User:Neil Shah-Quinn (WMF)/Data portal draft

Other languages:

English

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available. If you have any questions, you might find the answer in the Frequently Asked Questions about Data.

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.

If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

Quick glance[edit]

Data Dumps (details)

Homepage | Download

Dumps of all WMF projects for backup, offline use, research, etc.

Wiki content, revisions, metadata, and page-to-page and outside links
XML and SQL format
once/twice a month
large file sizes
The dumps.wikimedia.org domain also hosts some other data, including anonymized survey data from the three 2011/12 Editor surveys

API (details)

Homepage

The API provides direct, high-level access to the data contained in MediaWiki databases through HTTP requests to the web service.

Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
JSON, WDDX, XML, YAML, and PHP's native serialization format

Toolforge (details)

Homepage

Toolforge allows you to connect to shared server resources and query a copy of the database (with some lag).

acts as a standard web server hosting web-based tools
command-line tools
account required

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.

Analytics Dumps (details)

Homepage

Raw pageview, unique device estimates, mediacounts, etc.

Delimited, usually: Project, (Page title,) Count
Aggregated hourly or daily
pageviews | pageviews compressed | mediacounts | unique devices

WikiStats (details)

Homepage | Download

Reports in 25+ languages based on data dumps and server log files.

Unique visits, page views, active editors and more
Intermediate CSV files available.
Graphical presentation.
Monthly

DBpedia (details)

Homepage

DBpedia extracts structured data from Wikipedia, allows users to run complex queries and link Wikipedia data to other data sets.

RDF,N-triplets, SPARQL endpoint, Linked Data
billions of triplets of info in a consistent Ontology

DataHub and Figshare (details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

smaller (usually one-time) surveys/studies
dbpedia lite, DBpedia-Live and others
EPIC/Oxford quality assessment

Figshare (datasets taggd 'wikipedia')

Readership data[edit]

the pageviews API
unique devices API and dumps

Editing metadata[edit]

Editing metadata includes information about the users, time, and revision comment, and so on, but does not include the content of the revision itself.

This data is available from:

the action API
the XML data dumps
the replicas of the MediaWiki databases available on Wikimedia's toolforge
Recent changes stream

Raw content data[edit]

Data that includes the raw content of page revisions is available from:

Structured content data[edit]

Wikidata Query Service
DBPedia

Miscellaneous data[edit]

Analysis infrastructure[edit]

In addition to the raw data described above, there is a great deal of helpful infrastructure for research and analysis provided for people contributing to Wikimedia's mission.