There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.
If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you still have questions, you can email your question to the Analytics mailing list (more information).
If you wish to browse pre-computed metrics and dashboards, see statistics.
If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.
If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.
See also inspirational example uses.
WMF releases data dumps of Wikipedia, Wikidata, and all WMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short URL mappings.
- Text of current and/or all revisions of all pages, in XML format (schema)
- Metadata for current and/or all revisions of all pages, in XML format (schema)
- Most database tables as SQL files
- Page-to-page link lists (
- Lists of pages with links outside of the project (
- Media metadata (
- Info about each page (
- Titles of all pages in the main namespace, i.e. all articles (
- List of all pages that are redirects and their targets (
- Log data, including blocks, protection, deletion, uploads (
- Misc bits (
- Page-to-page link lists (
- Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
- Adds/changes dumps (includes no moves or deletes, plus some other limitations) (documentation)
- Wikidata entity dumps - see Wikidata:Data access for more information
- Available other dumps
You can download the latest dumps for the last year (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc). Download mirrors offer an alternative to the download page.
Due to large file sizes, using a download tool is recommended.
SQL dumps are provided as dumps of entire tables, using mysqldump.
Some older dumps exist in various formats.
How to and examples
Some tools are listed on the following pages, but these tools are mostly outdated and non-functional:
All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.
- Mailing list: xmldatadumps-l
- Bug reports: Dumps Generation project in Phabricator
- Design work on Dumps 2.0 replacement: Dumps Rewrite project in Phabricator
The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.
- Meta information about the wiki and the logged-in user
- Properties of pages, including page revisions and content, external links, categories, templates,etc.
- Lists of pages that match certain criteria
- See the full list of available information
To query the database you send a HTTP GET request to the desired endpoint (example https://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to
query and defining the query details the URL.
How to and examples
- API Tutorial
action=query) the content (
rvprop=content) of the most recent revision of Main Page (
titles=Main%20Page) of English Wikipedia (
https://en.wikipedia.org/w/api.php?) in XML format (
format=xml). You can paste the URL in a browser to see the output.
- More examples
To try out the API interactively on English Wikipedia, use the API Sandbox.
To use the API, your application or client might need to log in.
Before you start, learn about the API etiquette.
Researchers could be given Special access rights on case-to-case bases.
All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).
Toolforge and PAWS
Toolforge hosts command line or web-based tools, which can query copies of the database. Copies are generally real-time but sometimes replication lag occurs.
Explore the database schema of the MediaWiki software.
Using Toolforge requires familiarity with Unix/Linux command line, SSH keys, SQL/databases, and some programming.
To start using the Toolforge, see this Quickstart guide.
Recent changes stream
Analytics datasets offer data on pageviews, mediacounts, unique devices, revision history, data by country, and Wikidata QRanks.
Files starting with "project" contain total hits per project per hour statistics.
Per-country pageviews data is also available, sanitized for privacy reasons. See this announcement post (June 2023).
See the README for details on the format.
You can interactively browse the page view statistics at https://pageviews.toolforge.org. More documentation on the Pageviews Analysis tool is available.
The public "Geoeditors" dataset contains information about the monthly number of active editors from a particular country on a particular Wikipedia language edition (bucketed and redacted for privacy reasons). For some earlier years, similar data is available at /, see also Edits by project and country of origin.
Wikistats is an informal but widely recognized name for a set of reports which provide monthly trend information for all Wikimedia projects and wikis.
Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:
- unique visitors
- page views (overall and mobile only)
- editor activity
- article count
Data is presented as charts with the option to download the underlying data.
For more details on Wikistats, see wikitech:Analytics/Systems/Wikistats_2.
DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
The English version of the DBpedia knowledge base describes millions of things, and the majority of items are classified in a consistent ontology (persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.). Localized versions of DBpedia in more than hundred languages describe millions of things.
The data set also features:
- about 2 billion pieces of information (RDF triples)
- labels and abstracts for >10 million unique things in up to 111 different languages
- millions of links to images, links to external web pages, data links into external RDF datasets, links to Wikipedia categories, YAGO categories
- https://www.dbpedia.org/resources/ has download links for all the data sets, different formats and languages.
- SPARQL endpoint
- https://dbpedia.org/sparql is DBpedia's SPARQL endpoint.
- DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.
- Mailing list: DBpedia Discuss
- Forum: https://forum.dbpedia.org/
- DBpedia related publications, blog posts and projects
The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.
- Hotels/restaurants/attractions data as CSV/OSM/OBF
- Tourism guide for offline use
The WMF privacy engineering team uses differential privacy to release data that would otherwise be too sensitive to release. This data currently only includes pageview statistics; in the future, it will include statistics about editors, centralnotice impressions and views, search, and more.
- Pageview data (currently only available as daily TSVs)
Differentially-private data is currently available in static TSV form at https://analytics.wikimedia.org/published/datasets/. Work to make this data available via API is ongoing.
Differentially-private data and code is available under a Creative Commons Zero license.