From Meta, a Wikimedia project coordination wiki

Tutorial: Wikimedia Public (Research) Resources


The Wikimedia Foundation's mission is to disseminate open knowledge effectively and globally. In keeping with this mission, we support research in areas that benefit the free knowledge community. We aim to make any work with our support openly available to the public. At the same time that we do a minimalist user data collection, all the material (text and multimedia) available in our projects is public and reusable by everybody under a creative commons license. Moreover, all the article's history and interactions among users are also public, and we offer a set of tools for accessing such data. In this tutorial we are going to give an overview on all the data sources, and a detailed explanation of how to interact with this content including data and tools such as the Wikipedia Dumps, Quarry (SQL Replicas), Pageviews, PAWS (Wikimedia hosted Jupyter Notebooks), WikiData and put special attention to the multimedia content hosted by Wikimedia Commons project.

An outline of the tutorial[edit]

  • Introduction to Wikimedia Projects
  • Overview of Wikimedia's dataset and tools:
    • Static Dumps: Full Wikipedia dumps, where to get and how to parse them.
    • MediaWiki Utilities: The Python packages to interact with Wikimedia Utilities
    • Wikimedia API: The Wikimedia API for accessing data.
    • Pageviews API: How to check a detailed pageview count for any Wikipedia Page.
    • Quarry: The web interface to interact with Wikimedia SQL servers.
    • Clicks: Explanation of the click dataset (navegation path within Wikipedia).
    • Event Stream: Explanation of the (live) Event stream dataset.
    • Wikidata: How to interact with this (semantic) knowledge base.
    • ORES: Public Machine Learning based quality control systems
  • Hands on session: Using the Wikimedia commons free images.

Target audience and prerequisites[edit]

Researchers interested in Machine Learning, Natural Language Processing, Knowledge Graphs and Semantic Web. Basic knowledge of python is desirable but not mandatory. Also people coming from social science and without CS background will be able to learn how to use interfaces that does not require previous knowledge in coding and that be later processed in spreadsheets or statistical software.

All the material created and used in this tutorial is open-access. Community members interested in the reproducing the tutorial on their own locations, please contact the authors.


Slides can be found here: Wikimedia Public (Research) Resources.


Dates & Location[edit]

Where: Taipei International Convention Center, Taiwan (Room TBA)

When: April 20th, 2020 (Time TBA)


For further questions please send an email to diego at