Wikimedia Data Tutorial:
Using public data from Wikipedia and its sister projects
for academic research

held at ICWSM 2024

This is the website for the tutorial Wikimedia data How-to: Using public data from Wikipedia and its sister projects for academic research at the International AAAI Conference on Web and Social Media (ICWSM) 2024.

Motivation[edit]

Wikimedia's data is one of the largest available resources of multi-modal data, including articles across 326 language versions of Wikipedia, millions of images on Wikimedia Commons, and structured data in the Wikidata knowledge graph. Behind this well of information are large, active, and global communities working together. In their mission to disseminate open knowledge, the Wikimedia Foundation makes this data available under open licenses. In contrast, access to data about online communities on many other platforms is becoming more constrained. Given the extent of usage of Wikipedia and its sister projects on a global scale, Wikimedia data is one of the most useful tools for investigating our online habits and understanding the dynamics of community governance and collaboration in online spaces (see, e.g., Hill & Shaw: The most important laboratory for social scientific and computing research in history)

However, working with this information can be challenging in practice. These challenges include navigating the Wikimedia ecosystem, How to identify relevant data for specific research questions?), technical barriers (How to access available datasets via dumps?), or best practices (How to pre-process/filter the datasets?).

In this tutorial, we discuss the different data formats available through the Wikimedia projects, how to access them, and the best practices for working with Wikimedia information and the community creating this information. We will combine lecture-style moments with hands-on exercises to make everyone ready to start the next project using these resources.

Duration[edit]

half-day event (~4 hours)
more details: t.b.a.

Prerequisites[edit]

Basic knowledge of Python is desirable to fully take advantage of the hands-on exercises using Jupyter Notebooks. Other useful but not mandatory skills are a basic understanding of data processing and APIs.

Participants in the exercise session only need a laptop with internet access and a browser. To simplify setup, we will use PAWS, which is an environment to run Jupyter Notebooks hosted by the Wikimedia Foundation. For preparation, setting up a free Wikimedia account is desirable (link to create an account).

Organizers[edit]

Martin Gerlach (Wikimedia Foundation)
Lucie-Aimée Kaffee (Hugging Face)
Tiziano Piccardi (Stanford University)