Learning patterns/Working with the Wikipedia data dump for research

From Meta, a Wikimedia project coordination wiki
A learning pattern forevaluation
Working with the Wikipedia data dump for research
problemMuch scientic research uses or could make use of the rich data provided by Wikipedia. However, the data dumps can seem too large and complex to work with.
solutionBefore starting to work with data, clearly identify which information you need to use the smallest dump. Use a toy example to start with. Pre-process and filter data and use database-solutions.
creatorASociologist
endorse
created on19:42, 21 March 2022 (UTC)
status:DRAFT

What problem does this solve?[edit]

Scientific research has increasingly started to use Wikipedia as a data source. Given the large amount of information produced throughout the past twenty years and the fact that it is publicly available, Wikipedia data can be a gold mine for researchers needing text data or interested in successful online peer production. While having access to large amount of data is generally an advantage and useful, downloading and working with the files containing all this information can become a comptational problem. Wanting to download and work with all revisions of all pages in the English Wikipedia leads to multiple terabytes of text: too much for the casual researchers without access to specialised infrastructure. In this learning pattern, I want to highlight how the computational load can be reduced and what steps can be taken to be able to conduct research with Wikipedia data dumps.

What is the solution?[edit]

Step 1: Ask yourself which data you actually need[edit]

When working with Wikipedia data, the first and most important step is to answer the question which data one really needs. Do you need the text of all articles of Wikipedia in every state they have ever been in? Most likely not. Wikimedia provides a number of different data dumps with different levels of information, see for example the different dumps of the latest data dumps of the English Wikipedia.

You should ask yourself the following questions:

  1. Do you need meta data or content?
    • Stubs do not contain any page or revision content, but include metadata like information on which user edited what page to what extent. Content files like pages-articles also contain raw revision content and are much larger in size. If you are a researcher not interested in actual text but more in contribution behaviour, you might only need to focus on stubs.
  2. Do you need information on articles or everything?
    • Are you interested in the encyclopediac content of Wikipedia only, or do you need data on discussions and users? Articles files contain everything in the main namespace, but do not contain user pages and talk pages.
  3. Do you need a current snapshot of Wikipedia or do you care about its history?
    • There are data dumps which include only the current revision of each page, while the history files go back in time.

Depending on what data you need, there is a different file you need to download. For example, if you are not interested in textual content, but in the number of contributions per user across different name spaces and across time, you need to work with the meta stubs history files. If you are interested in the most current version of articles, you need to work with the current articles files.

Related to this: Also make sure whether you really need to work with the data dumps. Depending on your interest, the information might be available somewhere else. For example, if you are trying to find the longest Wikipedia pages, there is a special page listing them. In other cases, a single API call might also suffice.

Step 2: Work with a toy example[edit]

Whatever data file contains the information you need and subsequently download, it is probably still very large. Instead of pre-processing, working with and analysing this huge file, work with a toy example. A small toy example will allow you to work with the data structure and try out solutions and potential approaches without long waiting times. Working with a toy example will allow you to fail fast and fail often. Once your approach and your code works fine with the toy example, you can apply it to the data you are actually interested in.

Data dumps from small language versions can work as useful toy examples.

Step 3: Pre-process and filter the data[edit]

You have identified the data file you need and want to start working with it. The XML file you downloaded is still very large as XML is a verbose language. The data will become easier to work with if transformed into a table format like a CSV file. The next step is thus to parse the Wikipedia dump, for example with the Python wiki dump parser. The parser can and should be adjusted to your needs to output exactly the data you need. Again, ask yourself if there is any infomation that could be skipped: Do you need information on the edit summaries of edits, do you need edits done by IPs, are you only interested in edits made on Mondays? Filter the data now to end up with workable files.

Step 4: Work with a toy example[edit]

See step 2. Always work with a toy example, even if you think it is not necessary. It will save time in the long run.

Step 5: Use a database[edit]

If the data file you want to analyse still spans multiple gigabyte, it can make sense to store your data as a database. Most programming languages can work directly with database files and the aggregation and filtering via SQL commands can often be quicker and computationally cheaper.

Step 6: Work with a toy example[edit]

See step 2. Always work with a toy example, really. Load your toy example into the database.


When to use[edit]

  • Use this pattern when you want to work with Wikipedia data but do not have the computational power, facilities, and time to work with very large amount of data.
  • I have used this approach in my project.

Endorsements[edit]

--ASociologist (talk) 20:31, 21 March 2022 (UTC)[reply]

See also[edit]

Related patterns[edit]

External links[edit]

References[edit]