User:Isaac (WMF)/Standard research approaches
This page seeks to document various standard research approaches to common tasks with Wikimedia data. It is complementary to the analysis gotchas essay, which focuses on common pitfalls when working with Wikimedia data. It is always a work-in-progress. Feel free to make requests / leave suggestions on the talk page. Note that there is no "true" standard -- these are just my opinions / collected feedback based on several years of working with Wikimedia data. This Jupyter notebook in particular has a lot of code snippets for extracting various attributes from Wikipedia articles if you have access to the cluster.
Reader session reconstruction
This is just relevant to individuals with access to the webrequest logs. You may occasionally want to go from analyzing pageviews as independent events to inferring sequences (sessions) of pageviews that came from the same individual. This has three parts: 1) grouping pageviews by the individual who requested them, 2) ordering a sequence of pageviews into a sequence of views (not always linear; often tree-like), and, 3) identifying when a session starts or stops.
The approach for grouping pageviews by individual (really "actor" because the operationalization actually seeks to identify unique browsers accessing Wikimedia, not unique individuals) is now formalized in the pageview_actor table and depends on the user-agent (browser/OS/device) and IP address associated with a given pageview.
There are various approaches to the second challenge that depend on the need and range from the simplest (order linearly by timestamp) to more complex approaches that also inspect the referer information for a pageview and seek to build tree(s) of pageviews. See the code / papers below for examples.
When dealing with session length, pageviews are usually only considered within a given 24-hour period. The longer the time period considered, the more likely it is that an individual switches devices, IP addresses, or another individual uses the same device + IP. These concerns are much more salient for mobile devices than desktop devices. A "session" then is often defined as any sequence of pageviews that is not separated by more than an hour based on [Halfaker et al. https://arxiv.org/abs/1411.2878].
NOTE: Arora et al. "Wikipedia Reader Navigation: When Synthetic Data Is Enough" shows that in many cases, it is perfectly reasonable to just use independent source-target referral pairs to study reader behavior -- i.e. ignore the potentially larger sessions and merely work with aggregate counts of pageview and associated referral sources. This has the benefit that some of the data is public (see clickstream data) and the data can largely be extracted directly from the webrequest data without any further processing.
- Piccardi et al. A Large-Scale Characterization of How Readers Browse Wikipedia
- Example code with various anonymization techniques
- Evaluation of effectiveness of UA+IP approach and original approach
- Analysis of IP address stability
- Editor session filtering below for details on how to remove (potential) editors from reader session data.
Editor session reconstruction and filtering
If you want highly-detailed information about editor workflows, then the EditAttemptStep EventLogging is the best place to go (private data). This data has the caveat that it is currently sampled at 1 in 16 users on Wikipedia (to verify, type
If you are just interested in whether a particular reader session contains a potential edit, then you can use the following clause, which captures edit attempts across mobile/desktop and wikitext as well as Visual Editor (but not the apps -- see task T277785 for more details):
SELECT ... FROM wmf.webrequest WHERE ("uri_query LIKE '%action=edit%' -- desktop wikitext editor OR uri_query LIKE '%action=visualeditor%' -- desktop and mobile visualeditor OR uri_query LIKE '%&intestactions=edit&intestactionsdetail=full&uiprop=options%') -- mobile wikitext editor ...
See this README for more information about what each clause does but beware that there is no guarantee that the clause is up-to-date and you might want to verify that attempting to edit on various platforms still triggers those URL strings. This clause applies to both logged-in and IP editors, but if you are just interested in whether a particular reader session was for a logged-in user (with no guarantee that they edited), then use the
loggedIn property from the x-analytics-map column in webrequests.
Multilingual analysis (joining in Wikidata IDs)
When comparing edit / content / reading trends across multiple languages of Wikipedia (highly recommended), the easiest way to identify parallel content across languages is via the interlanguage links maintained on Wikidata (QIDs). How you do this depends on the size of your dataset and available resources.
For smaller joins -- e.g., several thousand articles -- you can use the pageprops API or wbgetentities API to get the Wikidata IDs for Wikipedia articles. The wbgetentities API can then be used to get all the sitelinks (Wikipedia articles etc.) associated with a given Wikidata ID.
For larger joins -- e.g., millions -- and those with access to the data lake, you can use the item_page_link table to join in Wikidata IDs to page titles or IDs (code examples). This data can also be extracted from the public Wikidata JSON dumps but this is potentially prohibitively expensive given the size of that dump. You can also comment on task T258514 to request pre-processed public snapshots of this data.
Summarizing and assigning topics to Wikipedia content (representation learning)
There are nearly infinite approaches to representing Wikipedia content as more high-level topics for easier analysis. The two main approaches are using the existing categories associated with an article (and mapping them to e.g., major topic classifications) and using machine-learning models to classify articles according to some external taxonomy. It is this latter approach that is most formalized and used by the Wikimedia Foundation for analyses. Specifically, this is the taxonomy used and task T297631 has details on using the data on Hive (private). There is also a public snapshot occasionally added to figshare or you can use the unofficial API for exploring the model.
For a more complete summary of this challenge and potential approaches, see the March 2020 Research Showcase and Johnson et al. "Language-agnostic Topic Classification for Wikipedia". For unsupervised embeddings, see this section for more details.
Processing Wikipedia text (for NLP)
The fastest way to access Wikipedia text in bulk is via the dumps. These are public and can be processed sequentially using the mwxml Python library or in parallel if you have private access to the cluster and use the wikitext tables. The main challenge for language modeling is how to pre-process the wikitext to remove extraneous syntax while retaining all of the appropriate content and context. There is no one right answer to this -- e.g., for some tasks, you may wish to retain category information while for others, you may wish to only retain article text. At a high-level, the best library for working with wikitext is mwparserfromhell (Python). While regexes or other languages might be faster, they will almost certainly miss many of the edge-cases caught by mwparserfromhell.
My main recommendation is to use the code / rules in HuggingFace's datasets repository (wikipedia.py) as a simple but effective script. Other approaches include mwtext (Python) and those by fastText (Perl) and wiki2vec (Java).
Pagerank for Wikipedia
A few potential approaches for calculating PageRank for Wikipedia can be found in this repository. The notebooks contained there are for use on our cluster but might be convertible to outside setups or contain useful pre-processing / hyperparameter choices.
Not yet documented, but ideally the following will eventually be covered:
- Working with HTML dumps (parsed version of Wikipedia articles)
- Diffing Wikipedia edits
- Extracting references from Wikipedia articles