Jump to content

Research:Language switching behavior on Wikipedia

From Meta, a Wikimedia project coordination wiki
Duration:  2019-March – ??
This page documents a stalled research project.

Wikipedia is decidedly multilingual. Many concepts have corresponding articles in many languages. While these articles sometimes might be translations (e.g., via Content Translation), oftentimes they contain additional content or varying perspectives on a given topic. Readers can easily access this content via the interlanguage links on the sidebar for a given article, and, while certain readers only ever see the content that exists in their native language, many readers do take advantage of these varying perspectives and view content in multiple languages. Anecdotally through conversations with readers and feedback related to the Universal Language Selector [1], a variety of reasons for language switching have been noted: reading about a topic in a more comfortable language, looking at how different cultures write about a concept, switching to a language that the reader believes will have more extensive content, and learning a language or testing one's skills.

This project focuses on the following question: for what types of articles do readers switch languages? The hope is that by identifying classes of articles where readers often switch, this might indicate that these articles have gaps in content, maybe should be prioritized for content translation or section recommendation, or should be surfaced more strongly as providing additional context to the reader. Article types could be related to categories, content, the structure of the article, etc.



Possible Approaches


The goal of this project was to identify when a reader switched languages for a given article. Theoretically, there are several ways this could be done:

  • Examine the referer data in the webrequest table. When a page view is associated with one project and contains a referer from another project, this might be evidence of a language switch. This method is used for various analyses (evaluation of compact language links; interlanguage navigation table) and while it likely works quite well for generating aggregate numbers, we rule it out for the following reason:
    • With modern browsers and HTTPS, when the domain changes (e.g., from "de.wikipedia.org" to "en.wikipedia.org"), everything is removed but the domain from the referer. It is not possible therefore to easily determine whether the previous page was the same article in another language or a page like the Main Page or a user page.
    • The exception is IE browsers, and from these and some manual checking, we can see that only about 60% of switches between languages are for the same article. So this method was result in a large number of false positives
  • Record switches via EventLogging on the interlanguage links.
    • This has happened in the past via this schema, but is currently not active. Future projects could explore reintroducing this logging, but we preferred to work from existing data sources at least at this stage.
  • Reconstruct reader sessions and record when multiple projects are viewed in the same session
    • This approach was not taken for the same reason mentioned above that the presence of two different language projects does not actually indicate that the user had chosen to read an article in two different languages.
  • Reconstruct reader sessions, associate all page views with their Wikidata concepts, and identify when the same Wikidata concept is viewed on multiple projects.
    • This is the approach we took as described below. It is the most complex, but it also is the most exact and allows us to distinguish between simple co-occurrence of language projects and actual language switches.

Dataset Generation


The following steps were taken to build the dataset of language switches:

  • Collect reader page views across all of the Wikipedia languages and associate each page w/ its corresponding Wikidata ID. This will be used for identifying when a reader views the same article but in multiple languages.
    • See this Phabricator task for documentation of how this mapping was generated efficiently.
  • Associate each page view with a device via a hash of the user-agent and client-IP.
    • See this analysis of the appropriateness of this method for reconstructing reader sessions.
  • Reconstruct the sessions associated with each device hash for a given day -- i.e. order all page views by device hash and timestamp
  • For each session, determine whether there were multiple projects viewed. If there was:
    • For each article viewed, determine if an article with the same Wikidata ID and different project was viewed at a later point.
    • Record each pair of <from-language> and <to-language> for a Wikidata concept.

Policy, Ethics and Human Subjects Research


At this stage, this research is solely based on an analysis of logs. Before any data would be publicly released, it will go through a privacy/security review.