Jump to content

Research:Language switching behavior on Wikipedia

From Meta, a Wikimedia project coordination wiki
Created
22:13, 22 March 2019 (UTC)
Duration:  2025-June – 2025-July
This page documents a completed research project.


Wikipedia is decidedly multilingual. Many concepts have corresponding articles in many languages. While these articles sometimes might be translations (e.g., via Content Translation), oftentimes they contain additional content or varying perspectives on a given topic. Readers can easily access this content via the interlanguage links on the sidebar for a given article, and, while certain readers only ever see the content that exists in their native language, many readers do take advantage of these varying perspectives and view content in multiple languages. Anecdotally through conversations with readers and feedback related to the Universal Language Selector [1], a variety of reasons for language switching have been noted: reading about a topic in a more comfortable language, looking at how different cultures write about a concept, switching to a language that the reader believes will have more extensive content, and learning a language or testing one's skills.

This project explores the following questions: How common is language switching? For which languages is switching most common? Does language switching happen on desktop, mobile web, and app at similar rates? Does language switching happen for logged-in users and anonymous users at similar rates? And finally, for what types of articles do readers switch languages? The hope is that by identifying classes of articles where readers often switch, this might indicate that these articles have gaps in content, maybe should be prioritized for content translation or section recommendation, or should be surfaced more strongly as providing additional context to the reader. Article types could be related to categories, content, the structure of the article, etc.

Methods

[edit]

Data/definitions

[edit]

There are multiple methodological approaches to examine language switching.[note 1]

For this project, a language switch will be defined as: any instance of an actor navigating from one language edition of a Wikipedia to an article on another language edition of a Wikipedia.

We will use the pageview_actor table, a subset of the webrequest table (see documentation). We will look at instances of a Wikipedia pageview where the referer is another Wikipedia language edition.

  • How? Language switches will be queried using WHERE clauses that determine that the current wiki and the referer wiki are different Wikipedia language editions:
    WHERE ((normalize_host(parse_url(referer, 'HOST')).project_family = 'wikipedia' 
            AND normalize_host(parse_url(referer, 'HOST')).project <> normalized_host.project)    
          OR ((referer IN {canon_wp_domain_list} OR referer IN {canon_wp_mobile_domain_list})
               AND REGEXP_EXTRACT(referer, '^([a-z0-9-]*)\.') <> normalized_host.project))
    

Dataset Generation

[edit]

TBA

Policy, Ethics and Human Subjects Research

[edit]

At this stage, this research is solely based on an analysis of logs. Before any data would be publicly released, it will go through a privacy/security review.

Initial findings

[edit]

How common is language switching?

[edit]

About 1.2% of daily Wikipedia readers switch languages at least once.

  • Percentage is based on averages of 10 randomly selected days tested across three months.
  • To add perspective, about 70% of sessions on English Wikipedia consist of a single pageview,[1] so 1.2%is quite impressive.

For which language(s) is switching most common?

[edit]

English is the most common language switched from and switched to, if we don't normalize for overall readership -- that is, if we don't take into account each language versions' daily pageviews overall. If we normalize by overall readership, then smaller languages (in terms of global language population) are among the top Wikipedia language editions for which the highest number of pageviews come from language switches. These findings are based on averages from 10 randomly selected days tested across three months.

Does language switching happen on desktop, mobile web, and app at similar rates?

[edit]

If we normalize by overall readership per access method, language switching occurs most often on desktop, followed by mobile web; and it occurs least often on mobile app.

  • Desktop: 1.38% of pageviews came from language switches
  • Mobile web: 0.54% of pageviews came from language switches
  • Mobile app: 0.02% of pageviews came from language switches
  • These findings are based on averages from 10 randomly selected days tested across three months.

Does language switching happen for logged-in users and anonymous users at similar rates?

[edit]

If we normalize by overall readership per logged-in status, language switching occurs more often for logged-in users than for logged-out (i.e., anonymous) users.

  • 2.7% of pageviews by logged-in users came from language switches
  • 0.8% of pageviews by logged-out users came from language switches
  • These findings are based on averages from 10 randomly selected days tested across three months.

Open questions

[edit]
  • For which type(s) of articles do readers switch languages?
  • Relation to MinT for readers?
  • When do and how often do readers switch to Simple English Wikipedia?
  • Could we use Quicksurveys to target those requests, to learn more about language switch motivations?
  • Do the most common language pairs match the Reader Surveys? (which other languages do you speak)
  • Using language populations as a reservoir of who could switch (for normalization)
  • Why do people switch?
  • To what degree is icon recognizability is affecting the device-type patterns we're observing?
  • From which wikis this 'switching traffic' is most frequently coming? To what degree can piecing some of these analyses together get us to a hypothesis (or set of them) for the 'why' question above?
  • Are people frequently switching multiple times in a session? Do they frequently switch back?
  • Are there any interactions between referral source and likelihood of switching? For example, is search result traffic heavily driving into enwiki, and then people are switching into other languages they read.

Notes

[edit]
  1. Possible approaches to language switching analysis include the following:
    1. Using the UniversalLanguageSelector table
      • This table contains data for users who switch languages using the Universal Language Selector, a tool that allows users to select a language and easily configure its support.
      • Caveats:
        • Universal Language Selector is a desktop-only feature; so this table excludes mobile web and mobile app language switches
        • While this table will be helpful in providing super-specific data (e.g., percentages of language switches which come via Universal Language Selector on desktop), it will not be able to provide data for examining overall trends and counts of all language switches.
    2. Using the interlanguage navigation table
    3. Using the pageview_actor table
      • This table is a subset of the webrequest table. (See Documentation).
      • Analysis options:
        • Presence of same article in different languages within a reader session - via building the session and assigning articles to their Wikidata QIDs and checking for matches across languages. (Example code at https://github.com/geohci/language-switching). In addition to capturing internal language edition switches, this would also capture folks who, e.g., used Google to find another language version of the article.
        • Presence of multiple language editions in the same reader session (regardless of whether same page). This (likely) captures multilingual readers but may not provide info about about preference; it also would miss multilingual readers who stay with a single language edition.
        • Comparison of language editions read with the accept_language property provided by reader browsers.

References

[edit]
  1. Piccardi, Tizziano; Gerlach, Martin; Arora, Akhil; West, Robert (18 Jan 2023). "A Large-Scale Characterization of How Readers Browse Wikipedia". ACM Transactions on the Web. p. 16.