User:Isaac (WMF)/Comparing Wikipedia language editions

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Some brainstorming on challenges to doing multilingual Wikipedia research / evaluating differences between language editions (as a result of discussions during the January 2022 Research Showcase.

Read the full paper at:


There's been a very promising shift both in HCI / computational social science research towards better understanding cross-lingual differences on Wikimedia projects and NLP towards building more multilingual language models. For the former, it's important to make accurate comparisons and understand the potential causes for differences between language editions. For the latter, it's important to understand how and why content varies between language editions in order to construct high-quality multilingual datasets.


There are lots of factors that can potentially cause differences in content / trace data between language editions. Below are some important ones.

Local Context[edit]

The offline context for editors of a particuar language edition:

Geography and Culture[edit]

Some language editions are closely aligned with a small geographic area -- e.g., single country -- and relatively homogenous culture. Languages that have been associated with colonialism or large diasporas, however, can have editors from more diverse backgrounds. This can lead to trivial aspects that affect quantitative data -- e.g., variation in dialects/spellings between regions can lead to many acceptable variations of spelling of the same words between articles -- and have larger impacts as well -- e.g., diversity appears to play an important role in maintaining NPOV and generally making a community more resilient to misinformation.

Availability of Sources[edit]

Wikipedia reflects the world and thus is limited by what sources are available to draw from and cite. For languages with fewer resources, especially when they are not digital or accessible, writing high-quality content can be much more difficult.

Editor Community and Governance[edit]

Aspects of governance and community that affect what content is contributed and by whom:


Wikis have different rules for who can contribute and how. These rules can affect things like who can edit to begin with (e.g., no IP editors on Portuguese Wikipedia, create articles (e.g., Articles for creation (enwiki)), translate articles (e.g., Content Translation on Englsh Wikipedia), and even whether the editors who are investigating sockpuppets or other patrolling work even speak the language (Checkusers).

Policies and Norms[edit]

Though the core content policies of Wikipedia tend to be shared across language communities, other policies might differ. For example, English Wikipedia accepts fair-use imagery (which is not allowed in Wikimedia Commons, the main image repository for Wikipedia). This allows for many images of company logos, movie posters, etc. to be included on English Wikipedia that are not so much missing on other language editions as potentially not meeting their guidelines.

Size, Age, and Composition[edit]

The size of a wiki has huge implications for how it functions. Much of the impact of size is mediated through access to technology -- e.g., smaller wikis generally lack in tools to support their work. Smaller wikis generally are also less of a target for vandalism though, less entrenched in their norms,[1] and might display more idiosyncracies in terms of their content due to more relaxed content policies or based on the early editor's interests / abilities to write articles via bots.


What technologies and tools used by Wikipedians can vary substantially between wikis:

Interface and Tool differences[edit]

Different language editions of Wikipedia have different software configurations and extensions installed. A very salient example of this is that some wikis have an extension called Structured Discussions (or "Flow") installed that, as the name suggests, provides more structure around talk page discussions. This change in interface -- as compared to a more unstructured talk page -- can have a substantive impact on the nature and structure of discussions and would complicate comparisons of discussion structure/volume across language editions.[2] Other major extensions that vary in usage include e.g., Flagged revisions (determines whether a revision is shown to readers before review), Content translation (supports editors in translating articles across languages), and PageAssessments (simplifies tagging of articles with WikiProjects, quality, and importance scores).

Bots and Filters[edit]

The availability of bots or other automated filters to do simple tasks on a wiki varies greatly between language editions. Taking the example of vandalism detection, English Wikipedia has a very extensive AbuseFilter configuration that catches many bad edits before they are even published, automated bots such as ClueBot NG, and RecentChanges filters based on ORES models for detecting bad-faith and damaging edits. This both reduces vandalism and greatly speeds up the response time as compared to wikis that depend on human evaluation. English Wikipedia in general has many specialized tools that blur the lines between human and bot editing such AutoWikiBrowser or Twinkle and can lead to very high numbers of small edits as part of routine maintenance tasks. Examining common edit tags (e.g., en:Special:Tags is a good, though not foolproof, way to discover these tools).

Reliance on transcluded content[edit]

Some wikis have much more strongly embraced templates and/or Wikidata as a source of data for Wikipedia articles. This removes evidence of much of the content that a reader sees from the raw wikitext in a page (as the content is transcluded when the page is parsed via templates or other automated means). For research that e.g., relies on extracting facts from infoboxes, a huge difference would be seen between data extracted via the wikitext and data extracted via the HTML. For example, see this study of Wikidata transclusion on English Wikipedia (and discussion of how it might vary by wiki).


Researchers can better accomodate for these language edition differences in their work by following some good research practices:

Situated Researchers[edit]

Researchers should attempt to include Wikipedians, native speakers, and/or folks who are long-time readers of a language edition on their teams. This is an important step beyond using e.g., machine-translated versions of content for inspecting the data / articles.

Language-agnostic Metrics[edit]

Researchers should consider how language-invariant their metrics / features are and whether there are alternatives that are more language-agnostic. For example, page length as the sole measure of content or quality depends on the script used and how "dense" a language is. If links can be used in the modeling, they can be represented as Wikidata IDs (language-agnostic) and greatly increase the density and coverage of the data.

Matched Corpora[edit]

When comparing content, care should be taken when constructing the sample of articles. An obvious approach is using the subset of articles that exist in the languages under study -- e.g., for a comparison of Spanish and English Wikipedia, focus on articles that exist on both. This can introduce skew though in that many local language topics would likely be filtered out so other matching strategies that use language-agnostic features should also be considered -- e.g., similar categories, similar links/topics, similar structure or edit history, etc.


  1. TeBlunthuis, Nathan; Shaw, Aaron; Hill, Benjamin Mako (21 April 2018). "Revisiting "The Rise and Decline" in a Population of Peer Production Projects". Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems: 1–7. doi:10.1145/3173574.3173929. 
  2. Aragón, Pablo; Gómez, Vicenç; Kaltenbrunner, Andreaks (3 May 2017). "To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion". Proceedings of the International AAAI Conference on Web and Social Media 11 (1): 12–21. ISSN 2334-0770.