Research:Prioritization of Wikipedia Articles/Language-Agnostic Quality

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Tracked in Phabricator:
task T293480

A core underlying technology for prioritizing Wikipedia articles to be improved is to have an understanding of the existing quality (or the inverse work-needed) for any article. The assumption is that for articles that are already high-quality, even if they are considered quite important, it is not necessary to prioritize them for edit recommendations. There are three general approaches to gathering article quality scores:

  • Community assessments: individual assessments of an article's quality by Wikipedians -- e.g., English Wikipedia's quality assessment via WikiProjects. This is the gold standard but is challenging to use in research for two main reasons:
    • Different language communities have different quality scales, making comparison across languages difficult
    • Assessing article quality is a Sisyphean task given the number of articles and constant evolution of wikis. This means that quality assessments are often out-of-date or non-existent.
  • Predictive models: machine-learning models that predict an article's quality and are trained on past community assessments -- e.g., ORES articlequality. This approach addresses the lag in community quality assessments but it is still difficult to comprehensively analyze quality across languages because of the dependence on local community assessments and varying scales.
  • Synthetic metrics: models that measure an article's quality based on a series of rules that seek to capture important aspects of an article's quality -- e.g., WikiRank.[1] This approach is much easier to scale across languages and transparent though the simplicity might also lead to a lack of nuance or undesirable feedback loops if incorporated into editing workflows.

The central goal of this work is to build a simple quality model that has coverage of all wikis and is sufficiently accurate as to be useful for research and modeling. For these goals, the synthetic metric approach is the most amenable, though we do use predictive models for informing how to weight the features that comprise the metric. The intent is to build it in such a way as to make it amenable to community-specific or task-specific fine-tuning -- i.e. also make the model easily configurable. Almost by definition, these models will start out as fairly reductive but hopefully will be a strong baseline and useful input to building more nuanced approaches to modeling article quality.

Potential Features[edit]

There are many potential features that could be incorporated into article quality models, some of which are listed below. Some of these features are clearly relevant to article quality in the traditional sense of a reader's experience -- e.g., page length, number of images -- while others might just relate to what sort of work is considered important in building a high-quality Wikipedia article -- e.g., quality of annotations -- even if there isn't a clear, direct impact on the reader experience. Features related to edit history -- e.g., number of editors -- are generally not considered despite being indicative because they are not particularly actionable. See Warncke-Wang et al.[2] for more details and a very nicely-grounded approach to developing quality models that heavily inspires this work. Lewoniewski et al.[1] also provide a taxonomy of feature types and strong baseline for synthetic measures of article quality that informs this work.

  • Quantity of content: how much information is available in an article?
    • Page length: the simplest measure of article quality but one with a fair bit of signal. Challenges include that different languages require substantially different numbers of characters to express the same ideas and thus the meaning of page length is not consistent across wikis.
    • Number of links: Linking to other Wikipedia articles generally helps readers explore content and is a good practice. Different wikis have different norms, however, about where and when it is appropriate to add links.
  • Accessibility: is the article well-structured and easy to understand?
    • Number of sections: Wikipedia articles are generally broken into sections to provide structure to the content. While more sections generally may suggest higher quality, it can be difficult to assert the appropriate number of sections for an article of a given length.
    • Number of templates: Templates generally indicate greater structure and readability to an article -- e.g., infoboxes -- but wikis vary greatly in their reliance on templates.
    • Number of images: multimedia content enriches Wikipedia articles, though certain topics are much easier to illustrate than others and certain norms lead to very high numbers of "images" -- e.g., including flags next to people's names to illustrate their nationality as is common in sports-related articles (example).
    • Alt-text coverage: do all images have alt-text that provides a useful description for readers who use screenreaders? This is just one example -- see English Wikipedia's Accessibility Manual of Style for more details.
  • Reliability: does the article fulfill the core content policies of Verifiability and Neutral Point of View:
    • Number of references: good Wikipedia articles should be verifiable, which generally means well-referenced. Different wikis sometimes use specific templates for references that are separate from the more universal ref tags, which can complicate extraction.
    • No issue templates: editors indicate reliability issues with various templates
  • Annotations: how well-maintained is the metadata about the article?
    • Number of categories: categories make finding and maintaining content easier
    • Wikidata item completeness: how much structured data is available about an article?
    • Assessments: WikiProjects are groups of editors who work to improve content in specific topic areas. They track this work through two main types of assessments: quality and importance.

Feature Types[edit]

There are generally two ways to represent a given feature: raw count (e.g., # of references) or proportion (e.g., # of references / page length). The first approach purely emphasizes "more content is better" but is simple to interpret. The second approach emphasizes more controlled growth -- e.g., if you add an additional section, the article might be penalized if that section doesn't have references. In the most extreme case, all features are proportions and a well-cited stub article with an image could be considered just as high-quality as a Featured Article. In practice, some form of raw page length is probably always included to account for longer articles generally being higher quality.

Basic Approach[edit]

  • Extract language-agnostic features
  • Normalize features to [0-1] range for consistency
    • This can be purely language-specific such as identifying e.g., page length at 95th percentile of content for a wiki and setting that as a maximum threshold and then scaling all page length features accordingly. The resulting model is useful for ranking content by quality but interpretation of the resulting scores will vary greatly by wiki -- e.g., a 1 article in English Wikipedia may be far more developed than a 1 article in Simple English Wikipedia.
    • This could also potentially include shared thresholds -- e.g., determining that 10 images or 20 sections is a consistent expectation of high-quality content regardless of wiki. This would be closer to a more standard article quality model that seeks to map articles to interpretable classes such as Featured Article.
  • Learn model to map features to a single quality score between 0 and 1. If features are all within a 0-1 range, then this can be as simple as a weighted-average of the features.


Data dumps from each model iteration can be found under this folder:


The first iteration of the language-agnostic quality model just uses four basic features: page length (log-normalized), number of references, number of sections (level 2 and 3 only), and number of images. Each feature is normalized to the 95th percentile for that wiki (with pretty generous minimum thresholds enforced -- 5 sections, 5 images, 10 references -- which roughly correspond to the average English Wikipedia article). The features are weighted then as follows: references (0.486), page length (0.258), headings (0.241), images (0.015). This results in a simple model that mostly is wiki-dependent -- i.e. scores are not comparable across wikis -- and mainly depends on quantity of content and references.

For more details, see:



  1. a b Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (14 August 2019). "Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics". Computers 8 (3): 60. doi:10.3390/computers8030060. 
  2. Warncke-Wang, Morten; Cosley, Dan; Riedl, John (2013). "Tell Me More: An Actionable Quality Model for Wikipedia" (PDF). WikiSym '13. doi:10.1145/2491055.2491063.