Research:Prioritization of Wikipedia Articles/Language-Agnostic Quality

Tracked in Phabricator:
Task T293480

A core underlying technology for prioritizing Wikipedia articles to be improved is to have an understanding of the existing quality (or the inverse work-needed) for any article. The assumption is that for articles that are already high-quality, even if they are considered quite important, it is not necessary to prioritize them for edit recommendations. There are three general approaches to gathering article quality scores:

Community assessments: individual assessments of an article's quality by Wikipedians -- e.g., English Wikipedia's quality assessment via WikiProjects. This is the gold standard but is challenging to use in research for two main reasons:
- Different language communities have different quality scales, making comparison across languages difficult
- Assessing article quality is a Sisyphean task given the number of articles and constant evolution of wikis. This means that quality assessments are often out-of-date or non-existent.
Predictive models: machine-learning models that predict an article's quality and are trained on past community assessments -- e.g., ORES articlequality. This approach addresses the lag in community quality assessments but it is still difficult to comprehensively analyze quality across languages because of the dependence on local community assessments and varying scales.
Synthetic metrics: models that measure an article's quality based on a series of rules that seek to capture important aspects of an article's quality -- e.g., WikiRank.^[1] This approach is much easier to scale across languages and transparent though the simplicity might also lead to a lack of nuance or undesirable feedback loops if incorporated into editing workflows.

The central goal of this work is to build a simple quality model that has coverage of all wikis and is sufficiently accurate as to be useful for research and modeling. For these goals, the synthetic metric approach is the most amenable, though we do use predictive models for informing how to weight the features that comprise the metric. The intent is to build it in such a way as to make it amenable to community-specific or task-specific fine-tuning -- i.e. also make the model easily configurable. Almost by definition, these models will start out as fairly reductive but hopefully will be a strong baseline and useful input to building more nuanced approaches to modeling article quality.

Potential Features[edit]

There are many potential features that could be incorporated into article quality models, some of which are listed below. Some of these features are clearly relevant to article quality in the traditional sense of a reader's experience -- e.g., page length, number of images -- while others might just relate to what sort of work is considered important in building a high-quality Wikipedia article -- e.g., quality of annotations -- even if there isn't a clear, direct impact on the reader experience. Features related to edit history -- e.g., number of editors -- are generally not considered despite being indicative because they are not particularly actionable. See Warncke-Wang et al.^[2] for more details and a very nicely-grounded approach to developing quality models that heavily inspires this work. Lewoniewski et al.^[1] also provide a taxonomy of feature types and strong baseline for synthetic measures of article quality that informs this work.

Quantity of content: how much information is available in an article?
- Page length: the simplest measure of article quality but one with a fair bit of signal. Challenges include that different languages require substantially different numbers of characters to express the same ideas and thus the meaning of page length is not consistent across wikis.
- Number of links: Linking to other Wikipedia articles generally helps readers explore content and is a good practice. Different wikis have different norms, however, about where and when it is appropriate to add links.
Accessibility: is the article well-structured and easy to understand?
- Number of sections: Wikipedia articles are generally broken into sections to provide structure to the content. While more sections generally may suggest higher quality, it can be difficult to assert the appropriate number of sections for an article of a given length.
- Number of templates: Templates generally indicate greater structure and readability to an article -- e.g., infoboxes -- but wikis vary greatly in their reliance on templates.
- Number of images: multimedia content enriches Wikipedia articles, though certain topics are much easier to illustrate than others and certain norms lead to very high numbers of "images" -- e.g., including flags next to people's names to illustrate their nationality as is common in sports-related articles (example).
- Alt-text coverage: do all images have alt-text that provides a useful description for readers who use screenreaders? This is just one example -- see English Wikipedia's Accessibility Manual of Style for more details.
Reliability: does the article fulfill the core content policies of Verifiability and Neutral Point of View:
- Number of references: good Wikipedia articles should be verifiable, which generally means well-referenced. Different wikis sometimes use specific templates for references that are separate from the more universal ref tags, which can complicate extraction.
- No issue templates: editors indicate reliability issues with various templates
Annotations: how well-maintained is the metadata about the article?
- Number of categories: categories make finding and maintaining content easier
- Wikidata item completeness: how much structured data is available about an article?
- Assessments: WikiProjects are groups of editors who work to improve content in specific topic areas. They track this work through two main types of assessments: quality and importance.

Feature Types[edit]

There are generally two ways to represent a given feature: raw count (e.g., # of references) or proportion (e.g., # of references / page length). The first approach purely emphasizes "more content is better" but is simple to interpret. The second approach emphasizes more controlled growth -- e.g., if you add an additional section, the article might be penalized if that section doesn't have references. In the most extreme case, all features are proportions and a well-cited stub article with an image could be considered just as high-quality as a Featured Article. In practice, some form of raw page length is probably always included to account for longer articles generally being higher quality.

Basic Approach[edit]

Extract language-agnostic features
Normalize features to [0-1] range for consistency
- This can be purely language-specific such as identifying e.g., page length at 95th percentile of content for a wiki and setting that as a maximum threshold and then scaling all page length features accordingly. The resulting model is useful for ranking content by quality but interpretation of the resulting scores will vary greatly by wiki -- e.g., a 1 article in English Wikipedia may be far more developed than a 1 article in Simple English Wikipedia.
- This could also potentially include shared thresholds -- e.g., determining that 10 images or 20 sections is a consistent expectation of high-quality content regardless of wiki. This would be closer to a more standard article quality model that seeks to map articles to interpretable classes such as Featured Article.
Learn model to map features to a single quality score between 0 and 1. If features are all within a 0-1 range, then this can be as simple as a weighted-average of the features.

Models[edit]

Data dumps from each model iteration can be found under this folder: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/quality/

V1[edit]

The first iteration of the language-agnostic quality model just uses four basic features: page length (log-normalized), number of references, number of sections (level 2 and 3 only), and number of images. Each feature is normalized to the 95th percentile for that wiki (with pretty generous minimum thresholds enforced -- 5 sections, 5 images, 10 references -- which roughly correspond to the average English Wikipedia article). The features are weighted then as follows: references (0.486), page length (0.258), headings (0.241), images (0.015). This results in a simple model that mostly is wiki-dependent -- i.e. scores are not comparable across wikis -- and mainly depends on quantity of content and references.

For more details, see: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/quality/V1_2021_04/README.md

Code: https://github.com/geohci/miscellaneous-wikimedia/blob/master/article-features/quality_model_features_V1.ipynb

V2[edit]

The second iteration of the quality model slightly expanded the feature set, changed how they were extracted (with fairly large consequences), and as a result tweaked the approach to normalizing / transforming the features. I retained much of the same structure / approach as V1, though the predictions could vary greatly between them for certain articles.

The features and pre-processing now are:

Page length: square-root-normalized number of bytes in the wikitext. Minimum wiki threshold of 100, which is equivalent to 10000 characters. Most wikis are already above this -- even Chinese Wikipedia is at 12,526 characters at its 95th percentile.
References: number of ref tags per normalized-page-length. Minimum wiki threshold of 0.15, which is roughly equivalent to 1 reference at 45 characters, 2 references at 180 characters, 3 references at 400 characters etc. Alternatively, it's about 2 references per section (using section logic below).
Sections: number of headings (levels 2 and 3 only) per normalized-page-length. Minimum wiki threshold of 0.1, which is 1 heading at 100 characters, 2 headings at 400 characters, 3 headings at 900 characters etc. This mostly penalizes bot-driven wikis that have many articles with lede paragraphs (so 0 sections).
Wikilinks: square-root-normalized of wikilinks per normalized-page-length. Minimum wiki threshold of 0.1, which is equivalent to 1 link per 100 characters (~1 link per sentence).
Categories: number of categories (raw count; no transformation). Minimum wiki threshold of 5.
Media: number of media files (raw count; no transformation) -- e.g., image, video, or audio files. Minimum wiki threshold of 2.

The features are all extracted directly from the wikitext of the article as opposed to relying on link tables. This is generally simpler and faster than the above approach because references / sections always needed to be extracted via wikitext, allows the model to be applied directly to historical revisions (for which links tables do not exist) and easily hosted as an API, and crucially filters out a lot of what I consider to be noise -- e.g., links added via navigation templates, image icons such as flags or checkmarks that are added via templates which is why the minimum threshold dropped from 5 for V1 to 2 in this version, tracking/hidden categories that are added via templates. Using square root to reduce skew is a less strong correction than logarithm and thus allows for a wider range of scores (with logarithm, even two sentences would have generally been enough to get 50% of the page length feature). References, sections, and wikilinks by page length is done to reduce colinearity but also because these features should grow in-line with content (though the square root correction to page length means that they do not track linearly).

I also considered including a feature for the number of templates but it had a negative relation with quality so I left it out. The negative relation makes sense in retrospect -- there are many reasons to add templates, often related to maintenance / quality issues).

The features are weighted as follows in the model: page length (0.395), references (0.181), sections (0.123), wikilinks (0.115), media (0.114), and categories (0.070). Due to rounding precision, these only add up to a maximum quality of 0.998.

For more details, see: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/quality/V2_2022_01/README.md

Code: https://github.com/geohci/miscellaneous-wikimedia/blob/master/article-features/quality_model_features_V2.ipynb

References[edit]

↑ ^a ^b Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (14 August 2019). "Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics". Computers 8 (3): 60. doi:10.3390/computers8030060.
↑ Warncke-Wang, Morten; Cosley, Dan; Riedl, John (2013). "Tell Me More: An Actionable Quality Model for Wikipedia" (PDF). WikiSym '13. doi:10.1145/2491055.2491063.

[wikirank-1] Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (14 August 2019). "Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics". Computers 8 (3): 60. doi:10.3390/computers8030060.

[2] Warncke-Wang, Morten; Cosley, Dan; Riedl, John (2013). "Tell Me More: An Actionable Quality Model for Wikipedia" (PDF). WikiSym '13. doi:10.1145/2491055.2491063.

[1]

[2]