Research:Cross-lingual article quality assessment

From Meta, a Wikimedia project coordination wiki
Created
11:28, 24 June 2022 (UTC)
Contact
Diego Sáez-Trumper
Paramita Das
Duration:  2022-April – 2022-July
This page documents a completed research project.


Being the largest web repository of free knowledge, Wikipedia articles are available in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, this research project proposes a computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features, finely tuned to a particular language edition's dynamics and existing quality classes. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their quality assessment scheme. Using this framework, we have built a large dataset of the feature values and quality scores for all revisions of articles in all the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. We believe that these datasets, which are released for public use can help in different downstream tasks related to Wikipedia research.

To overcome the above fault line, a strong baseline model [1] has been implemented which covers all wikis and is sufficiently accurate to measure language-agnostic quality.


Language-Agnostic Framework[edit]

Our framework adopts a language-agnostic approach to represent Wikipedia articles, enabling one to assess the quality, such as the comprehensiveness of knowledge or information contained within them. A few language editions of Wikipedia typically implement their system for assessing the quality of articles, assigning labels based on their overall quality– good, mediocre, or substandard. These hierarchical quality rankings are based on several key factors, including topic coverage, content organization, structural style, etc. For instance, in English Wikipedia, articles are ranked using quality classes, such as– FA, GA, B, C, START, and STUB. The quality label FA represents the highest quality articles, indicating articles that are comprehensive and well-written, while STUB denotes the lowest quality with minimal meaningful content that requires improvement in the overall structure of the article. Similar quality divisions exist in other language editions, like the French Wikipedia, where labels such as AdQ, BA, A, B, BD, and ebauche are used, in which AdQ represents the highest-quality article and ebauche stands for lowest quality, similar to STUB in English Wikipedia. The task of assessing article quality is primarily carried out by Wikipedia editors who mark their evaluations on the talk pages associated with each article.

It has been observed that resources that are made available are often related to a very selected subgroup of popular language versions among the over 300 existing ones. [2] For that reason, our language-agnostic modeling framework has been created as a resource to provide knowledge in all languages, following an inspirational principle of knowledge equity. We apply the framework on the full dump of revisions from all language versions of Wikipedia to show how the language-agnostic features are able to capture article quality. All the data generated from this work is publicly available. Therefore, we expect a diversity of research communities to benefit from this work.

Language-Agnostic Features[edit]

We have selected a set of structural features that are incorporated into our proposed article quality model. Our approach is inspired by previous work providing a grounded approach to developing quality models [3], that served to propose rankings of relative quality and popularity assessment in multiple language versions of Wikipedia [4]. For our modeling framework of article quality in any given language version of Wikipedia, we have designed the following set of language-agnostic features:

  • Page length: Square-root-normalized number of characters that exist in the Wikitext of the given revision.
  • References: Number of ref tags that exist per normalized page length. Good Wikipedia articles should be verifiable, which generally means well-referenced.
  • Sections: Number of headings (levels 2 and 3 only) per normalized page length. Wikipedia articles are generally broken into sections to provide structure to the content.
  • Wikilinks: Square-root-normalized of wikilinks per normalized page-length. Linking to other Wikipedia articles generally helps readers explore content and is a good practice.
  • Categories: Number of categories (raw count; no transformation). Categories make finding and maintaining content easier.
  • Media: Number of media files (raw count; no transformation) – e.g., image, video, or audio files.

Feature Extraction[edit]

Wikipedia articles are not static, they evolve over time. Editors are responsible for both creating new articles and updating the existing ones by generating new versions, called revisions. Revisions include the content of the corresponding version of the article in Wikitext format, as well as associated metadata such as the authoring editor, the timestamp or a descriptive comment by the editor about the revision. The full history of revisions of Wikipedia articles is available in the XML dumps. To generate our dataset of language-agnostic features of Wikipedia articles, we first retrieve the Wikitext content of every revision of every article in every available language version of Wikipedia from the beginning to the end of 2022. It should be noted that we only consider pages that represent articles (i.e., main namespace) and that we omitted page redirects. Then, we apply regular expressions to extract all the features in each revision. This could be also done with libraries for parsing Wikitext content such as mwparserfromhell [5]. However, we have found our approach with regular expressions on PySpark up to 10 times faster in medium-sized articles. With our feature extraction process, we generate a dataset of more than 2 billion revisions stored as CSV files (one file per language edition). Each row is a revision and the columns are the id of the revision (revision id), the id of the page (page id), and the values of the extracted language-agnostic features as described above. The feature files are available here.

Quality Modeling of Articles[edit]

Our approach to quality modeling of Wikipedia articles across languages relies on the language-agnostic features described in the previous section. The pipeline has two stages: 1) learning feature weights, and 2) deriving pre-processing thresholds. In the first stage, a small sample of data is used to learn the relative weight of each of the model features (e.g., categories, text, etc). This stage is also used for testing different feature transformations such as log-normalization. In the second stage, the language-agnostic features from every article are compared against the 95th percentile for that language edition of Wikipedia to determine what a “high quality” article should attain – e.g., if the top 5% of articles in English Wikipedia have 14 categories, then an article with 5 categories will have a score of 0.36 [min(1, 5/14)] for that feature while an article with 20 categories would have a score of 1 [min(1, 20/14)]. Certain global minimum thresholds are also set based on eye-balling the data at this stage. For example, the minimum threshold of sections is 0.1 to penalize bot-driven language editions of Wikipedia with many articles with lede paragraphs (i.e., 0 sections).

Feature Weight Min. threshold for top quality
Page length 0.395 10,000 characters
References 0.181 0.15 (∼2 references per section)
Sections 0.123 0.1 (1 heading at 100 chars, 2 headings at 400 chars, etc.)
Wikilinks 0.115 0.1 (∼1 link per sentence)
Media 0.114 2
Categories 0.070 5

Model Evaluation[edit]

Takeaways[edit]

Useful Links[edit]

code: https://gitlab.wikimedia.org/dsaez/article-quality-changes/-/blob/main/article_quality_Changes/history_quality_for_historic_revisions.ipynb

References[edit]

  1. https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Language-Agnostic_Quality
  2. Johnson, Issac; Lescak, Emily (2022). "Considerations for multilingual Wikipedia research". 
  3. Warncke-Wang, Morten; Cosley, Dan; Riedl, John (2013). "Tell me more: an actionable quality model for Wikipedia". Proceedings of the 9th International Symposium on Open Collaboration. pp. 1–10. 
  4. Lewoniewski, Wlodzimierz; Wkecel, Krzysztof; Abramowicz, Witold. "Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics". Computers (MDPI) 8 (3). 
  5. https://github.com/earwig/mwparserfromhell  Missing or empty |title= (help)