Jump to content

Research:Cross-lingual article quality assessment

From Meta, a Wikimedia project coordination wiki
11:28, 24 June 2022 (UTC)
Diego Sáez-Trumper
Paramita Das
Duration:  2022-April – 2022-July
This page documents a completed research project.

Being the largest web repository of free knowledge, Wikipedia articles are available in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, this research project proposes a computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features, finely tuned to a particular language edition's dynamics and existing quality classes. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their quality assessment scheme. Using this framework, we have built a large dataset of the feature values and quality scores from over 2 billion revisions of articles in all the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. We believe that these datasets, which are released for public use can help in different downstream tasks related to Wikipedia research.

Language-Agnostic Framework[edit]

Our framework adopts a language-agnostic approach to represent Wikipedia articles, enabling one to assess the quality, such as the comprehensiveness of knowledge or information contained within them. A few language editions of Wikipedia typically implement their system for assessing the quality of articles, assigning labels based on their overall quality– good, mediocre, or substandard. These hierarchical quality rankings are based on several key factors, including topic coverage, content organization, structural style, etc. For instance, in English Wikipedia, articles are ranked using quality classes, such as– FA, GA, B, C, START, and STUB. The quality label FA represents the highest quality articles, indicating articles that are comprehensive and well-written, while STUB denotes the lowest quality with minimal meaningful content that requires improvement in the overall structure of the article. Similar quality divisions exist in other language editions, like the French Wikipedia, where labels such as AdQ, BA, A, B, BD, and ebauche are used, in which AdQ represents the highest-quality article and ebauche stands for lowest quality, similar to STUB in English Wikipedia. The task of assessing article quality is primarily carried out by Wikipedia editors who mark their evaluations on the talk pages associated with each article. It has been observed that resources that are made available are often related to a very selected subgroup of popular language versions among the over 300 existing ones. [1] For that reason, our language-agnostic modeling framework has been created as a resource to provide knowledge in all languages, following an inspirational principle of knowledge equity. We apply the framework on the full dump of revisions from all language versions of Wikipedia to show how the language-agnostic features are able to capture article quality. All the data generated from this work is publicly available. Therefore, we expect a diversity of research communities to benefit from this work.

Language-Agnostic Features[edit]

We have selected a set of structural features that are incorporated into our proposed article quality model. Our approach is inspired by previous work providing a grounded approach to develop quality models [2], that served to propose rankings of relative quality and popularity assessment in multiple language versions of Wikipedia [3]. For our modeling framework of article quality in any given language version of Wikipedia, we have designed the following set of language-agnostic features:

  • Page length: Square-root-normalized number of characters that exist in the Wikitext of the given revision.
  • References: Number of ref tags that exist per normalized page length. Good Wikipedia articles should be verifiable, which generally means well-referenced.
  • Sections: Number of headings (levels 2 and 3 only) per normalized page length. Wikipedia articles are generally broken into sections to provide structure to the content.
  • Wikilinks: Square-root-normalized of wikilinks per normalized page-length. Linking to other Wikipedia articles generally helps readers explore content and is a good practice.
  • Categories: Number of categories (raw count; no transformation). Categories make finding and maintaining content easier.
  • Media: Number of media files (raw count; no transformation) – e.g., image, video, or audio files.

Feature Extraction[edit]

Wikipedia articles are not static, they evolve over time. Editors are responsible for both creating new articles and updating the existing ones by generating new versions, called revisions. Revisions include the content of the corresponding version of the article in Wikitext format, as well as associated metadata such as the authoring editor, the timestamp or a descriptive comment by the editor about the revision. The full history of revisions of Wikipedia articles is available in the XML dumps. To generate our dataset of language-agnostic features of Wikipedia articles, we first retrieve the Wikitext content of every revision of every article in every available language version of Wikipedia from the beginning to the end of 2022. It should be noted that we only consider pages that represent articles (i.e., main namespace) and that we omitted page redirects. Then, we apply regular expressions to extract all the features in each revision. This could be also done with libraries for parsing Wikitext content such as mwparserfromhell [4]. However, we have found our approach with regular expressions on PySpark up to 10 times faster in medium-sized articles. With our feature extraction process, we generate a dataset of more than 2 billion revisions stored as CSV files (one file per language edition). Each row is a revision and the columns are the id of the revision (revision id), the id of the page (page id), and the values of the extracted language-agnostic features as described above. The feature files are available here.

To illustrate the value of the dataset, we compare the 9 largest language versions by editing activity: English (en), German (de), French (fr), Spanish (es), Italian (it), Russian (ru), Japanese (ja), Chinese (zh), and Vietnamese (vi). The figures, i.e., box plots present the distribution of values of each feature in the latest revision of each article in these 9 versions.

Box plot of the feature (page length) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.
Box plot of the feature (number of references) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.
Box plot of the feature (number of headings) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.
Box plot of the feature (number of wikilinks) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.
Box plot of the feature (number of categories) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.
Box plot of the feature (number of media) for the top 9 Wikipedia language versions by editing activity: en, de, fr, es, it, ru, ja, zh, vi.

We observe that English Wikipedia, the largest and most popular language version, exhibits larger values in features like page length and number of references. However, the Japanese Wikipedia is the leading one in the number of sections and of wikilinks. We also note the remarkable lower values for the Vietnamese Wikipedia, a language version with a high percentage of very short articles (i.e., stubs) that were bot-generated [5].

Quality Modeling of Articles[edit]

Our approach to quality modeling of Wikipedia articles across languages relies on the language-agnostic features described in the previous section. The pipeline has two stages: 1) learning feature weights, and 2) deriving pre-processing thresholds. In the first stage, a small sample of data is used to learn the relative weight of each of the model features (e.g., categories, text, etc). This stage is also used for testing different feature transformations such as log-normalization. In the second stage, the language-agnostic features from every article are compared against the 95th percentile for that language edition of Wikipedia to determine what a “high quality” article should attain – e.g., if the top 5% of articles in English Wikipedia have 14 categories, then an article with 5 categories will have a score of 0.36 [min(1, 5/14)] for that feature while an article with 20 categories would have a score of 1 [min(1, 20/14)]. Certain global minimum thresholds are also set based on eye-balling the data at this stage. For example, the minimum threshold of sections is 0.1 to penalize bot-driven language editions of Wikipedia with many articles with lede paragraphs (i.e., 0 sections). Weights and thresholds of each feature are shown in the table below.

Feature Weight Min. threshold for top quality
Page length 0.395 10,000 characters
References 0.181 0.15 (∼2 references per section)
Sections 0.123 0.1 (1 heading at 100 chars, 2 headings at 400 chars, etc.)
Wikilinks 0.115 0.1 (∼1 link per sentence)
Media 0.114 2
Categories 0.070 5

Dataset of Predicted Quality Scores[edit]

With our dataset of language-agnostic features of revisions of Wikipedia articles in over 300 language editions, we apply the modeling approach described above to predict their quality. Again, we store the results in CSV files with the original columns (revision id, page id) and a column with predicted quality score (pred qual). In addition, we include a column with the id of each article in Wikidata (item id) to facilitate the identification of the same Wikipedia article across language editions.

Further, we analyze the evolution of article quality for the 9 largest language editions by editing activity. In particular, for each year, we select the predicted quality score of the latest revision of all existing articles until that year (included). These scores are grouped into box plots as shown in the figures below. For most language editions, we observe a slow but steady increase over time, with the overall quality becoming more stable in recent years. This observation can be explained by the labor of the Wikipedia editors improving the quality of articles by expanding their contents. However, this is not the case for all the language editions. Article quality in the Vietnamese Wikipedia presents a rise and fall until 2013, with values thereafter concentrated in a range more limited than in other language editions. We examined the Vietnamese language edition in the MediaWiki History dataset[6] and found an increasing ratio of bot-generated revisions until 2014 (up to 92% of revisions in that year were written by bots) and then declining and rebounding again in the last 3 years.

Box plots showing article quality scores as predicted by our model over years:English Wikipedia
Box plots showing article quality scores as predicted by our model over years: German Wikipedia
Box plots showing article quality scores as predicted by our model over years: French Wikipedia
Box plots showing article quality scores as predicted by our model over years: Spanish Wikipedia
Box plots showing article quality scores as predicted by our model over years: Italian Wikipedia
Box plots showing article quality scores as predicted by our model over years: Russian Wikipedia
Box plots showing article quality scores as predicted by our model over years: Japanese Wikipedia
Box plots showing article quality scores as predicted by our model over years: Chinese Wikipedia
Box plots showing article quality scores as predicted by our model over years: Vietnamese Wikipedia

Model Evaluation[edit]

To evaluate the effectiveness of our modeling approach by the quality assessment scheme of Wikipedia articles, we compile a set of sample test articles from both the English Wikipedia and the French Wikipedia. We extract the ground-truth quality labels in these two language editions of Wikipedia through regular expressions. We select only articles whose revision timestamp was updated before the last quality assessment appearing on their talk page. In this way, we ensure that the content of the articles has not changed substantially between the time of the ground-truth quality assessment and the revision from which we extract language-agnostic features. Thus, we select the corresponding revision of articles to create a dataset with 12,640 and 12,864 articles (composed of balanced label distributions from the quality classes of respective Wikipedia language editions.) from English and French Wikipedia, respectively, with ground-truth quality labels. Although French Wikipedia utilizes a distinct quality scheme from that of English Wikipedia, we have established a mapping quantitatively between the quality classes used in French Wikipedia and the quality assessment scheme of English Wikipedia as mentioned below-

French WP Quality labels English WP Quality labels
ebauche STUB

This way aligning the two different quality class hierarchies from two different language editions, we can effectively compare and evaluate the performance of our model based on a common quality framework. For testing the sample of articles with ground-truth quality labels, we apply our modeling framework to compute numerical quality scores (between 0 and 1) for each of the test articles. Then, we map each output score to a quality label of English Wikipedia according to the range derived from a small sample of English Wikipedia articles (i.e., the upper limit of each quality class is the median of the predicted quality score of revisions corresponding to such class). This way the generated quality score of the test articles (i.e., here French and English articles) is mapped to the English Wikipedia quality classes.

The below figures show the confusion matrices for the prediction result of our model with the dataset of revisions from English and French Wikipedia. As can be seen in both matrices, the misclassification rate is lower for the quality label subgroups. For example, in the case of the English Wikipedia, articles belonging to the FA classes are predicted as GA to a greater extent than the other types of quality classes. This is also true for the French Wikipedia, as well as for other divisions (i.e. START/STUB, B/C) of quality classes. Quality classes are defined by qualitative measures only. Therefore, we conclude that distinguishing the classes of their immediate top/bottom quality is a complex task.

Confusion matrix of class distributions of the articles as predicted by our model: English Wikipedia
Confusion matrix of class distributions of the articles as predicted by our model: French Wikipedia

Model Benchmarking[edit]

We benchmark our modeling against two baseline machine learning models-- ORES, an article quality prediction framework created by the Wikimedia Foundation (Halfaker and Geiger 2020), and Random Forest (RF). We choose this second model not only to compare it to our framework but also to examine the predictive value of our set of language-agnostic features using supervised machine learning approaches. RF models are trained individually for the revisions of English Wikipedia and French Wikipedia. We select the best hyperparameter settings in the training phase. To improve the estimated performance of the RFs on all the classes, we implement 10-fold cross-validation, and the average accuracy obtained by the classifier is 0.52 for English Wikipedia and 0.51 for French Wikipedia. Model benchmarking against ORES and RF models is based on the following metrics:

  • Spearman rank correlation (m1): To measure the variation between ground-truth quality labels and model scores, we use Spearman’s rank correlation coefficient. Ground-truth labels are converted to numerical values according to the ranks of the classes. For example, in the case of English Wikipedia consisting of six quality classes, FA (featured article) labels are transformed to 1.0, GA (good article) labels to 5/6, B labels to 4/6, etc.
  • Label alignment: We quantify label alignment in three ways as follows-
  1. Exact match (m2): Percentage of predicted labels exactly assigned to the ground-truth labels.
  2. Within the same group (m3): Percentage of predicted labels falling within the same group of ground truth classes. Quality classes are categorized into three groups of labels: GA/FA, C/B and START/STUB.
  3. Within one class (m4): Percentage of predicted labels matching within one class of ground-truth labels.

The results of our model benchmarking are presented in the below tables. The three tables show metric values, i.e., m1, m2, m3, and m4 for our model, ORES, and RF respectively.

Language m1 m2 m3 m4
English 0.79 40.9 66.3 82.4
French 0.76 40.4 67.9 83.9
Language m1 m2 m3 m4
English 0.85 58.6 78.7 89.9
French 0.79 50.9 68.4 83.0
Language m1 m2 m3 m4
English 0.82 51.7 73.8 85.2
French 0.80 51.6 71.9 83.8

We observe that ORES provides better results in all metrics. We want to recall that ORES models incorporate language dependent features, which might explain obtaining more accurate predictions of article quality. However, results are different in French Wikipedia, achieving slightly better predictions with RFs and our framework in certain metrics, e.g., label assignment within one class. These results support our intuition about the noteworthy predictive value of modeling with language-agnostic features. Furthermore, the improvement using RFs – in comparison to simpler heuristics of our framework – suggests a promising potential of machine learning techniques with our set of language-agnostic features.


We have presented a framework to model Wikipedia article quality using language-agnostic features. Our approach transforms the unstructured and massive content of Wikipedia XML dumps into a dataset of language-agnostic features from revisions. Therefore, this resource contains a structured, smaller, and more manageable representation of the full history of all Wikipedia articles. Additionally, we have created a second dataset with the scores resulting from applying our framework that automatically assesses article quality using these features. Our datasets have several applications. The most intuitive one is to examine the evolution of article quality in a particular language version of Wikipedia and run cross-lingual studies. For example, future research could analyze each feature to conduct specific studies on the number of characters, references, sections, images, and links added within and across languages, e.g., to evaluate the impact of coordinated campaigns to expand knowledge on Wikipedia. With the release of the datasets of our work, we expect to make Wikipedia content more accessible to diverse research communities.

However, we should note that the evaluation of our quality assessment model was done with articles from English and French Wikipedia. While these two language editions were chosen because they allowed us to obtain comparable “ground-truth quality” labels, the exclusive use of these two languages for testing constitutes a limitation of this study. Therefore, future work should extend this evaluation with data from low resource language editions of Wikipedia.



  1. Johnson, Issac; Lescak, Emily (2022). "Considerations for multilingual Wikipedia research". 
  2. Warncke-Wang, Morten; Cosley, Dan; Riedl, John (2013). "Tell me more: an actionable quality model for Wikipedia". Proceedings of the 9th International Symposium on Open Collaboration. pp. 1–10. 
  3. Lewoniewski, Wlodzimierz; Wkecel, Krzysztof; Abramowicz, Witold. "Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics". Computers (MDPI) 8 (3). 
  4. https://github.com/earwig/mwparserfromhell
  5. https://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm
  6. https://dumps.wikimedia.org/other/mediawiki_history/readme.html