Research:Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia
This project is part of WikiProjectReliability which is dedicated to improving the reliability of Wikipedia articles. The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors’ manual efforts around Wikipedia content reliability. However, there is a lack of large-scale data to support the development of such research. To fill this gap, we release Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
On Wikipedia, the review and moderation of content is usually self-governed by Wikipedia’s volunteer community of editors, through collaboratively created policies and guidelines  . However, despite the size of Wikipedia’s community (41k monthly active editors for English Wikipedia in 2020), the labor cost of maintaining Wikipedia’s content is intensive: Wikipedia patrollers reviewing at a rate of 10 revisions/minute would still require 483 labour hours per day to review 290k edits saved across all the various language editions of Wikipedia .
With the large labor costs associated with patrolling new edits, automated strategies are beneficial to helping Wikimedia’s community of maintainers avoid a task overload, allowing them to focus their efforts on more beneficial content moderation efforts. This has been carried out successfully at scale for the purpose of counter-vandalism, such as with the ORES service, an open source algorithmic scoring service which enables the scoring of Wikipedia edits in real time, through the use of multiple independent Machine Learning classifiers.
The goal of the project is to encourage automated strategies for the moderation of content reliability, by providing Machine Learning datasets for this purpose.
Citation and Verifiability Templates
One of the ways reliability is governed in Wikipedia is through the use of templates, which present as messages on a page they’re included in, and serves as a warning for gaps in reliability of a page’s content. To build this dataset, we rely on Wikipedia “templates”. Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of “non-neutral point of view” or “contradictory articles”, and serve as a strong signal for detecting reliability issues in a revision.
WikiProjectReliability maintains a list of maintenance templates related to citation and verifiability issues. These serve as a warning marker not just for the reader, but also as a tag for maintenance purposes, which points out to editors that moderation fixes are needed to improve the reliability of an article. Thus, we can get an idea for the reliability of an article by checking for the presence of these templates.
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positve/negative example in the dataset comes with the full article text and 20 features from the revision’s metadata.
Previous works   obtained positive and negative instances of a template being present by randomly selecting from a current snapshot of pages, rather than comparing subsequent revisions. Our approach is able to capture multiple instances of template addition and removal over the full revision history of pages.
 constructed a negative set from Wikipedia’s Featured Articles and Good Articles, which ignores stylistic differences which could be what the model would learn to predict instead. In contrast, our approach takes into account differences in article category and quality, because we extract pairs of related contrasting examples. We surmise that this would improve the ability of a model to learn from the data. Additionally, our approach addresses the imbalanced class issue where the number of positive examples exceed the negative examples.
Selection of templates
We categorised the 41 WikiProjectReliability maintenance templates by article, section, and in-line level. We then manually curated the maintenance templates based on their impact to Wikimedia, prioritising templates which are of interest to the community.
Parsing Wikipedia Dumps
The full history of Wikipedia articles is available through periodically updated XML dumps. We use the full English Wikipedia dump from September 2020, sizing 1.1TB compressed in bz2 format. We convert this data to AVRO and process it using PySpark. We apply a regular expression to extract all the templates in each revision, and retain all the articles that contain our predefined list of templates. Next, using the MediaWiki History dataset we obtain additional information (metadata) about each revision of the aforementioned articles.
Handling instances of vandalism
The accurate detection of true positive and negative cases is further complicated by instances of vandalism: where a template has been maliciously/wrongly added or removed. To handle this, we rely on the wisdom of the crowd by ignoring revisions which have been reverted in the future by editors. Research by  suggests that 94% of reverts can be detected by matching MD5 checksums of revision content historically. Thus, we use the SHA checksum method as a reliable method of detecting whether a revision was reverted in the future
However, comparing SHA checksums of consecutive revisions is a computationally expensive process as it requires processing through the entire history of revisions. Fortunately, the MediaWiki History dataset contains monthly data dumps of all events with pre-computed fields of computationally expensive features to facilitate analyses. Of interest is the
revision_is_identity_reverted feature in revisions events, which marks whether a revision was reverted by another future revision, which we use to exclude all reverted revisions from the articles’ edit history.
Obtaining positive and negative examples
For each citation and verifiability maintenance template, we construct a dataset which consists of positive and negative examples of a template’s presence. We iterate through all consecutive revisions of a page’s revision history to extract positive/negative class pairs. We define a positive example as the revision where a template was added. The presence of a template serves as a strong signal for detecting that a reliability issue exists in a revision.
Labeling negative samples in this context is a non-trivial task. While the presence of a template is a strong signal of a content reliability issue, the absence of a template does not necessarily imply the converse. A revision may contain a reliability issue that has not yet been reviewed by expert editors and flagged with a maintenance template. Therefore, we iterate through the article history succeeding the positive revision, and label as negative the first revision where the template does not appear, i.e. the revision where the template was removed. The negative example then acts as a contrasting class, as it constitutes the positive example which has been fixed for its reliability issue.
- 1. Obtain Wikipedia edit history
- We download the Wikipedia dump available at September 2020, and load it in AVRO format for PySpark processing.
- 2. Obtaining “reverted” status of a revision
- We query the Mediawiki History dataset to extract the “reverted” status of a revision.
- 3. Check if the revision contains a template
- For each template in the template list, we loop over all non-reverted revisions to find the first revision where the template has been added. We mark such revision as positive to indicate that the revision contains the template.
- 4. Process positive/negative pairs
- We iterate through all consecutive revisions of a positive example to find the next non-reverted revision where the template has been removed, and mark it as negative.
We share all processing code on our Github page.
After processing all reliability-related templates for positive/negative class pairs, we filter out templates with pair counts of less than 1000. This leaves us with the following top 10 templates:
|more citations needed||article||13707|
|pov||article or section||5214|
|contradict||article or section||2268|
Following that, we compute both metadata and text-level features for our data.
We extract metadata features for each revision in our data by querying the ORES API’s Article Quality model, resulting in 26 metadata features in total.
For our final released datasets, we narrow down our feature list by filtering out features of the lowest importance based on our benchmark binary models. To analyse the performance of different features across all templates, we obtained the importance scores of features from our benchmarked XGBoost models, which achieved the best performance on our metadata features. We trained XGBoost models on different subsets of features ordered by their importance, and obtained the accuracy score for each. We determined the most commonly occurring features of least importance across all templates, which we remove from our final dataset, reducing our feature size from 26 down to 20. We confirmed that the models trained on the new reduced subset of features achieves comparable (and sometimes improved) accuracy to the full set. The schema for our released datasets for each template is as follows:
|page_id||Page ID of the revision|
|revision_id||ID of the revision|
|revision_id.key||ID of the corresponding pos/neg revision|
|revision_text_bytes||Change in bytes of revision text|
|stems_length||Average length of stemmed text|
|images_in_tags||Count of images in tags|
|infobox_templates||Count of infobox templates|
|paragraphs_without_refs||Total length of paragraphs without references|
|shortened_footnote_templates||Number of shortened footnotes (i.e., citations with page numbers linking to the full citation for a source)|
|words_to_watch_matches||Count of matches from Wikipedia's words to watch: words that are flattering, vague or endorsing a viewpoint|
|revision_words||Count of words for the revision|
|revision_chars||Number of characters in the full article|
|revision_content_chars||Number of characters in the content section of an article|
|external_links||Count of external links not in Wikipedia|
|headings_by_level(2)||Count of level-2 headings|
|ref_tags||Count of reference tags, indicating the presence of a citation|
|revision_wikilinks||Count of links to pages on Wikipedia|
|article_quality_score||Letter grade of article quality prediction|
|cite_templates||Count of templates that come up on a citation link|
|cn_templates||Count of citation needed templates|
|who_templates||Number of who templates, signaling vague "authorities", i.e., "historians say", "some researchers"|
|revision_templates||Total count of all transcluded templates|
|category_links||Count of categories an article has|
|has_template||Binary label indicating presence or absence of a reliability template in our dataset|
While certain citation-related templates can be distinguished by metadata features, some reliability templates are distinguished by differences in their content text. Thus, we also create text-based datasets for the purpose of text classification.
For each revision in our dataset, we query the API for its wikitext, which we parse to obtain only its plain text content, filtering out all wikilinks, and non-content sections (reference, external links, etc). Finally, we obtain the diff between the revision texts of each positive/negative pair, obtaining the changed sections of text for each revision. We produce two versions of our text datasets: one composed of the diff text, and another of the full article level text.
The schema for our released text datasets are as follows:
|page_id||Page ID of the revision|
|revision_id_pos||Revision ID of the positive example|
|revision_id_neg||Revision ID of the corresponding negative example|
|txt_pos||Full/Diff text of the positive example|
|txt_neg||Full/Diff text of the corresponding negative example|
We release all our final datasets on Figshare.
In order to measure the informativeness of our dataset, we establish some baseline models for comparison by future work. For each template, we train baseline classification models predicting the label
has_template, which acts as an indicator for the presence of a citation and verifiability issue. We train baseline models in the following aspects: Metadata feature-based models, and text-based models.
We train benchmark models for the metadata feature datasets across all templates. We benchmark each template on the following models: Logistic Regression, SVM, Decision Tree, Random Forest, K-Nearest Neighbours, Naive Bayes, XGBoost, and finally, an ensemble of all the aforementioned classifiers.
Each classifier is trained using 5-fold cross validation, with GroupKFold to ensure non-overlapping groups of pageIDs across the train and test splits. This ensures that all revisions of the same page ID will not appear in the test set if it already occurs in the train set, and vice versa. We ensure that the train test splits have balanced label distributions.
For all experiments, we recorded the overall classification accuracy as well as the precision, recall, F1-score, and Area Under the ROC Curve.
We train text classification models on our text data across all templates. We convert the raw text data to vector representations in the following manners: by computing TF-IDF features, and by encoding the text using pre-trained word embeddings.
We obtain TF-IDF features from our raw text data, with a max vocabulary size of 1000, after comparing that this has minimal performance difference to training on the full vocabulary.
We then benchmark on the following models for our TF-IDF text features: Logistic Regression, SVM, Decision Tree, Random Forest, K-Nerest Neighbours, XGBoost.
Finally, we trained simple binary text classifiers on pre-trained word embeddings. We perform Transfer Learning to train the classifiers using pretrained text embedding modules from TF-Hub. We test our data on the following TF-Hub embedding modules:
- nnlm-en-dim128: encodes each individual word into 128 dimension embedding vectors and then averages them across a sentence for a final 128-dimensional sentence embedding.
- random-nnlm-en-dim128: a text embedding module with the same vocabulary and network as nnlm-en-dim128, but with randomly initialised weights
- universal-sentence-encoder: which takes in variable length English text and outputs a 512 dimensional vector
We add a DNNClassifier classification layer on top of each text embedding module, and train in two modes:
- With module training: training only the classifier (i.e. freezing the module), and
- Without module training: training the classifier together with the module
Each model is trained over 25 epochs.
Across all templates, the XGBoost model achieves the highest performance scores. The ensemble StackingCVClassifier model is able to achieve improved or comparable performance to the XGBoost model.
We show an example of the score results for the template
original research below:
|Support Vector Machine||0.580475||0.5814||0.580475||0.57928||0.609065|
We also computed the feature importances from our classifier, plotted below:
We release a notebook of the benchmarking models and scores for all other templates on PAWS.
On the template
original research, we obtain the following results:
|Support Vector Machine||0.504947||0.252473||0.500000||0.338844||0.560614|
The models trained on our text-based features do not perform as well as on metadata features. This is unsurprising as the text data is more difficult to learn from. As our data is at the article/diff level as opposed to at the sentence level, the observed performance could be attributed to multiple factors, such as the TF-IDF features not being expressive enough.
We release a notebook of the benchmarking models and scores for all other templates on PAWS.
Finally, we also trained some simple binary text classifiers on our text data using pre-trained word embeddings from TF-Hub. The results for the
original research template are presented below:
The best performing model from our experiments is the
universal-sentence-encoder model, which obtains an accuracy of 52.45%. However, the performance of the embedding based approach does not exceed a simple TF-IDF Log reg model. We surmise that this is due to the simple model architecture not being expressive enough to model the task-- for example, we use pre-trained sentence level embedding modules despite our data being at the article-level, due to computational constraints for this project. The limited dimension size used to represent the document-length text, may dilute the signal.
Due to computational constraints, we were also unable to train our data on more complex models such as BERT. We believe that such models could lead to greater improvements on the task and hope that the final released datasets encourage future work on this task.
- Ivan Beschastnikh, Travis Kriplean, and David W. McDonald. 2008. Wikipedian Self-Governance in Action: Motivating the Policy Lens. In International AAAI Conference on Web and Social Media (ICWSM).
- Andrea Forte, Vanesa Larco, and Amy Bruckman. 2009. Decentralization in Wikipedia Governance. Journal of Management Information Systems 26, 1 (2009), 49–72.
- Dan Cosley, Dan Frankowski, Loren Terveen, and John Riedl. 2007. SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces. ACM, 32–41.
- Aaron Halfaker and R. Stuart Geiger. 2020. ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia. Proc. ACM Human Computer Interaction. 4, CSCW2, Article 148 (October 2020), 37 pages.
- Maik Anderka, Benno Stein, and Nedim Lipka. 2012. Predicting quality flaws in user-generated content: the case of wikipedia. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '12). Association for Computing Machinery, New York, NY, USA, 981–990.
- Shruti Bhosale, Heath Vinicombe, and Raymond Mooney. 2013. Detecting promotional content in wikipedia. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1851–1857.
- Aniket Kittur, Bongwon Suh, Bryan A. Pendleton, and Ed H. Chi. 2007. He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). Association for Computing Machinery, New York, NY, USA, 453–462.