Research:Learning from dispute templates

From Meta, a Wikimedia project coordination wiki
12:23, 28 January 2021 (UTC)
collaboration patterns, article quality, templates
This page documents a completed research project.

This document describes the planned continuation of collaboration between researchers at the Wiki- media Foundation and the University of Cambridge, following a 12-week internship from January - April 2021.

Problem description[edit]

Wikipedia leverages large-scale online collaboration by editors to provide an open source online encyclopedia. The collaborative content creation process has played a key role in the creation of the encyclopaedia, but has also led to some variance in the quality of articles on the platform (Sarkar et al., 2019): sockpuppeting (Solorio et al., 2013), vandalism (West et al., 2010), factual inaccuracies (Wang and Cardie, 2016) and biased article perspectives (Matei and Dobrescu, 2011) are some of the artefacts of this process. To alert moderators, other editors and readers of such issues, the platform allows editors to add a variety of markup templates to articles.

In this work, our goal is to understand and model disputes in biographies of living persons. We collect a set of templates used to signal problematic content on those articles, and based on them we are going to create a dataset that allow to train ML models to automatically discover such content.

Previous Work[edit]

During the internship, a dataset of approximately 34 000 instances of tag additions events was mined, considering only articles in the category “Biographies of Living People” and only tags that describe articles with a promotional tone: “advert”, “fanpov”, “autobiography”, “peacock” and “weasel”. Since the objective was to study patterns of collaboration related to these tags, we extracted features describing edits to the article itself, the associated Talk page, the history of editors who contributed to the page, and relationships between editors. We developed logistic regression models to discriminate between these events and negative samples, based on the extracted features. The top performing model had an ROC-AUC score of 0.758 on the full dataset, which is promising. These models are seen as a baseline for performance on this task, since they are unable to incorporate the inherently graph-like structure of the data and require aggregation of the per-editor features. Furthermore, the text of the article is not taken into account at all.

Detecting promotional tone from text[edit]


An example of a sample from our dataset, with three different revisions shown.

Maintaining a neutral point of view is a desideratum in many communication channels, e.g.\ news articles, scientific writing, and encyclopaedias. The task of detecting biased writing is useful to mitigate the distribution of content which contains unfair representations of a topic. On Wikipedia, the problem of biased writing manifests itself in the form of promotional tone which violates the cornerstone ``neutral point of view policy of the platform.

During an internship in January-March 2021, an initial investigation was launched into the effects of collaboration patterns on article quality, specifically focusing on articles with a promotional tone problem. This included a data extraction methodology and developing predictive models. The relevant code and data was released.

This work was subsequently extended to consider text-based methods for detecting promotional tone. The resulting dataset, which we called WikiEvolve, contains seven versions of the same article from Wikipedia, from different points in its revision history; one with promotional tone, and six without it. This allows for obtaining more precise training signal for learning models from promotional tone detection. To model this data, we adapted the gradient reversal layer framework to encode two article versions simultaneously and thus leverage this additional training signal. In our experiments, our proposed adaptation of gradient reversal improved the accuracy of four different architectures on both in-domain and out-of-domain evaluation.


Our data extraction methodology consists of (i) finding articles tagged as having a promotional tone problem at some point in their edit history, (ii) selecting the revision where the template was added as a positive sample, and (iii) sampling negatives from revisions which did not contain the template.

Finding promotional tone tags[edit]

To identify tags of interest, we refer to the Wikipedia category articles with a promotional tone and identify the quality tags which most frequently occur in this articles of this category. These are advert, autobiography, fanpov, peacock and weasel. Each of these tags describe a different type of promotional tone issue. We then use regular expressions to collect all revisions which contain variations of these tags in the Wikitext history table.

Finding tag addition events[edit]

Once incidences of promotional tone tags have been identified, we use the MediaWiki history table to find the full edit histories of these articles. For each article, we then identify the point in its edit history where a tag was added, and consider this version of the article as the positive sample. We exclude cases where the tag addition edit was reverted by another editor. The article text at this timestamp is retrieved from the WikiText data lake.

Sampling negatives[edit]

For each positive sample, we select contrasting negatives from the revision history of the same article. We consider as candidates all revisions which were not reverted and which took place within 30 revisions (chronologically sorted) of the tag addition event. This is intended to ensure that the negative samples are of the same approximate stage of article development as the positive sample. We exclude the revision immediately before the tag addition event, as it is this version which prompted the tag to be added.

Up to three revisions (depending on availability) are selected at random from these candidates, before and after the positive. We refer to such a set of samples as a sample set. The negatives sampled before the tag addition will be denoted neg_pre, and the additional negatives are denoted neg_post.

The number of samples per tag and class are shown in this table:

Tag Positives # Negatives
Autobiography 1578 9289
Advert 4361 25843
Fanpov 413 2446
Peacock 2859 16960
Weasel 906 5421
Total 8 539 59 959

The dataset can be downloaded here.


To make better use of the training signal available in the multiple versions per article in WikiEvolve, we adapt gradient reversal [1] to learn more robust, topic-independent features.

Gradient reversal is a machine learning technique which was introduced as a methodology for learning domain-independent features in the context of domain adaptation. This approach jointly optimises two classifiers which rely on a shared underlying encoder model: (i) a label predictor for the main task, which predicts class labels and is used during both training and test time, and (ii) a domain classifier, which predicts either the source or the target domain during training as the auxiliary task. The parameters of the encoder model are optimised to minimise the loss of the main task classifier while maximising the loss of the domain classifier. This is achieved through a gradient reversal layer, which leaves the input unchanged during forward propagation and reverses the gradient by multiplying it by a negative scalar during the backpropagation.

This approach is motivated by theory on domain adaptation, which suggests that a good representation for cross-domain transfer is one for which an algorithm cannot learn to identify the domain of origin of the input observation [2].

Our adaptation of this framework differs from the one from Ganin and Lempitsky in that it considers two text inputs concurrently x and x', as opposed to one. We further define the auxiliary task as classifying whether two samples originated from the same article. The features we learn are therefore more likely to be informative of the tone, while avoiding learning content-related biases.



The feature encoder models are responsible for producing an embedding of an article to be used in both the main and auxiliary task. Training with GRL should improve the features learnt by these models. We therefore compare the GRL training approach with the standard method of training the classifier with four different feature extractor models (meaning that the auxiliary branch is removed and one sample is processed at a time).

The models we evaluate are:

  • Bag-of-words (BoW + MLP): a bag-of-words representation of an article is propagated through a multilayer perceptron (MLP) to obtain an embedding,
  • Averaged embeddings (AvgEmb + MLP): GloVe embeddings [3] for every word in the article are averaged, followed by an MLP model,
  • Hierarchical Attention Network (HAN) [4]: word embeddings are processed using an LSTM layer followed by an attention mechanism to build up sentence embeddings. Sentence embeddings are similarly combined to form an article embedding.
  • Longformer [5]: A transformer-based model, adapted for long-form documents. We finetune a pretrained model.


Model PR-AUC Accuracy
Bow + MLP 0.6019 0.5913
0.6409 0.6102
AvgEmb + MLP 0.6129 0.5848
AvgEmb + GRL 0.6415 0.6084
HAN 0.6271 0.5968
HAN + GRL 0.6459 0.6102
Longformer 0.6798 0.6392
Longformer + GRL 0.6821 0.6432

The results from our evaluation on the FullTest test set are shown in the table above. We observe that models trained with gradient reversal consistently outperform models trained without it, on both the accuracy and PR-AUC metrics. All improvements, except for the Longformer, are statistically significant at the $\alpha=0.05$ level, using the permutation test to compare PR-AUC values. Larger gains are observed for the BoW+MLP and AvgEmb+MLP models, compared to the HAN and Longformer models. A possible explanation for this is that these simpler models rely only on word-level information, and thus more susceptible to topical biases which the GRL would oppose.

These results support the motivation behind our data collection method and training framework: by incorporating our knowledge of how samples are related in our dataset and training, models are exposed to different versions of the same content (with and without promotional tone), and can therefore better learn features that are more effective for detecting promotional tone, compared to models that ignore this information.

Further Work[edit]

Promotional tone for more languages[edit]

While English Wikipedia is a good context for obtaining data to train ML algorithms, Wikipedia editions with less resources (editors / articles / usage of templates) should also benefit from this work. As next step we want to expand our ML model to work in more languages.

Graph representation of collaboration patterns[edit]

We would like to test more sophisticated models of collaboration that can account for features of the editor network; ie. graph neural networks. Computing requirements for these models is substantial; so far we have only been able to test models 1on the different templates separately, and not the full dataset. Having access to Wikimedia infrastructure would make this possible. Furthermore, we would like to see if these features generalise to the task of predicting escalation as in De Kock and Vlachos (2021).


  1. Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015
  2. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010 .
  3. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543,2014.
  4. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016
  5. I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.