User:Isaac (WMF)/Content tagging/Data gaps

From Meta, a Wikimedia project coordination wiki

As Wikimedia content grows and investments are made to diversify the types of contributions that folks can make to the projects, machine learning (ML) models are playing an important role in enabling this growth by scaling the communities' tools for tracking, maintaining, and expanding this content. The quality of these models (and therefore the distribution of their benefits) depends directly on the quality of data that they are trained on -- i.e. without high-quality, representative data about existing community workflows, it is very difficult to design models to support these processes.[1]

Take the example of patrolling for vandalism: smaller Wikipedia language communities generally can effectively patrol edits to identify vandalism through basic tools like RecentChanges and diffs.[2] Larger language communities often require more extensive tooling to help identify vandalism without overwhelming patrollers. These communities tend to rely more on tooling like filters on RecentChanges, anti-vandalism bots, or AbuseFilter to help identify potentially problematic edits. Building these tools and tracking their performance, however, requires collecting high-quality data at scale about which edits have been reverted for vandalism. This often leads to a situation where certain well-resourced communities have the data (and developer expertise) to build these models while others are left out.

This data gap is a major barrier to building equitable ML models to support Wikimedians for several reasons:

  • Equity: when consistently-structured data does not exist across language editions, researchers generally must resort to writing custom logic to extract the necessary data. Often this is limited to just a few language editions by the researchers' language skills, data availability, or time. Even if language-agnostic methods or transfer learning allows the model to be extended to other languages, its performance will likely suffer in those languages as content and processes in different language editions vary greatly for many reasons.
  • Bias: similar to the coverage equity issues, custom logic often relies on heuristics that can introduce bias into the data. For example, a lack of high-quality data about vandalism means that a common strategy is using heuristics to extract reverts that were done by trusted editors.[3] While this data might be quite high-precision, it generally suffers in recall and will reflect the biases of those editors that meet the heuristic criteria.
  • Maintainability: custom data extraction logic is often quite fragile and small changes in norms can easily break the pipelines. As a result, it is not uncommon to extract an initial dataset and continue to use this for modeling far into the future, even as that data grows stale.[4]

There are generally no easy solutions to these challenges. These data gaps can often be addressed, however, with careful software/tool design that supports editors while also considering data as a core output.

Article quality: a positive example[edit]

Assessing article quality -- e.g., stub to featured articles -- is a task that Wikipedians invest great effort as it facilitates their ability to prioritize work and evaluate their progress on closing knowledge gaps. Quality assessment is generally conducted by Wikipedians under the auspices of Wikiprojects with a common workflow being that an article is tagged as relevant to a given Wikiproject -- i.e. topic area -- and then its quality is assessed periodically by a member of that Wikiproject and tracked within categories, tables, or by bots. Researchers have developed many models over the years to automatically assess article quality to support these efforts and their research (WikiProject Medicine example).

Much of this model development initially focused on English Wikipedia and required processing the wikitext dumps (large/slow), specialized libraries for extracting Wikiproject quality assessment templates from talk pages, mapping those assessments to their associated Wikiproject, and mapping the talk pages to their associated article (example). Extending this approach to other language editions required extensive knowledge of their approaches to assessing quality and custom regexes for extracting the data.

In 2016, the Community Tech team created a Mediawiki extension called PageAssessments. This extension sought to support the existing editor workflows for evaluating article quality among other things. Thankfully the team also designed the extension to standardize the data model for quality assessments and make the data available via database tables and APIs where it can easily be accessed and processed. Arabic, English, French, Hungarian, and Turkish Wikipedias use the extension. While ideally adoption would be far higher, this has greatly improved the accessibility of groundtruth data for these languages and underlies the language-agnostic quality model. Any wiki that adopted the extension would require relatively little work to incorporate into the model as it is now just a very simple SQL query to extract the data -- e.g., example of data from Climate Change English WikiProject.

Major data gaps[edit]

Below are a few types of models that would be very valuable for the Wikipedia community but are greatly hampered by the lack of data.

Vandalism detection[edit]

Only a few language editions of Wikipedia have automated tools for detecting vandalism to support patrollers (Patrolling on Wikipedia) -- e.g., ClueBot NG on English Wikipedia or ORES editquality models in several more languages. These tools are highly impactful[5] and patrollers on wikis without them either collectively seek to review every edit or resort to basic heuristics for filtering out likely-non-vandalism (for smaller wikis, this is often feasible).

A natural assumption would be that there already exists good data on vandalism: any edit that was reverted. Unfortunately, reverts happen for a variety of reasons including vandalism but also edit warring, as a form of vandalism itself, removing accidental edits, or just suspicion about the edit that is unfounded. Some of these use-cases can be filtered out with additional heuristics such as ignoring self-reverts or limiting reverts to just "trusted" editors. The resulting data is imperfect but does have the benefit of still being mostly language-agnostic and thus relatively easy to gather for any language-edition of Wikipedia. Past research has also used the presence of various phrases in the edit summaries or usage of certain tools though that limits the coverage of the data to specific languages.

The main drawback of these heuristic-based approaches is that they reduce the complexity of vandalism to a binary. While vandalism often elicits thoughts of edits that insert blatantly false information or use insulting language, it is far more wide-ranging. The data for training models would ideally separate between these different types of vandalism because no one model will effectively detect them all and it's important to have transparency around what a given model is actually capable of so that editors may use it appropriately. WikiLoop DoubleCheck is a specific example of a tool designed to address the necessity for heuristics though it does not support distinguishing between different types of vandalism.

Spam-bot detection[edit]

One form of abusive editing on Wikipedia is spam bots -- i.e. editor accounts that are created with the sole reason of e.g., inserting problematic links to external websites. Existing approaches use regexes to extract spam accounts from block logs (example) but have a hard time comprehensively identifying specific spam edits. Ideally editors would have consistent means of identifying spam edits as well as a structured way of identifying the reason behind blocks so that spam edits and accounts could be more easily identified. This example also pertains to sockpuppet detection or other accounts that are banned for other forms of abuse. There is now a task to address this issue.

Content integrity issues[edit]

There are many types of content integrity issues on Wikipedia such as citation needed or NPOV issues. Many of these issues are tagged via templates but tagging is very likely much sparser than it could be, varies by language, and can be unclear as to what it pertains to as the template is often added long after the issue was created and might be left on the page even after the issue has been resolved. Building datasets then of content reliability issues can be quite tricky.[6] Ideally, templates (or a similar system) would be designed that are more-clearly connected to the content (like the citation-needed template), have clearer feedback loops to actually fixing issues to encourage usage by editors, are potentially less disruptive to the reading experience, and are more consistent and structured across languages. These would cover a wide range of content quality issues for which models could be built to assist editors in identifying and fixing.


There are certain content quality issues that are generally easy to fix and so are much more likely to be fixed directly by editors than have a template placed indicating the issue. Copy-editing (read more) is a prime example of this -- i.e. fixing small grammatical or spelling errors on Wikipedia. Detecting content that needs copy-editing is a great task for new editors as it often can leverage their pre-existing language skills while slowly introducing them to the general concept and workflow of editing. Building datasets of copyedits to train models across many language communities to detect these issues likely would require more structured edit summaries for editors to indicate when their edit is primarily a copyedit (as opposed to changes to the semantics of the content itself). This would have further benefits for building datasets to identify other common edit intentions such as adding facts or refactoring existing content.

Reliable sources[edit]

Verifiability is a core content policy on Wikipedia that requires reliable sources back up content. Despite the incredibly important role that references play in ensuring content integrity,[7] references on the Wikimedia projects are generated through a complex mixture of <ref> tags and templates that make it difficult to analyze or track them comprehensively. This makes it very complicated to develop tools for tracking the spread of sources through language editions, validate the quality and appropriateness of sources, and analyze patterns in sources such as content language,[8] geoprovenance, or whether it is open-access.[9] As a result, most research has focused on specific subsets of references -- e.g., scholarly sources -- and very tools exist to assist in tracking sources on the Wikimedia projects.

Common causes and potential fixes[edit]

Looking at these examples, there some similarities in the root causes of these data gaps: use of templates to track issues and unstructured edit/log comments to record the reasons for a particular action on wiki.


When editor communities have a need to track metadata about content for which there is no structured mechanism, they often turn to templates as a flexible means of achieving these goals. These cleanup templates are quite common for many content integrity and non-encyclopedic language issues. They are also used for tracking article quality: individual WikiProjects created their own template that they could add to an article talk page, which included a parameter for indicating article quality. This template would then add the appropriate category to the article indicating the topic and assessed quality of the article -- e.g., FA-class medicine articles. The PageAssessments extension was created (see above) to help standardize this process. While these systems work well for editors, they are very difficult to effectively parse and extract data from across many languages given their semi-structured nature and the local knowledge needed to interpret their parameters. They also see varying uptake by editors due to a mixture of factors likely including the slow feedback cycle for fixing the issues, lack of awareness about the specific templates, and their highly visible nature.

Improving our ability to gather high-quality data from these templates is a major project that would improvements at many levels. While the PageAssessments example is one potential path to fixing their shortcomings and might be a good approach for something as important as references -- developing an extension that puts structure around the existing processes -- it is also a resource-intensive solution that is very custom to specific issues. A more general solution likely has a few components:

  • Standardization: for many years, editors have requested global templates as a means of more easily sharing templates across language editions. This would have clear benefits for smaller language communities without the time or technical resources to develop their own templates and could also greatly simplify the extraction of structured data from these templates.
  • Feedback: tools like Newcomer Tasks and Citation Hunt have made important strides towards ensuring that editors can easily discover articles that have been tagged with cleanup templates. Building more tools and increasing the numbers of editors using these tools can be a strong motivating factor for applying these templates instead of their tags being relegated to another long backlog.
  • Design: while oftentimes it is desirable for readers to see these clean-up tags, they can also clutter an article and prioritizing bigger issues can mean more easily-fixable issues might be left untagged. With more editors responding to cleanup tags, communities might want to consider less-visible tags as a means to expand tagging without drastically impacting the display of content.

Log comments[edit]

Wikimedia projects contain extensive logs of administrative actions taken by editors but many of these logs often combine many distinct actions in one place and rely on edit comments to clearly distinguish between the actions. For example, reverting vandalism shows up in the edit history but is mixed with many other types of reverts (as described above). Block logs combine editors blocked for spam, sockpuppetry, vandalism, violating username policies, and many other reasons. Parsing out which blocks are for which reasons requires knowledge of the specific norms in that language community and does not scale well across many languages.

Improving our ability to gather high-quality data from these logs would require altering the interfaces used to carry out these actions to provide a more structured set of reasons that are connected across languages so researchers could easily build language-agnostic datasets of vandalism or different types of abusive accounts. This is already implemented in certain tools such as Twinkle but usage is not universal across wikis.

Wikitext vs. HTML[edit]

Wikipedia articles are written in wikitext. This is considered the source of truth for what content is in a Wikipedia -- i.e. what is stored in the databases -- and it is parsed into the HTML that is presented to readers (and edited in VisualEditor). Wikitext does not have strict specifications (background), however, and templates in particular give editors a lot of flexibility as to how to achieve a given effect in the version of the content presented to readers. As a result, there are often many ways to achieve similar aims. For example, the norm may be to add a reference via <ref> tags, but editors can also use a variety of footnote templates, import references from Wikidata via templates, and generate references from other templates. An analysis of wikitext against the final parsed HTML found that, on average, only about 90% of references were generated via <ref> tags and that this varied greatly by language edition and article. Wikidata transclusion is another classic case of the gap between wikitext and HTML.

We continue to rely on the wikitext for most modeling due to its availability (e.g., dumps, data lake) and libraries like mwparserfromhell that were developed to help researchers and developers to easily manipulate and extract information from wikitext. But while these resources are impressive, they are limited by the inherent limitations of wikitext itself. And this general gap between wikitext and the parsed HTML also seems to be expanding.[10]

Instead, we need to start thinking of the HTML as the source-of-truth for modeling. It may not be what is found in our databases, but it is far more structured so that features and clean plaintext can be extracted far more uniformly. And increasingly, with tools like VisualEditor, it is what editors are editing. We are thankfully making progress towards this state -- e.g., Enterprise HTML dumps, libraries to manipulate it like mwparserfromhtml. However, we still need to make these resources more accessible and ensure we have the resources for enabling historical analyses without overloading the Parsoid APIs.

See also[edit]


  1. For more information and background, see this list of example models, the mission of the WikiLoop project, and an extensive discussion of this challenge in this paper: Asthana, Sumit; Thommel, Sabrina Tobar; Halfaker, Aaron Lee; Banovic, Nikola (2021-10-13). "Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behaviors". Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2): 1–23. ISSN 2573-0142. doi:10.1145/3479503. 
  2. See this report on content moderation on medium-sized wikis for more details.
  3. Example for Arabic Wikipedia that gathers reverts done by users who are in several different groups (sysop, oversight, editor, bot, rollbacker, checkuser, abusefilter, bureaucrat) and have at least 1000 edits: editquality Makefile
  4. Loureiro, Daniel; Barbieri, Francesco; Neves, Leonardo; Anke, Luis Espinosa; Camacho-Collados, Jose (2022-04-01). "TimeLMs: Diachronic Language Models from Twitter". arXiv:2202.03829 [cs]. 
  5. Geiger, R. Stuart; Halfaker, Aaron (2013-08-05). "When the levee breaks: without bots, what happens to Wikipedia's quality control processes?". Proceedings of the 9th International Symposium on Open Collaboration. WikiSym '13 (New York, NY, USA: Association for Computing Machinery): 1–6. ISBN 978-1-4503-1852-5. doi:10.1145/2491055.2491061. 
  6. Check out these two examples (learning from dispute templates and large scale dataset on content reliability) or reach out to Diego (WMF) for more information.
  7. For an excellent example of the importance of sources in fighting disinformation, check out: Cohen, Noam (7 September 2021). "One Woman’s Mission to Rewrite Nazi History on Wikipedia". Wired. Retrieved 13 December 2022. 
  8. Sen, Shilad W.; Ford, Heather; Musicant, David R.; Graham, Mark; Keyes, Os; Hecht, Brent (2015-04-18). "Barriers to the Localness of Volunteered Geographic Information". Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI '15 (New York, NY, USA: Association for Computing Machinery): 197–206. ISBN 978-1-4503-3145-6. doi:10.1145/2702123.2702170. 
  9. "How many Wikipedia references are available to read? We measured the proportion of open access sources across languages and topics.". Wikimedia Foundation (in en-US). 2018-08-20. Retrieved 2022-12-13. 
  10. Mitrevski, Blagoj; Piccardi, Tiziano; West, Robert (2020-05-26). "WikiHist.html: English Wikipedia's Full Revision History in HTML Format". Proceedings of the International AAAI Conference on Web and Social Media 14: 878–884. ISSN 2334-0770. doi:10.1609/icwsm.v14i1.7353.