User:Isaac (WMF)/Content tagging

This page in a nutshell: This page was created in 2021-2022 and I am leaving it in that state as opposed to continuing to update so the statuses are likely out-of-date and any prototypes may now no longer exist.

This page is a brainstorming space for identifying potential "content-tagging models" that could be worked on by the Wikimedia Research team, assessing their feasibility / value, and prioritizing which ones to be prototyped.

Content Tagging Models[edit]

Content refers to Wikipedia articles, Wikidata items, Commons images, etc. It can also include smaller units than this -- e.g., sections, paragraphs, sentences, words. Importantly, the smaller the unit of content, the harder the modeling task generally is. Edits -- i.e. changes in this content -- are also included below but generally require substantively different approaches than modeling content itself.
Tagging refers to applying some (isolated) label to the content -- e.g., a topic. Generally it would indicate a classification model and seek to extend annotations already maintained by the Wikimedia community -- e.g., via talk pages, metadata templates, maintenance templates, Wikidata properties, categories, or edit tags. The domain of tags can range from very small -- e.g., binary measure of citation needed or not -- to very large -- e.g., any existing Wikipedia category -- but generally is some fixed, finite set of tags (often referred to as a taxonomy). The goal is generally to bring greater structure to the richness of Wikimedia content such that it is easier to describe (analytics), filter (interfaces), and automate decisions based on content (recommender systems).
Model indicates that some sort of algorithm will be used to predict the appropriate tags for a given piece of content. This can range from a simple set of heuristics to more complex machine-learning models. Almost always, the resulting model will be designed to complement existing editor processes. Note that in the table below, model is used as a name and the approach covers the algorithm.
Groundtruth refers to where we can find existing tags for training and validating a model. The groundtruth must consist of individual units -- e.g., sentences -- and associated tags -- e.g., citation is needed. Good groundtruth data can be particularly hard to construct for tasks with binary labels due to missing data -- i.e. it is not generally safe to assume that a lack of a tag indicates that that tag isn't necessary. For example, a sentence without a citation and no citation-needed template should not be assumed to not require a citation.
Approach refers to what type of model or data could be used to generate high-quality predictions. Knowing the data that will be needed is important to evaluating the complexity of the model and coverage. For example, approaches that can just use an article's wikilinks or structural features like page length are relatively easy to extend across languages. Approaches that require parsing text are more complex and less likely to scale well across languages (in particular to low-resourced languages).
Status is my evaluation of where a given project is at. This can range from projects that are just ideas to varying levels of prototypes to varying levels of productionization.

Model	Example Tags	Unit	Groundtruth	Approach	Status
Topic (high-level)	History, Physics, Environment	Article, section	WikiProjects + taxonomy	Links (LA)	Being productionized
Topic (region)	Lithuania, Northern Europe, Europe	Article, section	Wikidata properties + taxonomy	Links (LA)	Working prototype
Topic (person)	<gender> , <occupation>	Article	Wikidata ontology + taxonomy	Wikidata structure (LA)	Working prototype
Article Category (fine-grained)	Any Wikipedia category	Article	Categories	Links (LA)	Not started
Section Translation	History, Gallery, Early Life	Section	Taxonomy of common section types	Links + structural features (LA)	To be productionized
Quality	Stub to featured article	Article	WikiProjects	Structured features (LA)	Working prototype
Importance	Top, high, medium, low	Article	WikiProjects (importance) or pageviews (reader demand)	Structural features or pageviews	Working prototype (pageviews)
Readability	Simple, complex	Article	??	Structural features of text (LA) or language-specific	Early stages

Improvements Needed	Binary	Article, section	Other language versions, edit history	Dates / sources across languages (LA)	Not started
NPOV Violations	Binary	Sentence	Existing templates	Text-based (LD)	Early explorations
Citation Needed	Binary	Sentence	Existing references, templates	Text-based (LD)	Working prototype (3 languages)
Copy-edit	Any text	Sentence	Dictionaries	Text-based (LS)	Working prototype (31 languages)
Link Recommendation	Any Wikipedia article	Word	Existing links	Text-based (LS)	In production (many languages)

Image Category (fine-grained)	Any Commons category / Wikidata item	Image	Existing categories / depicts statements on Commons images	Basic computer-vision (LA)	Early stages
Image Caption	Unstructured text? Keywords to include?	Image	Existing captions for images on Commons / Wikipedia	Computer-vision (LA) + natural language generation (LS)	Early stages
Image Recommendation	Any Commons image	Article, section	Existing images in articles, Wikidata	Structural (LA) but also testing text-based (LD)	Research; In production (LA)

Edit Actions	Remove reference, change media	Edit diff	Edit summaries	Structural features (LA)	Working prototype
Edit Intentions	Copyedit, wikification	Edit diff	Edit summaries	Structural features (LA)	Old prototype (English)
Edit Quality (vandalism)	Damaging or not, good-faith/bad-faith	Edit diff	Edit summaries/tags	Text-based (LS)	In production (several languages)
Edit Toxicity (harassment)	Toxic or not	Talk page comment	Crowdsourced labels	Text-based (LS)	Dropped due to ethical concerns but continued by others

Prioritization[edit]

The following criteria will be used to select tagging models to prioritize (in no specific order):

Usefulness to Product: support from Product teams improves the likelihood that the project will be a success (more feedback, clear use-case, access to more resources).
Usefulness to Research: how many different projects on the team would benefit from the model? For example, quality modeling would support Knowledge Gaps metrics as well as many research projects on understanding reader/editor behavior and content dynamics.
Coverage / Equity: the greater the number of languages that a model can reasonably scale to, the more widespread the benefits can be. Generally favor language-agnostic over language-dependent over language-specific approaches (see more).
Research Value: will the research help advance our techniques / potentially contribute interesting generalizable knowledge or is it more applied (with some favor to the former)? For example, region classification seems to work best via label propagation, which is different from standard supervised classification models and thus a useful technique to develop.