User:Isaac (WMF)/Content tagging

From Meta, a Wikimedia project coordination wiki

This page is a brainstorming space for identifying potential "content-tagging models" that could be worked on by the Wikimedia Research team, assessing their feasibility / value, and prioritizing which ones to be prototyped.

Content Tagging Models[edit]

  • Content refers to Wikipedia articles, Wikidata items, Commons images, etc. It can also include smaller units than this -- e.g., sections, paragraphs, sentences, words. Importantly, the smaller the unit of content, the harder the modeling task generally is. Edits -- i.e. changes in this content -- are also included below but generally require substantively different approaches than modeling content itself.
  • Tagging refers to applying some (isolated) label to the content -- e.g., a topic. Generally it would indicate a classification model and seek to extend annotations already maintained by the Wikimedia community -- e.g., via talk pages, metadata templates, maintenance templates, Wikidata properties, categories, or edit tags. The domain of tags can range from very small -- e.g., binary measure of citation needed or not -- to very large -- e.g., any existing Wikipedia category -- but generally is some fixed, finite set of tags (often referred to as a taxonomy). The goal is generally to bring greater structure to the richness of Wikimedia content such that it is easier to describe (analytics), filter (interfaces), and automate decisions based on content (recommender systems).
  • Model indicates that some sort of algorithm will be used to predict the appropriate tags for a given piece of content. This can range from a simple set of heuristics to more complex machine-learning models. Almost always, the resulting model will be designed to complement existing editor processes. Note that in the table below, model is used as a name and the approach covers the algorithm.
  • Groundtruth refers to where we can find existing tags for training and validating a model. The groundtruth must consist of individual units -- e.g., sentences -- and associated tags -- e.g., citation is needed. Good groundtruth data can be particularly hard to construct for tasks with binary labels due to missing data -- i.e. it is not generally safe to assume that a lack of a tag indicates that that tag isn't necessary. For example, a sentence without a citation and no citation-needed template should not be assumed to not require a citation.
  • Approach refers to what type of model or data could be used to generate high-quality predictions. Knowing the data that will be needed is important to evaluating the complexity of the model and coverage. For example, approaches that can just use an article's wikilinks or structural features like page length are relatively easy to extend across languages. Approaches that require parsing text are more complex and less likely to scale well across languages (in particular to low-resourced languages).
  • Status is my evaluation of where a given project is at. This can range from projects that are just ideas to varying levels of prototypes to varying levels of productionization.
Model Example Tags Unit Groundtruth Approach Status
Topic (high-level) History, Physics, Environment Article, section WikiProjects + taxonomy Links (LA) Being productionized
Topic (region) Lithuania, Northern Europe, Europe Article, section Wikidata properties + taxonomy Links (LA) Working prototype
Topic (person) <gender> , <occupation> Article Wikidata ontology + taxonomy Wikidata structure (LA) Working prototype
Article Category (fine-grained) Any Wikipedia category Article Categories Links (LA) Not started
Section Translation History, Gallery, Early Life Section Taxonomy of common section types Links + structural features (LA) To be productionized
Quality Stub to featured article Article WikiProjects Structured features (LA) Working prototype
Importance Top, high, medium, low Article WikiProjects (importance) or pageviews (reader demand) Structural features or pageviews Working prototype (pageviews)
Readability Simple, complex Article ?? Structural features of text (LA) or language-specific Early stages
Improvements Needed Binary Article, section Other language versions, edit history Dates / sources across languages (LA) Not started
NPOV Violations Binary Sentence Existing templates Text-based (LD) Early explorations
Citation Needed Binary Sentence Existing references, templates Text-based (LD) Working prototype (3 languages)
Copy-edit Any text Sentence Dictionaries Text-based (LS) Working prototype (31 languages)
Link Recommendation Any Wikipedia article Word Existing links Text-based (LS) In production (many languages)
Image Category (fine-grained) Any Commons category / Wikidata item Image Existing categories / depicts statements on Commons images Basic computer-vision (LA) Early stages
Image Caption Unstructured text? Keywords to include? Image Existing captions for images on Commons / Wikipedia Computer-vision (LA) + natural language generation (LS) Early stages
Image Recommendation Any Commons image Article, section Existing images in articles, Wikidata Structural (LA) but also testing text-based (LD) Research; In production (LA)
Edit Actions Remove reference, change media Edit diff Edit summaries Structural features (LA) Working prototype
Edit Intentions Copyedit, wikification Edit diff Edit summaries Structural features (LA) Old prototype (English)
Edit Quality (vandalism) Damaging or not, good-faith/bad-faith Edit diff Edit summaries/tags Text-based (LS) In production (several languages)
Edit Toxicity (harassment) Toxic or not Talk page comment Crowdsourced labels Text-based (LS) Dropped due to ethical concerns but continued by others

Prioritization[edit]

The following criteria will be used to select tagging models to prioritize (in no specific order):

  • Usefulness to Product: support from Product teams improves the likelihood that the project will be a success (more feedback, clear use-case, access to more resources).
  • Usefulness to Research: how many different projects on the team would benefit from the model? For example, quality modeling would support Knowledge Gaps metrics as well as many research projects on understanding reader/editor behavior and content dynamics.
  • Coverage / Equity: the greater the number of languages that a model can reasonably scale to, the more widespread the benefits can be. Generally favor language-agnostic over language-dependent over language-specific approaches (see more).
  • Research Value: will the research help advance our techniques / potentially contribute interesting generalizable knowledge or is it more applied (with some favor to the former)? For example, region classification seems to work best via label propagation, which is different from standard supervised classification models and thus a useful technique to develop.