User:Isaac (WMF)/Content tagging
This page is a brainstorming space for identifying potential "content-tagging models" that could be worked on by the Wikimedia Research team, assessing their feasibility / value, and prioritizing which ones to be prototyped.
Content Tagging Models
[edit]- Content refers to Wikipedia articles, Wikidata items, Commons images, etc. It can also include smaller units than this -- e.g., sections, paragraphs, sentences, words. Importantly, the smaller the unit of content, the harder the modeling task generally is. Edits -- i.e. changes in this content -- are also included below but generally require substantively different approaches than modeling content itself.
- Tagging refers to applying some (isolated) label to the content -- e.g., a topic. Generally it would indicate a classification model and seek to extend annotations already maintained by the Wikimedia community -- e.g., via talk pages, metadata templates, maintenance templates, Wikidata properties, categories, or edit tags. The domain of tags can range from very small -- e.g., binary measure of citation needed or not -- to very large -- e.g., any existing Wikipedia category -- but generally is some fixed, finite set of tags (often referred to as a taxonomy). The goal is generally to bring greater structure to the richness of Wikimedia content such that it is easier to describe (analytics), filter (interfaces), and automate decisions based on content (recommender systems).
- Model indicates that some sort of algorithm will be used to predict the appropriate tags for a given piece of content. This can range from a simple set of heuristics to more complex machine-learning models. Almost always, the resulting model will be designed to complement existing editor processes. Note that in the table below, model is used as a name and the approach covers the algorithm.
- Groundtruth refers to where we can find existing tags for training and validating a model. The groundtruth must consist of individual units -- e.g., sentences -- and associated tags -- e.g., citation is needed. Good groundtruth data can be particularly hard to construct for tasks with binary labels due to missing data -- i.e. it is not generally safe to assume that a lack of a tag indicates that that tag isn't necessary. For example, a sentence without a citation and no citation-needed template should not be assumed to not require a citation.
- Approach refers to what type of model or data could be used to generate high-quality predictions. Knowing the data that will be needed is important to evaluating the complexity of the model and coverage. For example, approaches that can just use an article's wikilinks or structural features like page length are relatively easy to extend across languages. Approaches that require parsing text are more complex and less likely to scale well across languages (in particular to low-resourced languages).
- Status is my evaluation of where a given project is at. This can range from projects that are just ideas to varying levels of prototypes to varying levels of productionization.
Model | Example Tags | Unit | Groundtruth | Approach | Status |
---|---|---|---|---|---|
Topic (high-level) | History, Physics, Environment | Article, section | WikiProjects + taxonomy | Links (LA) | Being productionized |
Topic (region) | Lithuania, Northern Europe, Europe | Article, section | Wikidata properties + taxonomy | Links (LA) | Working prototype |
Topic (person) | <gender> , <occupation> | Article | Wikidata ontology + taxonomy | Wikidata structure (LA) | Working prototype |
Article Category (fine-grained) | Any Wikipedia category | Article | Categories | Links (LA) | Not started |
Section Translation | History, Gallery, Early Life | Section | Taxonomy of common section types | Links + structural features (LA) | To be productionized |
Quality | Stub to featured article | Article | WikiProjects | Structured features (LA) | Working prototype |
Importance | Top, high, medium, low | Article | WikiProjects (importance) or pageviews (reader demand) | Structural features or pageviews | Working prototype (pageviews) |
Readability | Simple, complex | Article | ?? | Structural features of text (LA) or language-specific | Early stages |
Improvements Needed | Binary | Article, section | Other language versions, edit history | Dates / sources across languages (LA) | Not started |
NPOV Violations | Binary | Sentence | Existing templates | Text-based (LD) | Early explorations |
Citation Needed | Binary | Sentence | Existing references, templates | Text-based (LD) | Working prototype (3 languages) |
Copy-edit | Any text | Sentence | Dictionaries | Text-based (LS) | Working prototype (31 languages) |
Link Recommendation | Any Wikipedia article | Word | Existing links | Text-based (LS) | In production (many languages) |
Image Category (fine-grained) | Any Commons category / Wikidata item | Image | Existing categories / depicts statements on Commons images | Basic computer-vision (LA) | Early stages |
Image Caption | Unstructured text? Keywords to include? | Image | Existing captions for images on Commons / Wikipedia | Computer-vision (LA) + natural language generation (LS) | Early stages |
Image Recommendation | Any Commons image | Article, section | Existing images in articles, Wikidata | Structural (LA) but also testing text-based (LD) | Research; In production (LA) |
Edit Actions | Remove reference, change media | Edit diff | Edit summaries | Structural features (LA) | Working prototype |
Edit Intentions | Copyedit, wikification | Edit diff | Edit summaries | Structural features (LA) | Old prototype (English) |
Edit Quality (vandalism) | Damaging or not, good-faith/bad-faith | Edit diff | Edit summaries/tags | Text-based (LS) | In production (several languages) |
Edit Toxicity (harassment) | Toxic or not | Talk page comment | Crowdsourced labels | Text-based (LS) | Dropped due to ethical concerns but continued by others |
Prioritization
[edit]The following criteria will be used to select tagging models to prioritize (in no specific order):
- Usefulness to Product: support from Product teams improves the likelihood that the project will be a success (more feedback, clear use-case, access to more resources).
- Usefulness to Research: how many different projects on the team would benefit from the model? For example, quality modeling would support Knowledge Gaps metrics as well as many research projects on understanding reader/editor behavior and content dynamics.
- Coverage / Equity: the greater the number of languages that a model can reasonably scale to, the more widespread the benefits can be. Generally favor language-agnostic over language-dependent over language-specific approaches (see more).
- Research Value: will the research help advance our techniques / potentially contribute interesting generalizable knowledge or is it more applied (with some favor to the former)? For example, region classification seems to work best via label propagation, which is different from standard supervised classification models and thus a useful technique to develop.