Machine learning models/Production/Language agnostic link-based article topic
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper |
Model owner(s) | WMF Research Team |
Model interface | Lift Wing API |
Past performance | Previous performance data |
Publications | Language-agnostic Topic Classification for Wikipedia |
Code | Gitlab repository |
Uses PII | No |
This model uses links in an article to predict a set of topics that a Wikipedia article may be in. | |
Predicting what general topic an article is in and doing so consistently across many languages is useful for various analyses of Wikipedia dynamics and filters for edit recommendation systems. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually across all Wikipedia projects.
This model is a new, language-agnostic approach to predicting which topic an article might be relevant to. It uses the wikilinks in a given Wikipedia article to predict which (0 to many) of a set of 64 topics are relevant to a given article. For example, Mount Everest might reasonably be associated with South Asia, East Asia, Sports, and Earth and the Environment.
The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article is represented as a collection of its wikilinks as represented by their respective Wikidata items. This data originates from the normal editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.
This model is deployed on LiftWing (documentation). Right now, it can also be publicly accessed through a beta testing site. This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends), filtering articles, and cross-language analytics. It should not be used for projects outside of Wikipedia, namespaces outside of 0, disambiguations, or redirects.
Motivation
[edit]A major challenge for many analyses of Wikipedia dynamics — e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion — is grouping the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia’s category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage: typically, only a small subset of articles can be classified, or the method cannot be applied across (the more than 300) languages on Wikipedia.
This language-agnostic approach for classifying articles into a taxonomy of topics, which is based on the links in an article, can be easily applied to (almost) any language and article on Wikipedia. It matches the performance of a language-dependent approach while being simpler and having much greater coverage.
Users and uses
[edit]- researchers
- bots
- editors
- user scripts and gadgets
- high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the
physics
andbiology
categories? - filtering to relevant articles — e.g. Filter articles only to those in the
music
category. - cross-language comparisons — e.g. How does the quality of articles in the
east-asia
category differ between French Wikipedia and Japanese Wikipedia?
- projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
- namespaces outside of 0, disambiguations, and redirects — the training data for this model explicitly excludes draft pages, talk pages, disambiguations, redirects, and other non-article pages, as they do not have a training label that could be associated with them.
Ethical considerations, caveats, and recommendations
[edit]- The model fits articles into a taxonomy of topics initially developed as a guide for discovering English WikiProjects. While some tweaks have been made to align it better with topic classification, it likely reflects English Wikipedia's interests and distinctions. Other language editions presumably would make different distinctions.
- While 0.5 is the suggested threshold, other thresholds may be more appropriate depending on the language and topic label. Notably, the raw scores from the model are not a measure of topical relevance but a measure of model confidence that a topic is relevant. Thus, a higher score does not mean a topic is more relevant and topics with clearer relevance — e.g., geographic, biographies — will generally have higher scores than topics with more ambiguous relevance — e.g., society, education.
- Gaps in WikiProject coverage are known to lead to biases in recall for certain topics. For example, film labels largely are missing actors from Nollywood (Nigeria) and thus recall is lower for articles about Nigerian films and actors than Hollywood (US) films and actors.
Model
[edit]Performance
[edit]As of the most recent model training in November 2022, the model had the following test statistics:
Precision Percentage of articles classified as a topic that were actually the topic |
Recall Percentage of articles that actually were a topic that were classified as the topic |
F1 Harmonic mean of precision and recall |
Average Precision Mean precision across all thresholds |
---|---|---|---|
0.881 (micro); 0.841 (macro) | 0.795 (micro); 0.676 (macro) | 0.833 (micro); 0.744 (macro) | 0.893 (micro); 0.796 (macro) |
Detailed Model Performance (November 2022, threshold=0.5 for all topics) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Performance Notes
[edit]- Performance will suffer in articles with few outlinks, though generally this means precision remains high but recall drops. Atypical linking practices might also lead to inexplicable results, though this is difficult to define. In practice, the number of outlinks in a given Wikipedia article does vary by language edition, region of the world, and age of article.
- The groundtruth data for this model (WikiProject tags) are quite comprehensive but also certainly missing many legitimate tags – this means that false positive and precision rates are likely better than they look but also that false negative and recall rates are likely conservative and so not as good as they seem.
- Evaluation factors: number of outlinks, topic
Implementation
[edit]Overview at: https://fasttext.cc/docs/en/supervised-tutorial.html
- Epochs: 2
- Learning rate: 0.1
- Window size: 20
- Min count (under which QID is not retained in vocab): 20
- No pre-trained embeddings used
- Embeddings dimension: 50
- Total number of model params: 3200 (50 x 64)
- Vocab size: 4,535,915
- Total number of embeddings params: 226,795,750 (vocab size * embeddings dimension)
- Model size on disk: 944 MB
- Decision thresholds: 0.5 for all labels
{
article: <url string>,
results: [
{score: <score (0-1)>, topic: <string>},
... (up to 64 topics)
{score: <score (0-1)>, topic: <string>}
]
}
Input
$ curl https://api.wikimedia.org/service/lw/inference/v1/models/outlink-topic-model:predict -X POST -d '{"page_title": "Frida_Kahlo", "lang": "en", "threshold": 0.1}' -H "Content-type: application/json"
Output
{"prediction": {
"article": "https://en.wikipedia.org/wiki/Frida_Kahlo",
"results": [
{"score": 0.863, "topic": "Culture.Biography.Biography*"},
{"score": 0.516, "topic": "Geography.Regions.Americas.North_America"},
{"score": 0.477, "topic": "Culture.Visual_arts.Visual_arts*"},
{"score": 0.275, "topic": "Culture.Biography.Women"}
]
}
}
Data
[edit]The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article was represented as the list of wikidata items associated with its outlinks. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.
- 90% sample of every language in Wikipedia
- In practice, this meant that English Wikipedia provides 17.9% of the data and then French (4.4%), German (3.7%), Italian (3.4%), Spanish (3.3%), Egyptian Arabic (3.1%) and all other languages are below 3%
- Sampling is done by Wikidata ID, so all the language versions of a given article either appear in training, validation, or test but not across multiple splits.
- Same data pipeline and general approach as training data
- 8% sample of every language in Wikipedia
- 2% retained for validation
Example data pipeline |
---|
1) Initial article
Magdalena Carmen Frida Kahlo y Calderón (6 July 1907 – 13 July 1954) was a Mexican painter known for her many portraits, self-portraits, and works inspired by the nature and artifacts of Mexico. Inspired by the country's popular culture, she employed a naïve folk art style to explore questions of identity, postcolonialism, gender, class, and race in Mexican society...
2) Isolate links
Frida Kahlo: [
[[Self-portrait]],
[[Mexico]],
[[Culture of Mexico]],
[[Naïve art]],
[[Folk art]],
[[Postcolonialism]],
...
]
3) Convert to list of Wikidata items
Q5588: [ # Frida Kahlo
Q192110, # self-portrait
Q96, # Mexico
Q2317008, # Culture of Mexico
Q281108, # Naïve art
Q1153484, # Folk art
Q265425, # Postcolonialism
...
]
4) Train model
This set of Wikidata IDs is then fed into the model, which maps each ID to an embedding, averages them together, and then uses a multinomial logistic regression to predict labels.
|
Licenses
[edit]- Code: MIT License
- Model: CC0 License
Citation
[edit]Cite this model as:
@article{johnson2021classification,
author={Johnson, Isaac and Gerlach, Martin and Sáez-Trumper, Diego},
title={Language-agnostic Topic Classification for Wikipedia},
journal={WWW '21: Companion Proceedings of the Web Conference 2021},
month=April,
year=2021,
pages={594–601}
}