Machine learning models/Production/Language agnostic link-based article topic

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper
Model owner(s)WMF Research Team
Model interfaceLift Wing API
Past performancePrevious performance data
PublicationsLanguage-agnostic Topic Classification for Wikipedia
CodeGithub repository
Uses PIINo
This model uses links in an article to predict a set of topics that a Wikipedia article may be in.


How can we predict what general topic an article is in, and do so consistently across many languages? Answering this question would be useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually across all Wikipedia projects.

This model is a new, language-agnostic approach to predicting which topic an article might be relevant to. It uses the wikilinks in a given Wikipedia article to predict which (0 to many) of a set of 64 topics are relevant to a given article. For example, Mount Everest might reasonably be associated with South Asia, East Asia, Sports, Earth, and the Environment.

The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article was represented as the list of wikidata items associated with its outlinks. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.

This model is deployed on LiftWing. Right now, it can be publicly accessed through a beta testing site. This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends), filtering articles, and cross-language analytics. It should not be used for projects outside of Wikipedia, namespaces outside of 0, disambiguations, or redirects.

Motivation[edit]

A major challenge for many analyses of Wikipedia dynamics — e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion — is grouping the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia’s category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage: typically, only a small subset of articles can be classified, or the method cannot be applied across (the more than 300) languages on Wikipedia.

This language-agnostic approach for classifying articles into a taxonomy of topics, which is based on the links in an article, can be easily applied to (almost) any language and article on Wikipedia. It matches the performance of a language-dependent approach while being simpler and having much greater coverage.

Users and uses[edit]

Intended users
  • researchers
  • bots
  • editors
  • user scripts and gadgets
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. Filter articles only to those in the music category.
  • cross-language comparisons — e.g. How does the quality of articles in the east-asia category differ between French Wikipedia and Japanese Wikipedia?
Don't use this model for
  • projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
  • namespaces outside of 0, disambiguations, and redirects — the training data for this model explicitly excludes draft pages, talk pages, disambiguations, redirects, and other non-article pages, as they do not have a training label that could be associated with them.
Current uses

This model is in production as of August 2022 but not currently incorporated in any products. The model is occasionally run manually on all Wikipedia articles and the outputs are put into an HDFS table, which powers the backend of the Pageview Topics Dashboard (Prototype), created by the Product Analytics team. (Note that this dashboard was created as a prototype and may have errors in loading.)

This model is currently accessible through the following beta sites:

Ethical considerations, caveats, and recommendations[edit]

  • The model fits articles into a taxonomy of topics initially developed as a guide for discovering English WikiProjects. While some tweaks have been made to align it better with topic classification, it likely reflects English Wikipedia's interests and distinctions. Other language editions presumably would make different distinctions.
  • While 0.5 is the suggested threshold, other thresholds may be more appropriate depending on the language and topic label. Notably, the raw scores from the model are not a measure of topical relevance but a measure of model confidence that a topic is relevant. Thus, a higher score does not mean a topic is more relevant and topics with clearer relevance — e.g., geographic, biographies — will generally have higher scores than topics with more ambiguous relevance — e.g., society, education.
  • Gaps in WikiProject coverage are known to lead to biases in recall for certain topics. For example, film labels largely are missing actors from Nollywood (Nigeria) and thus recall is lower for articles about Nigerian films and actors than Hollywood (US) films and actors.

Model[edit]

Performance[edit]

As of the most recent model training in November 2022, the model had the following test statistics:

Overall Model Performance (November 2022, threshold=0.5 for all topics)
Precision
Percentage of articles classified as a topic that were actually the topic
Recall
Percentage of articles that actually were a topic that were classified as the topic
F1
Harmonic mean of precision and recall
Average Precision
Mean precision across all thresholds
0.881 (micro); 0.841 (macro) 0.795 (micro); 0.676 (macro) 0.833 (micro); 0.744 (macro) 0.893 (micro); 0.796 (macro)
Detailed Model Performance (November 2022, threshold=0.5 for all topics)

Performance Notes[edit]

  • Performance will suffer in articles with few outlinks, though generally this means precision remains high but recall drops. Atypical linking practices might also lead to inexplicable results, though this is difficult to define. In practice, the number of outlinks in a given Wikipedia article does vary by language edition, region of the world, and age of article.
  • The groundtruth data for this model (WikiProject tags) are quite comprehensive but also certainly missing many legitimate tags – this means that false positive and precision rates are likely better than they look but also that false negative and recall rates are likely conservative and so not as good as they seem.
  • Evaluation factors: number of outlinks, topic

Implementation[edit]

Model architecture

Overview at: https://fasttext.cc/docs/en/supervised-tutorial.html

  • Epochs: 2
  • Learning rate: 0.1
  • Window size: 20
  • Min count (under which QID is not retained in vocab): 20
  • No pre-trained embeddings used
  • Embeddings dimension: 50
  • Total number of model params: 3200 (50 x 64)
  • Vocab size: 4,535,915
  • Total number of embeddings params: 226,795,750 (vocab size * embeddings dimension)
  • Model size on disk: 944 MB
  • Decision thresholds: 0.5 for all labels
Output schema
{
  article: <url string>,
  results: [
	{score: <score (0-1)>, topic: <string>},
	... (up to 64 topics)
	{score: <score (0-1)>, topic: <string>}
  ]
}
Example input and output

Input

GET /api/v1/topic?threshold=0.1&lang=en&title=Frida_Kahlo

Output

{
  "article": "https://en.wikipedia.org/wiki/Frida_Kahlo",
  "results": [
    {"score": 0.890, "topic": "Culture.Biography.Biography*"},
    {"score": 0.516, "topic": "Geography.Regions.Americas.North_America"},
    {"score": 0.484, "topic": "Culture.Visual_arts.Visual_arts*"},
    {"score": 0.281, "topic": "Culture.Biography.Women"}
  ]
}

Data[edit]

The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article was represented as the list of wikidata items associated with its outlinks. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.

Data pipeline
Wikilinks in a Wikipedia article (to other namespace 0 articles in that wiki) at the most current revision are selected from the pagelinks table and mapped to their corresponding Wikidata IDs at a set date. If there is no Wikidata ID or their Wikidata ID is not within the vocabulary, that data is dropped. The resulting bag-of-WikidataIDs is fed into the model, which maps each ID to a 50-dimensional embedding, averages them together, and then uses multinomial logistic regression to predict labels.
Training data
  • 90% sample of every language in Wikipedia
  • In practice, this meant that English Wikipedia provides 17.9% of the data and then French (4.4%), German (3.7%), Italian (3.4%), Spanish (3.3%), Egyptian Arabic (3.1%) and all other languages are below 3%
  • Sampling is done by Wikidata ID, so all the language versions of a given article either appear in training, validation, or test but not across multiple splits.
Test data
  • Same data pipeline and general approach as training data
  • 8% sample of every language in Wikipedia
  • 2% retained for validation
Example data pipeline
1) Initial article
Magdalena Carmen Frida Kahlo y Calderón (6 July 1907 – 13 July 1954) was a Mexican painter known for her many portraits, self-portraits, and works inspired by the nature and artifacts of Mexico. Inspired by the country's popular culture, she employed a naïve folk art style to explore questions of identity, postcolonialism, gender, class, and race in Mexican society...
2) Isolate links
Frida Kahlo: [
  [[Self-portrait]],
  [[Mexico]],
  [[Culture of Mexico]],
  [[Naïve art]],
  [[Folk art]],
  [[Postcolonialism]],
  ...
]
3) Convert to list of Wikidata items
Q5588: [    # Frida Kahlo
  Q192110,  # self-portrait
  Q96,      # Mexico
  Q2317008, # Culture of Mexico
  Q281108,  # Naïve art
  Q1153484, # Folk art
  Q265425,  # Postcolonialism
  ...
]
4) Train model
This set of Wikidata IDs is then fed into the model, which maps each ID to an embedding, averages them together, and then uses a multinomial logistic regression to predict labels.

Licenses[edit]

Citation[edit]

Cite this model as:

@article{johnson2021classification,
   author={Johnson, Isaac and Gerlach, Martin and Sáez-Trumper, Diego},
   title={Language-agnostic Topic Classification for Wikipedia},
   journal={WWW '21: Companion Proceedings of the Web Conference 2021},
   month=April,
   year=2021,
   pages={594–601}
}