Machine learning models/Production/Language agnostic link-based article topic

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper
Model owner(s)	WMF Research Team
Model interface	Lift Wing API
Past performance	Previous performance data
Publications	Language-agnostic Topic Classification for Wikipedia
Code	Gitlab repository
Uses PII	No
	This model uses links in an article to predict a set of topics that a Wikipedia article may be in.
	v; t; e;

Predicting what general topic an article is in and doing so consistently across many languages is useful for various analyses of Wikipedia dynamics and filters for edit recommendation systems. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually across all Wikipedia projects.

This model is a new, language-agnostic approach to predicting which topic an article might be relevant to. It uses the wikilinks in a given Wikipedia article to predict which (0 to many) of a set of 64 topics are relevant to a given article. For example, Mount Everest might reasonably be associated with South Asia, East Asia, Sports, and Earth and the Environment.

The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article is represented as a collection of its wikilinks as represented by their respective Wikidata items. This data originates from the normal editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.

This model is deployed on LiftWing (documentation). Right now, it can also be publicly accessed through a beta testing site. This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends), filtering articles, and cross-language analytics. It should not be used for projects outside of Wikipedia, namespaces outside of 0, disambiguations, or redirects.

Motivation

A major challenge for many analyses of Wikipedia dynamics — e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion — is grouping the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia’s category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage: typically, only a small subset of articles can be classified, or the method cannot be applied across (the more than 300) languages on Wikipedia.

This language-agnostic approach for classifying articles into a taxonomy of topics, which is based on the links in an article, can be easily applied to (almost) any language and article on Wikipedia. It matches the performance of a language-dependent approach while being simpler and having much greater coverage.

Users and uses

Intended users

researchers
bots
editors
user scripts and gadgets

Use this model for

high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
filtering to relevant articles — e.g. Filter articles only to those in the music category.
cross-language comparisons — e.g. How does the quality of articles in the east-asia category differ between French Wikipedia and Japanese Wikipedia?

Don't use this model for

projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
namespaces outside of 0, disambiguations, and redirects — the training data for this model explicitly excludes draft pages, talk pages, disambiguations, redirects, and other non-article pages, as they do not have a training label that could be associated with them.

Current uses

This model is in production as of August 2022. Its outputs are incorporated into the Search index where they are used as filters by edit recommender systems such as Newcomer Tasks. For internal Wikimedia Foundation purposes, a snapshot of all model predictions is produced monthly (details).

Ethical considerations, caveats, and recommendations

The model fits articles into a taxonomy of topics initially developed as a guide for discovering English WikiProjects. While some tweaks have been made to align it better with topic classification, it likely reflects English Wikipedia's interests and distinctions. Other language editions presumably would make different distinctions.
While 0.5 is the suggested threshold, other thresholds may be more appropriate depending on the language and topic label. Notably, the raw scores from the model are not a measure of topical relevance but a measure of model confidence that a topic is relevant. Thus, a higher score does not mean a topic is more relevant and topics with clearer relevance — e.g., geographic, biographies — will generally have higher scores than topics with more ambiguous relevance — e.g., society, education.
Gaps in WikiProject coverage are known to lead to biases in recall for certain topics. For example, film labels largely are missing actors from Nollywood (Nigeria) and thus recall is lower for articles about Nigerian films and actors than Hollywood (US) films and actors.

Model

Performance

As of the most recent model training in November 2022, the model had the following test statistics:

Overall Model Performance (November 2022, threshold=0.5 for all topics)
Precision Percentage of articles classified as a topic that were actually the topic	Recall Percentage of articles that actually were a topic that were classified as the topic	F1 Harmonic mean of precision and recall	Average Precision Mean precision across all thresholds
0.881 (micro); 0.841 (macro)	0.795 (micro); 0.676 (macro)	0.833 (micro); 0.744 (macro)	0.893 (micro); 0.796 (macro)

Detailed Model Performance (November 2022, threshold=0.5 for all topics)

Topic

n
Number of articles in the test set with the label (8% of all data)

FPR
Percentage of articles not in a topic classified as being in the topic

Precision
Percentage of articles classified as a topic that were actually the topic

Recall
Percentage of articles that actually were a topic that were classified as the topic

F1
Harmonic mean of precision and recall

Average Precision
Mean precision across all thresholds

europe

889127

5.6%

88.3%

87.3%

0.878

0.953

biography

772873

3.0%

92.5%

92.1%

0.923

0.971

stem

541929

1.2%

94.8%

89.5%

0.921

0.972

asia

418330

1.7%

89.9%

84.3%

0.870

0.933

sports

390101

0.5%

96.9%

94.6%

0.957

0.980

media

383948

1.5%

90.8%

88.0%

0.894

0.951

north-america

318785

1.9%

84.3%

76.5%

0.802

0.895

western-europe

317806

1.3%

89.8%

84.0%

0.868

0.944

biology

287075

0.3%

97.4%

93.9%

0.956

0.986

geographical

252242

1.6%

81.7%

70.1%

0.755

0.834

eastern-europe

209804

0.7%

91.5%

85.0%

0.881

0.944

southern-europe

196306

0.9%

87.7%

80.4%

0.839

0.917

northern-europe

183064

0.9%

85.4%

73.7%

0.791

0.872

history

157208

0.9%

79.7%

58.3%

0.673

0.751

music

148941

0.3%

94.2%

86.6%

0.902

0.942

films

139596

0.7%

87.5%

84.7%

0.861

0.925

women

138163

1.0%

69.6%

43.5%

0.535

0.622

military-and-warfare

112782

0.6%

80.9%

62.9%

0.707

0.783

east-asia

114672

0.4%

89.2%

81.1%

0.850

0.906

politics-and-government

110200

0.7%

76.7%

55.9%

0.647

0.718

west-asia

105185

0.4%

89.9%

81.0%

0.852

0.915

visual-arts

102622

0.6%

78.6%

59.9%

0.680

0.749

philosophy-and-religion

99928

0.6%

78.9%

59.2%

0.676

0.726

transportation

91054

0.2%

92.6%

84.3%

0.882

0.921

literature

85679

0.5%

78.3%

54.3%

0.641

0.718

south-asia

83060

0.2%

91.9%

84.0%

0.878

0.918

africa

84073

0.4%

84.9%

69.0%

0.761

0.829

north-asia

74433

0.3%

87.3%

75.6%

0.810

0.882

south-america

70327

0.3%

85.8%

75.6%

0.804

0.876

oceania

67774

0.2%

90.7%

75.8%

0.826

0.876

business-and-economics

59146

0.4%

73.1%

45.7%

0.562

0.606

technology

51349

0.4%

77.0%

61.0%

0.681

0.733

architecture

48185

0.3%

75.6%

52.6%

0.620

0.681

engineering

48126

0.2%

86.7%

66.1%

0.750

0.795

medicine-and-health

43245

0.2%

82.9%

67.0%

0.741

0.798

society

46268

0.3%

64.3%

27.9%

0.389

0.421

southeast-asia

43758

0.1%

89.3%

73.9%

0.809

0.859

television

44118

0.2%

85.1%

66.6%

0.747

0.802

earth-and-environment

45095

0.2%

84.3%

61.8%

0.713

0.759

space

39521

0.1%

95.4%

89.9%

0.926

0.961

linguistics

36010

0.1%

89.8%

64.8%

0.753

0.787

computing

28099

0.1%

83.1%

68.4%

0.750

0.824

internet-culture

27017

0.1%

89.9%

77.3%

0.831

0.881

entertainment

26086

0.2%

69.9%

40.8%

0.516

0.566

central-america

28158

0.1%

83.7%

59.2%

0.694

0.738

education

26072

0.1%

65.3%

25.5%

0.367

0.386

northern-africa

25855

0.1%

80.2%

57.3%

0.669

0.719

chemistry

24297

0.1%

86.2%

76.9%

0.812

0.880

food-and-drink

23315

0.1%

83.5%

61.5%

0.708

0.757

performing-arts

20426

0.1%

77.2%

43.6%

0.557

0.597

books

18727

0.1%

78.2%

51.8%

0.623

0.657

eastern-africa

19151

0.1%

85.1%

64.2%

0.732

0.789

mathematics

17261

0.1%

84.9%

70.7%

0.772

0.837

video-games

18180

0.0%

95.9%

87.9%

0.918

0.944

physics

18626

0.1%

78.9%

62.1%

0.695

0.756

comics-and-anime

17743

0.0%

91.8%

77.5%

0.840

0.878

western-africa

17738

0.1%

85.6%

66.6%

0.749

0.811

software

15889

0.1%

75.0%

67.0%

0.708

0.744

southern-africa

10577

0.0%

85.4%

61.2%

0.713

0.748

central-asia

10589

0.1%

78.2%

56.8%

0.658

0.702

central-africa

8296

0.0%

80.9%

54.7%

0.653

0.695

fashion

7320

0.0%

73.6%

42.2%

0.537

0.545

radio

5268

0.0%

85.7%

51.3%

0.642

0.632

libraries-and-information

4169

0.0%

72.8%

36.1%

0.482

0.440

Topic

n

FPR

Precision

Recall

F1

Average Precision

Performance Notes

Performance will suffer in articles with few outlinks, though generally this means precision remains high but recall drops. Atypical linking practices might also lead to inexplicable results, though this is difficult to define. In practice, the number of outlinks in a given Wikipedia article does vary by language edition, region of the world, and age of article.
The groundtruth data for this model (WikiProject tags) are quite comprehensive but also certainly missing many legitimate tags – this means that false positive and precision rates are likely better than they look but also that false negative and recall rates are likely conservative and so not as good as they seem.
Evaluation factors: number of outlinks, topic

Implementation

Model architecture

Overview at: https://fasttext.cc/docs/en/supervised-tutorial.html

Epochs: 2
Learning rate: 0.1
Window size: 20
Min count (under which QID is not retained in vocab): 20
No pre-trained embeddings used
Embeddings dimension: 50
Total number of model params: 3200 (50 x 64)
Vocab size: 4,535,915
Total number of embeddings params: 226,795,750 (vocab size * embeddings dimension)
Model size on disk: 944 MB
Decision thresholds: 0.5 for all labels

Output schema

{
  article: <url string>,
  results: [
	{score: <score (0-1)>, topic: <string>},
	... (up to 64 topics)
	{score: <score (0-1)>, topic: <string>}
  ]
}

Example input and output

Input

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/outlink-topic-model:predict -X POST -d '{"page_title": "Frida_Kahlo", "lang": "en", "threshold": 0.1}' -H "Content-type: application/json"

Output

{"prediction": {
  "article": "https://en.wikipedia.org/wiki/Frida_Kahlo",
  "results": [
    {"score": 0.863, "topic": "Culture.Biography.Biography*"},
    {"score": 0.516, "topic": "Geography.Regions.Americas.North_America"},
    {"score": 0.477, "topic": "Culture.Visual_arts.Visual_arts*"},
    {"score": 0.275, "topic": "Culture.Biography.Women"}
    ]
  }
}

Data

The training data for this model was over 30 million Wikipedia articles spanning all languages on Wikipedia. Each article was represented as the list of wikidata items associated with its outlinks. This data originated from the editing activities of Wikipedia and Wikidata editors, and was collected in an automated fashion.

Data pipeline

Wikilinks in a Wikipedia article (to other namespace 0 articles in that wiki) at the most current revision are selected from the pagelinks table and mapped to their corresponding Wikidata IDs at a set date. If there is no Wikidata ID or their Wikidata ID is not within the vocabulary, that data is dropped. The resulting bag-of-Wikidata-IDs is fed into the model, which maps each ID to a 50-dimensional embedding, averages them together, and then uses multinomial logistic regression to predict labels.

Training data

90% sample of every language in Wikipedia
In practice, this meant that English Wikipedia provides 17.9% of the data and then French (4.4%), German (3.7%), Italian (3.4%), Spanish (3.3%), Egyptian Arabic (3.1%) and all other languages are below 3%
Sampling is done by Wikidata ID, so all the language versions of a given article either appear in training, validation, or test but not across multiple splits.

Test data

Same data pipeline and general approach as training data
8% sample of every language in Wikipedia
2% retained for validation

Example data pipeline

1) Initial article

Magdalena Carmen Frida Kahlo y Calderón (6 July 1907 – 13 July 1954) was a Mexican painter known for her many portraits, self-portraits, and works inspired by the nature and artifacts of Mexico. Inspired by the country's popular culture, she employed a naïve folk art style to explore questions of identity, postcolonialism, gender, class, and race in Mexican society...

2) Isolate links

Frida Kahlo: [
  [[Self-portrait]],
  [[Mexico]],
  [[Culture of Mexico]],
  [[Naïve art]],
  [[Folk art]],
  [[Postcolonialism]],
  ...
]

3) Convert to list of Wikidata items

Q5588: [    # Frida Kahlo
  Q192110,  # self-portrait
  Q96,      # Mexico
  Q2317008, # Culture of Mexico
  Q281108,  # Naïve art
  Q1153484, # Folk art
  Q265425,  # Postcolonialism
  ...
]

4) Train model

This set of Wikidata IDs is then fed into the model, which maps each ID to an embedding, averages them together, and then uses a multinomial logistic regression to predict labels.

Licenses

Code: MIT License
Model: CC0 License

Citation

Cite this model as:

@article{johnson2021classification,
   author={Johnson, Isaac and Gerlach, Martin and Sáez-Trumper, Diego},
   title={Language-agnostic Topic Classification for Wikipedia},
   journal={WWW '21: Companion Proceedings of the Web Conference 2021},
   month=April,
   year=2021,
   pages={594–601}
}