Research:Language-Agnostic Topic Classification

Created

21:06, 14 May 2020 (UTC)

Contact

Isaac Johnson

Wikimedia Foundation

Collaborators

Martin Gerlach

Wikimedia Foundation

Diego Saez-Trumper

Wikimedia Foundation

Duration: 2019-September – 2023-May

Open access
via arXiv

Open source
via Github

Open data
via Figshare

Research:Projects

This page documents a completed research project.

This project comprised a number of complementary efforts to build language-agnostic topic classification models for Wikipedia articles -- i.e. label a Wikipedia article with one or more high-level topics using a single model that can provide predictions for any language edition of Wikipedia. This project built directly on the ORES' language-specific models that make topic predictions based on the text of the article. The work has resulted in the link-based topic classification model, which has been put into production and is used by various tools such as in guiding newcomers in finding relevant articles to edit.

Methods[edit]

The core approach used by these models is that they use language modeling techniques but rely on Wikidata as a feature space so that predictions are based on a shared, language-independent vocabulary that can be applied to any Wikipedia article. They all use the groundtruth set of topic labels for English Wikipedia that are based on the WikiProject directory.

Topic Classification of Wikidata Items[edit]

Main article: Demographics Survey Work Log

This approach makes topic predictions not for Wikipedia articles but for Wikidata items. It treats a given Wikidata item's statements and identifiers as a bag of words -- e.g., "instance-of human" becomes just two tokens "P31" and "Q5". A labeled dataset is generated from Wikidata items that have a labeled English Wikipedia article.

For instance, if you wanted to get a topic prediction for Langdon House, this approach would map the article to its Wikidata item (Q6485811). That Wikidata item (as of May 2020) has the following properties and values:

Together, these properties and values (as represented by the P and Q IDs) would form a bag-of-words representation from which the model learns a simple supervised multi-label classifier. As a side-effect of this model, you can extract embeddings for properties and values on Wikidata (the "words") and Wikidata items (the "documents").

Wikidata-item Resources[edit]

Model details / performance
Dump of topic predictions for every Wikidata item with a Wikipedia sitelink: https://figshare.com/articles/Topics_for_each_Wikipedia_Article_across_Languages/12127434
Model code: https://github.com/geohci/wikidata-topic-model

Topic Classification of Wikipedia Articles[edit]

For this approach, we use the actual Wikipedia articles to make predictions, but represent them as a bag of outlinks to Wikidata items. The high-level view of this process is as follows:

For a given article, collect all of the Wikipedia articles that it links to (same wiki; namespace 0). I currently use the pagelinks table, but this could be updated to use the wikitext dumps for greater control.
Resolve any redirects.
Map these outlinks (in the form of page IDs) to their associated Wikidata items (using the item_page_link table).

Continuing with the Langdon House example, this article (as of May 2020) would be represented as:

Q12063015, Q76321820, Q6976908, Q1293931, Q3702059, Q6977993, Q1022954, Q12063106, Q11190, Q12063060, Q1148084, Q5870034, Q2131593, Q6975902, Q6976135, Q1850711, Q6977769, Q3349886, Q6976054, Q4618708, Q6975331, Q1850701, Q4643746, Q1967620, Q1657477, Q12063540, Q6977716, Q6977837, Q6977725, Q6978002, Q1582434, Q6975782, Q1967636, Q6975806, Q1679966, Q1397, Q6976367, Q576744, Q6976290, Q30, Q3719, Q6976722, Q6976479, Q1967640, Q382362, Q6977028, Q1839000, Q6977991, Q4365410, Q2888877, Q6977001, Q16147380, Q6977800, Q6976998, Q8676, Q12063156, Q1850680, Q6973378, Q6976779, Q6976989, Q6976981, Q6977699, Q22664, Q6977032, Q6977161, Q5773622, Q6977307, Q12063371, Q6977894, Q176290, Q1967649, Q6977264, Q6977097, Q6977075, Q6977067, Q43196, Q468574, Q6977490, Q1967645, Q6977990, Q1967630, Q1620797, Q308439

Note, most of these links actually come from this navigation template, which points to the potential value of extracting these links from the wikitext.

As above, a machine learning model can then be learned that takes an input bag-of-words (where the words are QIDs) and outputs topic predictions. Because this model uses QIDs as its feature space, it can make predictions for any article from any Wikipedia language edition so long as that article's outlinks are first mapped to Wikidata items.

Outlink Resources[edit]

Navigation Embeddings[edit]

Navigation embeddings start with building reader sessions, where sessions are consecutive pageviews from the same device with no more than one hour between pageviews. These pageviews are mapped to their corresponding Wikidata items and then representation learning approaches such as word2vec can be applied to these sessions to learn embeddings for each Wikidata item. As a result, embeddings for Wikidata items will be similar if their corresponding Wikipedia articles are often read by the same person in quick succession -- i.e. equivalent to strong connections in the clickstream dataset. While this approach thusfar has not been used for topic classification, it is valuable to include as another language-agnostic approach to representing Wikipedia articles.

Navigation embeddings resources[edit]

For more details, see this example in which navigation embeddings are used to find articles related to the Covid-19 pandemic.

Subpages[edit]

Pages with the prefix 'Language-Agnostic Topic Classification' in the 'Research' and 'Research talk' namespaces:

Research:

Research talk:

Language-Agnostic Topic Classification