Jump to content

Research:Language-Agnostic Topic Classification/Alternative approaches

From Meta, a Wikimedia project coordination wiki

At various stages/projects, we have also used other approaches to language-agnostic topic classification that are model-based but do not rely on article links. Two are described below but have generally been found to have lower precision or coverage than the link-based approach.

Topic Classification of Wikidata Items

[edit]

This approach makes topic predictions not for Wikipedia articles but for Wikidata items. It treats a given Wikidata item's statements and identifiers as a bag of words -- e.g., "instance-of human" becomes just two tokens "P31" and "Q5". A labeled dataset is generated from Wikidata items that have a labeled English Wikipedia article.

For instance, if you wanted to get a topic prediction for Langdon House, this approach would map the article to its Wikidata item (Q6485811). That Wikidata item (as of May 2020) has the following properties and values:

Together, these properties and values (as represented by the P and Q IDs) would form a bag-of-words representation from which the model learns a simple supervised multi-label classifier. As a side-effect of this model, you can extract embeddings for properties and values on Wikidata (the "words") and Wikidata items (the "documents").

Wikidata-item Resources

[edit]
[edit]

Navigation embeddings start with building reader sessions, where sessions are consecutive pageviews from the same device with no more than one hour between pageviews. These pageviews are mapped to their corresponding Wikidata items and then representation learning approaches such as word2vec can be applied to these sessions to learn embeddings for each Wikidata item. As a result, embeddings for Wikidata items will be similar if their corresponding Wikipedia articles are often read by the same person in quick succession -- i.e. equivalent to strong connections in the clickstream dataset. While this approach thusfar has not been used for topic classification, it is valuable to include as another language-agnostic approach to representing Wikipedia articles.

[edit]
  • For more details, see this example in which navigation embeddings are used to find articles related to the Covid-19 pandemic.