Research:Language-Agnostic Topic Classification
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
This project comprises a number of complementary efforts to build language-agnostic topic classification models for Wikipedia articles -- i.e. label a Wikipedia article with one or more high-level topics using a single model that can provide predictions for any language edition of Wikipedia. The goal is that these models can complement ORES' language-specific models that make topic predictions based on the text of the article.
The core approach used by these models is that they use language modeling techniques but rely on Wikidata as a feature space so that predictions are based on a shared, language-independent vocabulary that can be applied to any Wikipedia article. They all use the groundtruth set of topic labels for English Wikipedia that are based on the WikiProject directory.
Topic Classification of Wikidata Items
This approach makes topic predictions not for Wikipedia articles but for Wikidata items. It treats a given Wikidata item's statements and identifiers as a bag of words -- e.g., "instance-of human" becomes just two tokens "P31" and "Q5". A labeled dataset is generated from Wikidata items that have a labeled English Wikipedia article.
For instance, if you wanted to get a topic prediction for Langdon House, this approach would map the article to its Wikidata item (Q6485811). That Wikidata item (as of May 2020) has the following properties and values:
- P31 (instance of): Q3947 (house)
- P18 (image)
- P17 (country): Q30 (United States of America)
- P31 (located in the administrative territorial entity): Q1397 (Ohio)
- P625 (coordinate location)
- P149 (architectural style): Q4365410 (Carpenter Gothic)
- P1435 (heritage designation): Q19558910 (place list on the National Register of Historic Places)
- P649 (NRHP reference number)
Together, these properties and values (as represented by the P and Q IDs) would form a bag-of-words representation from which the model learns a simple supervised multi-label classifier. As a side-effect of this model, you can extract embeddings for properties and values on Wikidata (the "words") and Wikidata items (the "documents").
- Model details / performance
- Try out the model here: https://wiki-topic.toolforge.org/#wikidata-model
- Dump of topic predictions for every Wikidata item with a Wikipedia sitelink: https://figshare.com/articles/Topics_for_each_Wikipedia_Article_across_Languages/12127434
- Model code: https://github.com/geohci/wikidata-topic-model
Topic Classification of Wikipedia Articles
For this approach, we use the actual Wikipedia articles to make predictions, but represent them as a bag of outlinks to Wikidata items. The high-level view of this process is as follows:
- For a given article, collect all of the Wikipedia articles that it links to (same wiki; namespace 0). I currently use the pagelinks table, but this could be updated to use the wikitext dumps for greater control.
- Resolve any redirects.
- Map these outlinks (in the form of page IDs) to their associated Wikidata items (using the item_page_link table).
Continuing with the Langdon House example, this article (as of May 2020) would be represented as:
- Q12063015, Q76321820, Q6976908, Q1293931, Q3702059, Q6977993, Q1022954, Q12063106, Q11190, Q12063060, Q1148084, Q5870034, Q2131593, Q6975902, Q6976135, Q1850711, Q6977769, Q3349886, Q6976054, Q4618708, Q6975331, Q1850701, Q4643746, Q1967620, Q1657477, Q12063540, Q6977716, Q6977837, Q6977725, Q6978002, Q1582434, Q6975782, Q1967636, Q6975806, Q1679966, Q1397, Q6976367, Q576744, Q6976290, Q30, Q3719, Q6976722, Q6976479, Q1967640, Q382362, Q6977028, Q1839000, Q6977991, Q4365410, Q2888877, Q6977001, Q16147380, Q6977800, Q6976998, Q8676, Q12063156, Q1850680, Q6973378, Q6976779, Q6976989, Q6976981, Q6977699, Q22664, Q6977032, Q6977161, Q5773622, Q6977307, Q12063371, Q6977894, Q176290, Q1967649, Q6977264, Q6977097, Q6977075, Q6977067, Q43196, Q468574, Q6977490, Q1967645, Q6977990, Q1967630, Q1620797, Q308439
Note, most of these links actually come from this navigation template, which points to the potential value of extracting these links from the wikitext.
As above, a machine learning model can then be learned that takes an input bag-of-words (where the words are QIDs) and outputs topic predictions. Because this model uses QIDs as its feature space, it can make predictions for any article from any Wikipedia language edition so long as that article's outlinks are first mapped to Wikidata items.
- Model details / performance statistics
- Try out the model here: https://wiki-topic.toolforge.org/#lang-agnostic-model
- Model code: https://github.com/geohci/wikipedia-language-agnostic-topic-classification
Navigation embeddings start with building reader sessions, where sessions are consecutive pageviews from the same device with no more than one hour between pageviews. These pageviews are mapped to their corresponding Wikidata items and then representation learning approaches such as word2vec can be applied to these sessions to learn embeddings for each Wikidata item. As a result, embeddings for Wikidata items will be similar if their corresponding Wikipedia articles are often read by the same person in quick succession -- i.e. equivalent to strong connections in the clickstream dataset. While this approach thusfar has not been used for topic classification, it is valuable to include as another language-agnostic approach to representing Wikipedia articles.
- For more details, see this example in which navigation embeddings are used to find articles related to the Covid-19 pandemic.
Pages with the prefix 'Language-Agnostic Topic Classification' in the 'Research' and 'Research talk' namespaces:
- Language-Agnostic Topic Classification
- Language-Agnostic Topic Classification/Countries
- Language-Agnostic Topic Classification/Model comparison
- Language-Agnostic Topic Classification/Outlink model performance
- Language-Agnostic Topic Classification/Outlink model performance/All wikis
- Language-Agnostic Topic Classification/Wikidata model performance
- Language-Agnostic Topic Classification/Wikidata model productionization