Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-09-11

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Thursday, September 12, 2019[edit]

This work log documents my current progress at expanding the ORES drafttopic model, which predicts topics for a given English Wikipedia article, to other languages on Wikipedia. The goal is to map any given Wikipedia article to one or more, human-interpretable labels that identify what high-level topics relate to that article. These topics can then be used to understand reader behavior by mapping page views from millions of articles to a much much smaller set of topics. In particular, for the debiasing and analysis of the reader demographic surveys, I have well over one million unique article page views across more than one hundred languages that need evaluated.

Example[edit]

If someone were to read the article for the Storm King Art Center, an outdoor sculpture garden in Hudson, NY, USA, we would want to map this page view to topics such as Culture (it is an outdoor art museum) and Geography (it has a physical location). Furthermore, we may want more fine-grained labels within Culture such as Art, and within Geography such as North America. One approach to doing this is through WikiProjects -- there exist many hundreds of WikiProjects in English Wikipedia that are all associated with a specific topic, and each of these WikiProjects has added their template to articles that they believe to be important to their topic. In this case, the talk page for Storm King has been tagged with templates from WikiProject Museums, WikiProject Visual arts, WikiProject Hudson Valley, and WikiProject Public Art. Based on the WikiProject Directory, a mapping can be built between these specific WikiProjects and higher-level categories that they belong to -- in this case: "Culture.Arts", "Culture.Plastic arts", "Culture.Visual arts", and "Geography.Americas".

In practice, mapping an article to the WikiProjects that have tagged it and then a list of topics is not straightforward. In some cases, the Directory is not well-formed and WikiProjects can be inadvertently left out (e.g., WikiProject Europe and several others appear outside of the sections for Geography.Europe) or assigned to odd categories (e.g., a broken link for Cities of the United States can assign all WikiProjects under Geography.Americas to Geography.Cities instead). The templates used by a given WikiProject can be many as well. For WikiProject Public Art, the template used with Storm King is actually WikiProject Wikipedia Saves Public Art as opposed to Template:WikiProject Public Art. And finally, while most English Wikipedia articles have at least one WikiProject tagging them, any given WikiProject is likely to not have tagged many articles that reasonably might be associated with their topic area.

Why WikiProjects?[edit]

No taxonomy of topics will be perfect and the nature of mapping all of Wikipedia to ~50 topics is an incredibly reductive task. A taxonomy of topics based on WikiProjects from English Wikipedia is being used in this work. This naturally raises concerns about whether these topics are appropriate for other language editions, but I have not encountered a clearly superior taxonomy for Wikipedia articles and appreciate that the WikiProjects taxonomy is easily derived and modifiable. Further details can be seen in the initial paper written about the drafttopic model[1].

Additional models based purely on Wikidata were also explored but abandoned as it was deemed that the Wikidata instance-of / sublcass taxonomy does not map closely enough to more general topic taxonomies. Wikipedia categories are famously difficult to map to a coherent taxonomy and do not readily scale across languages as well. Outside taxonomies such as DBpedia introduce additional data processing complexities. More details are contained within the See Also section below for those who are interested in exploring these alternatives.

Modeling[edit]

While looking up the WikiProjects that have tagged an article works well for long-standing articles on English Wikipedia, some method is needed for automatically inferring these topics when articles are new or outside of English Wikipedia. That is, a model needs to be built that can predict what topics should be applied to any given Wikipedia article.

Existing drafttopic model[edit]

The existing ORES drafttopic model takes the text of a page, represents each word via word embeddings, and makes predictions based on an average of these word embeddings. This allows it to capture the nuances of language while not requiring the article to already have much of the structure (Wikidata, links) that established articles on Wikipedia often have. This approach was taken so that the model would be applicable to drafts of new articles. It has the drawback, however, of being difficult to scale to other languages. Approaches such as multilingual word embeddings are still largely unproven and bring about many other challenges around preprocessing, loading the rather large word embeddings into memory, and needing the text for each article (which is nontrivial when analyzing over one million articles across more than 100 language editions).

Wikidata Model[edit]

For my particular context of representing page views to existing Wikipedia articles, I am not restricted to just article text. So as not to build separate language models for each Wikipedia, we choose to represent a given article not by its text but by the statements on its associated Wikidata item. This is naturally language independent and, intuitively, many Wikidata statements map directly to topics (e.g., an item with the occupation property and physician value should probably fall under STEM.Medicine). We treat these Wikidata statements like a bag of words.

Gathering training data[edit]

I wrote a script that loops through the dump of the current version of English Wikipedia, checks pages in the article talk page namespace (1) and retains any talk page that has templates whose name includes "wp" or "wikiproject".

These templates are mapped to topics via the existing drafttopic code (with a few adjustments to clean the directory).

Another script then loops through the Wikidata JSON dump and maps each talk page and its topics to a Wikidata item (joining on the title or QID if available) and associated claims.

Building supervised model[edit]

I use fastText to build a model that predicts a given Wikidata's topics based on its claims. A more complete description of fastText and how to build this model is contained within this PAWS notebook. Notably, there is some pre-processing to go from the JSON files outputted above and fastText-ready files.

Performance[edit]

As a baseline, this Wikidata model is compared to the existing drafttopic model for English Wikipedia (more drafttopic statistics here). Notably, this is not a apples-to-apples comparison because the Wikidata model is trained on a much larger, and different dataset that is much less balanced than the drafttopic dataset. For both models, the false negative rate is much higher than it really is for many classes due to the sparsity of WikiProject templating -- i.e. there are generally many articles that reasonably fit within a given WikiProject that are not labeled as such.

Grid search was used to determine the best choice for fastText hyperparameters. This demonstrated that model performance is largely robust to specific choices, though higher dimensionality of embeddings, learning rates, and number of epochs led to greater overfitting.

Results from grid search (fastText for hyperparameter definitions)
dim epoch lr minCount ws train micro f1 val micro f1 train macro f1 val macro f1
50 10 0.05 3 5 0.815 0.809 0.64 0.627
50 10 0.05 3 10 0.815 0.808 0.64 0.624
50 10 0.05 3 20 0.815 0.808 0.64 0.626
50 20 0.05 3 5 0.822 0.812 0.661 0.635
50 20 0.05 3 10 0.822 0.811 0.661 0.634
50 20 0.05 3 20 0.822 0.811 0.66 0.634
50 30 0.05 3 5 0.827 0.813 0.674 0.64
50 30 0.05 3 10 0.826 0.813 0.674 0.639
50 30 0.05 3 20 0.826 0.812 0.673 0.638
50 10 0.05 5 5 0.813 0.808 0.637 0.624
50 10 0.05 5 10 0.814 0.808 0.637 0.625
50 10 0.05 5 20 0.814 0.808 0.637 0.625
50 20 0.05 5 5 0.82 0.811 0.655 0.633
50 20 0.05 5 10 0.82 0.811 0.655 0.632
50 20 0.05 5 20 0.82 0.811 0.655 0.632
50 30 0.05 5 5 0.823 0.812 0.666 0.638
50 30 0.05 5 10 0.823 0.812 0.666 0.637
50 30 0.05 5 20 0.823 0.812 0.666 0.638
50 10 0.05 10 5 0.811 0.807 0.632 0.62
50 10 0.05 10 10 0.811 0.807 0.632 0.62
50 10 0.05 10 20 0.811 0.807 0.632 0.621
50 20 0.05 10 5 0.816 0.809 0.647 0.629
50 20 0.05 10 10 0.816 0.809 0.647 0.628
50 20 0.05 10 20 0.816 0.809 0.647 0.629
50 30 0.05 10 5 0.818 0.81 0.655 0.631
50 30 0.05 10 10 0.818 0.81 0.655 0.632
50 30 0.05 10 20 0.818 0.81 0.655 0.632
100 10 0.05 3 5 0.815 0.809 0.641 0.625
100 10 0.05 3 10 0.815 0.808 0.64 0.626
100 10 0.05 3 20 0.815 0.808 0.641 0.627
100 20 0.05 3 5 0.822 0.811 0.662 0.635
100 20 0.05 3 10 0.822 0.811 0.662 0.634
100 20 0.05 3 20 0.822 0.812 0.662 0.636
100 30 0.05 3 5 0.827 0.813 0.675 0.639
100 30 0.05 3 10 0.827 0.813 0.675 0.64
100 30 0.05 3 20 0.826 0.812 0.674 0.639
100 10 0.05 5 5 0.814 0.808 0.639 0.624
100 10 0.05 5 10 0.814 0.808 0.638 0.625
100 10 0.05 5 20 0.814 0.808 0.638 0.624
100 20 0.05 5 5 0.82 0.811 0.656 0.633
100 20 0.05 5 10 0.82 0.811 0.657 0.633
100 20 0.05 5 20 0.82 0.811 0.656 0.632
100 30 0.05 5 5 0.823 0.812 0.667 0.638
100 30 0.05 5 10 0.823 0.812 0.667 0.637
100 30 0.05 5 20 0.823 0.812 0.667 0.637
100 10 0.05 10 5 0.812 0.807 0.634 0.622
100 10 0.05 10 10 0.811 0.807 0.634 0.622
100 10 0.05 10 20 0.811 0.807 0.633 0.62
100 20 0.05 10 5 0.816 0.809 0.648 0.629
100 20 0.05 10 10 0.816 0.809 0.648 0.629
100 20 0.05 10 20 0.816 0.809 0.647 0.629
100 30 0.05 10 5 0.818 0.81 0.656 0.633
100 30 0.05 10 10 0.818 0.81 0.655 0.632
100 30 0.05 10 20 0.818 0.81 0.656 0.632
50 10 0.1 3 5 0.817 0.809 0.65 0.631
50 10 0.1 3 10 0.817 0.81 0.65 0.631
50 10 0.1 3 20 0.817 0.809 0.65 0.631
50 20 0.1 3 5 0.824 0.812 0.67 0.638
50 20 0.1 3 10 0.824 0.812 0.67 0.638
50 20 0.1 3 20 0.824 0.812 0.671 0.638
50 30 0.1 3 5 0.828 0.813 0.683 0.639
50 30 0.1 3 10 0.829 0.813 0.683 0.64
50 30 0.1 3 20 0.828 0.813 0.683 0.64
50 10 0.1 5 5 0.815 0.809 0.646 0.63
50 10 0.1 5 10 0.815 0.809 0.646 0.629
50 10 0.1 5 20 0.815 0.809 0.647 0.63
50 20 0.1 5 5 0.821 0.811 0.663 0.636
50 20 0.1 5 10 0.821 0.811 0.663 0.636
50 20 0.1 5 20 0.821 0.811 0.663 0.635
50 30 0.1 5 5 0.825 0.812 0.673 0.637
50 30 0.1 5 10 0.825 0.812 0.674 0.638
50 30 0.1 5 20 0.824 0.812 0.673 0.636
50 10 0.1 10 5 0.813 0.808 0.64 0.625
50 10 0.1 10 10 0.813 0.808 0.641 0.625
50 10 0.1 10 20 0.813 0.808 0.641 0.626
50 20 0.1 10 5 0.817 0.81 0.654 0.631
50 20 0.1 10 10 0.817 0.809 0.653 0.631
50 20 0.1 10 20 0.817 0.81 0.653 0.632
50 30 0.1 10 5 0.82 0.81 0.661 0.633
50 30 0.1 10 10 0.819 0.81 0.661 0.633
50 30 0.1 10 20 0.819 0.81 0.661 0.633
100 10 0.1 3 5 0.817 0.809 0.651 0.631
100 10 0.1 3 10 0.817 0.809 0.651 0.631
100 10 0.1 3 20 0.817 0.809 0.651 0.631
100 20 0.1 3 5 0.824 0.812 0.671 0.639
100 20 0.1 3 10 0.824 0.812 0.67 0.636
100 20 0.1 3 20 0.824 0.812 0.671 0.637
100 30 0.1 3 5 0.828 0.813 0.683 0.641
100 30 0.1 3 10 0.828 0.813 0.682 0.641
100 30 0.1 3 20 0.829 0.813 0.684 0.641
100 10 0.1 5 5 0.816 0.809 0.648 0.632
100 10 0.1 5 10 0.815 0.809 0.648 0.631
100 10 0.1 5 20 0.816 0.809 0.648 0.63
100 20 0.1 5 5 0.821 0.811 0.664 0.636
100 20 0.1 5 10 0.821 0.811 0.664 0.636
100 20 0.1 5 20 0.821 0.811 0.664 0.637
100 30 0.1 5 5 0.825 0.812 0.674 0.637
100 30 0.1 5 10 0.825 0.812 0.674 0.637
100 30 0.1 5 20 0.825 0.812 0.674 0.637
100 10 0.1 10 5 0.813 0.808 0.641 0.626
100 10 0.1 10 10 0.813 0.808 0.641 0.625
100 10 0.1 10 20 0.813 0.808 0.642 0.627
100 20 0.1 10 5 0.817 0.81 0.654 0.633
100 20 0.1 10 10 0.817 0.81 0.654 0.631
100 20 0.1 10 20 0.817 0.809 0.653 0.631
100 30 0.1 10 5 0.819 0.81 0.661 0.634
100 30 0.1 10 10 0.819 0.81 0.661 0.634
100 30 0.1 10 20 0.819 0.81 0.661 0.633
50 10 0.2 3 5 0.82 0.811 0.658 0.636
50 10 0.2 3 10 0.819 0.81 0.658 0.635
50 10 0.2 3 20 0.819 0.81 0.657 0.636
50 20 0.2 3 5 0.826 0.812 0.677 0.639
50 20 0.2 3 10 0.827 0.813 0.678 0.641
50 20 0.2 3 20 0.827 0.813 0.678 0.641
50 30 0.2 3 5 0.831 0.813 0.69 0.64
50 30 0.2 3 10 0.831 0.813 0.689 0.641
50 30 0.2 3 20 0.831 0.813 0.689 0.64
50 10 0.2 5 5 0.817 0.81 0.653 0.633
50 10 0.2 5 10 0.817 0.81 0.654 0.634
50 10 0.2 5 20 0.818 0.81 0.653 0.633
50 20 0.2 5 5 0.823 0.812 0.67 0.637
50 20 0.2 5 10 0.823 0.812 0.67 0.637
50 20 0.2 5 20 0.823 0.812 0.67 0.636
50 30 0.2 5 5 0.826 0.812 0.679 0.637
50 30 0.2 5 10 0.826 0.812 0.679 0.637
50 30 0.2 5 20 0.827 0.812 0.68 0.637
50 10 0.2 10 5 0.814 0.809 0.646 0.63
50 10 0.2 10 10 0.814 0.808 0.646 0.63
50 10 0.2 10 20 0.815 0.808 0.646 0.629
50 20 0.2 10 5 0.819 0.81 0.658 0.633
50 20 0.2 10 10 0.819 0.81 0.658 0.633
50 20 0.2 10 20 0.818 0.81 0.658 0.632
50 30 0.2 10 5 0.821 0.81 0.665 0.631
50 30 0.2 10 10 0.821 0.81 0.665 0.632
50 30 0.2 10 20 0.821 0.81 0.665 0.631
100 10 0.2 3 5 0.819 0.81 0.658 0.638
100 10 0.2 3 10 0.819 0.81 0.658 0.636
100 10 0.2 3 20 0.819 0.81 0.659 0.637
100 20 0.2 3 5 0.827 0.813 0.678 0.639
100 20 0.2 3 10 0.827 0.813 0.678 0.64
100 20 0.2 3 20 0.827 0.813 0.679 0.639
100 30 0.2 3 5 0.831 0.813 0.69 0.639
100 30 0.2 3 10 0.831 0.813 0.69 0.641
100 30 0.2 3 20 0.831 0.813 0.69 0.639
100 10 0.2 5 5 0.817 0.81 0.653 0.634
100 10 0.2 5 10 0.817 0.81 0.653 0.633
100 10 0.2 5 20 0.818 0.81 0.655 0.635
100 20 0.2 5 5 0.823 0.812 0.671 0.637
100 20 0.2 5 10 0.823 0.812 0.671 0.638
100 20 0.2 5 20 0.823 0.812 0.67 0.636
100 30 0.2 5 5 0.826 0.812 0.679 0.637
100 30 0.2 5 10 0.826 0.812 0.679 0.637
100 30 0.2 5 20 0.826 0.812 0.679 0.638
100 10 0.2 10 5 0.814 0.808 0.647 0.632
100 10 0.2 10 10 0.814 0.808 0.647 0.63
100 10 0.2 10 20 0.814 0.808 0.646 0.63
100 20 0.2 10 5 0.819 0.81 0.659 0.633
100 20 0.2 10 10 0.819 0.81 0.658 0.631
100 20 0.2 10 20 0.818 0.81 0.658 0.632
100 30 0.2 10 5 0.821 0.81 0.665 0.632
100 30 0.2 10 10 0.821 0.81 0.666 0.633
100 30 0.2 10 20 0.821 0.81 0.666 0.634

Based on the grid search, a model was built with the following hyperparameters was evaluated on the test set: 0.1 lr; 50 dim; 3 min count; 30 epochs:

Model Micro Precision Macro Precision Micro Recall Macro Recall Micro F1 Macro F1
drafttopic 0.826 0.811 0.576 0.554 0.668 0.643
Wikidata 0.881 0.809 0.762 0.560 0.811 0.643

Qualitative[edit]

To explore this Wikidata-based model, you can query it via a local API as described within this code repository: https://github.com/geohci/wikidata-topic-model

See Also[edit]

References[edit]

  1. Asthana, Sumit; Halfaker, Aaron (November 2018). "With Few Eyes, All Hoaxes Are Deep". Proc. ACM Hum.-Comput. Interact. 2 (CSCW): 21:1–21:18. ISSN 2573-0142. doi:10.1145/3274290.