Research:Language-Agnostic Topic Classification/Model comparison

From Meta, a Wikimedia project coordination wiki

As of September 2020, there are three different topic classification models that all seek to achieve the same objective: apply topics from the ORES taxonomy to a given Wikipedia article. As decisions are made about how to unify these models into a single approach, we realized that we needed a better understanding of how similar their predictions were. At a high level, the text-based approach is much more computationally- and resource-intensive than either of the outlinks- or Wikidata-based approaches, but the text-based approach just requires article text and is presumed to be most reflective of the actual content of an article and therefore accurate. Given that the text-based model currently hosted in ORES is also the most-vetted model (its predictions have been evaluated by volunteers in four languages), this was chose as a reference point and the other two models (Wikidata-based and outlinks-based) were evaluated to determine how much overlap there was. Of note, this was work that was overseen by Isaac (WMF) but largely executed by HAKSOAT.

Research Questions[edit]

We gathered random samples of articles from Arabic (ar), Czech (cs), English (en), and Vietnamese (vi) wikis. The sample size varied, but in all cases, at least 1000 articles were analyzed. For both the Wikidata-based and outlinks-based models, we asked the following questions:

  • For what proportion of articles does this model provide the same predictions as the text-based model?
    • This is ideal as the text-based model has been validated so identical performance means that the models could be used interchangeably. We refer to this as exact.
  • For what proportion of articles does this model just have lower recall than the text-based model -- i.e. all of its predictions also are made by the text-based model?
    • This is less ideal but from the standpoint of users who receive e.g., article recommendations, there should be no noticeable difference if they are being provided recommendations from a pool of e.g., 1000 articles instead of 5000. We refer to this as similar.
  • For what proportion of articles does this model predict additional topics not predicted by the text-based model but all of these additional topics are part of the groundtruth data?
    • This is a new prediction that the text-based model would not have made but is part of the groundtruth so we can safely assume that it would be an acceptable prediction for users. We also refer to this as similar.
  • For what proportion of articles does this model predict additional topics not predicted by the text-based model but at least some of these additional topics are NOT part of the groundtruth data?
    • These are the predictions that we have to pay the most attention to as they are neither validated by the text-based model nor the groundtruth. In practice, they may be completely acceptable (it is well-known that the groundtruth data is not complete), but a high percentage of articles in this category would indicate further evaluation was necessary.

For some languages, we also segmented the results by various metadata such as article length, # of Wikidata statements, and # of outlinks. As pertinent, the results from that are also reported below.

Results[edit]

At a high level, we see that the Wikidata and outlink-based models perform similarly though the outlink-based model has slightly higher precision AND recall. The models produce the exact same predictions as the text-based model for about 50% of articles. For another 40% of articles, the models produce similar predictions (either lower or higher recall). For the remaining 10% of articles, the models predict different topics -- i.e. neither in the groundtruth nor text-based results.

An examination of this 10% of predictions indicates that most are acceptable and not a cause for concern though the Women's Biography label does stand out as incorrectly applied by the outlinks model so further analysis or discussion of how to handle that label is necessary. For some languages, this 10% of articles are concentrated heavily in articles that are stubs (the bottom 10-20% of articles in terms of length).

See Also[edit]