Jump to content

Machine learning models/Proposed/Article country

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Isaac Johnson
Model owner(s)Isaac Johnson
Model interfaceUI; API
Past performanceResearch meta page
CodeAPI and Data pipeline
Uses PIIno
In production?no
Which projects?all Wikipedia languages
This model uses Wikidata properties, wikilinks, and categories to predict countries about Wikipedia articles.


Taking a Wikipedia article and determining which countries are relevant to it is not a straightforward task. As explored in the Knowledge Gaps Taxonomy, there are different ways in which article subjects are connected to countries. Some are very direct -- e.g., places being physically located in a country or someone being born in a country. Others are far more hazy -- e.g., a species being native to a region or a cuisine or type of music being culturally connect with a country. This model seeks to operationalize these different definitions of relevance and combine these signals into a complete set of predictions for which countries are relevant to any given Wikipedia article.

Specifically, the model takes the union of three signals:

  • Wikidata properties: this includes coordinates for places, citizenship and other information for most biographies, and the generic country property (P17) that covers a wide-range of relationships. These are used directly by the model.
  • Categories: many categories on Wikidata are explicitly modeled as being related to a particular country -- e.g., Flora of France -- and usage of this category in a Wikipedia article can signify its connection to that country. These are used directly by the model.
  • Wikilinks: for country relationships that either are difficult to model via Wikidata/categories or have not been yet, we fallback onto the links in an article to indicate other relevant countries. If enough links in an article point to a given country, that is treated as a valid prediction as well.

Motivation

[edit]

Editors generally want to edit content about which they have some familiarity -- i.e. they know something about the topic, its context, the types of sources that might be relevant and reliable, etc. An important aspect of this familiarity is some amount of geographic/cultural proximity to the topic. Countries are one very important scale that is often determinant of this geographic/cultural proximity. More fine-grained scales -- e.g., the topic is relevant to your particular state or town -- are also often an important scale but in practice it can be difficult to operationalize these scales whereas countries is a relatively constrained vocabulary that are used in many places on the Wikimedia projects as well. Earlier article-region models had used even larger regions such as Western Europe or South America, which were reported as too coarse to be useful as filters for many editors. This is especially true for organizers who are receiving grant funding that expects them to focus on content within their particular country. They currently cannot use much of the recommender system tooling (Content Translation, Newcomer Tasks, etc.) to directly address their needs.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics
  • filtering / ranking articles in tools – e.g., only showing articles about Ecuador in a recommender system
Don't use this model for
Current uses
  • Under development but intended as a filter for recommender systems similar to articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • The model is oriented towards precision as opposed to recall -- i.e. do not assume that because a country is not listed as a prediction, that it is not relevant.
  • Where editors are not seeing countries appear that they would expect to be relevant to a given article, they can consider making the edits (where appropriate) to that article's Wikidata item, categories, or the Wikidata items for the categories themselves. Wikilinks may also be added that point to articles that are tied to that country but care should be taken to not overly-wikilink just to trigger a specific model prediction.
  • No list of countries can be perfect, but details about the list used here may be found on the Research page for the model.

Model

[edit]

Performance

[edit]

Implementation

[edit]
Model architecture
Rule-based algorithm, so nothing fancy architecture-wise.
Output schema
{
  countries: [
	<country name>,
	... (up to 250 countries)
  ]
}
The above is the most basic output -- additional details and metadata can be provided.
Example input and output
$ curl https://wiki-region.wmcloud.org/regions?lang=en&title=Japanese%20iris

Output

{
  "qid": "Q16753983",
  "countries": [
    "Japan"
  ],
  "wikidata": [],
  "links": [
    {
      "country": "Japan",
      "count": 4,
      "prop-tfidf": 0.4264280798348245
    },
    {
      "country": "United Kingdom",
      "count": 2,
      "prop-tfidf": 0.1924294562973159
    }
  ],
  "categories": []
}
This example output shows not just the predicted countries but the source of the different predictions.

Data

[edit]

Two components of the model (Wikidata properties and Wikipedia categories) do not require any calibration/learning as they are used directly. The wikilinks portion of the model does require calibration, however, to know when a set of links is providing sufficient evidence to elevate a given country to a full prediction. That is what is described below.

Data pipeline
Wikilinks in a Wikipedia article (to other namespace 0 articles in that wiki) at the most current revision are selected from the pagelinks table and mapped to their corresponding Wikidata IDs at a set date. Each of these Wikidata IDs is checked for existing country predictions (these are cached predictions based on just the Wikidata properties and likely will at some point include categories too). For each country present via the links, the count of links and proportion of total links that point to that country are tallied. The proportions are then adjusted via tf-idf logic. If a country exceeds at least three instances and an adjusted proportion of 0.25 (one-quarter), then it becomes a prediction. The tf-idf transformation is included because certain countries are far more prevalent in the link data and therefore likely to be linked to even if they are not especially relevant. It is precomputed based on all pagelinks on Wikipedia.
Training data
Test data
The model was evaluated by human reviewers on a few subsets of articles. Full results can be seen in this notebook but, overall across all of Wikipedia, we see a precision of 0.975 and recall of 0.772.

Licenses

[edit]

Citation

[edit]

Cite this model as:

@misc{johnson_2024_articlecountry,
   title={Article country model},
   author={Johnson, Isaac},
   year={2024},
   url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_country}
}