Machine learning models/Proposed/add-a-link model

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Martin Gerlach, Kevin Bazira, Djellel Difallah, Tisza Gergő, Kosta Harlan, Rita Ho, and Marshall Miller
Model owner(s)MGerlach (WMF)
Model interfacehttps://api.wikimedia.org/wiki/Link_Recommendation_API
Publicationspaper and arXiv
Codemwaddlink
Uses PIINo
In production?Yes
Which projects?add-a-link structured tasks (most Wikipedias)
This model uses wikitext and existing links to recommend potential new links to add to an article.


This model generates suggestions for new links to be added to articles. Specifically, each suggestion for a new link contain the anchor-text and the page-title of the target article. The intended use-case of the model is to generate link recommendations at scale for the add-a-link structured task.

Motivation[edit]

Contributing to Wikipedia requires acquaintance with the MediaWiki platform not only on a technical level (e.g. editing tools) but also with the intricate system of policies and guidelines. These issues pose significant barriers to retention of new editors (so-called newcomers), a key mechanism to maintain or increase the number of active contributors in order to ensure the functioning of an open collaboration system such as Wikipedia[1]. Different interactive tools have been introduced, such as the visual editor,  which aims to lower technical barriers to editing by providing a “what you see is what you get” interface to editing.

Another promising approach to retain newcomers is the Structured Tasks framework developed by Wikimedia’s Growth Team. This approach builds on earlier successes in suggesting edits that are easy (such as adding an image description) which are believed to lead to positive editing experiences and, in turn, a higher probability of editors to continue participating. Structured tasks aim to generalize this workflow by breaking down an editing process into steps that are easily understood by newcomers, easy to use on mobile devices, and guided by algorithms.

The task of adding links has been identified as an ideal candidate for this process: i) adding links is a frequently used work type and considered an attractive task by users [2], ii) it is well-defined, and iii) can be considered low-risk for leading to vandalism or other negative outcomes.

After completion of "Add a link" Experiment Analysis, we can conclude that the add-a-link structured task improves outcomes for newcomers over both a control group that did not have access to the Growth features as well as the group that had the unstructured "add links" tasks, particularly when it comes to constructive (non-reverted) edits. The most important points are:

  • Newcomers who get the Add a Link structured task are more likely to be activated (i.e. make a constructive first article edit).
  • They are also more likely to be retained (i.e. come back and make another constructive article edit on a different day).
  • The feature also increases edit volume (i.e. the number of constructive edits made across the first couple of weeks), while at the same time improving edit quality (i.e. the likelihood that the newcomer's edits are reverted).

Users and uses[edit]

Use this model for
Get recommendations for new links to add to articles in Wikipedia
Don't use this model for
Getting recommendations for articles in Wikipedia languages that did not pass the backtesting check:  as, bo, diq, dv, dz, fy, gan, hyw, ja, krc, mnw, my, pi, shn, sn, szy, ti, ur, wuu, zh, zh_classical, zh_yue. The backtesting suggests that the model will likely yield very low-quality recommendations. Therefore, the corresponding models were not published and, as a result, is not available for those languages.

Ethical considerations, caveats, and recommendations[edit]

  • The performance varies across languages. This has at least two main reasons. First, parsing some of the languages is challenging. For example, standard word tokenization does not work for Japanese because it relies on whitespaces to separate tokens. Second, for some languages we have very little training data because the corresponding Wikipedias are small in terms of the number of articles. We implemented a backtesting evaluation to make sure that each deployed model passes a minimum level of quality.

Model[edit]

Performance[edit]

The model was evaluated offline (test data, see below) and manually by editors for an initial set of 6 languages.

Offline (test data) Manual (editors)
Project precision recall precision
arwiki 0.754 0.349 0.92
bnwiki 0.743 0.297 0.75
cswiki 0.778 0.43 0.7
enwiki 0.832 0.457 0.78
frwiki 0.823 0.464 0.82
viwki 0.903 0.656 0.73

For all other languages (in total 301 languages), the model was evaluated only offline with the test data. In practice, we required a precision of 0.7-0.75 or higher such that the majority of suggestions would be true positives. As a result, we discarded models for 23 languages.  For details, see the report.


Implementation[edit]

Sketch of the link recommendation model for the add-a-link structured task

The model works in three stages to identify links in an article:

  • Mention detection: we parse the text of the article and identify n-grams of tokens that are not yet linked as potential anchor texts.
  • Link generation: for a given anchor-text, we generate potential link candidates for the target page-title from an anchor dictionary. The anchor dictionary stores the anchor-text and the target page-title of all already existing links in the corresponding Wikipedia. The same anchor-text can yield more than one link candidate. We only generate link candidates for an anchor-text if that link has been used at least once before.
  • Link disambiguation: for each candidate link consisting of the triplet (source page-title, target page-title, anchor text) we predict a probability via a binary classification task. In practice, we use XGBoost’s gradient boosting trees[3]. As model input we use the following features:
    • ngram: the number of words in the anchor (based on simple tokenization)
    • frequency: count of the anchor-link pair in the anchor-dictionary
    • ambiguity: how many different candidate links exist for an anchor in the anchor-dictionary
    • kurtosis: the kurtosis of the shape of the distribution of candidate-links for a given anchor in the anchor-dictionary
    • Levenshtein-distance: the Levensthein-distance between the anchor and the link. This measures how similar the two strings are. Roughly speaking, it corresponds to the number of single-character edits one has to make to transform one string into another, e.g. the Levensthein-distance between “kitten” and “sitting” is 3.
    • w2v-distance: similarity between the article (source-page) and the link (target-page) based on the content of the pages. This is obtained from wikipedia2vec[4]. Similar to the concept of word-embeddings, we map each article as a vector in an (abstract) 50-dimensional space in which articles with similar content will be located close to each other. Thus given two articles (say the source article and a possible link) we can look up their vectors and get an estimate for their similarity by calculating the distance between them (more specifically, the cosine-similarity). The rationale is that a link might be more likely if the corresponding article is more similar to the source article.

More details can be found here.


Model architecture
Xgboost model: XGBClassifier with default values
Output schema
{
  links:  <Array of link objects> 
  links_count: 
    <No. recommendations>,
  page_title: <Article title>,
  pageid: <Article identifier>,
  revid: <Revision identifer>
}

The link object looks like:

{ 
  context_after: 
    <Characters immediately succeeding the link text, may include space, punctuation, and partial words>,
  context_before: 
    <Characters immediately preceding the link text, may include space, punctuation, and partial words>,
  link_index: 
    <0-based index of the link recommendation within all link recommendations, sorted by wikitext offset>,
  link_target: 
    <Article title that should be linked to>,
  link_text: 
    <Phrase to link in the article text >,
  match_index: 
    <0-based index of the link anchor within the list of matches when searching for the phrase to link within simple wikitext (top-level wikitext that's not part of any kind of wikitext construct)>,
  score: 
    <Probability score that the link should be added >,
  wikitext_offset: 
    <Character offset describing where the anchor begins >
}
Example input and output
GET /service/linkrecommendation/v1/linkrecommendations/wikipedia/en/Earth

Output

{
  "links": [
    {
      "context_after": " to distin",
      "context_before": "cially in ",
      "link_index": 0,
      "link_target": "Science fiction",
      "link_text": "science fiction",
      "match_index": 0,
      "score": 0.5104129910469055,
      "wikitext_offset": 13852
    }
  ],
  "links_count": 1,
  "page_title": "Earth",
  "pageid": 9228,
  "revid": 1013965654
}

Data[edit]

Data pipeline
We generate a gold-standard dataset of sentences with the aim of ensuring high recall in terms of existing links (i.e. ideally these sentences should not be missing any links). For a given article, we only pick the first sentence containing at least one link. The following sentences are likely to miss links as links should generally be linked only once according to Wikipedia’s style guide. For each language, we generate 200k sentences (or less if the language contains fewer articles). We split the gold standard dataset into training and test set (50% − 50%).
Training data
For each sentence in the training data, we store the existing links as positive instances (𝑌 = 1) with their triplet (source page-title ,target page-title , anchor-text). For each link we calculate the corresponding features X. We generate negative instances (𝑌 = 0), i.e. non-existing links, via two different mechanisms. First, for a positive triplet we generate triplets (source page-title, target page-title , anchor-text) by looking up the alternative candidate links for the target page-title with the same anchor-text in the anchor-dictionary. Second, we generate triplets by identifying unlinked anchor-text in the sentence and looking up the corresponding candidate links for the target page-title in the anchor dictionary.
Test data
For each sentence in the test data, we identify the existing links as positive instances and record the corresponding triplets (source page-title ,target page-title , anchor-text). We run the linking algorithm on the raw text not containing any links (replacing the link by its anchor text). In practice, we i) generate possible anchor texts iteratively (giving preference to longer anchor texts); ii) predict the most likely link from all the candidate links of the anchor text found in the anchor dictionary; and iii) accept the link if the probability the model assigns exceeds a threshold value 𝑝∗ ∈ [0,1] .

Licenses[edit]


Citation[edit]

Cite this model as[5]:

@INPROCEEDINGS{gerlach2021multilingual,
title = "Multilingual Entity Linking System for Wikipedia with a {Machine-in-the-Loop} Approach",
booktitle = "Proceedings of the 30th {ACM} International Conference on Information \& Knowledge Management",
author = "Gerlach, Martin and Miller, Marshall and Ho, Rita and Harlan, Kosta and Difallah, Djellel",
publisher = "Association for Computing Machinery",
pages = "3818--3827",
series = "CIKM '21",
year =  2021,
address = "New York, NY, USA",
location = "Virtual Event, Queensland, Australia",
doi = "10.1145/3459637.3481939"
}
  1. Halfaker, A., Geiger, R. S., Morgan, J. T., & Riedl, J. (2013). The rise and decline of an open collaboration system. The American Behavioral Scientist, 57(5), 664–688. https://doi.org/10.1177/0002764212469365
  2. Cosley, D., Frankowski, D., Terveen, L., & Riedl, J. (2007). SuggestBot: using intelligent task routing to help people find work in wikipedia. Proceedings of the 12th International Conference on Intelligent User Interfaces, 32–41. https://doi.org/10.1145/1216295.1216309
  3. Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
  4. Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., & Matsumoto, Y. (2020). Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 23–30. https://doi.org/10.18653/v1/2020.emnlp-demos.4
  5. Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939