Research:Link recommendation model for add-a-link structured task

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
14:13, 22 October 2020 (UTC)
Djellel Difallah
Duration:  2020-September – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

As part of the work on structured tasks, the Growth team is aiming to implement the Add a link structured task. Here we describe the algorithm developed by the research team to automatically generate link-recommendations (see Add-a-link for the technical inrastructure). As a black box the algorithm works as follows:

  • Input: source-page/article = the article in which we want to add links
  • Output: list of recommended links with the following information
    • target-page/link = article which we want to link to
    • anchor/mention = text in source-page where link to target-page should be placed
    • link-likelihood = probability the model assigns to the link. Setting a threshold for the link-probability (thus only recommending links with a probability above, say, 0.8) allows us to make more accurate predictions, however, with the expense of missing potential useful links.

In the following we want to explain some of the details of the black box that generates the recommendations.

Problem statement[edit]

Our goal is to make recommendations to add links to a given article. The recommendation should state: i) which article to link to, ii) where in the text should the link be added (using existing text without adding additional words).


First, we identify unlinked parts of the text which could potentially contain a link. For example, finding the plain text “Berlin”, we might recommend the link Berlin, the capital of Germany. However, we have to consider that this text could also refer to Berlin (Wisconsin), or the music band Berlin, etc. As a second step, we thus have to identify possible candidates and decide which (if any of them) to recommend as the article we would recommend to link.

Looking at links in existing articles, we use machine learning to identify the most likely candidates for articles to link to. The details are described below.

Anchor dictionary[edit]

The anchor dictionary records all links and their anchors across all articles in a given Wikipedia. This will serve as a look-up table to identify potential candidates for linking. From the wikitext-dump we identify all (anchor,link)-pairs from the signature using mwparserfromhell. As a result we obtain a dictionary of the form "{mention: {link: count} }".

We include the following pre-processing steps:

  • We normalize titles of the links (i.e. uppercasing the first character, see format of article-titles)
  • We resolve redirects of article-titles (i.e. if the link is a redirect-page we record the title of the redirected-to article)
  • We only keep links to main-namespace articles in the wiki under consideration
  • We only keep anchor/links in articles that are not redirects and that are in the main namespace
  • We normalize anchors by lowercasing the string
  • We only keep anchors with a link-probability > 6.5% (Milne&Witten 2008). The link-probability is the fraction of times the mention appears as a link (the mention can also appear as plain text without any links). This is intended to avoid mentions such as “and” in the anchor-dictionary, similar to filtering stopwords in NLP.

Gold dataset[edit]

We generate a gold-dataset of fully-formed sentences in Wikipedia articles which already contain links. For this, we consider the first sentence in an article which contains at least one link. This choice is motivated mainly by the fact that later sentences might not contain links which had already occurred earlier in the article (false negatives). An example for a gold-sentence from the article Belgium in simple wikipedia would be “Belgium, officially the Kingdom of Belgium, is a federal state in Western Europe.”

Training data[edit]

In order to train a ML-model we build a training dataset containing features X and labels Y. The gold-dataset provides examples of which words are used as anchors for which links in which articles (and similarly which words are not used as anchors). Thus, we can construct a dataset of positive (Y=1) and negative (Y=0) examples. Consider the example above for the article “simple:Belgium”: “Belgium, officially the Kingdom of Belgium, is a federal state in Western Europe”. The positive examples (Y=1) can be obtained from the existing links, for example the anchor-link pair (“state”, “simple:Sovereign state”). The negative examples (Y=0) can be obtained by two different means. The first possibility are alternative links to existing anchor-link pairs; looking up the anchor-dictionary would yield the anchor-link pair (“state”, “simple:State”). The second possibility are potential anchors which are do not contain links such as “western europe”; looking up the anchor dictionary would yield the anchor-link pair (“western europe”, “simple:Western Europe”). For the gold-sentence of an article, we generate as many positive and negative anchor-link pairs and calculate the following features (X):

  • ngram: the number of words in the anchor (based on simple tokenization)
  • frequency: count of the anchor-link pair in the anchor-dictionary

ambiguity: how many different candidate links exist for an anchor in the anchor-dictionary

  • kurtosis: the kurtosis of the shape of the distribution of candidate-links for a given anchor in the anchor-dictionary
  • Levenshtein-distance: the Levensthein-distance between the anchor and the link. This measures how similar the two strings are. Roughly speaking, it corresponds to the number of single-character edits one has to make to transform one string into another, e.g. the Levensthein-distance between “kitten” and “sitting” is 3.
  • w2v-distance: similarity between the article (source-page) and the link (target-page) based on the content of the pages. This is obtained from wikipedia2vec. Similar to the concept of word-embeddings, we map each article as a vector in an (abstract) 50-dimensional space in which articles with similar content will be located close to each other. Thus given two articles (say the source article and a possible link) we can look up their vectors and get an estimate for their similarity by calculating the distance between them (more specifically, the cosine-similarity). The rationale is that a link might be more likely if the corresponding article is more similar to the source article.
  • nav-distance: similarity between the article (source-page) and the link (target-page) based on the navigation of readers. This is obtained from an approach similar to the Navigation vectors. Conceptually similar to the w2v-distance except that we construct the 50-dimensional vectors based on the sequence of articles from readers on Wikipedia. That is, two articles will be more similar if they are read together in the same reading sessions.

Thus, for each triple (article, anchor, link) we will calculate a set of features X and record the label Y=0,1 (depending on whether the anchor-link pair exist in the article.

Training the model[edit]

Given the training data consisting of features X and labels Y, we train a classifier that distinguishes between the positive and the negative examples in the training dataset. Specifically, we use xgboost with default parameters.

Generating recommendations[edit]

Generating link recommendations work in the following way once we trained the model:

  • Choose an article and get the wikitext
  • Identify an unlinked part of plain text and consider as a possible anchor
  • Identify link candidates by looking up the anchor in the anchor-dictionary
  • For the triple (article, anchor, link) calculate the features X
  • Using the features X, make a prediction for the probability P(Y=1|X) from the xgboost-model. This yields the probability that we should add the anchor-link pair in the given article
  • Rank all candidate-links for a possible anchor according to their probability
  • Return the top candidate if the probability exceeds a given threshold (say 0.5)

Evaluating the model (Backtesting)[edit]

We evaluate the trained model using a held-out sample of the gold-dataset containing sentences with existing links taken from Wikipedia articles.

We generate link recommendations for the plain text of the sentence (removing all links). We then compare the predicted anchor-link pairs with the actual anchor-link pairs and calculate precision (how many of the recommended links are true, i.e. exist in the gold-data) and recall (how many of the true links are also recommended by the model). It is probably better to aim for a higher precision (making sure that the links we recommend are actually genuine) than for a high recall (making sure that we capture all existing links) in order to reduce the amount of false positives.


First set of results (2020-10)[edit]

We ran an initial evaluation on 7 wikis (simple, de, pt, ar, cs, ko, vi). There are some variations in the performance but, overall, the model yields satisfactory results across the different languages. Aiming for a recall of around 0.5 (i.e. the link recommendation capture 50% of the true links), we get a precision between 70-80%. In some languages, the precision is even higher (e.g. vi). The reasons for these variations are not clear at the moment and requires further research.

Threshold (link-probability) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
simple prec 0.53 0.55 0.58 0.62 0.67 0.7 0.75 0.82 0.85 0.83
recall 0.67 0.66 0.64 0.61 0.55 0.48 0.38 0.27 0.17 0.07
de prec 0.62 0.67 0.69 0.72 0.75 0.79 0.83 0.87 0.91 0.94
recall 0.63 0.62 0.6 0.58 0.54 0.48 0.41 0.33 0.24 0.11
pt prec 0.57 0.67 0.72 0.79 0.82 0.85 0.87 0.9 0.94 0.99
recall 0.54 0.57 0.58 0.58 0.57 0.52 0.46 0.4 0.3 0.18
ar prec 0.56 0.61 0.67 0.72 0.77 0.81 0.86 0.9 0.94 0.98
recall 0.43 0.43 0.42 0.4 0.37 0.34 0.29 0.24 0.18 0.08
cs prec 0.58 0.62 0.64 0.67 0.71 0.74 0.78 0.84 0.88 0.94
recall 0.56 0.55 0.54 0.51 0.48 0.42 0.35 0.25 0.13 0.03
ko prec 0.49 0.61 0.64 0.67 0.69 0.72 0.75 0.79 0.83 0.87
recall 0.4 0.4 0.4 0.4 0.39 0.37 0.35 0.3 0.21 0.08
vi prec 0.74 0.79 0.83 0.86 0.88 0.91 0.94 0.96 0.99 0.996
recall 0.81 0.78 0.76 0.74 0.72 0.68 0.63 0.58 0.53 0.48