Research:Improving link coverage/Supporting anchor text insertion

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Type of entity insertions in Wikipedia in May and June 2020. A: exact match between the anchor-text and the entity name, B: anchor-text was historically used as a mention for this entity in the past, C: anchor-text absent but the new entity is inserted in a pre-existing block of text, and D: the new entity is inserted in a new block of text.

Hyperlinks are important for ensuring easy and effective navigation of any website. Yet the rapid rate at which websites evolve (thanks to the big data era), renders maintaining the quality and structure of hyperlinks to be a challenging task. This problem can be broken down further into two subtasks:

  1. Given a set of webpages, identify pairs of pages that should be linked together?
  2. Having identified a pair of pages to be linked, where in the source page should the link(s) be inserted?

The first task reduces to solving a link prediction/completion problem, thereby identifying a list of candidate pages to be linked to a given page, which was addressed in our previous work on improving link coverage.

The second task on the other hand, can be seen as performing entity insertion. Specifically, when the source page contains appropriate anchor texts for the new link, these anchor texts become the candidate positions for the new link; all that remains to do is to ask a human which anchor text is best-suited. Things become more interesting when the source contains no anchor text for the new link; here it is far less clear where to insert the link. In essence, we have found a topic that is not yet, but should be, mentioned in the source, and the task is not simply to decide which existing anchor text to use, but rather where in the page to insert a new anchor text. Based on our empirical analysis, we found that cases where the anchor text is not yet present in the source article are not rare. On the contrary, such cases correspond to 50% of the total number of entity insertions in Wikipedia. The distribution of entity insertions into 4 different insertion types from the months of May and June 2020 is shown in figure on the right.

Here we develop an approach for mining human navigation traces to automatically find candidate positions for new links to be inserted in a webpage. Our intention is to demonstrate the effectiveness of our approach by evaluating it using Wikipedia server logs.


  • Wikipedia has grown rapidly since its inception: from 163 articles in January 2001 to 42.9M articles in December 2018. Based on the statistics from 2018 [1], on an average approximately 8000 new articles were added to Wikipedia every day. Without doubt, this enormous growth has promoted the prosperity of Wikipedia enriching the content in terms of both quantity and variety. However, this growth also warrants continuous quality control towards maintaining the link structure, if not improving it, which at this scale either requires a humongous army of editors or powerful automatic methods that can aid editors in this task.
  • In this work, our intention is to develop a method for estimating the probability distribution over all potential insertion positions. This will in turn be used to suggest potential link insertion positions by overlaying a heatmap on the text of the source page, thereby reducing the cognitive load of the editors. Specifically, instead of going through the entire page to find the best place to insert the new link, the editors have to process only a small fraction of the total content of the page. Note that a human in the loop is still useful to ensure that the new links are added at close-to-optimal positions, and the overall link structure is indeed of high quality.


We propose a data-driven approach that relies on the following intuitions for predicting potential positions of a new link from to in the source page .

  1. Given an indirect path , the new link from to should appear in the proximity of the clicked link to in the source page , since is connected to via on the path (which hints to this being the case also in the user’s mind).
  2. The more frequently is the successor of on paths to , the more peaked the position distribution should be around .
  3. The shorter the paths from to that have as the immediate successor of , the more peaked the position distribution should be around .

Let be the position of the existing link from to , and be the distance between and the new link from to in the text of the source page . We propose a bayesian network to estimate the joint probability distribution over all insertion positions. Formally,

where, is the relative frequency of appearing as a successor on the indirect path , is a uniform distribution over all the insertions of the link in , and is the distribution of the distance between and estimated from the navigation logs.


It is also possible to identify the probability of inserting a new link for different insertion positions in a page by using the surrounding textual information. To this end, we learn aligned embeddings for words and links from the existing links in each of the Wikipedia articles. Once such embeddings are learned, it is possible to measure the probability distribution over all insertion positions in a page by measuring the similarity of the new link with different text spans of the page .


  • We plan to use navigation traces extracted from Wikimedia's server logs, where all HTTP requests to Wikimedia projects are logged.


  • Quantitative: For quantitative evaluation, we use the metrics from the recommender systems literature, such as Recall@ and Mean Reciprocal Rank (MRR). To obtain these metrics, we sort the insertion positions in the decreasing order of the estimated probabilities, and identify the rank of the ground-truth insertion position in the previously specified sorted list. The ranks are then used to compute Recall@ (for ), and MRR.
  • Qualitative: For qualitative measures, we intend to run a crowdsourcing experiment, where the control condition asks editors to insert links without the heatmap, while the treatment condition with the heatmap. We then measure and compare the following for both the scenarios:
  1. How long it takes them to find a position?
  2. How cumbersome they find the experience?
  3. How highly others rate the chosen position?
  4. How frequently inserted links are reverted/clicked?

Research Terms[edit]

This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding.