Research:Improving link coverage/Supporting anchor text insertion
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Hyperlinks are important for ensuring easy and effective navigation of any website. Yet the rapid rate at which websites evolve (thanks to the big data era), renders maintaining the quality and structure of hyperlinks to be a challenging task. This problem can be broken down further into two subtasks:
- Given a set of webpages, identify pairs of pages that should be linked together?
- Having identified a pair of pages to be linked, where in the source page should the link(s) be inserted?
The first task reduces to solving a link prediction/completion problem, thereby identifying a list of candidate pages to be linked to a given page, which was addressed in our previous work on improving link coverage.
The second task on the other hand, can be seen as performing entity insertion. Specifically, when the source page contains appropriate anchor texts for the new link, these anchor texts become the candidate positions for the new link; all that remains to do is to ask a human which anchor text is best-suited. Things become more interesting when the source contains no anchor text for the new link; here it is far less clear where to insert the link. In essence, we have found a topic that is not yet, but should be, mentioned in the source, and the task is not simply to decide which existing anchor text to use, but rather where in the page to insert a new anchor text. There are two key challenges in performing entity insertion:
- Absence of anchor text is not rare: Based on our empirical analysis, we found that cases where the anchor text is not yet present in the source article correspond to about 50% of the total number of entity insertions in Wikipedia articles. The distribution of entity insertions into 4 different insertion types from the months of May and June 2020 is shown in figure on the right.
- Majority of articles are long: About 80% of Wikipedia articles contain 100 sentences or more, thus, identifying insertion positions for new entities within a Wikipedia article is a daunting task requiring the editors to scan the entire article text, thereby increasing their cognitive load.
Here we develop an approach for mining human navigation traces to automatically find candidate positions for new links to be inserted in a webpage. Our intention is to demonstrate the effectiveness of our approach by evaluating it using Wikipedia server logs.
- Wikipedia has grown rapidly since its inception: from 163 articles in January 2001 to 57M articles in January 2022. Based on the statistics from 2022 , on an average approximately 20K new articles are added to Wikipedia every day. With about 100K edits per day, 25-30K edits lead to link insertions. Without doubt, this enormous growth has promoted the prosperity of Wikipedia enriching the content in terms of both quantity and variety. However, this growth also warrants continuous quality control towards maintaining the link structure, if not improving it, which at this scale either requires a humongous army of editors or powerful automatic methods that can aid editors in this task.
- Moreover, in the light of the aforementioned challenges, there is a need for methods with the following characteristics.
- Ability to operate in the absence of textual context (targeted towards the red portion of the above plot), i.e., cases where the anchor text is not yet present in the source article.
- Ability to make locality-aware predictions, i.e., predictions that lie in the neighborhood of the true insertion position, if not at the true position. Ideally, the methods should be able to learn a distribution that can be overlaid on a Wikipedia article as a heatmap.
- Light-weight, easy-to-update, and scalable: Insertions can be on different pages and with different characteristics, and to accommodate the rapid rate of such insertions, models need to be updated at least every week, if not daily.
- In this work, we develop a method for estimating the probability distribution over all potential insertion positions within an article that stands strong on all three aforementioned requirements. This will in turn be used to suggest potential link insertion positions by overlaying a heatmap on the text of the source page, thereby reducing the cognitive load of the editors. Specifically, instead of going through the entire page to find the best place to insert the new link, the editors have to process only a small fraction of the total content of the page. Note that a human in the loop is still useful to ensure that the new links are added at close-to-optimal positions, and the overall link structure is indeed of high quality.
Data and Methodology
We construct navigation traces for readers of Wikipedia articles using Wikimedia's webrequest server logs from April, 2020. We only consider internal navigation, i.e., navigation to/from websites external to Wikipedia are ignored. Also, we used the articles as they existed on April 1st, 2020, by obtaining the article revision ids aligned with the April 1st XML dump. Articles in HTML were crawled using their revision ids.
The webrequest logs contain an entry for each HTTP request to Wikimedia servers, specifying information, including, but not limited to, timestamp, requested URL, referrer URL, client IP address, and user agent information.
Navigation traces are constructed following the approach proposed by Paranjape et al. . As a first step, we construct navigation sessions for each Wikipedia reader. Note that a reader may visit multiple pages from a given page by opening different browser tabs, and thus, navigation sessions are innately trees and not chains. We construct navigation trees by stitching together pageviews using the referrer information. From this, we obtain navigation sequences by extracting all root-to-leaf paths from each navigation tree with at least 2 nodes. We observed that most sequences are short, however, long sequences do exist. For instance, 75% (600M) of the sequences in the English Wikipedia are of length 2, but only 0.5% (4M) are of length 10 or more.
We propose a data-driven approach that relies on the following intuitions for predicting potential positions of a new link from to in the source page .
- Given an indirect path , the new link from to should appear in the proximity of the clicked link to in the source page , since is connected to via on the path (which hints to this being the case also in the user’s mind).
- The more frequently is the successor of on paths to , the more peaked the position distribution should be around .
- The shorter the paths from to that have as the immediate successor of , the more peaked the position distribution should be around .
We have verified all the aforementioned intuitions using the navigation traces extracted from Wikimedia's server logs. The results are presented in Figure 2.
Let be the position of the existing link from to , and be the distance between and the new link from to in the text of the source page . We propose a bayesian network to estimate the joint probability distribution over all insertion positions. Formally,
where, is the relative frequency of appearing as a successor on the indirect path , is a uniform distribution over all the insertions of the link in , and is the distribution of the distance between and estimated from the navigation logs. It is important to note that we use the links inserted in May, 2020 to estimate of the NavNet model.
Please see Figure 3 for a pictorial representation of the NavNet model.
Note that we currently use only requests to the desktop version of the English Wikipedia, but our method generalizes to any language edition.
We also introduce multiple alternative approaches that serve as baselines to assess the efficacy of the proposed NavNet model. Specifically, the baselines correspond to the following two broad categories:
- Methods that use the affinity of text and entities. We propose an approach using pre-trained models for obtaining representations from textual content of Wikipedia articles: (1) GloVe: Static word embeddings, and (2) SBERT: A state-of-the-art pre-trained large language model.
- Methods that use existing links in a page, and learn a mixture model. Similar to the previous methods, we use pre-trained models for obtaining representations from textual content: (1) GloVe: Static word embeddings, and (2) SBERT: A state-of-the-art pre-trained large language model, but also use models for obtaining representations from the underlying network structure of hyperlinked Wikipedia articles: (1) DeepWalk, (2) LouvainNE, (3) NetSMF, and (4) Lemane.
Using the aforementioned representations, we then infer the probability distribution over all insertion positions in a page by measuring the similarity of the new link with either different text spans or other links existent on the page .
We evaluate the efficacy of NavNet by comparing its performance with the aforementioned baselines using links inserted in Wikipedia in the month of June 2020. We only consider those inserted links that were existent in the extracted navigation traces and were clicked more than 10 times, which results in an evaluate set of about 5000 inserted links. Moreover, we consider two broad types of queries (cf. Figure 1):
- 'Easy': The anchor text or context already existent, only the entity mention needs to be inserted.
- 'Hard': Both anchor text and context absent, and an entire block of text along with the new entity mention needs to be inserted.
We perform extensive evaluation using multiple meaningful evaluation metrics, belonging to two categories: (1) Locality-aware and (2) Ranking-based.
- NDCG: Normalized cumulative discount gain, by giving higher relevance to those closer to the ground-truth insertion position.
- EMD: Earth mover distance
- MRR: Mean reciprocal rank
For the metrics, we vary .
As portrayed in Table 1, we find that NavNet statistically significantly outperforms all the strong baselines across a variety of considered metrics, thereby showcasing the ability of NavNet to provide considerable advancements towards address the entity insertion problem.
It is also important to note that NavNet is extremely straight-forward to maintain as it requires about 1 hour to extract information from logs, when compared to all the other baselines, which require training times in the order of days, if not weeks. Moreover, as portrayed in Table 2, NavNet is 3-5 times more efficient (in terms of inference time), and possesses 5-10 times smaller memory footprint.
We also conducted an ablation analysis determining the strength of different components. It can be noted from Figure 4 that the sole knowledge of links that were clicked to reach a target is quite strong, as it captures 90% of the NavNet model performance. Next, the relative frequency that captures the strength of each component, and lastly, the decay function, in fact gaussian decay works quite well. This finding substantiates that NavNet is simple to maintain as it does not depend substantially on the learned distributions, and just the knowledge of which links were clicked is enough in certain practical scenarios.
This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding.
- ↑ Paranjape, A., West, R., Zia, L., Leskovec, J. (2016). Improving Website Hyperlink Structure Using Server Logs. In WSDM. https://dl.acm.org/doi/abs/10.1145/1188966.1188996