Research:Inferring Wikipedia missing links

From Meta, a Wikimedia project coordination wiki
Zekarias Tilahun Kefato
Cristian Consonni
Alberto Montresor
Duration:  2017-03 – ??

This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.


When studying and analyzing a complex network, it happens frequently that missing information may limit the understanding of the network that can be achieved. For example, in social networks we may observe the diffusion of a meme, a hashtag, the adoption of a product, etc., but we do not know the exact source of influence, given that underlying network may be (partially) unknown[1]

Wikipedia is another instance where we may encounter missing information, in the form of missing links between pages that are strongly related by not directly connected. This may hamper the user experience, as the user may be required to perform long navigations before reaching the desired target. Missing links are due to the extremely dynamic nature of Wikipedia, which is beyond the monitoring power of available editors.


As part of Research:Improving link coverage, it has been pointed out by Paranjape et al.[2] how useful the Wikipedia link structure is in order to provide sufficient and coherent information, and how difficult it is for humans to monitor and maintain related links due to its dynamic nature. We propose an approach that is completely different from the one of Paranjape et al.1, which has the potential to achieve significant improvement in extracting relevant links and thus to assist human editors in a more meaningful way. Our approach has been already applied to the problem of inferring edges in social networks[1] and has proven to be quite effective. But our approach has not be tested at scale, and we believe that Wikipedia is a very good fit for this purpose.


In order to tackle the network inference problem, we have developed a novel algorithm and we have achieved encouraging results over the state-of-the-art; a paper on this topic is under peer review for the MLG2017 conference[1].

Our approach can be summarized in the following two phases:

  1. We utilize multiple cascading events that capture meaningful structural patterns of the network, to obtain information about potential missing edges;
  2. Inspired by recent studies on representation learning of words in natural language documents, we exploit existing results in that area and apply them to retrieve the latent structure of the network.

We are planning to use the data provided by the Wikimedia foundation in two different ways:

  • We will use the information contained in the server access logs to identify missing links in the Wikipedia structure;
  • We will validate our network inference approach using the existing Wikipedia structure as ground truth.


The timeline of the project depend on when we will be able to analyze the data, the approach describe in the previous section is already implemented.

Policy, Ethics and Human Subjects Research[edit]

This proposal does not involve any human subject research,

Data collection[edit]

In order to achieve our goal, we need to obtain information about how user navigates the Wikipedia structure; this information is contained in request logs. This information corresponds to the cascading events mentioned in Phase 1 of our approach[1].


  1. a b c d Kefato Z, Sheikh N, and Montresor A. "DeepInfer: Diffusion Network Inference through Representation Learning". Submitted to MLG2017. (preprint)
  2. Paranjape, Ashwin; West, Robert; Zia, Leila; Leskovec, Jure (2016). "Improving Website Hyperlink Structure Using Server Logs" (PDF). Proceedings of the 9th International ACM Conference on Web Search and Data Mining (WSDM): 615––624. doi:10.1145/2835776.2835832. Retrieved 2017-03-13.