Research:Recommending links to increase visibility of articles/Supporting entity insertion

From Meta, a Wikimedia project coordination wiki

Links are a fundamental part of Wikipedia, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, inserting a new link into the network is not trivial, requiring not only the identification of the corresponding source and target articles but also reading of the source article to identify a suitable position where to integrate the link into the text.

In order to support editors in the latter task, in this project, we develop a multilingual model for the task of entity insertion. The task is motivated by the use-case of increasing the visibility of specific articles in the network, such as adding incoming links to orphan articles (i.e. de-orphanization).

This work builds heavily upon (and extends) previous efforts to support anchor text insertion Research:Improving_link_coverage/Supporting_anchor_text_insertion

Motivation[edit]

General[edit]

Adding new knowledge to Wikipedia not only requires creating of new content but also integrating it into the existing knowledge structure. In fact, editors have developed a dedicated guideline to “build the web” in English Wikipedia’s manual of style. When adding a new link, we need to, first, identify two entities that should be linked and, second, also identify suitable text in the source article where the link will be added. While there are a wide variety of models and tools to improve linking, existing approaches address this problem insufficiently.

On the one hand, link recommendation aims to infer which nodes should be linked in a network[1]. Using features from, e.g., navigation of readers allows to suggest new useful links between articles[2]. However, these models focus exclusively on the network structure and ignore the problem that links need to be embedded in the text of the article.

On the other hand, entity-linking approaches consider the existing text and try to identify the most probable link target for specific tokens or substrings in the text (anchor). This approach is used in many existing tools for Wikipedia such the add-a-link model[3].

Use-case: De-orphanization[edit]

In our work on orphan articles[4], we found that there is a surprisingly large number of orphan articles in Wikipedia (~9M articles across all Wikipedia language versions), which are de-facto invisible to readers navigating Wikipedia. We described a promising approach to identify candidate links to increase the visibility of orphan articles (de-orphanization) based on link translation. While this gives us a source and target article for the candidate link, a remaining challenge is where to insert the specific link in the text of the source article. This step is crucial to make the link recommendations for de-orphanization more actionable for editors.

Methods[edit]

Problem description[edit]

We consider the problem of entity insertion in Wikipedia: Given a source and target article, the goal of entity insertion is to identify the most suitable text span in the source article for inserting a link to the target article. Specifically, we operationalize the task of entity insertion as a ranking problem, where the goal is to rank all the candidate text spans in the source article by how related they are to the target article.

Data[edit]

We constructed a new multilingual dataset for the entity insertion task in Wikipedia. The dataset consists of all the links from all the Wikipedia pages and each link's surrounding context and additional meta-data (such as page titles, QIDs and lead paragraphs). Overall, the dataset contains 958M links from 49M pages in 105 languages. The largest language is English (en), with 166.7M links from 6.7M pages, and the smallest language is Xhosa (xh), with 2.8K links from 1.6K pages. The data processing was done in two steps.

Existing links. We first extracted all the links present at the timestamp of 2023-10-01. We extract the content of all articles from their HTML version using the corresponding snapshot of the Enterprise HTML dumps. We removed articles without a lead paragraph and without a QID. For each article, we consider all internal links in the main article body (ignoring figures, tables, notes, and captions) together with their surrounding context. We removed all the links where either the source or the target article was one of the removed articles and we dropped all the self-links.

Added links. Then we found all the links added in the time between 2023-10-01 and 2023-11-01. We extract a set of added links by comparing the existing links in snapshots from two consecutive months. We apply the same procedure as above to each snapshot, respectively, and take the difference of the two sets to identify the links existing in the second month but missing from the first month. In order to identify the edit where the link was inserted, we go though the revision history of the respective articles available in the Wikimedia XML dumps. Once we had identified the pair of IDs associated with the revisions before and after the link was inserted, we directly downloaded the corresponding HTML versions. Comparing the two HTML versions, we could identify the changes made by the editor when inserting the link which we categorised into five categories:

  • text_present: links that fall into this category were added by hyperlinking an existing mention;
  • missing_mention: links were added by taking an already existing sentence and adding the mention for a new entity (and potentially some additional context to the sentence);
  • missing_sentence: an extension of the previos category, the link was added by writing a new sentence and hyperlinking part of the text, but where the editors wrote a span of multiple sentences;
  • missing_section: the links were added in a section that did not exist in the previous version of the article.

Model[edit]

t.b.a.

Results[edit]

Characterizing link insertion[edit]

Here, we empirically characterize how new links are inserted by looking at all added links and counting the occurrence of the different categories.

Entity insertion strategies for links added from 2023-10 to 2023-11 for a subset of 20 languages. The x-axis shows the language code and the number of links added in each language.

Only for 27% of added links, the mention for the anchor of the link was already present in the text. For the majority of added links, some of the text needed to be changed as well (adding the mention, adding a sentence, or adding a larger span of text).

This means for majority of the added links, simple string matching between text and page-title of the target article will likely not be very successful.




Resources[edit]

  • Paper: t.b.a. (in preparation)
  • Code: t.b.a.
  • Tool: t.b.a.

References[edit]

  1. Ghasemian, Amir; Hosseinmardi, Homa; Galstyan, Aram; Airoldi, Edoardo M; Clauset, Aaron (2020-09-22). "Stacking models for nearly optimal link prediction in complex networks". Proc. Natl. Acad. Sci. U. S. A. 117 (38): 23393–23400. ISSN 0027-8424. doi:10.1073/pnas.1914950117. 
  2. Paranjape, Ashwin; West, Robert; Zia, Leila; Leskovec, Jure (2016). "Improving Website Hyperlink Structure Using Server Logs". Proc Int Conf Web Search Data Min 2016: 615–624. doi:10.1145/2835776.2835832. 
  3. Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. CIKM ’21, 3818–3827. https://doi.org/10.1145/3459637.3481939
  4. Arora, A., West, R., & Gerlach, M. (2023). Orphan Articles: The Dark Matter of Wikipedia. In arXiv [cs.SI]. http://arxiv.org/abs/2306.03940