Research:Recommending links to increase visibility of articles

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T293030
Duration:  2021-August – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


In order to support newcomers in their first edits, the Growth Team has been developing the Structured Tasks framework. Structured tasks break down the editing process into smaller steps that are easily understood, easy to use on mobile devices, and can be guided by algorithms. The first structured task that was implemented was add-a-link, which has been deployed to 4 wikis (arwiki, bnwiki, cswiki, and viwiki). Results from those wikis have been encouraging (T277355) -- with only 6.2% of edits from recommended links being reverted. Therefore, we would like to implement other types of tasks that are part of editors’ workflows.

One idea is to further develop the structured task on adding links. The current add-a-link framework is simple (it suggests text and the link) and the priority is given to the action of adding links rather than the value of the added link. Here, we want to add new incoming links to articles in order to increase their visibility. For example, there still exist many orphan-articles, i.e., articles without any incoming links, which cannot be reached from any other Wikipedia page. This is a much more difficult editing task, since we have to add the link to our target article in the text of a different article, the source article.

Key aims of the project:

  • Thrust 1: Understand better which articles require new links to address structural gaps on Wikipedia as well as assess the quantitative impact of addressing these gaps.
  • Thrust 2: Develop an algorithm for a structured task to suggest links to orphan articles (or other articles with low visibility) to increase their visibility.

Background[edit]

(Zhu et al., 2020) [1] show that improving articles as part of campaigns can lead to significant, substantial, and long-term increases in both content consumption and subsequent contributions. More importantly, in this context, they show that they find that there are also significant spillover effects in the increase in attention to downstream hyperlinked articles.

(Wagner et al. 2015; Wagner et al. 2016) [2][3] investigated the gender gap in the content of Wikipedia articles. They showed that in addition to an underrepresentation of women in the number of articles, there are also substantial structural biases in the way articles on women are connected in the hyperlink network. For example, women biographies are less central in the network quantified, for example, through their consistently lower values in PageRank. This results in lower visibility. (Langrock&Gonzáles-Bailón 2020) [4]systematically investigate how campaigns such as Art+Feminism are able to address these biases. They find that they are generally successful at improving the content of a target-page, but fail to improve the visibility (number of inlinks).

There is a {{Orphan}} maintenance template that tracks articles that are not linked from any other Wikipedia article, i.e., articles without incoming links. The category Category:Orphaned articles lists these articles. As of 2021-08-20, there are about 90K articles listed in this category. The template mentions the Find link tool, though for a few examples we tried, it did not yield any suggestions.

Takeaways:

  • Incoming links are important for the visibility of articles
  • There are many articles that lack incoming links, either as part of a structural bias or because they are simply orphans
  • In contrast to other biases, existing campaigns are not as successful in addressing these biases
  • Machine learning algorithms can help empower editors to address these issues by generating good recommendations

Methods[edit]

Recommending links to increase the visibility of articles can be broken down into 3 steps:

  • Identify articles that are lacking incoming links (these are the target pages of the new links, for example orphan-articles).
  • Identify candidate article from which to link to these articles (these are the source pages for the new links)
  • Identify potential locations in text of the source page where to insert the link to the target page. This might be specific words, or sentences, or sections where we assume the link should be added. This is most likely very challenging as suitable anchor text for the link might not yet exist. Thus, adding the links will probably also involve adding some text.

Timeline[edit]

  • Thrust 1: Exploratory analysis, characterizing orphan articles, and quantifying the causal impact of de-orphanization.
  • Thrust 2: Developing a prototype model to recommend links for de-orphanization, identify position as well as generate text for link insertion, potential refinements, and evaluation.

Thrust 1: Characterizing Orphan Articles[edit]

To better understand structural gaps in Wikipedia, we conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles, across 319 different language versions of Wikipedia. These articles are of particular interest since they are de facto invisible for readers navigating hyperlinks in Wikipedia.

Specifically, we aim to address the following research questions:

  • RQ1: What are the key characteristics of orphan articles?
  • RQ2: Does adding incoming links (de-orphanization) increase the visibility of orphan articles?
  • RQ3: What is the current state of de-orphanization and what are the potential ways to improve it?

Key Highlights[edit]

Many orphans[edit]

  • We find that a surprisingly large extent of content, roughly 15\% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia, and thus, rightfully term orphan articles as the dark matter of Wikipedia.
Analyzing the extent of orphan articles across all Wikipedia language versions.
  • This observation is not limited to only a few or small Wikipedia language versions, rather for more than 100 Wikipedia language versions the percentage of orphans is above 30%, including Egyptian Arabic (78%) and Vietnamese (50%), which are among the 20 largest Wikipedia language versions.
  • We find that orphan articles are negatively correlated with being: (1) of higher quality and (2) being about the topic of history and society, while possessing a slight positive association with being newer.
  • More importantly, we showed that orphan articles encode structural biases: biography articles about women are substantially more common among orphans than expected from their overall frequency.

Lack of visibility[edit]

A pictorial representation of the quasi-experiment: an article that receives a new incoming link (denoted in red font) is considered as treated, whereas the same article in another language that does not receive any new incoming links is considered as control.
  • We provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility in terms of the number of pageviews.
  • Importantly, we found that this increase is mainly driven by internally-referred pageviews from other Wikipedia articles which contain a link to the de-orphanized article.
Per-month DiD treatment effect with 95\% CIs for the (a)-(b) forward and (c)-(d) reverse setup considering November 2022 as the treatment month.

Challenges for editors[edit]

  • We further highlight the challenges faced by editors for de-orphanizing articles, demonstrate the need to support them in addressing this issue, and provide potential solutions for developing automated tools based on cross-lingual approaches.
  • Specifically, we find that the rate of organic de-orphanization is alarmingly low. For the snapshots we considered, editors added new incoming links to ~35K orphan articles.
  • While this constitutes an impressive effort by the community, at that rate it would take approximately 20 years to de-orphanize all orphan articles (assuming no newly created orphan articles).
  • We hypothesize that existing tools do not support editors in addressing this issue effectively. For example, the Find link tool generally does not yield many results for orphan articles, especially for smaller languages.
  • We develop an approach for identifying articles from which to link to orphans via link translation. Results show that our Link translation tool could be effective for 5.5M (62%) orphan articles.

Overall, our work not only unravels a key limitation in the link structure of Wikipedia and quantitatively assesses its impact, but also provides a new perspective on the challenges of maintenance associated with content creation at scale in Wikipedia.

We wrote up the findings of Thrust 1 in a paper [5] entitled Orphan Articles: The Dark Matter of Wikipedia (available as pre-print: https://arxiv.org/abs/2306.03940 (pdf)

Thrust 2: Link-translation for De-orphanizing articles[edit]

As a first prototype, we consider a simple approach to this problem:

  • We restrict ourselves to orphan articles as target pages. Without any incoming links, those articles are not visible from within Wikipedia; thus, adding any incoming link will increase their visibility.
  • We generate candidate links from inspecting all other language versions of Wikipedia. Specifically, we check whether there is an existing link to the target page in any of the other Wikipedias. If yes, we will identify the matching article in the corresponding language and recommend that link. This corresponds to "translating" an existing link from one language to another language version.
  • (optional) Recommend the translated section. Since we recommend existing links from other languages, we can recommend a suitable location for that link in the text. For example, we first identify the section-title where the already existing link is located. Using, e.g. the section alignment tool, we can identify a suitable section for the language of interest.

More details with a first exploratory analysis can be found here: Research:Recommending links to increase visibility of articles/Link-translation#Exploratory analysis

Results[edit]

We evaluated the link recommendations for new incoming links to orphan articles (de-orphanization) via the link translation approach and compare it with different baselines (editor tools, heuristics, embeddings). As ground truth consider all incoming links added to articles that were orphans in Jan 2022 but were de-orphanized in Feb 2022. We evaluate each method in its ability to predict the newly added incoming links using metrics Recall@k and Mean Reciprocal Rank (MRR). Our link-translation approach considerably outperforms all the baselines. Specifically it provides the best suggestions to de-orphanize articles in the following scenarios:

  • low-resourced languages (macro average is as good as, and even stronger than micro average, indicating that link-translation performs equally well, and in fact, better for languages with fewer resources)
  • lower values of k, i.e. when considering only few top suggestions. (a recall@1 of 15% is a remarkably strong outcome)

Detailed results can be found in Research:Recommending links to increase visibility of articles/Link-translation:Evaluation

Tools[edit]

Based on the idea of link translation, we built a first prototype of a tool to support editors to increase the visibility of orphan articles as well as to generate reading recommendations to surface articles that lack visibility. The advantage of generating recommendations via link translation is that i) they are easily interpretable, ii) they have already been vetted by other language communities, iii) they work at scale. Therefore, we can generally consider the recommendations to be of high quality. We built functional user interfacaes for the tool on toolforge, where you can try the tool yourself.

Supporting editors: https://linkrec.toolforge.org/

  • This tool aims to increase the visibility of articles. It generates recommendations of articles from where to link to them. We identify recommendations by looking up the corresponding article in other Wikipedia languages. Surprisingly, in many cases there are already existing links which can be simply "translated".

Supporting readers: https://linkrec.toolforge.org/readmore

  • This tool aims to surface articles lacking visibility as follow-up reading recommendations for a selected article in given language. In spirit, it is similar to the Read more feature. In contrast to readmore, our experimental tool selects related articles based on the following criteria: i) novelty: they do not yet appear as blue links in the selected articles and the given language; ii) lack of visibility: they appear as blue links in only few (or no) other articles in the given language; and iii) relevance: they already exist as blue links in the selected article in other Wikipedia languages.

Link insertion[edit]

Once we identified a suitable new link to add (i.e. consisting of a source article and a target article), we are still faced with the task of inserting the link somewhere into the text of the source article. This can be a non-trivial task if an anchor word matching the pagetitle of the target article is not available and/or if the source article contains a lot of text. Therefore, we develop a multilingual model to support the task of link insertion by identifying the most suitable text span for a specific new link.

More details: Research:Recommending links to increase visibility of articles/Supporting entity insertion

Resources[edit]

Subpages[edit]

Pages with the prefix 'Recommending links to increase visibility of articles' in the 'Research' and 'Research talk' namespaces:

Research talk:

  1. Zhu, K., Walker, D., & Muchnik, L. (2020). Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia. Information Systems Research, 31(2), 491–509. https://doi.org/10.1287/isre.2019.0899
  2. Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015). It’s a man's Wikipedia? Assessing gender inequality in an online encyclopedia. Ninth International AAAI Conference on Web and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/viewPaper/10585
  3. Wagner, C., Graells-Garrido, E., Garcia, D., & Menczer, F. (2016). Women through the glass ceiling: gender asymmetries in Wikipedia. EPJ Data Science, 5(1), 1–24. https://doi.org/10.1140/epjds/s13688-016-0066-4
  4. Langrock, I., & González-Bailón, S. (2020). The Gender Divide in Wikipedia: A Computational Approach to Assessing the Impact of Two Feminist Interventions. https://doi.org/10.2139/ssrn.3739176
  5. a b Arora, A., West, R., & Gerlach, M. (2023). Orphan Articles: The Dark Matter of Wikipedia. In arXiv [cs.SI]. arXiv. https://arxiv.org/abs/2306.03940