Jump to content

Grants:Programs/Wikimedia Research Fund/Patching broken external links on Wikipedia

From Meta, a Wikimedia project coordination wiki
statusnot funded
Patching broken external links on Wikipedia
start and end dates07/01/2022 - 06/30/2023
budget (USD)40,000-50,000 USD
applicant(s)• Harsha Madhyastha


Overview

[edit]

Username

Applicant's Wikimedia username. If one is not provided, then the applicant's name will be provided for community review.

Harsha Madhyastha

Project title

Patching broken external links on Wikipedia

Entity Receiving Funds

Provide the name of the individual or organization that would receive the funds.

Regents of the University of Michigan

Research proposal

[edit]

Description

[edit]

Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

A key challenge in preserving Wikipedia for future generations is that, even a few years after an article has been compiled, some of its external references cease to work [1, 2], robbing visitors of the context that the article's editors meant to provide them.

To address this problem, the InternetArchiveBot [3] augments every broken external reference on Wikipedia with a link to an archived copy of the dysfunctional URL. However, this practice assumes that, if a URL no longer works, the page it was pointing to likely no longer exists.

On the contrary, many URLs cease to work only because the sites hosting them have been reorganized and the URLs for their pages have changed, e.g., requests to https: //www.denmark.k12.wi.us/ecc/ return a “Not Found” error, because the page has moved to https://www.denmark.k12.wi.us/schools/ecc. In such cases, relying on an archived copy of a page’s old URL is not ideal for several reasons:

  • No archived copy exists for many broken URLs
  • Functionality which requires interactions with a page's back-end servers does not work on archived copies
  • Content in the last archived copy might be stale, e.g., for a page which lists the current members of an organization

To sidestep these problems, we have developed FABLE. Given a broken URL, FABLE employs a variety of approaches to determine if the page previously available at that URL still exists on the web, and if so, at what new URL. Honoring the choice made in the past to link to a specific page, we aim to find that page's new URL, rather than finding alternate pages with similar content. We have run our current prototype of FABLE on a sample of 5000 broken external links on Wikipedia. For 34% of these broken URLs, FABLE was able to find functional aliases (i.e., new URLs for the same pages).

We estimate that our current prototype of FABLE is able to find aliases for only ~60% of the URLs which have one, only 81–86% of the aliases it finds are correct, and it takes ~5 minutes per input URL. In the proposed project, we aim to improve FABLE along three dimensions:

  1. Coverage: find aliases for more broken URLs
  2. Accuracy: ensure that a higher fraction of its results are correct
  3. Efficiency: consume fewer resources in operating FABLE

Budget

[edit]

Approximate amount requested in USD.

40,000-50,000

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

The funds will be used to support the following:
  1. A graduate student researcher's 25% appointment (tuition, stipend, and benefits) for 12 months
  2. Half a month of summer salary plus associated benefits for the PI
  3. Cloud computing costs to support the execution of our FABLE system
  4. 15% overhead on direct costs

Impact

[edit]

Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

One of the three main thrusts in Wikimedia’s 2030 Strategic Direction is to improve the integrity of knowledge available on Wikipedia. A significant long-term threat is that, though millions of contributors and community editors put in the effort to include appropriate citations and ensure verifiability, many of these external references cease to work over time.

Our work aims to preserve the fruits of this collective effort. By enabling broken links on Wikipedia to be patched such that they continue to point to the pages they originally linked to, our work will ensure that users can access the latest content and all of the functionality available at those links, in contrast to the current practice of relying on archived copies of pages.

Dissemination

[edit]

Plans for dissemination.

We aim to publish papers describing our work on FABLE.

We aim to build a wikibot which, given a Wikipedia article, will 1) attempt to find a functional alias for every broken URL in the article, and 2) post the aliases it finds in the article's Talk page and solicit input from the article's editors about their accuracy.

We have also been in conversation with the Internet Archive, who have expressed interest in having the InternetArchiveBot use the URL replacement patterns identified by FABLE.

Past Contributions

[edit]

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

PI Madhyastha's research related to the web has resulted in several papers at top conferences over the last decade (see https://harshavm.engin.umich.edu/publications_by_year/). The deep understanding of the web gained from these prior projects will be crucial in the success of the proposed effort.

The PI has also consistently emphasized impact beyond papers. Notable examples include:

  • - Data from the PI's iPlane system (https://web.eecs.umich.edu/~harshavm/iplane/) has been used widely in research studies
  • - Our Vroom framework (awarded the IRTF’s Applied Networking Research Prize) has been used to speed up the mobile website of 1-800-Flowers
  • - Our MyPageKeeper app for flagging social spam and malware on Facebook was used by over 20K users

I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.

Yes