- Harsha Madhyastha
Affiliation or grant type
- University of Michigan
- Harsha Madhyastha
- Harsha Madhyastha: User:HarshaMadhyastha
- Improving the Persistence of External References on Wikipedia
Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.
A key challenge in preserving Wikipedia for future generations is that, even a few years after an article has been compiled, some of its external references cease to work , robbing visitors of the context that the article's editors meant to provide them.
To address this problem, the current best practice on Wikipedia is to augment any broken external reference with a link to an archived copy of the dysfunctional URL; the InternetArchiveBot implements this approach at scale. However, this best practice is neither complete nor sufficient.
1. Current systems identify a link as broken if an error is encountered when crawling it. But, many links may return a non-erroneous response, but redirect to an unrelated page. Reference 2 in https://en.wikipedia.org/wiki/Brian_Dubie is an example of such a "soft-404". In other cases, there can be content drift, i.e., the content at the link may have been modified, resulting in the link no longer serving the purpose for which it had been created.
In this project, we aim to study these limitations in two ways. First, we will quantify the prevalence of the above-mentioned problems. Similar to prior studies , we will manually examine a random sample of external links on Wikipedia. Second, we will aim to devise algorithms that can automate the identification of broken links which are either missed by current systems or where an archived copy proves insufficient. These algorithms could help inform future revisions to systems such as InternetArchiveBot.
Approximate amount requested in USD.
- 50,000 USD
Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).
The funds will be used to support the following:
- A graduate student researcher's 25% appointment (tuition, stipend, and benefits) for 12 months
- Half a month of summer salary plus associated benefits for the PI
- 15% overhead on direct costs
Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.
One of the main thrusts in Wikimedia’s 2030 Strategic Direction is to improve the integrity of knowledge available on Wikipedia. A significant long-term threat is that, though millions of contributors and community editors put in the effort to include appropriate citations, many of these external references decay over time.
Our work aims to preserve the fruits of the collective effort put into ensuring Wikipedia's verifiability. By both quantifying the shortcomings of current best practices to cope with this issue and studying how these shortcomings could be overcome, our work will inform future improvements to the systems used to patch external references on Wikipedia.
Plans for dissemination.
We will aim to publish a research paper that describes our findings.
To specifically communicate our findings to the Wikimedia community, we will 1) post on forums such as Wikipedia's Village pump, where potential improvements to wikibots are discussed by the community, and 2) give a talk at Wikimedia Research's monthly showcase.
We have been communicating all of our previous findings on this topic to the Internet Archive, and we will continue to do the same in this project.
Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.
Our research paper  on characterizing broken links on Wikipedia for which no archived copies exist helped inform revisions to WaybackMedic, another bot that Internet Archive runs on Wikipedia to fix dead links.
We have developed a system called FABLE; given a broken URL, it determines if the page previously available at that URL still exists on the web, and if so, at what new URL. Encouraged by FABLE's high accuracy in finding URL replacements for permanently dead links , we plan to start developing a wikibot based on FABLE next year.
I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.