커뮤니티 기술/아카이브에 죽은 외부 링크 마이그레이션

From Meta, a Wikimedia project coordination wiki
This page is a translated version of the page Community Tech/Migrate dead external links to archives and the translation is 100% complete.
Tracked in Phabricator:
Task T120433 resolved

웨이백 머신 프로젝트에 대한 데드 링크 마이그레이션은 데드 링크를 인터넷 아카이브의 웨이백 머신으로 넘겨주기하는 것을 목표로 합니다. 이 프로젝트 페이지는 프로세스에 대한 정보와 도구에 대해 논의할 수 있는 장소를 제공합니다.

커뮤니티 기술 팀을 위해 이 프로젝트는 Cyberbot II를 만든 User:Cyberpower678의 작업을 지원합니다. Cyberbot은 현재 영어 위키백과의 문서 페이지에서 실행 중이며 죽은 링크로 태그가 지정된 링크를 찾고 있습니다. Cyberbot은 인터넷 아카이브에 쿼리하고 죽은 외부 링크를 아카이브된 버전에 대한 링크로 대체한 다음 문서 토론 페이지에 기여자 검토를 요청하는 메시지를 게시합니다.

프로젝트에 대한 커뮤니티 기술의 지원에는 다음이 포함됩니다. 중앙 집중식 로깅 인터페이스 생성 및 고급 데드 링크 감지 모듈 구축을 통해 Cyberbot이 전체 위키를 살펴보고 수정해야 할 링크를 식별할 수 있습니다.


우리는 웹에서 사용할 수 있는 많은 소스를 인용합니다. 시간이 지남에 따라 이들 중 상당수가 호스팅된 웹페이지에서 사라지거나 이동 또는 제거되어 링크가 끊어집니다. 이것은 큰 문제이며 수동으로 추적하기가 매우 어렵습니다. 소스를 잃지 않도록 하려면 죽은 링크를 인터넷 아카이브의 웨이백 머신으로 이동하거나 기여자 검토를 위해 플래그를 지정하는 자동화된 프로세스가 필요합니다. 이것은 2015년 커뮤니티 위시리스트 설문조사에서 가장 인기 있는 제안으로 111개의 지지 투표를 받았습니다.

기술 토론 및 배경


This is a slightly less technical overview. For more details, please see our meeting notes and the Phabricator task.

October 3, 2016

Work is ongoing to deploy IABot to Swedish Wikipedia (T136142). The code is ready for testing, but it still needs local configuration and localization.

July 20, 2016

The dead links detection has been approved for normal use and IABot's switches have been flipped. However some operational bugs cropped up that were masked when the switches weren't flipped. Cyberpower is working to fix them quickly. Upgrades to the bot's intelligence have also been made. As Ryan mentioned, it's almost sentient now.

June 21, 2016

Cyberpower says in T136728 that the false positive rate is down to .2%, way better than the goal. This should be enough to get approval for the bot...

June 8, 2016

Cyberbot is working through the trials required to get approval for phase 2 of the project -- running through all the external links on English Wikipedia, not just the ones marked with the dead link template. Community Tech added the code to skip paywalled sites, and we're helping to track down some of the false positives -- live links that Cyberbot is marking as dead. Right now, the false positive rate is around 8%. The approval process hasn't established given a set limit on false positives, but we're hoping to get it down to 1%, if possible.

May 9, 2016

We're still testing and tweaking Cyberbot's dead link detection; there have been some false positives that we're helping to track down and fix. We will also work on having the bot skip paywalled sites, not counting them as dead.

April 24, 2016

We have been working with Cyberpower678 on dead link detection -- testing links in articles to make sure that they're still alive. The bot now checks for HTTP 404 error messages, and if it finds that the link is dead, then it checks again in a few days. When a link returns 404s after three checks, Cyberbot marks the link as dead and replaces it with an archive link.

April 6, 2016

Cyberbot II has now worked through all of the pages marked with the Dead link template. There are still approximately 60,000+ pages in the category of articles with dead links; these are links for which the Internet Archive doesn't have a suitable archive. We're starting a discussion on Template talk:Dead link, suggesting that we add an extra parameter to the template that can mark these as unfixable, so that we don't have to keep rechecking the same unsalvageable links.

March 28, 2016

The centralized logging interface is finished, and Cyberbot now uses the logging API. Documentation is on Fixing dead links/Deadlink logging app.

March 27, 2016

Cyberbot II is currently being discussed on Wikipedia:Bots/Requests for approval/Cyberbot II 5a. We're working with Cyberpower678 on detecting dead links that aren't marked with the dead link template.

March 14, 2016

User:Green Cardamom has created a new bot called WaybackMedic. It fixes a bug in the Internet Archive API that returned false positives for Cyberbot to use as replacements for dead links. According to Green Cardamom, there are tens of thousands of archive links that are pointing to the wrong archive. WaybackMedic is cleaning up those links (example) -- and also cleaning up some formatting bugs from early in Cyberbot's career (example).

WaybackMedic is currently in bot trials.

March 3, 2016

Dead links logging interface

Work on the Centralized logging interface on Tool Labs for all dead links bots is going well. There's a first draft version up on Tool Labs: Deadlinks, and documentation here: Fixing dead links/Deadlink logging app. We'll be asking for stakeholder input next week.

Coming up soon: we'll work on an output API from the logging interface, showing the last time that a particular page was checked, and the last article that a particular bot processed. This will help a bot that crashed or paused to pick up where it left off, and it'll help multiple bots running on the same wiki to avoid checking the same pages that another bot recently processed. (T128685)

February 7, 2016

We've defined goals for what we're expecting to do this quarter on this project, and a goal for later on this year.

Goals for this quarter (until end of March):

  • Centralized logging interface on Tool Labs for all dead links bots -- This can be used to track what pages have been checked. It'll be useful for individual bots so they don't go over the same page multiple times, and especially useful if there are multiple bots running on the same wiki. The log will include name of the wiki, the archive source, # of links fixed, or notifications posted. (This should accommmodate bots from DE, FR and others.) Several tickets: Create a centralized logging API (T126363), Create a web interface (T126364), Documentation (T126365).
  • Investigation of advanced dead link detection -- Investigate and plan for adding advanced dead link detection as a module for Cyberbot, detecting other kinds of dead links besides 4XX and 5XX error codes. This may involve adapting Internet Archive's code. See T125181 and T127749.
  • Documentation and code review for Cyberbot -- Documenting Cyberbot's code, in preparation for helping other developers create bots on other wikis. Documentation has started at InternetArchiveBot, code review in T122227.

For later on this year:

  • Our big goal for this project is to help bot-writers on many different language Wikipedias to create their own dead link archive bots. Each community has its own templates, policies, approach and preferred archive service, and it's not scaleable for our team / Cyberpower / Internet Archive to create bots for every language WP. We want to provide the tools that bot writers can use -- modular code that includes APIs and advanced dead link detection, documentation of the existing code, and a centralized logging interface that all bots can use.

February 2, 2016

One of the tasks we'll be working on with Cyberpower678 is a centralized logging interface for tracking and reporting dead link fixes. The log will be kept on Tool Labs. This will help keep track of which pages have had their dead links fixed, when, and by what agent/bot. This will facilitate 3 things:

  • If a bot dies, it can pick up where it left off
  • It will help prevent bots from doing redundant work
  • It will provide a centralized (and hopefully comprehensive) reporting interface for the Internet Archive and other archive providers

The tracking ticket for the logging interface is T125610.

Jan 20, 2016

  • We're working with Cyberpower678 and the Internet Archive to define more exactly what the goals for the project are.
  • We're looking at different ways of approaching the problem, because the best solution may involve a few different tools or processes working together. There are three bots (English, Spanish, French) we need to compare, and we need the solution to work for different languages and different projects (not just Wikipedia).
  • We're investigating automated ways to test links to see if they're dead or not. Legoktm has a proposal and Niharika is working on an algorithm.
  • We need to limit the number of requests we send to the Internet Archive's API.


Too early to say anything yet, but when we have a good estimation, we'll put it here.

Initial Community Tech team assessment

Support: High. Dead reference links hurt our projects' reliability and verifiability, and connecting deadlinks with an archive supports the usefulness of our content. There were some dissents in the voting phase, pointing out that it's better when humans find the appropriate alternative links, rather than a bot that might not choose the right one.
Impact: High. Improving the quality of citations helps readers as well as contributors. There are some bots currently running on English, French and Spanish Wikipedias. We want to help build solutions that can be adapted to every language.
Feasibility: High. Cyberbot II is currently active on English Wikipedia, and Elvisor on Spanish Wikipedia. Cyberpower678's work on Cyberbot is being supported by The Wikipedia Library and the Internet Archive. There is obviously good work being done here, and we can figure out how to best support it, and help it to scale globally.
Risk: Low. Cyberbot II is running on English Wikipedia, with no major issues encountered. It may be challenging to integrate with other wikis’ citation templates.