InternetArchiveBot/How the bot fixes broken links

From Meta, a Wikimedia project coordination wiki


The process of fixing broken links occurs in four steps: parsing the contents of the page, scanning the links that appear on the page, creating archives of links, and editing the page.

Parsing the page[edit]

First, the bot retrieves a batch of 5,000 articles to work with from the wiki’s API. Before beginning the batch, the bot will check to see if any configuration changes were made to it from the Management Interface. Once it has applied any configuration changes, it will proceed to work through the batch of 5,000 articles which will usually be in alphabetical order.

The bot begins its page analysis by completely scanning the page’s wikitext. It tries to gather as much information as it can on citation templates containing URLs, URLs contained within square brackets, and URLs posted plainly, while respecting <nowiki>, <pre>, <code>, <syntaxhighlight>, <source>, and HTML comments. It also attempts to seek relevant information about the links it finds, such as if another editor or bot declared the link dead, or if it has any archive URLs already associated with it. The next step is to analyze the parsing information returned by the parsing engine to create metadata about it.

The metadata for each reference – a source with a link in it – is built off of the parser’s data, which includes: access time, archive time, archive URL, the original URL, whether the URL is dead, if it’s considered a paywall URL, if it’s a malformed reference, if it’s a reference that can be expanded into a template if desired, if the URL is already being treated as a dead or not. If the access time is not found in the reference, InternetArchiveBot will attempt to extrapolate access time by assuming the time the link was added to the page was the time the link was last accessed. If an archive URL is found that does not reveal to the bot the original URL it’s supposed to display, the bot will extrapolate this to verify the archive actually belongs to the reference.

Note regarding URLs in templates: InternetArchiveBot by default does not handle URLs contained inside unknown templates as it will not know how altering said template may affect the rendered page. All templates not configured as a citation template are considered unknown templates. Configuring templates are done on the Citation Templates configurator.

Scanning the links[edit]

After each link has been analyzed, the bot, if configured to do so, will attempt to determine if the link is actually dead or not by performing a quick scan, with URL lookups spaced out to prevent overload. Multiple URLs in the same domain will be queried in series with a one-second break. URLs of a different domain will be grouped in for an asynchronous query. Only URLs meeting a set criteria will be included for scanning.

URLs to be scanned must not have been scanned in the last seven days. If, however, the last scan of the URL resulted in a response marking the URL dead, then the waiting period is reduced to three days. A URL must fail three scans consecutively in order to be considered dead. This means it takes a minimum of nine days for a URL to be considered dead. The waiting period returns to seven days once it is considered dead. If a dead URL is scanned and found to be alive, its status will be reset to alive immediately. URLs and domains set to the permalive (i.e. permanently alive) state will be always seen as alive, unless it is tagged on wiki as a dead link, and the bot is configured to accept the tag over its own assessment. Permalive links are not scanned. URLs, and domains, set to permadead (i.e. permanently dead) will be always seen as dead without exception. Permadead links are not scanned.

In addition to the permadead, and permalive, states the bot also does not scan links that it believes to be subscription sites. As sites like those can not be accurately assessed without knowing the response is due missing credentials, or if the page is actually down. As a result, the bot will only consider the URL dead if set manually in the bot, or if tagged on Wikipedia. The bot looks for templates, or parameter values in the CS1 templates, that are set by the user that identifies the site as requiring a free or paid accounts to access the material. If found, the domain is marked as a subscription site and the bot discontinues checking it.

Archiving dead links[edit]

After all the links have been scanned, the bot will proceed to look for archives for links it has found to be dead. If configured to do so, it may look for archives for all living links as well. If the bot has previously discovered an archive URL of a particular link in the past, it will defer to its memory of it. InternetArchiveBot can recognize and use more than 20 different archival services. If no archive exists for a living link, the bot will attempt to save a copy of the link into the Wayback Machine if configured to do so, and use the resulting snapshot as the new archive URL. If no archive snapshot for a dead link is found, the bot will consider it unfixable and will mark it as a permanently dead reference. If the reference appears to have an archive URL, but it is not recognized to be a valid archive, or the URL is a mismatch to the snapshot, it will get replaced with one the bot considers valid. This ensures the integrity of the references and helps to minimize the effects of mistakes and vandalism.

Editing the page[edit]

Once the bot is finished analyzing the links, it will attempt to establish if it is making a repeat edit that was reverted in the past. If it detects it is repeating an edit, it will attempt to figure out why the edit was reverted, including whether it was reported as a false positive. If it assesses the edit may have potentially been a false positive, it will report the false positive on behalf of the reverting user, and not reattempt the edit until a developer has reviewed the report.

Finally, an edit is posted to the article. The bot also does some cleanup along the way and will make other minor changes to references it intends to edit. Most commonly the bot will make template spacing more consistent, to force the use of HTTPS on specifically archive URLs when available, and to sometimes simply toggle UrlStatus on a cite template from alive to dead. If the bot encounters a reference that is not using a citation template, the bot will convert the bare URL reference into a reference that uses a cite template if it is able to do so without breaking the format of the reference. If a bare reference has an archive template, a template that renders a link to the original URLs archive snapshot, and the template is considered obsolete, the bot will convert it to the preferred default.

After the edit is made, if configured to do so, the bot will leave a talk page message describing the changes it has made to the page. Alternatively the bot can be configured to only leave talk page messages instead of editing the article.