InternetArchiveBot/Documentation/Managing URLs and domains

From Meta, a Wikimedia project coordination wiki

InternetArchiveBot stores metadata about the URLs it encounters while it does its runs. This data is used to help IABot keep track of the URLs it scans/works with. However there are times when the data is flawed and may need to be fixed by a human operator.

There are two ways to update this data, at the URL level, which is updating a single URL, or at the domain level, which is updating the metadata for all URLs under a domain/host name.

Manage metadata for an individual URL[edit]

URL metadata that can be modified through this interface are the archive URL to use, the last seen access time, and whether or not the URL is dead.

To access the URL metadata:

  1. Under "Manage URL Data," click "Manage individual URLs".
  2. Paste the URL, exactly as seen on the wiki, into the URL field.
  3. If the URL was encountered by the bot, it will pull up data about it.
  4. The first part of the UI will be the metadata fields. The second part will be a collection of pages the URL was found on the selected wiki. Once the page list has fully loaded, the search results are retained in your session. You can use these results to submit a bot job with the saved pages.
  5. To submit the bot job with the search results, click the button, “Run bot on affected pages”. It will take you to the job queuer with the pages pre-filled. Click Submit to submit the job.
  6. If you no longer have the search results pulled up, you can still recall it by going to Submit Bot Job and clicking “Load pages from URL search”, then by clicking submit.
  7. You can now change the properties of the data. They are described below.
  8. Users with the appropropriate permission can override the validation checks being done on the archive URL. However, the archive URL must still be a valid supported archive service.

When you have made the changes you want, you may optionally add a reason in the reason box. Click Submit to submit your changes.

Below the metadata and page list, there is an alteration log. This shows any user that has made changes to the URL metadata and what changes were made. Below the alteration log is a basic scan log. This shows a rolling history of scans the bot made to the URL and what the scan result was.

Metadata fields[edit]

  • Access time: this is the time IABot uses to find the best archive URL for the respective URL.
  • Live state: this determines how IABot is to treat the URL on the wiki. There are 5 adjustable states:
    • Alive: The URL is considered and will be scanned regularly with a 1 week waiting period.
    • Dead: The URL is considered dead, and will be scanned regularly with a 3 day waiting period.
    • Permalive (Whitelisted): The URL is considered alive, and it will not be scanned.
    • Permadead (Blacklisted): The URL is considered dead and will not be scanned.
    • Subscription site: The URL will not be scanned, nor will it be considered dead or alive. The bot will not know what the state of the URL is and will not handle it on wiki.
  • Live state also has 2 additional, non-selectable states, applied by the bot:
    • Unknown: The URL has not yet been assessed by the bot.
    • Dying: The URL has failed to produce a response the bot considers as a live website. The URL will be scanned regularly with a 3 day waiting period. If the URL fails three tests consecutively, it will be set to the Dead state. While in the dying state the URL is considered alive.
  • Archive URL: This is the URL of an archive snapshot for the URL you are editing. The archive snapshot MUST be of the original URL or it will be rejected. The archive URL must be to a valid archive service, or it will be rejected. For a list of accepted archive services, see SUPPORTED ARCHIVES.

Manage an entire domain[edit]

In certain cases you may want to manage the status of entire domain names, for instance, in the case of websites that ended up being shut down entirely. To get started go to "Manage entire domains" under "Manage URL data".

  1. Type in an excerpt of the hostname you want to look up. If you know the exact host name, you may optionally click "Perform exact match lookup" to speed up the search. Be aware the search can be slow and demanding on the web application. DO NOT include the scheme, i.e. https://, the port, or any part of the URL following the initial slash in the URL.
  2. Once the search completes, you will be presented with a list of possible matches to your search. Select the domains you wish to make changes to and click Submit.
  3. You will now see two parts to the UI. The first part is a basic metadata form, and the second part is a collection of 3 pages in 3 tabs containing the list of domains being changed, the list of URLs that will be affected by the change, and the list of pages any of those URLs are found on the wiki. Once the page list has loaded, a blue button will appear above the pages allowing to run the bot on those pages. Note, it may take some time to load the pages.
    To submit the bot job with the search results, click the button, "Run bot on affected pages”. It will take you to the job queuer with the pages pre-filled. Click Submit to submit the job.
    If you no longer have the search results pulled up, you can still recall it by going to Submit Bot Job and clicking “Load pages from domain search”, then by clicking submit.

There are only a few options to change here. They are described below.

Metadata fields[edit]

  • Global live state: There are 6 adjustable states. There are two state levels. Domain level, and URL level. Domain level states override URL level states.
    • (none): This option will unset any domain level state applied to the domain. URL level states will be applied by the bot going forward.
    • Alive: Changes all URLs live state to Alive at the URL level. All URLs will be considered alive and regularly scanned with a 1 week waiting period.
    • Dead: Changes all URLs’ live state to dead at the URL level. All URLs will be considered dead and regularly scanned with a 3 day waiting period.
    • Permalive (Whitelisted): Applies the Whitelisted state at the domain level. URL level states will be overridden, unless the individual URL is either whitelisted or blacklisted. All URLs will be considered alive and will not be scanned.
    • Permadead (Blacklisted): Applies the Blacklisted state at the domain level. URL level states will be overridden, unless the individual URL state is either whitelisted or blacklisted.
    • Subscription site: Applies the subscription site state at the domain level. Individual URLs’ states WILL NOT be overridden. When a the URL scan produces a result that may be consistent with a site requiring an account to access, the individual URL will be set to Subscription Site. URLs will still be treated with their respective URL level states.
    • (mixed): This is a non-selectable option. It simply tells you that the domain level state settings are not consistent among the domains selected to be modified.
  • Delete archive URLs: This allows you to purge all URLs under the selected domains of their saved archive URLs. You can additionally set the restriction of how new archive URLs will be allowed to be saved going forward. Pick one of the options.
  • Optionally, you may add a reason for your change, which will be visible under the individual URL, and click Submit.