InternetArchiveBot/API

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

The IABot Management Interface comes with a simple, fast, and easy to use API. This allows tools and gadgets to integrate with the tool and contribute to improving IABot's core as well as rely on the vast resources IABot has. Other Wikipedia bots can access the API as well and approved bots on Wikipedia are automatically given bot rights on the API. To access the API, go to https://tools.wmflabs.org/iabot/api.php.

Authorizing the tool/bot/gadget to use the API[edit]

Unlike the primary interface, the API allows limited access to information without the need for authorization, however, to use the API fully, an OAuth authorization header must be passed to the API in every request. For bot's accessing the API, since conventional OAuth will not work, since it requires user input, the API uses a header relay system and relays your header to MW to obtain your bot's account details. The API will then pass back the payload, provided the header supplied was valid. Gadgets and onwiki JS scripts, can use the conventional authentication method which will direct the user to approve the tool's access to the user's account.

Differences between authorization methods[edit]

The conventional method of authorization typically involves users clicking the login button on the tool interface directing them to an authorization window to authorize the access to the tool, which afterwards they are directed back to the original location. This allows the tool to fully access the user's account, within the grants allowed, and make edits on behalf of the user.

The method for API access for a bot account needs to be done in a manner where popup dialogs don't need to be answered since they are JS based. As such the API will accept the bot's OAuth header, and will pass that to the MW software to obtain the identification of the bot. The header is generated by the bot, and the payload from the response is passed back to the bot. The API will not always use the header and as such will not always pass back a payload. Bot's are encouraged to validate the payload at least once every run and whenever a payload is passed back to the bot. To force a payload response, the "returnpayload" parameter can be set, either POST or GET. This will however slow down the response of the request as it tries to retrieve a payload. This method of authorization only allows the tool to identify what account just connected to the API, and cannot make edits on behalf of the bot. As such some API features will be disabled.

Authorizing a JS gadget or another tool using OAuth[edit]

If the tools using the API are on wiki JS gadgets, or other OAuth supported tools, users can simply call the URL https://tools.wmflabs.org/iabot/oauthcallback.php?action=login&returnto={returnurl}, where {returnurl} is the URL of the location to direct the user back to on successful login, to authorize the user for the API.

To log the users back out, call the URL https://tools.wmflabs.org/iabot/oauthcallback.php?action=logout&returnto={returnurl}.

Authorizing a bot with an owner-only consumer[edit]

This method is more complicated, and limits some API functions that require direct access to the account. Because the method being used to authenticate is relaying the OAuth header to MW, it is important to note the destination URL being encoded into the OAuth signature is not the tool API URL, but MW's OAuth /identify URL. Not encoding the correct URL into the signature will result in an invalid signature response. In addition to that, the correct identify URL to the correct wiki needs to be used. Since this is a multi-wiki interface, the tool needs to know which wiki you are working on. The default wiki is enwiki. To change the wiki, you can set the wiki parameter to the appropriate wiki. You can set it in the same request as the authorization. For example if you want the Swedish Wikipedia, your request needs to have wiki=svwiki in either the POST or GET fields. When signing the header, the URL being encoded must match the format https://{domain}/w/index.php?title=Special:OAuth/identify, where {domain} is the wiki being worked on.

If you are planning to run on the English Wikipedia, you would pass wiki=enwiki in the GET or POST fields, and encode the URL https://en.wikipedia.org/w/index.php?title=Special:OAuth/identify into your OAuth signature. You would then pass it to the tool API to authenticate you.

If you are not familiar with OAuth may use the provided 2 scripts that can be called from the shell mentioned in the section below.

If the OAuth header passed to the tool is valid, it will be able to identify the bot connecting to the application, and it will pass back an encrypted payload. It is important to note that the header is only used to identify the bot. To prevent unexpected session failures, it is recommended to always pass a header and validate the payload whenever one is returned. It is also recommended to validate the payload at least once every run. To force a payload in the response, you can set the returnpayload parameter. This however slows down the request as the API now tries to perform the identify request. More details about the parameters are explained below.

Helper scripts to make OAuth easier[edit]

OAuth can be difficult to implement for some users. There are 2 scripts meant to be executed externally that can do the hard work for you.

The first is the MWOAuthGenerateHeader:

  1. Download: MWOAuthGenerateHeader.php
  2. Execute the script in the command line interface or via some form of exec function as follows
    php MWOAuthGenerateHeader.php <consumerkey> <consumersecret> <accesstoken> <accesssecret> <identifyurl>
    Replace the bracketed placeholders with the appropriate values
  3. On success, the output of the script will be a properly formatted OAuth header to be passed to the tool via the headers. If something went wrong, it will output "FAIL" instead.

The second is the MWOAuthDecodePayload:

  1. Download: MWOAuthDecodePayload.php
  2. When the tool passes back the payload, execute this script as follows
    php MWOAuthDecodePayload.php <consumersecret> <payload>
    Replace the bracketed placeholders with the appropriate values
  3. On success, the output of the script will be a JSON object with your account details. If any part of the payload fails to pass validation, it will output an error message instead. If you know the script was properly executed, it could suggest some form of interception or attack took place during the request, and execution should be aborted.

API functions and usage and limitations[edit]

GET or POST?[edit]

The API accepts either, and can be interchanged. POST parameters with the same parameter as a GET will take priority. It is recommended to POST all requests.

Global parameters[edit]

The API allows some information to be obtained without the need for authorization, such as obtaining URL information and pages they're found on, but the remaining functions require authorization.

There are global parameters when passed affect all requests and how the API operations.

Parameter name Possible values Description
action See the section below This is the primary parameter to direct the API. This tells the API what action to carry out.
returnpayload anything Forces the API to pass the header to the wiki's OAuth and try to get a payload. If successful, a payload will be returned with the request. A failure will result in an error getting returned with the response. The bot will not be logged out on a request failure however.
token The token echoed from the previous response This is a required parameter for all write requests making some form of a change. This is the CSRF token. Missing this will return a 400 error.
checksum The checksum token echoed from the previous response This is a required parameter for all write requests making some form of change. This is the checksum token which is used to validate the request as valid. A bad token will result in a 409 error and a missing token will result in a 400 error.
wiki 'enwiki', 'svwiki', or any other support wiki This directs the tool which wiki to switch to. Bots should generally use their home wiki. Changing this also means changing the identify URL mentioned in the above section for owner-only consumers.
offset string This is a parameter used for requests requiring pagination. When omitted, only the first 1000 entries are returned in the result. When moving between pages, pass the value defined in the continue value, when defined, to move through the result set, by passing the same request and this parameter.

Global return values[edit]

All API responses are in JSON.

In every response there are values that get returned when completing a request.

Value name Type Description
loggedon boolean Indicates whether or not the client is logged onto the API.
username string Only defined if the client is logged in. Contains the identified user connected to the API.
csrf string Only defined if the client is logged in. Contains the CSRF token required for all write requests.
checksum string Only defined if the client is logged in. Contains the checksum token required to execute a write request.
servetime float The length of time in seconds the webserver took to service the request.
result string Defined when a write request is being made. Successful requests have the value "success". Other requests have a value of "fail".
continue string Defined when a request uses pagination and has more results. This is the value for the next set of entries which can be up to 1000 per set. Pass to the offset parameter to go to the next set of entries when repeating the request.

Global return values on error[edit]

Value name Type Description
noaccess string Returned when the API is inaccessible. Possible codes are, "disabledinterface", "maintenance", and "Missing authorization". Respectively, they mean the tool has been disabled by a developer, the tool developers are performing maintenance, and that client is not logged in to the API to execute the desired function.
notavailable string Usually is passed back when a function of the API being accessed is not executable with an owner-only consumer.
errormessage string Defined whenever an error occurs during a request. It's usually accompanied with an error code for bot's to identify. This contains an English description of the error.
noaction string Default response for the API if no action is being performed.
validationerror string Defined when returnpayload is set in the request. Possible values are "noheader" and "invalidheader".
usedheader string Returned with validationerror and autherror. Contains the header the API attempted to use.
autherror string Defined when initial login to the API failed.
requesterror string Defined when a request couldn't be executed due to an issue with the request. Possible values are "invalidchecksum", "missingchecksum", "blocked", "dberror", "404", "invalidtoken", and "missingtoken"
ratelimit string Default response for the API if the number of allowed requests were exceeded during the last minute. Limits are 5 requests/minute for anonymous users, 500 requests/minute for logged in users, and 5000 requests/minute for authorized bots.
missingpermission string Default response for the API if the requested action requires a permission that the client lacks. Contains the permission required.
accessibletogroups array Defined with missingpermission listing the usergroups that have access to the requested function.
missingvalue string Defined when a required value for a request is missing, or has bad data.

Action functions[edit]

action=getfalsepositives[edit]

This action allows automated processes to fetch reported false positives, from either bots or tool users. This action requires the "viewfpreviewpage" permission, and logically is not available to anonymous clients.

This action offers the following parameters:

Parameter Required Accepted values Description
displayopen Default option Anything Setting this parameter will return all false positives that have been reported but not acted on.
displayfixed No Anything Setting this parameter will return all false positives that were reported and fixed.
displaydeclined No Anything Setting this parameter will return all false positives that were reported and declined as an invalid report.

The action has the following possible return values:

Value Type Description
openreports int The number of active unacted reports.
fpreports array All of the reports in the request.

This function does not return action specific errors.

action=getbotqueue[edit]

This action allows automated processes to fetch bot jobs in the bot queue submitted from either bots or tool users. This action requires the "viewbotqueue" permission, and logically is not available to anonymous clients.

This action offers the following parameters:

Parameter Required Accepted values Description
displayqueued No Anything Setting this parameter will return all bot jobs still pending completion.
displayrunning Default option Anything Setting this parameter will return all bot jobs actively being worked on.
displayfinished No Anything Setting this parameter will return all bot jobs that have been successfully finished.
displaykilled No Anything Setting this parameter will return all bot jobs that have been killed by the requesting users or tool maintainers.
displaysuspended No Anything Setting this parameter will return all bot jobs that have been suspended by the tool maintainers.

The action has the following possible return values:

Value Type Description
queued int The number of bot jobs still pending.
running int The number of bot jobs in operation.
botqueue array Details of all of the bot queue jobs requested.

This function does not return action specific errors.

action=reportfp[edit]

This action allows automated processes to report false positives, to the interface and tool maintainers. This action requires the "reportfp" permission, and logically is not available to anonymous clients. This action requires the CSRF and Checksum tokens to work.

This action offers the following parameters:

Parameter Required Accepted values Description
fplist Yes Newline separated string This parameter is a list of URLs separated by a newline, to be reported as false positives.

The action has the following possible return values:

Value Type Description
toreport array The URLs that were reported to the maintainers.
toreset array The URLs that were automatically corrected during reporting.
notdead array The URLs already found to be alive and are being ignored.
notfound array The URLs IABot hasn't encountered and are being ignored.
alreadyreported array The URLs that already reported and will not be reported again.

This action has the following possible errors:

Value Type Description
reportfperror string Defined when an error specific to the action has occured.

action=searchurldata[edit]

This action allows automated processes to fetch URL data for any encountered URL on Wikipedia. This action is available to anonymous clients.

This action offers the following parameters:

Parameter Required Accepted values Description
urls No Newline separated string This parameter is required if urlids or any of the search filters are not set. A list of URLs, separated by newlines, to lookup and provide details about.
urlids No Newline separated int This parameter is required if urls or any of the search filters are not set. A list of URL IDs, separated by newlines, to lookup and provide details about.
hasarchive No 0 or 1 This parameter is required if urls and urlids or any of the other search filters are not set. Set to 0 to retrieve all URLs with no archive associated with it. Set to 1 to retrieve all URLs with an archive associated with it.
livestate No Pipe separated string This parameter is required if urls and urlids or any of the other search filters are not set. Filters records to the given states of the URLs. Available options are:
  • dead - Retrieve all URLs that are considered dead.
  • dying - Retrieve all unstable URLs that are likely dead, but not yet considered dead.
  • alive - Retrieve all URLs considered alive.
  • unknown - Retrieve all URLs where their states are unknown.
  • paywall - Retrieve all URLs that are considered closed access and therefore undeterminable.
  • whitelisted - Retrieve all URLs that are whitelisted and considered alive. Not populated with alive.
  • blacklisted - Retrieve all URLs that are blacklisted and considered dead. Not populated with dead.
isarchived No string This parameter is required if urls and urlids or any of the other search filters are not set. Filters records based on if they have a known available archive in the Wayback Machine. Returns true or false in the archived field for each URL. For URLs where it's uncertain an archive exists, NULL is returned instead. Available options are: (only one can be picked)
  • yes - Retrieve all URLs that have a confirmed archive available at the Wayback Machine, or other archives.
  • no - Retrieve all URLs that are confirmed unavailable at the Wayback Machine.
  • unknown - Retrieve all URLs that are not confirmed available at the Wayback Machine.
  • missing - Retrieve all URLs that are either not confirmed available or confirmed unavailable at the Wayback Machine.
reviewed No 0 or 1 This parameter is required if urls and urlids or any of the other search filters are not set. Set to 0 to retrieve all URLs that haven't been reviewed by a user or another bot. Set to 1 to retrieve all URLs that have been reviewed by a user or another bot.

The action has the following possible return values:

Value Type Description
urls array Details of all of the URLs requested.

This function does not return action specific errors.

action=searchpagefromurl[edit]

This action allows automated processes to fetch pages encountered URLs were found on. This action is available to anonymous clients.

This action offers the following parameters:

Parameter Required Accepted values Description
url No string This parameter is required if urlid is not set. A URL to lookup found pages with. Using urlid is recommended.
urlid No int This parameter is required if url is not set. Look up the pages based on the URL's ID.

The action has the following possible return values:

Value Type Description
pages array A list of all pages the URL was found on.

This function does not return action specific errors.

action=searchurlfrompage[edit]

This action allows automated processes to fetch URLs encountered on given pages were found on. This action is available to anonymous clients.

This action offers the following parameters:

Parameter Required Accepted values Description
pageids Yes Pipe separated int A list of page IDs to lookup.

The action has the following possible return values:

Value Type Description
urls array A list of all URLs that were found on the given pages.

This function does not return action specific errors.

Wikipedia page ID's can be retrieved using the MediaWiki API. For example for 'Albert Einstein':

https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=info

action=modifyurl[edit]

This action allows automated processes to modify URL data IABot uses. This action requires the "changeurldata" permission, as well as the "alteraccesstime" permission to alter the access time, the "deblacklisturls" permission to remove URLs from the blacklist, the "dewhitelisturls" permission to remove URLs from the whitelist, the "blacklisturls" to add URLs to the blacklist, the "whitelisturls" to add URLs to the whitelist, the "alterarchiveurl" permission to alter the archive URL of URLs, and the "overridearchivevalidation" permission to bypass the archive validation checks, and logically is not available to anonymous clients. This action requires the CSRF and Checksum tokens to work. Not all permissions are needed to perform the desired functions.

This action offers the following parameters:

Parameter Required Accepted values Description
urlid Yes int The URL ID of the URL to modify.
accesstime No A PHP recognized timestamp The timestamp of the access time. The bot uses this when searching for new archives. Requires the "alteraccesstime" permission to modify.
livestateselect No 0, 3, 5, 6, or 7 The live state to set the URL to.
  • Set to 0 to mark the URL as dead.
  • Set to 3 to mark the URL as alive.
  • Set to 5 to mark the URL as a paywall.
  • Set to 6 to blacklist the URL. This requires the "blacklisturls" permission to set.
  • Set to 7 to whitelist the URL. This requires the "whitelisturls" permission to set.
    Conversely, if the URL is already blacklisted or whitelisted, the "deblacklisturls" or "dewhitelisturls" permission are required to set the new state.
archiveurl No string A URL of an archive snapshot of the original URL. Requires the "alterarchiveurl" permission to modify.
reason No string An optional reason describing the changes being made and why. It's recommended to provide one.
overridearchivevalidation No 1 or "on" Bypass the checks on the archive snapshot. The snapshot will still be checked to ensure it is an archive snapshot, but making sure it matches the original will be bypassed. Requires the "overridearchivevalidation" permission to set.

This action has no unique output responses for successful requests.

This action has the following possible errors:

Value Type Description
urldataerror string Defined when an error specific to URL modification occured. Possible values are "illegalaccesstime", "stateblockedatdomain", "illegalstate", "invalidarchive", "urlmismatch", and "404".

action=analyzepage[edit]

This action allows automated processes to run the bot library on a page and make an edit on the clients behalf This action requires the "analyzepage" permission, and logically is not available to anonymous clients. This action requires the CSRF and Checksum tokens to work. This action is only available for fully authenticated clients. Owner-only consumers will not work.

This action offers the following parameters:

Parameter Required Accepted values Description
pagesearch Yes string The page title of the page to analyze.
reason No string An option reason for analyzing the page.
archiveall No "on" Attempt to add archives to all non-dead references and save non-existent copies to the Wayback Machine.

The action has the following possible return values:

Value Type Description
linksanalyzed int The number of URLs the bot found and analyzed.
linksarchived int The number of URLs the bot archived to the Wayback Machine
linksrescued int The number of URLs it fixed on wiki, either through adding archives, or correcting formatting.
linkstagged int The number or URLs it tagged as dead on wiki.
pagemodified bool Whether the page was edited or not.
waybacksadded int The number of Wayback Machine archives added to the page.
othersadded int The number of other archives added to the page.
revid int OR bool The revision ID of the edit. False if no edit was made.
modifiedlinks array The list of links it modified on the page.

This action has the following possible errors:

Value Type Description
analyzeerror string Defined when an error specific to the action has occured. Possible values are "404" and "apierror".

action=submitbotjob[edit]

This action allows automated processes to submit bot jobs for InternetArchiveBot to carry out. This action requires the "submitbotjobs" permission, and logically is not available to anonymous clients. This action requires the CSRF and Checksum tokens to work. Additionally, bot jobs larger 500 pages require the "botsubmitlimit5000" permission, bot jobs larger than 5000 pages require the "botsubmitlimit50000" permissions, and bot jobs larger than 50000 pages require the "botsubmitlimitnolimit" permission.

This action offers the following parameters:

Parameter Required Accepted values Description
pagelist Yes Newline separated string A list of page titles to process, separated by newlines.

The action has the following possible return values:

Value Type Description
id int The job ID number
status string The current job run status. Possible values are "queued", "running", "complete", "killed", and "suspended"
requestedby string The user that requested the bot job.
targetwiki string The wiki code of the target wiki.
queued string Timestamp of when the job was submitted.
lastupdate string Timestamp of the last update to the job.
totalpages int The total number of pages in the bot job.
completedpages int The number of pages completed by the bot.
runstats array An array of statistics during the run.

This action has the following possible errors:

Value Type Description
bqsubmiterror string Defined when an error specific to the action has occured.

action=getbotjob[edit]

This action allows automated processes to submit bot jobs for InternetArchiveBot to carry out. This action requires the "submitbotjobs" permission, and logically is not available to anonymous clients. This action requires the CSRF and Checksum tokens to work. Additionally, bot jobs larger 500 pages require the "botsubmitlimit5000" permission, bot jobs larger than 5000 pages require the "botsubmitlimit50000" permissions, and bot jobs larger than 50000 pages require the "botsubmitlimitnolimit" permission.

This action offers the following parameters:

Parameter Required Accepted values Description
id Yes int The job ID to lookup.

The action has the following possible return values:

Value Type Description
id int The job ID number
status string The current job run status. Possible values are "queued", "running", "complete", "killed", and "suspended"
requestedby string The user that requested the bot job.
targetwiki string The wiki code of the target wiki.
queued string Timestamp of when the job was submitted.
lastupdate string Timestamp of the last update to the job.
totalpages int The total number of pages in the bot job.
completedpages int The number of pages completed by the bot.
runstats array An array of statistics during the run.

This function does not return action specific errors.

action=logout[edit]

This action allows automated processes logout from the API.

This action has no parameters.

This function does not return any action specific values.

This function does not return action specific errors.