User:MPopov (WMF)/Notes/Internal API requests
< User:MPopov (WMF) | Notes
![]() | It's important to unset the HTTP proxy environment variables so the requests can correctly go to the internal endpoint. |
This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1007.eqiad.wmnet.
R[edit]
Sys.unsetenv("HTTP_PROXY")
Sys.unsetenv("HTTPS_PROXY")
Sys.unsetenv("http_proxy")
Sys.unsetenv("https_proxy")
Using httr2 package:
library(httr2)
req <- request("https://api-ro.discovery.wmnet/w/api.php") %>%
req_headers("Host" = "en.wikipedia.org")
req <- req %>%
req_url_query(
action = "query",
prop = "info",
titles = "R_(programming_language)|Python_(programming_language)"
format = "json"
)
# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
req_options(ssl_verifypeer = 0)
# Perform the request:
resp <- req %>%
req_perform() %>%
resp_body_json()
To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:
library(tidyverse)
page_info <- resp$query$pages %>%
map_dfr(as_tibble)
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length |
---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> |
23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |
Python[edit]
import os
os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
os.environ.pop('http_proxy', None)
os.environ.pop('https_proxy', None)
Using requests library:
import requests
url = 'https://api-ro.discovery.wmnet/w/api.php'
headers = {'Host': 'en.wikipedia.org'}
payload = {
'action': 'query',
'prop': 'info',
'titles': 'R_(programming_language)|Python_(programming_language)',
'format': 'json'
}
resp = requests.get(url, headers=headers, params=payload, verify=False).json()
To convert the response into a nice data frame we can use from_dict from pandas:
import pandas as pd
page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length | |
---|---|---|---|---|---|---|---|---|---|---|
23862 | 23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |