User:MPopov (WMF)/Notes/Internal API requests

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1007.eqiad.wmnet.

R[edit]

Sys.unsetenv("HTTP_PROXY")
Sys.unsetenv("HTTPS_PROXY")
Sys.unsetenv("http_proxy")
Sys.unsetenv("https_proxy")

Using httr2 package:

library(httr2)

req <- request("https://api-ro.discovery.wmnet/w/api.php") %>%
    req_headers("Host" = "en.wikipedia.org")

req <- req %>%
    req_url_query(
        action = "query",
        prop = "info",
        titles = "R_(programming_language)|Python_(programming_language)"
        format = "json"
    )

# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
    req_options(ssl_verifypeer = 0)

# Perform the request:
resp <- req %>%
    req_perform() %>%
    resp_body_json()

To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:

library(tidyverse)

page_info <- resp$query$pages %>%
    map_dfr(as_tibble)
A tibble: 2 × 10
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

Python[edit]

import os

os.environ.pop('HTTP_PROXY', None)
os.environ.pop('HTTPS_PROXY', None)
os.environ.pop('http_proxy', None)
os.environ.pop('https_proxy', None)

Using requests library:

import requests

url = 'https://api-ro.discovery.wmnet/w/api.php'

headers = {'Host': 'en.wikipedia.org'}

payload = {
    'action': 'query',
    'prop': 'info',
    'titles': 'R_(programming_language)|Python_(programming_language)',
    'format': 'json'
}

resp = requests.get(url, headers=headers, params=payload, verify=False).json()

To convert the response into a nice data frame we can use from_dict from pandas:

import pandas as pd

page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
23862 23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925