Toolhub/Progress reports/2021-08-27
Report on activities in the Toolhub project for the week ending 2021-08-27.
Tool created to collect enwiki userscripts
[edit]Continuing work from last week's notes on importing data, Bryan has a new toolinfo-scraper tool running in Toolforge. The tool currently has one job which runs twice per hour to turn the list of userscripts documented on w:en:Wikipedia:User_scripts/List into a toolinfo.json file that Toolhub can import. Currently 591 user scripts are visible in the Toolhub demo server because of this import from enwiki.
This same tool can be extended in the future to add parsing jobs for other lists maintained on wiki or elsewhere which can be transformed into toolinfo format.
A useful side effect of this effort was discovering a possible exception during a crawler run triggered when there is a name collision between toolinfo data uploaded to Toolhub via the API and a record read from a crawled URL. This exception is now caught and recorded as a error against the URL instead of causing the entire crawler run to fail.
Production logging
[edit]Bryan returned to a patch that he had started working on all the way back in March and found that an upstream issue that he had become frustrated in working around had been fixed. This has allowed moving forward with building a logging formatter which puts events into the Elastic Common Schema (ECS) format currently preferred for debug logging at the Foundation.
Production memcache
[edit]Effie Mouzeli has been helping Bryan understand a bit more about how Mcrouter is configured and used to connect MediaWiki servers with "pools" of memcached servers. Toolhub has less complicated needs for memcached than MediaWiki does, but in an attempt to reuse shared infrastructure we will be attempting to use the same pools of servers. By sharing some basic configuration with the MediaWiki hosts we hope to avoid making a new special case of memcached server handling for the Wikimedia Site Reliability Engineering team.
Wrap up
[edit]Last week's report ended with an expectation that by this report we would have a better understanding of the work left to deploy Toolhub fully into production. Bryan's current punch list of known remaining work is:
- Select content license and document in Toolhub UI at all data entry locations
- Add support for mcrouter to the Helm chart for Toolhub and per-deployment configuration via helmfile settings
- Load test API to establish initial container sizing limits for CPU and RAM
- Add production OAuth secrets to generated configuration
- Deploy into "staging" Kubernetes cluster
- Create initial database tables and seed data
- Create initial Elasticsearch schema and index
- Test crawler process from inside Kubernetes staging cluster to validate HTTP proxy configuration
It seems unlikely that all of these tasks will be Done in the coming week, but it should be possible to work through a number of them.