Toolhub/Progress reports/2021-10-29

From Meta, a Wikimedia project coordination wiki

Report on activities in the Toolhub project for the week ending 2021-10-29.

Brief outage on 2021-10-27[edit]

Tracked in Phabricator:
Task T294437 resolved

Toolhub was unavailable due to issues connecting to its backing database on 2021-10-27. The outage lasted from approximately 14:00 UTC to 14:10 UTC. The cause was a scheduled maintenance to place a proxy service between connecting clients like Toolhub and the "m5" database servers. The outage was resolved when the configuration change was reverted.

During the attempt it was discovered that Toolhub's network configuration did not allow TCP connections originating from Toolhub's Kubernetes namespace to the new proxy servers. This was the result of a misunderstanding by Bryan when he initially configured the NetworkPolicy for Toolhub. This configuration issue has been addressed by updating the network egress rules to allow connecting to the new dbproxy1017.eqiad.wmnet and dbproxy1021.eqiad.wmnet hosts.

A second attempt at inserting the proxy into the connection flow has not yet been scheduled, but is expected to happen sometime in November 2021.

Notice now shown on demo servers[edit]

Tracked in Phabricator:
Bug 734385

Now that the official https://toolhub.wikimedia.org/ service is operational there is a possibility of confusion for folks who find and use the demonstration server. To help remove this confusion we have implemented a feature to display a prominent but dismissable notice on Toolhub instances running in "debug" mode. The implementation of this feature on the backend server has been done in a way that can be extended in the future if additional use cases are found for providing runtime environment based information to the front end application.

Warnings from Kubernetes health checks resolved[edit]

Tracked in Phabricator:
Task T294072 resolved

One of the useful features of running Toolhub inside a Kubernetes cluster is the built-in application health checking probes of the supervisor process. These health checks can trigger automatic restarts of the application and other alerts. Toolhub's "readiness" check was succeeding, but also logging a warning on each probe attempt because Toolhub was returning an w:HTTP 301 redirect response instead of the expected "200 OK" response. This has been corrected by a configuration change allowing requests to the /healthz check endpoint over plain HTTP as used by the Kubernetes probe without forcing an HTTPS protocol change.

Wrap up[edit]

This was a fairly quiet week for Toolhub development and discussion. In November the team will be growing thanks to the hiring of a new software engineer and a software engineering contractor. Bryan and Seve will be working in the coming weeks to help onboard these new members of the team and ensure that they have a good selection of new features, bugs, and technical debt tasks to work on.