Toolhub/Progress reports/2021-04-02

From Meta, a Wikimedia project coordination wiki

Report on activities in the Toolhub project for the week ending 2021-04-02.

Crawler enhancements[edit]

Following up on work done last week, a patch has been merged to improve the reporting for crawl errors by collecting log events emitted during processing of each URL by the crawler and saving them to the database. This patch also displays this collected log data as part of the crawler history information. This should help administrators and toolinfo.json URL maintainers gain a better understanding of network and data validation errors encountered by the crawler.

Data consistency improvements[edit]

Tracked in Phabricator:
Task T277231 resolved

Testing of faceted search in early March showed occasional facet terms consisting of an empty string. It was not initially obvious how these values were being introduced into the database and Elasticsearch index. Research over the last week isolated the problem to our Django model layer and specifically the way that Django's CharField and TextField handle validation and storage for the cases of NULL and empty strings. Our toolhub.apps.toolinfo.models.Tool model contains multiple text/char fields which are marked with both blank=True meaning the the value is optional and null=True meaning that missing values should be stored in the database as NULL. The upstream documentation warns against this practice:

If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string.

In our use case, removing the null=True configuration caused the problem to be worse rather than better due to the way that Elasticsearch processes blank strings and null values. A blank string is treated as a present, but empty value. A null is treated as an absent value. Ideally we want the latter behavior for optional strings so that we can more easily run searches for them.

The fix we have introduced for this is custom subclasses of Django's CharField and TextField which transform empty strings into null values before saving data to the database. This change makes the representation consistent for both the database and the search index.

After applying the patch, existing database records can be cleaned with these manual SQL updates:

update toolinfo_tool set author = NULL where author = '';
update toolinfo_tool set repository = NULL where repository = '';
update toolinfo_tool set subtitle = NULL where subtitle = '';
update toolinfo_tool set openhub_id = NULL where openhub_id = '';
update toolinfo_tool set bot_username = NULL where bot_username = '';
update toolinfo_tool set replaced_by = NULL where replaced_by = '';
update toolinfo_tool set icon = NULL where icon = '';
update toolinfo_tool set license = NULL where license = '';
update toolinfo_tool set tool_type = NULL where tool_type = '';
update toolinfo_tool set api_url = NULL where api_url = '';
update toolinfo_tool set translate_url = NULL where translate_url = '';
update toolinfo_tool set bugtracker_url = NULL where bugtracker_url = '';
update toolinfo_tool set _schema = NULL where _schema = '';
update toolinfo_tool set _language = NULL where _language = '';

Exposing toolinfo origin to the UI[edit]

Tracked in Phabricator:
Task T278258 resolved

The 'origin' of each toolinfo record, tracking whether it was created by the crawler or submitted directly to the API, is now returned in API and search results. This is now used in our UI to decide whether or not to show the edit action for a given toolinfo record to the user. The current authorization model for toolinfo data is that editing is exclusive to the origin and user that created the initial record.

April-June 2021 planning[edit]

Bryan and Srishti spent some time over the last week examining the work remaining on the roadmap to our planned 1.0 release. As mentioned in last week's wrap up, lists of tools, annotations (community maintained notes/details for tools), and moderation and patrolling are the major functionality remaining. Beyond these major features, we will also need to complete a security review, add more configuration to harden the service for production use, integrate with various deployment tools, add tracking for basic service health metrics, and write some end user help documentation. Ideally we would also find an initial solution for translation of dynamic content as well.

No final decisions have been made yet on the full scope of the work for the April-June quarter. We have chosen moderation and patrolling as the most critical remaining feature to work on first. We will be having additional discussions in the coming week to determine if we will try to complete all of the remaining features prior to launch or if some can be deferred. We will also be making a decision on how to adjust our "sometime in June 2021" target launch date to a more concrete date.

Wrap up[edit]

This week marks the completion of the second full quarter of implementation work on Toolhub. During the quarter, Bryan and Srishti have:

  • Given two presentations to Wikimedia Foundation staff about the Toolhub project
  • Completed the user interface to register, authorize, view, and revoke Toolhub OAuth grants
  • Implemented faceted search
  • Implemented creation and editing of new toolinfo records via the API and UI
  • Implemented the backend for viewing history and diffs of toolinfo records over time
  • Designed and implemented a uniform notification system for the UI
  • Improved localization with changes to the translation layer
  • Made many improvements to our frontend tooling to call the backend API
  • Added "soft delete" support for toolinfo records
  • Made various improvements in the functionality and reporting for the URL crawler

The frontend for history, diffs, and rollback/revert of toolinfo revisions is under review and expected to be merged in the coming week.