File metadata cleanup drive/Tasks and tools

From Meta, a Wikimedia project coordination wiki

Main tasks[edit]

Add information templates[edit]

For each wiki, we need to identify files that don't have any information templates, then add one.

For Commons, there's a Labs tool listing files that don't have any of the accepted information templates. This tool also tries to guess how to add the information template and suggests a possible edit for you (example).

For other wikis, the best starting point is the list of interwiki links for {{Information}}. We're also looking at tools to make this easier.

Complete information templates[edit]

As a sub-task, we need to check that all the required fields of the template are filled out.

On Commons, this can be done by looking into files that are missing an author and those that are missing a source.

Add machine-readable markers to information and copyright templates[edit]

In parallel, we need to add the machine-readable markers to the templates. This needs to be done for all information templates (example) and copyright tags (example).

Add machine-readable markers to related templates[edit]

As a secondary goal, we should try to add machine-readable markers to related templates as well. This notably includes:

  • templates about trademarks
  • warning templates (personality, FOP, URAA etc)
  • templates about local/uncertain PD licenses
  • templates about GLAM partnerships
  • assessment templates (featured etc.)
  • campaign templates (WLM etc.)
  • permissions templates (e.g. OTRS ticket number)

Other things we could do:

  • add machine-readable markers about file title, alt text to information templates
  • differentiate between Location and Object location

Tools[edit]

We're currently in the process of evaluating tools and processes to make this effort easier. This may include tools, bots or gadgets to identify missing metadata.

Possible ways to measure our progress are to:

  • Analyze template links: Go through wikis and compare the total number of uploaded files (from Special:Statistics) to the combined number of files with an information template (and a copyright tag). This may not capture files with missing/broken information.
  • Use maintenance categories: The MediaWiki parser could be modified to output maintenance categories when machine-readable copyright information is missing. They would have to exist on every wiki, and it might take several weeks for File: pages to be purged from the parser cache.
  • Analyze MediaViewer logs: When MediaViewer can't find metadata it can read, it logs an error. This can be used to identify missing templates, missing metadata, and generally measure how many files opened in MV lack readable information. It will only surface files viewed in MediaViewer, though.
  • Analyze HTML dumps: We could generate regular dumps of File description pages in HTML format from Parsoid. A script would then parse the dump to check that metadata are present. This would need to be done for all wikis but the script could output metrics in a single location. Another possible tool is mwoffliner but it makes changes to the content so might not be the best suited.

This is a work in progress and your input is genuinely appreciated.

See also[edit]