Wikimedia Blog/Drafts/The Open Access Media Importer
This post is now available at https://blog.wikimedia.org/2013/10/21/scientific-multimedia-files-get-a-second-life-on-wikipedia/.
- Making scientific multimedia files routinely available on Wikimedia Commons
- Wikimedia Commons as a repository of scientific multimedia files
- The Open Access Media Importer makes scientific multimedia files available on Wikimedia Commons
- The second life of scientific multimedia files on Wikimedia Commons
- PubMed Central's openly licensed multimedia files are now on Wikimedia Commons
- Mutual enrichment of scientific multimedia and Wikimedia platforms
On Wikimedia projects, audio and video content has traditionally taken a backseat relative to text and static images (however, changes are underway). Conversely, more and more scholarly publications come with audio and video files, though these are — a legacy from the print era — typically relegated to the "supplementary material" rather than embedded next to the relevant text passages. And a rising number of these publications are Open Access, i.e. freely available under Creative Commons licenses that allow for the materials to be reused in other contexts.
Why not enrich thematically related Wikimedia pages with such multimedia files? That's where the Open Access Media Importer (OAMI) comes in. It makes scientific video and audio clips accessible to the Wikimedia community and a broader public audience. The OAMI is an open-source program (or 'bot') that crawls PubMed Central — a full-text database of over 3 million biomedical research articles — and extracts multimedia files from those publications in the database that are available under Wikimedia-compatible licenses.
Such reuse-friendly terms are the key ingredient to making scholarly materials useful beyond the article in which they have originally been published. However, OAMI aims to make this material even more useful by making it accessible:
- in places where people actually look for them (Wikimedia platforms are a prime example),
- in one coherent format (in our case Ogg Vorbis/Theora, which isn't encumbered by patent restrictions), and
- in a way that allows for collaborative annotation with relevant metadata. This makes it a lot easier to browse and search the media files.
Status and Statistics
Since the first tests with the bot in mid-2012, the amount of video files on Commons has more than doubled from about 15k to well over 38k (of which about 10% use the WebM format, the rest Ogg Theora). Much of this increase is due to the OAMI, which has uploaded over 14k files so far — mainly videos, with a few hundred audio files among them. According to the BaGLAMa tool, over 700 of these files are currently being used across over 50 Wikimedia projects, which exposes them to a total of about 3 million page views a month — a scale most of them would never reach via the supplementary materials they had originally been published in.
Once the bot had been approved for automated uploads in October last year, its initial focus was on importing multimedia files from articles published in the past, but in parallel, it has already processed new ones as they came in. Since mid-2013, the focus has shifted: with the exception of about a hundred files that failed to convert or upload properly, the import of backfiles has been completed, so the bot is now chiefly processing files from newly published articles, several hundreds per month.
The Open Access Media Importer is being developed by a team of three: Daniel Mietchen (project lead), Nils Dagsson Moskopp (software development), and Raphael Wimmer (infrastructure). Its initial development has been supported by a grant from Wikimedia Germany e.V.. Once the materials are uploaded to Wikimedia Commons, the community there helps in improving categorization and file descriptions, fixing file conversion or thumbnail problems, renaming files or cropping them. On that basis, Wikipedia editors looking for illustrations can then readily find these materials and incorporate them into the articles they work on. WikiProject Open Access helps with that too, and it has featured a number of OAMI-provided files in its Open Access File of the Day series that highlights files from Open Access sources used multiple times in Wikimedia contexts.
Unforeseen but welcome side effects
Operating the bot uncovered inconsistencies in the metadata that publishers deliver to PubMed Central along with their articles. These issues range from license information to keywords and MIME types and have reached a scale that did not only lead to some adjustments of PubMed Central's workflows, but also to a conference paper that was to be presented on Tuesday at JATS Con, a conference on the XML standard that publishers use to exchange information about the content of their publications. Due to the recent shutdown of the US government, the conference had to be canceled, but a JATS user meeting will take place that same day, where an abbreviated version of the talk will be presented.
Converting all multimedia files to Ogg Vorbis/Theora, OAMI is encountering a wide range of codecs and container formats. Some of these — e.g., animated GIFs inside a Quicktime container — were not initially supported by the GStreamer library used for conversion. Bug reports have been submitted to the developers and resulted in several code changes that benefit all GStreamer users.
Another side-effect is that the bot has been nominated as a finalist in the first year of the Accelerating Science Award Program, for which the award ceremony is to take place today as part of the kick-off event for this year's Open Access Week. This nomination further increases the visibility of Wikimedia Commons within the scientific community. Video interviews with all six finalists are available on Commons.
The bot was designed to be extendable to sources other than PubMed Central (manual tests with materials from Dryad and from PANGAEA have already been performed), to media types other than audio and video and to target sites other than Wikimedia Commons. Work on a derived pipeline for exporting these videos to YouTube has started, and we welcome anyone who wants to submit patches or plugins, e.g. to make other collections of openly licensed media more accessible (CERN just started to release materials under an open license), to use video tagging for improving the file description pages and their categorization, or to suggest Wikipedia articles where a given file might fit in. Of course, the code for the bot is free and open-source software, and forks to build something cool based on the OAMI (e.g. games or citizen science projects) are most welcome.
Daniel Mietchen (WikiProject Open Access) and Raphael Wimmer
- embeds in Wikimedia pages (or, via InstantCommons, MediaWiki pages more generally)
- search engines
- categories and pages
- links from within Commons, especially its category tree
- links from other Wikimedia projects
- Problems with reuse:
- language: most of the materials are English-only
- from an embedded multimedia file (especially in gallery environments), it is not always obvious how to get to the corresponding file page on Commons
- focus (e.g. concatenated videos about different species)
- video metadata often provided in individual frames rather than via the video container's metadata or the video's captions in the original scholarly article
- Wikimedia Commons is not widely known as a repository of PDF files but it has more than 215,000 of them, which dwarfs against the over 18 million image files but is about the same amount as all audio and video files combined.
- The overall number of files on Wikimedia Commons currently stands at 17 million, which is about twice the number of topics covered across all Wikipedias.
- Half a dozen Wikipedias have more than a million articles
The top ten most embedded items
- via GLAMorous
- Rare instruments or conditions
- Exception: Zenaida macroura
- Stridulation in multiple taxa
- C. elegans video category
- Conference talks
- Corrected -NC file
File:Alticus arnoldorum jumping - pone.0011197.s008.ogv - multiple uses
File:Otus jolandae song - pone.0053712.s008.oga - essence of scientific finding; some uses
File:Zenaida macroura vocalizations - pone.0027052.s009.oga - used in ca. 30 pages
- commons:User:Open Access Media Importer Bot
- outreach:GLAM/Newsletter/November 2012/Contents/Open Access report
- commons:Commons:Deletion requests/File:GFP's-Mechanical-Intermediate-States-pone.0046962.s003.ogv
- commons:Commons:MIME type statistics
- Breaking through walls of text: How we will create a richer Wikimedia experience
- JATS-Con abstract accepted