Research:Supporting Commons contribution by GLAM institutions/Preparing media items for upload

From Meta, a Wikimedia project coordination wiki

"The hardest part doesn't have anything to do with Pattypan itself. It has to do with... Cleaning up metadata and transforming them into the Commons format"

Metadata about files to upload is structured in all sorts of ways (when it is in structured format at all). Usually, some kind of transformation/munging/filtering is necessary to get it into the format that can be uploaded by a particular upload tool.

Collecting, or in some cases, creating metadata for media items before upload presents a variety of challenges, particularly for smaller GLAM organizations and smaller GLAM donation projects. In many cases, the metadata for media items is not already organized and stored in a structured form in an institutional database.

In some cases, there is very little documented metadata at all for the media in the collection at the start of the GLAM project, and the collection itself may not be available in digital form. In these cases, generating this metadata (through background research or digitization) and associating it with media items to be uploaded becomes a vital, but complex and tedious, part of the GLAM project.

Assembling metadata for upload[edit]

Participant Quote
p7 "We get content from different sources, we have some metadata that follows specific standards like Dublin Core, but it can be anything. This type of metadata is tree-structured. It might have a depth level that is different depending on the record at hand. One task I had to address was the flattening the data. Make sure the information was present on level 1. In order to import it with GLAM wiki toolset, need to flatten."
p1 "The database in [the Museum of Brazilian History] is different, it’s only maintained currently on a computer that is a PC XT. Using a technology from the 80s. It’s completely offline. Runs on some sort of DOS, so each file is a different file, and we don’t know how to extract the data. It’s not something we’ve ever seen. This may be common in third world museums. It’s been closed for years because the roof is collapsed. But it’s one of the most important museums in Brazil. But about 75,000 objects in their collection are in this computer, which we don’t know how to get them out of. "
p6 "The collection was mostly digitized paintings and drawings. The size was not spectacular. They didn’t want to reduce the resolution before uploading though. Some of the pictures was taken by professional photographers, so there is metadata for the camera, but not much metadata about the content. We didn’t even talk about metadata at that time. It was hard because the collections management software they used wasn’t very friendly for capturing metadata."
p11 "The general process generally involves using a scanner or camera and crop the image so that you don’t get unnecessary details. We think it would improve our work if we had a semi-automated tool. The Bugarian Archive actually uploads up to 30% of their images online. However, even if they are online, if we have to download these images, fix them, add the metadata, and upload to Commons."
  • Accessing data stored in legacy databases. Even when media items and their associated metadata are available in a digital database and a structured format, getting that data out of the database can be challenging. Legacy information formats, database software platforms, operating systems and computer hardware all present distinct challenges for data export, and often the challenges combine to create a series of substantial hurdles that must be overcome before the media and metadata are even accessible in a format that can be read and reviewed, let alone uploaded.
  • Combining and standardizing metadata. Metadata about items within a GLAM organization's collection may be stored in multiple places, in multiple formats (both digital and analog). Combining that metadata into a single structure that is compatible with the input parameters of available upload tools (GLAMwiki Toolset, PattyPan, Upload Wizard, etc.) and uploaded to Commons is a major challenge, especially for smaller and less-well-resourced GLAM organizations, and GLAM projects that lack participants with significant, relevant technical expertise.

Creating and digitizing metadata[edit]

Participant Quote
p1 "The only metadata that existed was small sticker on each item, like “Ostrich skeleton”. We don’t know when the object came, who donated it. We can’t generate that. We’ve been focusing a lot on the techniques of bone maceration to be more specific about this kind of information: what was the technique that was used to create the exhibit, info about the subject animal. They aren’t even sure what animals they have skeletons of sometimes. We work with professors to classify this information."
p8 "The pictures were numbered, and I got a word document for each numbered picture with a description. I supplemented those descriptions, and also changed the filename to a descriptive one. Other images I had to identify myself (they weren’t listed in the document). Data was messy: image list was in one document, descriptions were in another document. Preparation was very manual, required original research. For example, [microscope] power information. Sometimes this was listed in the description, but had to estimate this myself for many [pathology slides]."
p10 "Some of the [historical documents] were of a paper format that scanners could not read. In older days, content was on paper that cannot be scanned easily. Without metadata uploads are useless, but sometimes [the GLAM institution] do not even know what is the metadata for an item!"
p7 "My work was mostly in metadata preparation, [figuring out] how we can get/define metadata that wasn’t structured in the original metadata file. Wanted content that represented European sound creation. 1 million audio files accompanied by metadata. Need to create a selection that would represent content created in Europe. One problem that I needed to get a solution for was how to identify content from Europe when the metadata was unstructured? Could be that the location field would be a place in London, but it would be non-standardized, non-normalized. We used a service to get a lot of metadata about the sounds in the recording."
  • Generating new metadata. Another challenge that many GLAM organizations face early in the project (before upload) is improving the metadata of their collections so that there is sufficient metadata available about each item that it can be legally uploaded to Commons, and discovered and used once it's there. Since many of these items are historical, cultural, or scientific records and artifacts, subject matter expertise (and sometimes a fair degree of detective work) is necessary to create this metadata. Examples include: the creator or copyright holder of the media items, the preservation or preparation technique used to create the object depicted (p1 above), and the power of the microscope used to create a pathology slide (p7 above).
  • Digitizing analog metadata. Metadata about the items depicted in digital media is often not recorded digitally, as in the case of the skeleton labels described by p1 above. In other cases, print media must be scanned and converted to digital document format for upload. Metadata about these documents is often embedded in the text itself: OCR technology can sometimes help with extracting this metadata (for example, title and author), but in many cases the condition of the printed artifact, the language of the artifact, or the quality of the scanned image thwart automated OCR and automated metadata extraction. In these cases, item or collection-level metadata must be transcribed by hand (or at best, copy/pasted in plaintext), which is a time-consuming and error prone process. Several GLAMs reported leaving valuable metadata out of the upload because the cost of extracting it from the source was too high.