FindingGLAMs/White Paper/DOCS

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
FindingGLAMs logo.svgExpanding what is possible around GLAMs on the Wikimedia projects
A White Paper as Guidance for Future Work
developed as part of the FindingGLAMs project

Case Study 4: DOCS – Documents Obtained, Compiled and used for Sourcing[edit]

Key facts[edit]

Time: Fall 2019

Organizations involved: Nordic Museum, UNESCO

Wikimedia/free knowledge communities involved: Wikimedia Sverige, Swedish Wikisource community

Keywords: Wikimedia Commons, Wikidata, Wikisource, digitalization, literature, copyright, archives

Key conclusions[edit]

  • Uploading a GLAM collection to Wikidata and Wikimedia Commons can involve several different, independently developed tools, which are not naturally connected to each other. The uploader has to learn each of them in order to develop a functional working process.
  • The process requires experience and knowledge of the Wikimedia tool ecosystem, including tools maintained by volunteers, which can make it inaccessible to new editors and GLAM staff.
  • OpenRefine and Pattypan are powerful GUI-based tools that can be used in tandem in a Wikidata + Wikimedia Commons process, lowering the entry barrier for new uploaders.
  • There is a very large potential value to create a clear path on how material uploaded can be (semi-)automatically uploaded to multiple specialized Wikimedia platforms and connected with each other. This path and the tools needed are still to be developed.

Background[edit]

A digitized ethnographic library reaches Wikimedians[edit]

The Nordic Museum (Nordiska museet) is a Swedish museum, located in Stockholm, dedicated to the cultural history of Sweden. Wikimedia Sverige has previously collaborated with the museum on projects ranging from uploading digitized artworks[1] to Wikimedia Commons to supporting high school students who edit Wikipedia using materials from the museum’s library.[2] The museum staff who work with digital media have a keen interest in and familiarity with the open knowledge movement and our organization.

Fataburen is a journal that has been published by the Nordic Museum, for over 120 years. The museum had undertaken a digitization project to make it more accessible to the public, and had cataloged and published the articles in PDF format on the open publication platform DiVa.[3] Furthermore, high resolution individual page scans in TIFF format are stored in an internal database that is not available to the public. Since the digitization project covered the whole history of the journal, from its beginnings to the most recent volumes, a not insignificant part of the articles were in the public domain due to more than 70 years having passed since the death of the authors.

In order to make the collection more known and accessible, the museum reached out to Wikimedia Sverige to help share Fataburen on the Wikimedia platforms. We judged the material to be highly valuable to the open knowledge movement, as the journal is a classic in Swedish cultural research, with many notable contributors over the years, and treats topics that are interesting to the general public (like folklore, life in the countryside, etc.) in an accessible way. In other words, we anticipated that the material could be put into good use by Wikipedians.

As we had access to both the scans and the metadata, it made sense to both upload the (public domain) scans to Wikimedia Commons and (all) the data to Wikidata. Furthermore, we envisioned that the collection could be of interest to the Swedish Wikisource community, as each article is short enough for one or two volunteers to proofread, and it would make them easier to use as sources in Wikipedia articles. That way, the museum would gain higher visibility for its publications, and at the same time the Swedish Wikimedia community would get access to valuable source materials.

Archival photos – archival documents[edit]

As part of our collaboration with the UNESCO Archives, we worked with their collection of photographs documenting the organization's history. The Archives manage a collection of about 170,000 visual resources, 5,000 of which have been deemed to be particularly important and included in the Digitizing Our Shared UNESCO History project.[4] We were tasked with uploading a smaller, curated selection of 100 photographs to Wikimedia Commons.

What makes this project interesting in the context of processing documents is that the photographs’ metadata is not yet digitised and only exists on the backs of the photographs. The descriptions provide relevant information, the omitting of which would significantly lower their educational value. That's why we were provided with two files of each photo, one of the front and one of the back. For the information written on the back side of a photo to be truly valuable, it has to be converted to text. That way, it is easier to read, find, analyze and share. OCR technology was developed for this task, but it is not perfect: the results of the automatic recognition process have to be validated by a human, as they can contain errors. This is particularly true when dealing with text that has developed bad contrast and blurriness due to age, as is the case with much archival material – including the UNESCO photos. Manually validating thousands of files does not require specialized knowledge, but can get tedious, and might not be considered an efficient use of staff resources by GLAM institutions.

That's why the UNESCO Archives have implemented a crowdsourcing project to transcribe the photograph captions which are then validated.[5] A volunteer does not have to invest a lot of time into transcribing a photograph that contains a paragraph or two of text. In exchange, they get a sense of achievement from helping out one of the world's most renowned organizations. As of February 2020, 214 volunteers have entered the data from 48% of the 5,000 participating photographs and 37% has been validated.

It was the output of the crowdsourced transcription process that was shared with us together with the photograph files. We did not have to process the texts ourselves in order to upload it as metadata accompanying the images on Wikimedia Commons. Even though we had access to the transcribed captions, we still found the actual back sides of the photographs important enough to be uploaded as well. Firstly they provide an educational value, serving as an example of how archival photographs can be described. Secondly, information that is not conveyed by the raw text, such as the graphic layout or even the color of the ink might be of interest to researchers; we cannot imagine all possible uses people can have for them, so we should not limit them by withholding material that we have the power to share. Thirdly, the viewer can refer to the back sides to confirm that the transcribed text is indeed true to the source, which is important to researchers, journalists and others who need to be sure the resources found on Wikimedia Commons are reliable. That’s why both sides of every photo were uploaded and linked to each other by including a thumbnail of the other side in the file’s information box.[6]

Problem[edit]

The following steps were identified as necessary for the completion of the Nordic Museums project:

  • All bibliographic metadata of the about 2,000 articles is uploaded to Wikidata. This was identified as necessary because structured data is much more powerful than unstructured data, making it possible to link to authors, topics, etc., as well as enabling complex queries and thus increasing the discoverability of the data. In particular, it would enable us to find articles written by authors who have been dead long enough for their works to have entered the public domain.
  • Public domain articles in the collection are identified. This was identified as necessary because the museum had digitized all the articles, regardless of their copyright status, and did not provide information about the copyright status in their system. We had to identify the articles that could be uploaded to Wikimedia Commons.
  • The public domain articles are uploaded to Wikimedia Commons. This step would make the articles available to Wikimedians.
  • The public domain articles on Wikimedia Commons are connected to the corresponding Wikidata items. This step was identified as necessary to enable those using Wikidata to find interesting articles to access the scanned files and read them.
  • A single article is published and proofread on Swedish Wikisource, to act as an example and to hopefully initiate community engagement with the collection. This step was identified necessary to serve as a proof of concept and basis for reflection on how it might be possible to use Wikisource to improve the educational benefits of digitized literature from GLAM collections.

The following problems were encountered in the course of pursuing these goals:

The author information in the metadata was only available as strings[edit]

This caused difficulties on two levels. First of all, it reduced the value of the data on Wikidata, as it made it impossible to link to the authors' items and utilize the power of Wikidata as a platform for linked structured data. It did not, however, prevent us from uploading the data – the Wikidata property author name string (P2093) for precisely this reason, making it possible to add author information without having identified the corresponding Wikidata item. Indeed, since its creation back in 2015, the property has been used over 30 million times, primarily by editors importing large datasets of bibliographic metadata, indicating that we were far from alone with this problem. Volunteers have developed tools to work with author name strings on Wikidata, such as the Author Disambiguator.[7]

Using author name strings for all the authors, however, while fast and convenient, would have caused significant problems in our work on Wikimedia Commons, if not prevented it in its entirety. In order to upload the articles to Wikimedia Commons, we had to be sure they were in the public domain; this status depends on the author’s death date, as described above. Neither the copyright status of the articles nor the authors’ death dates were provided in the museum’s database, meaning that we had to find this information on our own. Since many of the contributors to Fataburen were notable in Wikipedia/Wikidata terms, we expected to find they already had Wikidata items with the basic biographical information we needed.

To solve this problem we employed Mix’n’match, a volunteer-developed tool for matching strings (e.g. person names) to Wikidata items.[8] The tool makes it possible for editors to collaborate on matching a dataset. Once we published a Mix'n'match catalog with the authors' names, it made it quick and easy for both WMSE's and the museum's staff to match the names to Wikidata items, and even attracted the attention of two volunteer editors.[9] Once we were happy with the matching – that is, when all the names that could be linked to existing Wikidata items, a number of new items for authors that we found notable were created, and over 60% of the entries in the catalog where matched – the matching results were downloaded using the export function in Mix’n’Match. The authors who were matched were also the most prolific contributors to the journal, ensuring a high number of articles with linked authors. The remaining names would be added using the author name string property.

The mapping of authors’ names to Wikidata was useful not only for the needs of this project. A couple of weeks later, it was ingested into KulturNav, a shared authority platform used by a number of cultural heritage institutions across Sweden and Norway.[10] Since a KulturNav property exists on Wikidata (P1248), the KulturNav URIs could then be added to the authors’ items. The mapping we did can now be re-used by the Nordic Museum to enrich their collection, or indeed by any other GLAM institution using KulturNav.

Public domain files uploaded to Wikimedia Commons must be in the public domain in both the country of origin and the United States[edit]

Copyright law can get complicated on a global scale; Wikimedia Commons can only host free content, the definition of which is not the same in every country. It is an international project, with users and content from all over the world. However, the servers Wikimedia Commons (and the other Wikimedia platforms) is hosted on are located under U.S. jurisdiction. Because of that, a work must be covered by a valid free license or have entered the public domain in both the country where it was first published and the United States.[11]

In 2019, in order for a work to be in the public domain in the United States, it had to be published before 1924. In Sweden, works enter the public domain 70 years after the author's death. It is thus possible for a work to be in the public domain in Sweden, but not in the United States.

This problem was solved by only uploading articles that were in the public domain in both countries. It was a fully acceptable solution, as it left us with over 200 files to process, and the museum staff involved in the project were familiar with that particular Commons policy. Had we been working with a GLAM without previous experience of the Wikimedia projects, this issue might have taken some time to explain; it is also not inconceivable that it might have influenced their decision to contribute to Wikimedia Commons.

We had access to the scanned material in two formats. On the museum’s open publishing platform DiVa, the articles are published as PDF files. Some of these PDF files contain the cover image of the issue in which the article was originally published, which was problematic: the copyright status of the cover image would have to be investigated separately in order to determine whether it could be uploaded. Had we decided to preventively exclude the cover images, we still would have had to process the PDF files to remove the offending page.

Furthermore, while PDF is an acceptable format on Wikimedia Commons and Wikisource, DjVu is often considered more suitable, due to it being an open standard from the start and the resulting availability of free and open tools to work with it.[12] We made the decision to use this open standard to make our contribution more accessible to the community.

The alternative to those PDF files were the individual page scans, in TIFF format, stored in the museum’s internal database, not available to the public. They had a higher resolution than the PDF files, which was important to us, as it would make OCR-ing and proofreading the articles in Wikisource easier. This also gave us the freedom to collate them into DjVu files. This was what we decided to do – even though it involved more work than downloading the readily available PDF files, as we had to write custom scripts to download and convert the files. The museum staff agreed that sharing the materials in the highest possible quality was important.

TIFF is a lossless format and thus preferable to JPG in cases where high fidelity is important. Scanned text is such a case, as any aberrations will make it harder to process for OCR software and human eyes alike. That was our principal motivation for choosing to engage directly with the TIFF files. Another was that uploading the highest quality resources available is considered a good custom in the Wikimedia Commons community.[13] Not only are high-resolution files better suited for print and modern displays, they also prove the GLAM's dedication to sharing their material with the public in the same format that is available internally.

Implementation[edit]

The project was carried out in several steps, each step using a suitable tool.

Metadata processing[edit]

OpenRefine,[14] an open-source application for data cleanup and reconciliation, was used to process the bibliographic metadata and upload it to Wikidata, creating a new item for each of the about 2,000 journal articles. OpenRefine is a popular tool among Wikidata editors, even though it is unaffiliated with Wikidata and was not originally built with Wikidata in mind. We used it to execute all the steps of the Wikidata process, from examining the raw metadata to creating the Wikidata items.

File processing and upload[edit]

Downloading the TIFF files from the museum’s internal database and collating them into DjVu files required us to write a dedicated script.[15] While highly specific for this particular task, we imagine the script could be partially reusable, as the database uses the same API as KulturNav.[16] Operations such as downloading documents from a list would be done in the same way in another database using this architecture. Of course, large parts of the scripts will be reusable if we ever do another project using materials from the museum’s database. The parts where the image files are converted and collated into DjVu files are also generic.

Pattypan[17] and OpenRefine were both used to upload the public domain articles to Wikimedia Commons. Pattypan is an open-source application for batch file uploading to Wikimedia Commons, designed to be easy to use for GLAM volunteers and staff. Pattypan uses a spreadsheet to handle the metadata of the uploaded files. That spreadsheet was prepared with OpenRefine, using the output of the Wikidata pre-processing from the previous step as an input and modifying it to fit into the Wikimedia Commons informational template format. It should be noted that OpenRefine is in no way necessary to prepare data for Pattypan ingestion; different users have different preferred tools and workflows, and the spreadsheet format offers great flexibility, down to editing the data by hand in a spreadsheet software of one’s choice. In our case, using the two tools in tandem made the most sense, due to having already processed the metadata in OpenRefine in the Wikidata step.

Also used in this project were Mix’n’Match and Quickstatements, the former for the matching of author strings to Wikidata items outlined above and the latter for making batch edits to the Wikidata items of the journal articles, such as adding links to the scanned files on Wikimedia Commons. Both tools are developed by volunteers and popular among Wikidata editors.

Structured Data on Commons (SDC) was not used in this project. At the time of executing this project, the development of SDC was focused on media whose aim is to visualize something, such as artworks, photos and illustrations, unlike scanned text.

Wikisource[edit]

Once we had a number of interesting articles uploaded to Wikimedia Commons, we explored how to add them to Wikisource. An early question was whether to do this step automatically as well. It would be technically viable to make a bot that creates the index pages for all the 200 articles. But would it be beneficial to the community?

The Swedish Wikisource community is comparatively small, with a dozen active users. This means the rate of proofreading is slow but steady, and the selection of works that get proofread is determined by that small user group's interests. We decided that dumping 200 new documents on them would not be a nice thing to do (and likely be counter-productive). Instead, we posted an announcement on Wikisource's discussion forum that the documents are available on Wikimedia Commons, and created an index page for one of them.[18] After we initiated the proofreading process, other users came and finished it.

A couple weeks later we found that a dozen more articles from our upload had been proofread,[19] showing that the community was interested in the subject matter.

Outcome[edit]

The project resulted in: 204 files on Wikimedia Commons[20], 1,822 Wikidata items[21] and one proofread article on Wikisource[22]. At the time of writing (February 2020), the Wikisource community has proofread an additional 12 articles. We find this impressive since the Swedish Wikisource community is rather small: by comparison in February 2020, there were 18 users who had performed an action within the previous 30 days.[23]

Future[edit]

What this project demonstrates is that the way from raw material to Wikimedia uploads can be long and require several tools. This has been especially true since Wikidata became part of the Wikimedia ecosystem and garnered the interest of GLAM institutions. These days, creating Wikidata items for the materials uploaded to Wikimedia Commons and making them as complete as possible can be an important part of a GLAM upload process. The project Sum of All Paintings, which aims to create Wikidata items for every notable painting in the world, demonstrates how important GLAM collections are to the Wikimedia community.[24]

The most obvious advantage of working with Wikidata and Wikimedia Commons simultaneously is that it enables complex queries; like in our example, where we had to identify articles by sufficiently dead authors. On the other hand, with the emergence of Structured Data on Commons, one cannot help but wonder what the division between SDC and Wikidata should be. If we included SDC in our project, we would have had to provide data in three places on the file page on Wikimedia Commons, as SDC statements, and in the corresponding Wikidata item. This might be confusing primarily for new users, but also for experienced Wikimedians who first learned to work with files on Wikimedia Commons and then saw the emergence of Wikidata and SDC. To make things even more confusing, information templates on Wikimedia Commons based on the Artwork module[25] can automatically pull some information (e.g. the accession number) from the linked Wikidata item, if one has been provided. This is not clearly documented and the Artwork module is updated regularly with new functionalities.

This shows that working with structured data about GLAM content is not straightforward. Every uploader can develop their own workflow and strategy, and since both Structured Data on Commons and the way the informational templates pull data from Wikidata are under development, editors who follow the development can experiment and implement new ideas regularly. On the other hand, newcomers or editors who have no interest in technical development and just want to know exactly what is possible to do and how to do it might find this landscape difficult to navigate.

To solve this conundrum, or at least make it more approachable, one imagines documentation might be helpful. There is currently no single user's manual that describes the whole process, focusing specifically on the interplay of Wikidata, (non-structured) Wikimedia Commons and SDC and aimed at newcomers. The primary reason being, one assumes, the fact that the projects have evolved separately from each other and the communities are mostly separate as well. Secondly, technical development on these platforms is rapid and independent from each other. Thirdly, the tools available to editors, like QuickStatements, PattyPan, OpenRefine and the many small gadgets and users scripts on both Wikidata and Wikimedia Commons are developed independently by volunteers, so it's not realistic to expect all of them to engage in a coordinated documentation project.

Wikisource was an interesting part of this project, as it was the only one where we chose to engage manually. That was a conscious choice due to the small size of the Swedish Wikisource community. With such a small group of active users, it is obvious that the content of this platform is very curated and determined by the editors' interests. One might wonder what the reason for the activity being so low is. Is it simply a natural reflection of the low level of interest in crowdsourced transcription and proofreading in the general public, or do technical aspects also come into play?

The latter is definitely a possibility, as we have found. Once one has uploaded or found an interesting literary work on Wikimedia Commons, initializing a proofreading project on Wikisource is not easy – in fact, the documentation classifies it as an "advanced task".[26] In general, Wikisource might be the platform that requires the most technical knowledge to participate in – there are a lot of tags with arbitrary names that have to be memorized. Notably, the tags and formatting customs are determined by each Wikisource community, so the skills gained on Swedish Wikisource do not automatically transfer to English, German etc. Wikisource, which can also lead to editors being less active than they would like to.

We imagine that at least the process of initializing a new proofreading project could be significantly improved by a closer synergy with Wikimedia Commons and by the creation of a user-friendly tool akin to the Upload Wizard. When stumbling upon an interesting literary work on Wikimedia Commons, the user could be offered a direct link to import it into Wikisource. If the file on Wikimedia Commons contains information about the language of the work (in the information template, in SDC or in the linked Wikidata item), the right Wikisource version could be determined automatically. Then, the user would be asked a series of questions, one per screen so as to not risk informational overload, to determine the variables that they currently have to enter manually, like which pages of the work to include. Again, any information already available in structured form, like the name of the author, or the genre of the work, could be imported to Wikisource automatically.

The reason why we included the UNESCO Archives upload in this case study is that it also involves the problem of transcription, albeit approaching it from a different angle: the photo captions were proofread on a different platform, in a drive organized by the GLAM. That made our work considerably easier, as it is unclear how such a project could be undertaken in the Wikimedia environment. Wikisource was developed to proofread multi-paged works like books and articles, and requires, as mentioned previously, a fair amount of internalized technical knowledge. The transcription manual for volunteers published by UNESCO[27] makes clear that the platform they are using for their crowdsourcing project, HeritageHelpers, provides a much simpler environment, making it easier for newcomers to start contributing. Most importantly, the crowdsourcing platform makes it possible for users to enter the data directly into a set of predefined fields, such as Country, Date range, Persons, Credits and so on. This means that the output of the process has a structured form, which is much easier to process for Wikimedia Commons or Wikidata than unstructured text.

In this case study, we explored several ways to interact with visual resources with text content in order to maximize their benefits for Wikimedians and other users. A picture says more than a thousand words, but if the picture is a scan of an article or the back of a photo with text on it, the words cannot be neglected. Even though Wikimedia Commons is primarily a visual platform, it can store many different types of media files. If anything, this project demonstrates how many different ways exist to work with them. It makes it clear that there is a need for tools and workflows that do not address only one platform, like Wikimedia Commons. The Wikimedia power users, including those working with GLAM content partnerships, devote a lot of their time to figuring out the right tools for the task at hand, and the best ways to involve the different Wikimedia platforms to truly make the materials useful and used.

A centralized effort to develop tools and documentation for those users could have enormous benefits. It would not only save their time but also and lower the threshold for new editors, including GLAM professionals, who want to work directly with their material on the Wikimedia platforms. A typical GLAM professional who wants to share their expertise and has a basic understanding of the Wikimedia platforms would have to devote a lot of time to researching and learning all the tools and steps we implemented in this case study – especially if they did not know from the beginning what is possible to achieve in the first place. Streamlining the process and enabling seamless connections between the different Wikimedia platforms will give those users more power.

References[edit]