Learning patterns/Data transfers to Wikimedia Commons: Sharing institutional archives

A learning pattern forcontent release partnerships

Data transfers to Wikimedia Commons: Sharing institutional archives

problemThe standard upload function of the free media archive Wikimedia Commons has been designed primarily for uploading small amounts of data/ media files.

solutionThe transfer of larger data sets or entire archives usually requires some preparation, depending on the character of the data and its associated metadata.This Learning Pattern tries to provide first guidance through the most common tools and procedures.

creator• Nicolas Rück (WMDE)• Jakob Warkotsch (WMDE)

discuss

endorse

created onDecember 2, 2015

What problem does this solve?[edit]

Why should a cultural institution release its archive to the public?[edit]

A transfer of precious datasets from the institution’s archives to a free media archive has many benefits: It simplifies their re-use, contextualization and dissemination by third parties. The data can, for example, be integrated into Wikipedia or used barrier-free in a scientific context. New creative and innovative technical applications can be developed based on the archive. Previously unknown background details can be identified and explored by volunteers.

By now, numerous institutions have recognized the benefits of free data: As early as 2008, the German Bundessarchiv (Federal Archive) released over 80,000 photos from its inventory, which since serve to illustrate numerous Wikipedia articles, for example – viewed by thousands of readers day by day. The Veikkos archive donated a unique collection of over 40,000 public domain seals to Commons, which immediately got sorted, categorized and assigned further by the volunteer community of the Wikimedia projects.

Other examples come from the free culture hackathon Coding Da Vinci, which made available a data set of a historic fabric samples collection from the Hochschule für Technik und Wirtschaft (University of Applied Science) Berlin, historical 18th century writings from the archives of the district office of Berlin Charlottenburg-Wilmersdorf, images of geological collections of the City Museum Berlin as well as Audio and Video files from the Ethnological Museum Berlin, to name just a few.

However, the standard upload function of the free media archive Wikimedia Commons has been designed primarily for uploading only small amounts of data/ media files.

What is the solution?[edit]

The transfer of larger data sets or entire archives usually requires some preparation, depending on the character of the data and its associated metadata. This Learning Pattern tries to provide first guidance through the most common tools and procedures:

In which format can I upload media files to Commons?[edit]

File Types[edit]

For Commons, all file formats listed here are suitable.

Resolution / Compression / File Size[edit]

To take full advantage of all the possibilities that can result from a continued re-use of media files, they should be as large as possible, and uploaded as lossless as possible. Reducing and compressing images is a lot of work, but has many disadvantages and few advantages.

However, uploads in the default settings of Commons are, for technical reasons, limited. You can find more information about file size here.

File Name[edit]

Optimally, the file name should be composed of an explanatory title and an inventory or object number. In the case of artistic works, for example, it may consist of the artist's name, the name of the artwork and an object number. For book scans, the following structure has proved as useful: author's name, title and page number. More general information on file naming can be found here.

What do I need to upload data sets to Commons?[edit]

To transfer media files to the free media repository Commons, an Internet connection, a free Commons user account and a free tool for transmitting the data (pls. see sections below) is needed. In addition, the uploading institution has to be the copyright owner of the files in order to release them under a free license (if they are not already public domain works).

For further information about Wikimedia Commons pls. see also:

How do I create a user account for Commons?[edit]

A user account for Commons can be created here (or here for test uploads to Commons Beta). This account is also valid for other Wikimedia projects such as Wikipedia and many others. It is best to get separate accounts for every person from your institution who is going to edit, and choose usernames of the form "Person Name (Institution Name)". Further details on the creation of user accounts are available here.

Some Wikimedia projects are ok with multiple people from your institution sharing a single account named after your institution; others are not. In particular, the English-language Wikipedia is very adamant that each account be used by only one person and not shared. Thus if you plan on making any edits to the English Wikipedia, it is important that each person from your institution has a separate account.

Afterwards, it is recommended to get your account verified (due to transparency and safety reasons, currently only available in the German Wikipedia and Commons). All information about user account verification for institutions/ organisations can be found here (in German) and here for Commons.

On your user page you can introduce yourself and your institution, and/ or the media files made available by you (see below "How do I present my projects?"). If you have any conflicts of interest, you should disclose them on your user page. In particular, if you are being compensated for editing Wikimedia, it should be clear on your user page who is compensating you. If you create a user page on Meta-Wiki, it will be copied to all other Wikimedia Wikis automatically. Further information on user pages is listed here.

What upload tool should I use?[edit]

To upload just single or several files, the easy-to-use UploadWizard may be sufficient. To upload single files, which are already available on the web, URL2 Commons has proved to be a very simple and useful tool.

For uploading larger amounts of data, the VicuñaUploader or Pattypan is recommended instead. If a data set is already available online and if its single media files already have individual metadata, the GWToolset can be used to transfer the data to Commons. These two programs will be discussed in detail below.

Using the “VicuñaUploader” (for uploading complete folders of data available offline)[edit]

For a brief description of the VicuñaUploader please read here. A free download of the VicuñaUploader software and instructions in English are available, too.

Instructions for uploading files with Vicuña:[edit]

Choose the files to be uploaded via Files→ Read files
To mark all files for the upload, select Edit → Select all
Go to Edit → Edit selected files description
Under Desc you can now insert your description text in the form {{en|Description Text in English}}
Description in other languages are optionally, e.g. {{de|Description Text in German}} {{fr|Description Text in French}}, etc.
With the menu item Date you can specify the date of the files by using the Date template.
By using Categories you can assign files to a specific category (for more information on the use of categories, please see below).
Under Tools → Settings you can specify additional details like authorship, source and license. In the free text field License you can enter a project or data set template in double-braces (for further information about project or data set templates, please see below).

Using the “GWToolset” (for transfering data which is already available online)[edit]

See also: Commons:GLAMwiki_Toolset#Instructions for additional details.

Preconditions and Preparations[edit]

Please note: For an upload via the GWToolset, a basic knowledge in programming is required.

The upload is recommended to be tested on Commons Beta first. If this is successful, the upload can be repeated on Commons itself.
To transfer online available data to Commons with the GWToolset, your Commons user account needs special user rights. These can be requested here. Once you have obtained your authorization, you can access and use the GWToolset.
As a next step, the server which contains the source data must be whitelisted. This can be requested via this link on Phabricator. For the login on Phabricator you can use your Commons or Wikipedia account (under “Login or Register MediaWiki"). It may take some time (approx. one week) until your request will get approved. As there may be further questions regarding your pending request, it is recommended to check its status on Phabricator from time to time.
Some servers are already whitelisted on the GWToolset, for example the photo portal Flickr. If your files are on Flickr, they alternatively can be transferred directly from there with the simple and self-explanatory flickr2commons tool or with the UploadWizard (if you have the "upload by url" user right). But please note: With these method additional information specified on Flickr (such as image descriptions) will be transmitted, too.

Creating an XML file for uploading through GWToolset[edit]

For the upload with the GWToolset you have to create a flat XML file containing the metadata for all images/ files. The creation of such a file and some useful tips for this process are explained in the steps below. For the conversion and processing of the data, a basic knowledge of programming is required. The process of creating the XML file can be roughly divided into five steps:

Converting your metadata file into a machine-readable format
Importing the file into a data structure
Customizing of the data fields
Creating category data fields
Creating the XML file

1. Converting the metadata file into a machine-readable format[edit]

For each picture the metadata file must contain at least a filename, a title and the URL of the image. Many cultural institutions use spreadsheets to manage their metadata so we will use this as an example throughout the next steps. Other formats such as JSON and XML are popular as well. The easiest way to transform a spreadsheet into a machine-readable format is to export it to CSV. To do this, open the metadata file in any spreadsheet program and convert it to CSV by using the export function of the program.

2. Importing the file into a data structure[edit]

For the next steps a piece of code in a scripting language such as Ruby, PHP, Perl or Python is needed. The example code below is written in Ruby. This is a working example program based on the excerpts shown in the following.

First the file should be read line by line and the data fields should be transferred in an appropriate data structure (Map, Dictionary, etc.). Let us assume, our CSV file has 5 columns: "Title", "URL", "Description" , "Categories" and "Year of creation". These can now be extracted with the following code example:

metadata = []

CSV.read(file, col_sep: ';').each do |row|
 metadata << {
   title: row[0],
   url: row[1],
   description: row[2],
   categories: row[3],
   year: row[4]
 }
end

3. Customizing the data fields[edit]

It often occurs that some of the fields from the given metadata file should not be transferred to Commons as they are (e.g. due to differing sorting patterns, naming conventions, etc). If these can be automatically adjusted, you can create a class to process the raw metadata.

Reading of the data:

class ImageMeta
 attr_reader :title, :url, :description

// creating instance variables from the “fields” hash
 def initialize(fields)
   fields.each { |field, value| instance_variable_set "@#{field}", value.strip unless value.nil? }
 end
end

You can create methods for the fields that need to be changed within that class. In our example from above, the “Year of Creation” can be supplemented by a MediaWiki-date template as follows:

def year
 "{{Date|#{@year}}}"
end

The customizing of other fields works the same way. From the metadata which was read in step 2, you can now create objects of the defined class:

metadata.map! { |fields| ImageMeta.new(fields) }

4. Creating of category data fields[edit]

To categorize images uploaded via the GWToolset properly, each category must be in a separate XML tag of the metadata file. In order to convert the raw category data to a list of categories, these should first be extracted from the CSV file and then projected on each category. For example, if the categories are separated by commas in a column of the CSV file, the ImageMeta class categories can thus be extracted as follows:

def raw_categories
 @categories.split(',').map(&:strip)
end

To maintain flexibility in transferring the categories, you can even create a class for this:

class CategoryMapping
 MAPPING = {
   'Radierung' => 'Etchings',
   'Lithografie' => 'Lithographs',
   'Aquatinta' => 'Aquatint',
   'Mappe' => 'Portfolios'
 }

def initialize(raw_categories)
   @raw_categories = raw_categories
 end

def mapped_categories
   return [] if @raw_categories.nil?

categories = []
   @raw_categories.each do |category|
     categories << MAPPING[category] if MAPPING[category]
   end

categories.uniq
 end
end

5. Creating of the XML file[edit]

Now all the data is available and the XML file can be created. The naming of the XML elements can be arbitrary for the GWToolset, but the tool requires a certain XML structure. The root element contains XML elements for each of the images, which in turn contain the metadata. Within the image elements, however, no further nesting is allowed.

The XML structure for our example would be as follows:

builder = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do
 images do
   metadata.each do |image|
     image do
       title image.title
       description image.description
       year image.year
       imageUrl image.url

mapping = CategoryMapping.new(image.raw_categories)
       mapping.mapped_categories.each_with_index do |category, i|
         send "category#{i}", category
       end
     end
   end
 end
end

builder.to_xml

The output of builder.to_xml can subsequently be written to a file and thus, the XML file for the GWToolset upload is completed.

Start the transfer process[edit]

Details for the particular steps and the respective input fields are explained in the following Screencast.

Start the GWToolset. It can be also found on https://commons.wikimedia.org/ → Special pages → GWToolset
Enter the requested inputs for the metadata detection and move on to → Submit.
Enter the requested inputs for the metadata mapping and move on on → Preview batch. Important: for the use of categories, please first read the "How to organize and categorize data on Commons?” section below.

Check the preview. If it corresponds to the order in which the files are to be deposited on Commons subsequently, click on process stack to start the transfer.

Once the transfer request has been sent via the GWToolset, the browser window can be closed and the computer can be turned off. The transfer will be in progress between the servers in the background. The files should gradually show up in the List of new files, and in the category / categories that you provided during the upload.

How can I present my data after the upload?[edit]

To introduce your newly uploaded collection and, which is recommended, to contact the volunteer community of the Wikimedia projects, you can create a project page about the data set. On that page you can present your uploaded collection, your project or the cooperation in which’s context the data was provided. Doing this, please keep in mind what Commons NOT is!

Generally, project pages are created as galleries and can then be designed freely. A description how to do this can be found here.

Some examples of existing project pages:

How to organize and categorize data on Commons?[edit]

The category structure is the preferred method to organize files on Commons and to make sure they can be found properly. Each file should be found in the category structure. To ensure this, each file must be assigned directly to a category or appear in a gallery page which in turn is categorized. Each category itself must be categorized so that a hierarchical structure (similar to a family tree) results.

How this is done in detail, is shown in this example.

Besides choosing the category from the actual motif of an image, it is often recommended to additionally categorize each file by the type of institution which is uploading, like many other institutions or cooperations have done previously:

This e.g. can be handled by adding a template to each file, which also contains further information about the uploading institution besides category information (pls. see below: "Using data set templates / project templates").

Using Data Set Templates/ Project Templates[edit]

What is a data set template?[edit]

A so-called “data set template” may be included in the description page of an individual media file. It includes a brief explanation of your institution and about the data set the respective file belongs to. Additionally, you can categorize the files with a project template, which allows for finding all other files of a particular data set or of the uploading institution. Please find more detailed information on templates and Mediawiki here.

How do I create a template for my project or my Commons partnership?[edit]

Examples of templates for collaborative Commons partnerships can be found on this page. If you need help with creating templates, you can reach out to WikiProject Templates and make a new request there.

The information contained in the actual project template is imported from other templates which have to be set up separately as follows:

Template:YOUR INSTITUTION-source

→ main template in which the information of the sub-templates are mapped. In addition, a category is defined, to which all files using this template are assigned.

Template:YOUR INSTITUTION-source/layout

→ defines the layout. Here, for example, a logo can be inserted and the text placement is defined.

Template:YOUR INSTITUTION-source/lang

→ provides the descriptive text in the existing language versions

Template:YOUR INSTITUTION-source/en

→ contains the description text in English

Template: YOUR INSTITUTION-source/de

→ contains the description text in German

Please feel encouraged to create templates for additional languages and link them in the “Template:YOUR INSTITUTION-source/lang”.

Examples for each component of the template for the partnership with the “Hochschule für Technik und Wirtschaft Berlin” (University of Applied Science):[edit]

With a click on the Edit button, you can view (and copy) the source code of each template.

How to integrate a template into a file[edit]

To add a template to a file, please insert the following code in the file description page, usually right below the paragraph about the license: {{template name}}

At what position the code is inserted when using one of the upload tools described above does vary. Please refer to the tool’s description (see above) for further information.

Further templates[edit]

In addition to the project template, which usually refers to a dedicated cooperation of an institution with Commons, you can also create a specific template for your institution. This template can be added to file description pages and may contain additional details as e.g. location, date of foundation or website of your institution.

Where can I find further support?[edit]

Through programs and projects such as "Medienschatz” or "GLAM" (Galleries, Libraries, Archives & Museum), Wikimedia Deutschland supports volunteers and institutions in the release and transfer of stored data and helps to connect them.

If you have any questions, you can contact us at:

community@wikimedia.de (for volunteers)
glam@wikimedia.de (for cultural institutions)

For questions about GWToolset you can find help on a dedicated mailing list.