Digitization/Digitization process

From Meta, a Wikimedia project coordination wiki

Definition[edit]

Digitization is the process of conversion of analog media or objects into digital representations readable by computers, which means that analog signals, presented in a continous form, are converted into digital signals, represented in discrete, non-continous units. There are specific trade-offs to the process of digital signal processing that affects the way in which the analog media is converted into a digital representation.

While in the context of cultural heritage institutions the intended audience for the output of digitization (digital objects) is the general public (digitization is done in order to facilitate access to media collections), the process of digitization in itself is mainly oriented to convert the information in a language that computers can easily process, store and transfer. This in turn allows automatization tasks (such as large scale analysis on texts) to be performed over media.

There are two major challenges that any digitization process faces, specially if we are talking about digitizing materials that somehow need to maintain certain level of accuracy or fidelity with the original.

  • Fidelity to the original. This challenge is more variable and depends on your needs. If you have cultural objects such as paintings, illustrations, photographies, etc., you want to capture the colors to be seen as approximately they are being seen in the original (more discussions on this to come later, since viewing is always affected by lighting conditions). Capture needs to happen in controlled environments or at least have certain points of reference with the real world that you could determine later on (i.e. using a color chart for coloured media). Things such as the color blue can vary a lot depending on your lighting conditions or if there are more objects in your photo that also capture light. They can also vary depending on the color space where your digital capture device is working on, if you have decided not to use the native format of the camera (such as RAW or DNG images), etc., etc.
  • Differences in rendering information between capture devices and display devices. Your capture device might be able to capture more information than your display device is actually capable of displaying. For example, a flatbed scanner might be able to work with a color depth of 48 bit but most displays will only show up to 24 bit of color depth (don't know what color depth is? We'll go into that later). Does this mean that is useless to work with a 48 bit color depth? Not necessarily, but you need to be able to make this assesment on your own.

If you don't need to maintain any fidelity to the original and you only need to extract the content (i.e. you have a book that you want to scan just for the text), most of the sections outlined in this wiki would not be that useful, specially discussions over color. However, there are still certain quality controls that you need to maintain in order to do the job properly, and there are various considerations in respect to the format, size and shape of the analog material that you want to take into consideration.

Goals of digitization[edit]

Traditionally, digitization has been seen as a way of preserving the original materials. In this regard, digitization is not really a preservation process of your originals. It is only an indirect preservation process, meaning that what you are enabling through digitization is taking out of circulation fragile materials that can not be handled anymore or need to be carefully handled, or materials that are being requested on a regular basis, with the potential of being damaged over time. Digitization can not replace a careful and well thought preservation plan for the original materials. What do we mean by this? If your building catches fire, you need to have a plan for that. Digitization is not a backup plan if a building catches fire and all the originals are destroyed.

On the same note, whenever in the digitization process the terms preservation standards appears, that actually applies to digital preservation. Digital preservation is a complex concept because it can either mean the preservation process applied to digital objects that were created out of an analog material or the preservation process applied to digitally-born materials, but in this case we will talk about digital preservation in the first meaning.

Being said that, the goals of digitization are:

  • To enable ubiquitous access to your collection: this means that people or institutions can view your collection whenever and wherever they see it fit. Access is increased through sharing your collection with other institutions or through larger platforms such as Wikimedia Commons.
  • To enable reperforming processes over your collection: this means allowing people or institutions to remix, to perform automated tasks over your content, to print your collections in larger formats, to exhibit them through other means, among others.

Although they might seem the same, there are some slight differences in term of the process that each of them might follow. If you don't have any means at your hand, you can enable access by taking pictures with your phone camera, applying some post-processing and uploading them to the Internet (no matter the platform; for what it matters, it can be a Google Drive folder). Some might see this as something completely wrong, but the truth is that if this is the only tool that you have, it will perform well enough to give you acceptable results to enable access. And, of course, to provide the counter-example, you might have digitized your materials at the best quality possible, but if you are giving away blurred images with downsized quality just to prevent re-use, you are not enabling access at all. Digitization doesn't fail when you are putting your content into unsuitable platforms for archiving, when you are delivering your content on platforms that you don't own or when you are giving away not that good pictures. It fails when someone needs to send you an email to request for the files or visit your office to have access to the digitized material.

On the other hand, if you want to allow reperforming processes, you do need to achieve certain quality standards and maintain certain levels of quality control. Some of these reperforming operations might not be allowed by copyright law (i.e. the printing of some original), so you need to make copyright clearance before enabling full-access to high-quality digital objects.

In some cases both goals might converge but in some others they might not.

Intended uses[edit]

In a digitization project, you can design your project in two possible ways:

  • specific uses approach: in this case you digitize your media with settings that are convenient or relevant to your needs in a certain point of time. For example, if you need to use a thumbnail of your media to portrait in your catalog, using this method you'll digitize it with low quality settings. With this approach, you might need to digitize your material more than once.
  • use-neutral approach: in this approach, you aim to digitize only once at the highest quality possible following standard settings in order to maintain consistency. Standard settings will include color management and image quality control process and long-term archiving of digital surrogates. The downside of this approach is that is expensive and time-consuming, but might prevent you of digitizing your collection more than once.

Types of digitization process[edit]

There are two types of digitization process: destructive scanning and non-destructive scanning.

  • Destructive scanning is the process of breaking a book (normally by the spine and with a saw or guillotine) in order to digitize it with a table scanner that has ADF (Automatic Document Feeder). Is the easiest, fastest and cheapest way to scan paper materials, but it is not recommended for preservation purposes of the original materials. For an overview on methods and ways to make destructive scanning, check the DIY Book Scanner Forum section on destructive scanning.
Destructive scanning can also be performed over photographies and illustrations using a table scanner with ADF, although the results can be far more challenging than with books, in part due to the intrinsec caractheristics of the materials, such as glossy or fragile paper, that can be destroyed by the machine even before getting the digital object as a result of the process.
Another type of destructive scanning method is to use a photocopier or flatbed scanner for books. In most cases they would force the bindings of the book, considerably affecting the original shape and depending on how it was binded, they might even tear pages apart. This is definitely not a recommended process to scan books if you have a large collection. It is also incredibly time consuming.
A good example of a destructive scanning process can be found in the Caselaw Access Project from the Harvard Library Innovation Lab at the Harvard Law School. In this case, you can see that they do actually conserve the original materials if someone wants to access them for whatever reason, but they do not re-bind them after the process.
  • Non-destructive scanning is the preferred method if you need to preserve your original materials. To put in place this type of process you need to have a frame that can open the book as gently as possible, or if you are digitizing loose paper, you need to have proper tables in place where you can handle the material as gently and as little as possible. The challenge of non-destructive methods is that they tend to be expensive and time-consuming, normally because you need to turn the pages by yourself or place the pages correctly each time, one by one. The exception for that is if you are using an automated bookscanner that automatically turns the pages, but these also tend to be incredibly expensive.
An important detail to consider is that using a scanner with Automatic Document Feeder doesn't necessarily mean that you are using a destructive scanning method. This is obvious, because one is a type of machine and the other is a type of method, but it is important to clarify this anyway because you can actually have loose papers or materials that can be feeded into a scanner with ADF and still be performing a non-destructive scanning method. When and how to use an ADF scanner depends a lot on the type of material that you have.

The challenge here is to decide between which type of methods to apply and when. The decision will be based on your need to preserve the materials in its original format and shape, combined what are your needs over the material (i.e., if you are only interested in converting the information to digital or if you are also interested in showing it as closer to the original as possible).