User:LA2/Digitizing books with MediaWiki

From Meta, a Wikimedia project coordination wiki

Written by user:LA2 on August 12, 2005.

Digitizing books with MediaWiki[edit]

After these ideas have matured, they should probably be moved to a page of its own, either here on meta or on mediawiki.org. Related pages can be found in the categories books, images, and Wikidata.

Update: There is now (since September 2006) a ProofreadPage extension to the Mediawiki software (SVN), created by ThomasV, already activated on the English and French Wikisource, that displays scanned page image and proofread OCR text side by side, with links to the previous and next page. The extension uses a special "Page:" namespace. An example is this title page from volume I of the works of Fermat. See also the help page Help:Side by side image view for proofreading.

One definite advantage of the Page: namespace is that it facilitates the keeping of statistics on the number of scanned pages. See also the Wikisource:List of books in the Page namespace and its interwiki links. Here's an over-simplistic shell script:

gunzip < frwikisource-20061107-page.sql.gz | tr '(' '\n' | awk -F, '$2==104' | wc -l
Date Number of pages in the Page: namespace
English French German
6-7 Nov 2006 689 17763 -
14-15 Oct 2006 395 17715 -
21 Sept 2006 24 17495 -
11-12 Sept 2006 - - -

Playground[edit]

The New Student's Reference Work, 1914

In September and October 2005 I started real attempts to use the existing MediaWiki to publish scanned books. The scanned images of book pages are uploaded (using the script by Eloquence) to Wikimedia Commons and the books are presented as part of Wikisource. The first titles are:



Why write new software?[edit]

At the Wikimania 2005 conference, I was discussing with Magnus Manske, PatrickD and others some ideas for publishing old books. Some people, who have seen Project Runeberg, want to start similar projects in other languages and are asking me for the software. As if there were any software that could be transplanted and reused. Unfortunately, this is not the case. Project Runeberg uses a complex rats' nest of old scripts in various programming languages, that has grown over the past twelve years. The code is ugly! A far better idea would be to extract the best experience and ideas from Project Runeberg and build some new software. Fortunately, the ideas are rather simple and easy to summarize. Even better: If you want to experiment with this, you don't have to scan many books because the Internet Archive already contains scanned books that you can download. All you need to focus on is programming the website. The division of labour into raw scanning (e.g. University of Toronto), archiving (e.g. Internet Archive), proofreading (e.g. pgdp.net), and publishing the proofread texts (e.g. Project Gutenberg) was not available when Project Runeberg started in 1992, but the situation is very different in 2005. If you are designing software in 2005 it might live until 2015, so you shouldn't use the style of 1995. My own suggestion is to add modules to the MediaWiki software.

As a matter of fact, I'm also among those who need "another Project Runeberg" because I have some English and German books that I want to digitize and make available, and which fall outside of Project Runeberg's Scandinavian scope. For example, I scanned and OCRed the German book of quotations Geflügelte Worte by Georg Büchmann. Karl Eichwalder submitted my files to Distributed Proofreaders (see below), from where the proofread text will later appear as an e-text on Project Gutenberg. But I also want to make the scanned images available in the fashion of Project Runeberg. Instead of reusing the old software, I wrote a minimalist script that generated the static HTML pages at aronsson.se/buchmann/. These pages don't have any "edit" link for proofreading the text. Another non-Scandinavian book I scanned was a cross-index to the German encyclopedia Meyers Konversationslexikon, which has been made available by Christian Aschoff who digitized the entire encyclopedia (Schlüssel zu Meyers Konversations-Lexikon).

Elements of book digitization[edit]

This section is a quick summary for those who have not digitized books before.

The contents of old books is subject to copyright, and copyright laws are slightly different in each country. In most countries, copyright expires when the author has been dead for 70 years. So you need to know (1) who the author is, and (2) when she died. Wikipedia should be a very good source of knowledge. If you want to digitize a book, start by writing a Wikipedia article about the author. Knowing who the author is can be hard enough. Translators and illustrators also have copyright, and each coauthor must have been dead for 70 years for copyright to expire. For completely anonymous works, copyright expires 70 years after the first publication. For books that are still under copyright, you can try to contact the author or heirs and ask them to release the contents under a Creative Commons license.

For some readers it is good enough to hear the story of Snow White or Moby Dick as it was retold from memory. They are great stories, but your friend's grandmother would tell it somewhat different than your own grandmother. Even if you read it from the book, some publishers print abridged versions or modify the contents in other ways. If you want to be able to quote parts of a book, you need to know which edition you use. Many people prefer to quote from the 1st edition. While it is quite possible to digitize other editions than the first one, it is quite necessary to document which edition (publisher + print year) you digitize. For some books it is interesting to digitize more than one edition.

Digitization is the transfer of the black and white print of a text page into some useful digital computer file. The most useful format is text (including plain text, HTML, RTF, MS Word and wikitext) because this can be searched, edited and copied. The most naive approach is to look at the printed book and type its text into the computer. Most people would instead use a scanner (or digital camera) and an OCR program. OCR is short for optical character recognition and implies the recognition of editable letters and digits in the patterns of black-and-white pixels of a scanned image. Either way, some errors will always remain and there is always a need for manual proofreading of the text. OCR is non-trivial. There are some attempts to write free and open source software for OCR (ClaraOCR, Gamera), but so far it has not been able to perform like the commercial proprietary software (FineReader, Omnipage, TextBridge).

Suppose you find a text of Moby Dick online, and you wonder if the names Ishmael, Manhattoes, Corlears and Coenties (all found in the first chapter) are spelled right. Maybe they are correct as printed, or maybe they were OCR errors. Or if they were indeed printing errors, maybe someone "corrected" them during digitization. How could you know? Going to the library and have a look in the "real" book will be very time consuming if it happens often. The solution is to keep the scanned images of every book page online, so the "real" printed page (photos don't lie, right?) is just one click away. One additional benefit is the scanned image of the title page automatically documents which edition was used.

Most of these lessons were learned already from microfilming in the 1950s and have been updated by large scale digitization projects in the 1990s, primarily at university and state libraries. A good current source is the D-Lib Magazine (online, open access). Unfortunately, many of the involved libraries are not as committed to open contents as could be wished. A good read is this online book by Daniel J. Cohen and Roy Rosenzweig, Digital History. A Guide to Gathering, Preserving, and Presenting the Past on the Web (2005).

Project Gutenberg was started in 1971 as an e-text project and still doesn't publish scanned images. Project Runeberg was started in 1992 as an e-text project for Scandinavian languages and in 1998 started to add scanned images. Distributed Proofreaders was founded in December 2000 as a preprocessor for Project Gutenberg. The scanned images are not "published" but are presented only during two phases of proofreading after which the e-text is submitted to PG. Old scanned images are archived, and it is hoped that DP will some day publish them. In 2002 Project Runeberg added a wiki-like function for online proofreading. This is not workflow-oriented like DP, but any page can be proofread at any time, very much like articles in Wikipedia are improved by the person who finds the error.

The above should explain why the early ideas for "Project Sourceberg" (now Wikisource) are quite inadequate for any serious digitization work.

Functionality wanted in MediaWiki[edit]

I hope to outline a set of functions that are needed for digitized books, and which can be implemented separately, each well integrated into the MediaWiki architecture, addressing needs found in other subprojects as well.

  • The upload function of MediaWiki might need to be streamlined for uploading of scanned books. Some popular formats for scanned books are PDF, DjVu and TIFF. All can contain multiple pages in one huge file or a large number of smaller files, one for each page. MediaWiki would be incomplete if it didn't support both kinds: uploading of huge files and uploading of multiple files. Yet another multipage format is to create a ZIP archive of several smaller scanned images.
  • For multipage formats (PDF, DjVu, TIFF, ZIP), the MediaWiki software needs to be able to extract (batch or on-demand) single pages.
  • For non-webfriendly image formats (PDF, DjVu, TIFF), the MediaWiki software needs to be able to convert them to webfriendly image formats (PNG), either in batch or on the fly. This is no more complicated than today's rescaling of images.
  • When one image contains the whole of a book page, MediaWiki should support some simple image manipulation operations, such as cropping the margins, adjusting for skew, contrast and color adjustments, and copying an area (an illustration) to an image file of its own. For example, in a scanned encyclopedia page such as this one, the illustrations of various types of cylinders should be copied so they can be reused in Wikipedia articles. (Thanks to Sj for reminding me.)
  • Typically one user will scan images of a book and upload them, then another user will download the images to run OCR and then upload text files for each page. The second user might prefer a multipage format (PDF, DjVu, TIFF, ZIP) for easier download. Does MediaWiki have an image download function like this today?
  • The flat wiki namespace is fine for Wikipedia articles, but not for pages and chapters of printed books. The need to isolate books (each having several chapters and a table of contents) from each other has already been discussed on Wikibooks (see Wikibooks should use subpages and feature requests book construct and image pager). If separate compartments can be created for each book (or volume of a multipart work), scanned page images should be uploaded within such a compartment.
  • While most books use page numbers of some kind, these are often inadequate for identifying pages when digitizing the book. For example, there can be more than one page 13, and there can be several pages without numbers. Some encyclopedias use column numbering, with columns 1 and 2 on one page, columns 3 and 4 on the next, etc. Project Runeberg uses the convention of naming the scanned images 0001.tif, 0002.tif, 0003.tif, etc., and pages numbers are added as attributes to the pages. The sequence of pages is important for adding navigation links to the previous and next page. If some pages were scanned out of order, the MediaWiki software could offer the ability to move pages around.
  • The table of contents is a mapping between a list of articles (chapters) and pages. Think of a printed magazine and how one article can end on the same page where the next article begins, and some articles are "continued on page 71" (or in the next issue). Is there a need to point out where on the page an article starts? The resulting data structures could become very complex. How far do you need and want to go?
    • Perhaps User:Nigelk/Nav is an interesting extension for building a table of contents.
  • Once both the scanned image and the OCR text are available for a book page, the proofreading is a slight modification to MediaWiki's normal edit window. A simple HTML frameset divides the screen in half (vertically or horizontally), presenting the scanned image in one frame and the edit form in the other. Try the proofreading functions of Project Runeberg and Distributed Proofreaders for examples and ideas.
  • Actually, proofreading is just one specialization of the more general principle of two-window editing, which can also be used for translating Wikipedia articles (French source article to the left, English edit window to the right) or TV interview transcription (TV animation above, transcript edit window below).
  • Proofreading and publishing of pages from scanned books can be seen as a specialization of a more general principle of wiki-style editing applied to contents within a predefined namespace, where the URLs and the sequence of pages are predefined. You don't need "red links" to create new pages, because all the pages you need is one for each scanned book page. Suppose for example that you want to create a wiki for Amazon.com-style book reviews. You would want to allow this for every possible ISBN number, but users shouldn't be able to create pages outside of the ISBN numbering system. Perhaps this is bordering on Wikidata? Maybe Wikidata should be used for defining the sequence of pages, their page numbers, and the table of contents.