Jump to content

CIS-A2K/Indic Languages/Digitization of books in Wikisource using DjVu

From Meta, a Wikimedia project coordination wiki

Over the past few years, my interactions with different Indic language Wikimedians involved in sister projects of Wikipedia (such as Wikisource, wiktionary, and so on) revealed that help/supporting materials available for sister projects are very less. But many senior wikimedians from Latin languages had helped us in the past. On Wikisource, for example, John_Vandenberg from English Wikisource has provided support in various forms while we were struggling.

When it come to sister wiki projects in Indic languages the sharing of best practices is very less. This blog post is a small step towards bridge that gap. Going forward with the help of Indic Wikimedians working in indic wiki projects I will be sharing best practices and stories from Indic Wiki communities.

This blog post is about a particular type of method we use in Wikisource to digitize old source texts. This post is hugely based on the original post published by Manoj a Malayalam Wiki librarian, a few months ago.

The current method of adding content in Indic language Wikisources is slow and pain-staking. Optical Character Recognition OCR - which allows electronic conversion of printed matter into machine-encoded text - is not currently available in Indic languages. Few OCR FOSS tools are under various stages of development. So the only available option for us (Indic wiki librarians) is to type the content. Now what community members are doing is to have the physical book open, and type it out manually. Sometimes, they have a PDF of the document and they keep switching screens to type out. As you can imagine both of these are tedious.

There is an alternative which can provide some remedy to this - though it is still nowhere near OCR (for Indic languages). It will help in digitization and is based on an open source technology called, DjVu. (It is currently widely used in English and other Latin language script wikis.) Among Indic Wikis only Malayalam and Sanskrit Wiki librarians currently use it for their respective Wikisources. I believe that most other communities are not using it because they might not even be aware of it - and hence this post is for all Indic communities so that they may adopt / adapt some ideas from this.



Instead of PDF as the format for digital archiving, the format used here is the DjVu format - which is an open file format designed primarily to store scanned documents. PDFs can easily be converted to DjVu format)

To use this method in your wikisource, you need to enable Index and Page name spaces and few other technical things in your wikisource. Please write to me at shiju@wikimedia.org if you need assistance in this regard. John_Vandenberg has also extended all his support for the same.

Let me show you how this is done - through an example of the recent book digitized in Malayalam wikisource, Keralolpathi. The Index page of Keralolpathi is shown below.

Main Index page of the book for Digitization


This book is scanned and converted to DjVu format. (Thanks to Manoj who is coordinating all these activies in Malayalam Wikisource). The page numbers of the book are listed at the bottom of the Index page as shown in the above image. The back ground color of each page number denote the current status of the digitization of different pages of the book

  • Red - The digitization of these pages are yet to be started
  • Pink Digitized, but pages are not proofread
  • Yellow Digitized and proof read once
  • Green - Approved pages (proof read twice)
  • Light blue - No content
  • Dark blue - Problem pages

Looking at the Index page you can select a page for contribution according to your interest. For example if you would like to proofread a page select light pink page and proof read. If you would like to digitize a new page by typing content click on red link, and so on.


DjVu Digitization - Edit window


For example, this will be the view that you see once you click on a particular page from Index page. The corresponding page in Malayalam wikisource is here.

The page you see on the right column is the scanned page. Our task is to type on the left column by looking at the page on the right side. By default the OCR facility associated with this DjVu system will try to scan and convert the text. However, as there is still poor OCR technology for Indic languages, there is very little meaningful content that is provided through this.

So we need to type on the left side edit box by looking at the scanned image which is always displayed on the right side. All one needs to is to type the content and save. That is it.

The process is simple and any one who has interest in contributing to your mother language wiki projects can participate. It removes the complication of having to manage actual books or different screens while typing - and thereby reduces effort, improves speed and is easier to control quality.

I want to dedicate this post to Indic language speakers who want to contribute to Indic language wikis but just getting lost due to various reasons. Wikisource is really one of the easiest ways to start in Wikimedia movement - and many start their journeys through it. There are relatively simpler policies than on Wikipedias. There is much less technical mark-ups. I myself started my jouurney with wikimedia movement by creating an account in Malayalam wikisource (even before Malayalam wikipedia or English wikipedia!)

The beauty of this is that you can help digitize and preserve content of areas that are of interest to you - whether it is literature, religion, or fiction or whatever!

In all of this, don’t forget that while you are contributing to Wikisource, you also end up reading wonderful, ancient books and documents!

Please write to me at shiju@wikimedia.org if you need assistance in this regard.

NOTE: A good OCR software is urgently required for Indic languages. Hope tech guys reading this post will look into this.