Wikimedia Blog/Drafts/Using Google Drive and the OCR in it to digitize scanned images containing text

From Meta, a Wikimedia project coordination wiki

Title ideas[edit]

  • Using Google Drive and the OCR in it to digitize scanned images containing text
  • Using Google Drive and its OCR to digitize scanned images of books any all printed text

Summary[edit]

A brief, one-paragraph summary of the post's content, about 20-80 words. On the blog, this will be shown in the chronological list of posts or in the featured post carousel on top, next to a "Read more" link.

  • ...

Body[edit]

The Optical Character Recognition (OCR) software by Google now works for more than 248 world languages, including all the major South Asian languages, and it's easy to use and works with over 90 percent accuracy for most languages.

OCR software has been extremely beneficial for the study of language, helping to extract text from images of virtually any printed text—and sometimes even handwriting, which opens the door to old texts, manuscripts, and more.

Typically OCR software has difficulty reading the text on old documents or pages with blemishes and ink marks, spitting out gibberish instead of legible text.

Google's support page on this project shares additional details about character formatting, like its ability to preserve bold and italicized fonts in the output text:

When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.

— Support page, Google.

According to Wikimedians Shiju Alex and Ravishankar Ayyakkannu, languages like Malayalam and Tamil, the OCR works with almost 100 percent accuracy, and includes support for formatting things like like auto-cropping, separating text by discarding images, and ignoring color backgrounds.

Other South Asian language users speaking Bangla, Malayalam, Kannada, Odia, Tamil, and Telugu have also shared feedback on Faceook after testing the updated OCR software. For a few scripts, like Gurmukhi (used to write Punjabi), it turns out that the output after OCR is quite poor, resulting largely in gibberish, when testing a screenshot image from Punjabi Wikipedia.

This is quite a large leap for the languages with lots of old texts that are not yet digitized. Old and valuable texts in many languages could now be digitized and shared over the internet using platforms like Wikisource and could be preserved and made available for sharing knowledge.

Google's OCR partly uses Tesseract—an OCR engine released as freeware. Developed as a community project between 1995 and 2006 (and later taken over by Google), Tesseract is considered to be one of the world's most accurate OCR engines and works for over 60 languages. The source code is now hosted at https://github.com/tesseract-ocr. Check this link for the OCR outputs in various South Asian scripts.

Subhashish Panigrahi, Wikimedian and Programme Officer, Access To Knowledge, Centre for Internet and Society

Notes[edit]

Ideas for social media messages promoting the published post:

Twitter (@wikimedia/@wikipedia):

(Tweet text goes here - max 117 characters)
---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|------/

Facebook/Google+

  • ...