Wikisource Handbook/OCR

From Meta, a Wikimedia project coordination wiki
Wikisource Handbook
 Introduction Check Copyright Status Uploading and Indexing OCR Proofreading and Transclusion Wikidata Linkage 

Optical Character Recognition (OCR) is a process by which a text character in a scanned image from pdf/djvu/jpg etc. can be converted to Unicode characters. For Indic languages, a suitable software to accurately do OCR was not available until mid 2015, after which Google released OCR service for Indic languages. Indic Wikisource communities are utilising the OCR service till then.


OCR4Wikisource is a free and open-source software developed by T. Shrinivasan et. al for Linux OS users to automate the process of doing mass Google OCR using Google Drive API. The software will:

  • Step 1:Download the book from Wikimedia Commons
  • Step 2:Split the file into individual pages
  • Step 3:Upload the pages to Google Drive one by one for doing OCR
  • Step 4:Download the OCRed text and
  • Step 5:Upload them to respective Wikisource pages

It is recommended to create a bot account to run this script.

To install the script, first download the zip file from this link.[1]
  • Step 1:Download the zip file from the above link
  • Step 2:Extract the OCR4wikisource-master folder from the zip file and keep it in Home directory.
  • Step 3:Open Terminal by using shortcut Ctrl+Alt+T.
  • Step 4: Type the following commands: cd OCR4wikisource-masterbash ./

Drive configuration and running OCR
  • Step 1:Go to this address [2] and create a new project.
  • Step 2:Activate Google Drive API and Fusion Tables API
  • Step 3:Go to Credentials menu and then to OAuth Consent screen where, you have to write something at Product menu shown to users
  • Step 4:Create credentials by selecting OAuth client ID
  • Step 5:Select Application Type to Other and give any name
  • Step 6:Download the json file and copy it to the OCR4wikisource-master
  • Step 7:Rename the json file
  • Step 8:Open the terminal to download and install another tool from this address [3] by typing the following commands.
  1. sudo apt-get install python-pip
  2. sudo pip install google-api-python-client
  3. sudo pip install gdcmdtools
  • Step 9:Run this command: client_secret_file name.json
  • Step 10:You will get a weblink in the terminal while running this command, click on the link and then click on the Allow button, which will open a new page with a Token.
  • Step 11:Copy the token number and paste in the terminal, after which API will be configured.
  • Step 12:Now, go to the OCR4wikisource-master folder and open the config.ini file and fill up accordingly.
  • Step 13: Open the terminal and run the following command: python

Note this software only runs in Linux OS. Also please check if Google Drive API supports your language.

Google OCR tool[edit]

Puzzle globe logo
Ocr button

The Google OCR tool adds a Page-namespace toolbar button that will derive text from the current page’s image, via Google’s Cloud Vision API [1] OCR service. Check the languages which are supported [4] by this service. Click on the button to get OCRed text in each Wikisource page.

Note: OCRed texts are not 100% accurate. Manual proofreading is needed to correct the typo errors.