Community Tech/Google OCR for Indic language Wikisources/notes

Notes on #25 Wishlist Survey item: Tool to use Google OCRs in Indic language Wikisource (T120788)

July 13, 2016

Adding texts to Wikisource: Help:Beginner's guide to adding texts

You need to enable a gadget: OCR - Enable OCR button in Page: namespace.

Upload the document as scans, PDF or djvu to Commons.

Step 1: Create an Index page for the text, by replacing "File" with "Index" in the address bar. It gives you a list of fields to fill out -- title, author, publisher, publication date, etc. (Example in screenshots is Index:Alice in Blunderland.pdf.)

Step 2: The newly created Index page has lots of redlinks -- a number for each page in the scan.

Step 3: Click on one of the redlinks, and you can create the Page. This uses the ProofreadPage extension to set up a side-by-side scan. If the scan doesn't automatically have OCR metadata, clicking the OCR button will send the image to Tesseract to get OCR'd.

Step 1: Creating the Index page for Alice in Blunderland, with fields to fill out
Step 2: A new Index page, with lots of redlinks
Step 3: Click on a redlink number to create a Page. Highlight on the OCR button.

TPT is the lead maintainer of the ProofreadPage extension. Phe wrote the OCR tools, which are on Tool Labs as Phetools. Proofreading stats here: Phetools stats.

List of Indian language wiki projects -- the most active Indic Wikisources appear to be:

Bengali (bn) - 463,000 pages
Telugu (te) - 38,000 pages
Malayalam (ml) - 26,000 pages
Sanskrit (sa) - 17,000 pages
Kannada (kn) - 15,000 pages
Gujarati (gu) - 10,000 pages
Odia (or) - 7,000 pages
Tamil (ta) - 4,000 pages
Assamese (as) - 2,000 pages
Marathi (mr) - 2,000 pages

Of those, the Google OCR API supports Bengali, Sanskrit, Tamil, Assamese, and Marathi.

Support votes for this wishlist item included people from Bengali, Telugu, Sanskrit, Kannada, Tamil, Odia and Punjabi.

Hindi Wikisource (hi) doesn't exist yet. -- some source text available at http://wikisource.org/wiki/Main_Page:Hindi

Sept 8

Sam contacted Tamil WS: talk:Balajijagadesh#OCR_gadget_testing User talk:Balajijagadesh#OCR gadget testing. Balajijagadesh is helping with testing.

Lots of testing: phab:T142770.