Community Tech/Google OCR for Indic language Wikisources/notes

From Meta, a Wikimedia project coordination wiki

Notes on #25 Wishlist Survey item: Tool to use Google OCRs in Indic language Wikisource (T120788)

July 13, 2016[edit]

Adding texts to Wikisource: Help:Beginner's guide to adding texts

You need to enable a gadget: OCR - Enable OCR button in Page: namespace.

Upload the document as scans, PDF or djvu to Commons.

Step 1: Create an Index page for the text, by replacing "File" with "Index" in the address bar. It gives you a list of fields to fill out -- title, author, publisher, publication date, etc. (Example in screenshots is Index:Alice in Blunderland.pdf.)

Step 2: The newly created Index page has lots of redlinks -- a number for each page in the scan.

Step 3: Click on one of the redlinks, and you can create the Page. This uses the ProofreadPage extension to set up a side-by-side scan. If the scan doesn't automatically have OCR metadata, clicking the OCR button will send the image to Tesseract to get OCR'd.

TPT is the lead maintainer of the ProofreadPage extension. Phe wrote the OCR tools, which are on Tool Labs as Phetools. Proofreading stats here: Phetools stats.

List of Indian language wiki projects -- the most active Indic Wikisources appear to be:

Of those, the Google OCR API supports Bengali, Sanskrit, Tamil, Assamese, and Marathi.

Support votes for this wishlist item included people from Bengali, Telugu, Sanskrit, Kannada, Tamil, Odia and Punjabi.

Hindi Wikisource (hi) doesn't exist yet. -- some source text available at http://wikisource.org/wiki/Main_Page:Hindi

Sept 8[edit]

Sam contacted Tamil WS: talk:Balajijagadesh#OCR_gadget_testing User talk:Balajijagadesh#OCR gadget testing. Balajijagadesh is helping with testing.

Lots of testing: phab:T142770.