Community Tech/Google OCR for Indic language Wikisources/notes
Notes on #25 Wishlist Survey item: Tool to use Google OCRs in Indic language Wikisource (T120788)
July 13, 2016
Adding texts to Wikisource: Help:Beginner's guide to adding texts
You need to enable a gadget: OCR - Enable OCR button in Page: namespace.
Upload the document as scans, PDF or djvu to Commons.
Step 1: Create an Index page for the text, by replacing "File" with "Index" in the address bar. It gives you a list of fields to fill out -- title, author, publisher, publication date, etc. (Example in screenshots is Index:Alice in Blunderland.pdf.)
Step 2: The newly created Index page has lots of redlinks -- a number for each page in the scan.
Step 3: Click on one of the redlinks, and you can create the Page. This uses the ProofreadPage extension to set up a side-by-side scan. If the scan doesn't automatically have OCR metadata, clicking the OCR button will send the image to Tesseract to get OCR'd.
Step 1: Creating the Index page for Alice in Blunderland, with fields to fill out
Step 2: A new Index page, with lots of redlinks
Step 3: Click on a redlink number to create a Page. Highlight on the OCR button.
- Bengali (bn) - 463,000 pages
- Telugu (te) - 38,000 pages
- Malayalam (ml) - 26,000 pages
- Sanskrit (sa) - 17,000 pages
- Kannada (kn) - 15,000 pages
- Gujarati (gu) - 10,000 pages
- Odia (or) - 7,000 pages
- Tamil (ta) - 4,000 pages
- Assamese (as) - 2,000 pages
- Marathi (mr) - 2,000 pages
Of those, the Google OCR API supports Bengali, Sanskrit, Tamil, Assamese, and Marathi.
Support votes for this wishlist item included people from Bengali, Telugu, Sanskrit, Kannada, Tamil, Odia and Punjabi.
Hindi Wikisource (hi) doesn't exist yet. -- some source text available at http://wikisource.org/wiki/Main_Page:Hindi
Sam contacted Tamil WS: talk:Balajijagadesh#OCR_gadget_testing User talk:Balajijagadesh#OCR gadget testing. Balajijagadesh is helping with testing.
Lots of testing: phab:T142770.