Community Tech/Google OCR for Indic language Wikisources/notes
Notes on #25 Wishlist Survey item: Tool to use Google OCRs in Indic language Wikisource (T120788)
July 13, 2016
[edit]Adding texts to Wikisource: Help:Beginner's guide to adding texts
You need to enable a gadget: OCR - Enable OCR button in Page: namespace.
Upload the document as scans, PDF or djvu to Commons.
Step 1: Create an Index page for the text, by replacing "File" with "Index" in the address bar. It gives you a list of fields to fill out -- title, author, publisher, publication date, etc. (Example in screenshots is Index:Alice in Blunderland.pdf.)
Step 2: The newly created Index page has lots of redlinks -- a number for each page in the scan.
Step 3: Click on one of the redlinks, and you can create the Page. This uses the ProofreadPage extension to set up a side-by-side scan. If the scan doesn't automatically have OCR metadata, clicking the OCR button will send the image to Tesseract to get OCR'd.
-
Step 1: Creating the Index page for Alice in Blunderland, with fields to fill out
-
Step 2: A new Index page, with lots of redlinks
-
Step 3: Click on a redlink number to create a Page. Highlight on the OCR button.
TPT is the lead maintainer of the ProofreadPage extension. Phe wrote the OCR tools, which are on Tool Labs as Phetools. Proofreading stats here: Phetools stats.
List of Indian language wiki projects -- the most active Indic Wikisources appear to be:
- Bengali (bn) - 463,000 pages
- Telugu (te) - 38,000 pages
- Malayalam (ml) - 26,000 pages
- Sanskrit (sa) - 17,000 pages
- Kannada (kn) - 15,000 pages
- Gujarati (gu) - 10,000 pages
- Odia (or) - 7,000 pages
- Tamil (ta) - 4,000 pages
- Assamese (as) - 2,000 pages
- Marathi (mr) - 2,000 pages
Of those, the Google OCR API supports Bengali, Sanskrit, Tamil, Assamese, and Marathi.
Support votes for this wishlist item included people from Bengali, Telugu, Sanskrit, Kannada, Tamil, Odia and Punjabi.
Hindi Wikisource (hi) doesn't exist yet. -- some source text available at http://wikisource.org/wiki/Main_Page:Hindi
Sept 8
[edit]Sam contacted Tamil WS: talk:Balajijagadesh#OCR_gadget_testing User talk:Balajijagadesh#OCR gadget testing. Balajijagadesh is helping with testing.
Lots of testing: phab:T142770.