CIS-A2K/Events/Gujarati OCR

From Meta, a Wikimedia project coordination wiki
CIS-A2K

CIS-A2K (Centre for Internet and Society - Access to Knowledge) is a campaign to promote the fundamental principles of justice, freedom, and economic development. It deals with issues like copyrights, patents and trademarks, which are an important part of the digital landscape.
If you have a general proposal/suggestion for Access to Knowledge team you can write on the discussion page. If you have appreciations or feedback on our work, please share it on feedback page.

Making Gujarati OCR and TTS available to GU Wikimedia Community[edit]

Date: Sept. 18, 2013
Present: Prof.Rama Mohan & T.Vishnu Vardhan over telephone

Context[edit]

M.S. University, Baroda has developed a Gujarati OCR that it feels is very effective. Prof. Rama Mohan of M.S. University has headed this project funded by DEITY, GOI. Gujarati Wikimedians felt that it will be a great value add to the community to have this OCR. Based on an initial discussion with GU Wikimedians Dhaval S Vyas and Sushant Savla on Sept. 15, 2013 an initial discussion meeting was initiated with Prof. Rama Mohan. The below captures the key discussion points.

Key discussion points on OCR[edit]

Mr. T. Vishnu Vardhan - Gave a background to the selfless work put in by the GU Wikimedians and how GU OCR can be of great help to the community and in effectively making the knowledge in Gujarati open to public through Wiki Source?

Prof. Rama Mohan - Gave a background to the project that has been funded by DEITY as part of the TDIL project. He said that MS University has so far trained OCR on 25 different books, each about 250 pages. The results prove that the OCR is robust.

Mr. T. Vishnu Vardhan - asked if the results are equally when converting older prints.

To which Prof. Rama Mohan - said the dated books and the images could be a challenge. So the user should decide post OCR whether it is worth editing or retyping the entire thing. As per Prof. Rama Mohan - so far the success is from 70 to 95% of conversion.

Mr. T. Vishnu Vardhan - explored the following options: a) buying of a set of licenses; b) making the OCR available in the Public Domain; c) making the source code open for further improvements; and d) if GU Wikimedians can test and contribute to the effectivity of the OCR.

Prof. Rama Mohan - said that MS Univ. is in the process the putting up a Portal where the any user can use the software and convert the scanned Gujarati text image into text. He said that this could be used by anyone including the GU Wikimedians.

Mr. T. Vishnu Vardhan - tried probing for further details (like when will it be made available, who can use, how effective will it be, etc) on this but not many details were forthcoming. At the end of the conversation Prof. Rama Mohan - felt that probably in a couple of months the Portal could be up.

Upon enquiry Prof. Rama Mohan - willingly agreed to support the Wikisource community by converting certain set of scanned texts (if the entire books are available). However, the community should not expect perfect conversion and should be willing to proofread and use. Mr. T. Vishnu Vardhan- enquired about the updates on OCR efforts being funded by DEITY across various institutions and Prof. Rama Mohan - shared some updates.

Mr. T. Vishnu Vardhan - asked about Tesseract and Prof. Rama Mohan - infoProf. Rama Mohan -ed that he has not yet studied it

Mr. T. Vishnu Vardhan - explored the possibility of buying some licenses of the OCR. Prof. Rama Mohan - was not sure about who could sell the licenses and asked to explore this with DEITY.

Mr. T. Vishnu Vardhan - specifically brought up the Gujarati Sabha case, to which Prof. Rama Mohan - expressed in principle agreement to deploy the GU OCR there. The details need to be worked out. Prof. Rama Mohan - said other than infrastructure (Linux machines) some personnel need to be trained in the use of the OCR. As the GUI is not sophisticated the personnel need atleast 3-4 days training in using commands.

Mr. T. Vishnu Vardhan - asked if MS Univ. could offer this training if CIS-A2K sends the personnel to MS Univ. Prof. Rama Mohan - in-principle agreed with this.

Mr. T. Vishnu Vardhan - asked if MS Univ. or Prof. Rama Mohan - be unhappy if CIS-A2K and WMIN Chapter were to lobby with the DEITY or GOI to release the source code of this OCR. Prof. Rama Mohan - said that he will be happy if it can be done.

Mr. T. Vishnu Vardhan - asked if it is worthwhile for CIS-A2K and WMIN Chapter and other such bodies to lobby with the DEITY, MoIT, GOI in making all the OCR development open source. Prof. Rama Mohan - said nothing wrong in giving a shot and shared that TDIL is looked after by Dr. Swarnalatha (Director) DEITY. Vijay Kumar at DIETY is also critical person. However, such decisions need to go to Secretary level.

Key discussion points on TTS[edit]

Prof. Rama Mohan - gave background to the TTS development at MS University. Mr. T. Vishnu Vardhan- asked what would be the funding requirement? Prof. Rama Mohan - said that 2 people who can work about 1 year. Segmenting the continuous speech is critical.

Mr. T. Vishnu Vardhan - asked if the speech in native Indian tongue. Prof. Rama Mohan - said yes.

Prof. Rama Mohan - said that around 800 speech units are there in the data base and enhancing it to 3000 units is the challenge. The TTS engine is ready and can be re-built.

Mr. T. Vishnu Vardhan - enquired if and when would be a good time to meet Prof. Rama Mohan - personally. Prof. Rama Mohan - said that he will be more than happy to meet in Baroda in October.