Grants talk:IEG/Tamil OCR to recognize content from printed books

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Finalize your proposal by October 1st![edit]

Hi kbalavignesh. Thank you for drafting this proposal!

  • Once you're ready to submit it for review, please update its status (in your page's Probox markup) from DRAFT to PROPOSED, as the deadline is September 30th.
  • If you have any questions at all, feel free to contact me (IEG committee member) or Siko (IEG program head), or just post a note on this talk page and we'll see it.

Cheers, Ocaasi (talk) 20:16, 25 September 2014 (UTC)[reply]

Accuracy[edit]

@Kbalavignesh: Hey there, just was checking out this proposal and had some questions related to Tesseract. As it functionality is central to this grant, could you provide some examples as to what the software can currently do, and what its limitations are? What accuracy issues do you expect to encounter? I JethroBT (talk) 21:35, 29 September 2014 (UTC)[reply]

Hi @I JethroBT: , The Tesseract per se is not a stand alone OCR . Its a OCR engine which could be extended to any language . The engine has already been extended for Tamil . But still the test results are not satisfactory . The reason being that , many Tamil fonts have to be trained for the engine to be more accurate .The OCR engine has good response with respect to basic OCR operations like Page layout analysis , Binarisation , segmentation . Yet , the training data and the process to create the training data are relatively tough . Moreover the scope for ambiguity is high in Tamil . (Example - (னை being recognised as னன, வெ being recognised as கிவ). Apologies that I have answered a question posted to balavignesh . I just gave my part of the understanding as the deadline is near .Expecting a more clear explanation from balavignesh .:) Please post if further clarifications are needed --Commons sibi (talk) 11:19, 30 September 2014 (UTC)[reply]
Hi @Commons sibi:, Thanks for the clarification.
Hi @I JethroBT: , The core engine of the Tesseract is like child. Training tesseract is like teaching new language to child by introducing character by character.
There are many styles of fonts exist. Also in reality, printed text on image may have distortion. So, Tesseract needs more specific training and continuous training to recognize the text accurately.
At present there is no online tool for training and managing the trained data. It needs some technical skill to setup and train. So, Developing this tool will help user in following ways,
  • Simple GUI to train the text by non technical users.
  • Sharing trained data to others.
  • Continuous training by many people.
Ultimately, if the training gets more accuracy, wiki contributors can use it for recognizing text from image.
For example, we have more information in old books in image (scanned) format. It is hard to manually type the content to put in wiki. Instead, the contributor can select suitable trained data and can give the image as input. Tesseract will recognize the text from image and will output as text. --balavignesh (talk) 16:20, 30 September 2014 (UTC)[reply]

License of the proposed project[edit]

Hi bala , What is the proposed license under which the Trainer GUI is to be introduced ? GPL or Apache 2.0 --Commons sibi (talk) 01:23, 1 October 2014 (UTC)[reply]

Hi @Commons sibi:, It will be GPL.

Hardware and internet costs?[edit]

Hi kbalavignesh, thanks for submitting this proposal! In order for us to confirm it would be eligible for an IEG, can you please share some more details about what "hardware and internet costs" include in your budget? And will you be hosting your web application on Wikimedia Labs or elsewhere? Best wishes, Siko (WMF) (talk) 23:19, 3 October 2014 (UTC)[reply]

Hi Siko (WMF), hardware is the cost of laptop/pc and internet is the cost of internet service. It is not about hosting. Hardware cost is approximate and it depends on the need.
Hi, kbalavignesh - yes, but what I'm trying to understand is what are these costs will be used for :) The 4 project participants need a grant to pay for computers for them to work on, and internet service in order to complete this project, because they don't have their own that they can rely on for completing this work, is that correct? And as for hosting, that's a separate question from your hardware and internet costs, actually (sorry that wasn't clear): I may not be familiar with the details of the web application you've got in mind, but I assume it will need hosting somewhere...I'm guessing that place would be Tool Labs? Best wishes, Siko (WMF) (talk) 17:45, 7 October 2014 (UTC)[reply]
Hi Siko (WMF), This will be stand alone web application. It needs LAMP (Linux, Apache,Mysql/Postgresql, PHP) & Tesseract OCR installation. During development & testing this will be hosted on VPS. Later, on stable state , this can be moved to wikimedia labs. The hardware cost includes the hosting cost(VPS).

Free content and public domain materials[edit]

Any idea on how many applicable materials in Tamil can be considered free? I believe digitazation will be most useful to Wikmedia contributors when the content can be freely shared (i.e., already in the public domain or released by the author). Would you like to give special attention to supporting OCR of historical documents in this grant proposal? I'm not sure about Tamil language, if it were about Japanese, I would pay some attention on supporting characters historically used in printed materials of early modern Japan, because old public domain materials tend to contain such characters. whym (talk) 10:37, 7 October 2014 (UTC)[reply]

Hi whym , It seems Tamil Virtual University provided many books to Tamil wiki projects. Please check endorsements comment. We can train Tesseract with any kind of character. I am not sure about Japanese characters, but old Tamil characters only contains few minor shape changes. So, proper training can make Tesseract to read any type of characters. --balavignesh (talk) 07:41, 9 October 2014 (UTC)[reply]
Noolaham Foundation (A Sri Lankan Tamil speaking communities focused digital preservation and archiving non-profit organization) has a notable and growing collection of Public Domain or Creative Commons licensed (http://www.noolaham.org) Tamil resources. A practical Tamil OCR is desperately needed. --Natkeeran (talk) 14:09, 9 October 2014 (UTC)[reply]
@Kbalavignesh and Natkeeran: thank you for the information. Any idea on what kinds of contents are included? Reference works and scholarly works would be most relevant to Wikimedia projects. Other kinds of materials (such as novels) would enrich Wikisource at least. Regarding fonts, since differences that look very minor to human eyes could confuse OCR systems a lot, I thought having some focus on target fonts might make sense. (more on this point in a below section) whym (talk) 14:22, 9 October 2014 (UTC)[reply]

Online Tesseract trainer.[edit]

I beg my pardon, there are online tools too. http://pp19dd.com/tesseract-ocr-chopper/ check this. Tamil is not a very specific one for this. If you can do something pan-Indian, that would be great. --Rahmanuddin (talk) 04:47, 9 October 2014 (UTC)[reply]

Hi Rahmanuddin, Please check the section, "How it differs from other tools?" . As i explained , the above tool only helps to chop the section. It not helps to train by group and share among group. Also, as i explained, initially this will be focused for Tamil. Later, this can be customized for other Tesseract Supported Languages with minimal effort.--balavignesh (talk) 07:41, 9 October 2014 (UTC)[reply]
I also would like to know how these previous projects can be build upon, or inform the new projects?--Natkeeran (talk) 14:10, 9 October 2014 (UTC)[reply]

Questions: Technical, Potential Partnerships[edit]

(These questions came out of the discussion going on in Tamil Wiki Village Pump. Transferring the current questions to here for share it with others.)

  • Will the feedback based on continuous training be accumulated similar to web based training?
Yes. The idea is , feedback (correction on output) will be gathered and the trained data will be rebuild again.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Has background research conducted regarding previous research and development of Tamil OCRs? If so where can I find it?
  • Is segmentation the main issue?
  • Is character classification the main issue?
  • Is Tesseract the best or among the best of the open source OCR engines? Please justify?
  • Will this project accumulate towards a general purpose Tamil OCR?
Yes.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Where will it be hosted? What organizational support/structure will there to continue development after the completion of this project?
During development this will be hosted on VPS. Later, it can be moved to wikimedia labs. The code and trained data will be released under GPL and will be maintained in common place. We can create and maintain a group to provide continous support and development.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • What is the minimum quality of images that is required, or that can be supported?
inter-character and inter-line spacing should be clear. Image should not have distortion on characters.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Can we do any per-processing to improve quality?
Yes, There are many open source image manipulation tools available. We can add this feature in the future development.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Will the proposed team be responsive to Tamil wiki and user community?
Yes--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • I am the technical lead of the Noolaham Foundation, a Sri Lankan Tamil speaking communities focused digital preservation and archiving organization. We have a desperate need for a working Tamil OCR. We will consider providing a full time staff member to do testing for the full or part of the duration. We can also provide test materials (ex TIFF files). Will providing these resources advance the project? How effective will you be able to utilize these resources?
Thanks for the support. Initially, sample materials will be helpfull for development. Testers can start test after releasing working version of the tool.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Can you/we get some support from INFITT for this project, resource wise or just token support?
It will be always great to get support from community. This tool needs more contribution for training from community. Need comment from INFITT members.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Would involving computer science students help this project? How will the Team use students?
Yes. They can develop or test based on their skill set.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]
  • Assuming this a valuable project that will lead us close to a practical Tamil OCR, how can we maximize the chances of getting this grant?
Our discussions and clear project plan will increase the chance. Project plan will be added soon.--balavignesh (talk) 19:02, 11 October 2014 (UTC)[reply]

--Natkeeran (talk) 13:58, 9 October 2014 (UTC)[reply]

Focus of measuring performance[edit]

Is there any specific focus when measuring accuracy and the number of supported fonts? It would be useful to have some examples of or pointers to some of the target fonts and the target set of books or documents on which the accuracy will be evaluated, and the reasons why you choose them. I'd suggest putting some focus on reference works or similar, to make it more relevant to Wikimedia projects. Other evaluations to measure robustness using a more diverse set would be great to have, too, though. whym (talk) 14:10, 9 October 2014 (UTC)[reply]

@Kbalavignesh and Whym: Very good suggestions. The first target would be the Tamil Encyclopedia and Tamil Children Encyclopedia (20 volumes together.). They were released under Creative Commons license by the academy due to efforts by the Tamil Wiki community. These reference works will be immediately useful to the wiki and the greater communities. This should be first target. There is large collection of public domain works released by Tamil Nadu government as well. The other main source is noolaham.org public domain works. --Natkeeran (talk) 15:53, 9 October 2014 (UTC)[reply]

Eligibility confirmed, round 2 2014[edit]

IEG review.png

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for round 2 2014 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 2 2014 begins on 21 October 2014, and grants will be announced in December. See the schedule for more details.

Questions? Contact us.

Jtud (WMF) (talk) 22:05, 9 October 2014 (UTC)[reply]

Release of Training Data sets[edit]

This is a very good initiative. As far as i can see there are two outputs in this project

  • Web based Trainer GUI.
  • Annotated training data sets (against publisher and font types)

AFAIK , There is not much training data available for indian languages . So please consider releasing of training data sets created for this project as a deliverable of this project. It will be a major useful data corpus for foss community to build many tamil computing tools --AniVar (talk) 04:56, 10 October 2014 (UTC)[reply]

Hi AniVar, the main motivation of this tool will be training and sharing trained data. So, this will serve as repository for trained data and will be shared under GPL license. --balavignesh (talk) 18:04, 11 October 2014 (UTC)[reply]
Thanks for clarifying . Please add it in project outputs. Make sure they are annotated . But I doubt you can select GPL for training data sets . It must be based on the license of the training materials you use . Please make sure to use free licensed materials or PD materials --AniVar (talk) 18:29, 11 October 2014 (UTC)[reply]
In particular, where will those training data sets be made available? And how would they be kept available indefinitely? Asaf (WMF) (talk) 23:19, 16 October 2014 (UTC)[reply]
Hi Asaf (WMF), During development , the trained data can be stored where the application hosted. Later the application can be hosted on wiki labs along with trained data.--balavignesh (talk) 11:06, 3 November 2014 (UTC)[reply]

Example of usage cycle?[edit]

So, suppose I download Tesseract and the latest(?) Tamil OCR training data set. And I do some intensive further training on Tamil material I have. How do I upload or contribute back the modified/new training sets? How are they incorporated/added to the existing ones?

More generally, could you concretely describe the flow or cycle you envision, from a user perspective? Asaf (WMF) (talk) 23:19, 16 October 2014 (UTC)[reply]

Hi @Asaf (WMF): ,I have added details on GRANT page. Check the section "Sample use cases". --balavignesh (talk) 10:10, 28 October 2014 (UTC)[reply]
Thanks, balavignesh! Asaf (WMF) (talk) 20:13, 28 October 2014 (UTC)[reply]

References / Previous Tools[edit]

--Natkeeran (talk) 15:50, 10 October 2014 (UTC)[reply]

Comment--Commons sibi (talk) 16:31, 10 October 2014 (UTC)[reply]
Thank you, Natkeeran, for these references! I'm not sure if one of them is also the tool mentioned by Dr. Selvakumar in the endorsement section of the proposal page; if it isn't, it would be good to also check whether that work can be used (after being relicensed). Asaf (WMF) (talk) 23:20, 16 October 2014 (UTC)[reply]
Asaf (WMF) Not sure. PonVizhi is the software provided free of charge by the TamilNadu and Indian governments. However, as noted above it is not a FOSS software. If efforts are made through right channels, we may be able to get it released under FOSS or compatible license. However, it does not support the vast majority of Tamil fonts in use. --Natkeeran (talk) 20:53, 17 October 2014 (UTC)[reply]

Some suggestions[edit]

Hi, thanks for taking up this challenging project. Some suggestions:

  • The title of the project is misleading and confusing. You could have simply stuck to "Creating trainer GUI for Tesseract OCR for Tamil".
  • Have you done basic research on Tamil typefaces? If not, please do meet some old publishers, printers and typesetters. Madras (now Chennai) was the first south city to become a centre for printing and publishing - thus there is a lot of rich and untapped printing and publishing history within the city.
  • Prioritize meeting families who are into printing and publishing for 2-3 generations.
  • Come up with a list of typefaces and prioritize them for your work. For instance, you should prioritize a typeface that was used in early 1900s than 1970s as most of the content in public domain (or copyright free) will be from 1900s and not 1970s.
  • Please do visit RMRL to a get sense of the various typefaces. The staff there is also very knowledgeable and will extend help.
  • Tamil Wikimedia community has recently got the Tamil Encyclopedias donated under CC. Prof. Selvakumar told me that all the Tamil encyclopedias that were donated have the same typeface. So if you could prioritize training Tesseract for this typeface then it could have immediate use to the Tamil Wikimedia projects and the community.
  • Please also reach out to people at Noolaham Foundation and Project Madurai.

Best wishes.--Visdaviva (talk) 12:31, 27 October 2014 (UTC)[reply]

Very good set of suggestions Visdaviva. I earlier pointed the need of adding annotated training data sets as output. Visdaviva pointing specifically about the orthography analysis which will be helpful to form annotation categories. Please make sure to annotate training data against typefaces, publisher and period (year). As i pointed earlier, a well researched orthographic categorisation and annotation of training data will be a very helpful output , apart from the trainer GUI code. --AniVar (talk) 20:55, 27 October 2014 (UTC)[reply]

Hi Visdaviva , AniVar , Thank you very much for the suggestions. Initially we are planning to cover the basic features of tool. After deployment , we can start adding the features/corrections based on our requirement and we can do more work on trained data as well.--balavignesh (talk) 10:23, 28 October 2014 (UTC)[reply]

Aggregated feedback from the committee for Tamil OCR to recognize content from printed books[edit]

Scoring criteria (see the rubric for background) Score
1=weak alignment 10=strong alignment
(A) Impact potential
  • Does it fit with Wikimedia's strategic priorities?
  • Does it have potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
7.5
(B) Innovation and learning
  • Does it take an Innovative approach to solving a key problem?
  • Is the potential impact greater than the risks?
  • Can we measure success?
5.8
(C) Ability to execute
  • Can the scope be accomplished in 6 months?
  • How realistic/efficient is the budget?
  • Do the participants have the necessary skills/experience?
6.8
(D) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
  • Does it support diversity?
7.5
Comments from the committee:
  • Lots of endorsements, clear target community, will support diversity in making special content easily available which has so far been inaccessible. Appreciate the focus on an under-represented language project.
  • Could vastly improve Tamil Wikisource, though less-certain to be adapted for further languages. Probably not scalable but the use of Tesseract can be good, best to be clear the focus and impact is for Tamil language not in the adaptation of Tessaract for the use in other languages
  • Potential impact is large compared to risk.
  • Approach seems pretty standard when trying to build a new OCR tool.
  • Looking at <https://printalert.wordpress.com/author/kbalavignesh/>, the main proposer seems still trying to learn OCR. There is some time until the project will start and other collaborators might be able to help out in this regard, though.
  • Might appreciate seeing some more volunteer time involved in the project
  • The measures of success are good, but insufficiently quantified. Specific targets should be identified for each metric of interest.

Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 16:57, 13 November 2014 (UTC)[reply]


Not sure if I can comment about the above assessment. I think the "Innovation and learning" opportunity is more than the corresponding score. OCR and OCR related tools have been a significant technical and organizational challenge in Tamil and other Indian languages. From getting test materials to developing a platform that can be further developed on, there is an array of challenges. There is no active FOSS or commercial project in this space. Although, the proposed solution may be a standard one, it must be considered within the overall context, where many challenges are involved. Providing the "Specific targets (should be) identified for each metric of interest" will further clarify if this proposal can address those challenges. --Natkeeran (talk) 19:45, 17 November 2014 (UTC)[reply]

Round 2 2014 Decision[edit]

IEG IdeaLab review.png

This project has not been selected for an Individual Engagement Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding, but we hope you'll continue to engage in the program. Please drop by the IdeaLab to share and refine future ideas!

Comments regarding this decision:
We recognized the potential that a project like this could have for the Tamil web, and we appreciate your efforts in this regard. Because the scope of IEG is focused on impact to Wikimedia projects, and direct impact is still unclear for this trainer GUI, we ultimately felt this proposal was not a good candidate for funding at this time. However, we encourage you to deepen your involvement in the Wikimedia community and would be happy to see you return in the future as the team develops clearer plans with Wikimedia volunteers either for this project, or any future ideas you may have.

Next steps:

  1. Review the feedback provided on your proposal and to ask for any clarifications you need using this talk page.
  2. Visit the IdeaLab to continue developing this idea and share any new ideas you may have.
  3. To reapply with this project in the future, please make updates based on the feedback provided in this round before resubmitting it for review in a new round.
  4. Check the schedule for the next open call to submit proposals - we look forward to helping you apply for a grant in a future round.
Questions? Contact us.


Planning to continue without Grant[edit]

Even though I failed to get the Grant, I got good experience on applying Grant. Endorsements, discussions & suggestions show the importance of this project. Thank you all.

Thank you very much to media wiki for conducting Grant. It has helped me to shape my idea.

I am planning to continue with this project without the grant. I request you all to provide the same support and suggestions to make it happen.

Here you can see, Progress & updates - http://tesstrainer.wordpress.com/

Repository- https://github.com/kbalavignesh/tesstrainer

--balavignesh (talk) 03:07, 1 January 2015 (UTC)[reply]