WikiConference India 2016/Submissions/Indic Wikisource & Google OCR Co-ordination

From Meta, a Wikimedia project coordination wiki
Hashtag: #WCI2016
Main pageHackathonProgramsEdit-a-thonPress coverageFAQSitemap
Please make sure to log in to Wikimedia Meta before creating your Proposal. If your submission is selected, we will contact your Wikipedia account so make sure your Username is correct.
Title of the submission
Your Username (For the submission author)

jayantanth (Link)

Type of presentation

Workshop

Abstract (in about 300 words)

The Indic wikisource domain was started about 2006-2007. But due non-availability of proper Indic OCR, the project was not flourish. Typing was not the good solution for this project. Althought the Malayalam, Telugu, Tamil and Bengali Wikisource community engaged by typing the book page by page. In past year January 2015 google release their Multilingual OCR Google Drive Tool. We have tested that OCR result is quite good. Then Tamil Wikipedia, Shrinivasan T was developed one python script for automated OCR job using the Google drive OCR tool in Oct 2015. IWth using this tool the Tamil and Bengali Wikisource community OCRed about 400554 and 466747 pages respectably. ( Latest stats of Indic Wikisource below). But other wikisource is quite silent due to proper knowledge of using this tool. With this workshop we will teach how would using this tool to developed other Indic languages.

Pr-requisite for this workshop will be.

  • Linux users
  • The basics editing of Wikipedia.
  • The primary knowlege of wikisoure.
  • Linux Based laptop.
  • Windows Laptop With Oracle Virual Box ( Loaded Lixux will be added advantage)


Time required: Minimum of One hr

Requisite knowledge: knowledge of linux but not compulsory.

Target group: All Indic Language Community

Result

Accepted


Last Update on July 2016, full stats will be available here.

Page namespace Main namespace
language all pages not proof. problem. w/o text proofread validated all pages with scans w/o scans disamb percent
te 27706 8537 28 621 18520 17219 11677 2839 8838 0 24.31
ta 400554 395592 2 14 4946 4632 4099 68 4031 0 1.66
ml 19918 11690 119 284 7825 668 6254 670 5584 0 10.71
gu 5570 360 9 107 5094 2597 4824 655 4169 0 13.58
bn 466747 459703 192 2909 3943 1159 6828 885 5913 30 13.02
kn 10617 8659 5 51 1902 594 6862 73 6789 0 1.06
or 4570 2764 2 22 1782 383 362 0 362 0 0
sa 4308 3486 9 90 723 192 13851 6 13845 0 0.04
as 710 307 0 0 403 66 1314 10 1304 0 0.76
mr 1774 1749 0 3 22 6 970 1 969 0 0.1

Interested attendees and comments[edit]

  1. Highly recommended. Every Indian language Wikisource volunteer should attend this. --Ravi (talk) 18:37, 19 July 2016 (UTC)[reply]
  2. --Manojk (talk) 03:30, 20 July 2016 (UTC)[reply]
  3. Csyogi (talk) 10:05, 20 July 2016 (UTC)[reply]
  4. --Balajijagadesh (talk) 02:28, 2 August 2016 (UTC)[reply]
  5. --Kannan Shanmugam (talk) 06:06, 2 August 2016 (UTC)[reply]