Community Wishlist Survey 2020/Wikisource/New OCR tool

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Random proposal ►

◄ Back to Wikisource  The survey has concluded. Here are the results!


  • Problem: 1) Wikisource has to rely on external OCR tools. The most widely used one has been out of service for many months and all that time we are waiting, whether its creator appears and repairs it or not. The other external OCR tools do not work well (they either have extremely slow response, or generate bad quality text). None of these tools can also handle text divided into columns in magazine pages and they often have problems with non-English characters and diacritics, the OCR output needs to be improved.
    2) The tool hOCR is not working for wikisources based on non-Latin scripts. PheTool hOCR is creating a Tesseract OCR text layer for wikisources based on Latin script. E. g. for Indic Wikisource, there is a temporary Google OCR to do this, but integrating non-Latin scripts into our tool would be more useful.
  • Who would benefit: Wikisource contributors handling scanned texts which do not have an original OCR layer or whose original OCR layer is poor, and contributors to wikisources based on non-Latin scripts.
  • Proposed solution: Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person. The tool should:
    • be quick
    • generate good quality OCR text
    • be able to handle text written in columns
    • be able to handle non-English characters of Latin script including diacritics
    • be able to handle non-Latin languages

Tesseract, which is an open source application, also has a specific procedure to training OCR which requires corrected text of a page and an image of the page itself. On the Wikisource side, pages that have been marked as proofread show books that have been transcribed and reviewed fully. So, what needs to be done is to strip formatting the text of these finished trascriptions, expand template transclusions and move references to the bottom. Then take the text along with an image of the page in question and run it through the Tesseracts procedure. The improvement would then be updated on ToolLabs. The better the OCR the easier the process is with each book, allowing Wikisource editors to become more productive, completing more pages than they could do previously. This would also motivate users on Wikisource.

Some concerns have appeared that WMF nearly always uses open source software, which excludes e. g. Abby Reader and Adobe, and that the problem with free OCR engines is their lack of language support, so they are never really going to replace Phe's tools fully. I do not know whether free OCR engines suffice for this task or not, but I hope the new tool to be as good or even better than Phe's tools and ideological reasons that would be an obstacle to quality should be put aside.

Discussion[edit]

I think this is the #1 biggest platform-related problem we are facing on English Wikisource at this time. —Beleg Tâl (talk) 15:09, 27 October 2019 (UTC)

Yeah. For some reason neither Google Cloud nor phetools support all of the languages of Tesseract. Tesseract in comparision to the wikisources is missing Anglo-Saxon, Faroese, Armenian, Limburgish, Neapolitan, Piedmontese, Sakha, Venetian and Min nan.--Snaevar (talk) 15:12, 27 October 2019 (UTC)

Note that you really don't want a tool that scans all pages for all languages as that is so compute-intensive that you'd wait minutes for every page you tried to OCR. Tesseract supports a boatload of languages and scripts, and can be trained for more, but you still need a sensible way to pick which ones are relevant on any given page. --Xover (talk) 07:27, 31 October 2019 (UTC)
I know. Both the Google Cloud and phetools gadgets pull the language from the language code of the wikisource that the button is pressed on and thus only uses one language. The same thing applies here. These languages are mentioned however so it is clear which wikisources this proposal could support, and witch ones it would not. P.S. I am not american, so I will never try to word things to cover all bases.--Snaevar (talk) 23:01, 2 November 2019 (UTC)

Even aside from the OCR aspect, being able to extract the formatting out of a PDF int wikitext would be highly valuable for converting pdfs (and other formats via pdf) into wikimarkup. T.Shafee(Evo﹠Evo)talk 11:19, 29 October 2019 (UTC)

I am not sure about formatting. Some scans or even originals are quite poor and in such cases the result of trying to identify italics or bold letters may be much worse than if the tool extracted just pure text. I would support adding such feature only if it were possible to be turned on and off. --Jan.Kamenicek (talk) 22:05, 30 October 2019 (UTC)

Many pages requires only simple automatic OCR. But there are pages with another font (italics, fraktur) or pages with mixed languages (e.g. Missal both in local language and latin), where would be usseful to have possibility of some recognizing options. This can be more easily made on local PC, but not everybody have this option. JAn Dudík (talk) 11:21, 31 October 2019 (UTC)

Would also be great to default the OCR formatting to match the MOS, rather than having to change it all to conform to the MOS manually. --YodinT 14:19, 25 November 2019 (UTC)

Voting[edit]

  • Support Support Bodhisattwa (talk) 06:45, 21 November 2019 (UTC)
  • Support Support JAn Dudík (talk) 07:15, 21 November 2019 (UTC)
  • Support Support Le ciel est par dessus le toit (talk) 13:00, 21 November 2019 (UTC)
  • Support Support Lyokoï (talk) 17:32, 21 November 2019 (UTC)
  • Support Support Tpt (talk) 19:36, 21 November 2019 (UTC)
  • Support Support: impossible to contribute since Phe’s tool is down. —Pols12 (talk) 21:03, 21 November 2019 (UTC)
  • Support Support Pamputt (talk) 21:38, 21 November 2019 (UTC)
  • Support Support Sadads (talk) 21:41, 21 November 2019 (UTC)
  • Support Support Balajijagadesh (talk) 05:24, 22 November 2019 (UTC)
  • Support Support Libcub (talk) 08:13, 22 November 2019 (UTC)
  • Support Support Jahl de Vautban (talk) 09:22, 22 November 2019 (UTC)
  • Support Support Lionel Scheepmans Contact French native speaker, sorry for my dysorthography 10:47, 22 November 2019 (UTC)
  • Support Support Alan Talk 12:46, 22 November 2019 (UTC)
  • Support Support JLTB34 (talk) 13:29, 22 November 2019 (UTC)
  • Support Support GPSLeo (talk) 21:10, 22 November 2019 (UTC)
  • Support Support DraconicDark (talk) 02:29, 23 November 2019 (UTC)
  • Support Support FreeCorp (talk) 05:25, 23 November 2019 (UTC)
  • Support Support Pavithra.A (talk) 12:14, 23 November 2019 (UTC)
  • Support Support Emptyfear (talk) 17:12, 23 November 2019 (UTC)
  • Support Support @ջեօ 17:15, 23 November 2019 (UTC)
  • Support Support --Armenmir (talk) 17:27, 23 November 2019 (UTC)
  • Support Support আফতাবুজ্জামান (talk) 23:18, 23 November 2019 (UTC)
  • Support Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)
  • Support Support VIGNERON * discut. 10:40, 24 November 2019 (UTC)
  • Support Support Pymouss Tchatcher - 11:38, 24 November 2019 (UTC)
  • Support Support Eatcha (talk) 12:22, 25 November 2019 (UTC)
  • Support Support --Bander7799 (talk) 12:34, 25 November 2019 (UTC)
  • Support Support JogiAsad (talk) 13:27, 25 November 2019 (UTC)
  • Support Support Murma174 (talk) 13:27, 25 November 2019 (UTC)
  • Support Support Also in rtl language wikisource, do not insert ltr tags before punctuation marks. This causes problems. Naḥum (talk) 13:37, 25 November 2019 (UTC)
  • Support Support --YodinT 14:19, 25 November 2019 (UTC)
  • Support Support Blue Rasberry (talk) 15:32, 25 November 2019 (UTC)
  • Support SupportMJLTalk 15:35, 25 November 2019 (UTC)
  • Support Support Husky (talk) 16:12, 25 November 2019 (UTC)
  • Support Support A garbage person (talk) 16:19, 25 November 2019 (UTC)
  • Support Support 16:43, 25 November 2019 (UTC)
  • Support Support Sgvijayakumar (talk) 19:09, 25 November 2019 (UTC)
  • Support Support Ninovolador (talk) 21:27, 25 November 2019 (UTC)
  • Support Support Vkalaivani (talk) 22:46, 25 November 2019 (UTC)
  • Support Support Risker (talk) 05:03, 26 November 2019 (UTC)
  • Support Support Geonuch (talk) 05:32, 26 November 2019 (UTC)
  • Support Support Hsarrazin (talk) 14:31, 26 November 2019 (UTC)
  • Support Support β16 - (talk) 15:08, 26 November 2019 (UTC)
  • Support Support Thibaut120094 (talk) 16:51, 26 November 2019 (UTC)
  • Support Support Noting that Community Tech forking and fixing Phe's tools will help precisely nothing in the long run. We need a WMF-supported tool that's within some WMF team's responsibilities to maintain and properly integrated into Mediawiki release cycles. Make use of volunteers where available, certainly, but someone at the WMF needs to own the OCR tool or it might as well stay a community gadget. Do please feel free to use this Wish to spend the necessary time kicking Phe's OCR tools until they start working again though. It's bound to be something stupid that's making it fail: like, has anybody tried to simply restart the tool? It could be hanging on a stale NFS file handle for all we know! Xover (talk) 06:10, 27 November 2019 (UTC)
    That is exactly what I hope is going to be solved. In this proposal I stated the problem: "Wikisource has to rely on external OCR tools" and proposed the solution: "Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person." --Jan Kameníček (talk) 10:14, 1 December 2019 (UTC)
  • Support Support Acélan (talk) 13:19, 27 November 2019 (UTC)
  • Support Support Harkawal Benipal (talk) 16:08, 27 November 2019 (UTC)
  • Support Support Indic Wikisource community members at Wiki Advanced Training 2019 asked for a Bulk OCR tool not dependent on platform (Linux, Windows etc.). I hope this tool allows Bulk OCRing pages. Satdeep Gill (talk) 16:43, 27 November 2019 (UTC)
  • Support Support WhatamIdoing (talk) 16:55, 27 November 2019 (UTC)
  • Support Support Pyb (talk) 18:05, 27 November 2019 (UTC)
  • Support Support This would be number my #1 for wikisource. Of course it should be open source. Wellparp (talk) 19:03, 28 November 2019 (UTC)
  • Support Support Peter Alberti (talk) 19:54, 28 November 2019 (UTC)
  • Support Support 94rain Talk 12:53, 30 November 2019 (UTC)
  • Support Support Satpal Dandiwal (talk) 21:07, 30 November 2019 (UTC)
  • Support Support while also agreeing with Xover's thoughts. Mahir256 (talk) 07:37, 1 December 2019 (UTC)
  • Support Support Candalua (talk) 16:35, 1 December 2019 (UTC)
  • Support Support Rahmanuddin (talk) 06:49, 2 December 2019 (UTC)
  • Support Support सुबोध कुलकर्णी (talk) 12:25, 2 December 2019 (UTC)
  • Support Support Ruthven (msg) 12:41, 2 December 2019 (UTC)
  • Support Support Sannita - not just another it.wiki sysop 13:19, 2 December 2019 (UTC)
  • Support Support Jberkel (talk) 13:22, 2 December 2019 (UTC)
  • Support Support Saederup92 (talk) 13:24, 2 December 2019 (UTC)
  • Support Support Omshivaprakash (talk) 14:14, 2 December 2019 (UTC)
  • Support Support Novak Watchmen (talk) 17:54, 2 December 2019 (UTC)