Community Wishlist Survey 2020/Archive/Improve OCR with wikisource texts

Random proposal ►◄ Archive The survey has concluded. Here are the results!

Improve OCR with wikisource texts

N Has been merged into Community Wishlist Survey 2020/Wikisource/New OCR tool.

Problem: Users on multiple wikisources have complained that the OCR is not good enough, with letters with diatrics mentioned being especially bad.
Who would benefit: All users in languages which are both supported by Tesseract (close to 100 languages or language variants) and WMF has an wikisource on. The main content of wikisource comes from copying books that have an expired copyright. OCR is used in this process. Wikisource uses (among other things) the Google Cloud OCR, which uses Tesseract and the phetools ocr on labs which uses Tesseract as well. On top of that are some wikisource users who use Tesseract on their own computers. This proposal would improve OCR tools present on the ToolLabs. The better the OCR the easier the process is with each book, allowing wikisource editors to become more productive, completing more pages than they could do previously. This would also motivate users on Wikisource.
Proposed solution: Tesseract has an specific procedure to training OCR which requires corrected text of an page and an image of the page itself. On the Wikisource side, pages that have been marked as proofread, shows books that have been transcribed and reviewed fully. So, what needs to be done is to strip formatting the text of these finished trascriptions, expand template transclusions and move references to the bottom. Then take the text along with an image of the page in question and run it through Tesseracts procedure. The improvement would then be updated on ToolLabs.

More comments: Tesseract is an open source application.

This proposal was merged into Community Wishlist Survey 2020/Wikisource/New OCR tool.

Phabricator tickets:
Proposer: Snaevar (talk) 23:21, 23 October 2019 (UTC)[reply]

Discussion

yeah, we have a backlog of uploaded texts that have a text layer that did not get included, but require a manual use of OCR button. OCR quality is a major pain point. Slowking4 (talk) 00:14, 24 October 2019 (UTC)[reply]

Also note that Wikisource OCR doesn't work properly with italic and small-caps text. Adding support for these, would be a huge improvement. Kaldari (talk) 16:11, 30 October 2019 (UTC)[reply]

Also texts in fraktur are very problematic. JAn Dudík (talk) 11:15, 31 October 2019 (UTC)[reply]

I checked the fraktur issue, and I would need 100+ A4 pages worth of 18th century texts (16-17th century is ok, too, but I did not check that). Your main wikisource (cs.wikisource) does not have that. This proposal will improve those languages that do have Fraktur support, namely Danish, German, Slovak and Swedish (that is an exhaustive list, btw.).
I do not know enough about the italic and bold issue to comment on it.--Snaevar (talk) 15:43, 8 November 2019 (UTC)[reply]

@Snaevar: I wonder if you might want to merge this proposal with Jan.Kamenicek's proposal: New OCR tool. I think one unified proposal would probably get more votes than two similar proposals on their own. Kaldari (talk) 16:18, 30 October 2019 (UTC)[reply]

I will answer the merge proposal on the main talk page of Community Wishlist Survey 2020.--Snaevar (talk) 15:43, 8 November 2019 (UTC)[reply]

@IFried (WMF): Note that this proposal has been merged into New OCR tool. Kaldari (talk) 16:10, 12 November 2019 (UTC)[reply]