Community Wishlist Survey 2020/Wikisource/Improve extraction of a text layer from PDFs

From Meta, a Wikimedia project coordination wiki

Improve extraction of a text layer from PDFs

  • Problem: If a scan in PDF has an OCR layer (i. e. original OCR layer, usually of high quality, which is a part of many PDF scans provided by libraries, not the OCR text obtained by our OCR tools), the text is very poorly extracted from it in Wikisource page namespace. DJVUs do not suffer this problem and their OCR layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its OCR layer improves too. (Example of OCR extraction from a pdf here: [1], example of the same from djvu here: [2] ) As most libraries including Internet Archive or HathiTrust offer downloading PDFs with OCR layers and not DJVUs, we need to fix the text extraction from PDFs.
  • Who would benefit: All Wikisource contributors working with PDF scans downloaded from various major libraries (see above). Some contributors in Commons have expressed their concern that the DjVu file format is dying and attempted to deprecate it in favour of PDF. Although the attempt has not succeeded (this time), many people still prefer working with PDFs (because the DJVU format is difficult to work with for them, or they do not know how to convert PDF into DJVU, how to edit DJVU scans, and also because DJVU format is not supported by Internet browsers...)
  • Proposed solution: Fix the extraction of text from existing OCR layers of scans in PDF.
  • More comments:
  • Phabricator tickets:
  • Proposer: Jan.Kamenicek (talk) 20:18, 24 October 2019 (UTC)[reply]

Discussion

There are also libraries, where is possible to download bunch of pages (20-100) in PDF, but no or only single in djvu.

There is also possibility of external google OCR

mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript');

, but there are more ocr errors and sometimes there are mixed lines. JAn Dudík (talk) 12:13, 25 October 2019 (UTC)[reply]

Yes, exactly, the Google OCR is really poor (en.ws has it among their gadgets), but the original OCR layer which is a part of most scans obtained from libraries is often really good, only Mediawiki fails to extract it correctly. If you download a PDF document e.g. from HathiTrust, it usually contains an OCR layer provided by the library (i.e. not obtained by some of our tools), and when you try to use this original OCR layer in the Wikisource page namespace, you get very poor results. But, if you take the same PDF document and convert it to djvu prior to uploading it here, then you get amazingly better results when extracting the text from the original OCR layer in Wikisource, and you do not need any of our OCR tools. This means that the original OCR layer of the PDF is good, only we are not able to extract it right from the PDF for some reason, although we are able to extract if from DJVU. --Jan.Kamenicek (talk) 17:10, 25 October 2019 (UTC)[reply]
yeah - it is pretty bad when the text layer does not appear, and OCR buttons hang with gray, but i can cut and paste text from IA txt file. clearly a failure to hand-off clear text layer. Slowking4 (talk) 02:34, 28 October 2019 (UTC)[reply]

Voting