Community Wishlist Survey 2020/Wikisource/Generate thumbnails for large-format PDFs

Random proposal ►◄ Wikisource The survey has concluded. Here are the results!

Generate thumbnails for large-format PDFs

Problem: For some PDFs, with very large images (typically scanned newspapers), no images (called "thumbnails") are shown.
Who would benefit: Wikisource when proofreading newspaper pages.
Proposed solution: Look at the PDF files described in phab:T25326, phab:T151202, commons:Category:Finlands Allmänna Tidning 1878, to find out why no thumbnails are generated.
More comments: When extracting the JPEG for an individual file, that JPEG can be uploaded. But when the JPEG is baked into a PDF, no thumbnail is generated. Is it because of its size? Small pages (books) work fine, but newspapers (large pages) fail.
Phabricator tickets: phab:T151202
Proposer: LA2 (talk) 21:04, 23 October 2019 (UTC)[reply]

Discussion

Hi LA2! Can you provide a description of the problem? This could help give us a deeper understanding of the wish. Thank you! --IFried (WMF) (talk) 18:52, 25 October 2019 (UTC)[reply]
The problem is very easy to understand. I find a free, digitized PDF and upload it to Commons, then start to proofread in Wikisource. This always works fine for normal books, but when I try the same for newspapers, no image is generated. Apparently this is because the image has a larger number of pixels. I haven't tried to figure out what the limit is. --LA2 (talk) 21:36, 25 October 2019 (UTC)[reply]

For File:Finlands_Allmänna_Tidning_1878-00-00.pdf at least, ghostscript correctly rendered the file locally, but took a lot of time (Like a ridiculous amount of time. evince seems to render it instantly, so I don't know why ghostscript takes so long). So at a first guess, I suppose its hitting time limits. Bawolff (talk) 20:20, 25 October 2019 (UTC)[reply]
Maybe the solution is to fix ghostscript? Another way is to navigate around ghostscript and use pdfimages to extract the embedded JPEG images, and render them. Since JPEG rendering seems to work fine. I don't know. --LA2 (talk) 21:34, 25 October 2019 (UTC)[reply]
pdfimages is not a solution as a PDF page may consist of multiple images and it is hard to extract their relative location (at least not possible with pdfimages). Ankry (talk) 20:23, 9 November 2019 (UTC)[reply]
I was going to write to the National Library about this (I think I know at least one of the persons involved) but I don't observe this slowness on gs 9.27, I think: phabricator:P9760. Maybe I should try a non-dummy command. Nemo 09:06, 27 November 2019 (UTC)[reply]

What about to provide for ProofReading more compact desight at all. Those seconds scrolling counts. If we have on one site the window with the extracted text and in the other site the same size window with scan in which you can zoom and move fast, that should save your time and be more attractive for newbies. The way it is now it looks kind of techy and in some cases difficult to handle. E.g. there should be also more content help or a link to discussion page covered in more attracitve design. Juandev (talk) 09:22, 4 November 2019 (UTC)[reply]

I think that a tool that allows to generate such thumbnails manually / on request / offline with much higher limits and available to a specific group of users (commons admins? a dedicated group?) maybe a workaround for this problem. Ankry (talk) 20:23, 9 November 2019 (UTC)[reply]

@LA2 and Bawolff: -- File repaired (as cited above). Please check. Hrishikes (talk) 02:26, 22 November 2019 (UTC)[reply]

I'm not sure what you want me to check - the question at hand is why that parti ular version of the file failed to render. Bawolff (talk) 09:15, 22 November 2019 (UTC)[reply]

Wow, @Hrishikes and Bawolff:, there is a fix? How exactly does it work? Could it be integrated into the upload process? Could it be applied to all files in commons:Category:Finlands Allmänna Tidning 1878? --LA2 (talk) 19:13, 10 December 2019 (UTC)[reply]

@LA2: -- This problem is occurring in highly compressed files and linked to the ocr layer. The fix consists of decompressing the file (so that the size in mb increases) and either flattening or removal of the ocr layer. I first tried flattening; it usually works but did not in this case; so I removed the ocr. Now it works. And yes, it is potentially usable for other files in your category. Extract the pages as png/jpg and rebuild the pdf. Hrishikes (talk) 01:39, 11 December 2019 (UTC)[reply]

Voting

Support Important issue for every project which relies on multi-page documents (PDF is a notoriously bad format but that's what we have in practice). It probably doesn't require much coding, but the Community Tech team could help by lobbying the appropriate WMF departments to get more resources assigned to the thumbnail generation. Nemo 09:16, 22 November 2019 (UTC)[reply]
Support --Jan.Kamenicek (talk) 10:31, 23 November 2019 (UTC)[reply]
Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)[reply]
Support LA2 (talk) 12:45, 25 November 2019 (UTC)[reply]
Support Stefan Kühn (talk) 13:16, 25 November 2019 (UTC)[reply]
Support JogiAsad (talk) 13:25, 25 November 2019 (UTC)[reply]
Support Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:26, 25 November 2019 (UTC)[reply]
Support – Ammarpad (talk) 15:36, 25 November 2019 (UTC)[reply]
Support A garbage person (talk) 16:18, 25 November 2019 (UTC)[reply]
Support 游魂 16:41, 25 November 2019 (UTC)[reply]
Support Ciao • Bestoernesto • ✉ 17:45, 25 November 2019 (UTC)[reply]
Support Eunostos (talk) 20:37, 25 November 2019 (UTC)[reply]
Support Geonuch (talk) 15:09, 26 November 2019 (UTC)[reply]
Support Not that I want to encourage more use of PDF, but we run into too many pointless problems with all multi-page formats (the majority with PDF it seems) and reducing this will reduce both wasted time and frustration (which often hits new contributors: the old hands have learned to avoid the pain points). Xover (talk) 05:54, 27 November 2019 (UTC)[reply]
Support ··· 🌸 Rachmat04 · ☕ 14:35, 27 November 2019 (UTC)[reply]
Support Peter Alberti (talk) 19:59, 28 November 2019 (UTC)[reply]
Support Rahmanuddin (talk) 06:50, 2 December 2019 (UTC)[reply]
Support सुबोध कुलकर्णी (talk) 12:12, 2 December 2019 (UTC)[reply]
Support Novak Watchmen (talk) 17:56, 2 December 2019 (UTC)[reply]