Community Wishlist Survey 2020/Wikisource/Generate thumbnails for large-format PDFs

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Random proposal ►

◄ Back to Wikisource  The survey has concluded. Here are the results!


  • Problem: For some PDFs, with very large images (typically scanned newspapers), no images (called "thumbnails") are shown.
  • Who would benefit: Wikisource when proofreading newspaper pages.
  • Proposed solution: Look at the PDF files described in phab:T25326, phab:T151202, commons:Category:Finlands Allmänna Tidning 1878, to find out why no thumbnails are generated.
  • More comments: When extracting the JPEG for an individual file, that JPEG can be uploaded. But when the JPEG is baked into a PDF, no thumbnail is generated. Is it because of its size? Small pages (books) work fine, but newspapers (large pages) fail.
  • Phabricator tickets: phab:T151202
  • Proposer: LA2 (talk) 21:04, 23 October 2019 (UTC)

Discussion[edit]

  • Hi LA2! Can you provide a description of the problem? This could help give us a deeper understanding of the wish. Thank you! --IFried (WMF) (talk) 18:52, 25 October 2019 (UTC)
    The problem is very easy to understand. I find a free, digitized PDF and upload it to Commons, then start to proofread in Wikisource. This always works fine for normal books, but when I try the same for newspapers, no image is generated. Apparently this is because the image has a larger number of pixels. I haven't tried to figure out what the limit is. --LA2 (talk) 21:36, 25 October 2019 (UTC)
  • For File:Finlands_Allmänna_Tidning_1878-00-00.pdf at least, ghostscript correctly rendered the file locally, but took a lot of time (Like a ridiculous amount of time. evince seems to render it instantly, so I don't know why ghostscript takes so long). So at a first guess, I suppose its hitting time limits. Bawolff (talk) 20:20, 25 October 2019 (UTC)
    Maybe the solution is to fix ghostscript? Another way is to navigate around ghostscript and use pdfimages to extract the embedded JPEG images, and render them. Since JPEG rendering seems to work fine. I don't know. --LA2 (talk) 21:34, 25 October 2019 (UTC)
    pdfimages is not a solution as a PDF page may consist of multiple images and it is hard to extract their relative location (at least not possible with pdfimages). Ankry (talk) 20:23, 9 November 2019 (UTC)
    I was going to write to the National Library about this (I think I know at least one of the persons involved) but I don't observe this slowness on gs 9.27, I think: phabricator:P9760. Maybe I should try a non-dummy command. Nemo 09:06, 27 November 2019 (UTC)
  • What about to provide for ProofReading more compact desight at all. Those seconds scrolling counts. If we have on one site the window with the extracted text and in the other site the same size window with scan in which you can zoom and move fast, that should save your time and be more attractive for newbies. The way it is now it looks kind of techy and in some cases difficult to handle. E.g. there should be also more content help or a link to discussion page covered in more attracitve design. Juandev (talk) 09:22, 4 November 2019 (UTC)
  • I think that a tool that allows to generate such thumbnails manually / on request / offline with much higher limits and available to a specific group of users (commons admins? a dedicated group?) maybe a workaround for this problem. Ankry (talk) 20:23, 9 November 2019 (UTC)
I'm not sure what you want me to check - the question at hand is why that parti ular version of the file failed to render. Bawolff (talk) 09:15, 22 November 2019 (UTC)
Wow, @Hrishikes and Bawolff:, there is a fix? How exactly does it work? Could it be integrated into the upload process? Could it be applied to all files in commons:Category:Finlands Allmänna Tidning 1878? --LA2 (talk) 19:13, 10 December 2019 (UTC)
@LA2: -- This problem is occurring in highly compressed files and linked to the ocr layer. The fix consists of decompressing the file (so that the size in mb increases) and either flattening or removal of the ocr layer. I first tried flattening; it usually works but did not in this case; so I removed the ocr. Now it works. And yes, it is potentially usable for other files in your category. Extract the pages as png/jpg and rebuild the pdf. Hrishikes (talk) 01:39, 11 December 2019 (UTC)

Voting[edit]

  • Support Support Important issue for every project which relies on multi-page documents (PDF is a notoriously bad format but that's what we have in practice). It probably doesn't require much coding, but the Community Tech team could help by lobbying the appropriate WMF departments to get more resources assigned to the thumbnail generation. Nemo 09:16, 22 November 2019 (UTC)
  • Support Support --Jan.Kamenicek (talk) 10:31, 23 November 2019 (UTC)
  • Support Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)
  • Support Support LA2 (talk) 12:45, 25 November 2019 (UTC)
  • Support Support Stefan Kühn (talk) 13:16, 25 November 2019 (UTC)
  • Support Support JogiAsad (talk) 13:25, 25 November 2019 (UTC)
  • Support Support Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:26, 25 November 2019 (UTC)
  • Support SupportAmmarpad (talk) 15:36, 25 November 2019 (UTC)
  • Support Support A garbage person (talk) 16:18, 25 November 2019 (UTC)
  • Support Support 16:41, 25 November 2019 (UTC)
  • Support Support Ciao • Bestoernesto 17:45, 25 November 2019 (UTC)
  • Support Support Eunostos (talk) 20:37, 25 November 2019 (UTC)
  • Support Support Geonuch (talk) 15:09, 26 November 2019 (UTC)
  • Support Support Not that I want to encourage more use of PDF, but we run into too many pointless problems with all multi-page formats (the majority with PDF it seems) and reducing this will reduce both wasted time and frustration (which often hits new contributors: the old hands have learned to avoid the pain points). Xover (talk) 05:54, 27 November 2019 (UTC)
  • Support Support ··· 🌸 Rachmat04 · 14:35, 27 November 2019 (UTC)
  • Support Support Peter Alberti (talk) 19:59, 28 November 2019 (UTC)
  • Support Support Rahmanuddin (talk) 06:50, 2 December 2019 (UTC)
  • Support Support सुबोध कुलकर्णी (talk) 12:12, 2 December 2019 (UTC)
  • Support Support Novak Watchmen (talk) 17:56, 2 December 2019 (UTC)