Community Wishlist Survey 2019/Archive/Tool for easy conversion to DJVU

Random proposal ►◄ Archive The survey has concluded. Here are the results!

Tool for easy conversion to DJVU

Problem: After the conversion module at the Internet Archive stopped working, converting the files to DJVU has become a pain point.
français: ("Depuis l'arrêt du module de conversion sur Internet Archive, convertir des fichiers en DJVU est devenu une plaie.")

Who would benefit: All Wikisource users.
français: ("tous les participants à Wikisource")

Proposed solution: Create a tool that offers conversion from PDF to DVJU directly on Commons, or salvage the old tool at the Internet Archive.
français: ("créer un outil proposant la conversion de PDF en Djvu directement sur Commons ou récupérer l'ancien outil d'Internet Archive")

Other comments:

Phabricator tickets: phab:T73989

Proposer: Shev123 (talk) 15:46, 2 November 2018 (UTC)[reply]

Discussion

This was posted in French. I've done a rough translation to English, but kept the French original below. /Johan (WMF) (talk) 18:43, 2 November 2018 (UTC)[reply]

Converting pdf into djvu is not so hard, there's a free excellent routine pdf2djvu both for Windows and Unix/linux; the problem is to build a djvu with the best possible OCR text layer. IMHO IA discontinued djvu output because it turns out useful for a minimal number of items. Perhaps IA could activate djvu output again "on demand" for items where djvu is needed; I hope that Wikimedia Foundation and Internet Archive staff could find an agreement about. In the meantime, IA Upload can already build a djvu with an excellent OCR layer from many IA items, and I feel that the residual issues for difficult cases could be fixed soon. --Alex brollo (talk) 19:29, 5 November 2018 (UTC)[reply]

pdf2djvu is not so easy to use (especially the first time). Adding a text on Wikisource is complicated and I would like to make some steps easier. I forgot to say that modify a DJVU (because there is something wrong with a file) is also very complicated. Sorry for my English... --Shev123 (talk) 10:19, 6 November 2018 (UTC)[reply]

this is totally true. For Mac users, like me, that relied on Internet Archive for the Djvu conversion, it is now really a problem.

User:Tpt has included a conversion from pdf/jp2 to Djvu when using IA-upload to import a book from IA that has no Djvu, but this implies that the book is already on IA. Then a first conversion is done on IA, that may take hours (and recently 2 days) on a book. And the on-the-fly conversion to DjVu sometimes fail (just have a look at the tool's page).

To import books from Gallica, or Google, or other libraries, it would be really great to have an online tool, that would allow to convert pdf file) to DjVu when importing directly (without having to upload to IA first), or even convert to DjVu files already uploaded on Commons...

for now, the process can take up to 2-3 days before the file is uploaded and ready for correction, which is much, much longer, and really problematic, since users with technical abilities to convert files have left.

a tool, usable by users without specific technical ability, to themselves upload a book file, and have it converted to DjVu, would certainly improve the project and relieve some of the pression on the very few who have the technical ability... --Hsarrazin (talk) 16:59, 6 November 2018 (UTC)[reply]

See also the 2017 proposal : Improve workflow for uploading books to Wikisource. 65 endorsements ! --Consulnico (talk) 09:57, 7 November 2018 (UTC)[reply]

I think the main problem is that we're becoming one of the few remaining communities making any use of DjVu; for most people, PDF suffices and is already part of their workflow. The IA has said they'd look into re-enabling it, but that was a year ago and they've not, so I think we're stuck with fixing up IA Upload — if that's the best thing to do. This proposal basically boils down to fixing the outstanding bugs with IA Upload (which are mostly already documented). I'm going to archive this — hope that's ok? Sam Wilson 02:43, 15 November 2018 (UTC)[reply]

@Samwilson: Veuillez remettre cette proposition au vote. De quel droit l'avez-vous supprimé ? Please put this proposal to the vote. By what right did you suppress it? --Le ciel est par dessus le toit (talk) 12:34, 17 November 2018 (UTC)[reply]

@Le ciel est par dessus le toit: Sorry, I mistakenly made the archiving edit under my personal username, not my WMF one. See below for more rationale for archiving this. SWilson (WMF) (talk) 23:21, 19 November 2018 (UTC)[reply]

@Samwilson: this proposal was not for "fixing IA-upload" but for a full tool, that would allow any wikisource contributor to provide a pdf file or url, and have it uploaded and converted to DjVu without having to upload it to IA first. Could you please put it back where it was ? --Hsarrazin (talk) 17:19, 18 November 2018 (UTC)[reply]

@Hsarrazin: The reason CommTech feel that we can't work on DjVu-generation this year is that it's not simple, and is only of limited use. If someone has a PDF to proofread, then can upload it to Commons with the existing tools, and proofread it from PDF. I'm not trying to say that DjVu isn't a better format in lots of ways, and I'm disappointed that we can't do more with it, but the reality seems to be that it's less and less used by GLAM partners, and the tools for working with it are not evolving. I definitely still think that the issue linked above (for an all-round better ingestion workflow for Wikisource) is something that really needs more focus, but specifically around DjVu I think there's a diminishing return for effort spent on it. Sorry! SWilson (WMF) (talk) 23:21, 19 November 2018 (UTC)[reply]

@Samwilson: There is a technical issue at stake here : on frws, we now work to find scans for texts that were uploaded in main space without scan backup. We have a tool that allows to auto-match the uploaded text with the text layer in the DjVu. It does not work with Pdf, and thus we are obliged to redo the correction work twice, when the book is DjVu and not Pdf. This is really needed for working.

AFAIK wikisource is not mainly used by GLAM partners (not frws), so taking into account the needs of GLAM partners to refuse a legitimate wikisource wish before it is even submitted to vote does not seem fair to me (and the whole frws community I'm now writing for).

Why did you remove it without letting the community vote about it. It is a Community wishlist after all, and the Community should be allowed to vote about it. --Hsarrazin (talk) 08:25, 20 November 2018 (UTC)[reply]

by saying that "we can't work on DjVu-generation this year", you are, in fact, burying the issue, because, next year, even less people will be able to use DjVu, so there will be even less reason (in your opinion)... this is NOT fair to dedicated contributors who should at least be allowed to vote on this... --Hsarrazin (talk) 08:32, 20 November 2018 (UTC)[reply]

@Hsarrazin: Are you referring to the toolforge:phetools match_and_split tool? That indeed showcases one of the advantages of DjVu over PDF (easier traversal of OCR text by page-text). And I'm sorry if you feel like I'm burying the issue; I'm really not trying to. I think we all need to talk more about how and why we're using DjVu, because it takes not inconsiderable technical resources to continue to do so. Basically by archiving this proposal the CommTech team is saying that it's not something that we can technically work on, and that's why it's not worth having people vote on it. I think we'd be better off building tools to work with the formats that are more commonly in use. Again, I do apologise if I've not communicated properly about this. I really do want the best for Wikisource. :) SWilson (WMF) (talk) 10:29, 20 November 2018 (UTC)[reply]