Talk:Community Tech/Ebook Export Improvement

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Project Overview: Request for Feedback (May 2020)[edit]

Hello, everyone! We invite you to read the content page of the project, which includes an analysis of the ebook export process and its primary issues, and share your feedback below. Thank you!

Have we covered the main reasons why people export ebooks?[edit]

  • Yes, this is accurately covered and clearly explained. MartinPoulter (talk) 18:47, 28 May 2020 (UTC)
  • Great resume. In our actions, we should always keep in mind that there are two types of users: contributors but also visitors to Wikisource. We have to make sure visitors have a good experience exporting books to whatever device they have. --Viticulum (talk) 19:25, 28 May 2020 (UTC)
    • @MartinPoulter and Viticulum: Thank you for the feedback! Also, it's a great point that we'll need to be continually mindful of both contributors and visitors. While these two groups will have some overlapping needs in this project (such as being able to find & download books), the contributors may have greater familiarity with Wikisource. For this reason, it's important that we identify the largest problems with the user experience, so we can hopefully improve UX overall. Thanks again! --IFried (WMF) (talk) 15:03, 4 June 2020 (UTC)

Have we covered the main methods to export ebooks?[edit]

  • Well explained. I have myself learned new thing.
    • In French Wikisource, we use mostly option "#4: Export via links at the top of text", and "#3: Export via links on the main page" when announcing new books.
    • We also have those links on the author page for each book: Ex.: See an author
    • My main concern is that external user find and understand easily how to export books. For "#2: Export via the left side panel" external user can't export a full book, and this can be misleading to them. --Viticulum (talk) 19:28, 28 May 2020 (UTC)
  • This depends on Wikisource. On czech WS we have only options PDF and EPUB, and no other in gadgets. On sk.WS there is only PDF option. Export should be avaliable on all langage versions for all users. JAn Dudík (talk) 11:30, 29 May 2020 (UTC)

@MartinPoulter, Viticulum, and JAn Dudík: Thank you for the feedback! It's also very helpful to be reminded of the fact that different wikis have different common practices, and some do not have as many options available. Ideally, we will want to improve overall user experience, so that: 1) users can easily discover how to download books, and 2) users have various options available to them, if possible, rather than being limited by one option. We'll investigate how we can improve this experience. Thank you again! --IFried (WMF) (talk) 19:57, 9 June 2020 (UTC)

Have we covered the main problems experienced when exporting ebooks?[edit]

  • Yes, and I like that reliability was placed first as book export is so core to the functionality of Wikisource that it needs really high uptime. An observation that occurs near the end is crucial: "The WSExport tool is not easily discoverable, and it doesn't provide an intuitive user experience", yes this is Colleagues, who are intelligent enough and very familiar with the Web, have looked at this site and not grasped that any book on the site can be exported in a variety of formats, and it's easy to see why they miss that. MartinPoulter (talk) 18:47, 28 May 2020 (UTC)
@MartinPoulter: Thank you for sharing this! It is a fantastic point and it really helps frame some of the key issues. We want the WSExport to work well, of course. In addition, we need people to be able to find it, otherwise all of our potential improvements will have only limited impact. For this reason, we have been investigating some of the primary issues related to user experience and we hope to improve the discoverability of WSExport, along with its general reliability and performance. Again, thank you for sharing this perspective! --IFried (WMF) (talk) 23:43, 24 July 2020 (UTC)

Problem to export the image on the first page in pdf format[edit]

  • Pour moi le soucis le plus gênant : lors d'un export "pdf", dans la majorité des cas (voir par exemple sur le livre, l'image de la page de garde ne s'affiche pas en première page, et un message d'erreur "Données insuffisantes pour une image" s'affiche à l'ouverture du fichier.

For me, the most frustrating problem while exporting an ebook in "pdf" format, in most cases (see for example for the book, the image on the cover page does not appear, and an error message can be seen when opening the file (in french "Données insuffisantes pour une image" => trad. : insuffient data for an image). Thanks, Laurent --Lorlam (talk) 17:36, 28 May 2020 (UTC)

@Lorlam: Thank you; much appreciated! --IFried (WMF) (talk) 22:06, 10 July 2020 (UTC)
This seems a side effect of changes made earlier in 2020, as this was not happening before. --Viticulum (talk) 19:33, 28 May 2020 (UTC)
Yes, this problems appears in february 2020 --Lorlam (talk) 20:43, 28 May 2020 (UTC)

@MartinPoulter, Lorlam, and Viticulum: Thank you for this information! We completely agree that the user experience is not optimal, and we hope to improve it (both for experienced editors/readers & newcomers). Also, thanks for the information regarding the PDF export issue, which seems to have appeared around February 2020. I'll share this information with the team & see if we can investigate. --IFried (WMF) (talk) 20:24, 9 June 2020 (UTC)

@Lorlam: Thanks for this comment. The link you provided doesn't seem to have an image on the first page, so I wasn't able to properly test/reproduce this issue. Do you have another example? Thanks! --IFried (WMF) (talk) 16:18, 11 June 2020 (UTC)
@IFried: The link given ( should have an image after exporting the book (see,_13.djvu), but I could have others examples ( ... ... Thanks, --Lorlam (talk) 18:02, 11 June 2020 (UTC)
@Lorlam: Thanks for the information! We have tested this, and we noticed the following: When we first downloaded the book, the first cover page was blank. However, when we looked at the download a day later (in one case), the cover page had changed to display the expected content. The page was red-colored, which was not expected, but the content and imagery looked fine. Does this reflect your experience? --IFried (WMF) (talk) 23:00, 24 June 2020 (UTC)
@IFried: Interesting ! A red colored image instead of a Grayscale image ! This may indicate a bad object definition in the pdf structure and insufficient data to render the Green and Blue channels. The Object definition for the Scribe cover should be: <</Type/XObject/BitsPerComponent 8/ColorSpace/DeviceGray/DL 22383/Filter[/DCTDecode]/Height 565/Length 22383/Subtype/Image/Width 400>> instead of <</Type/XObject/BitsPerComponent 8/ColorSpace/DeviceRGB/DL 22383/Filter[/DCTDecode]/Height 565/Length 22383/Subtype/Image/Width 400>> as written by Calibre. As mentioned, a workaround is to convert 8bits grayscale cover to 24bits True color.--Denis Gagne52 (talk) 15:37, 27 June 2020 (UTC)
@Denis Gagne52: Ah, interesting; thanks for providing that potential explanation (which I have noted). Much appreciated. --IFried (WMF) (talk) 21:57, 7 July 2020 (UTC)
@IFried: Sorry, for me, for books which have problems, the cover page never displays (for information, I use Acrobat Reader to open pdf files) --Lorlam (talk)
@IFried:I had information about this problem which is described on Github cu --Lorlam (talk) 15:58, 26 June 2020 (UTC)
@Lorlam: Thanks for getting back to us and providing this additional information. We'll check it out. One thing: It seems like the Github link that you provided didn't work (404 error when I tried to access it). Can you try to share it again? Thanks! --IFried (WMF) (talk) 21:54, 7 July 2020 (UTC)
@IFried: Yeh ! it is cu --Lorlam (talk) 00:10, 8 July 2020 (UTC)
@IFried: Hi ! We are in august... For me this problem is still not corrected :-( --Lorlam (talk) 17:30, 21 August 2020 (UTC)
@Lorlam: Hello, and thank you for reaching out! We have not yet begun prioritizing and tackling individual bugs, such as the one you described. We are still primarily in the research phase of the project. We cannot guarantee that we will fix specific Wikisource or WSExport specific bugs yet (and there are many for us to look into), however we do hope to fix many bugs, with priority given to those related to: 1) people being unable to download books, and 2) people unable read the text within the books (i.e., font rendering issues). In the meantime, we would love if you could check out our August update and share your feedback. By sharing your feedback, you'll help us get closer to wrapping up our research phase of the project, and we'll then be able to dig into some of the highest priority Wikisource-related bugs. Also, one quick note: My username is actually IFried_(WMF), so if you use that in the future, I'll definitely get the ping. Anyway, thank you for all of your feedback so far, and we hope to read more in response to the August update! --IFried (WMF) (talk) 23:32, 21 August 2020 (UTC)
Just to say, although this may seem a minor issue, it actually is quite a nuisance for end users who download multiple books. Without a set cover unique to each book, they all look the same (Wikisource logo plus small text title, or boring text title). ebooks with designed covers eg the original title page are much more identifiable etc. JimKillock (talk) 13:27, 28 August 2020 (UTC)

Which formatting and style issues are the most common and frustrating, in your opinion?[edit]

  • Many a times we add image using the crop image tool. For the web view it is okay. But if we try to download the book instead of cropped image the whole image of the page is downloaded in the book.
  • Tables are not rendered properly many a times in the downloaded book
  • sfrac template is not rendered properly in downloaded book.

--Balajijagadesh (talk) 18:27, 27 May 2020 (UTC)

@Balajijagadesh: Thank you for this information! One question: When we conducted some basic tests, the fractions in ebook exports looked okay. Maybe you can provide some examples of the sfrac template issue, which we can use for analysis? Thanks! --IFried (WMF) (talk) 23:07, 24 June 2020 (UTC)
@IFried (WMF): Hi. Thanks for reaching out. The sfrac template is rendered properly in pdf and epub formats. But is not rendered properly in mobi format. The horizontal bar disappears and the introduces alignment problem. Let me know if you can reproduce the problem. Regards -- Balajijagadesh (talk) 13:50, 2 July 2020 (UTC)
@Balajijagadesh: Thank you for sharing this information! We have tested this issue on mobi, and we have been able to reproduce the issue. I have written a ticket for this. Appreciate it! --IFried (WMF) (talk) 22:22, 10 July 2020 (UTC)
  • While converting gujarati Ebooks using WSExport into mobi format, the text is printed right half of the page. --Sushant savla (talk) 11:10, 28 May 2020 (UTC)
@Sushant savla: Thank you for this feedback! We think we captured this issue in our example #5 on the project page. Is this correct/is this the same issue that you are describing? Thanks! --IFried (WMF) (talk) 23:09, 24 June 2020 (UTC)
  • I have found that the main problem with "Download as PDF" is fonts. When special fonts are used, especially those that support diacriticals, the output is not always rendered in the same font. Rather, a standard font is sometimes used, one which does not support the diacriticals. There are also sometimes unexpected changes to font size that can ruin the formatting. Dovi (talk) 12:12, 28 May 2020 (UTC)
@Dovi: Thank you for providing this information! We have some follow-up questions (so we can better understand the problem). Our questions: Can you provide an example of where you are seeing this issue? And how are you downloading the PDF? Is it via the side-panel (and, therefore, via ElectronPDF) or via the top panel (and, therefore, via WSExport), or somewhere else? Thanks in advance! --IFried (WMF) (talk) 23:13, 24 June 2020 (UTC)
    • Slightly related: I do not see the "Choose format" option in Hebrew Wikisource. How can it be enabled? Dovi (talk) 12:18, 28 May 2020 (UTC)
@Dovi: Hello! The ability to see "Choose format" should be available, if WSExport is enabled, on the wiki. If you want to enable it in the sidebar, you can try to contact someone with interface admin rights on your wiki in order to enable it in the sidebar. --IFried (WMF) (talk) 23:15, 24 June 2020 (UTC)
  • (Perhaps this should be in a new section, feel free to move): The first example from enWS is, in my opinion, not a good example. The markup at enWS was using the <center> tag, and it wasn't using the s:en:Template:page break template, which inserts some CSS-styled div to produce a page break in ereaders (break-after:page; page-break-after:always;). So, I think these issues are not really the fault of the WS-export tool, but rather an issue that should be fixed at enWS. Perhaps WS-export could spot "suspect" markup and make a best-effort attempt to hotfix them during export, but that would mask the underlying issue of poor markup at the source and offload the burden onto the WS-export maintainers. Inductiveload (talk) 10:53, 29 May 2020 (UTC)
@Inductiveload: Thanks for sharing this information; it was very helpful. We can see, like you wrote, that the example is due to incorrect markup (i.e., template:pagebreak should have been used instead of <center> tag). In this case, the issue seems to be community outreach and education rather than a technical issue. However, we still want to document that this is happening, so that we can inform our communities how to mitigate these issues when they export books using WSExport. We'll also look into adding more details on the project page about this. Thanks again! --IFried (WMF) (talk) 23:18, 24 June 2020 (UTC)
  • I tried to export several books with WSexport tool. And the biggest issue was - metadata. On cs.wikisource we have on all content pages infobox with information about author, source, licence etc. And the same table was at the beginning of every chapter in exported book. There should be option to hide these informations on export and have them only once in text. JAn Dudík (talk) 11:50, 29 May 2020 (UTC)
@JAn Dudík: Thanks for the feedback! While we see that someone provided a solution to the metadata issue with ebook exports, we also understand that there are other issues, and we hope to improve the ebook export experience overall. Furthermore, we see that there’s an issue with encoding in external hyperlinks, which we've noted. Thanks! --IFried (WMF) (talk) 23:32, 24 June 2020 (UTC)
  • @JAn Dudík: The support for WSExport on cs.wikisource is very poor. If cs.wikisource community wants good exported e-books it would unfortunately require lot of changes there. Hiding metadata table is one of simple changes. --EBookian (talk) 20:35, 29 May 2020 (UTC)
  • @EBookian: And is somewhere documenation what to do for better support? JAn Dudík (talk) 20:49, 29 May 2020 (UTC)
  • @JAn Dudík: WSExport is quite simple tool which takes some pages and translates them into e-book, there is not much to document while it surely lacks in some areas. You added microformat there which is good thing. On the other hand cs.wikisource heavily relies on those metadata tables at the moment and if you exclude them from export now you will see no divide between chapters. You need to unify the style of pages, create e-book CSS, ... I am getting out of scope of this page, if you wish we can continue this talk somewhere else. --EBookian (talk) 21:20, 29 May 2020 (UTC)
@JAn Dudík: Thanks for bringing up this question about documentation! We also see that improved documentation of best practices can help people encounter less confusion and errors. We’re currently looking into how to do this, and we’ll update the project page when we have information. --IFried (WMF) (talk) 22:06, 7 July 2020 (UTC)
  • Now I tried to read one exported book on my mobile app and I found that there is problem with encoding in external hyperlinks - instead of UTF-8 is link probably in latin-2 (instead of cs:s:Autor:Věnceslav Černý i got link to Autor:VÄ›nceslav_ÄŚernĂ˝) JAn Dudík (talk) 20:49, 29 May 2020 (UTC)
@JAn Dudík: Thanks for this information. In order to better understand the problem, we have a few questions: 1) When you say you are using the mobile app, what do you mean, exactly (since there is no Wikisource app?). Are you using the mobile view of a desktop browser, for example? 2) Did you use the download PDF button on this page (we are asking because this link uses ElectronPDF rather than WSExport)? Thanks! --IFried (WMF) (talk) 22:10, 7 July 2020 (UTC)
@IFried (WMF): I used wsexport for generating epub file from cs.wikisource book. Then I copy it to my mobile and opened using Cool Reader app (but you can imagine any other e-book reader). Text of book and images were correct, but external link from infoboxes were with bad encoding. JAn Dudík (talk) 09:22, 8 July 2020 (UTC)
@JAn Dudík: Thank you for this explanation! We tested accessing a downloaded epub via a mobile reading app, and we didn’t see any issues with the external links. However, we understand that this may issue may still occur sometimes. For this reason, we have documented this issue in Phabricator. We may not have time to fix it in the scope of this project, since we are primarily focusing on issues related to the WSExport tool not working/working too slowly or books no being readable (i.e., basic functionality of the tool and basic readability of the text). However, we have it documented, in case someone would like to fix it now or in the future. Thank you for reporting it! --IFried (WMF) (talk) 23:33, 24 July 2020 (UTC)
  • While converting the text from wikisource into pdf or rtf, the text is indented at the start of the every paragraph. It even indents the first line of the poem even if it is enclosed under poem tag. So the output for poems are bad spoiling all the alignment for the poems. The poems are not indented in epub or mobi format. The issue can be seen here -- Balajijagadesh (talk) 07:06, 3 July 2020 (UTC)
@Balajijagadesh: Thanks for this feedback! We have tested this issue on epub, pdf, and mobi. As you wrote, the pdf version had incorrect indentation. The mobi version had the numbers smashed into the text, which also looked strange. The only version that looked okay was epub. We have written a ticket to track the issue, and we’ll see if we can look into this. In addition, we are beginning to investigate the best practices for proofreading content to Wikisource. Once we share these findings, we hope it can help prevent some formatting and styles issues in the future. Thanks! --IFried (WMF) (talk) 22:03, 10 July 2020 (UTC)
@Balajijagadesh: Thank you for bringing this up! We have covered this issue in example #2 on the project page, and we agree that this is a big problem. We really hope that we can fix it, and we have begun investigating how we may be able to do this. Thanks again and we hope to provide updates on this issue soon. --IFried (WMF) (talk) 22:05, 10 July 2020 (UTC)

Which user experience issues are the most common and frustrating, in your opinion?[edit]

  • Many a times the downloading time of the book is so much that people close the page. Many times the wsexport tool doesnt work. -- Balajijagadesh (talk) 18:28, 27 May 2020 (UTC)
  • See mw:Bug management/Triage/201410. In general, things only got worse since 2014, so everything that applied back then is still valid. You need to study the relevant components in Phabricator. Nemo 13:42, 29 May 2020 (UTC)
@Balajijagadesh and Nemo bis: Thank you for the feedback on the most frustrating UX issues! This is helpful and we will take a look. --IFried (WMF) (talk) 22:09, 10 July 2020 (UTC)

Which problems, overall, do you find the most critical to fix, and why?[edit]

  • Since the latest version, WSExport is slower than before. External visitors may not be patient if system too slow (they may think it is not working). When time-out is reach, message is not user-friendly for external visitors. --Viticulum (talk) 19:31, 28 May 2020 (UTC)
@Viticulum: Thanks for the feedback! One question: What is the latest version you are referring to? Also, thanks for the comment about the need to improve user-friendly messaging (we’ll look into it). --IFried (WMF) (talk) 22:11, 10 July 2020 (UTC)
@IFried (WMF): Sorry for being so long to come back to you. The slowness were experimented in May in production. I do not know how to determine versions. I will test 10 books this week, everyday. Results on Friday. --Viticulum (talk) 19:40, 26 July 2020 (UTC)
@IFried (WMF): Please see here the result of my test for Export Time. --Viticulum (talk) 19:56, 2 August 2020 (UTC)
@Viticulum: Thank you for sharing this very useful information! We have included it as a note in our current investigation about Wikisource errors and issues. This will help us have a better understanding of the wait times experienced by some Wikisource users when downloading books. We hope this analysis can help us identify primary issues and how we can go about fixing or improving them. Thank you again! --IFried (WMF) (talk) 21:31, 11 August 2020 (UTC)
  • We need multi-year reliability. Multi-page export needs to be provided by a MediaWiki extension again to all the formats people need: PDF and EPUB at a minimum (but when you support EPUB, it's easy to add ZIM and ODT as well). The development and maintenance extension needs to be outsourced to a third party, with sufficient funding for at least 5 years, so that users and partners (for instance libraries) can be sure that it will keep existing in the future and not vanish overnight if a couple persons at WMF decide so. Without a reliable export, it's impossible to get national libraries and the various access methods to bring users to Wikisource. Nemo 13:45, 29 May 2020 (UTC)
@Nemo bis: Thanks for the feedback! Just to make sure we understand your comment, can you clarify what you mean by “multi-year reliability?” To your other point, we agree that Wikisource should have more standardized and easily accessible tools and gadgets. For this reason, we will be working to improve this issue, especially through the ‘Migrate Wikisource specific edit tools from gadgets to Wikisource extension’ wish. Finally, to your point regarding maintenance: While the Community Tech team will not be maintaining Wikisource, overall, in a long-term capacity, we are hoping to increase the overall health and usability of Wikisource, so that it is easier to maintain in the future. --IFried (WMF) (talk) 22:13, 10 July 2020 (UTC)

Anything else you would like to add?[edit]

  • I would like the developers/technical team to pay attention to eBooks in RTL languages. These are written right-to-left (E.g., Hebrew and Arabic). I hope the Export tool will also support such languages. From past expreience, such support is not automatic, and special care is needed to ensure this.--Naḥum (talk) 12:14, 28 May 2020 (UTC)
@Nahum: Thanks so much for this feedback! We would love to learn more about the issues and challenges unique to RTL users on Wikisource, especially regarding ebook exports. Can you provide more details? We agree that this should be looked into as well, so we look forward to your response. --IFried (WMF) (talk) 22:15, 10 July 2020 (UTC)

Modernisation does not export[edit]

Hi ! One issue with the export is that the modernisation system that we use, at least in the fr.wikisource, does not work in exported formats because its in JS. But it cause very unpleasant reading of old texts who have been transcribed in the original version then modernised with the modernisation system. Its very convenient to use on wikisource itself but very disappointing with the export. --M0tty (talk) 12:00, 28 May 2020 (UTC)

See this example [1] for modernisation of old French: On middle/left there is "Orthographe originale" or "Orthographe moderne". This is done for each chapter. It is not possible to extract a chapter or the whole book in modernised French. This functionality is not incorporated in WSExport. I believe this would be a whole project in itself. Tpt could give more insight. --Viticulum (talk) 19:46, 28 May 2020 (UTC)
Ideally, as it seems possible to include some Javascript in an ePub, it would be great if the ePub file could contain both versions and switch from one to the other using exactly the same Javascript code as in the French Wikisource. However it's possible that this would require to load not only the "local" replacements present as a parameter of the modernisation model, but also the entire Wikisource modernization dictionary, or at least the subset of words which are found in the exported text. --George2etexte (talk) 14:13, 2 June 2020 (UTC)
@M0tty, Viticulum, and George2etexte: Thanks for sharing this information. From my understanding, you are writing about the fact that Wikisource readers online can choose which orthography to select, but this is not available for ebook exports. Is this correct? And, if so, can you provide a bit more explanation and context around it (for example, do you know if there is already a Phabricator ticket that documents this problem)? The fix for this may be a large project that is out of scope for the current project. However, it’s good for us to still know about this issue, and we would like to document it in Phabricator. We look forward to your response. Thanks! --IFried (WMF) (talk) 22:17, 10 July 2020 (UTC)
Hi @IFried (WMF): Yes, that's exactly that. The epub export can't export the modernisation layer. I haven't found any ticket on fabricator regarding this issue. Thx for looking after this. --M0tty (talk) 17:46, 11 July 2020 (UTC)
@M0tty: Thank you for your response! We have documented the issue on Phabricator. We may not be able to work on it during the span of this project, since we’re primarily focused on fixing issues related to people not being able to download, access, or read books (i.e., core, basic usage bugs). However, we wanted to document it, and we hope that it can be picked up by someone to be fixed in the future. Thank you for letting us know about this issue! --IFried (WMF) (talk) 23:17, 17 July 2020 (UTC)

Math export[edit]

  • Currently on the different wiki, it is possible to activate MathML to have a nice render of mathematical formulas instead of vectorial images. The current export process does not allow to have this MathML format and include all mathematical formulas as images, like the old mediawiki way. MathML being now vastly handled, it would be really useful to be able to export the code with MathML. — Alan Talk 13:16, 28 May 2020 (UTC)
@Nalou: Thanks so much for this information! As a first question, can you let us know a bit more about how you activate and use MathML (with an example, preferably)? If we understand correctly, you are writing about the inability to use math markup in Wikisource. For this reason, users need to employ tactics that aren’t ideal, such as capturing an image of a formula with the crop tool. Is that correct? Thanks! --IFried (WMF) (talk) 22:19, 10 July 2020 (UTC)
@IFried (WMF): Thanks for tracking my remark. For activating MathML I simply checked the dedicated button in the Appearance tab in the preferences (at the bottom of the page). It allows to have a nice MathML rendering in pages. One example can be found here. There are LaTeX formulas embedded in math tags in the wikicode. We can have math markup on Wikisource website; this is an example of it. And it works very nicely in webbrowser. But if I want to export a pdf version of the book, the exported document uses images for the math formulas. If you try the export in htmlz format, you'll see in the html code that the formulas are included using images. To reformulate what I said, I would like to have proper math formulas like in a LaTeX document. The wikicode exists. I do not know how it is treated by wsexport but the math tags are exported as images whereas there exists some possibilities to handle directly the latex formulas. Your last sentence is partly correct: I tried to automate the modification of the html export of the book to replace the images with the LaTeX formulas. But it is quite complicated so I stopped... As a test, I suggest that you look at the export of the previous book I mentionned (direct link to wsexport here). If you look at page 6, you will see that the math symbols are not in the same fonts as the main text and that it cannot be selected. This comes from the pre-rendering of math formulas by the mediawiki engine and the inclusion of it in the document. This system was done to allow the best crossplatform accessibility of math in wikipedia but it is not quite adapted for exporting documents today. I am convince that a better solution may be possible. Scientific books are a huge part of our culture. It would be a very nice possibility to produce modern version of old and innaccessible books in pdf or epub. I tried to be a bit more exhaustive on the description. Feel free to ask me more if it is not clear (quite hard to describe everything using text). — Alan Talk 13:03, 11 July 2020 (UTC)
@Nalou: Thank you so much for your detailed response! From my understanding, the issue is that mathematical formulas are sometimes expressed as images rather than text, which limits what people can do in terms of reading, sharing, and analyzing the information. Are we correct in this analysis? If so, we have documented this issue on Phabricator, and we’ll see if we can do anything to fix it. If this isn’t the issue, we would love to hear more details so we can understand. Thanks! --IFried (WMF) (talk) 23:15, 17 July 2020 (UTC)

Wrong date order for exports in french langage[edit]

  • Le format de la date est inversé (mois/jour/année) comme c'est la norme en anglais, par exemple aujourd'hui : "Exporté de Wikisource le 05/28/20" => c'est bizarre tout de même d'avoir le commentaire "Exporté de…" en français avec un format de date au format "anglais"

For book exports in french Wikisource, the date order is inverted (month/day/year) as it is the rule in english. For example for today : "Exporté de Wikisource le 05/28/20" => But it is strange to have the comment "Exporté de…" in french, with a wrong date order, as it is the rule in english (in french the date order in day/month/year), so, in french, we are today the 28/05/20 (and not the 05/28/20). Thanks, Laurent --Lorlam (talk) 17:48, 28 May 2020 (UTC)

=> Ok now, the problem has been fixed. --Lorlam (talk) 00:39, 25 June 2020 (UTC)
@Lorlam: Thanks for reporting this issue! As you wrote, the issue appears to be fixed in some cases. However, we still see this issue arising in other cases, such as in Tamil exports, so we’ll look into this. One possible solution may be to display the name of the month rather than the number. Thanks! --IFried (WMF) (talk) 22:24, 10 July 2020 (UTC)

Bad export in "pdf" for french civility titles[edit]

  • L'outil d'export en "pdf" ne sait pas traiter les modèles de civilité entre accolades "M." / "Mlle" / "Mme" / "Mmmes" / etc… et on obtient une sortie "pdf" pas très jolie ou les caractères sont soulignés en pointillés ce qui ne les rend pas très lisibles…

The "pdf" export tool does not export correcty french civility titles that we use in french Wikisource (under embrace "M." for Monsieur / "MM." for Messieurs / "Mlle" for Mademoiselle / "Mme" for Madame / "Mmes" for Mesadames / etc…). The export in "pdf" shows caracters underlined with a dotted line, which is not well readable... (example for the distribution list of the play Thanks, Laurent --Lorlam (talk) 18:16, 28 May 2020 (UTC)

  • To add a clue for this problem, someone in french wikisource said it is a problem with the {{abréviation}}
model (see here, and all others models which uses it. All these "civility titles" models are described here : ... thx --Lorlam (talk) 21:00, 28 May 2020 (UTC)
=> This problem has been fixed by modifying the model in french Wikisource, so Okay now ;-) --Lorlam (talk) 21:10, 31 May 2020 (UTC)
@Lorlam: Thanks for reporting this! It appears that this issue has been fixed, as you have written. However, if this issue arises again, please do let us know. Thank you! --IFried (WMF) (talk) 22:25, 10 July 2020 (UTC)

e-book navigation[edit]

The way Table of Contents is translated into e-book navigation (I mean e-book reader navigation, not ToC that would be printed) is very limited. It would be beneficial if there was a way to allow editors to change the structure of e-book navigation to align better with the book structure (probably by some ToC tags). --EBookian (talk) 20:58, 29 May 2020 (UTC)

@EBookian: Thanks for the information! Can you provide more details on this problem (perhaps a specific example of where you are seeing this problem)? This will help us understand the problem better. Much appreciated! --IFried (WMF) (talk) 22:26, 10 July 2020 (UTC)

Long chapters and footnotes[edit]

I have observed on several occasions the case of footnotes in books with no chapters or with chapters exceeding 80 or 100 pages. The wsexport epub tool arbitrarily splits the chapter and breaks the links to the footnotes, forcing you to artificially split the chapter to get around the problem. See exemple in Histoire de l'affaire Dreyfus T.2)--Cunegonde1 (talk) 03:36, 30 May 2020 (UTC)

@Cunegonde1: Thank you for this information! We have conducted some basic tests on EPUB and PDF to try to reproduce the splitting and link problems. However, we were unable to reproduce the issues. The footnotes appeared to properly display at the end of the chapter with linking functionality. Perhaps you can share a screenshot and more details that demonstrate the issue? This will help us understand the problem better and see if it is something we can fix. Thanks! --IFried (WMF) (talk) 22:28, 10 July 2020 (UTC)
@IFried: You can see the issue on this book : Sade - histoire_de_Juliette, if you create the epub and edit it, you can see that the first footnote call is localised on chap : c1_L_histoire_de_Juliette_premiere_partie.xhtml, page 62, and the text of footnote is localised on a chapter call : c1_L_histoire_de_Juliette_premiere_partie_2.xhtml, the link beetwin the footnote call and the footnote text is : <a xmlns:epub="" href="#cite_note-1" epub:type="noteref">[1]</a> is not pointing to the chapter where is the text of footnote. Excuse my poor english. Abstract in french : Le lien entre l'appel de note et la note elle-même ne fonctionne pas, l'appel de note se trouve dans une section de l'epub et la note elle même dans une autre sans qu'il y ait un lien pointant vers cette section.--Cunegonde1 (talk) 06:53, 11 July 2020 (UTC)
@Cunegonde1: Thank you for your response! Are you saying that the footnote link (for example, “1” on page 62) does not actually redirect the user to the appropriate footnote section when they click on it? If so, we are able to reproduce the issue & we have documented this behavior in a Phabricator ticket. If this is something else, maybe you can provide a screenshot or more details? Thanks. --IFried (WMF) (talk) 23:13, 17 July 2020 (UTC)
@IFried (WMF): Thanks, you describe exactly the issue.--Cunegonde1 (talk) 05:46, 18 July 2020 (UTC)

Prevent page breaks after headings[edit]

Je voudrais signaler aussi des sauts de pages intempestifs, typiquement entre un titre de section et le texte de la section, quand celle-ci ne commence pas sur une nouvelle page (par exemple, dans l’epub exporté à partir de cet ouvrage de Gauss, dont les chapitres sont eux-mêmes divisés en courts articles, comme on peut le voir sur cet exemple, le numéro de l’article et le début de l’article se trouvent souvent sur deux pages séparées). Sur le Wikisource français, des modèles ont été créés justement pour la mise en forme des titres et leur hiérarchisation, de {{t2}} à {{t6}}, à partir des balises HTML h2 à h6. Pourrait-on modifier certains paramètres, de ces modèles ou de l’export, pour empêcher un saut de page entre un tel titre et le début de la section, quel que soit le nombre de retours chariot qui le suit dans le code ?

I would like also to draw your attention on some inappropriate page breaks, basically between a section heading and the text of this section, especially when this section does not begin on a new page (for example, in this Gauss' work, whose chapters are themselves divided in small articles designated by numbers, as you can see here, the article number and the beginning of the article are frequently separated in the epub by a page break). On French Wikisource, some templates, namely {{t2}} to {{t6}}, are specifically designed to specify the style and the hierarchy of headings (based on HTML h2 to h6). Could these models or the export tool be modified to prevent page breaks after headings, whatever the number of carriage returns following it in the code ?ElioPrrl (talk) 15:27, 30 May 2020 (UTC)

Did you consider to use these tags:
  • <div style = "page-break-inside: avoid;"> <! - Beginning of the block: Skip page to avoid ->
  • Your text-block…
  • </div> <! - End of block: Skip page to avoid -> This could be encapsulated in a template --Denis Gagne52 (talk) 16:27, 27 June 2020 (UTC)
@ElioPrrl: Thanks for reporting this issue! We have tested this issue, and we were able to reproduce it. We also see that the possible inclusion tags (as describe above by Denis Gagne5) could fix this issue. Can you let us know if the issue is indeed fixed by the tags, or no? Thanks! --IFried (WMF) (talk) 22:31, 10 July 2020 (UTC)
@Denis Gagne52 and IFried (WMF): These tags cannot fit the bill, since I want to avoid page breaks not inside a paragraph, but after a paragraph (most of the time, after a title tag) ; there exists also a propriety called page-break-after, but I have not succeeded in fiwing this problem thanks to it (but I'm knew to CSS, I've learned it for six months, and maybe this explains my difficulities Face-smile.svg). After taking a look in the exported code, I saw that the title tags h1, h2, etc., are often followed by one, or several, blank lines <p><br/></p>, where the page can be broken : I think that here lies the unefficiency of page-break-after. And as we cannot predict how many blank lines will follow a title tag, I have no idea how to prevent thesepage breaks, whatever the number of carriage returns following it.ElioPrrl (talk) 15:14, 17 July 2020 (UTC)

@ElioPrrl: Your title must be enclosed in the div followed by the paragraph or part of it. The page-break will happen before or after the div.

<div style = "page-break-inside: avoid;">
Your title block
The following paragraph
--Denis Gagne52 (talk) 01:32, 19 July 2020 (UTC)
@Denis Gagne52: I do understand ; but, thus, there can't be any page-break in the following paragraph either... And if the paragraph consists in more than three or five lines (even more so if the paragraph is longer than a page), this solution is much too coarse. By the way, it would be far more comfortable if the solution were implemented either in Mediawiki or in our models t2, ..., t6. — ElioPrrl (talk) 08:53, 19 July 2020 (UTC)
@Denis Gagne52 and ElioPrrl: Thank you for this explanation and feedback! This issue is focused on problems related to proofreading, if we understand correctly. The current project focuses on improving ebook exports, rather than proofreading, so this issue is not in the scope of this project. However, we hope to improve the experience of proofreading by sharing documentation of best practices for all Wikisource users (the research is in progress, and we’ll share our findings in the future). I hope this can be of help, and thank you again! --IFried (WMF) (talk) 23:31, 24 July 2020 (UTC)


Initials built with the lettrine model in the French Wikisource are not properly displayed in the ePub exported files (see e.g. this play). It's a bit better in the PDF exports (the font size is a bit larger than the text, although it does not seem to adapt to the number of lines given in the « lignes= » parameter of the model). --George2etexte (talk) 14:13, 2 June 2020 (UTC)

@George2etexte: Thank you for letting us know about this! We have conducted some basic tests. In our tests, we found that the lettrine was represented better in EPUB than PDF, but we understand that there may be different experiences on different devices. We may not have capacity to fix this issue, but we have noted it. Is there a Phabricator ticket for documentation purposes? If not, would you like create one and tag us? Thanks! --IFried (WMF) (talk) 22:32, 10 July 2020 (UTC)

Cropped image handling[edit]

Telugu wikipedia wikisource extensively used cropped scan to represent images or figures in text as in example page. Current wsexport handles it well and we would like this functionality to be handled in future. -- 04:50, 3 July 2020 (UTC)

@ Thank you! We are happy to hear that WSExport handles cropped images well on Telugu Wikisource (we assume you meant Wikisource?). However, we also know that there are cropped image issues experienced by other users, so we’ll see if there is something that we can do to improve this issue. Thanks again for commenting. --IFried (WMF) (talk) 22:07, 10 July 2020 (UTC)
Thanks [User:IFried (WMF)|IFried (WMF)]] for your response.--Arjunaraoc (talk) 16:52, 4 September 2020 (UTC)

Early findings: Request for feedback (August 2020)[edit]

Pinging everyone who previously commented on this page (and apologies if I missed anyone!).

@Balajijagadesh, Sushant savla, M0tty, Sannita, Dovi, Nahum, Nalou, Lorlam, MartinPoulter, Consulnico, Viticulum, Inductiveload, JAn Dudík, Nemo bis, EBookian, Cunegonde1, ElioPrrl, George2etexte, and Denis Gagne52:

Hello, everyone! We have just posted an August update for the ebook export improvement project, which shares our findings related to the project so far. We invite you to read our analysis and share your feedback below. We deeply appreciate your feedback, which will help us determine next steps for the project. Thank you in advance!

What are your general thoughts about the guiding principles that we have learned from the consultation so far (i.e., “Lessons from the consultation”)? Is there anything that you think we should add or change?[edit]

I am very happy to see all the recognition given to the export tool and its importance. Thank you to all the team for all the great work. The visitor experience will always remain “my priority”. (I know, I am repeating myself…). I understand resource availability, but we sure are in the good direction.
Sharing good practices is a great idea. --Viticulum (talk) 16:10, 21 August 2020 (UTC)
Viticulum Thank you so much for all of your feedback! It makes us really happy to know that you think we are going in the right direction. We agree that user experience is very important, and we also want to improve it. For this reason, we will be sharing an update in September about proposed UX improvements. In the meantime, we’ll continue investigating and looking into how the WSExport tool, font rendering, and other core issues can be improved, as well. Thank you again! --IFried (WMF) (talk) 21:32, 28 August 2020 (UTC)

Is there anything you would like to share about the work we have done so far (i.e., VPS work, Calibre upgrade, various investigations, and the consolidation of tickets)? We’re open to any thoughts or suggestions![edit]

What do you think of the proposal to investigate cache generated ebooks? Would this be useful and high-priority, in your view? Do you have any concerns?[edit]

I think this would make server resource utilization much more efficient. This is also complementary to the next section (request queue optimizing; see also my comment there) and (IMO) can be developed together. However, we should have an option to skip or clear the cache for a specific request, especially as an e-book test feature for Wikisource editors. Ankry (talk) 22:08, 14 August 2020 (UTC)
@Ankry: Thank you for sharing this feedback! It’s great that you think this could be a useful improvement. In regard to the ability to skip the cache, is the main reason why because the user may want to see a new version (rather than the cached version) of the book? We think this is a good reason why; we just want to confirm that we understand your thought process correctly. Thank you and we look forward to your response! --IFried (WMF) (talk) 21:35, 28 August 2020 (UTC)
@IFried (WMF): I mean the case when Wikisource editors make some fixes and want to see the result (whether their fixes work or not). This is a special case, so definitely not for default behaviour. Ankry (talk) 13:11, 29 August 2020 (UTC)
I understand the technical point of view, and the usefulness for speed reason. My concern is that books continue to be validated, corrected even once they are published. (First phase is correcting to have all yellow pages, then sometimes another user validates to green pages). Since WSExport “cannot” know this, we may not be downloading the latest version. --Viticulum (talk) 16:14, 21 August 2020 (UTC)
@Viticulum: Yes, this is a great point, and we touched upon this in the discussion above with Ankry. One possible solution may be to allow editors to skip the cache, if they want, so it’s a choice rather than a requirement. I will communicate this point to the engineers, and we’ll see if we can come up with a solution that takes into account this concern. Much appreciated for bringing it up! --IFried (WMF) (talk) 21:36, 28 August 2020 (UTC)

What do you think of the proposal to investigate job queue for more efficient ebook generation? Would this be useful and high-priority, in your view? Do you have any concerns?[edit]

I think that efficient request handling is important as it would allow to shorten the time that clients wait for the requested e-book. It is IMO more likely that the same e-book generation is requested multiple times in a short time period due to eg. being announced as a new work just completed or information about the book being shared between people, than having completely random e-books requested. The advantage would be both: higher end user satisfaction due to receiving the e-book faster and lower server load due to merging multiple requests into a single e-book generation process.
It would be also nice if the users get immediately the information of the e-book generation process status. My tests on an external server suggest that many users who do not get any result in 10-15s try to request the e-book again and again. E-book generation time is usually longer than 15s. Ankry (talk) 21:52, 14 August 2020 (UTC)

I agree with Ankry. But then, when a book is announced, I think this is also the period where there are the most corrections being done, another user decides to start validate the book (for example). Not easy to find a compromise, but speed is important . --Viticulum (talk) 16:20, 21 August 2020 (UTC)
@Ankry: Thank you so much for sharing this! We are very happy that you think the job queue work would help address issues related to reliability. Your detailed explanation was helpful, as well. Regarding the export status: We agree that it is currently confusing to users who may not know the status of the export. We would like to address this, as well, if possible. I’ll talk with the team about how we can let users know the status of the export in a more intuitive and accessible way. --IFried (WMF) (talk) 21:38, 28 August 2020 (UTC)
@Viticulum: Yes, this will be recurring theme in this project (i.e., balancing the need for speed with the need to have the latest version of the book). This is something that we will be mindful of as a team and consider high priority in terms of how we think about the user experience. For this reason, I’ll be sharing this topic with the engineers, and we’ll discuss what can be done. Once we have a better idea, I can share an update on the project page. --IFried (WMF) (talk) 21:42, 28 August 2020 (UTC)

What do you think of the proposal to investigate how to prevent incomplete book downloads? Would this be useful and high-priority, in your view? Do you have any concerns?[edit]

Yes this should be looked into. It gives a bad opinion to external visitors. --Viticulum (talk) 16:21, 21 August 2020 (UTC)
Great! We appreciate the feedback. --IFried (WMF) (talk) 21:43, 28 August 2020 (UTC)

What do you think of the proposal to switch to a new system of fonts? Would this be useful and high-priority, in your view? Do you have any concerns?[edit]

What work or investigations would you like to see that is *not* being addressed or is being addressed in a different way than you would expect? In other words, what do you think we’re overlooking, if anything?[edit]

I think you should investigate the "TOC tree" (I don't know how it's called) of the generated ePub. Specially for larger books, it's very useful to have sections and subsections and sometimes even beyond. We should encourage the use of semantic tags, such as h1, h2, etc. for that matter. I think the French Wikisource already does that but I don't know how that translates into the epub. Regards, Ignacio Rodríguez (talk) 01:48, 14 August 2020 (UTC)

This is an improvement that I would appreciate. A single level table of contents is not suitable for many types of books. It could be easier, in my opinion, to specify the position from the index rather than the h2, h3, ... tags. The only way to get this result actually is to split the index between the main page and the sub-pages, an arduous process that should be simplified. --Denis Gagne52 (talk) 23:47, 21 August 2020 (UTC)

@Ignacio Rodríguez and Denis Gagne52: Thank you so much for sharing this information! You both wrote about difficulties related to the Table of Contents, and specifically about the fact that WSExport does not always download complete books that have sub-sub pages. For this reason, it may be useful for the team to analyze how we can better support books that have sub-sub pages in the ebook export process. We have two follow-up questions:
  1. Do you have specific examples of the problem related to Table of Contents to share? We would like to examine any specific examples you can provide.
  2. Do you have any ideas of the best way to solve it? We noticed that your comments express differing views on whether we should use h1, h2… tags. Can you share more on your opinion on the tags, among other solutions?
We are very curious about any additional information that can be provided. Thank you in advance! --IFried (WMF) (talk) 21:48, 28 August 2020 (UTC)
Excuse in advance if I don't make myself clear, as English isn't my first language. I haven't personally experienced problems with downloading only partial books. I am referring to the resulting TOC that my ebook reader would get. Normally it takes the elements directly from the "index page" (.ws-summary div?). But sometimes the structure inside of that is lost. I am suggesting that if you specify the structure with "h tags", that would take precedence and the TOC can build from there. Take for example this book I proofread in 2017. I specified h2, h3, and h4 sections (referring to Book, Chapter and Section levels on the original book). But when I donwload the ePub, the resulting TOC only have the links I provided (linking to the Chapter [h3] levels), and there's no clean way that I know to make a TOC that respects the section level links.
The other option I can think, is to try to stablish a format directly from the Index, like Denis suggested, but I think that would be harder, as every project has its own index formatting templates. --Ignacio Rodríguez (talk) 02:32, 29 August 2020 (UTC)
@Ignacio Rodríguez: Thank you for your response! If I understand correctly, the problem in the example you provided is that only chapters are included in the TOC, but the book names (such as Libro Primero) and sub-titles for chapters (such as “Resumen de la…”) are not included in the TOC. Is that correct? --IFried (WMF) (talk) 14:42, 10 September 2020 (UTC)
@IFried (WMF): That's it --Ignacio Rodríguez (talk) 14:58, 10 September 2020 (UTC)
@Ignacio Rodríguez and IFried (WMF): We both share the same goal. The means do not matter as long as the result is achieved. As we cannot put aside the current method which supports all that is in inventory, my proposal is to add a notion of hierarchy to the ws-summary which would be independent of the division into pages. Currently we have to modify links in the main page and repeat them in subpages so that they appear second in the TOC. Here’s an example from the book I am working on to show the difficulty of producing a two-level TOC with ws-export. --Denis Gagne52 (talk) 23:58, 30 August 2020 (UTC)
@Denis Gagne52: Hello, and thank you for your response! We have tested the example you provided, and we were able to see the chapters properly displayed (and linking to the relevant content) for both PDF and EPUB. Are you still experiencing this issue -- and, if you are, can you provide some more details or a screenshot example? Sorry for the inconvenience; we just want to ensure that we are getting all the information we can about bugs, and we unfortunately cannot reproduce this one. For this reason, any further information would be appreciated. Thank you in advance! --IFried (WMF) (talk) 14:43, 10 September 2020 (UTC)
@IFried (WMF): This example was provided not for you to see the chapters properly displayed but to show the complexity to build a multi-level TOC. If it was user-friendly we would find many of these in Wikisource. --Denis Gagne52 (talk) 21:10, 10 September 2020 (UTC)
@Denis Gagne52: Thank you for clarifying the issue you were describing. From our understanding, you are talking about the complexity of building a multi-level Table of Contents. Ideally, this process should be easier for people to do. While this is outside the scope of WSExport work, and it sounds like a different wish, it is good for us to know about. Perhaps it can be approached as a new wish in a future wishlist or a volunteer developer can take it on. Thank you again! --IFried (WMF) (talk) 14:54, 15 October 2020 (UTC)

Anything else you would like to add?[edit]

Hi! I'm a little surprise because the status update of August start with some lessons that the tech team have learned : 1. Keeping in mind both contributors and visitors, and 2. Thinking about user experience improvement rather than technical improvements. These are 2 excellent points. But it seem in my opinion that none of the suggested improvements bellow follow theses two principles : The cache generator for e-books is a technicality that will just provide a very marginal improvement and is just about performances. The font change seem to be a detail for a non-tech guy like me, etc etc. As important as they may be, it seem to me that the impact of theses improvements will be minor on the user experience. They still remain pertinent to globally improve WSexport tho. Good job. Greetings. --M0tty (talk) 19:14, 21 August 2020 (UTC)

@M0tty: Hello! Apologies, I should have clarified that we'll be sharing our proposal to improve the user & reader experience in the next status update (most likely, in September). We are definitely planning to address general user experience issues, which are high priority for us. We're just still conducting research on that front, so we're not quite ready to share our findings yet. I'll update the August update to make that more clear. Also, thanks for your other comments. We agree that improving the reliability of the WSExport tool is very important, and we hope to make a meaningful difference through our work. Thanks again for your comment! --IFried (WMF) (talk) 19:54, 21 August 2020 (UTC)
Hi @IFried (WMF):! Thx for the clarification. Cheers! --M0tty (talk) 22:45, 21 August 2020 (UTC)

Portal site[edit]

Hey, I would like to use this chance to recommend the creation of a portal where readers can browse, read, and download all available ebooks in all languages. As far as I know, most metadata is now stored in Wikidata (including genre, and so on), so it should be pretty straightforward to build a simple browse&download site for readers with the option to go to Wikisource for editing the source. On a separate note, I appreciate that the Foundation spends some time on Wikisource, the mobile version doesn't work that well, and it seems that it is becoming a standard this days.--LibraryFighter (talk) 13:35, 5 September 2020 (UTC)

@LibraryFighter:Thank you so much for your feedback! You provided a very interesting and exciting vision for a future Wikisource experience. Unfortunately, this would be a whole different project (since it doesn’t directly deal with improving WSExport and the ebook export experience). Also, this project idea could be quite large, due to the fact that it would be for so many different communities. However, we encourage you to continue exploring this idea, and perhaps a team can explore it in the future (especially if it became smaller in scope). You may also consider proposing a new project inspired by this idea. Finally, we thank you for your kind words about this initiative that the team is taking on. We’ll also look into the mobile experience and see if we can improve it. Much appreciated! --IFried (WMF) (talk) 14:55, 15 October 2020 (UTC)