Community Wishlist Survey 2020/Wikisource

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Community Wishlist Survey 2020

Wikisource
35 proposals, 47 contributors

Go-previous.svg Wikiquote  •  Wikispecies Go-next.svg

The proposal phase has ended.
Come back on November 20 to vote on proposals!

Contents


Vertical display for classical Chinese content

Edit proposal/discussion

  • Problem: Most content in Chinese Wikisource is classical Chinese, which has been printed or written in vertical for thousands of years.
  • Who would benefit: Chinese and Japanese Wikisource. Other Wikimedia projects of languages in vertical display (like Manchu).
  • Proposed solution: Add vertical support to the Wikimedia software. To the proposer's knowledge, Wikimedia already supports right-to-left display of Arabic and Hebrew.

    A switch button on each page and "force" setting in Special:Preferences should be added to allow readers to switch the display mode between traditional vertical text 傳統直寫 and modern horizontal text 新式橫寫. Magic word will be added that allow pages to set its own default display mode.

    Hypothetical vertical Chinese Wikisource as follows. (In this picture, some characters are rotated but they should not.)

    Hypothetical vertical Chinese Wikisource.png

  • More comments:
  • Phabricator tickets:
  • Proposer: 維基小霸王 (talk) 13:59, 1 November 2019 (UTC)

Discussion

Improve the PDF/book reader

Edit proposal/discussion

  • Problem: When we view a scanned book in PDF or DjVu format in Commons or Wikisource, its always single-page view. Every-time we go to the next page, we need to click on the drop down menu of pages. See this book in Commons for example. For book readers, its more like viewing images than reading books. This not only creates difficulty in reading, but also in identifying missing, duplicated pages etc. if the file needs to be corrected.
  • Who would benefit: Commons & Wikisource editors and readers
  • Proposed solution: Implement the open-source Internet Archive BookReader in all wikis specially Commons and Wikisource. For the same book in Internet Archive, see the difference. It has features like single view, double-page view, thumbnail view, zoom and a wide variety of other features.
  • More comments: Proposed in Community Wishlist Survey 2016 and 2019
  • Phabricator tickets: phab:T154100
  • Proposer: as before in 2016 and 2019. Bodhisattwa (talk) 12:36, 9 November 2019 (UTC)

Discussion

Proofread extension enhancement

Edit proposal/discussion

  • Problem: Proofread extension is great tool for proofreading, but the transcluded output have some problems:
    • The source text is stored in another space (Page namespace) than in page in main namespace, where might be metadata for this text
    • When somebody wants to edit this text, he find nothing relevant in source and link to one or many transcluded pages. There are some tools (default in some wikisources) which helps to find the correct page. These tools displays number of page in the middle of text (in some wikis in the edge) and in source html there are invisible parts, but sometimes in the middle of the word/sentence/paragraph. Spiltted to new proposal
    • There are needed various hack in the source (page) to get correctly formated output
      • <nowiki/> is needed for marking end of paragraph in the end of page, in other case translusion connect both paragraphs together
      • Is impossible to transclude poem which is on more pages as one block, there must be marks for beginning and ending of poem in every page break, so this seems as more poems .
    • Is problem to have simple links ([[/foo/]]) to other subpages from not/transcluded pages
    • Documentation. Probably somewhere exist in english, but in other languages tehre are only base informations and some users uses many enhanced functions which are not in documetation
    Because of these problems some users (including me) prefers text directly in main namespace.
  • Who would benefit: Wikisource editors, readers, users of TTS
  • Proposed solution: Solve these bugs, make tools for better orientation in transcluded pages.
  • More comments:
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 13:24, 25 October 2019 (UTC)

Discussion

  • JAn Dudík, thank you so much for this wish! We have discussed this as a team, and it's too large in its current form. Can you perhaps break this wish into multiple smaller wishes? For example, the second bullet point is a great example of an individual wish. This would make the proposal much more manageable for the team. Thank you! --IFried (WMF) (talk) 01:00, 11 November 2019 (UTC)
OK, I made own proposal JAn Dudík (talk) 16:33, 11 November 2019 (UTC)
Thank you, JAn Dudík! --IFried (WMF) (talk) 02:08, 12 November 2019 (UTC)

Tools to easily localize content deleted at Commons

Edit proposal/discussion

  • Problem: When a book scan is deleted on Commons, it completely breaks indexes on Wikisource. Commons does not notify Wikisource when they delete files, nor do they make any attempt to localize the file instead of deleting it. Wikisource has no way of tracking indexes that have been broken by the Commons deletion process.
  • Who would benefit: Wikisource editors
  • Proposed solution: 1) Make it really easy to localize files, for example by fixing phab:T8071, and 2) Fix or replace the bot(s) that used to notify Wikisources of pending deletions of book scans used by Wikisource
  • More comments: A similar approach may also be helpful for Wikiquote and Wiktionary items that depend on Wikisource, when Wikisource content is moved or deleted.
  • Phabricator tickets: phab:T8071
  • Proposer: —Beleg Tâl (talk) 14:45, 4 November 2019 (UTC)

Discussion

  • A new commons deletion bot was created in 2017. Create an phabricator task with the tag "Community-Tech" to enable it on your wiki. Once you have done that, only bug T8071 remains.--Snaevar (talk) 18:29, 4 November 2019 (UTC)
  • A better way is probably FileImporter enabled on local wiki: phab:T214280. --Xover (talk) 11:52, 5 November 2019 (UTC)

Increase maximum file size

Edit proposal/discussion

  • Problem: For now it is impossible to upload files bigger then 4GiB. That is a problem especial for high quality uncompressed book scans or high resolution or long video material. There are only two workarounds one is to compress the data what is lossy in many cases or to split up the file what makes it harder to link them.
  • Who would benefit: Everyone wants to use high quality book scans as one file, long video files or similar big files.
  • Proposed solution:
  • More comments:
  • Phabricator tickets: task T191802
  • Proposer: GPSLeo (talk) 18:46, 21 October 2019 (UTC)

Discussion

  • The first letter after the project name should always be capitalized. I will fix that now. GeoffreyT2000 (talk) 19:01, 21 October 2019 (UTC)
  • yes, files will continue to get bigger, and hard limits will increasingly stop work. Slowking4 (talk) 18:57, 22 October 2019 (UTC)
  • One thing to consider for people implementing - its not just the storing of large objects in Swift. Its also about being able to effectively serve large ranges of large files from varnish (in the video case), and in the PDF/Djvu case its also about scaling the whole 1 file but 20,000 associated thumbnails system. Its been a while since I've looked at file storage, so its possible my knowledge is outdated, but just wanted to mention what I believe are the issues involved. Bawolff (talk) 02:34, 23 October 2019 (UTC)
    @Bawolff: "Thumbnails" for multi-page formats (PDF/DjVu/TIFF) could really use some TLC! There are performance issues that makes the proofreading (transcribing) process slow and lots of issues that lead to outdated thumbnails. General improvements in performance and reliability here could have a really big cumulative effect.
    On the performance side I suspect that 1) Wikisource essentially needs full size page images where the infrastructure is rigged for reduced-size thumbnails that can be generated on the fly, and 2) the sizes of "thumbnail" requested from ProofreadPage never hits a size that exists pre-generated (is there some kind of cache-hit analysis that could reveal this?). Making sure there are pre-generated images for each page and that ProofreadPage requests a size that exists should reduce this to just serving an image (which is reasonably fast today, and will benefit from any future general infrastructure improvements).
    On the reliability side, I think general armoring is in order, and combine that with an algorithm that behaves well in the context of a multi-page media. Just guessing based on some of the cases I've seen, something pathological in the file trips up whatever component is generating the thumbnails and because its assumptions are that it's operating on "1 file : 1 image : multiple thumbnails" it falls down completely instead of recover gracefully when faced with "1 file : many images : multiple thumbnails". If the image data on page 3 of a PDF is borked, the right failure mode is to give up on that page and move on to pp. 4–1024, but from the outside it looks like what happens is that it gets stuck and gives up on the entire file.
    This isn't directly related to this Wishlist proposal (I'm just free-associating from your comment above) and I can't really see a way to put this that would fit in a proposal of its own ("Make images m0ar better!!!!"), so I'm mainly just putting it out there in the hopes it'll find a home somewhere. --Xover (talk) 10:31, 9 November 2019 (UTC)

Activate templatestyles by Index page css field

Edit proposal/discussion

  • Problem: templatestyles extension is almost magic into wikisource environment, but there's the need to activate easily it into all pages of an Index.
  • Who would benefit: all contributors
  • Proposed solution: to allow optionally to fill Index page css field with a valid templatestyle page. A simple regex could be used to see if css field content contains a valid css or a valid page name.
  • More comments: Presently it.wikisource and nap.wikisource are testing other tricks to load work-specific templatestyles into all pages of an Index, with very interesting results.
  • Phabricator tickets: phab:T226275, phab:T215165
  • Proposer: Alex brollo (talk) 07:24, 9 November 2019 (UTC)

Discussion

  • Reproducing original books is inherently layout and formatting heavy, presenting books to readers is inherently layout and formatting heavy. Inline formatting templates are carrying a lot of weight right now, with somewhat severe limitations and very little semantics. Getting a good system for playing with the power of CSS would help a lot. --Xover (talk) 11:08, 9 November 2019 (UTC)

XTools Edit Counter for Wikisource

Edit proposal/discussion

Français: Compteur de modifications très amélioré.
  • Problem: There are no wikisource specific stats about user wise Proofread/validation. It is impossible to know stats about proofreading task. Wikisource workflow is different from Wikipedia. It could not be done by xtool. So we need specific Stats tools for Wikisource.
    Français: Il n’existe pas de statistiques spécifiques sur les correction/Validations par utilisateurs. Les processus de travail (workflow) de Wikisource diffèrent de ceux de Wikipédia. Ce ne peux être fait via par xtool. Nous avons besoin d’un outil statistique spécifique pour Wikisource.
  • Who would benefit: Whole Wikisource Community.
    Français: Toute la communauté Wikisource.
  • Proposed solution: Make a stats Tools for Wikisource specific.
    Français: Créer un outil de statistiques spécifiques à Wikisource.
  • More comments:
  • Phabricator tickets: phab:T173012
  • Proposer: Jayantanth (talk) 16:01, 26 October 2019 (UTC)

Discussion

improve external link

Edit proposal/discussion

  • Problem: in italian version it is a problem to make too much link to Wikidata in a single page. But this is necessary to improve te use of Wikisource books out of is ouwn platform: on tablet, pc: The presence of links to Wikidata make the books an hipertext much more usefull.
  • Who would benefit: Every readers
  • Proposed solution: I am not a tecnic so I have only needs e not solutions ;-)
  • More comments:
  • Phabricator tickets:
  • Proposer: Susanna Giaccai (talk) 11:22, 4 November 2019 (UTC)

Discussion

@Giaccai: Can you give specific examples of pages where this is currently a problem ? —TheDJ (talkcontribs) 12:57, 4 November 2019 (UTC)
This is the only one page with a Lua error: it:s:Ricordi di Parigi/Uno sguardo all’Esposizione. IMHO it's a Lua "not enough memory" issue, coming from Lua exausted space: you can see "Lua memory usage: 50 MB/50 MB" into NewPP limit report. --Alex brollo (talk) 14:47, 4 November 2019 (UTC)
Weren't the links to Wikidata to be used only in case of author's names? --Ruthven (msg) 18:45, 4 November 2019 (UTC)
No, presently there are tests to link to wikidata other kinds of entities (i.e. locations); wikidata is used to find a link to a wikisource page, or to a wikipedia page, or to wikidata page when both are lacking (dealing with locations, usually the resulting link points to wikipedia). --Alex brollo (talk) 07:17, 5 November 2019 (UTC)
I just investigated the error. The "not enough memory" issue is caused by s:it:Modulo:Wl and s:it:Module:Common. @Alex brollo: What is going on is that the full item serialization is loaded into Lua memory twice per link, once by local item = mw.wikibase.getEntity(qid) in s:it:Modulo:Wl and once by local item = mw.wikibase.getEntityObject() in s:it:Module:Common. You could probably avoid both of this calls by relying on the Wikibase Lua functions that are already used s:it:Modulo:Wl and, so, limit a lot the memory consumption of the module. Tpt (talk) 16:08, 12 November 2019 (UTC)

Transcluded book viewer with book pagination

Edit proposal/discussion

Vis-itwikisource
  • Problem: When we view a transcluded (NS0) book, its a normal view of wikilike environments. Most of the book reader or lover don't like this kind of view and navigation. They are always like a book, page by page view two-page view like a physical book. Every-time we go to the next page subpage. For Italian Wikisource create one js to view like this, Vis, View In Sequence (two-sided view of our page).
  • Who would benefit: Wikisource editors and readers
  • Proposed solution: Create Vis like default viewer, View In Sequence (two-sided view of our page).
  • More comments:
  • Phabricator tickets:
  • Proposer: Jayantanth (talk) 15:43, 11 November 2019 (UTC)

Discussion

Improve extraction of a text layer from PDFs

Edit proposal/discussion

  • Problem: If a scan in PDF has an OCR layer (i. e. original OCR layer, usually of high quality, which is a part of many PDF scans provided by libraries, not the OCR text obtained by our OCR tools), the text is very poorly extracted from it in Wikisource page namespace. DJVUs do not suffer this problem and their OCR layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its OCR layer improves too. (Example of OCR extraction from a pdf here: [1], example of the same from djvu here: [2] ) As most libraries including Internet Archive or HathiTrust offer downloading PDFs with OCR layers and not DJVUs, we need to fix the text extraction from PDFs.
  • Who would benefit: All Wikisource contributors working with PDF scans downloaded from various major libraries (see above). Some contributors in Commons have expressed their concern that the DjVu file format is dying and attempted to deprecate it in favour of PDF. Although the attempt has not succeeded (this time), many people still prefer working with PDFs (because the DJVU format is difficult to work with for them, or they do not know how to convert PDF into DJVU, how to edit DJVU scans, and also because DJVU format is not supported by Internet browsers...)
  • Proposed solution: Fix the extraction of text from existing OCR layers of scans in PDF.
  • More comments:
  • Phabricator tickets:
  • Proposer: Jan.Kamenicek (talk) 20:18, 24 October 2019 (UTC)

Discussion

There are also libraries, where is possible to download bunch of pages (20-100) in PDF, but no or only single in djvu.

There is also possibility of external google OCR

mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript');

, but there are more ocr errors and sometimes there are mixed lines. JAn Dudík (talk) 12:13, 25 October 2019 (UTC)

Yes, exactly, the Google OCR is really poor (en.ws has it among their gadgets), but the original OCR layer which is a part of most scans obtained from libraries is often really good, only Mediawiki fails to extract it correctly. If you download a PDF document e.g. from HathiTrust, it usually contains an OCR layer provided by the library (i.e. not obtained by some of our tools), and when you try to use this original OCR layer in the Wikisource page namespace, you get very poor results. But, if you take the same PDF document and convert it to djvu prior to uploading it here, then you get amazingly better results when extracting the text from the original OCR layer in Wikisource, and you do not need any of our OCR tools. This means that the original OCR layer of the PDF is good, only we are not able to extract it right from the PDF for some reason, although we are able to extract if from DJVU. --Jan.Kamenicek (talk) 17:10, 25 October 2019 (UTC)
yeah - it is pretty bad when the text layer does not appear, and OCR buttons hang with gray, but i can cut and paste text from IA txt file. clearly a failure to hand-off clear text layer. Slowking4 (talk) 02:34, 28 October 2019 (UTC)
  • @Jan.Kamenicek: It sounds like there are various problems with the extraction of some PDFs' text layers. Would indeed be great to fix! Is this to do with e.g. columns? Could you please add some examples to this proposal of PDFs (or pages of) that are failing to be extracted correctly? Thanks! —Sam Wilson 15:54, 12 November 2019 (UTC)
    @Samwilson: We can compare File:The Hussite Wars, by the Count Lützow.pdf with the File:The Hussite wars, by the Count Lützow.djvu. The PDF file was downloaded from HathiTrust, the djvu file was created by converting the pdf file into djvu using an online converter. I personally would expect that further processing would result in some data loss and that the quality of the djvu file would be worse. However, when you compare for example [3] with [4], you can see that Mediawiki extracts the OCR layer much better from the DJVU file than from the PDF file, which means that Mediawiki is not able to extract the OCR layer from PDF files properly. --Jan.Kamenicek (talk) 17:17, 12 November 2019 (UTC) I have also added the examples into the problem description above. --Jan.Kamenicek (talk) 17:22, 12 November 2019 (UTC)
    @Jan.Kamenicek: Thanks, that's very useful! Sam Wilson 19:39, 12 November 2019 (UTC)

Migrate Wikisource specific edit tools from gadgets to core

Edit proposal/discussion

  • Problem: There are many useful edit tools gadgets on some wikisources. Many of these should be used everywhere, but...
    • Not every user knows, he can import script from another wiki.
    • Some of these script cannot be only imported, they must be translated or localised.
    • Majority od users will search these tools on en.wikisource, but there are many scripts eg. on it.wikisource too
  • Who would benefit: Editors on other Wikisources
  • Proposed solution: Select the best tools across wikisources and integrate them as new functions.
  • More comments:
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 13:24, 5 November 2019 (UTC)

Discussion

UI improvements on Wikisource

Edit proposal/discussion

  • Problem: Big part of work on WS is proofreading of OCR texts. Wikitexteditor2010 have some useful functions, but these are divided in more tabs:
    • Advanced - there is very useful search and replace button
    • Special characters - there are many characters which are not on keyboard
    • Proofread tools (page namespace only) - some more tools.
    When I am working on some longer text from OCR, there are typical errors, which can be fixed by search and repace (e.g " -> “ or ii -> n) . So I must use first tab. Now there is missing character from another language, so I must switch to second tab and find this character. Then I find next typical error, so I must again switch to first...
  • Who would benefit: Wikisource editors, but useful for other projects too.
  • Proposed solution: Proofread is probably made mainly on desktops (notebooks) which have monitor wide enough to have all these tools on one tab without need of switching again and again
  • More comments:
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 20:59, 22 October 2019 (UTC)

Discussion

Hi, did you know that you can customize the edittoolbar to your liking? See https://www.mediawiki.org/wiki/Manual:Custom_edit_buttons. Also I use a search-replace plugin directly in a browser as this works better for me. See e.g. https://chrome.google.com/webstore/detail/find-replace-for-text-edi/jajhdmnpiocpbpnlpejbgmpijgmoknnl https://addons.mozilla.org/en-US/firefox/addon/find-replace-for-text-editing/?src=search I use the chrome one and it works alright for simple stuff. For more advanced stuff I copy the text to notepad++/notepadqq/libreoffice writer and do the regex stuff there.--So9q (talk) 11:26, 25 October 2019 (UTC)

Support CSS Shapes module

Edit proposal/discussion

  • Problem: Some books are decorated with illustrations that are non-rectangular and lay out text along the irregular shape. This can some times be approximated by just using normal Mediawiki image syntax, but in some cases the only reasonable layout is as the original. For example, this page (illustrations by Howard Pyle). The proper way to achieve that is using CSS Shapes Level 1] shape-outside: url(…) pointing at an image whose alpha-channel gives the exclusion mask for the element it matches. However, there's no way to apply advanced (non-predefined) CSS rules to an image in Mediawiki, and the CSS sanitizer would (I believe) nuke the url() from inline styles in any case.
  • Who would benefit: All projects. The Wikisourcen run into this situation every so often, but all projects will have use for it every now and again.
  • Proposed solution: Extended syntax for applying arbitrary CSS to images? Let the CSS sanitizer accept url() for internal images? Bonus for some syntax where I can just use "File:Image.jpg" instead of "//upload.wikimedia.org/…". Parametrized values for TemplateStyles maybe? If we could have a template {{masked image |float=left |display=Image.jpg |mask=Image.png |threshold=0.5}} that could feed the latter two to its TemplateStyles stylesheet, that solves this problem. Cases where we need shape-inside would still be unresolved, as well as a myriad other cases for fancy CSS, that fully parametrized TemplateStyles would solve.
  • More comments:
  • Phabricator tickets: T200632, T203416
  • Proposer: Xover (talk) 21:08, 10 November 2019 (UTC)

Discussion

Inter-language link support via Wikidata

Edit proposal/discussion

  • Problem: Wikidata's inter-language link system does not work well for Wikisource, because it assumes that pages are structured the same way as Wikipedia pages are structured, and this is not the case.
  • Who would benefit: Editors and readers of all Wikisources, and editors and readers of Wikidata
  • Proposed solution:
    1. Support linking from Wikidata to Multilingual Wikisource
    2. Support automatic interlanguage links between multiple editions that are linked to different items on Wikidata, where these items are linked by "has edition" and "edition or translation of"
  • More comments: This was also proposed last year
  • Phabricator tickets: phab:T138332, phab:T128173, phab:T180304, phab:T54971
  • Proposer: —Beleg Tâl (talk) 15:47, 23 October 2019 (UTC)

Discussion

This issue causes a lot of confusion for new editors on Wikisource and Wikidata, who frequently set up the interwiki links incorrectly in order to bypass this limitation. —Beleg Tâl (talk) 16:12, 23 October 2019 (UTC)

@Beleg Tâl: great proposal ! For information @Tpt: is working on something quite similar (Tpt: can you confirm?), we should keep this proposal as this is important and any help is welcome but still we should keep that in mind ;) Cdlt, VIGNERON * discut. 14:47, 27 October 2019 (UTC)
HI! Yes, indeed, I am working on it as part of mw:Extension:Wikisource. It's currently in the process of being deployed on the Wikimedia test cluster before a deployment on Wikisource. It should be done soon, so, hopefully no need from the Foundation on this (except helping the deployment). Tpt (talk) 13:59, 30 October 2019 (UTC)
@Tpt: Fantastic, thank you!! —Beleg Tâl (talk) 17:22, 2 November 2019 (UTC)
  • FYI I repeated T54971, which I asked for several decades to try to support it. --Liuxinyu970226 (talk) 13:17, 3 November 2019 (UTC)
  • I would just notify that in svwikisource and plwikisource there are javascript-based implementations of multi-version interwiki and they seem to work fine if appropriate structures are available in Wikidata. Ankry (talk) 20:09, 9 November 2019 (UTC)

memoRegex

Edit proposal/discussion

  • Problem: OCR editing needs lots of work-specific regex substitutions, and it would be great to save them, and to share them with any other user. Regex shared substitutions are too very useful to armonize formatting into all pages of a work.
  • Who would benefit: all users (the unexperienced ones could use complex regex subsitutions, tested by experienced ones)
  • Proposed solution: it.wikisource uses it:s:MediaWiki:Gadget-memoRegex.js, that does the job (it optionally saves regex substitutions tested with a it.source Find & Replace tool, so that they can be called by any other user with a click while editing pages of the same Index). The idea should be tested, refined and applied to a deep revision of central Find and Replace tool.
  • More comments: The tool has been tested into different projects.
  • Phabricator tickets:
  • Proposer: Alex brollo (talk) 07:33, 25 October 2019 (UTC)

Discussion

  • Actually this is very useful. It's an extension to a workaround to solve the search & replace bug that affects all Wikisource projects. If reimplementing the Search & Replace is retained as a solution, "memoRegex" should be considered as part of the implementation. --Ruthven (msg) 18:51, 4 November 2019 (UTC)

New OCR tool

Edit proposal/discussion

  • Problem: 1) Wikisource has to rely on external OCR tools. The most widely used one has been out of service for many months and all that time we are waiting, whether its creator appears and repairs it or not. The other external OCR tools do not work well (they either have extremely slow response, or generate bad quality text). None of these tools can also handle text divided into columns in magazine pages and they often have problems with non-English characters and diacritics, the OCR output needs to be improved.
    2) The tool hOCR is not working for wikisources based on non-Latin scripts. PheTool hOCR is creating a Tesseract OCR text layer for wikisources based on Latin script. E. g. for Indic Wikisource, there is a temporary Google OCR to do this, but integrating non-Latin scripts into our tool would be more useful.
  • Who would benefit: Wikisource contributors handling scanned texts which do not have an original OCR layer or whose original OCR layer is poor, and contributors to wikisources based on non-Latin scripts.
  • Proposed solution: Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person. The tool should:
    • be quick
    • generate good quality OCR text
    • be able to handle text written in columns
    • be able to handle non-English characters of Latin script including diacritics
    • be able to handle non-Latin languages

Tesseract, which is an open source application, also has a specific procedure to training OCR which requires corrected text of a page and an image of the page itself. On the Wikisource side, pages that have been marked as proofread show books that have been transcribed and reviewed fully. So, what needs to be done is to strip formatting the text of these finished trascriptions, expand template transclusions and move references to the bottom. Then take the text along with an image of the page in question and run it through the Tesseracts procedure. The improvement would then be updated on ToolLabs. The better the OCR the easier the process is with each book, allowing Wikisource editors to become more productive, completing more pages than they could do previously. This would also motivate users on Wikisource.

Some concerns have appeared that WMF nearly always uses open source software, which excludes e. g. Abby Reader and Adobe, and that the problem with free OCR engines is their lack of language support, so they are never really going to replace Phe's tools fully. I do not know whether free OCR engines suffice for this task or not, but I hope the new tool to be as good or even better than Phe's tools and ideological reasons that would be an obstacle to quality should be put aside.

Discussion

I think this is the #1 biggest platform-related problem we are facing on English Wikisource at this time. —Beleg Tâl (talk) 15:09, 27 October 2019 (UTC)

Yeah. For some reason neither Google Cloud nor phetools support all of the languages of Tesseract. Tesseract in comparision to the wikisources is missing Anglo-Saxon, Faroese, Armenian, Limburgish, Neapolitan, Piedmontese, Sakha, Venetian and Min nan.--Snaevar (talk) 15:12, 27 October 2019 (UTC)

Note that you really don't want a tool that scans all pages for all languages as that is so compute-intensive that you'd wait minutes for every page you tried to OCR. Tesseract supports a boatload of languages and scripts, and can be trained for more, but you still need a sensible way to pick which ones are relevant on any given page. --Xover (talk) 07:27, 31 October 2019 (UTC)
I know. Both the Google Cloud and phetools gadgets pull the language from the language code of the wikisource that the button is pressed on and thus only uses one language. The same thing applies here. These languages are mentioned however so it is clear which wikisources this proposal could support, and witch ones it would not. P.S. I am not american, so I will never try to word things to cover all bases.--Snaevar (talk) 23:01, 2 November 2019 (UTC)

Even aside from the OCR aspect, being able to extract the formatting out of a PDF int wikitext would be highly valuable for converting pdfs (and other formats via pdf) into wikimarkup. T.Shafee(Evo﹠Evo)talk 11:19, 29 October 2019 (UTC)

I am not sure about formatting. Some scans or even originals are quite poor and in such cases the result of trying to identify italics or bold letters may be much worse than if the tool extracted just pure text. I would support adding such feature only if it were possible to be turned on and off. --Jan.Kamenicek (talk) 22:05, 30 October 2019 (UTC)

Many pages requires only simple automatic OCR. But there are pages with another font (italics, fraktur) or pages with mixed languages (e.g. Missal both in local language and latin), where would be usseful to have possibility of some recognizing options. This can be more easily made on local PC, but not everybody have this option. JAn Dudík (talk) 11:21, 31 October 2019 (UTC)

Improve workflow for uploading academic papers to Wikisource

Edit proposal/discussion

5-minute documentary on medics using Internet-in-a-Box in the Dominican Republic, where many medical facilities have no internet. These medics would presumably also appreciate better access to the research literature (Doc James?).
Likewise the staff of the University of Ngaoundéré, which has a slow, expensive satellite connection, but also has local-network Afripedia access
...and biologists doing field research
  • Opportunity: There are a large and increasing number of suitably-licensed new Wikisource texts: academic papers (strikingly-illustrated example on Wikisource). Many articles are now published under CC-BY or CC-BY-SA licenses, and with fully-machine-readable formats; initiatives like Plan S will further this trend.
  • Problem: Uploading these articles is needlessly difficult, and few are on Wikisource.
  • Who would benefit: Having these papers on Wikisource (or a daughter project) would benefit:
  1. anyone accessing open-access fulltexts online. There is an ongoing conflict between traditional academic-article publishers and the open-access movement in academia, and some publishers have done things which make freely-licensed article fulltexts harder to access (for instance, one publisher has paywalled open-access articles, pressured academics and third-party hosters to take down fulltexts (examples), charged for Creative-Commons licensed materials,[5] sought to retroactively add NC restrictions to Creative Commons licenses, forbidden republication of CC-licensed articles on some platforms at some times, and acted with the apparent aim of making legal free fulltexts harder to find [6]); it also bought out a sharing platform (example). Wikimedia has social and legal clout to resist such tactics, and is well-indexed by search tools.
  2. anyone wanting offline access to the academic literature (through Internet-in-a-Box): people with poor internet connectivity (including field scientists), or censorship
  3. those who have difficulties reading non-accessible content. Some online journals work hideously badly with screenreaders.
  4. anyone wishing to archive a previously published paper, including the Wikijournals (journals go bust and go offline, and many research funders require third-party archiving)
  5. those using the fulltexts for other Wikimedia projects (e.g. Wikipedia sourcing, academic article illustrations copied to Commons for reuse)
  • Proposed solution: Create an importer tool for suitable articles, with a GUI.
  • More comments:
    • Konrad Foerstner's JATS-to-Mediawiki importer does just this, but seems stuck in pre-alpha.
      • Even the citation-metadata scrapers which we already use could automate much of the formatting.
      • Another apparently-related tool is the possibly-proprietary Pubchunks.
    • The #Icanhazwikidata system allows academics to add academic-paper metadata to Wikidata by tweeting an identifier; Magnus Manske's Source Metadata tool added it automatically. An #Icanhazfulltexthosting system could allow uploading a fulltext by tweeting a fulltext link (with some feedback if your linked PDF lacks the needed machine-readable layer).
  • Phabricator tickets:
  • Proposer: HLHJ (talk) 02:02, 9 November 2019 (UTC)

Discussion

Thanks to Daniel Mietchen for the example article. I've made a lot of statements about projects I don't know much about, and would appreciate advice and corrections. HLHJ (talk) 02:02, 9 November 2019 (UTC)

Us becoming a repository for high quality open access journals is a good idea. We just need to be careful that we do not include predatory publishers. Doc James (talk · contribs · email) 09:22, 9 November 2019 (UTC)
@Doc James: What, in your view, is the advantage of hosting academic papers on Wikisource vs just hosting the PDFs on Commons. It seems like most mobile devices and browsers these days support reading PDFs, and modern PDFs almost always have a text layer that is already searchable (and gets indexed by Google, etc.). Converting PDFs into Wikisource texts is a lot of work, and I'm curious what it achieves in your view. Kaldari (talk) 16:21, 12 November 2019 (UTC)
User:Kaldari Yes maybe the problems with PDFs are improving. Papers are more sources than media files though. So they more naturally fit within Wikisource. Doc James (talk · contribs · email) 16:28, 12 November 2019 (UTC)

Reorganize the Wikisource toolbar

Edit proposal/discussion

français: reorganization wikisource toolbar
  • Problem: Some shortcuts are superfluous, others are missing
    français: Certains raccourcis sont superflus, d'autres absents
  • Who would benefit: facilitate editing for new writers
    français: faciliter l'édition pour les nouveaux rédacteurs
  • Proposed solution: Rearrange the toolbar a bit
    français: Réorganiser un peu la barre outils
  • More comments: In the toolbar, we have {{}},{{|}}, {{|}}. I think keep {{}} and replace the other two useless it goes as fast to type | on the keyboard at the desired place. Instead we could put {{paragraph|}}, {{space|}}, {{separation}}, <ref>txt</ref> duplicates the icon next to (insert file) at the top left. It could be replaced by <ref follow=pxxx>. Next to <br/> we could add <brn|> and {{br0}}. The <search-replace> could appear next to the pencil dropdown at the top-right.
    français: Dans la barre, nous avons {{}},{{|}}, {{|}}. Je pense garder {{}} et remplacer les 2 autres pour moi inutiles ça va aussi vite de taper au clavier le | à l'endroit voulu. A la place on pourrait y mettre {{alinéa|}}, {{espace|}}, {{separation}} <ref>txt</ref> fait double emploi avec l'icone à côté (inserer fichier) en haut à gauche. On pourrait le remplacer par <ref follow=pxxx>. A côté de <br/>, on pourrait rajouter <brn|> et {{br0}}. Le <rechercher-remplacer> pourrait figurer à côté du crayon changer d'éditeur.
  • Phabricator tickets:
  • Proposer: Carolo6040 (talk) 10:58, 25 October 2019 (UTC)

Discussion

I think you mean the Character insertion bar below the editor ? That can already be modified by the community itself, it does not require effort by the development team. —TheDJ (talkcontribs) 12:48, 4 November 2019 (UTC)
we need a redesign of default menus for wikisource. this is beyond the capabilities of the new editor. visual editor will not be used until this is done, either by community or developers. Slowking4 (talk) 15:11, 4 November 2019 (UTC)
Still, these changes can be performed by local interface admins (such edits are not to be done by new editors). Check for example MediaWiki:Edittools. Ruthven (msg) 18:55, 4 November 2019 (UTC)

Enable book2scroll that works for all Wikisources

Edit proposal/discussion

  • Problem: book2scroll is not enabled for all Wikisource and not working for any non -latin wikisource. It is very useful for Page marking numbering in index: pages any more..
    Français: book2scrool n’est pas activé pour tous les wikisources et ne fonctionne pas sur les wikisources non-latin. Cet outil est très utile pour la numérotation du marquage des Pages dans l’index:page.
  • Who would benefit: Whole Wikisource community.
    Français: Toute la communauté wikisource.
  • Proposed solution: problem is that this code is very old (as in Toolserver-old), and only works with some site naming schemes. Other languages don't work either for many titles.
    Français: Le problème est que le code est très anciens (??? as in Toolserver-old), et ne fonctionne que pour la nomenclature de nommage de certains sites et ne fonctionne pas pour plusieurs titres.
  • More comments: same as previous year list
  • Phabricator tickets: phab:T205549
  • Proposer: Jayantanth (talk) 15:58, 26 October 2019 (UTC)

Discussion

Improve export of electronic books

Edit proposal/discussion

Original title (Français): Améliorer l'exportation des versions électroniques des livres
  • Problem: Imagine if Wikipedia pages could not display for many days, or would only be available once in a while for many weeks. Imagine if Wikipedia displayed pages with missing information or scrambled information. This is what visitors get when they download books from the French Wikisource. Visitors do not read books online in a browser. They want to download them on their reader in epub, mobi or pdf. The current tool (Wsexport) to export books in these formats has all those problems : on spring 2017, it was on and off for over a month ; after october 2017, mobi format did not work, then pdf stopped working. These problems still continue on and off.


  • Since the project was finished sometime in July or August 2019, the stability of the WsExport tool has improved. Unfortunately there has been downtimes, some up to 12 hours. The fact that the tool does not get back on line rapidly is a deterrent for our readers/visitors.

  1. On September 30 no download 10 am to 10pm Montreal time
  2. On October 30 no download for around 30 min aroung 13h to 13:30
  3. On October 31 no anwser or bad gateway 22:10
  4. On November 1 : no download from 17:15 to 22:30
  5. On November 2 : no download from 10:30 to 11:40
  6. On November 2 : no download or bad gateway from 19:25 to 22:46

  • I have tested books and founds the same problems as before.

  1. Missing text at end of page or beginning of page (in plain text or in table)
  2. Duplication of text at end of page or beginning of page
  3. Table titles don't appear
  4. Table alignment in a page (centered) not respected
  5. Text alignment in table cell not respected
  6. Style in table not respected in MOBI format
  7. And others

  • For all these reasons this project is resubmitted this year and shows the importance that the Wikisource community gives to this important aspect of the Wikisource project: an interface for contributors and an interface for everyone else from the public who wishes to read good e-books. --Viticulum (talk) 21:45, 7 November 2019 (UTC)


  • Français: Imaginez si les pages Wikipédia ne s’affichaient pas pour plusieurs jours, ou n’étaient disponibles que de façon aléatoire durant plusieurs jours. Imaginez si sur les pages Wikipédia certaines informations ne s’affichaient pas ou était affichées tout croche. C’est la situation qui se produit pour les visiteurs qui désirent télécharger nos livres. Les visiteurs ne lisent pas les livres en ligne dans un navigateur, ils désirent les télécharger sur leurs lecteurs en epub, mobi ou pdf. L’outil actuel (Wsexport) permettant l’export dans ces formats possède tous ces problèmes: au printemps 2017, il fonctionnait de façon aléatoire durant un mois; depuis octobre 2017, le format mobi puis pdf ont cessé de fonctionner. Ces problèmes continuent de façon aléatoire.


  • Depuis la fin du projet en juillet ou août 2019, la stabilité de l'outil WsExport s'est améliorée. Malheureusement, il y a eu des temps d'arrêt, certains jusqu'à 12 heures. Le fait que l'outil ne soit pas remis en ligne rapidement peut être dissuasif pour nos lecteurs / visiteurs.

  1. Le 30 septembre, aucun téléchargement 10h à 22h heure de Montréal
  2. Le 30 octobre, pas de téléchargement pour environ 30 minutes de 13h à 13h30
  3. Le 31 Octobre pas de réponse ou mauvaise passerelle 22:10
  4. Le 1er novembre: pas de téléchargement de 17h15 à 22h30
  5. Le 2 novembre: pas de téléchargement de 10h30 à 11h40
  6. Le 2 novembre: pas de téléchargement ou mauvaise passerelle de 19h25 à 22h46

  • J'ai testé des livres et trouve les mêmes problèmes qu'avant.

  1. Texte manquant à la fin ou au début de la page (dans le texte ou dans un tableau)
  2. Duplication de texte en fin ou en début de page
  3. Les titres de table n'apparaissent pas
  4. L'alignement de la table sur une page (centrée) n'est pas respecté
  5. L'alignement du texte dans la cellule du tableau n'est pas respecté
  6. Style dans la table non respecté en format MOBI
  7. Et d'autres

  • Pour toutes ces raisons, ce projet est soumis à nouveau cette année et montre l’importance que la communauté Wikisource accorde à cet aspect important du projet Wikisource : une interface pour les contributeurs et une interface pour tous les autres utilisateurs du public souhaitant lire de bons livres électroniques. --Viticulum (talk) 21:57, 7 November 2019 (UTC)


  • Who would benefit: The end users, the visitors to Wikisource, by having access to high quality books. This would improve the credibility of Wikisource.

    This export tool is the showcase of Wikisource. Contributors can be patient with system bugs, but visitors won’t be, and won’t come back.

    The export tool is as important as the web site is.

    Français: L’utilisateur final, le visiteur de Wikisource, en ayant accès à des livres de haute qualité. Ceci contribuerait à améliorer la crédibilité de Wikisource. L’outil d´exportation est une vitrine pour Wikisource. Les contributeurs peuvent être patients avec les anomalies de système, mais les visiteurs ne le seront peut-être pas et ne reviendront pas. L’outil d’exportation est tout aussi important que le site web.
  • Proposed solution: We need a professional tool, that runs and is supported 24/7, as the different Wikimedia web sites are, by Wikimedia foundation professional developers.

    The tool should support different possibilities of electronic book, and the evolution of ebooks technology.

    The different bugs should be corrected.

    Français: Nous avons besoin d’un outil professionnel, fonctionnant et étant supporté 24/7, comme tous les différents sites Wikimedia, par les développeurs professionnels de la fondation Wikimedia. Les différentes anomalies doivent être corrigées.
  • More comments: There are not enough people in a small wiki (even French, Polish or English Wikisource) to support and maintain such a tool.
    Français: Nous ne sommes pas assez nombreux dans les petits wiki (même Wikisource Français, Polonais ou Anglais) pour supporter une telle application.


Discussion

Repair Index finder

Edit proposal/discussion

  • Problem: It's rather similar to the first proposal on this page; that is, for at least a month, the Index finding thingy is broken; whatever title you put in it, it says something along the lines of "The index finder is broken. Sorry for the inconvenience." (This is just from memory!) It also gives a list of indexes, from the largest to the smallest. The compromise I at any rate am using now is the index-finder installed in the search engine.
  • Who would benefit: Everybody who wants to find an index.
  • Proposed solution: Somebody who has a good knowledge about bugs? I'm not good at wikicode!
  • More comments: Excuses for any vague terminology - I am writing via mobile.
  • Phabricator tickets: task T232710
  • Proposer: Orlando the Cat (talk) 07:00, 5 November 2019 (UTC)

Discussion

Batch move API

Edit proposal/discussion

  • Problem: On Wikisource, the "atomic unit" is a work, consisting of a scanned book in the File: namespace, a set of transcribed pages in the Page: namespace, an index in the Index: namespace, and hopefully also one or more pages in mainspace that transcludes the pages for presentation. This is unlike something like a Wikipedia, where the atomic unit is the (single) page in mainspace, period.
    ProofreadPage ties these together using the pagename: an Index: page looks for its own pagename (i.e. without namespace prefix) in the File: namespace, and creates virtual pages at Page:filenameoftheuploadedfile.PDF/1 (and …/2 etc.). If any one of these are renamed, the whole thing breaks down.
    A work can easily have 1000+ pages: if it needs to be renamed, all 1000 pages have to be renamed. This is obviously not something you would ever undertake manually. But API:Move just supports moving a single page, leading to the need for complicated hacks like w:User:Plastikspork/massmove.js.
    The net result is that nothing ever gets renamed on Wikisource, and when it's done it's only done by those running a personal admin-bot (so of the already very few admins available, only the subset that run their own admin-bots can do this, and that's before taking availability into account).
  • Who would benefit: All projects, but primarily the Wikisources; it would be used (via scripts) by +sysop, but it would benefit all users who can easily have consistent page names for, say, a multi-volume work or whatever else necessitates renaming.
  • Proposed solution: It would wastly simplify this if API:Move supported batch moves of related pages, at worst by an indexed list of fromto titles; better with fromto provided by a generator function; and ideally by intelligently moving by some form of pattern. For example, Index:vitalrecordsofbr021916brid.djvu would probably move to Index:Vital records of Bridgewater, Massachusetts - Vol. 2.djvu, and Page:-namespace pages from Page:vitalrecordsofbr021916brid.djvu/1 would probably move to Page:Vital records of Bridgewater, Massachusetts - Vol. 2.djvu/1
    It would also be of tremendous help if mw.api actually understood ProofreadPage and offered a convenience function that treated the whole work as a unit (Index:filename, Page:filename/pagenum, and, if local, File:filename) for purposes of renaming (moving) them.
  • More comments: For the purposes of this proposal, I consider cross-wiki moves out of scope, so, e.g., renaming a File: at Commons as part of the process of renaming the Index:/Page: pages on English Wikisource would be a separate problem (too complicated). Ditto fixing any local mainspace transclusions that refer to the old name (that's a manageable manual or semi-automated/user-tools job).
  • Phabricator tickets:
  • Proposer: Xover (talk) 12:41, 5 November 2019 (UTC)

Discussion

@Xover: Why sysop bit is needed here? I think the bot flag is enough unless the pages are fully protected. Ankry (talk) 20:45, 9 November 2019 (UTC)
@Ankry: Because page-move vandalism rises to a whole `nother level when you can do it in batches of 1k pages at a time. And for the volume we're talking about, having to go through a request and waiting for an admin to handle it is not a big deal: single page moves happen all the time, but batch moves of entire works would typically top out at a couple per week tops (ignore a decade's worth of backlog for now). Given these factors, requiring +sysop (or, if you want to be fancy, some other bit that can be assigned to a given user group like "mass movers" or whatever) seems like a reasonable tradeoff. You really don't want inexperienced users doing this willy nilly!
But so long as I get an API that lets me do this in a sane way (and w:User:Plastikspork/massmove.js is pretty insane), I'd be perfectly happy imposing limitations like that in the user script or gadget implementing it (unless full "Move work" functionality is implemented directly in core, of course). Different projects will certainly have different views on that issue. --Xover (talk) 21:28, 9 November 2019 (UTC)

Repair Book Uploader Bot

Edit proposal/discussion

  • Problem: Book Uploader Bot was a valuable tool for the upload of books from Google-Books on Commons for Wikisource. It is not working for a long time and it takes a long time for uploading a book from: Google Books (you need to download the book in PDF, make an OCR, convert into a djvu, upload on Commons and then fill the information). From IA, we have IA upload. It is working but also have some issues from time to time.
  • Who would benefit: Contributors of Wikisources
  • Proposed solution: Repair the tool or build a new one
  • More comments:
  • Phabricator tickets:
  • Proposer: Shev123 (talk) 14:58, 10 November 2019 (UTC)

Discussion

Offer PDF export of original pagination of entire books

Edit proposal/discussion

Français: Pouvoir exporter en pdf en respectant la pagination de l'édition source.
  • Problem: Presently PDF conversion of proofread wikisource books doesn't mirrors original pagination and page design of original edition, since it comes from ns0 transclusion.
    Français: La conversion en PDF des livres Wikisource ne reflète pas la pagination et le design original des pages de l’édition originale, car la conversion provient de la transclusion et non des pages.
  • Who would benefit: Offline readers.
    Français: Lecteurs hors ligne.
  • Proposed solution: To build an alternative PDF coming from conversion, page for page, of nsPage namespace.
    Français: Élaborer un outil pour générer un PDF alternatif provenant d’une conversion page par page.
  • More comments: Some wikisource contributors think that nsIndex and nsPage are simply "transcription tools"; I think that they are much more - they are the true digitalization of a edition, while ns0 transclusioni is something like a new edition.
    Français: Certains contributeurs de wikisource pense que nsIndex et nsPage sont simplement des « outils de transcription » ; je pense qu’ils sont beaucoup plus que cela – ce sont la vraie numérisation d’une édition, tandis que la transclusion ns0 constitue une nouvelle édition.
  • Phabricator tickets: T179790
  • Proposer: previous year proposer Alex brollo got voted 57, Jayantanth (talk) 16:03, 26 October 2019 (UTC)

Discussion

Ajax editing of nsPage text

Edit proposal/discussion

  • Problem: Dealing with simple pages editing, much user time is lost into the cycle save - load in view mode - go to next page that opens in view mode - load it into edit mode.
  • Who would benefit: experienced users
  • Proposed solution: it.wikisource implemented an ajax environment, that allows to save edited text and to upload next page in edit mode (and much more) very fastly by ajax calls: it:s:MediaWiki:Gadget-eis.js (eis means Edit In Sequence). It's far from refined, but it works and it has been tested into other wikisource projects too. IMHO the idea should be refined and developed.
  • More comments:
  • Phabricator tickets:
  • Proposer: Alex brollo (talk) 07:16, 25 October 2019 (UTC)

Discussion

  • I enthusiastically support - I have often wished that I could move directly from page to page while staying in Edit mode - it would be particularly useful for error checking: making sure, for instance, that every page in a range which could have been proofread by different people over a number of months or even years all conform to the latest format/structure etc. CharlesSpencer (talk) 11:03, 25 October 2019 (UTC)
  • I think this is a very good project specific improvement that can be made within the remit of community wishlist. Seems feasible as well. —TheDJ (talkcontribs) 12:55, 4 November 2019 (UTC)
  • This would be a great first step towards something like a full-featured dedicated "transcription mode", that would likely involve popping into full screen (hiding page chrome, navbar, etc.; use all available space inside the browser window, but don't let the page scroll because it conflicts with the independently scrolling text field and scanned page display, in practice causing your whole editing UI to "jump around" unpredictably), some more flexibility and intelligence in coarse layout (i.e. when previewing a page, the text field and scanned page are side by side, but the rendered text you are trying to compare to the scanned page is about a screenworths of vertical scrolling away), prefetching of the next scanned page (cf. the gadget mentioned at the last Wikimania), and possibly other refinements (line by line highlighting on the scanned page? We often have pixel coordinates for that fro the OCR process). Alex brollo's proposal is one great first change under a broader umbrella that is adapting the tools to the typical workflow on Wikisource, versus the typical workflow on Wikipedia-like projects: the difference makes tools that are perfectly adequate for Wikipedia-likes really clunky and awkward for the Wikisources. Usable, but with needlessly high impedance. --Xover (talk) 12:53, 5 November 2019 (UTC)
    @Samwilson: Could s:User:Samwilson/FullScreenEditing.js be a piece of this larger puzzle? I haven't played with it, but it looks like a good place to start. If this kind of thing (a separate focussed editing mode) were implemented somewhere core-adjacent, it might also provide an opportunity to clean up the markup used ala. that attempt last year(ish) that failed due to reasons (I'm too fuzzy on the details. Resize behaviour for the text fields got messed up, I think.). Could something like that also have hooks for user scripts? There's lots of little things that are suitable for user scripting to optimize the proofreading process. Memoized per-work snippets of text or regex substitutions; refilling header/footer from the values in the associated Index:; magic comment / variables (think Emacs variables or linter options) for stuff like curly/straight quote marks. In a dedicated editing mode, where the markup is clean (unlike the chaos of a full skin and multiple editors), both the page and the code could have API-like hooks that would make that kind of thing easier. --Xover (talk) 11:20, 9 November 2019 (UTC)
  • Thanks for appreciation :-). Really the it.wikisource eis tool - even if rough in code - is appreciated by many users. I like to mention too its "ajax-preview" option, that allows to see very fastly (<1 sec) the result of current editing/formatting and that allows too some simple edit of brief chuncks of text nodes (immediately editing the underlying textarea). Some text mistakes are much more evident in "view" mode that in "edit" mode, but presently Visual Editor is too slow to be used for typical fast editing into wikisource. --Alex brollo (talk) 09:43, 7 November 2019 (UTC)

UX VisualEditor on Wikisource

Edit proposal/discussion

  • Problem: visual editor came to wikisource but has not shown much promise. the menus require much searching, and scrolling down, memorizing keys, or customizing CSS.
    veterans go back to wikitext, and newbies do not know how to navigate
  • Who would benefit: new editors to wikisource
  • Proposed solution: do a thorough UX design on default menu layout
  • More comments: "visual editor is broken on wikisource" was a broad consensus expressed at wikimania [7] and previous wishlist [8]
  • Phabricator tickets:
  • Proposer: Slowking4 (talk) 19:07, 22 October 2019 (UTC)

Discussion

There is also problem with VE inside <poem></poem> - VE thinks that text is parameter of {{poem}}. The same situation is with some formatting templates (one in the beginning with <div class=foo> and second with </div> on the end of text. T45120 JAn Dudík (talk) 21:03, 22 October 2019 (UTC)

IMHO Visual editor is almost useless into wikisource; a specific wikisource editing interface should be built by scratch, under strict supervision by a group of very active wikisource editors. --Alex brollo (talk) 07:34, 9 November 2019 (UTC)

Make content of Special:IndexPages up-to-date and available to wikicode

Edit proposal/discussion

  • Problem: 1. The content of Special:IndexPages (eg. s:pl:Special:IndexPages) is not updated after changing status of some pages in an index page until the appropriate index page is purged. 2. The data from this page is not available to wikicode. Its availability would make possible creation of various statistics / sortable lists or graphic tools showing the status of index pages by users. In plwikisource, we make this data available to wikicode via bot which updates specific teplates regularily; these extra edits would be able to be avoided.
  • Who would benefit: All wikisources, mainly those with large number of indexes
  • Proposed solution: Make per-index numbers of pages with various statuses from Special:IndexPages available via mechanism like a magic function, a LUA function or something similar.
  • More comments:
  • Phabricator tickets:
  • Proposer: Ankry (talk) 19:12, 9 November 2019 (UTC)

Discussion

Template limits

Edit proposal/discussion

  • Who would benefit: Every text on every Wikisource is potentially concerned but obviously the target is the text with a lot of templates (either the long text, the text with heavy formatting or both).
  • Proposed solution: I'm not a dev but I can imagine multiples solutions :
    • increase the limit (easy but maybe not a good idea in the long run)bad idea (cf. infra)
    • improve the expansion of template (it's strange that "small" template like the ones for formatting consume so much)
    • use something than template to format text
    • any other idea is welcome
  • More comments:
  • Phabricator tickets: not exactly the same but there is phab:T123844
  • Proposer: VIGNERON * discut. 09:28, 24 October 2019 (UTC)

Discussion

  • Would benefit all projects as pages that use a large number of templates, such as cite templates, often hit the limit and have to work round the problem. Keith D (talk) 23:44, 27 October 2019 (UTC)
  • for clarity, this is soley about the include size limit? (There are several other types of template limits). Bawolff (talk) 23:14, 1 November 2019 (UTC)
    @Bawolff: What usually bites us is the post-expand include size limit. See e.g. s:Category:Pages where template include size is exceeded. Note that the problem is exacerbated by ugly templates that spit out oodles of data, but the underlying issue is that the Wikisourcen operate by transcluding together lots of smaller pages into one big page, so even well-designed templates and non-pathological cases will sometimes hit this limit. --Xover (talk) 12:02, 5 November 2019 (UTC)
  • @VIGNERON: unfortunately, various parser limits exist to protect our servers and users from pathologically slow pages. Relaxing them is not a good solution, so we can't accept this proposal as it is. However, if it were reformulated more generally like "do something about this problem", it might be acceptable. MaxSem (WMF) (talk) 19:32, 8 November 2019 (UTC)
    • @MaxSem (WMF): thank for this input. And absolutely! Raising the limit is just of the ideas I suggested, "do something about this problem" is exactly what this proposition is about. I scratched the "increase the limit" suggestion, I can change other wording if needed, my end goal is just to be able to format text on Wikisource. And if you have any other suggestion, you're welcome 😉. VIGNERON * discut. 19:54, 8 November 2019 (UTC)
  • The problem here is that almost all content on large Wikisources is transcluded using ProofreadPage. I noticed that the result is that all the code of templates placed on pages in the Page namespace is transcluded (counted into the post-expand include size limit) twice. If you also note here that except the templataes, Wikisource pages have a lot of non-template content, you will see that Wikisource templates must be tiny, effective, etc. And even long CSS class name in an extensively used template might be a problem.
@Bawolff and MaxSem (WMF): So the problem is whether this particular limit has to be the same for very large, high traffic wikis like English Wikipedia as for medium/small low trafic wikis like Wikisource? I think that Wikisources would benefit much even if raising it for 25-50% (from 2MB to 2.5-3MB)
Another idea is based on the fact that Wikisource page creation idea is: create/verify/leave untouched for years. So if large transclusion pages hit a lot parser efficiency, maybe the solution is to use less aggressive updates / more aggressive caching for them? I think, that delayed updates would not be a big problem for Wikisource pages.
Just another idea: in plwikisource we have not pages hitting this limit at the moment due to a workaround used: for large pages we make userspace transclusions using {{iwpages}} template, see here. Of course, very large pages may then kill users' browsers instead of killing servers. But I think this is acceptable if somebody really wants to see the whole Bible on a single page (we had such requests...). Unfortunately, this mechanism is incompatible with the Cite extension (transcluded parts contain references with colliding id's - but maybe this can be easily fixed?). Also, a disadvantage is that there is no dependencies to the userspace transcluded parts of the page(s) (but maybe this is not a problem?). Ankry (talk) 20:04, 9 November 2019 (UTC)
Yeah, depending on just exactly what the performance issue that limit is trying to avoid is, it is very likely a good idea to investigate whether that problem is actually relevant on the Wikisources. Once a page on Wikisource is finished it is by definition an exception if it is ever edited again: after initial development the page is supposed to reflect the original printed book which, obviously, will not change. Even the big Wikisources are also tiny compared to enwp, so general resource consumption (RAM, CPU) during parsing has a vastly smaller multiplication factor. A single person could probably be reasonably expected to patrol all edits for a given 24-hour period on enWS without making it a full time job (I do three days worth of userExpLevel=unregistered;newcomer;learner recent changes on my lunch break). If we can run enwp with the current limit, it should be possible handle all the Wikisourcen with even ten times that limit and barely be able to see it anywhere in Grafana.
Not that there can't be smarter solutions, of course. And I don't know enough about the MW architecture here to predict exactly what the limit is achieving, so it's entirely possible even a tiny change will melt the servers. But it's something that's worth investigating at least. --Xover (talk) 21:50, 9 November 2019 (UTC)
@Ankry and Xover: thanks a lot for these inputs, raising even a bit the limit may be a good short term solution but I think we need more a long term solution. I think the most urgent is to look more into all the aspect of the problem to see what can be done and how. Cheers, VIGNERON * discut. 15:01, 12 November 2019 (UTC)

ProofreadPage extension in alternate namespaces

Edit proposal/discussion

Français: Utiliser les outils de l'espace page dans d'autres espaces
  • Problem: ProofreadPage elements, such as "Source" link in navigation, do not display in namespaces other than mainspace
    Français: Les éléments de l’espace page, tels que le lien "Source" dans la navigation, ne s'affichent pas dans les espaces de noms autres que l’espace principal.
  • Who would benefit: Wikisources with works in non-mainspace, such as user translations on English Wikisource
    Français: Utilisateurs Wikisource qui font des travaux qui ne sont pas en espace principal, tels que des traductions utilisateur sur Wikisource anglaise
  • Proposed solution: Modify the ProofreadPage extension to allow its use in namespaces other than mainspace
    Français: Modifier l'extension de l'espace page, ProofreadPage, pour permettre son utilisation dans des espaces de noms autres que l’espace principal.
  • More comments: I also proposed this in the 2019 and 2017 wishlist surveys.
  • Phabricator tickets: phab:T53980
  • Proposer: —Beleg Tâl (talk) 16:23, 23 October 2019 (UTC)

Discussion

  • Not a lot of work, heaps of impact. Thanks for the proposal. --Gryllida 23:35, 6 November 2019 (UTC)

Improve workflow for uploading books to Wikisource

Edit proposal/discussion

  • Problem:
Uploading books to Wikisource is difficult.
In the current workflow you need to upload the file on Commons, then go to Wikisource and create the Index page (and you need to know the exact URL). :The files need to be DJVU, which has different layers for the scan and the text. This is important for tools like Match & Split (if the file is a PDF, this tool doesn't work).
More importantly, the current workflow (especially for library uploads) includes Internet Archive, and the famous IA-Upload tool. This tool is now fundamental for many libraries and uploaders, but it has several issues.
As Internet Archive stopped creating the DJVU files from his scans, the international community has struggled solving the issue of creating automatically a DJVU for uploading on Commons and then Wikisource.
This has created a situation where libraries love Internet Archive, want to use it, but then get stuck because they don't know how to create a DJVU for Wikisource, and the IA-Upload is bugged and fails often.
Summary
    • IA-Upload tool is bugged and fails often when creating DJVU files.
    • M&S doesn't work with PDF files.
    • Users do not expect to upload to Commons when transferring files from Internet Archive to Wikisource.
    • Upload to Internet Archive is an important feature expecially for GLAMs (ie. libraries).
  • Who would benefit:
    • all Wikisource communities, especially new users
    • new GLAMs (libraries and archives) who at the moment have an hard time coping with the Wiki ecosystem.
  • Proposed solution:
Improve the IA-Upload tool: https://tools.wmflabs.org/ia-upload/commons/init
The tool should be able to create good-quality DJVU from Archive files, and do not fail as often as it does now.
it should also hide, for the end-user, the uploading to Commons phase. The user should be able to upload a file on Internet Archive, and then use the ID of the file to directly create the Index page on Wikisource. We could have an "Advanced mode" that shows all the passages for experienced user, and a "Standard" one that makes things more simple.
  • More comments:
  • Phabricator tickets: related: phab:T154413
  • Proposer: originally proposed by Aubrey (talk) in 2017 - re-proposed by Candalua (talk) 16:15, 6 November 2019 (UTC)

Discussion

Better editing of transcluded pages

Edit proposal/discussion

  • Problem: When somebody wants to edit text in page with transcluded content, he find nothing relevant in source, only links to one or many transcluded pages under textarea. There are some tools (default in some wikisources) which helps to find the correct page. These tools displays number of page in the middle of text (in some wikis in the edge) and in source html there are invisible parts, but sometimes in the middle of the word/sentence/paragraph.
  • Who would benefit: Users who wants to correct transcluded text
  • Proposed solution: 1) Make invisible html marking visible, but not disturbing the text. Find the way how to move it from the middle of words.
    • en.wikisource example (link to page is on the left edge):
      • dense undergrowth of the sweet myrtle,&#32;<span><span class="pagenum ws-pagenum" id="2" data-page-number="2" title="Page:Tales,_Edgar_Allan_Poe,_1846.djvu/16">&#8203;</span></span>so much prized by the horticulturists of England.
        
    • cs.wikisource example (link is not displayed by deafault, when make visible by css, is in the middle of text.)
      • vedoucí od západního břehu řeky k východ<span><span class="pagenum" id="20" title="Stránka:E. T. Seton - Prerijní vlk, přítel malého Jima.pdf/23"></span></span>nímu.
        
  • Alternate solution: 2) after click to [edit] display pagination of transcluded text, click on page will open it for edit.
  • Alternate solution 2: 3) Make transcluded page editable in VE.
  • More comments: Split from this proposal
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 16:30, 11 November 2019 (UTC)

Discussion

Repair search and replace in Page editing

Edit proposal/discussion

  • Problem: Actually, "Search and replace", as provided by the code Editor (top left option in the advanced editing tab), just doesn't work when using it at "Page" namespace.

This is the basic tool to... search and replace text when editing, mass correct OCR mistakes, etc. It is simply not working.

  • Who would benefit: All editing users
  • Proposed solution: Reimplement the function, or fix the bug in the Mediawiki software.
  • More comments: There are some workarounds, as implemented in it.source, but they are new gadgets that mimic this basic functionality of MediaWiki.
  • Phabricator tickets: phabricator:T183950, phab:T198688 and phab:T212347
  • Proposer: Ruthven (msg) 11:44, 29 October 2019 (UTC)

Discussion

  • Extending the proposal: This would profit all Wiki-Projects.
    • I would suggest something more general: when I use Search and replace, I cannot go a step backwards anymore, in case my replace (or more importantly something before) was wrong. This is a general problem with the text-editor. Every time I use any of the already existing buttons (like Bold, or math or what so ever), I cannot do this step backwards. So, if I' m editing for sometime and then do something wrong and then use one of these buttons (or search and replace), I must do the whole work from the beginning, because I cannot go back to the mistake that I did before using one of these buttons. This is not the case with the visual editor, so, I think, it would be possible to change this in the texteditor rather easily.
    • There are only two options in search and replace: you can either replace one after the other, or the whole text. I would be really grateful if I could use search and replace only in a marked text (and not the whole one)Yomomo (talk) 22:24, 8 November 2019 (UTC)
    • About Search and replace. If I want to replace something with more lines, the new-line-mark will not be included. I don't know how difficult it is to change this, but it would be a profit to be able to replace parts also when they (and the new part) have more lines. Yomomo (talk) 14:52, 1 November 2019 (UTC)

Generate thumbnails for large-format PDFs

Edit proposal/discussion

  • Problem: For some PDFs, with very large images (typically scanned newspapers), no images (called "thumbnails") are shown.
  • Who would benefit: Wikisource when proofreading newspaper pages.
  • Proposed solution: Look at the PDF files described in phab:T25326, phab:T151202, commons:Category:Finlands Allmänna Tidning 1878, to find out why no thumbnails are generated.
  • More comments: When extracting the JPEG for an individual file, that JPEG can be uploaded. But when the JPEG is baked into a PDF, no thumbnail is generated. Is it because of its size? Small pages (books) work fine, but newspapers (large pages) fail.
  • Phabricator tickets: phab:T151202
  • Proposer: LA2 (talk) 21:04, 23 October 2019 (UTC)

Discussion

  • Hi LA2! Can you provide a description of the problem? This could help give us a deeper understanding of the wish. Thank you! --IFried (WMF) (talk) 18:52, 25 October 2019 (UTC)
    The problem is very easy to understand. I find a free, digitized PDF and upload it to Commons, then start to proofread in Wikisource. This always works fine for normal books, but when I try the same for newspapers, no image is generated. Apparently this is because the image has a larger number of pixels. I haven't tried to figure out what the limit is. --LA2 (talk) 21:36, 25 October 2019 (UTC)
  • For File:Finlands_Allmänna_Tidning_1878-00-00.pdf at least, ghostscript correctly rendered the file locally, but took a lot of time (Like a ridiculous amount of time. evince seems to render it instantly, so I don't know why ghostscript takes so long). So at a first guess, I suppose its hitting time limits. Bawolff (talk) 20:20, 25 October 2019 (UTC)
    Maybe the solution is to fix ghostscript? Another way is to navigate around ghostscript and use pdfimages to extract the embedded JPEG images, and render them. Since JPEG rendering seems to work fine. I don't know. --LA2 (talk) 21:34, 25 October 2019 (UTC)
    pdfimages is not a solution as a PDF page may consist of multiple images and it is hard to extract their relative location (at least not possible with pdfimages). Ankry (talk) 20:23, 9 November 2019 (UTC)
  • What about to provide for ProofReading more compact desight at all. Those seconds scrolling counts. If we have on one site the window with the extracted text and in the other site the same size window with scan in which you can zoom and move fast, that should save your time and be more attractive for newbies. The way it is now it looks kind of techy and in some cases difficult to handle. E.g. there should be also more content help or a link to discussion page covered in more attracitve design. Juandev (talk) 09:22, 4 November 2019 (UTC)
  • I think that a tool that allows to generate such thumbnails manually / on request / offline with much higher limits and available to a specific group of users (commons admins? a dedicated group?) maybe a workaround for this problem. Ankry (talk) 20:23, 9 November 2019 (UTC)

Index creation wizard

Edit proposal/discussion

  • Problem: The process of turning a PDF or DjVu file into an index for transcription and proofreading is quite complicated and confusing. See Help:Index pages and Help:Page numbers for the basics.
  • Who would benefit: Anyone wanting to start a Wikisource transcription
  • Proposed solution: Create a wizard that walks an editor though the process of creating an index from a PDF or DjVu file (that has already been uploaded). Most importantly, it will facilitate creating the pagelist, by allowing the editor to go through the pages and identify the cover, title page, table of contents, etc, as well as where the page numbering begins.
  • More comments: This is similar to a proposal from the 2016 Wishlist, but more limited in scope, i.e. this proposal only deals with the index creation process, not uploading or importing files.
  • Phabricator tickets: task T154413 (related)
  • Proposer: Kaldari (talk) 15:32, 30 October 2019 (UTC)

Discussion

  • A wizard for initial setup is a good start, but an interactive visual editor for Index: pages, and especially for <pagelist … /> tags, would be even better. The pagelist is often edited multiple times and by multiple people, and currently requires a lot of jumping between the scan and the browser, mental arithmetic and mapping between physical and logical page numbers, multiple numbering schemes and ranges in a single work, etc. etc. A visual editor oriented around thumbnails of each page in the book and allowing you to tag pages: “This thumbnail, physically in position 7 in the file, is logically the ‘Title’ page”; “In these 24 pages (physical 13–37) the numbering scheme is roman numerals, and numbering starts on the first page I've selected”; “On this page (physical 38) the logical numbering resets to 1, and we're now back to default arabic numerals”; “This page (physical 324) is not included in the logical numbering sequence, so it should be skipped and logical numbering should resume on the subsequent page, and this page should get the label ‘Plate’”. All this stuff is much easier to do in a visual / direct-manipulation way than by writing rules describing it in a custom mini-syntax. --Xover (talk) 11:40, 9 November 2019 (UTC)

Structured, plain text Wikisource exports

Edit proposal/discussion

  • Problem:
One of the main goals of Wikisource is to produce transcription text that can be widely shared and reused in other contexts. Wikisource has a lot of promise for allied organizations, such as GLAM institutions, that also work on preserving and providing access to textual works. Currently, however, Wikisource is vastly underutilized by the GLAM sector, and one of the main reasons is the lack of interoperability.
It is actually quite difficult to make completed texts on Wikisource useful to the outside world because of the use of wiki formatting and templates in transcriptions, as well as non-transcription text on Wikisource pages, which are not easy to strip out in any programmatic way—and certainly not in any built-in way easily available to reusers, such as an API request.
For example, here is the first paragraph of the US Declaration of Independence:
WHEN in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature’s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.
But here is how Wikisource transcribes it:
{{dropinitial|W}}HEN in the cour{{ls}}e of human Events, it becomes nece{{ls}}{{ls}}ary for one People to di{{ls}}{{ls}}olve the Political Bands which have connected them with another, and to a{{ls}}{{ls}}ume among the Powers of the Earth, the {{ls}}eparate and equal Station to which the Laws of Nature and of Nature’s God entitle them, a decent Re{{ls}}pect to the Opinions of Mankind requires that they {{ls}}hould declare the cau{{ls}}es which impel them to the Separation.
Here is how that looks in an API response (spoiler: terrifying).
  • Who would benefit:
  1. This would primarily benefit Wikisource's potential reusers, specifically stakeholders seeking to use Wikisource as a platform for crowdsourcing their content and then ingest transcription data back into their dataset.
  2. It would also benefit all Wikisource editing communities more directly by encouraging increased institutional partnerships and new users.
  3. It might help Wikisource in other ways, like improving the search index, and paving the way for future storage of transcriptions as structured data statements (directly on Wikidata or as local statements like Structured Data on Commons).
  • Proposed solution:
A successful outcome would involve a documented, core functionality that allows a user to easily access a transcription for a given work as plain text, and in a machine-readable context.
This could look like a method in the existing Wikisource API for requesting sanitized data, with a "translation" layer on the backend that can take the formatted Wikisource pages and deliver clean versions to the consumer. For example, a JSON array of pages if querying a mainspace or index namespace page, or, at the very least, the ability to get such data on a per-page basis by querying the page namespace. It would hopefully not be an external, standalone tool that might become unmaintained.
I think it could look something like Extension:TextExtracts, except that extension doesn't seem to work in Wikisource contexts and has a character limit.
  • More comments:
  • Phabricator tickets:
  • Proposer: Dominic (talk) 16:54, 23 October 2019 (UTC)

Discussion

While I do like the idea, it is worth considering the Rest API in line with this proposal. Dominic´s example in Rest API. See also https://en.wikisource.org/api/rest_v1/ --Snaevar (talk) 17:42, 23 October 2019 (UTC)

@Snaevar: Yes, that's a good point; I wasn't really aware of it. It looks like what we want could be similar in approach to the Wiktionary definition method, except without any HTML elements in the response. In general, though, I really like how this works: https://en.wiktionary.org/api/rest_v1/page/definition/apple. Dominic (talk) 19:49, 23 October 2019 (UTC)

@Dominic: I'm not sure if it's quite what you're after, but the Wikisource Export tool can export to plain text (for example, United States Declaration of Independence (Dunlap Broadside)). It does this by creating an epub of the work and using Calibre to convert that to plain text. —Sam Wilson 23:29, 23 October 2019 (UTC)

@Samwilson: This is a useful example as well, but doesn't quite get at the use case I'm describing either. I think there are several export tools like this aimed at serving the reader of a text, but none that seem optimized for the use case of a downstream harvester attempting to consume Wikisource transcription data at scale. For example, if I am an institution that contributed that digital image of the Declaration, and thousands of others, and then wanted to ingest the work produced by Wikisource back into source dataset, I would want to be able to easily query for the text of a specific digital image in the form of structured data (when I say plain text, I mean the transcription doesn't have HTML elements or wiki markup, not that it is unstructured or TXT format). Dominic (talk) 21:14, 24 October 2019 (UTC)
@Dominic: It sounds like you might be talking about being able to export in TEI format, or some other non-presentational markup? This would indeed be terrific! At the moment, the closest we come is HTML, because wikitext only fully outputs to HTML. There have been experiments with using things like Pandoc to turn wiki HTML into Docbook or other structured formats, but because we don't have semantic markup for lots of elements (e.g. a quotation paragraph in a book is just indented, or a chapter title is just larger text) there's no way to actually portray these things in a real structured way in any format. For instance, the example paragraph you give above uses the long s, and an institution bringing the transcription back into their collection would want to retain that knowledge about the work, but plain text doesn't give it (that's a slightly shallow example, maybe, because that can just be represented with an actual ſ character – but lots of things can't be).

Maybe you could update the proposal to clarify what you mean by "as plain text, and in a machine-readable context", because it feels like these things might be in opposition — if it's machine readable, it might be text but it's no really plain text, if you see what I mean? We already have plain text, but it's not very useful! :-) Sam Wilson 22:36, 24 October 2019 (UTC)

Reagarding the Rest API. You still have to learn how to do it. What about something easy for tech idiots like me on a few clicks. I dont have a time to learn it, does the GLAM employee will have a time to learn it? Why somebody would be learning something if they can have it elsewhere without learning new things? Juandev (talk) 09:17, 4 November 2019 (UTC)