Talk:Wikisource roadmap

From Meta, a Wikimedia project coordination wiki

Announcements of this page made to communities by community members[edit]

This topic is two-fold, first we can see who is actively reading Wikisource-l as they links are made. And we build a natural list of good people to contact from the various communities for cross-language communication. Secondly, we can see which communities might not be connected with the rest of us an consider some outreach.--BirgitteSB (talk) 01:11, 31 July 2012 (UTC)[reply]

Metadata and Mediawiki[edit]

I recall speaking with Daniel (before the Unconference) about Mediawiki and metadata. When I talked about there not actually being a place to store the metadata in Mediawiki, I believe he said there was an upgrade to fix this that was scheduled to be done before Wikidata could rollout. I may have greatly misunderstood the conversation, I honestly don't know what "a place store metadata in Mediawiki" really means at all. It just a phrase I read many years ago in thread about why metadata was impossible for us. I just imagine it as kitchen strangely lacking a pantry. Can anyone explain this situation or else tell me am not making any sense? I would think that appearance of some new "pantry" in Mediawiki might be useful to what Tpt is working on. Maybe it already is somehow part the metdata roadmap and I just don't recognize what is being described there as my "pantry" issue. If anyone can understand what issue I am talking about, I would appreciate being corrected as to the proper terminology. Then I could probaly search and find the right disscussions on my own to understand things better.--BirgitteSB (talk) 18:03, 31 July 2012 (UTC)[reply]

Yes, I think Daniel talks about Content Handler that add a way to manage in Mediawiki content that is not Wikitext : with Content Handler extensions can add namespaces with their own storage system (by exemple using JSON as Wikidata will do). This feature will be very useful for improving ProofreadPage in order to implement in a better way a lot of features (specially moving a lot of Javascript into PHP side like the index pages form maker or page pages header and footer) but is not needed for what I'm doing now (an exportation system). Tpt (talk) 20:23, 3 August 2012 (UTC)[reply]

Confusing To Do item[edit]

Currently the first line in the TODO section is:

I'm not sure what this means. Is this about the gadget (which started on English Wikisource as s:en:MediaWiki:Gadget-addViafData.js) or the Help page? Either way, what do we want it to do? The bullet point just stops without an object. - AdamBMorgan (talk) 17:26, 5 August 2012 (UTC)[reply]

I retread this in context on the hackpad to jog my memory. We were talking about Wikidata to manage metadata using the Wikidata permanent ID's of Wikipedia articles as catalog as well as the IDs from other authorities. Now to give some context. Wikidata would need development and funding to be able to handle metadata on Wikisource.. Daniel from Wikidata was encouraging us to develop the idea as he is needing proposals of what to do with Wikidata after next year to present for funding. Wikidata is not funded by WMF, WMDE manages the project and they have received funding from some big groups (I don't remember exactly who). So they need to ask for more funding and explain what they will do with it or look for new work. This is not like staff delevelopment. So if this direction is funded for Wikidata I imagine it will still be years till we can work with Wikidata. In the meantime, if it turns out that this development is forthcoming, we can plan to gather metadata in way that will be transferable to Wikidata. This intermediate method would store the metadata in templates. However since template are painful, we discussed a using the magic forms for templates, like you see in the demo Tpt has made for index templates. So the to do item is to build a gadget that import some basic metadata from authoritative catalogues and creates this template in a way that is not very painful, but still open-ended to add additional meta-data. It was discussed that this should be a gadget since a Wikidata extension would supercede it. Nowwhat was not really discussed was if Wikidata was not funded to develop a metadata handler for Wikisource. In that case I imagine we would want to make an extention rather than a just gadget for our house solution. But even if we go with some house extension, a gadget still might be a good way to experiment before finalizing how an extension will operate. Does this make sense, or do I need to back up?--BirgitteSB (talk) 03:56, 8 August 2012 (UTC)[reply]
OK. I don't have the technical knowledge to talk about the magic forms but isn't this proposed gadget exactly what English Wikisource has already? Can't other Wikisources simply copy and paste the gadget to their own wikis and localise the text? Inductiveload created it, he may be able to help with that. - AdamBMorgan (talk) 01:05, 13 August 2012 (UTC)[reply]
Which gadget are thinking of?--BirgitteSB (talk) 03:53, 13 August 2012 (UTC)[reply]

Recherches globales et par catégories[edit]

Bonjour,

Il me semble qu'il manque un outil ressemblant à s:fr:Wikisource:Recherche dans les catégories (il faut activer le gadget pour que ça fonctionne). L'idée serait d'avoir plusieurs champs sur une page, de pouvoir sélectionner la ou les langues de recherche, d'entrer des noms de catégories (avec autocomplétion afin d'éviter les erreurs) pour pouvoir les croiser : Auteurs français + Auteurs du XVIIe siècle + Poètes. Il existe déjà l'extension mw:Extension:DynamicPageList (Wikimedia) qui permet de faire s:fr:Portail:Littérature française du XVIIe siècle/Catégories, mais c'est limité et pas toujours pratique, il faut par exemple faire soi-même la présentation. Je pense donc qu'un outil spécifique à Wikisource et concernant tous les domaines linguistiques présenterait un grand intérêt. Marc (talk) 15:50, 8 August 2012 (UTC)[reply]

My Wikisource wishlist[edit]

Proper OCR tool[edit]

  • Software needed: Tesseract 3.02 (or newer), a lot of language data files for Tesseract, an tool on Labs or toolserver
  • Importance: Medium

Here I want a tool similar to PHE´s OCR tool on toolserver (https://toolserver.org/~phe/ocr.php), but with Tesseract 3.02 and a lot more language data files. There should be an language data file for each wikisource language, unless where it does not exist.

This tool would work a bit different to PHE´s OCR tool as well. Here is the process I have in mind.

  1. When the user clicks the "OCR" button, it will note from what wikisource project the request is coming from
  2. The two letter ISO 639-1 code used on wikisource needs to be converted to a three letter ISO 639-3 code used by Tesseract
  3. OCR the text in the language specified
  4. Show the text on wikisource

Example[edit]

  1. Lets say that I am on is.wikisource.org and click on the "OCR" button. The script will notice that the request comes from the icelandic wikisource.
  2. The script converts the two letter language code "is" to the three letter language code "isl"
  3. Tesseract OCR´s the text in Icelandic
  4. The text is shown at wikisource

Alternative[edit]

Instead of step 1 and 2, we could have a drop-down list of languages, instead of the OCR button. This drop-down list would show all the available languages, letting the user choose the language. The name of the language would be shown in the drop down list, but the three letter code would be sent in the query.

Example: If I choose Icelandic in the drop down list, it will send an query that the language is "isl".

Convert one page in PDF to PNG[edit]

  • Software needed: Ghostscript, an tool on Labs or toolserver
  • Importance: Medium

This tool is intended for importing image files. It will fetch an PDF file from commons and convert one of the PDF pages to PNG using Ghostscript. I am not even considering DjVu here, as I know that the compression level of images there are much too high to get an acceptable result.

For the tool, we need an user interface, with the following fields.

Required fields

  • File name: The file name on commons
  • Page: An page containing an image that the user wants to create an PNG from

Optional fields

  • Resolution: resolution in dpi - default: 300 dpi
  • Color quality: Drop down list, with 24-bit and grayscale. Default: 24-bit

Command line with default options. Assuming that the file name on commons is myscan.pdf, the output name is myscan.pdf and the page is the first one:

gs -q -sDEVICE=png16m -dBATCH -dNOPAUSE -dFirstPage=1 -dLastPage=1 -r300 -sOutputFile=myscan.png myscan.pdf

  • For an another page, -dFirstPage=1 and -dLastPage=1 needs to be changed.
  • For another resolution, -r300 needs to be changed.
  • For an grayscale color quality, use -sDEVICE=pnggray

Once Gostscript has finished the conversion, offer the user to download the file. Once the user has downloaded the file, (s)he can crop the image and rotate it as needed and finally upload the file to commons. --Snaevar (talk) 13:27, 18 February 2013 (UTC)[reply]

The need to recover certain HTML tags from being subject to the wikicode[edit]

Is there any way of providing the ability to directly apply the <H1>, <H2>, <H3>, <H4>, <H5> & <H6> HTML 'Headings' tags to regain their intended functionality and established specification on a case-by-case basis without the Wikicode automatically treating them as "typical" Wikipedia article sections & default Table of Contents entries everytime?

The current coding that provides for the familiar ability to assign and organize section levels by using 2 or more " = " characters as the section level-determinant should remain unchanged. This, in illustrative terms, is when one currently types into the editbox...

== Example Current ==


...then submits that for a save, the normal wikicode handling of that manual input in the underlying HTML becomes...

<h2><span class="editsection">[<a href="/w/index.php?title=Talk:Wikisource_roadmap&action=edit&section=11" 
title="Edit section: Example Current">edit</a>]</span> 
<span class="mw-headline" id="Example_Current">Example Current</span></h2>

... by the final rendering. The wikicode's automatic insertions of:

  1. <span class="editsection"> causes the section [Edit] link to be generated to the right of the section's title text; and
  2. <span class="mw-headline" id="Example_Current"> triggers the addition of Example Current as an entry in the default wikicode generated Table of Contents.


What I'm hoping for is the ability for the wikicode to skip/avoid the automatic insertion of SPANs #1 & #2 above, and their standard triggered resulting behavior, ONLY in the specific instances where a.) one of the six standard HTML headings <tags> are Directly applied instead of the 2 or more ' = ' characters method; AND b.) detects the presence of some predesignated class value(s) being set for the heading tag in question & applied in a.).


So the ability sought after here in illustrative terms would be as if one typed into the editbox...

<h2 class="overrideMW">Example Proposed</h2>


... and submits that for a save, the overridden wikicode handling of that in the underlying HTML output would / should be something along the lines of...

<h2 class="overrideMW"><span class="heading2" id="Example_Proposed">Example Proposed</span></h2>


We can then further manipulate the resulting outputs with templates and / or basic css styling ... yada, yada, yada.


PLEASE refrain from explaining away what is specifically being asked for here - 'recovery of the <H#> HTML tag-elements without the mere presence of them triggering their incorporation into the wikicode and affecting their normal output' - with laundry-lists of magic-word, div block, border-bottom, horizontal-line, font-sizing workarounds and the like to reproduce the appearance of these formal heading tags under the wikicode; these "hacks" are already well-known and heavily in use to say the least.

The issue at hand, however, is that just about EVERYTHING outside of the wikiworld will not properly detect, convert and display such novelties as the universally understood section headings for document content that they normally are & originally designed for without additional manipulation or customization of settings. They won't apply the normal rules on allowing/avoiding page-breaks before, after or in-between without additional manipulation or customization of settings. The list of other potential short-comings goes on with the same variation on a theme...

To conclude, I hope there is away to both preserve the well-established application of headings under the wikicode as well as a new way to opt out of them being subject to that wikicode on a case-by-case basis without corrupting the integrity of either the wikicode or the W3 specification in the process. Thank you in advance for your time in reading all this as well as any consideration that may follow. -- George Orwell III (talk) 01:54, 30 April 2013 (UTC) (I prefer you contact me at WikiSource instead)[reply]