Wikisource roadmap

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page develops from the minutes of the Wikisource unconference held at Wikimania 2012. For reference, see the slides of the Aubrey presentation at Wikimania 2012 (many issues has been outlined there, but it's just a summary) http://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf

Distribution list for all the Wikisources; please, make sure you are subscribed to the Wikisource mailing list.

Please make sure to submit all the Wikisource related bugs in Bugzilla. Aaharoni will be picking up maintenance of ProofreadPage as he his back burner project.

A draft of roadmap for Proofread Page is available on mediawiki.org.

Metadata management system[edit]

On the index page there's an extension that takes a form and makes a template that is mapped to Dublin Core and can export to OAI-PMH

Demo system

http://wikisource-dev.wmflabs.org/w/index.php/Main_Page

Slides of the presentation at Wikimania 2012

http://commons.wikimedia.org/wiki/File:Getting_ebooks_on_Wikisource,_a_first_step_to_a_Semantic_Digital_Library_-_Wikimania_2012.pdf

Mapping

https://wikisource.org/wiki/Wikisource:ProofreadPage/Configuring_index_pages

Roadmap[edit]

  • Review of the first version of OAI-PMH generator.
  • Deployment on a Wikisource for testing (fr ?)
  • Add of new features to OAI-PMH (sets, classification, copyright data)
  • Import of bibliographic metadata from OCLC
  • Deployment on all Wikisources
  • When wikidata will be done and deployed on Wikisource: RDF API with Wikidata URI.

Wikidata[edit]

Wikidata should be the ultimate place for the metadata regarding authors and books that Wikisource uses. Moreover, at the present moment there is redundancy with Commons, which has a Creator template for creators/authors and a Book template for book. Right now, there is no standard way to act: people often don't put these important templates in Commons, and put all the metadata within Wikisources (Index pages, mostly). Theoretically, Wikidata should be the only place where to store those data, and then we should transclude them all both in Commons (Book template, Creator template) and Wikisource (Index plages, etc.).

It's important to discuss and engage the Wikidata community regarding metadata https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikisource

Mobile version[edit]

It's important to file all the bugs in Bugzilla. Please note that right now the mobile version is optimized for reading instead of editing. In 2013 the Mobile Team will work on the editing features for the Wikipedia app, we could give our feedback to them at that point.

There is also a project of Android app for Wikisource. A first APK is available here.

This app is looking for a logo. Is anyone interested by doing one ?

Djvu[edit]

From Wikitech-l:

Djvu files are the wikisource standard supporting proofreading. They have very interesting features, being fully "open" in structure and layering, and allowing a fast and effective sharing into the web, when they are stored in their "indirect" mode. Most interesting, their text layer - which can be easily extracted - contains both the mapped text from OCR and metadata. A free library - divuLibre - allows full command line access to any file content. Apparently it's not supported: e should find and engage DjVu developers. Presently, djvu files structure and features are minimally used. Indirect mode is IMHO not supported at all, there's no mean to access to mapped text layer nor to metadata, and only the "full text" can be accessed once, when creating a new page into Page namespace.

It would be great:

  • to support indirect mode as the standard;
  • to allow free, easy access to the full text layer content from wikisource user interface.

One crucial feature for Wikisource should be the possibility to reinsert proofread text within Djvu files, maybe uploading directly the updated file on Commons. This feature could be crucial for GLAMs, which could profit from Wikisource proofreading and have back their file with human-read OCR.

Archivists seem excited about the tools our platform offers. However these partners would like to have the validated text returned to the djvu file for there own digital collection. We do not have a good a way of doing this. We do not have a way of doing this with preserving word search at all.

Alex Brollo has done some work on that. He suggestes that Wikisources should work with indirect djvu file, instead of bundled, so that if we manage to edit the file for every edit we could save page by page, instead of modifying a 40Mb file every time. If someone is interested in working on this, please write to Aubrey, Aubrey and Alex are making some tests.

BirgitteSB sent some mails and got some feedbacks:

Cf. the comments of Sun, 04 Nov 2012 on wikitech-l.

Djvu as a file format is not maintained by anyone. Which seems like significant to me, but some do not seem to think this is very profound since it is finished and works. RTF as a file format is unmaintained is widely used with out problems. But this does mean there is no obvious group to approach and seek advise from on the above issue.

At Wikimania 2012 SJ found Birgitte Josh from the Free Software Foundation who is going to look to see if he can find anyone who is working on dvju related stuff or something similar. There was also encouragement from WMF staff to find an interested student who to be considered for a summer of code project.

Internet Archive[edit]

It would be great to ease the process of uploading books on Wikisource. Right now, we ofter search through Google Books, then we go to Internet Archive to see if our interested book is also there (Internet Archive provides ocrred djvus). If the book is not there, we download it from GB, and upload it in IA, then we upload it in Commons. Now, could a bot/browser extension do that? With a click we could ask the IA script (user:tbt) to give priority to a book we choose, than, in few ours, we could go to IA, find the DJVU and have it uploaded directly on Commons, with some important metadata filled up.

I'm working on a gadget that do the upload from IA to Commons and create the Book template from IA metadata. I've to solve a bug in the part that upload to commmons and then the script will be ready. Tpt (talk) 19:05, 30 August 2012 (UTC)

Documentation of existing features[edit]

BirgitteSB: "I found out about the default header (<pages index=Foobar header=1 />) which has been a available for some YEARS now. It is amazing! Tpt of fr.WS showed me the next day how robust this feature is. Typing only "/" will fill in the previous and next chapter, for one example which is a more recent feature he added. But there crazy stuff (parameters that are undocumented on en.WS) it can do with a lot less typing or pasting. Tpt is also working on having index page generated in some magical way that will impress everyone. (Part of metadata roadmap)"

Tpt: This shows that there is a huge problem of communication : there is no "official" ProofreadPage documentation. I think that it's one of the most important things to do. Can anyone do it ?

GLAM[edit]

GLAMs, as previously said, are really interested in Wikisource. Dominic McVitt Parks has spoken a lot about this and his residency with NARA. Wikisource could be very very useful, for example, for transcribing and proofreading documents for archives and libraries. It would be very important to have constant and specific feedbacks from these institutions:

  • what do they think of Wikisource?
  • what would they like to see as features?
  • would they use Wikisource APIs?
  • do they use Djvu?

One possible example is the possibility to insert human-proofread text back into Djvus. But do they really want it? Another could be being able to assess their OCR One hope of the Gallica partnership

Aboutness[edit]

As per Asaf proposal, we could use tags/keywords to state the "aboutness" of a documents, and the tags could go in Dublin Core subject field.

Upoad Wizard[edit]

We must not send our users to Commons to upload djvus. We could build an Upload wizard, within Wikisource, that upload directly on Commons, with synchronized templates. But the user must stay where it is, on Wikisource. Same for IA.

Xanadu[edit]

One of the most beautiful (and vague and impossible) concepts of digital libraries has been developed in the sixties by Ted Nelson, who wrote about Xanadu, a huge hypertext where all texts could be stored, written, read, paid, linked, transcluded. One of his visions was w:transclusion: Wikisource does that (with the Labeled section extension), but it is largely underused. This is also due to the fact that for transcluding sections of text you have to explicitly put <section> labels in the markup, and it would be greatly useful to being able to hide the labels within templates.

In Italian Wikisource, for example, we use an anchor template, (called §), which is perfect to "save" a portion of textm make it an anchor an hyperlink to it from different texts/pages. It is very useful, for example, when you have a quote in text A, and that quote is from text B. You can go in text B, use the § template to wrap the original quote, make it linkable, and then go in text A and use a link: thus, evey quote can be read directly in his original place, providing context. Example.

Now, it would be extremely interesting to be able not only to make a span of text linkable, but transcludable. We could then be the perfect source for Wikiquote (and Wikipedia) for quotes, and for the whole Web itself.

User automatization tools[edit]

Many tasks on Wikisource are tiring and dull. Bots and other tools have ben developed, locally, through many years, but they are always used by a minority of tecnically skilled users, which are burdened with many requests. it would be very useful to provide normal users (or at least expert users with poor technical skills) with bot-like tools to save time for "intelligent" tasks. Automatize the most tasks would mean saving time and effort of our users and thus amplify our userbase.

Translation tools[edit]

Translation tools improving in the future. A Tikiwiki installation actually has a translation tool written in php that is far more robust than we would probably need. Although it could use development to allow accuracy/fluency checks the way we currently do proofreading/validation checks.

Another thing could be the improvement of the already existing DoubeWiki extention: right now, there is a small double arrow beside the interwiki link, but nobody see that. We shoud have icons/messages to highlight this amazing feature fopr our readers, this is a great added value that wikisource has and we shpuld exploit it.

Wikimania Presentaion details[edit]

  • Demo
  • 2008 Paper (Wiki-Translation has actually developed the feature written about and been using this feature for the past three years.)

Statistics[edit]

unfortunately, tool for statistics are always developed for Wikipedias, and never for Wikisources. Given the fact that the traffic is *much* lower, maybe some tools could be easily used for provinding source communities with analysis and insight. For our communities it would be very useful to know which page are the most visited, which books are most reads, and so on.

Wikicaptcha[edit]

See also mw:CAPTCHA
Slides of the presentation at Wikimania

http://commons.wikimedia.org/wiki/File:Wikicaptcha.pdf Seb35 used to work on this: he should be contacted for un update.

Critics

Visual Editor[edit]

See also mw:VisualEditor
See also mw:Extension:Proofread page

The Visual Editor is something developers have been working on for a while. It can be easily implemented in a new namespace to be tested by the community. For Wikisource, it would be paramount to have the visual editor implemented directly withing the Proofread extension. Aubrey has spoken with Krinkle, developer of the Visual Editor, who highlighted some possible (probable) difficulties in the feasibility of the project, due to different "architecture" of the 2 extensions.

Some discussion is needed between developers to understand those issues and how to cope with them.

Microtask[edit]

It would be very interesting to develop features of Wikisource reducing the work unit: at the present moment, a user can proofread/validate a whole page, upgrading her quality status. This helped users to dedicate less time but still to accomplish small but useful tasks. It would be even more helpful to make the user able to proofread/validate a single line or a single word. Projects and experiments in citizen science (es. Zoouniverse) show that making tasks for users smaller and simpler optimizes the crowdsourcing labor and increases the userbase. This is an important aim for the development and success of Wikisource.

Microtasks contributions requiring development[edit]

  • Proofread at line or word level (could possibly piggyback Wikisource Captcha development)
  • Catalog one piece bibliographic data (part of the Wikidata use case)

TODO[edit]

  • write gadget that can (looks like http://commons.wikimedia.org/wiki/Help:Gadget-VIAFDataImporter)
  • search in worldcat and try and return some dc info, a
  • enter Wikipedia links to be recorded in the template as subject and exported as dc:subject
  • Copyright calculator.
  • write proposal to all wikisources
  • when Wikidata arrives converts the templates
  • map the metadata with template Book
    • self categorizing template
  • map the author page with template Creator
  • upload from Wikisource in Commons via API
    • for the book template, there shouldn't be conflicting issues
    • Wikisource metadata wins on Commons metadata
  • January 2013 Ready proposals for Summer of Code and get feedback from User:Sumanah

Tasks[edit]

  • User:Zaran: speak with Krinkle for Visual Editor
  • Max: OCLC thing
  • BirgitteSB: upload here her mail and divide into bullet points
  • Jarekt: template Book
  • Everyone: speak with your own community and give them this link
  • Aubrey: speak with Alex Brollo about the text layer of the Djvu and how to put the text layer back
  • Tpt: ProofreadPage + mobile things.

Participants include Asaf Bartov, Aubrey, BrigitteSB, Kristin Anderson, Daniel Kinzler, Jeremy Baron, Maximilian Klein maybe 5 others