Wikisource roadmap

This page develops from the minutes of the Wikisource unconference held at Wikimania 2012. For reference, see the slides of the Aubrey presentation at Wikimania 2012 (many issues has been outlined there, but it's just a summary) http://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf

Distribution list for all the Wikisources; please, make sure you are subscribed to the Wikisource mailing list.

Please make sure to submit all the Wikisource related bugs in Bugzilla. Aaharoni will be picking up maintenance of ProofreadPage as he his back burner project.

A draft of roadmap for Proofread Page is available on mediawiki.org.

Metadata management system[edit]

On the index page there's an extension that takes a form and makes a template that is mapped to Dublin Core and can export to OAI-PMH

Demo system

http://wikisource-dev.wmflabs.org/w/index.php/Main_Page

Slides of the presentation at Wikimania 2012

http://commons.wikimedia.org/wiki/File:Getting_ebooks_on_Wikisource,_a_first_step_to_a_Semantic_Digital_Library_-_Wikimania_2012.pdf

Mapping

https://wikisource.org/wiki/Wikisource:ProofreadPage/Configuring_index_pages

Roadmap[edit]

Review of the first version of OAI-PMH generator.
Deployment on a Wikisource for testing (fr ?)
Add of new features to OAI-PMH (sets, classification, copyright data)
Import of bibliographic metadata from OCLC
Deployment on all Wikisources
When wikidata will be done and deployed on Wikisource: RDF API with Wikidata URI.

Wikidata[edit]

Wikidata should be the ultimate place for the metadata regarding authors and books that Wikisource uses. Moreover, at the present moment there is redundancy with Commons, which has a Creator template for creators/authors and a Book template for book. Right now, there is no standard way to act: people often don't put these important templates in Commons, and put all the metadata within Wikisources (Index pages, mostly). Theoretically, Wikidata should be the only place where to store those data, and then we should transclude them all both in Commons (Book template, Creator template) and Wikisource (Index plages, etc.).

It's important to discuss and engage the Wikidata community regarding metadata https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikisource

Mobile version[edit]

It's important to file all the bugs in Bugzilla. Please note that right now the mobile version is optimized for reading instead of editing. In 2013 the Mobile Team will work on the editing features for the Wikipedia app, we could give our feedback to them at that point.

certificate
~~streched logo bugzilla:38832~~ Fixed
texts are aligned to the left, they should be center aligned bugzilla:38833
~~table borders are visible bugzilla:37222~~ Fixed
tooltips and other hover pop-up are inaccessible bugzilla:30191

There is also a project of Android app for Wikisource. A first APK is available here.

This app is looking for a logo. Is anyone interested by doing one ?

Djvu[edit]

From Wikitech-l:

Djvu files are the wikisource standard supporting proofreading. They have very interesting features, being fully "open" in structure and layering, and allowing a fast and effective sharing into the web, when they are stored in their "indirect" mode. Most interesting, their text layer - which can be easily extracted - contains both the mapped text from OCR and metadata. A free library - divuLibre - allows full command line access to any file content. Apparently it's not supported: e should find and engage DjVu developers. Presently, djvu files structure and features are minimally used. Indirect mode is IMHO not supported at all, there's no mean to access to mapped text layer nor to metadata, and only the "full text" can be accessed once, when creating a new page into Page namespace.

It would be great:

to support indirect mode as the standard;
to allow free, easy access to the full text layer content from wikisource user interface.

One crucial feature for Wikisource should be the possibility to reinsert proofread text within Djvu files, maybe uploading directly the updated file on Commons. This feature could be crucial for GLAMs, which could profit from Wikisource proofreading and have back their file with human-read OCR.

The BHL http://www.biodiversitylibrary.org/ is very interested in working with us.
NARA has a dashboard feature directing people into proofreading on Wikisource.

Archivists seem excited about the tools our platform offers. However these partners would like to have the validated text returned to the djvu file for there own digital collection. We do not have a good a way of doing this. We do not have a way of doing this with preserving word search at all.

Alex Brollo has done some work on that. He suggestes that Wikisources should work with indirect djvu file, instead of bundled, so that if we manage to edit the file for every edit we could save page by page, instead of modifying a 40Mb file every time. If someone is interested in working on this, please write to Aubrey, Aubrey and Alex are making some tests.

BirgitteSB sent some mails and got some feedbacks:

Bień, Janusz S. (2011) Efficient search in hidden text of large DjVu documents. In: Advanced Language Technologies for Digital Libraries. Lecture Notes in Computer Science (Theoretical Computer Science and General Issues) (6699). Springer, pp. 1-14. ISBN 978-3-642-23159-9
Januz also suggested this graphical editor for Djvu. He said they are working on a proof of prototype tool for a similar purpose.

Cf. the comments of Sun, 04 Nov 2012 on wikitech-l.

Djvu as a file format is not maintained by anyone. Which seems like significant to me, but some do not seem to think this is very profound since it is finished and works. RTF as a file format is unmaintained is widely used with out problems. But this does mean there is no obvious group to approach and seek advise from on the above issue.

At Wikimania 2012 SJ found Birgitte Josh from the Free Software Foundation who is going to look to see if he can find anyone who is working on dvju related stuff or something similar. There was also encouragement from WMF staff to find an interested student who to be considered for a summer of code project.

Internet Archive[edit]

It would be great to ease the process of uploading books on Wikisource. Right now, we ofter search through Google Books, then we go to Internet Archive to see if our interested book is also there (Internet Archive provides ocrred djvus). If the book is not there, we download it from GB, and upload it in IA, then we upload it in Commons. Now, could a bot/browser extension do that? With a click we could ask the IA script (user:tbt) to give priority to a book we choose, than, in few ours, we could go to IA, find the DJVU and have it uploaded directly on Commons, with some important metadata filled up.

I'm working on a gadget that do the upload from IA to Commons and create the Book template from IA metadata. I've to solve a bug in the part that upload to commmons and then the script will be ready. Tpt (talk) 19:05, 30 August 2012 (UTC)[reply]

Documentation of existing features[edit]

BirgitteSB: "I found out about the default header (<pages index=Foobar header=1 />) which has been a available for some YEARS now. It is amazing! Tpt of fr.WS showed me the next day how robust this feature is. Typing only "/" will fill in the previous and next chapter, for one example which is a more recent feature he added. But there crazy stuff (parameters that are undocumented on en.WS) it can do with a lot less typing or pasting. Tpt is also working on having index page generated in some magical way that will impress everyone. (Part of metadata roadmap)"

Tpt: This shows that there is a huge problem of communication : there is no "official" ProofreadPage documentation. I think that it's one of the most important things to do. Can anyone do it ?

GLAM[edit]

GLAMs, as previously said, are really interested in Wikisource. Dominic McVitt Parks has spoken a lot about this and his residency with NARA. Wikisource could be very very useful, for example, for transcribing and proofreading documents for archives and libraries. It would be very important to have constant and specific feedbacks from these institutions:

what do they think of Wikisource?
what would they like to see as features?
would they use Wikisource APIs?
do they use Djvu?

One possible example is the possibility to insert human-proofread text back into Djvus. But do they really want it? Another could be being able to assess their OCR One hope of the Gallica partnership

Aboutness[edit]

As per Asaf proposal, we could use tags/keywords to state the "aboutness" of a documents, and the tags could go in Dublin Core subject field.

Upoad Wizard[edit]

We must not send our users to Commons to upload djvus. We could build an Upload wizard, within Wikisource, that upload directly on Commons, with synchronized templates. But the user must stay where it is, on Wikisource. Same for IA.

Xanadu[edit]

One of the most beautiful (and vague and impossible) concepts of digital libraries has been developed in the sixties by Ted Nelson, who wrote about Xanadu, a huge hypertext where all texts could be stored, written, read, paid, linked, transcluded. One of his visions was w:transclusion: Wikisource does that (with the Labeled section extension), but it is largely underused. This is also due to the fact that for transcluding sections of text you have to explicitly put <section> labels in the markup, and it would be greatly useful to being able to hide the labels within templates.

In Italian Wikisource, for example, we use an anchor template, (called §), which is perfect to "save" a portion of textm make it an anchor an hyperlink to it from different texts/pages. It is very useful, for example, when you have a quote in text A, and that quote is from text B. You can go in text B, use the § template to wrap the original quote, make it linkable, and then go in text A and use a link: thus, evey quote can be read directly in his original place, providing context. Example.

Now, it would be extremely interesting to be able not only to make a span of text linkable, but transcludable. We could then be the perfect source for Wikiquote (and Wikipedia) for quotes, and for the whole Web itself.

User automatization tools[edit]

Many tasks on Wikisource are tiring and dull. Bots and other tools have ben developed, locally, through many years, but they are always used by a minority of tecnically skilled users, which are burdened with many requests. it would be very useful to provide normal users (or at least expert users with poor technical skills) with bot-like tools to save time for "intelligent" tasks. Automatize the most tasks would mean saving time and effort of our users and thus amplify our userbase.

Translation tools[edit]

Translation tools improving in the future. A Tikiwiki installation actually has a translation tool written in php that is far more robust than we would probably need. Although it could use development to allow accuracy/fluency checks the way we currently do proofreading/validation checks.

Another thing could be the improvement of the already existing DoubeWiki extention: right now, there is a small double arrow beside the interwiki link, but nobody see that. We shoud have icons/messages to highlight this amazing feature fopr our readers, this is a great added value that wikisource has and we shpuld exploit it.

Wikimania Presentaion details[edit]

Demo
2008 Paper (Wiki-Translation has actually developed the feature written about and been using this feature for the past three years.)

Statistics[edit]

unfortunately, tool for statistics are always developed for Wikipedias, and never for Wikisources. Given the fact that the traffic is *much* lower, maybe some tools could be easily used for provinding source communities with analysis and insight. For our communities it would be very useful to know which page are the most visited, which books are most reads, and so on.

Wikicaptcha[edit]

See also mw:CAPTCHA
Slides of the presentation at Wikimania

http://commons.wikimedia.org/wiki/File:Wikicaptcha.pdf Seb35 used to work on this: he should be contacted for un update.

Critics

Blog post: Captchas are becoming ridiculous

Visual Editor[edit]

See also mw:VisualEditor

See also mw:Extension:Proofread page

The Visual Editor is something developers have been working on for a while. It can be easily implemented in a new namespace to be tested by the community. For Wikisource, it would be paramount to have the visual editor implemented directly withing the Proofread extension. Aubrey has spoken with Krinkle, developer of the Visual Editor, who highlighted some possible (probable) difficulties in the feasibility of the project, due to different "architecture" of the 2 extensions.

Some discussion is needed between developers to understand those issues and how to cope with them.

Microtask[edit]

It would be very interesting to develop features of Wikisource reducing the work unit: at the present moment, a user can proofread/validate a whole page, upgrading her quality status. This helped users to dedicate less time but still to accomplish small but useful tasks. It would be even more helpful to make the user able to proofread/validate a single line or a single word. Projects and experiments in citizen science (es. Zoouniverse) show that making tasks for users smaller and simpler optimizes the crowdsourcing labor and increases the userbase. This is an important aim for the development and success of Wikisource.

Microtasks contributions requiring development[edit]

Proofread at line or word level (could possibly piggyback Wikisource Captcha development)
Catalog one piece bibliographic data (part of the Wikidata use case)

TODO[edit]

write gadget that can (looks like http://commons.wikimedia.org/wiki/Help:Gadget-VIAFDataImporter)
search in worldcat and try and return some dc info, a
enter Wikipedia links to be recorded in the template as subject and exported as dc:subject
Copyright calculator.
write proposal to all wikisources
when Wikidata arrives converts the templates
map the metadata with template Book
- self categorizing template
map the author page with template Creator
upload from Wikisource in Commons via API
- for the book template, there shouldn't be conflicting issues
- Wikisource metadata wins on Commons metadata

January 2013 Ready proposals for Summer of Code and get feedback from User:Sumanah

Tasks[edit]

User:Zaran: speak with Krinkle for Visual Editor
Max: OCLC thing
BirgitteSB: upload here her mail and divide into bullet points
Jarekt: template Book
Everyone: speak with your own community and give them this link
Aubrey: speak with Alex Brollo about the text layer of the Djvu and how to put the text layer back
Tpt: ProofreadPage + mobile things.

Participants include Asaf Bartov, Aubrey, BrigitteSB, Kristin Anderson, Daniel Kinzler, Jeremy Baron, Maximilian Klein maybe 5 others

Notes

Metadata, Semanti MW[edit] would like to see you write down your requirements data you would like to collect, entities, described in FRBR standard, etc. (That refers to https://en.wikipedia.org/wiki/FRBR), which is pronounced like English "ferber". People referred to FRBRs. FRBR defines these increasingly precise descriptions of literary works/objects: work, expression, manifestation, and item. A "work" is Shakespeare's play called Romeo and Juliet. An "item" might be tangible: the copy of Romeo and Juliet on my shelf.) thoughts on wikidata and wikisource https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikisource SMW (Semantic MediaWiki) can be used to express relations between works, expressions, manifestations, etc. (It's implemented by extensions not now running on wikisource.) In Semantic MediaWiki, x is y, subject predicate object WikiData, x has property so and so, and there is a statement somewhere that asserts that this is so How do you fit semantic and wikidata approaches together? historians may want this level of detail in their metadata, and you'd want to model it should we model with this level of detail? or use established formats like FRBR, that's more tractable, less deep and easier to process We want to integrate primary data objects from authoritative sources in a different way from the data items that we want to maintain on wikidata to support wikipedia we want to use their bibliographic metadata we don't know yet how these things mix and match the kinds of cataloging described (indexing of characters, etc.) is very labor intensive open, federated, no single silo, single server, single id frbr site which has ids in linked open data world the basic solution is redundancy and same as relationships prototype book essay cataloger system, mint new ids for things as needed for my system, call shakespeare number 17 then when attempting to get my semantic relationships out to your university catalog, make equation, 17 in my system - 1356 in bibliotek francais Will wikidata be federated? It will mint new ids for things minted by wikipedia. They'll be more stable than ids minted by wikipedia whose article titles can change. DBpedia uses English language wikipedia urls for identifiers, thus non-English pages don't appear in it. wikidata will help with global ids for some things. semantic integration and handshaking doable in a way that provides useful results hard semantics very difficult to maintain ... how to get a computer to find out for you that a painting by date, color, etc ... difficult can query systems and let human sort through results rdf vocabularies and standards are very complex crowdsourcing will get you to the level of skos perhaps doesn't expect wikidata to implement frbr entities on its own characters in a book ... id from some sort of open data cloud will reference external ids, and this is already in the specification needed in order to catalog things not in wikipedia probably not in the owl germany about political entity or geographical area same as relationships can be misleading mappings can be useful user interface may not require rdf of anyone except the programmer oclc releasing cataloging as linked data in rdf implementing pre draft to schema.org also releasing entire database? you don't need a dump because you have urls that you can query with rdf mass imports into wikidata of oclc no, wikidata needs curation not designed for homogenic data would love to have system that says datahub .io has info ... where to get dump, fields needed, who publishes pulling info into wikidata when and how needed mapping data on demand all info is versioned, each change creates a new version history of source data depends on data provided by foreign source for any data point or property value , was true at this point in time for copyright, expression is appropriate data point skipped expression level for wikidata seems useless for wikidata technical ability varies ... project perspective could be a wikisource extension look at a scanned book, new essay, called x, about y and z pick topics, popup, query field, skos / lc / other aboutness options is it about france, germany empire or republic of germany pick what you can, germany, and more accurate version, also about lcsh republic of germany simple extension for user interface not difficult work for volunteer requiring deep knowledge of semantic web tell us what you see, pick subjects from sources on list store in format useful in future for wikipedia or libraries pdf scans in bulk create entities in bulk create new work and expression from essay that begins on page 23 wikidata current scope, till spring of next year is wikipedia will have works described, because works are in wikipedia manifestation, etc. will be there as they are entered in wikipedia on the manifestation level we have authoritative records work level, community collected records fine to begin without something inbetween we need a work, that has an id, to hang on to an individual essay there aren't yet work ids, no national register of work ids we have wikipedia articles at each level use case: practical: doing essays extension for wikidata data entry project preparing data for a future entity separated system aboutness of the work requires human intelligence catagorization in different systems set is being maintatined by huge community at a detailed level limiting it to the set of wikipedia pages not adequate ... article about person who does not have an article about them, for example essay book about john smith the third is a stub can go into queue for interested wikipedians can then link to actual data record magazine has review of book of essays building system and data structures for interlanguage links do you have info for how we want to apply this? please go to wiki page and write your own thoughts on how you would apply this seeing that there is a need to record info about entities used and transcribed in wikisource unclear whether it should be attached to actual file on wikisource or in wikidata, a technical question that may not need to concern the user claim made by a wikisource editor is primary information this needs to be reflected in a different data structure, who thought this about that already exists for label, what is this thing called. no external source available for what is the name of the label sees what is needed ... we might be going into this area next year after initial phase also important, needs to find sponsors and donors to keep the development going next year for that we need use cases, something that shows why this effort is useful have a value proposition to make here to other organizations that would like to do this but are not staffed or funded to do so most large owners of metadata do not have essay level cataloging currently only full text search (where available) to find things like poem, "A Dream" even title level without the aboutness also hard to search for stuff about wikipedia, you get so many articles that just happen to be on wikipedia ... add minus wikipedia.org need to figure out feeding pipeline for where we get the material use that we will categorize human added value, adding concepts not specifically mentioned in the text we need to have something that can handle copyright metadata ... one mega gadget that will run on specific .... and also separately fill out another field that will add an aboutness ... aboutness does not need to wait for wikidata wikipedia link is a good indicator of aboutness public domain locator is available has a project up that covers many cases, useful current state of metadata in wikisource had a presentation suggest a gadget that generates templates for now rather than an extension that does something special about half a year for wikidata to get to the point where it can store this level of detail in mediawiki infrastructure will be there, you write an extension that will cover the special case Hamlet example: With new "entities" (like variables, or fields) in a cataloging database, could have a way to classify/annotate which works have Hamlet as a character (not just the major play by Shakespeare) and to be able to query on that property. Thus to find translations in other languages ; works not by Shakespeare which use Hamlet as a character ; academic theses and publications which discuss Hamlet as a character or quote from works about Hamlet the character. Template:Creator has a link for Wikisource http://commons.wikimedia.org/wiki/Template:Creator

Notes

Metadata, Semanti MW[edit]

would like to see you write down your requirements data you would like to collect, entities, described in FRBR standard, etc. (That refers to https://en.wikipedia.org/wiki/FRBR), which is pronounced like English "ferber". People referred to FRBRs. FRBR defines these increasingly precise descriptions of literary works/objects: work, expression, manifestation, and item. A "work" is Shakespeare's play called Romeo and Juliet. An "item" might be tangible: the copy of Romeo and Juliet on my shelf.)

thoughts on wikidata and wikisource https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikisource SMW (Semantic MediaWiki) can be used to express relations between works, expressions, manifestations, etc. (It's implemented by extensions not now running on wikisource.)

In Semantic MediaWiki, x is y, subject predicate object WikiData, x has property so and so, and there is a statement somewhere that asserts that this is so How do you fit semantic and wikidata approaches together? historians may want this level of detail in their metadata, and you'd want to model it should we model with this level of detail? or use established formats like FRBR, that's more tractable, less deep and easier to process We want to integrate primary data objects from authoritative sources in a different way from the data items that we want to maintain on wikidata to support wikipedia we want to use their bibliographic metadata we don't know yet how these things mix and match

the kinds of cataloging described (indexing of characters, etc.) is very labor intensive

open, federated, no single silo, single server, single id frbr site which has ids

in linked open data world the basic solution is redundancy and same as relationships

prototype book essay cataloger system, mint new ids for things as needed for my system, call shakespeare number 17 then when attempting to get my semantic relationships out to your university catalog, make equation, 17 in my system - 1356 in bibliotek francais

Will wikidata be federated? It will mint new ids for things minted by wikipedia. They'll be more stable than ids minted by wikipedia whose article titles can change. DBpedia uses English language wikipedia urls for identifiers, thus non-English pages don't appear in it. wikidata will help with global ids for some things.

semantic integration and handshaking doable in a way that provides useful results hard semantics very difficult to maintain ... how to get a computer to find out for you that a painting by date, color, etc ... difficult can query systems and let human sort through results

rdf vocabularies and standards are very complex crowdsourcing will get you to the level of skos perhaps doesn't expect wikidata to implement frbr entities on its own

characters in a book ... id from some sort of open data cloud will reference external ids, and this is already in the specification

needed in order to catalog things not in wikipedia probably not in the owl germany about political entity or geographical area same as relationships can be misleading mappings can be useful

user interface may not require rdf of anyone except the programmer

oclc releasing cataloging as linked data in rdf implementing pre draft to schema.org also releasing entire database?

you don't need a dump because you have urls that you can query with rdf

mass imports into wikidata of oclc no, wikidata needs curation not designed for homogenic data

would love to have system that says datahub .io has info ... where to get dump, fields needed, who publishes

pulling info into wikidata when and how needed mapping data on demand all info is versioned, each change creates a new version history of source data depends on data provided by foreign source

for any data point or property value , was true at this point in time

for copyright, expression is appropriate data point skipped expression level for wikidata seems useless for wikidata

technical ability varies ... project perspective

could be a wikisource extension look at a scanned book, new essay, called x, about y and z pick topics, popup, query field, skos / lc / other aboutness options

is it about france, germany empire or republic of germany pick what you can, germany, and more accurate version, also about lcsh republic of germany

simple extension for user interface not difficult work for volunteer requiring deep knowledge of semantic web

tell us what you see, pick subjects from sources on list store in format useful in future for wikipedia or libraries

pdf scans in bulk create entities in bulk create new work and expression from essay that begins on page 23

wikidata current scope, till spring of next year is wikipedia will have works described, because works are in wikipedia manifestation, etc. will be there as they are entered in wikipedia

on the manifestation level we have authoritative records work level, community collected records

fine to begin without something inbetween we need a work, that has an id, to hang on to an individual essay

there aren't yet work ids, no national register of work ids we have wikipedia articles at each level

use case: practical: doing essays

extension for wikidata data entry project preparing data for a future entity separated system aboutness of the work requires human intelligence

catagorization in different systems set is being maintatined by huge community at a detailed level

limiting it to the set of wikipedia pages not adequate ... article about person who does not have an article about them, for example

essay book about john smith the third is a stub can go into queue for interested wikipedians can then link to actual data record magazine has review of book of essays

building system and data structures for interlanguage links

do you have info for how we want to apply this?

please go to wiki page and write your own thoughts on how you would apply this

seeing that there is a need to record info about entities used and transcribed in wikisource

unclear whether it should be attached to actual file on wikisource or in wikidata, a technical question that may not need to concern the user

claim made by a wikisource editor is primary information this needs to be reflected in a different data structure, who thought this about that

already exists for label, what is this thing called. no external source available for what is the name of the label

sees what is needed ... we might be going into this area next year after initial phase

also important, needs to find sponsors and donors to keep the development going next year for that we need use cases, something that shows why this effort is useful

have a value proposition to make here to other organizations that would like to do this but are not staffed or funded to do so

most large owners of metadata do not have essay level cataloging

currently only full text search (where available) to find things like poem, "A Dream" even title level without the aboutness

also hard to search for stuff about wikipedia, you get so many articles that just happen to be on wikipedia ... add minus wikipedia.org

need to figure out feeding pipeline for where we get the material use that we will categorize

human added value, adding concepts not specifically mentioned in the text

we need to have something that can handle copyright metadata ...

one mega gadget that will run on specific .... and also separately fill out another field that will add an aboutness ...

aboutness does not need to wait for wikidata wikipedia link is a good indicator of aboutness public domain locator is available has a project up that covers many cases, useful

current state of metadata in wikisource had a presentation

suggest a gadget that generates templates for now rather than an extension that does something special

about half a year for wikidata to get to the point where it can store this level of detail in mediawiki

infrastructure will be there, you write an extension that will cover the special case

Hamlet example: With new "entities" (like variables, or fields) in a cataloging database, could have a way to classify/annotate which works have Hamlet as a character (not just the major play by Shakespeare) and to be able to query on that property. Thus to find translations in other languages ; works not by Shakespeare which use Hamlet as a character ; academic theses and publications which discuss Hamlet as a character or quote from works about Hamlet the character.

Template:Creator has a link for Wikisource http://commons.wikimedia.org/wiki/Template:Creator