Jump to content

Wikisource reader app/Selection

From Meta, a Wikimedia project coordination wiki
Outreachy Round 31
7 October 2025 – 6 March 2026 Round 31 of Outreachy

Workflow of transcription on Wikisource

[edit]

Before selecting the books for the app, it was needed to understand the workflow of Wikisource. That itself is a major problem to digest as different language versions follow different workflows for their work, where some steps overlap, while others do not. So, to avoid this difficulty, a workflow for printed materials like books, periodicals, newspapers, etc was outlined in as comprehensive manner as possible and can be adopted by an ideal Wikisource language community, if they would like to. For convenience, workflow for audio and video contents, which are still not very well developed for Wikisource, are avoided.

The common workflow which almost all Wikisource language communities adopt for transcription of contents are as follows -

  1. Identification - The first step is to identify the printed materials which are within the scope of Wikisource considering the copyright status, publication status etc.
  2. Digitization - Scanning of these materials is the next step, which can be done fresh or can be collected from different digital and physical resources, where they are already digitised.
  3. Upload - Once the digitised materials are available, the next step is to upload them on Wikimedia Commons.
  4. Indexing - The uploaded materials are then transferred to Wikisource in Index namespaces and checked for missing or duplicate or bad scans etc to create indexed pagelists and table of contents.
  5. Proofreading - Wikisource community volunteers then OCR and proofread each and every pages of these material.
  6. Validation - A second group of volunteers check the proofread pages again and validate them.
  7. Transclusion - After the proofreading and/or validation is done, the contents are then transcluded into main namespaces after properly dividing them into chapters etc. according to the table of contents. This step makes it ready for readers to read.

Note: This is the basic workflow of Wikisource which is expected to be followed by all communities. Unfortunately, some communities miss some of the critical steps like creation of tables of contents or the entire step of transclusion etc. due to different reasons.

Workflow of Wikidata integration with Wikisource

[edit]

Now, apart from the above-mentioned basic workflow to transcribe and create e-book, communities differ while adding with the metadata of the materials like, name of authors, publishers, publication years etc. Now there are three kinds of practical scenarios adopted by communities and combinations within these three.

  1. No metadata - Volunteers sometimes do not add any kind of metadata anywhere or partially on Wikisource index pages or transcluded pages. That is the worst kind of scenario and needs to be avoided.
  2. Metadata stored locally - Majority of language communities store metadata locally on Wikimedia Commons at the file description and/or Wikisource at the index namespaces in designated fields and/or in header of transcluded pages. This can lead to duplication of efforts, increased chance of error, data redundancy etc.
  3. Metadata stored on Wikidata - A very few Wikisource language communities store metadata centrally as Wikidata items and roundtrip them back on Wikimedia Commons at the file description and/or Wikisource at the index namespaces in designated fields and/or in header of transcluded pages. This is an ideal scenario, which provides opportunities to fully leverage the power of Wikidata.

Bibliographic data model on Wikidata

[edit]

Now, for a Wikisource mobile reading app, both actual content and metadata are equally important, so that not only readers can read the content, but also can easily navigate and search them. Storing content at a central database like Wikidata is thus preferable to easily query and make use of the metadata.

Keeping the 1 to 7 steps of content transcription and Step 3 of metadata in mind, the suitable criteria to select a Wikisource content to be available to readers can be drafted.

The material needs to -

  1. be digitised and uploaded on Wikimedia sites
  2. have an index page
  3. completely proofread (at least, if not validated)
  4. completely transcluded with divisions of chapters, if any.
  5. have metadata stored centrally and accurately following Wikidata Books data model with the following linkages on respective Wikidata items.
    1. title in native language (mandatory)
    2. language of work (mandatory)
    3. author(s), editor(s) (if any), translator(s) (if any)
    4. date of publication, publisher, place of publication
    5. Wikisource index page url (mandatory, not more than one value)
    6. Wikisource sitelink of transcluded page with proofread and validated badges. (mandatory)

Example

[edit]

Let’s get such a list for Bangla Wikisource with this SPARQL query - https://w.wiki/HcKo

SELECT ?item WHERE {
?sitelink schema:isPartOf <https://bn.wikisource.org/>; schema:about ?item.
{ ?sitelink wikibase:badge wd:Q20748092. }
UNION
{ ?sitelink wikibase:badge wd:Q20748093. }
?item wdt:P1957 ?indexPage .
}
GROUP BY ?item
HAVING (COUNT(DISTINCT ?indexPage) = 1)

An example item of one such book looks like this.

You can change the language codes from bn to your preferred language in the above SPARQL query and see the number of books available for your language in the app.

The API

[edit]

An API was developed which serves as a catalogue or index of books which follow the above described books data model. The API was built using Django and deployed on Toolforge. It periodically runs a set of SPARQL queries to retrieve data, process that data and update the database. You can find the API at https://wsindex.toolforge.org/books/ and the source code here at https://codeberg.org/ph4ni/wsindex

Data that can be fetched from the metadata API
Key Description Sourced from
wikidata_qid QID of the book Wikidata
title Title in English Title (P1476) in English or Label in English or Title in mul or Label in mul, in that order of priority
title_native_language Title in the native language Title (P1476) statement in native language or Label in native language or Title in mul or Label in mul, in that order of priority
languages List of languages the book is in P407 - language of work or name
date_of_publication Date published P577 - publication date
authors List with Author label and Wikidata QID P50 - author
editors List with Editor label and Wikidata QID P98 - editor
translators List with Translator label and Wikidata QID P655 - translator
genre List of genres of the book P136 - genre
type_of_work Form of creative work P7937 - form of creative work
ws_url Link to the Wikisource page wikisource site link
thumbnail_url Link to the thumbnail version of cover page from the file linked to P18 (image) or from the qualifier P4714 (title page number) to the value of P996 (document file on Wikimedia Commons)
epub_url Link to the epub file Downloadable URL based on WS Export
wikisource_index_url Link to index page P1957 - Wikisource index page URL
view_count Number of views of the ws_url page in last one year Page views from Wikimedia Analytics API
subjects Subjects of the work P921 - main subject
Querying the API
URL description
https://wsindex.toolforge.org/books/ Base URL which returns 32 results with pagination
https://wsindex.toolforge.org/books/?page=2 Example of page
https://wsindex.toolforge.org/books/Q51543972/ Get book by the QID. Also works without the prefix 'Q'
https://wsindex.toolforge.org/books/?languages=fr Get books by language code
https://wsindex.toolforge.org/books/?search=India Search books by title and author names.
https://wsindex.toolforge.org/books/?genres=dictionary Get books by genres