Wikisource Loves Manuscripts

From Meta, a Wikimedia project coordination wiki
(Redirected from Wiki Rescues Manuscripts)
Jump to navigation Jump to search
Translate this page; This page contains changes which are not marked for translation.
Other languages:

Wikisource Loves Manuscripts is a call-to-action and a project to support the digitization of manuscripts on Wikisource.

Background[edit]

Wikisource-loves-manuscripts-vertical-logo.svg

In 2020–21, the Wikimedia Foundation funded two projects that helped create a new Wikisource in the Balinese language of Indonesia. One of the projects focused on creating the technology to support the transcription of hand-written palm-leaf manuscripts on Wikisource while the other focused on scanning and digitizing more manuscripts from archives and individual collectors. We believe this is a replicable strategy for engaging with culture and heritage in Southeast Asia.

Wikisource Loves Manuscripts pilot in Indonesia[edit]

Pusat Pengkajian Islam dan Masyarakat (PPIM), a research institute based in Jakarta, will be leading the Wikisource Loves Manuscripts pilot in Indonesia in collaboration with Wikimedia Indonesia and the community-led WikiLontar project, with the support of Wikimedia Foundation.

Regions[edit]

The project will focus on manuscripts rescue on three islands: Bali, Java and Sumatra. Manuscripts from this area have a fairly rich diversity in terms of language, script, writing support, and text content.

Timeline[edit]

  • October to December 2022 - Project planning and announcement
  • January to March 2023 - First rescue mission & proofread-a-thon
  • April to June 2023 - Second rescue mission & proofread-a-thon
  • July to September 2023 - Third rescue mission & proofread-a-thon

Primary Activities[edit]

Manuscript digitization[edit]

The core activity of this project is to digitize manuscript collections belonging to individuals and institutions (libraries, museums etc.) that are in danger of being damaged. All pages of the manuscript will be photographed (or scanned) and a digital copy will be uploaded to Wikimedia Commons under sufficient Creative Common license. Each manuscript bundle will be provided with sufficient metadata via Wikidata.

Wikisource proofread-a-thon[edit]

Manuscripts that have been uploaded to Wikimedia Commons and with metadata will then be processed through a transcription process using Wikisource. The manuscript will be typed by volunteers using the script corresponding to that used in the manuscript. For this reason, there will be an introduction to how Wikisource works to handle typing non-Latin scripts. In the next stage, a competition will be held to transcribe manuscripts from the results of digitization.

Transkribus pilot[edit]

Texts on Wikisource are transcribed through a mix of automated text recognition and community corrections. Good quality Optical Character Recognition (OCR) allows contributors to focus on improving the quality of content, through proofreading, rather than doing the full transcription manually. It is a prerequisite for scaling Wikisource projects. The Foundation’s CommTech team improved Wikisource by integrating two OCR engines, Google OCR and Tesseract. But many languages and documents are still not supported with high-quality on-wiki OCR, including the Balinese and Javanese language Wikisources that launched in 2021.

  • Transkribus (website) is an AI-powered text and handwriting recognition tool which can be used to create OCR models based on transcriptions on Wikisource. Based on an initial research, there are no other text and handwriting recognition tools that can be trained to support any language. There is also an existing community demand (West Bengal Wikimedians) and partner engagement (British Library) with Transkribus.
  • A team from IIIT Hyderabad with expertise in computer vision and applied machine learning will test the viability of Transkribus with the under-supported languages of South-East Asia. In the first phase of the pilot, we will be using Balinese language documents already transcribed by volunteers on Wikisource, in order to build a new OCR model.

January 2023 Update[edit]

READ-COOP gave a workshop on Transkribus to IIIT Hyderabad team and the Wikisource Fellows on creating models. The workshop video and the slidedeck are released under CC BY-NC license by READ-COOP.

Team[edit]

PPIM

Wikimedia Indonesia

IIIT Hyderabad

  • Dr. Ravi Kiran

Wikimedia Foundation

Learning partners sign-up[edit]

Affiliates, communities, and organizations with interest in manuscript digitization are invited to sign-up below to become a learning partner on this project:

  • Example: Affiliate/community/organization name, focus languages/scripts, any other details including links
  • Wikimedia Nigeria User Group/Bukky658 for the National Library of Nigeria Kwara State, English/Yoruba
  • Punjabi Wikimedians User Group for the second phase of Digitalizing Punjabi Manuscripts. We did its first phase as a pilot project in which we scanned 4 manuscripts which were around more than 2000 pages. Now we are planning to enlarge this project which will include 6 manuscripts of approximately 20,000 pages. We are covering old and rare manuscripts which are seriously needed to preserve. You can see its more details.
  • Wikimedia Thailand/2ndoct interest in working on Tai Tham script (Lanna script)
  • ...