Wikisource Loves Manuscripts
Wikisource Loves Manuscripts is a call-to-action and a project to support the digitization of manuscripts on Wikisource.
⌂ Home Page | Technical Guidelines | Mission List | Categories | Location | Transcriptions | Documentation | Report
Background[edit]
Wikisource Loves Manuscripts pilot in Indonesia[edit]
Pusat Pengkajian Islam dan Masyarakat (PPIM), a research institute based in Jakarta, will be leading the Wikisource Loves Manuscripts pilot in Indonesia in collaboration with Wikimedia Indonesia and the community-led WikiLontar project, with the support of Wikimedia Foundation.
Regions[edit]
The project will focus on manuscripts rescue on three islands: Bali, Java and Sumatra. Manuscripts from this area have a fairly rich diversity in terms of language, script, writing support, and text content.
Timeline[edit]
- October to December 2022 - Project planning and announcement
- January to March 2023 - First preservation mission & proofread-a-thon
- April to June 2023 - Second preservation mission & proofread-a-thon
- July to September 2023 - Third preservation mission & proofread-a-thon
Primary Activities[edit]
Manuscript digitization[edit]
The core activity of this project is to digitize manuscript collections belonging to individuals and institutions (libraries, museums etc.) that are in danger of being damaged. All pages of the manuscript will be photographed (or scanned) and a digital copy will be uploaded to Wikimedia Commons under sufficient Creative Common license. Each manuscript bundle will be provided with sufficient metadata via Wikidata.
Wikisource proofread-a-thon[edit]
Manuscripts that have been uploaded to Wikimedia Commons and with metadata will then be processed through a transcription process using Wikisource. The manuscript will be typed by volunteers using the script corresponding to that used in the manuscript. For this reason, there will be an introduction to how Wikisource works to handle typing non-Latin scripts. In the next stage, a competition will be held to transcribe manuscripts from the results of digitization.
Transkribus pilot[edit]
Texts on Wikisource are transcribed through a mix of automated text recognition and community corrections. Good quality Optical Character Recognition (OCR) allows contributors to focus on improving the quality of content, through proofreading, rather than doing the full transcription manually. It is a prerequisite for scaling Wikisource projects. The Foundation’s CommTech team improved Wikisource by integrating two OCR engines, Google OCR and Tesseract. But many languages and documents are still not supported with high-quality on-wiki OCR, including the Balinese and Javanese language Wikisources that launched in 2021.
- Transkribus (website) is an AI-powered text and handwriting recognition tool which can be used to create OCR models based on transcriptions on Wikisource. Based on an initial research, there are no other text and handwriting recognition tools that can be trained to support any language. There is also an existing community demand (West Bengal Wikimedians) and partner engagement (British Library) with Transkribus.
- A team from IIIT Hyderabad with expertise in computer vision and applied machine learning will test the viability of Transkribus with the under-supported languages of South-East Asia. In the first phase of the pilot, we will be using Balinese language documents already transcribed by volunteers on Wikisource, in order to build a new OCR model.
Updates[edit]
Team[edit]
PPIM
- Ilham Nurwansah (Wikimedian in Residence)
- Abdullah Maulani (Community Coordinator)
Wikimedia Indonesia
IIIT Hyderabad
- Dr. Ravi Kiran
Wikimedia Foundation
- Satdeep Gill (Culture & Heritage team)
- Sakti Pramudya (Partnerships team)
- Praveen Das (Partnerships team)
- Sam Wilson (Community Tech team)
Learning partners sign-up[edit]
Affiliates, communities, and organizations with interest in manuscript digitization are invited to sign-up below to become a learning partner on this project:
- Example: Affiliate/community/organization name, focus languages/scripts, any other details including links
- Wikimedia Nigeria User Group/Bukky658 for the National Library of Nigeria Kwara State, English/Yoruba
- Punjabi Wikimedians User Group for the second phase of Digitalizing Punjabi Manuscripts. We did its first phase as a pilot project in which we scanned 4 manuscripts which were around more than 2000 pages. Now we are planning to enlarge this project which will include 6 manuscripts of approximately 20,000 pages. We are covering old and rare manuscripts which are seriously needed to preserve. You can see its more details.
- Wikimedia Thailand/2ndoct interest in working on Tai Tham script (Lanna script)
- Bikol Wikipedia Community/Filipinayzd interested in working on digitization of old newspapers and literary manuscripts in the Central Bikol language, in partnership with the James O'Brien SJ Library of Ateneo de Naga University, and the University of Nueva Caceres Museum
- West Bengal Wikimedians, Bengali language and script.
- Wikimedia Community User Group Malaysia/Tofeiku interested in digitizing old Malay manuscripts Jawi script in Malaysia, particularly held by the National Library of Malaysia.
- ...