OpenSpeaks/Archives
|
OpenSpeaks Archives is a public digital archive supporting community-led documentation of low-resourced languages. It contributes to Wikimedia projects, communities, and the open knowledge movement. So far, we have documented 20 South Asian languages, and improved nearly 900 pages across 100+ Wikipedia and Wikimedia projects. Framework • Tools • Captioning Guide • Media |
-
A native speaker shares the barriers to native-language education in his near-extinct language, Gorum
-
A Marcha-Rongpo-speaking couple living in a city remembers their childhoods in different villages
-
A Jaunpuri-Garhwali speaker discussing access to public information and social welfare
-
-
A joke narrated by author Surendra Singh Pangtey in Johari-Kumaoni dialect
-
Overview
OpenSpeaks Archives addresses three broader gaps: content gaps about Indigenous and other low-resourced languages, a lack of practical tools for community-led audiovisual documentation, and citation bias against oral history as a source of knowledge. Working with community language archivists, GLAM partners, and Wikimedians, we create citable, accessible media to embed and use as sources on Wikimedia projects and open educational resources, and technological tools to support language documenters.
Our 2024–2025 pilot archived five tongues (languages and dialects), and our first phase (2025–2026) aims to document 13 tongues from three South Asian countries.
Why it matters
Despite being multilingual and diverse, many living languages and their speaker communities remain largely invisible on Wikimedia projects (content) and communities (participation). This is when even fictional are richly documented. Oral histories from the majorityworld communities are rarely treated as citable knowledge within Wikimedia policies and workflows, limiting how far community-held knowledge can travel. OpenSpeaks Archives challenges this status quo by demonstrating how oral history can be recorded ethically, adhering to community protocols, consent, and verifiability, and aligning with emerging movement work on oral citations. The project aims to make it normal, not exceptional, to cite low-resourced language oral history on Wikipedia, Wikidata, and sister projects.
What we do
The project works across three main areas:
Educational resources
- OpenSpeaks Oral History Framework: a practical framework for documenting oral history using FAIR–CARE principles, peer-reviewed in a short version at Wiki Workshop 2026.[1]
- OpenSpeaks Captioning: guidance on creating accessible, multilingual subtitles and captions for audio and video used in Wikimedia projects.
- OpenSpeaks on Wikiversity: toolkit covering planning, recording, consent, accessibility, and publishing workflows for language documentation.
These resources are to encourage community archivists, Wikimedians, and GLAM practitioners to learn at their own pace and then adapt the workflows to their languages and contexts.
Tools and infrastructure
Several open-source prototype tools are being developed as part of the project, documented at OpenSpeaks/Tools:
- OpenSpeaks Subtitler: key webapp for for community subtitle creators/translators for subtitling offline and online; integrated into Wikimedia Commons to pull audio/video.
- Media Metadata Viewer & Compress Helper: helps inspect media properties quickly and generate compression commands for sharing and editing.
- Media Duration Calculator: batch-calculates total duration of media in folders to support budgeting and project planning.
- Multimedia Folder Organizer and related utilities: support structured naming, tagging, and transcript word counting, so production workflows can be replicated and scaled.
Communities and languages
Our primary focus is to gradually build community strength so that they can document their languages themselves.
OpenSpeaks Fellows, who are native speakers and community organisers, guide the work in three regional clusters. For our 2024–2025 pilot, we focused on one language from Nepal and four tongues from India. During our first implementation phase, we focus on four clusters: Nepal (two languages), northern India (five tongues), eastern–southeastern India (five languages), and Sri Lanka (one language). They review and subtitle media, ensure consent and community ownership, and mentor new archivist‑Wikimedians in their languages.
| Cluster | Language (ISO 639 code) | Fellow/Coordinator | Resource persons/other collaborators | Interviewees | ||
|---|---|---|---|---|---|---|
| Northern India | Marcha (dial. Rongpo - rnp) |
Kimmi Pal | Bimla and K.S. Badwal | |||
| Johari (dial. Kumaoni - kfy) |
Surendra Singh Pangtey | Bhuppi Pangtey | Surendra Singh Pangtey | |||
| Jaunpuri (dial. Garhwali - gbm) |
Arun Gour | Sampati, Bhagwandi, Suchita | ||||
| Jaunsari (jns) |
Arun Gour | Deepak Joshi | ||||
| Bangani |
Jaiprakash Chauhan | Jaiprakash Chauhan | ||||
| Eastern-Southeastern India | Sora (srb) |
Opino Gomango | Ramani Dalbehera, Namad Dalbehera, Opino Gomango | |||
| Juray (juy) |
Opino Gomango | Dinabandhu Gomango, Manjula Bhuyan, Srinivas Gomango | ||||
| Juang (jun) |
Opino Gomango (coordinator) | |||||
| Gorum/Parengi (pcj) |
Boloram Muduli (Fellow), Opino Gomango (coordinator) | |||||
| Lambadi (lmn) |
Nenavath Mohan | Nenavath Mohan and Meghavath Sathish | ||||
| Nepal | Saptariya Tharu (thq) |
Sanjib Chaudhary | ||||
| Raji (rji) |
Uday Raj Aaley | |||||
| Kumhali (kra) |
Uday Raj Aaley |
-
Uday Raj Aaley, researcher and OpenSpeaks Fellow, speaking with Kaliprasad Raji, a Raji-language elder during documentation in a public place in the latter's village
-
Fellow Opino Gomango interviewing Kshetrabasi Gomango to document Juang language
-
In community-based language documentation like the Ho-language documentation here, often a group of community members reviews what is being recorded. This process informed our Oral History Framework.
Publications
- (Panigrahi, Subhashish), (Gomango, O.), (Pal, K.) (2026), OpenSpeaks Archives: Citing Low-Resourced Language Oral History Multimedia, Wikimedia Foundation
- Panigrahi, Subhashish (2026-03-31). "OpenSpeaks Archives: Language Documentation Field diary (July 2025–March 2026)". Diff. Retrieved 2026-04-05.
Phase 1 (2025–2026)
Goals and expected outcomes
Activities and timeline
Output
Overall project
- 2025
- July:
- Discussed with project plan with archivists.
- Discussed with Indic Mediawiki User Group collaboration with tools development.
- Early funding, resource person and other logistical coordination.
- August:
- Decision to call all language leads OpenSpeaks Fellow.
- Organised workshop (mentioned above) in Dehradun.
- Developed prototype tools that directly address technical gaps identified in the pilot last year, to create and edit subtitles, inspect media properties and compress files for sharing/editing, batch calculate total media duration inside folders for project planning/budgeting.
- Shared prototypes and discussed with Indic Mediawiki User Group to collaborate for tools development.
- In-person field translation conducted for Johari in Dehradun, India. One output video used in multiple Wikipedia articles.
- To plan for subtitling and translation, met in person with Arun Gour, OpenSpeaks Fellow for Jaunpuri, who flagged the need for a tutorial to understand subtitling.
- Kimmi Pal, Fellow for Rongpo, finished first draft of subtitles for interviews of Bimla and K.S. Bharwal.
- September:
- Seven OpenSpeaks Fellows confirmed participation.
- Created three prototype tools addressing all technical gaps identified in the pilot:
- OpenSpeaks Subtitler: key tool; detects pauses, generates dummy subtitles and helps subtitle offline.
- Media Optimizer: displays essential metadata of audio/video files and generates a command-line prompt to compress media for sharing with collaborators.
- Additional tools prototyped for media folder organisation and transcript word counting.
- Successful follow-up meeting with Indic MediaWiki Developers User Group; key members expressed interest in collaborating on tool-building.
- Presented at Celtic Knot (virtual, 23 September), initiating a broader movement conversation on oral citation practices.
- First phase of language documentation begun in Gorum/Parengi and Juang languages by Opino Gomango in Mysore, India.
- Kimmi Pal (Rongpo) finished subtitling most of the Rongpo recordings; videos edited and subtitles integrated, scheduled for publication on Wikimedia Commons in mid-October.
- October:
- OpenSpeaks/Tools page created on Meta-Wiki, documenting all tools in development.
- Five prototype tools created (yet to be fully tested before publishing but will be published eventually) as webapps:
- Commons Metadata Generator]]: generates wikicode for files created during a language documentation project.
- Media Folder Analyzer: analyses audio/video duration for project budgeting.
- [Multimedia Folder Organizer: organises, tags, categorises and renames media files in a production folder.
- Transcript Word Counter: counts transcript words to support billing and project planning.
- Print Subtitles: prints subtitles for translators to correct offline.
- First batch of subtitled oral history videos from multiple languages prepared for upload to Wikimedia Commons.
- Selected for participation in Wikimedia Futures Lab, to be held in Frankfurt from 30 January–1 February 2026.
- In conversation with the Songhay language diaspora community to support mentorship and the Wikipedia incubation process for the Songhay language.
- November:
- Invited to Deutsche Welle (DW) Akademie's invite-only "The next chapter: Journalism in the age of AI" gathering (Chiang Mai, Thailand, 25 November).
- Presented OpenSpeaks Archives at FosterLang: "Strategies to Increase Equitable Forms of Exchange and Partnerships in Support of Linguistic Capital and Language Justice", a working group organised by Linguapax International.
- Indic MediaWiki User Group collaboration still being confirmed; contingency plan to engage a paid developer if needed, focusing on essential tool feedback from active users.
- December:
- Spoke at WikiConference Kerala 2025: "Digital Tools and Strategy for Indigenous Languages" (recorded talk).
- Significant progress with media translation across multiple languages; processed media uploaded to Commons and embedded into Wikipedia and Wikidata entries.
- Two new Wikimedia Commons templates created for community use:
- Audio/video template: for OpenSpeaks Archives audiovisual files, reusable by others.
- Image/presentation template: for slide decks and visual documentation.
- Verbal agreement for a GLAM collaboration with a New Delhi, India-based institution related to language data, capacity building, educational resources, and tools.
- New GLAM partnership finalised with an European public archive to permanently archive all videos in the OpenSpeaks archive; outcomes include:
- assigning a DOI to each video, enabling citation on Wikipedia, Wikidata, and other Wikimedia projects.
- Field linguist training for Wikimedians interested in language documentation, to be delivered by their staff.
- OER to be created for how language archivists can contribute data to their archive and subsequently to Wikimedia projects.
- OpenSpeaks to act as a bridge between Wikimedian-archivists and them, rather than as a gatekeeper; formal agreement to be signed soon.
- Webapp created for adding rich metadata to Wikimedia Commons.
- Invited to and recorded for a podcast by Radio Taiwan International (Taiwanese public broadcaster).
- 2026
- January:
- Moderated a panel on community activism at the AI Impact Summit 2026 pre-Summit event.
- Tech lead finalised for building tools; prototyping and UI design in progress.
- Submitted a paper to Wiki Workshop 2026, co-authored with two OpenSpeaks Fellows, Opino Gomango and Kimmi Pal.
- Acquisition of Eastern Tharu media archive from Sanjib Chaudhary completed; currently being edited for publication, retaining his copyright.
- Media in three languages processed, subtitled, and published on Commons; used in Wikipedia and Wikidata.
- February:
- Participated in Wikimedia Futures Lab; co-created the conceptual model for WikiVoice with Wikimedians from Nigeria and Indonesia.
- Wiki Loves Languages co-launched together with Dagbani Wikimedians User Group, Igbo Wikimedians User Group, Odia Wikimedians User Group and Wikimedians of Santali Language User Group
- March:
- Oral History Framework released based on feedback received from accepted talk at Wiki Workshop
Tools
- OpenSpeaks Subtitler
- Webapp for creating audio/video subtitles both offline and online.

- Media Metadata Viewer & Compress Helper
- Quickly inspect media properties and compress files for sharing/editing with collaborators.
- Media Duration Calculator
- Batch calculates total media duration of audio and video files inside folders for project planning/budgeting.
- Multimedia Organization Tool
- Organise, categorise, tag, and batch-rename multimedia files (video, audio, image) inside a folder using structured naming conventions for production workflows.
Open Educational Resources
- Oral History Framework for documenting oral history using FAIR-CARE principles (shorter version peer-reviewed in Wiki Workshop 2026)
- OpenSpeaks Captioning Guide for captioning/subtitling
Awareness & capacity building
- "How we're building a language archive using open source and open licensing". FOSS United. FOSS United. 2026-03-28. Retrieved 2026-03-29. (deck)
- "Explorer Spotlight". National Geographic Society and National Centre for Biological Sciences (NCBS). 2026-01-29. Retrieved 2026-03-16.
- "Cultural Rights, Innovation, and Development in the AI Moment -Towards a Public Domain Framing". UNESCO civil society network on AI ethics and policy. (virtual, 10 March 2026)
- "Wikimedia Futures Lab". Wikimedia Deutschland. 2026-01-29. Retrieved 2026-03-05. (screened of Gyani Maiya, discussion on citation of oral history for low-resourced languages; co-built alpha prototype of WikiVoice, a platform for reliable, citable oral history together with Tochi Precious and Biyanto R.)
- "Indigenous languages and Small Language Models: Creating Open Source Protocols for Community Toolkits". Official Pre-Summit Event of the AI Impact Summit 2026. Design Beku. 2026-01-21.
- "Digital Tools and Strategy for Indigenous Languages" (PDF). WikiConference Kerala 2025. 2025-12-21. Retrieved 2025-12-24. (video)
- "Speaking in Our Voices: Preserving Local Languages through Wikimedia Projects". Wiki in Africa. 2025-12-02. Retrieved 2025-12-02.
- "Strategies to Increase Equitable Forms of Exchange and Partnerships in Support of Linguistic Capital and Language Justice". Fosterlang. FosterLang, Linguapax International. 2025-10-29. Retrieved 2025-12-02.
- "When Can We Cite Low-Resourced Language Oral Histories in Wikimedia Projects?", Celtic Knot Conference 2025 (virtual, 23 September 2025)
- Language Documentation for Open Knowledge Workshop, Dehradun, India (15 August 2025)
- "OpenSpeaks Archives: Language digital archive for Wikimedia projects ", Wikimania 2025 Nairobi (virtual, 8 August 2025)
- "Oral Knowledge in the Digital Commons: Reflections on Documentation and Citational Practice". Future of the Commons Collective (virtual, 28 July 2025)
In media
- Shyn, Oleksandr (2025-12-02). "The Untranslatables Project: Johar, with Subhashish Panigrahi". Radio Taiwan International. Retrieved 2025-12-02.
Friends and collaborators
References
- ↑ Panigrahi, Subhashish; Gomango, Opino; Pal, Kimmi (2026-03-25). OpenSpeaks Archives: Citing Low-Resourced Language Oral History Multimedia. Wiki Workshop 2026. Online: Wikimedia Foundation.
Cite as
- "OpenSpeaks Archives". Endangered Languages Archive. DOI: 2196/c6ab6125-379b-46b2-b756-ee5e1b0d744e. www.elararchive.org.. Retrieved 2026-03-25.
