Wikispeech/Speech data collection
Speech Data Collection
Wikispeech develops tools and infrastructure for collecting speech data. This is a key component in making speech technologies more accessible, inclusive, and available under free licences.
Why collect speech data
[edit]High-quality speech data is essential for building text-to-speech (TTS) systems in new languages. Our goal is to make Wikispeech available in as many languages as possible. Beyond this, we aim to create a diverse and openly available speech data repository that includes:
- Words, short phrases, and longer sentences
- Different dialects and language variations
- Speech from people with speech impairments
- Recordings with varying sound quality, reflecting real-world conditions
Each recording is associated with metadata, such as:
- Language and dialect
- Type of speech (word, sentence, etc.)
- Audio quality
- Age and other relevant attributes for research
Open data for research and development
[edit]We aim to make both recordings and metadata available under free licences. This allows universities, researchers, and organisations to:
- Improve text-to-speech systems
- Develop better speech recognition technologies
- Build more inclusive and accessible digital tools
Enabling voice-based access to knowledge
[edit]As speech recognition becomes more common, voice is increasingly used to access digital content. While Wikispeech enables users to listen to Wikipedia, navigation still often depends on screens, keyboards, or touch interfaces.
We see a future where users can interact with knowledge through natural voice interaction — asking questions and receiving answers conversationally.
With sufficient speech data, it is also possible to:
- Improve the naturalness of synthetic speech (intonation, rhythm, emphasis)
- Support more languages and dialects
- Contribute to documenting language development and variation over time
Challenges
[edit]Collecting and sharing speech data comes with important challenges.
Privacy and sensitive data
[edit]Some metadata — such as information about speech impairments — can be highly sensitive. We must ensure that all data collection and processing fully respects users’ privacy and rights.
Storage and accessibility
[edit]We need sustainable and cost-effective solutions for storing and distributing large amounts of speech data, while maintaining integrity and accessibility.