Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation
Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.
Overview[edit]
TTS Engine | Type | Licence | Languages | Costs (USD/character) | SSML | Voices |
---|---|---|---|---|---|---|
phoneme-synthesis + meSpeak.js | Library | GPLv3 (open source) | 24 | N/A | 29 | |
larynx | CLI/API | MIT (open source) | 9 | N/A | 50 | |
espeak-ng | CLI/API | GPLv3 (open source) | 127 | N/A | 127[nb 1] | |
Google Cloud | API | Closed source | 40 | 0.000004 | 100 | |
IBM Cloud | API | Closed source | 13 | 0.00002 | 26 | |
Microsoft Azure | API | Closed source | 129 | 0.000016 | 270 | |
Amazon AWS | API | Closed source | 22 | 0.000004 | 66 |
Requirements[edit]
The TTS engine we pick should:
- accept SSML (speech synthesis markup language), as an emerging W3C standard[1]
- produce acceptable quality speech synthesis
- support as wide a range of languages as possible
Audio samples[edit]
- https://tnt-dev.toolforge.org/projects/tts (work in progress)
phoneme-synthesis + meSpeak.js[edit]
Notes[edit]
meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?
Licence[edit]
Languages & voices[edit]
24 languages (29 voices) are supported, with varying completeness[3]
- Catalan
- Czech
- German
- Greek
- English
- Esperanto
- Spanish
- Finnish
- French
- Hungarian
- Italian
- Kannada
- Latin
- Latvian
- Dutch
- Polish
- Portuguese
- Romanian
- Slovak
- Swedish
- Turkish
- Mandarin Chinese
- Cantonese Chinese
Quality[edit]
Better than larynx out of the box, but could be better with some tweaking.
Costs[edit]
N/A
SSML[edit]
SSML support can be enabled via a flag.[2]
Notes[edit]
Has some issues with (ə)
Links[edit]
larynx[edit]
Notes[edit]
larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML
Licence[edit]
Languages & voices[edit]
9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]
- English
- German
- French
- Spanish
- Dutch
- Italian
- Swedish
- Swahili
- Russian
Quality[edit]
Tested, fairly poor with default settings, will require a lot of tweaking.
Costs[edit]
N/A
SSML[edit]
Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]
Notes[edit]
Links[edit]
- GitHub
- TheresNoTime's fork
- Languages/Voices
- SSML support
- CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/
espeak-ng[edit]
Notes[edit]
meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.
Licence[edit]
Languages & voices[edit]
Quality[edit]
Untested
Costs[edit]
N/A
SSML[edit]
Similar to meSpeak.js, a subset of SSML is supported.
Notes[edit]
Links[edit]
Google Cloud[edit]
Notes[edit]
API
Licence[edit]
- Proprietary
Languages & voices[edit]
40 languages (100+ voices)
Quality[edit]
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs[edit]
All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.
Free quota[edit]
- 4 million characters per month
Then[edit]
- $0.000004 USD per character
SSML[edit]
Fully supported
Notes[edit]
Links[edit]
IBM Cloud[edit]
Notes[edit]
API
Licence[edit]
- Proprietary
Languages & voices[edit]
13 languages (26 voices) are supported[10]
- Arabic
- Chinese
- Czech
- Dutch
- English
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Spanish
- Swedish
Quality[edit]
Untested
Costs[edit]
All costs are based on publicly available pricing.
Free quota[edit]
- 10,000 characters per month
Then[edit]
- $0.00002 USD per character
SSML[edit]
Fully supported
Notes[edit]
Links[edit]
Microsoft Azure[edit]
Notes[edit]
API
Licence[edit]
- Proprietary
Languages & voices[edit]
129 languages (270 voices) are supported
Quality[edit]
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs[edit]
All costs exclude "Custom Neural" voices, and are based on publicly available pricing.
Free quota[edit]
- 0.5 million characters per month
Then[edit]
- $0.000016 USD per character
SSML[edit]
Fully supported
Notes[edit]
Links[edit]
Amazon AWS[edit]
Notes[edit]
API
Licence[edit]
- Proprietary
Languages & voices[edit]
22 languages (66 voices) are supported
Quality[edit]
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs[edit]
All costs are based on publicly available pricing.
Free quota[edit]
- 5 million characters per month (for 12 months)
Then[edit]
- $0.000004 USD per character
SSML[edit]
Fully supported
Notes[edit]
Links[edit]
See also[edit]
Footnotes[edit]
References[edit]
- ↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
- ↑ a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
- ↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.