Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

Tracked in Phabricator:
Task T307624

Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.

Contents

Overview[edit]

TTS Engine	Type	Licence	Languages	Costs (USD/character)	SSML	Voices
phoneme-synthesis + meSpeak.js	Library	GPLv3 (open source)	24	N/A	Y	29
larynx	CLI/API	MIT (open source)	9	N/A	Y	50
espeak-ng	CLI/API	GPLv3 (open source)	127	N/A	Y	127^{[nb 1]}
Google Cloud	API	Closed source	40	0.000004	Y	100
IBM Cloud	API	Closed source	13	0.00002	Y	26
Microsoft Azure	API	Closed source	129	0.000016	Y	270
Amazon AWS	API	Closed source	22	0.000004	Y	66

Requirements[edit]

The TTS engine we pick should:

accept SSML (speech synthesis markup language), as an emerging W3C standard^[1]
produce acceptable quality speech synthesis
support as wide a range of languages as possible

Audio samples[edit]

https://tnt-dev.toolforge.org/projects/tts (work in progress)

phoneme-synthesis + meSpeak.js[edit]

Notes[edit]

meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project^[2], and could possibly be included directly in an extension?

Licence[edit]

GPL v3

Languages & voices[edit]

24 languages (29 voices) are supported, with varying completeness^[3]

Catalan
Czech
German
Greek
English
Esperanto
Spanish
Finnish
French
Hungarian
Italian
Kannada
Latin
Latvian
Dutch
Polish
Portuguese
Romanian
Slovak
Swedish
Turkish
Mandarin Chinese
Cantonese Chinese

Quality[edit]

Better than larynx out of the box, but could be better with some tweaking.

Costs[edit]

N/A

SSML[edit]

SSML support can be enabled via a flag.^[2]

Notes[edit]

Has some issues with (ə)

Links[edit]

larynx[edit]

Notes[edit]

larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML

Licence[edit]

MIT

Languages & voices[edit]

9 languages (50 voices) are supported^[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model^[5]

English
German
French
Spanish
Dutch
Italian
Swedish
Swahili
Russian

Quality[edit]

Tested, fairly poor with default settings, will require a lot of tweaking.

Costs[edit]

N/A

SSML[edit]

Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist^[6]

Notes[edit]

Links[edit]

GitHub
TheresNoTime's fork
Languages/Voices
SSML support
CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/

espeak-ng[edit]

Notes[edit]

meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application^[7]. We would also need to run this as an API.

Licence[edit]

GPL v3

Languages & voices[edit]

127^{[nb 1]} languages^[8]

See list

Quality[edit]

Untested

Costs[edit]

N/A

SSML[edit]

Similar to meSpeak.js, a subset of SSML is supported.

Notes[edit]

Links[edit]

Google Cloud[edit]

Notes[edit]

API

Licence[edit]

Proprietary

Languages & voices[edit]

40 languages (100+ voices)

See list

Quality[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs[edit]

All costs exclude "WaveNet" (DeepMind GAN ML model^[9]) voices, and are based on publicly available pricing.

Free quota[edit]

4 million characters per month

Then[edit]

$0.000004 USD per character

SSML[edit]

Fully supported

Notes[edit]

Links[edit]

IBM Cloud[edit]

Notes[edit]

API

Licence[edit]

Proprietary

Languages & voices[edit]

13 languages (26 voices) are supported^[10]

Arabic
Chinese
Czech
Dutch
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Swedish

Quality[edit]

Untested

Costs[edit]

All costs are based on publicly available pricing.

Free quota[edit]

10,000 characters per month

Then[edit]

$0.00002 USD per character

SSML[edit]

Fully supported

Notes[edit]

Links[edit]

Microsoft Azure[edit]

Notes[edit]

API

Licence[edit]

Proprietary

Languages & voices[edit]

129 languages (270 voices) are supported

See list

Quality[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs[edit]

All costs exclude "Custom Neural" voices, and are based on publicly available pricing.

Free quota[edit]

0.5 million characters per month

Then[edit]

$0.000016 USD per character

SSML[edit]

Fully supported

Notes[edit]

Links[edit]

Amazon AWS[edit]

Notes[edit]

API

Licence[edit]

Proprietary

Languages & voices[edit]

22 languages (66 voices) are supported

See list

Quality[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs[edit]

All costs are based on publicly available pricing.

Free quota[edit]

5 million characters per month (for 12 months)

Then[edit]

$0.000004 USD per character

SSML[edit]

Fully supported

Notes[edit]

Links[edit]

Footnotes[edit]

↑ ^a ^b voice count unsure, likely 1 per language at least?

References[edit]

↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
↑ ^a ^b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.

[espeak-ng-1] voice count unsure, likely 1 per language at least?

[2] "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.

[masswerk.at/mespeak-3] "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.

[4] "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[5] "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[6] Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.

[7] "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[8] "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[9] "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[10] "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.

[11] "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.

[nb 1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Overview[edit]

Requirements[edit]

Audio samples[edit]

phoneme-synthesis + meSpeak.js[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

SSML[edit]

Notes[edit]

Links[edit]

larynx[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

SSML[edit]

Notes[edit]

Links[edit]

espeak-ng[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

SSML[edit]

Notes[edit]

Links[edit]

Google Cloud[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

Free quota[edit]

Then[edit]

SSML[edit]

Notes[edit]

Links[edit]

IBM Cloud[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

Free quota[edit]

Then[edit]

SSML[edit]

Notes[edit]

Links[edit]

Microsoft Azure[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

Free quota[edit]

Then[edit]

SSML[edit]

Notes[edit]

Links[edit]

Amazon AWS[edit]

Notes[edit]

Licence[edit]

Languages & voices[edit]

Quality[edit]

Costs[edit]

Free quota[edit]

Then[edit]

SSML[edit]

Notes[edit]

Links[edit]

See also[edit]

Footnotes[edit]

References[edit]