Wikispeech/Pilot study

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This is the Pilot study for Wikispeech. It outlines the work that will be done during 2016-2017. The pilot study was done in August-December 2015 in cooperation between Wikimedia Sverige, STTS – speech technology services and KTH's Speech, Music and Hearing - part of School of Computer Science and Communication


Wikipedia is one of the most used websites in the world with approx. 500 million visitors every month and 20 billion page views. Wikipedia is a so-called wiki built on the MediaWiki software. MediaWiki is used by many thousands of other websites and this project aims to create the software needed for making text-to-speech possible on all of these, optimised for Wikipedia.

With the help of navigation and recital using synthetic speech, people who find it easier to assimilate information through speech than text can get equal access to the information. In the long run the open nature of the project will make it possible to develop new ways of presenting the spoken information, e.g. through a player intended for mobile phones. This could include those with visual impairment, dyslexia or who are analfabets. The approximately 25% of people who find it easier to learn from spoken text could also utilise this functionality as well as those who wish to learn at the same time as they do something else (e.g. driving). 25% of the readers of Wikipedia would mean that approximately 115-125 million people would benefit from the project in the long run. Those who have received a medical diagnosis regarding limitations in reading comprehension (e.g. dyslexia, visual impairment or cognitive impairment) often have access to technological aids. This however very often requires a diagnosis, that you live in a high income country and that the language you speak has working text-to-speech for this to be a solution to the accessibility problem. People with poor reading comprehension (from unaccustomed readers to analfabets) also have limited access to commercial tools even if this would improve their understanding. This is especially true if they do not wish to share their data with one of the IT behemoths. To conclude the assessment is that a very large group would benefit from built in text-to-speech on WIkipedia.

Making all of the websites which use MediaWiki more accessible to those who find it hard to assimilate written information is therefore incredibly important. The project will increase the accessibility of one of the most important websites. All other platforms using MediaWiki will be able to make use of the technical solutions which are developed during the project. That is several thousand websites which quick and easy will be able to activate text-to-speech.

Wikipedia and many other wikis contain a lot of specialised text which requires a very extensive lexicon of pronunciations for text-to-speech to work satisfactorily. Additionally Wikipedia exists in 288 different languages, and the platform needs to be scalable to all of these language in addition to any future ones. The project makes it easier to develop text-to-speech for languages which as of yet lack this technology which is of interest since there are numerous speakers of these in Sweden but not enough to make it commercially attractive/prioritised. Commercial solutions only exist for a minority of all languages.

Flexibility is therefore crucial. Swedish, English and a right-to-left language (Arabic) will be included in the development project. One way to build a flexible platform is to use the language expertise among the tens of thousands of volunteers involved in the various Wikimedia projects. By user generating the lexicon of pronunciations with recordings of specialised texts we can get refined and high-quality text-to-speech even for obscure subjects in languages which previously had no working text-to-speech solutions. The methods for crowdsourced text-to-speech will be possible to use or other types of texts as these will be freely shared. In contrast to closed solutions this crowdsourced open source solution also makes it possible for users of the text-to-speech to improve it themselves and thereby avoid annoying errors (just like many readers appreciate being able to correct e.g. typos when reading texts).

It is a common mistake to assume that there is one solutions which fits everyone who is e.g. blind. Even if they share an impairment the individuals may otherwise differ completely with different circumstances and needs. For instance someone who has been blind since birth may often be able to play back texts at a very fast speed whereas another person who, in addition to being blind, also has a cognitive impairment might need to instead play back the text at a slower speed than normal. The possibility to create a personal account on Wikipedia where your settings can be saved is therefore critical.

The project will benefit both researchers and companies since the material which is generated by volunteers to improve the text-to-speech will be free to reuse – including for commercial purposes. A similar solution is at present not available on the market.

For research the project delivers unique possibilities. The gains for research around speech technology and text-to-speech solutions are evident: access to large amount of data is one of the main conditions for modern speech technology and here data is generated on all levels, with narrations, user data, and user-generated reviews. The project not only supports but may be at the forefront of new methods in user-centered iterative research and development.

Particularly interesting from the perspective of research concerning speech technology is that the project is working on the reading of longer connected texts, on a wide range of topics. Existing text-to-speech solutions are typically not designed for that kind of reading, though it is of great importance from an accessibility perspective.

The project also contributes to research outside text-to-speech technology. For example, the feedback from users that is generated could be regarded as a form of perception tests that can provide insights into how real listeners perceive speech of various kinds, and from the continuous updates and additions of words and pronunciations by users we can learn things that were not previously known on how language develops and how languages ​​relate to each other.

User Scenarios[edit]

Below three different user scenarios are presented. The idea is to show how a user will be using the three different processes that Wikispeech has: listening to the text-to-speech function, improving it, and developing it for new languages.  

Kim Learns From the Text[edit]

Kim, who has difficulties assimilating written text, is interested in learning more about democracy and visits Wikipedia to learn more about parliamentarianism. Kim has always had difficulties reading, but for various reasons this has never been investigated and Kim has never been provided with any technological aids. Today the schools in Sweden are fairly good at investigating the needs of the child and give them her/him the tools, but it is quite a new reality. On the website there is however a button to get the article read using text-to-speech that anyone can use.

Kim has left the laptop at home today and instead decides to use the mobile phone to read the article. After all it is as easy to navigate the user-friendly mobile interface on all of the 70 languages which the article exists on (for now). Today Kim visits the Swedish article as this is well structured but might otherwise have looked at the English version of the article (where there also is possible to use text-to-speech). In the article Kim clicks on the button and the text starts being recited.

Multiple different lexicons of pronunciations have been included but the article includes many technical terms which require a specialised lexicon of pronunciations to be correct (e.g. words such as "oligarchy" and "gerrymandering"). The Swedish text also mentions Gustav III which requires that there is an understanding that this should be pronounced (“Gustav the third”). A week ago this would have caused trouble but luckily an engaged volunteer just helped by extending the lexicon and the text-to-speech system is now very capable in this subject area in Swedish.

Having gone through part of the article Kim takes a pause and stops the text-to-speech. Coming back the next day the article continues to be recited from the correct place in the text. Conveniently enough all of the accessibility settings Kim has made with regards to reading voice, playback speed etc. remains as these have been saved as personal settings when Kim decided to log in to Wikipedia. Kim gets interrupted at several occasions and needs to go back and re-read parts of the text which is easy using either the keyboard or the mouse.

Kim Corrects Pronunciations in the Text[edit]

When taking a look at the article about [[:sv:M/S Teaterskeppet|M/S Teaterskeppet] Kim discovers that the pronunciation of the previous name of the ship, Vágbingur, is not pronounced in a good way. Kim decides to fix this by, in this case, referring to the Faroese pronunciation dictionary which is easily done through a built in tool.

Kim goes on listening and discovers that the nautical term “starboard” is pronounced incorrectly (as “star board”). As an old sailor Kim feels that this is not acceptable and therefore corrects the lexicon of pronunciations. Kim has been given a special user right and can when logged in to Wikipedia update the lexicon of pronunciations and the correction will, as is done on a wiki, go live directly.

Updating the phonetic text goes quickly thanks to the handy toolkit which makes it easy to pick IPA-symbols and listen to make sure it sounds right. When everything looks good Kim also takes the opportunity to record the pronunciation as an audio file which is uploaded thereby enriching both Wiktionary and/or Wikidata. Kim likes that the effort benefits multiple platforms! Especially since the recordings can be used to make even better text-to-speech in the future.

Kim also takes the opportunity to review some corrections made by other users to make sure they are of high quality.

Kim Adds a Favourite Language[edit]

Kim has delved into Esperanto for many years and is horrified to discover that there is no text-to-speech for the language. After looking around for a while Kim finds that the most important components are actually available as free software which can be put together to get a functioning text-to-speech system.

Even if the existing lexicon of pronunciations is pretty weak Kim thinks that once everything is up and running it might be worth spending two hours per day expanding it. After contacting the Esperanto association, where Kim is a member, Kim finds three more persons willing to help with expanding the lexicon. Together they start going through the well-structured process that is available to enable text-to-speech in a new language by making the developers aware of the interest and the existing components by adding all of the information to a wikipage. As soon as they feel ready they can start extending the lexicon. (Some words in Esperanto are already in the dictionary, as words in Esperanto mentioned in other language versions already have been added). They keep a good pace and the small group manage to upload more than 200 new words per day to the lexicon. The quality of their recordings are very high as they could use professional recording equipment borrowed from Wikimedia Sverige’s Technology Pool.

A couple of months later the various components for Wikispeech in Esperanto have been adapted by Wikimedia developers to fit into the existing infrastructure. It took a little bit longer than expected as some of the components' licenses were not clearly defined and those who created them had to be contacted. Also, some existing components had to be adapted because they were built with outdated technology. Happily, they decided to release the components under a free license

Background and Definitions[edit]

Minimum Viable Product[edit]

In this project we use two different definitions: Walking skeleton and Minimum Viable Product (MVP). A walking skeleton is the minimum functionality needed to be able to see what the system actually does. No advanced functionality is included but only the absolute basics. Minimum Viable Product includes more, namely the minimum amount of functionality required for the Wikimedia community and disability organisations to feel that it is worth activating speech-to-text. These requirements are therefore higher since it is not enough that it works, it also requires a certain amount of ease of use.

These two concepts are used below to classify the different properties of the system. Our goal is to achieve all of the properties for the MVP during the project.

Overarching System Description[edit]

The system will be built to handle various existing language resources since these vary a lot today. The system also needs to be generic enough to be used on all Wikimedia projects[1] and also third-party installations of MediaWiki. The system will therefore consist of several well defined APIs which can exchange information, in this text we refer to these jointly as a “wrapper”. In the image below that makes up all circles and arrows while the boxes need to be exchangeable.

The different boxes labeled 2, 3, 4 and 6 may also consist of different parts for different languages but have in common that the deliver input and output data in a well-defined way. This technical part is also described in an appendix called Process 0: System design.

New Language[edit]

In addition to providing a few different ways to listen to Wikipedia with the text-to-speech solution, it should also be possible to add new voices to an existing or new language. It will not be a complete interface to create a new synthesis (voice), but there will be instructions that can be followed to create one.

MediaWiki and the Wikimedia Server Environment[edit]

The MediaWiki software is written in PHP, but extension can use other languages. For something to be enabled in production on the Wikimedia servers it is a requirement that all parts are freely licensed and can be used in multiple languages (MediaWiki exists in 371 languages and Wikipedia on 291). In the Wikimedia environment there are approximately 800 MediaWiki installations, all of which might potentially require support for Wikispeech.

Wikispeech will be created as an Extension. It must be possible to activate and configure this separately on each wiki. To be allowed to do that the extension must be translateable on the platform All code must be written for that and some work to install it there will be needed.

For the markup of sounds, and to make it easily editable for anyone who edits on Wikipedia a few different ways to store this has been investigated. The solution we believe to be the most suitable is to not use markup directly in the traditional wiki text, but instead use a similar technique as the new editor for MediaWiki (VisualEditor) uses (i.e. Parsoid). This solution help preventing the (sometime substantial amount of) markup code to obstruct anyone who would like to edit an article, and thus does not irritate the editors, but still is easy to edit for those who want to improve the speech synthesis.

The GUI will be defined during the project. It will include a number of separate pieces where the most conspicuous to the end user is the audio player. The second most important piece will be the interface to make improvements. Wikimedia Foundation has already developed a design library with all the components well described (how they look and how they will be used), so it will not have to be designed on that level, it becomes rather a choice of layout and UX that will be important.

That we have selected a server solution is because we want to achieve the highest possible accessibility. Instead of relying on readers to have their own applications installed, we ensure that everyone can benefit from the results. The alternative solution is obviously to create a client that can be installed by the reader, but it is an additional threshold and also means that readers have to be made aware that the client is available. A server solution also allows the synthesizer can be used by third parties.

The alternative solution is obviously to create a client which can be installed by the reader. This is however an extra hurdle for the reader. Additionally it requires that the reader is aware of the client. A server solution also means that the speech synthesis can be used by a third party.

The motivation for making the solution as modular as we did is that we want to make use of the wast quantity of resources available in this area, whilst being able to deal with the fact that these are very often not standardised. The alternative would be the specify an existing standard This would be at the cost of not being able to make use of the majority of the pre-existing resources, therefore making Wikispeech develop at a much slower pace.


In the project three processes will be set up: reciting the text using speech synthesis, improving the speech synthesis and adding speech synthesis for a new language. Here we will firstly give a schematic overview of these processes before presenting all of the properties which make up each process.

Process for recital includes:

Navigation → NLP → Synthesis engine → Audio player

Process for improving speech synthesis includes:

.................................................................. Correction in the lexicon of pronunciations → Control of the quality of the lexicon change → Collection of speech data


Incorrectly recited text is identified


.................................................................. Correction occurs in article → Community driven annotation in the text

Process for adding a new language includes:

Expressed interest for the activation of a new language → Identification of existing components → Possible development of API adaptations → Missing or bad components are developed → Installation → Local configuration



Green = Walking skeleton

Yellow = MVP

White = Possible continued development

Process for Recital[edit]

This process is the general to all languages and must therefore be described in such a way that the language-specific parts instead live in the process for adding a language. (This includes the three first languages.)

Step 1: Navigation[edit]

The user must be able to listen to the synthesis in their web browser. The playback must allow pausing and resuming in multiple ways. The possibility to navigate in the text/page is desirable. The user should also be able to follow along in the text, where the current word is highlighted during playback.


  1. Start and stop functions for recital of the whole text from the start of the article with a mouse click on the playback button
  2. Start and stop functions for recital of the whole text from the start of the article with a keyboard shortcut
  3. The possibility to go to surrounding interface directly from article text to the surrounding user interface
  4. Recitation of marked text
  5. Works in multiple web browsers[2]
  6. Can be activated from both mobile and desktop view
  7. Speed and other settings of recitation can be adjusted according to personal preferences
  8. Recitation can be rewinded to allow re-listening
  9. Recitation can skip ahead in different intervals (e.g. word, sentence or paragraph)
  10. Recitation indicates that something is a link, footnote or similar (e.g. through a “pling”) which can be configured
  11. In editing mode the article editor can listen to and correct the pronunciation of a certain text directly
  12. Can be activated through the Wikipedia-app
  13. Navigation of recitation for the whole text through voice control

Step 2: Natural Language Processing (NLP)[edit]

The NLP-component consists mainly of two parts text processing and pronunciation generation.


  1. Text being recited is sent to Wikimedia servers (Wikispeech API)
  2. API for text processing (e.g. https and json/ssml)[3]
  3. Conversion of text from raw to annotated text with sufficient information (i.e. annotation for pauses, emphasis within a sentence etc.; the minimum being a series of phoneme sequences and text normalising/parsing)
  4. API for pronunciation component
  5. Structure for lexicon of pronunciations[4]
  6. Automatic pronunciation component for words not in the lexicon
  7. TTS handles various tags (e.g. that it is an image caption or table being read)
  8. Make possible the choice between existing pronunciations for the language (in multiple ways)
  9. Generation of prosodic tags (emphasis, phrasing - either through data or rules)

Step 3: Synthesis engine[edit]


  1. Synthesis motor handling (the generic)
  2. Synthesis API for all engines
  3. Interpretation of SSML-annotation (of pausing, emphasis within a sentence, intonation, volume and speed)
  4. Make language flavour selection (e.g. british/american) possible
  5. Make prosodic control (emphasis) possible
  6. Recording API (for recorded audio files)

Step 4: Audio player[edit]


  1. See which word is being read
  2. Possibility to choose between existing voices for the language (e.g. male/female)
  3. Works in multiple web browsers
  4. Audio file is available in various formats for download
  5. Using your own voice

Process for Improving Speech Synthesis[edit]

Step 1: Incorrectly recited text is identified[edit]

The user shall in various ways be able to help improve mispronounced words etc. while listening.


  1. The user can report on a wikipage that something sounds weird (manually)
  2. The user can highlight things which sound weird using a keystroke or similar (without having to provide the correct pronunciation, instead the report ends up in a queue where someone else can correct the problem)
  3. The user can report when a mistake is discovered and submit a correction through an easy mechanism
  4. An external actor identifies a problem and can with our tool contribute to the lexicon (e.g. The Swedish Tax Agency includes Wikispeech and improves the pronunciation of all terms related to economy)
  5. Choose whether it is a general fault (so that the lexicon must be updated) or if this should be annotated locally in the article
  6. Works in multiple web browsers (this applies to all parts of the editor, e.g. identification, correction in the lexicon and correction in the article)

Step 2a: Correction in the lexicon of pronunciations[edit]

The improvement is added centrally and impacts all pages.


  1. Input of phonetic text for correct pronunciation with what is currently in the toolbox today
  2. Other types of annotation (e.g. how an abbreviation, a date or similar should be recited)
  3. Input of phonetic text for correct pronunciation using specially developed tools for IPA/SAMPA, with the possibility to have the input read back to you and suggestions based on similar words
  4. Make a comparison of the change to the pronunciation (validation of IPA characters) and give a warning when things differ too much
  5. A user right for corrections to the lexicon of pronunciations
  6. Corrections from normal users end up in a queue awaiting approval
  7. Recording of correct pronunciation and user-generated conversion to phonetic text which can be transferred to Wikidata and Wiktionary. (For thoses who know the language but not IPA it will be possible to record the pronunciation. Other users can then listen to it and enter the IPA based on the recording.)
  8. Recording of correct pronunciation and automatic conversion to phonetic text
  9. Notifications to a user who reported an error when it has been corrected
  10. It is possible to create a personal user lexicon with the pronunciations you want to use but which shouldn’t be used globally

Step 2b: Correction occurs in article[edit]

The improvement is added locally on the specific page.


  1. Choice of pronunciation entry in the lexicon, through user input e.g. how an abbreviation, date, name or similar should be read (through choice of correction type, language and existing pronunciation which can be listened to)
  2. Suggestion to correct all occurrences of the same word in the text (it is possible to see how the word is being used)

Step 3a: Control of the quality of the lexicon change[edit]

To avoid vandalism and mistakes by beginners, new tools are needed


  1. Control of multiple similar reports/suggested improvements
  2. Check the impact of a change (the number of pronunciations affected and excerpts from texts where the word is used to illustrate the context)

Step 3b: Community driven annotation in the text[edit]

To avoid vandalism and mistakes by beginners being added to articles, MediaWiki tools are used


  1. Annotations go live directly after the edit
  2. Quality assurance through existing tools (e.g. recent changes)
  3. Quality assurance through new tools (e.g. new tags)

Step 4a: Collection of speech data[edit]

Primarily to improve the speech synthesis, but might in the future allow for voiced based searches on Wikipedia


  1. Recording of a corpus of text as a basis for speech synthesis; from individual volunteers who recite longer texts
  2. Recording of a corpus of text as a basis for speech synthesis; from multiple volunteers who who recite shorter texts that are later merged (crowdsourcing)
  3. Inclusion of recordings from sources from other organisations is possible (e.g. audio books)

Process for Adding a New Language[edit]

This happens three times already during the time of the project, for Swedish, English and a right-to-left language (Arabic).

Step 1: Expressed interest for the activation of a new language[edit]


  1. Communication about the interest is possible (e.g. through a wiki page)

Step 2: Identification of existing components[edit]


  1. Analysis of which components exist for text processing, lexicon, audio corpus and synthesis and at what level they are at

Step 3: Possible API adaptations (the parts are included in the “wrapper”)[edit]


  1. Manual adaptations of the component APIs for the new components

Step 4: Missing or bad components are developed[edit]

This is decided on a case-by-case basis.


  1. Manual improvements or creation of bad components (by developers and/or experts)
  2. Development of the lexicon simple tools (e.g. imports and micro-contributions)
  3. Training of prosodic modeling using existing and newly recorded material
  4. Development of the lexicon with the help of gamified tools (cf. Wikidata-game)

Step 5: Installation[edit]


  1. Manual installation by developers

Step 6: Local configuration[edit]


  1. Manual server-side configurations by developers
  2. Manual configuration by the community on the wiki (e.g. the possibility of local (re)naming of different technical messages for the specific wiki and styling)


Work Package Month Comments/description
System specification March 2016-September 2017 The main part finalized during Phase 1 (Etapp 1), but updated with agile methods also after.
Product description March-April 2016 Added to Is our base for communication with relevant actors.
Inform relevant actors March-May 2016 The Wikimedia community and other developers are informed. Everything is available and structured on Phabricator and on wiki pages. Emails are sent out.
The “Wrapper” is developed March-June 2016 Create a “wrapper” with the different parts on Wikimedia Labs. I.e. APIs and the basic infrastructure.
The interface for playback is developed March-September 2016 Some parts has to be developed when the wrapper is ready.
The interface for improvements/corrections is developed March-September 2016 The completion is depending on the finalization of both the wrapper and the interface.
Test specifications are written April-May 2016 Initially a document that is created as a foundation for the work with the tests done by KTH. There after a specification of the activities around the tests at every sprint. It includes a Test Plan, User Specifications and a Test Environment Specification.
Report for Phase 1 (Etapprapport 1) May 2016 Sent in to PTS.
Trimming of the various parts May-November 2016 Based on the comments about language, bugs, performance, etc.
Release notes June 2016-August 2017 Is done continuously at each version's release. In accordance with the way this is done by WMF (i.e. branch cuts with all its commits/Phabricator tasks).
Report of Unit Tests June 2016-August 2017 Only automatic reports. This is done continuously each time the code is changed.
User manual & FAQ for the end users June 2016-August 2017 Is updated continously during the project based on the feedback. Is a translatable page on so that the instructions easily can be translated and at the same time keep it easy to have the information updated.
Test of functions July 2016-February 2017 Organize test groups from the community and from disability organizations trying the tool (the wrapper and the interface) and provides comments. Reports on this will be completed in March 2017.
Thesis Topics July 2016-August 2017 Define suitable topics for thesises that students can work on (eg. moduls).
Report for Phase 2 (Etapprapport 2) September 2016 Sent in to PTS.
Development of new projects August-October 2016 Define what is needed for new projects and write applications. This is done outside of the PTS financed project, but is based on the results.
Report for Phase 3 (Etapprapport 3) February 2017 Sent in to PTS.
User documentation fro installation and operation March-August 2017
Wikimedia Foundation activates the Wikispeech Extension February-September 2017 Practicalities regarding implementation and the exchange of code review.
Report for Phase 4 (Etapprapport 4) June 2017 Sent in to PTS.
Final Report September 2017 Replace the report for Phase 5 (Etapprapport 5). The project ends on the 15 September.
  1. This will only include standard wiki pages, and so will exclude e.g. Special Pages and Wikibase pages.
  2. See here for selection:
  3. Text processing will be necessary (that which is done to the text before it is sent to the speech synthesis engine) which is adapted for Wikispeech. Articles contain annotation and structure which is relevant for text-to-speech and this must be sent to the Synthesis API in a pre-defined format.
  4. Here you merge “known” words and get a pronunciation for them. A lexicon is needed where each entry has certain mandatory fields (minimum is probably the orthographic word, a phonetic transcription and language). In addition to the mandatory fields it is desirable to include further information (when it was last saved, who edited the entry, language of the word, language of the pronunciation, word class, disambiguating designations for pronunciations of homographs, code for the system of phonetic annotation (IPA, SAMPA, etc.), etc.).