Grants:IEG/Audio Dictionary

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Audio Dictionary
IEG key blue.png
summaryCreate structured audio sets of the 6000 most common words for 3 languages, usable to teach these languages.
targetEnglish and all other Wiktionaries, Wikiversities, strutured data is usable by Wikimedia and external projects.
strategic priorityincreasing reach
created on18:05, 30 September 2014 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

Problem: Wikimedia projects provides various materials (Wiktionary, Wikipedia) or courses (Wikiversity) to learn foreign languages. Two major problems are recurrent, and could be addressed. First, for listening comprehensions and speech production, audios are a notable help. The main source of audios used by the Wikimedia foundation is the Shtooka Project, led by Nicolas Vion, Paris. This project provides elegant coverage for French (~15000 words), Chinese (~8800 words), Dutch (~8700), English (~5000) and Russian (~5000). The Dutch and English sets being past Wikipedia initiatives. Major languages such Arabic (~900), Hindi (0!), Spanish (1700), Indonesian (0!) are poorly provided. Secondly, the base materials such as words lists and sentences lists are not systematic and have various degrees of completeness relative to real usages. Indeed, the Wiki ways based on volunteers is by definition non-systematic.

What is your solution?[edit]

Solutions: We know from lexicographic studies that the first 6000 thousands words generally cover 95 to 99% of spoken and written materials of that language. Such list guide us along the most effective road to document a lexicon. On the other hand, Nicolas Vion, language lover and administrator of the Shtooka Project provided audio recording training to myself (Hugo Lopez, language teaching PhD researcher and Wikipedian) in 2013 for punctual study on Chinese teaching and learning. As a PhD at the French National Institute for Oriental Languages and Civilizations, I also have access to native speakers of Oriental languages such as Arabic, Hindi, and Indonesian as well as free access to a professional recording studio which guaranty a suitable recording quality in order to eventually record such databases. We propose use publicly available words frequency lists, pre-cited expertises and hardwares to systematically records the "List of 6000 common words" for several major non-western languages.

Project goals[edit]

Creating systematic audio sets

We propose to combine these available contents, expertises, and access to systematically record high quality audios for :

  • an "Arabic audio database of 6000 common words"
  • an "Hindi audio database of 6000 common words"
  • an "Indonesian audio database of 6000 common words".

Project plan[edit]


Recording sessions

The recordings sessions require :

  • An audio recording studio. We have pre-agreement to access freely to the audio recording studio of the French INALCO. Tests have already been done there, with good results.
  • A paid native speaker. The amount of work required, estimated about 4 to 6 working days by language, requires to pay the native speakers.
  • A paid technical assistant. The technical assistant will manage the textual content, prepare the recording hardware and software, and lead the recordings sessions. The amount of work required, estimated about 4 to 6 working days by language (times 3), requires to pay this technical assistant.

Each daily sessions is expected to produce between 2000 and 1000 audios. The recording of 6000 words is thus expected to take between 4 (mainly fast) and 6 (mainly slow) sessions.

Others notes

Pre-test in the recording studio underlined the need of a silent portable PC. The role of project manager will also be required to manage bookings, administrative work, trials and failures. The software used will be Shootka Recorder, which increase productivity for systematic audio recordings.


Item Description Units Total (USD)
Team expenses:
Native Speaker 1 4 - 6 working sessions 400€ x1 + 20% taxes 500€
Native Speaker 2 4 - 6 working sessions 400€ x1 + 20% taxes 500€
Native Speaker 3 4 - 6 working sessions 400€ x1 + 20% taxes 500€
Technical assistant 4 - 6 working sessions * 3 400€ x1 + 20% taxes 1500€
Coordinator / Project manager 15h * 4 * 20€ 1200 + 20% 1440€
Basic mini PC 300€ * 1 (including taxes) 300 + 0% 300
Total: 4740€ (estimate on 2014/09/30)

Note: project manager count is large estimate, assuming coordination of each subprojects (3 items) plus of the coordination with WMF (1 item). May change.

The final cost is about 30 cents per audio file.

Community engagement[edit]

Similar initiative have already, punctually, been lead by Wikipedians and largely reused accross the wikimedia projects and above, showing the interest and value of such media creations.


The Systematic dataset can and should be extended further by following initiatives, by individuals, WM chapters, or the WMF (IEG). The textual base can be translated, and the methodology can be duplicated to other languages, major or rares.

Measures of success[edit]

Need target-setting tips? Note: in addition to your project-specific measures of success, you will also be asked to report on some Global Metrics at the end of your final report. Please keep this in mind as you plan, and we'll support you as you begin your project.

Number of audios produced, quality, reuse (longer term).

Get involved[edit]


Community Notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?


Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.

See also[edit]


  • <to come>