User talk:LA2/Corpus

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Early draft[edit]


The CEE region speaks many different languages, many of them Slavic, all very different from English, being the dominant international language. There is a dire need for improved support for translation, both manual and automatic, through dictionaries such as Wiktionary. Two years ago, at the CEE Meeting 2015 in Tartu, I gave a lightning presentation about Foreign language contributions to Wiktionary, briefly describing my contributions about Swedish words to the English and Russian Wiktionary. Now I present some problems facing such contributions, and how they can be solved with the help of literary texts from Wikisource that exist in both languages. This is a novel approach to combining the two projects for a very useful result. My experience is from Russian and Swedish, and I am now applying this pattern also to Ukrainian and Belarusian.


What Wiktionary is - documenting all languages in all languages, the difference between host language (of the site) and foreign language contributions (providing explanations in the host language).

Wiktionary is a lonely hobby. For any combination of host and foreign language, there is typically less than one contributor. A single user makes some contributions for some time, then leaves for other tasks. Very seldom can users cooperate. [graph from Scandinavian languages in en.wikt] In this enviroment, it is important to leave the work in a state where others can continue later.

Zipf's law - long tail of words - rare words, technical terminology are easy to translate - common words are short and ambigious.

A good primary school dictionary has 20,000 words. Advanced dictionaries cover 100 or 200,000 words. Only 2000 words (for some combination of host and foreign language) can not possibly provide a useful coverage.

Look at the statistics, and you will find that Wiktionary is still very much in its starting phase. It was started in December 2002 and soon celebrates its 15th anniversary. It exists in 172 host languages, but only 65 of those have more than 20,000 words and only 26 have more than 200,000 words. In the English Wiktionary, only 20 languages (English + 19 foreign ones) have more than 20,000 words. There are for Italian 144074 entries, Finnish 82393, Russian 59631, German 55809, Czech 29920, Swedish 24006, Latvian 11485, Lithuanian 6000, Ukrainian 3780, Belarusian 2034.

Both for Russian and Ukrainian, commercial publishers provide English translation dictionaries in various sizes: 50, 100 and 200,000 words. But for Ukrainian the coverage seems more random. In an English-Ukrainian dictionary with 80,000 words, neither newcomer nor novice were listed. Beginner = новачок was listed, but the Ukrainian-English section did not list новачок. So even if Wiktionary is bad, commercial dictionaries are also bad.

Wiktionary entries can be elaborate or short. For technical terminology, a short entry can be sufficient. By translating academic specializations such as biology, biologist, chemistry, chemist, it was possible in a short time to create 300 entries for words in Ukrainian and Belarusian in the Swedish Wiktionary, reaching the 500 word limit that merits a mentioning on the main page.

It is, however, easy to write bad dictionary entries, with incomplete coverage or based on misunderstanding / guessing. It is easy to invent poor example uses of words. Using real sentences from literature is a great help in explaining how words are used.

But we also have to consider copyright. Luckily, many out-of-copyright texts exist, for example in Wikisource.

For translations (foreign language contributions), it is easy to provide bad translations of real sentences. Using real sentences from literature and real translations is a great help. Luckily, Wikisource exists in many languages.