Grants:IEG/Pan-Scandinavian Machine-assisted Content Translation/Timeline

From Meta, a Wikimedia project coordination wiki

Timeline for Pan-Scandinavian Machine-assisted Content Translation[edit]

Timeline Date
Danish→Nynorsk release 1 January 2016
SALDO lexicon converted for Swedish 15 January 2016
Danish→Bokmål release 1 Februar 2016
Danish→Swedish release 1 March 2016
Nynorsk→Danish release 1 April 2016
Swedish→Nynorsk, Swedish→Bokmål, Bokmål→Swedish, Nynorsk→Swedish releases 17 May 2016


Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

December[edit]

https://meta.wikimedia.org/wiki/Skanwiki/Skanwikiprojekt_MT#2015-12-29 and below contains reports in Norwegian, here's a short summary in English:

We have created a new release of the translator package apertium-dan-nor along with it's individual language dependencies (Danish, Bokmål, Nynorsk). This package from before included translators Nynorsk→Danish and Bokmål→Danish; now it for the first time includes a Danish→Nynorsk translator!

The main part of the time has been spent on vocabulary/tagging consistency (in short, something like "singular" or "noun" is a Tag, and different languages have different taggings of words, like ungendered vs gendered nouns, or certain verbs might take the passive in one language but there is no corresponding form in the other, etc.). Most of this was dealt with in the bilingual dictionary (but some changes were also made to the monolingual dictionaries where the differences where "arbitrary" rather than motivated by true linguistic differences). Also, lots of plain bugs in the tagging were fixed. The structural transfer rules (which deal with tag/word changes covering more than one word, generalising over different words) were initially copied from the Bokmål→Nynorsk translator, some rules deleted and others slightly modified, but all in all most of the work from that translator's transfer rules could be re-used here.

In addition to the work on dan-nor, we now have a converter script for the SALDO dictionary for Swedish. This gives us a positively huge dictionary of well-tagged verbs, nouns, adjectives, adverbs, but we still need to manually work on the closed classes (pronouns, determiners, function words) where we need to ensure taggings don't have unnecessary arbitrary differences from the other dictionaries, and where the differences are too idiosyncratic to make script-writing useful.


January[edit]

The Danish→Bokmål translator is now released, with many little improvements all round (including improvements to the three other directions between dan/nno/nob).

As with Danish→Nynorsk, transfer rules were based off Nynorsk→Bokmål and then tweaked to accept the various tagset differences in Danish. We have also used Wikidata queries (after some help from the kind people in the irc://irc.freenode.net/#mediawiki-i18n channel) to quickly get lists of proper noun translations, leading to some fixes to our proper noun translations.

We've also updated stats at http://wiki.apertium.org/wiki/Scandinavian_MT_project – in particular with Word Error Rates, where low numbers are good (fewer edits needed before publication). For a historical article, WER was down at 10.87 % for dan→nob and 13.64 % for dan→nno, while the more literary article «Peter Høeg» was up at 22.64 % for dan→nno; this gives an indication of the performance of the translators. So the best result is similar to the first edition of nob→nno (although that evaluation was back in 2009, and much has happened since).

Regarding communication, we've notified the various village pumps about https://meta.wikimedia.org/wiki/Skanwiki/Skanwikiprojekt_MT whenever there has been new releases (as well as notifying the Apertium mailing list).

February[edit]

The Swedish←→Danish translator is now released. This one was based on an old translator that hasn't seen major work since 2009, and then only supported the one direction Swedish→Danish.

The new work is modernised quite a bit, using some new technology that we didn't have in Apertium back in 2009, e.g. lexical selection (a module for choosing the right word-translation of a given word-and-part-of-speech in context) and dynamic compounding analysis (Scandinavian languages have long compound nouns like German). It also now has three-stage transfer instead of one, making it easier to do word-order changes over larger contexts, and uses shared monolingual dependencies (so we don't duplicate monolingual data across language pairs). Due to all these modernisations, a lot of work was spent on checking consistency across lexicons and rules. The Swedish monolingual dictionary was also completely changed to use the SALDO lexicon as mentioned elsewhere, which also required a lot of consistency work to synchronise the tagsets and analyses. The old Swedish→Danish translator now performs quite a bit better in terms of word-coverage (see http://article.gmane.org/gmane.comp.nlp.apertium/5613 for some details).


We also released a new version 1.1.0 of Nynorsk←→Bokmål, since there has been some word additions and rule fixes due to feedback from our users :-)

March[edit]

A new version 1.3.0 of the Danish-Norwegian translator is out, now with Nynorsk→Danish support; testable on https://apertium.org There are some notes in Norwegian on https://meta.wikimedia.org/wiki/Skanwiki/Skanwikiprojekt_MT#2016-04-02 and technical release notes at the Apertium mailing list http://thread.gmane.org/gmane.comp.nlp.apertium/5779 .

The biggest change is that both Norwegian→Danish translators now use three-stage transfer instead of one, which allows changes with larger contexts. This also improves translation of compound words. Coverage is also increased, and is currently between 87% and 92% on Wikipedia text; 92%-95% when only counting lower case (a lot of the unknowns are proper nouns, which often don't need translation).

We have also worked on tagging consistency (removing "arbitrary" differences that had no linguistic basis) between the four monolingual dictionaries to make it easier to start on the last leg of the project: Swedish-Norwegian.

April[edit]

This month was spent mostly on Swedish-Norwegian. Coverage was increased by adding/fixing entries manually as well as crossing the bilingual dictionaries via Danish (swe-dan + dan-nor = swe-nor); coverage is now increased on Wikipedia text from 72-75% to 81-86% (all pairs around 90% when only counting lowercased words). The current coverage status is at http://wiki.apertium.org/wiki/Scandinavian_MT_project We also worked on vocabulary consistency, but that job is far from complete.

Swedish-Norwegian is a completely new pair, and still needs lots of work, but we plan on a first release mid-May, with full three-stage transfer in all directions, lexical selection rules, and all the difficult closed-class words done, and in such a state that the main remaining improvements should consist of increasing coverage.


We also discussed a possible future MT project with Faroese Wikipedia members.

May, June[edit]

For 17th of May (the Norwegian constitution day, also start of the union between Sweden and Norway), we released a "beta" 0.1.0 of apertium-swe-nor, then on 7th of June (date of the dissolution of swedonorwegian union) we released the first complete apertium-swe-nor package versioned 0.2.0. Release statements for 0.1.0 and 0.2.0 on the Apertium list, and aimed at Scandinavian wikis at https://meta.wikimedia.org/wiki/Skanwiki/Skanwikiprojekt_MT#2016-06-07 and below.

The swe-nor package has all the features of the other pairs, like three-stage transfer, lexical selection, both rule-based and statistical disambiguation, and dynamic compounding. Transfer rules were based off copies from dan→swe (giving nor→swe) and dan→{nno,nob} (giving swe→{nno,nob}), while the bilingual dictionary was heavily based on crossing via Danish (dan-nor ∩ swe-dan = swe-nor), then augmented with lots of manual additions. Since the bilingual dictionary got quite a lot of "extra" translations that weren't always the most frequent, we used simple unigram corpus frequencies to create fallback (low-weight) lexical selection rules which rule out the more rare translations where we haven't yet made more sophisticated rules. For the transfer rules, we also added some rules for handling the Swedish supine, which only partly corresponds to Norwegian/Danish perfect forms.

According to Kartik Mistry of Language Engineering, the Debian packaging is done and the packages very close to becoming included in Content Translation (apparantly there was some delay due to upgrading servers to newer OS'es).

On the PR front, we got quite a bit in June:

and we're currently asking Scandinavian Wikipedians and Apertiumers for help with evaluation.


Is your final report due but you need more time?