Grants:IEG/Pan-Scandinavian Machine-assisted Content Translation/Midpoint
Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.
In a few short sentences or bullet points, give the main highlights of what happened with your project so far.
- Released first version of Danish→Norwegian Bokmål machine translator
- along with updates to Norwegian Nynorsk/Bokmål→Danish, and a new version of Nynorsk←→Bokmål
- Released first version of Danish→Swedish translator
- along with many improvements to Swedish→Danish machine translator
- Created and integrated new Swedish machine translation lexicon from the Swedish SALDO lexicon
- Co-operated with Content Translation developers on integration issues
Methods and activities
How have you setup your project, and what work has been completed so far?
Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.
I've followed a fairly standard workflow used in Apertium language data development. Each language pair (e.g. Swedish–Danish) provides a bidirectional translator, and depends on monolingual language packages for monolingual dictionaries and disambiguation, and provides its own bilingual dictionary and transfer rules. We commit changes to language pairs and language data in Apertium's Subversion monorepository; there's no branching/pull requesting here, since typically only a couple people will work on a certain language/language pair at any one time, and it's easier with this kind of data to avoid conflicts than it is with typical program code.
- Most relevant past learning: Created an Apertium MT system during Google Summer of
Code 2009 (and have kept working on such systems afterwards). Resources I found useful then:
- but most of all, live help from other Apertium developers: http://wiki.apertium.org/wiki/IRC
- PROTIP: Find a good Apertium mentor at http://wiki.apertium.org/wiki/Apertium_mentors to help you plan your project and see it through.
- Main guidelines followed during developing:
- Time-based releases, with the release schedule seen at http://wiki.apertium.org/wiki/Scandinavian_MT_project
- Prioritising dictionary consistency and coverage, at the expense of more linguistically interesting issues like getting all the syntax right
- Where possible, reuse better-developed data from nearby language pairs (transfer and disambiguation rules from similar, better developed language pairs can often be used as a basis, which might save some time compared to writing everything from scratch; dictionaries can often be crossed to get new word-translations transitively)
- PR: Tell everyone about your project, and be sure to keep updating the people who tell others.
- I've regularly updated the Scandinavian Village Pumps (as well as Apertium's own mailing list) whenever there have been releases, and notified various Twitter users, and linked to my own "hub" at https://meta.wikimedia.org/wiki/Skanwiki/Skanwikiprojekt_MT
What are the results of your project or any experiments you’ve worked on so far?
Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.
The main deliverables in the timeline so far have all been released:
- Danish→Nynorsk translator
- SALDO lexicon converted for Swedish
- Danish→Bokmål translator
- Danish→Swedish translator
as well as new versions of other translators like Nynorsk↔Bokmål, Swedish→Danish, Bokmål→Danish.
All the translators have over 80 % coverage on Wikipedia text, so on average less than 2 out of 10 words are unknown when translating. Many of these unknowns will also be "free rides" like proper nouns; counting only lowercased words gives over 90 % Wikipedia coverage for most translators.
The translator packages have also received "tech upgrades", and now all use the newest modules available in the Apertium platform:
- Monolingual dependencies for less data redundancy
- Three-stage transfer for better grammar and cleaner rules
- Lexical selection for more fluent-sounding translations
- Constraint Grammar rules for better disambiguation
- Dynamic decompounding for wider coverage of unseen words
For every release, we have Apertium release statements:
- http://article.gmane.org/gmane.comp.nlp.apertium/5498 dan-nor 1.1.0
- http://article.gmane.org/gmane.comp.nlp.apertium/5561 dan-nor 1.2.1
- http://article.gmane.org/gmane.comp.nlp.apertium/5613 swe-dan 0.7.0 including SALDO
- http://article.gmane.org/gmane.comp.nlp.apertium/5620 nno-nob 1.1.0
with links to downloads.
The releases are all packaged for Debian, and in the process of being included in the Content Translation tool (see e.g. https://phabricator.wikimedia.org/T124137 for progress on dan-nor), though the actual work here is done by people outside this IEG project. The translators themselves are also testable on https://apertium.org
On the PR side, I've written on the Wikimedia Norge blog, and co-written a press release sent to various Norwegian news papers (though nothing has come of that yet).
Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.
Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.
Yes, there are no changes from the budget done or anticipated.
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.
What are the challenges
What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.
- Main challenge: Although people do get excited about this project, it's difficult to get people excited enough to actually join in contributing (even something as simple as saying "this gets translated wrong").
- Slightly challenging: Having patience while others do the packaging work that gets things into Content Translation.
On the technical side, the Apertium platform feels quite "stable"; the main challenge there is to keep working and seeing things through.
What is working well
What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
Next steps and opportunities
What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.
The next scheduled deliverables are:
- Nynorsk→Danish translator
- Swedish→Nynorsk, Swedish→Bokmål, Bokmål→Swedish, Nynorsk→Swedish translator
I also plan on doing more extensive reporting to Village Pumps and Twitter, and looking for more mailing lists or forums to notify, not only about the Scandinavian systems, but about the possibility of doing similar projects for other languages. (I am strongly considering applying for a renewal.)
We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?
So far it's been great! It has been motivating hearing so many people say they think the project is worthwhile, and it is motivating to work on something that may be useful to lots of people.
- Although I do use git-svn myself, since I like having offline commits and diff-logs.