Grants:IEG/Pan-Scandinavian Machine-assisted Content Translation/Final

From Meta, a Wikimedia project coordination wiki


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.

Part 1: The Project[edit]

Summary[edit]

In a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.

  • Created and updated new free and open source machine translation data packages for use in Wikimedia's Content Translation tool
    • Created new translators Swedish↔Nynorsk, Swedish↔Bokmål, Danish→Swedish, Danish→Nynorsk, Danish→Bokmål, Nynorsk→Danish
    • Updated and modernised existing machine translators for Swedish→Danish, Bokmål→Danish, Nynorsk↔Bokmål
  • Worked with Language Engineering team members to make sure packaging goes smoothly
  • Spread the good word about Content Translation wherever I could think of


Methods and activities[edit]

What did you do in project?

Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.

  • Mostly followed the same workflow as in Grants:IEG/Pan-Scandinavian_Machine-assisted_Content_Translation/Midpoint#Methods_and_activities
    • Minor workflow improvement: having another computer continually running dictionary consistency testing on the newest code (similar in spirit to having a Continuous Integration server) sped things up a lot, as opposed to a purely local compile-test-edit-cycle
    • Minor development improvement: pre-filling lexical selection files with low-weight (fallback) unigram corpus frequences to avoid spending time writing rules just to deselect very rare words (better would be bigrams/trigrams, but learning those take a lot more time, resources and up-front effort)
  • Focused a bit more on PR (Twitter/Reddit, talking on Village Pumps, helping with blog post translation)
  • Reached out to Scandinavian-speaking Apertiumers and Wikipedians who might be able to help with evaluation

Outcomes and impact[edit]

Outcomes[edit]

What are the results of your project?

Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.

For all the ways to "pair" two mainland Scandinavian Wikipedias Nynorsk, Bokmål, Swedish and Danish, there is now a free and open source Apertium machine translator available and packaged for inclusion in Content Translation:

(Debian packaging of the sources was done mainly by Tino Didriksen of the Apertium project and Kartik Mistry of Wikimedia Language Engineering.)


Apparently,

   "One new @Wikipedia article is created via our content translation tool every 7 minutes" – @WhatToTranslate
   https://twitter.com/WikiResearch/status/662357460322160640

and from what I can tell of the stats, having MT available seems like it increases the usage of the tool.


Since the new pairs are not all yet installed in Content Translation, we'll have to wait a bit with evaluating their direct impact on Wikipedia growth, but it seems likely they'll have a positive effect. https://en.wikipedia.org/wiki/Special:ContentTranslationStats#cx-stats-publishedtab-0 says that

  • of 3437 articles translated to Bokmål, 1698 came from Nynorsk (MT), 290 Swedish (no MT[1]) and 180 Danish (no MT)
  • of 567 articles translated to Nynorsk, 546 came from Bokmål (MT), 5 Swedish (no MT)
  • of 591 articles translated to Danish, 265 came from Swedish (MT), 53 from Bokmål (no MT) and 3 from Nynorsk (no MT)
  • of 239 articles translated to Swedish, 4 came from Danish (no MT) and 3 from Bokmål (no MT)

The rest of the source translations are mostly from English or other large European languages.

So the fact that Danish has had Apertium MT in Content Translation for a while from Swedish, and has so many translations from Swedish, while Swedish does not have Apertium MT (installed yet) from Danish, and so few translations from Danish, might be evidence that MT availability increases usage of Content Translation; the trend seems to hold through the table.

Progress towards stated goals[edit]

Please use the below table to:

  1. List each of your original measures of success (your targets) from your project plan.
  2. List the actual outcome that was achieved.
  3. Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Danish→Nynorsk release released 1 January 2016
SALDO lexicon converted for Swedish released 15 January 2016
Danish→Bokmål release released 1 February 2016
Danish→Swedish release released 1 March 2016
Nynorsk→Danish release released 1 April 2016
Swedish→Nynorsk, Swedish→Bokmål, Bokmål→Swedish, Nynorsk→Swedish releases released 17 May 2016 as "beta" (needed a bit more work to be release-quality), final release 7 June 2016


The above are the "intrinsic" measures. Until the pairs are all installed in Content Translation, it's unfortunately not possible to measure their effect on Wikipedia growth.


Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?

Regarding the code/data parts, I feel I have achieved my goals – there is of course always more that could be done, but all the translators are now in such a state that they no longer require the kind of major changes that require co-ordinated work. So it should be easy to let different committers at different times do continual improvements to the translators.

I would have liked to get more community involvement; here I'm not sure what I could have done to improve things. It's difficult to motivate Wikipedia editors to take part if they can't use the tools immediately, so integration with Content Translation and getting fast updates is vital.

Global Metrics[edit]

We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

  1. Next to each metric, list the actual numerical outcome achieved through this project.
  2. Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric Achieved outcome Explanation
1. Number of active editors involved 1
2. Number of new editors 0 Not a direct goal of the project
3. Number of individuals involved 8 including main developer, mentor, packagers, PR contacts, testers
4. Number of new images/media added to Wikimedia articles/pages 1 Not a direct goal of the project
5. Number of articles added or improved on Wikimedia projects 3 Not measured yet – hopefully in the hundreds/thousands after the packages are activated, but so far just some evaluation articles
6. Absolute value of bytes added to or deleted from Wikimedia projects 0 Not a direct goal of the project


Learning question
Did your work increase the motivation of contributors, and how do you know?
  • I got a lot of thumbs up and cheers from people wanting to use the tool once available, and questions about "when can I use it", so in the sense of "we're getting something new and shiny" it seems to have been motivating.

Indicators of impact[edit]

Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Option A: How did you increase participation in one or more Wikimedia projects?

Option B: How did you improve quality on one or more Wikimedia projects?

Option C: How did you increase the reach (readership) of one or more Wikimedia projects?

  • As the machine translators are not yet installed, we can't yet say we've increased reach/participation/quality, but we do expect at the very least reach to increase due to content becoming available in more languages. There might also be more participation because people who might not otherwise be active contributors might enjoy the new translation workflow.

Project resources[edit]

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

  • Some of the people who got involved, centrally and peripherally:
    • Kevin Brubeck Unhammer – main developer
    • Francis M. Tyers – mentor, contributor
    • Tino Didriksen – Apertium packager (Debian packages)
    • Kartik Mistry – Wikimedia Language Engineering (Debian packages)
    • Trond Trosterud – Apertium contributor (feedback and contributions)
    • Astrid Carlsen – Wikimedia Norway Executive Director (help with PR, blogs)
    • John Erling Blad – Wikimedia enthusiast (feedback, help with PR, finding relevant contacts)

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

What worked well[edit]

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.


Endorsed:

Also a new one, though a bit technical:

(Previously created a new one during Midpoint: Developing Apertium MT for your language in Content Translation)

What didn’t work[edit]

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

  • So far, simply asking for feedback on translations has given very few results. We need to find a way to make it easier for people to find and report back faults in our systems

Other recommendations[edit]

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.

Next steps and opportunities[edit]

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

  • Expanding the translation pairs into Faroese and possibly Icelandic would be the next logical step – but we'd have to recruit participants willing to spend some time going through translations
  • Support for Translation Memories in Content Translation might turn out very useful – Apertium to a certain extent supports this, but nothing very integrated yet
  • Writing a scientific article to showcase the work to a wider natural-language technology audience


Think your project needs renewed funding for another 6 months?




Part 2: The Grant[edit]

Finances[edit]

Actual spending[edit]

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

Expense Approved amount Actual funds spent Difference
Main developer, 6 months 50 % work 12,500 $
Co-funding by Apertium project -2,500 $
Total 10,000 $


Remaining funds[edit]

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

  • No

If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:


Documentation[edit]

Did you send documentation of all expenses paid with grant funds to grantsadmin(_AT_)wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

  • Yes

Confirmation of project status[edit]

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

  • Yes

Is your project completed?

Please answer yes or no.

  • Yes

Grantee reflection[edit]

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!

This has been a great experience, creating useful language technology that will benefit Wikipedia, and seeing how other people also get excited about it :) Doing it again, I'd probably think a bit more about how to get feedback faster from human translators, since that's the part that was most challenging, but overall it seems to have been a success.


Footnotes[edit]

  1. By "no MT" I mean no Apertium MT – Yandex statistical MT support was recently added to all pairs between Swedish-Danish-Bokmål (but not Nynorsk), although the counts here are from way before Yandex support was added, so that should not have too large an effect. Reading the discussions, it seems Yandex quality could be better.