Celtic Knot Conference 2020/Keynote: Strategies for scaling up the Irish language Wikipedia

Welcome to the note-taking pad of one of the Celtic Knot Conference 2020 sessions! This space is dedicated to collaborative note-taking, comments and questions to the speaker(s). You can edit this document directly, and use the chat feature in the bottom-side corner.

✨⏯️ Session details[edit]

Name: Keynote by Kevin Scannell
Speaker: Kevin Scannell - https://en.wikipedia.org/wiki/Kevin_Scannell - https://cs.slu.edu/~scannell/
Link to the video/replay: https://www.youtube.com/watch?v=fmoG7QIzB7s
More details: ...

💬❓ Questions[edit]

Feel free to add questions here, while or after watching the session. Please add your (user)name in bracket after the question. The host of the session will pick a few questions to ask them during the livestream. The speaker or other participants will answer on this pad (asynchronously: the answer may come in a few hours or days).'

Amir (User:Amire80): I'd love to hear more about the usage of machine translation. Some people love it and some—not so much. How do you use it? What challenges to you have? Do you paste manually from Google Translate, or do you use the Content Translation extension? Are there MT engines that are better than Google Translate, at least in some way?

(User:michaelgraaf) When the translation quality from Content Translation tool is poor, it's difficult to get editors to use it even if you tell them that's the way to improve it. Vicious circle.

Amir: Do people prefer to type from scratch? They can type from scratch in Content Translation, and enjoy things like easy image and link adaptation.

It's an eternal struggle. The problem is exacerbated in the Bantu languages of SA where a well-meaning foundation is arranging translations to be done off-wiki then pasted in. To really mess things around they are adding material to the English article texts before transation so the output is no longer a translation of the English article. For example inserting a "Background" paragraph before the lede. The only way to beneficiate their work for ML purposes would be to ask them to provide the entire collection of documents they've created as a corpus. Then you could match each English text with its various translations easily enough from a spreadsheet they have kept.

Meghan (User:Dowlinme): I'm an editor on Irish Wikipedia and I really like using the content translation to create articles (EN to GA). It's helpful for me to already have most of the vocabulary available and then I work on the grammar. Unfortunately infoboxes and references come out strange and have to be manually corrected, which I think puts some people off.

Rebecca (User:Smirkybec): +1 to Meghan's statement, the fact that many templates do not "translate" (as they are missing or broken) leads to frustration.

Lianne (User:Gwikor_Frank): +1! It's especially difficult for Cornish as we're doing a Wikpedia editor recruitment drive at the moment, but the speech community is generally older (we mostly learn as adults) and so tend to feel unconfident or easily confused with technology.

Amir (User:Amire80) 2: What is the name of this MT engine that was presented? I cannot find it in Phabricator. I'm in the team that develops Content Translation, and I can try checking the status.

Vigneron: +1, very good question Sir!

Answer from Prof Scannell: http://www.intergaelic.com/

Amir: Aha, I remember *Caidhean, but I'm probably not spelling it correctly. What's the correct spelling? I'd love to help.

http://www.potafocal.com/cai/ - The spelling is "caighdeán" ('standard' in Irish)

Thanks. Found it: https://phabricator.wikimedia.org/T159799 . I'll try to wake it.

http://www.intergaelic.com/ Amir (User:Amire80) 3: Do you know anything about the readers of the Irish Wikipedia? Who reads it? Do people find it useful? What kind of people are they? Native speakers? Enthisuaists? School students? Adult students?

Rebecca (User:Smirkybec): Looking at the stats, it seems like there is a lot of readers in the US, which is possiblu diaspora (ancedotally, I know a lot of Irish immigrants edit/read Vicipéid as a way of staying in touch with home) and also there are a lot of Irish/Celtic studies programmes in the US.

🖊️🔗 Collaborative note-taking[edit]

Feel free to take notes about the session here, add some useful links, etc.

Prof Scannell started learning Irish in the 1990s; he's a professor at St Louis University (Missouri) originally from Boston. More on https://en.wikipedia.org/wiki/Kevin_Scannell ;)
Prof Scannell has worked on several projects around Goidelic languages
Obviously, having a resource in your language is important, pticly for immersion schools.
One of the advantages he sees is for people on Natural Language Programming, so populating Wikipedias in regional and minority languages helps programmers scale up their English-trained models into other languages. They're all trained on Wikipedia data. This is a mixed bag, as the Irish Wikipedia isn't a great example of Irish-language writing online.
3 ways he has worked on scaling up the Gaeilge Wikipedia.
1stly, in 2013, a project involved importing an Irish-language encyclopædia (Fréamh an Eolais—https://ga.wikipedia.org/wiki/Fr%C3%A9amh_an_Eolais ) by Prof Hussey. A friend, Ciarán Ó Bréartúin, who is a supporter of open-source content effectively convinced the publishers of the value of republishing their book on the Irish Wikipedia, relying on the interpersonal persuasion of the publishers. Prof Scannell's contribution was to wikify the content, to create inter-article links, which can be quite tricky — especially with stemming words. Multi-word phrases and disambiguation were also challenges: for example Neptune the planet and Neptune the god. This more than doubled the size of the Irish Wikipedia, adding ~7,000 articles written by a fluent speaker and domain expert. He's not sure this is easily repeatable for all languages, of course.
2nd strategy: Machine translation. This is a standard approach, both manually and using Google Translate and post-facto editing, to import articles from English. One particular Irish editor Marcas Ó Duinn User:Marcas.oduinn has added a lot of content about old Irish history and mythology; this was a huge gap previously, with articles about Gaelic language and culture being far better covered on the English-language Wikipedia. Prof Scannell's own work uses machine translation and post-editing to translate content between different Gaelic languages (Irish: ga, Scottish Gaelic: gd, Manx: gv).
- Machine translation has got much better in some language pairs recently, where there is a lot of parallel text, such as English and Mandarin Chinese. Between these closely-related languages there is much fewer parallel texts — the Bible is one such, but generally as translations from English rather than between these languages directly.
- [Screenshare: http://InterGaelic.com ~12 minutes into the video] Demonstrating content from Scottish Gaelic to Irish
- Prof Scannell has developed APIs to handle this language translation. A friend, Michal Boleslav Měchura (https://coislife.ie/udair/michal-boleslav-mechura/), built InterGaelic from it.
- Not sure how broadly applicable this is; building the MT engine was quite challenging, but is working well between those 3 languages
3rd strategy: Wikidata, a shared database that will be more-widely discussed throughout this conference. For a long time Scannell had been planning on getting to know Wikidata; he did so on sabbatical recently.
- Irish Wikipedia has a lot of stub articles, particularly about placenames and species. As these are often manually-created they are more likely to suffer from human error — misspellings in particular.
- There are some good database sources in Irish, so linking those up to Wikidata will make it easier to create better-quality articles in the Irish-language Wikipedia.
- Scannell wrote a bot to clean up placenames in both gawiki and Wikidata. In particular, ensuring that the official placenames were accurate (https://www.logainm.ie/). Wikidata allows for descriptions in each language, so he also verified those. He avoided wholesale importing new items, due to unclear rights issues.
- He "fell in love" with Wikidata as a model, from that, and so his bot has spent a lot of time adding item labels and item descriptions in Irish — 4 million descriptions and 3m labels as of today, with another 3 million of each queued up for addition.
  - (haven't we all fallen in love with it? :D )
- You can use the statements in Wikidata items to create sentences ("$name ($DOB–$DOD) was a $career in $place" and so on) pretty easily.
- Wikidata is great and serves Wikipedias in various ways, but the end-goal is to use it to populate the Wikipedias. Part of this is using Wikidata-originated infoboxes (see https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2020/Submissions/Wikidata-powered_infoboxes), which is poorly implemented in gawiki thusfar.
- Wikidata can create very high-quality article stubs in this manner, hopefully replacing a bunch of the human effort involved. This can free up more-capable Irish-language editors to do more valuable work expanding articles.
This is an interesting paper: https://arxiv.org/abs/2004.09095 The State and Fate of Linguistic Diversity and Inclusion in the NLP World. It summarises how where various languages are as regards resourses for Natural Language Procession, on a scale from 0 to 5. There is a GitHub page at https://microsoft.github.io/linguisticdiversity/ including a list of where they thought each language was https://microsoft.github.io/linguisticdiversity/assets/lang2tax.txt They have all of the Celtic languages at level 1 except Irish at 2. User:DavydhT
Rebecca: There was quite a lot of jealously from the Irish-language Wikipedia community of the Welsh-language community, which had a lot of Wikidata-originated content, before Prof Scannell's work here.
[Some discussion of InterGaelic that didn't get noted here]. The output of the API is richer than Google Translate would give you, as it provides the alignment between source and destination language, which Google does not. Rebecca points out that GTranslate isn't great for translating into Irish. Hopefully Prof Scannell's work on machine-translation will help that with on-wiki tools in the future?
- Google Translate has improved a lot in the last 3 years, though it's notably better towards English than away from it. There's a way to go yet, but heavy post-editing is no longer needed. There are other MT engines out there; it's a difficult space to compete in, as things move so fast and a lot of money is poured into the sector. As training datasets often needs supercomputer clusters, it's much easier if you're working in a specific niche, like Prof Scannell is doing.
The personal relationships needed to import a whole encyclopædia into gawiki were very important. Is something like that likely to happen again?
- There's a real culture of openness with the people working with the Irish language — software developers on MT are sharing resources, software and data between themselves, and often under open licenses. This has leaked into other areas too — Logainm.ie is an amazing database of placename information, available on an open licence. (It's unclear if it's a compatible Open licence.) There's definitely willingness to help each other out here.
  - Ok, so: I got permission to import links to Logainm a few years ago (before I proposed the Logainm ID property), the problem is that importing other information isn't possible because of Logainm's licence (CC-BY: https://www.logainm.ie/en/inf/proj-copyright) -- User:Jimregan
- Rebecca points out there's worry about data sources becoming subsumed into Wikipedia, so it's important to help people understand that they will always have a place, especially given the nature of linked data on Wikidata.

✨✨✨✨✨
More information about the Celtic Knot Conference 2020: https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2020
The Friendly Space Policy also applies on this space: https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2020/Friendly_Space_Policy