Celtic Knot Conference 2020/Submissions/Search Support for Minority Languages/Discussions

Welcome to the note-taking pad of one of the Celtic Knot Conference 2020 sessions! This space is dedicated to collaborative note-taking, comments and questions to the speaker(s). You can edit this document directly, and use the chat feature in the bottom-side corner.

✨⏯️ Session details[edit]

Name: Search Support for Minority Languages
Speaker: https://meta.wikimedia.org/wiki/User:TJones_(WMF)
Link to the video/replay: https://www.youtube.com/watch?v=Pi3-w9ne3zg
Slides & Notes: https://commons.wikimedia.org/wiki/File:Search_Support_for_Minority_Languages_(Celtic_Knot_July_2020).pdf
More details: Celtic Knot Conference 2020/Submissions/Search Support for Minority Languages
See also: Wikidata helpdesk Celtic Knot Conference 2020/Wikidata helpdesk

💬❓ Questions[edit]

Feel free to add questions here, while or after watching the session. Please add your (user)name in bracket after the question. The host of the session will pick a few questions to ask them during the livestream. The speaker or other participants will answer on this pad (asynchronously: the answer may come in a few hours or days).

See the Q&A section in the Collaborative note-taking section below for answers to some of these.

User:Amire80: Adding this before the session even started, hoping that it will be relevant :) — For languages that don't have nice built-in support plugins in our search platform, how can the volunteers contribute to it? Send words, send grammar or morphology rules, etc.? And how can people contribute improvements to existing plug-ins?
- (See Q&A below)
And this is just a shout-out: Oracle recently named "Wikipedia Search" as one of "The 25 greatest Java apps ever written": https://blogs.oracle.com/javamagazine/the-top-25-greatest-java-apps-ever-written
- Ha! They mention Lucene and Elasticsearch (which is built on Lucene). The WMF Search team does customize Elasticsearch, and we do write plugins for it, but our main customization and integration layer is CirrusSearch, which is written in PHP. https://www.mediawiki.org/wiki/Extension:CirrusSearch [User:TJones (WMF)]
More of a note than a question. Termau.cymru has the ability to find something if you give it a mutated form. http://termau.cymru/#gath finds "cath" 'cat'. I can't remember exactly how the Maes-T software does it internally, but in principle it should be fairly easy to adapt it to Cornish dictionary which is using the same backend software https://cornishdictionary.org.uk/ [User:DavydhT]
- Looks like the mutation is not mechanically reversible in some cases. Looks like both b- and m- have soft mutation of f-, so you need a dictionary to know what the right answer is, which makes it more complex to do in software. Alas. [User:TJones (WMF)]
I have a list of stopwords in Breton (https://github.com/Wanibzh29/16.-Breton-Stopwords-br_fr/blob/master/Stopwords%20br-fr%20(breton) ), are you interested ? ;) (Nicolas VIGNERON)
- yes please [User:DavydhT]
- Very cool! The list is longer than I would have expected—and I see some potential tokenization issues (apostrophes oftencause problems). Next week, I’ll open a Phab ticket and look and see what’s currently happening with Breton search indexing. Maybe we can upgrade the diacritic-stripping, too. [User:TJones (WMF)]
Where those corpora (for generating stopwords lists, stemming formulas etc.) come from when we talkabout minority languages?
- (See Q&A below)
Where do one find search-related tickets for my language in Phabricator?
- (See Q&A below)
Could Lexemes be useful? (we dont tag stopwords, not yet ; but we have stems) (Nicolas VIGNERON)
- At some point Lexemes will reach a critical mass of volume and complexity and then they will not only be useful but indispensible for all sorts of NLP work. (A long time ago I would have been very hesitant to believe in the Lexemes project, but Wikipedia and Wikidata have shown that there is a will to do this kind of super valuable work.) For search right now, Lexemes could become a source of data for stemmers and other kinds of NLP applications. I don't think it would be practical to, say, look up lemmas for words in real time—at least not at the scale needed to support on-wiki search. But having the data in a computer-readable (and computer-friendly!) format would be so valuable; it would certainly be better than scraping Wiktionary (which is something I've considered doing). [User:TJones (WMF)]

🖊️🔗 Collaborative note-taking[edit]

Feel free to take notes about the session here, add some useful links, etc.

[Introduction to Trey Jones, Senior Computational Linguist in the Search Platform team at WMF]
This was originally pitched as a lightning talk, so it will be a relatively short presentation followed by plenty of time for Q&A.

Trey Jones

Part of Trey's job is to work on language-specific search, so they're providing information that should be useful for people to help them
Tokenisation: What counts as a word. In most European languages this is relatively straightforward, but hyphenation, for ex. complicate things
Segmentation is what we call tokenisation when it's very hard.
Normalisation of case — Turkish I/ı, İ/i; Irish has prefixes that handle differently in capitals
- Diacritic folding, removing some (but not all) diacritics
- Serbian has 2 alphabets, so normalisation involves transliterating from Cyrillic
Stemming: making related forms (hopes, hoped, hoping) and consonant mutation (athair, n-athair in Irish)
Stop words "identification of languages" you'd discount "of" when searching. But edge cases, like the group "The The". Irish has h, n and t as stop-words in consonant mutation.
Language specific tools
- Elision in language-specific tools
- Chinese uses Simplified and Traditional
- Khmer is a great example, as there is a "correct" order to put the character parts together, but a different order can result in the same character:
  - ង្ក្រា ( ង + ្ក + ្រ + ា )
  - ង្រ្កា ( ង + ្រ + ្ក + ា )
  - ង្រា្ក ( ង + ្រ + ា + ្ក )
Current state of affairs
- New stemmers: [language list]
A really big part of improving search in any language is having a volunteer who can help, so Trey generally asks "tell me what makes search suck in your language?"

Q&A

User:Amire80: Adding this before the session even started, hoping that it will be relevant :) — For languages that don't have nice built-in support plugins in our search platform, how can the volunteers contribute to it? Send words, send grammar or morphology rules, etc.? And how can people contribute improvements to existing plug-ins?
- It depends on the situation. Stop words and elision are often very straightforward — you can open a Phabricator ticket and tag Trey. For existing plugins, it depends on if we have access to the code and can support them. Anything available for ElasticSearch is easier for Trey's team to customise.
- Dealing with diacritics is a specifically difficult problem
- For morphology and grammar, we generally need existing software with a compatible licence. If there's a straightforward set of rules it's less of a problem.
Oracle recently needed Wikipedia Search as one of the 25 greatest Java applications ever written. That's funny, because the layer we control is written in PHP 😉
[A question about Welsh mutations, Breton stop words] (See Questions above)
If it's a foreign.jpg, can English search find it? [context of the question unclear]
Where those corpora (for generating stopwords lists, stemming formulas etc.) come from when we talk about minority languages?
- You don't necessarily need corpora for stop words. It helps, with frequency analysis. For Mirandese, we started with the list of Portuguese stop words and the language volunteer translated those to Mirandese as a starting point
Where does one find minority language search tickets in Phabricator?
- https://phabricator.wikimedia.org/search/query/advanced/
- Search for the language name, generally.
  - You probably want to check "Open" for "Document Status", and add "Task" for "Document Types"—that will limit your results to currently open tickets.
  - Search in Phabricator is often "fun"
Stemming depends a lot. Esperanto was easiest, because it is deliberately regular. Stemming in English is a disaster because the English language is just wild.
The best ways to get in touch with Trey are on-wiki as User:TJones (WMF) or tjones@wikimedia.org. Trey is TJones on Phabricator and the team is taggable as "Discovery", because the tag has not moved to reflect the newer team name. Trey is happy to get tickets into Phabricator from other comms methods, if that is useful for people.

✨✨✨✨✨
More information about the Celtic Knot Conference 2020: https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2020
The Friendly Space Policy also applies on this space: https://meta.wikimedia.org/wiki/Celtic_Knot_Conference_2020/Friendly_Space_Policy