User talk:Isaac (WMF)/Language modeling

Anyone should feel free to use this page for giving feedback on the proposed taxonomy of language modeling approaches and I will also use it to summarize conversations that I have.

Research Team 25 May 2021

Regarding how many languages we should seek to support with our modeling:
- How do we view trade-offs in coverage and performance? E.g., 95% accuracy in a few languages vs. 70% in all languages?
- Strong agreement with aiming to have a baseline language-agnostic model (coverage of all ~300 language editions if the model is for Wikipedia) where possible and then consider more language-dependent/specific improvements
Related efforts:
- Should reach out to Search to understand their experiences/coverage as they in effect do a large amount of language-specific modeling to run their search indices
- Related concepts in research sphere: "multilingual by default". DS proposed "Inclusive Document Representation" or IDR as the catchphrase.
- In general, plenty of existing efforts in inclusive language modeling that MR can maybe help connect us with
- Good paper on mBERT performance for under-resourced languages: https://arxiv.org/pdf/2005.09093.pdf
- IJ will be talking with ML platform about this soonish
Other considerations when balancing trade-offs:
- Now vs. future capacity -- e.g., do you build something that right now the technical infrastructure would struggle with but probably be more easily supported in a number of years?
- Engineering vs. Research time -- some models might require minimal researcher time but large amounts of engineer time or vice versa
Language-dependent/specific approaches aren't inappropriate, they just might require greater requirements/care:
- Be careful about languages that are supported -- i.e. specifically select less-resourced languages to work on and not just English.
- Show that language-agnostic approaches don't work "well enough"
Roadmap
- LZ would like an explicit outcome
- White paper?
- Embeddings / parsers that we should support for the community even if we don't view them as the best use of research time?
- Research directly comparing these approaches across different languages for various tasks?