User talk:Isaac (WMF)/Language modeling

From Meta, a Wikimedia project coordination wiki

Anyone should feel free to use this page for giving feedback on the proposed taxonomy of language modeling approaches and I will also use it to summarize conversations that I have.

Research Team 25 May 2021[edit]

  • Regarding how many languages we should seek to support with our modeling:
    • How do we view trade-offs in coverage and performance? E.g., 95% accuracy in a few languages vs. 70% in all languages?
    • Strong agreement with aiming to have a baseline language-agnostic model (coverage of all ~300 language editions if the model is for Wikipedia) where possible and then consider more language-dependent/specific improvements
  • Related efforts:
    • Should reach out to Search to understand their experiences/coverage as they in effect do a large amount of language-specific modeling to run their search indices
    • Related concepts in research sphere: "multilingual by default". DS proposed "Inclusive Document Representation" or IDR as the catchphrase.
    • In general, plenty of existing efforts in inclusive language modeling that MR can maybe help connect us with
    • Good paper on mBERT performance for under-resourced languages: https://arxiv.org/pdf/2005.09093.pdf
    • IJ will be talking with ML platform about this soonish
  • Other considerations when balancing trade-offs:
    • Now vs. future capacity -- e.g., do you build something that right now the technical infrastructure would struggle with but probably be more easily supported in a number of years?
    • Engineering vs. Research time -- some models might require minimal researcher time but large amounts of engineer time or vice versa
  • Language-dependent/specific approaches aren't inappropriate, they just might require greater requirements/care:
    • Be careful about languages that are supported -- i.e. specifically select less-resourced languages to work on and not just English.
    • Show that language-agnostic approaches don't work "well enough"
  • Roadmap
    • LZ would like an explicit outcome
    • White paper?
    • Embeddings / parsers that we should support for the community even if we don't view them as the best use of research time?
    • Research directly comparing these approaches across different languages for various tasks?