Machine learning models/Proposed/Language Identification

Model card
Model card
This page is an on-wiki machine learning model card'.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield
Model owner(s)	WMF Language Team
Publications	An Open Dataset and Model for Language Identification : Laurie Burchell and Alexandra Birch and Nikolay Bogoychev and Kenneth Heafield
Code	Github repository
Uses PII	No
In production?	No
Which projects?	Machine Translation
	This model uses a given text snippet to predict the language in which it is written
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

Motivation[edit]

Language identification (LID) is a foundational step in many natural language processing ( NLP) pipelines. User interfaces allow people to choose or indicate the language of content in the context. Predicting the language of text and there by avoiding this selection step is an enhancement for user experience. Reliably guessing which language the user is writing in can be useful to seamlessly integrate language tools in different contexts such as T295862: View translated messages in Talk pages. In this way the interactions for getting an automatic translation are simplified, not requiring the user to figure out and indicate which is the source language.

There are several language identificaiton tools, but there is no language identification system that can detect all of the 300+ languages in which wikipedia exist. Compact Language Detector 2 library can detect 83 languages. FastText's lid model can detect 176 languages. There is also a problem that these models does not publish the training data. It is in this context, the project "An Open Dataset and Model for Language Identification" by researchers of Univerity of University of Edinburgh, becomes a step forward. The dataset and the model developed can detect 201 languages, possibly the most capable and performant language identification system.

Users and uses[edit]

Intended users

researchers
developers
user scripts and gadgets

Use this model for

Idenitify the language of an arbitrary text like wikipedia article sections, talk page messages. The model will give the language code and confidence score for the prediction
Identified language could be used for a default action such as treating the content as written in predicted language and do machine translation, while prompting user to validate the prediction and do corrections if needed

Don't use this model for

The language prediction is a probabilistic prediction. Irrespective of the score of the prediction, do not use the prediction to do an action where the user is not involved. Prompt the user with the prediction and provide a mechanism for mannual override.

Current uses

This model is not in production, but intended to use the machine translation UI of mw:MinT project and other usecases where machine translation is involved.

Ethical considerations, caveats, and recommendations[edit]

For the languages that are very similar in vocabulary and script, the model can give very high score results, but wrong prediction. For example, the model is weak to distinguish Indonesia and Malay languages as they are very close. User interfaces that use the prodiction should indicate that this is a probabilistic prediction and should provide mechanism for manual override.
For the languages that share common script for example, Latin, if the training data is small because of low resource nature, there might be bias towards a latin language that is similar.
This dataset and model only covers 201 languages the ones we were able to test with the FLORES-200 Evaluation Benchmark. In addition, because the test set consists of sentences from a single domain (wiki articles), performance on this test set may not reflect how well our classifier works in other domains. Future work could create a LID test set representative of web data where these classifiers are often applied. Finally, most of the data was not audited by native speakers as would be ideal. Future versions of this dataset should have more languages verified by native speakers, with a focus on the least resourced languages
Our work aims to broaden NLP coverage by allowing practitioners to identify relevant data in more languages. However, we note that LID is inherently a normative activity that risks excluding minority dialects, scripts, or entire microlanguagesfrom a macrolanguage. Choosing which languages to cover may reinforce power imbalances, as only some groups gain access to NLP technologies. In addition, errors in LID can have a significant impact on downstream performance, particularly (as is often the case) when a system is used as a ‘black box’. The performance of our classifier is not equal across languages which could lead to worse downstream performance for particular groups. We mitigate this by providing metrics by class.(Verbatim copy from the Ethic statement section of paper)

Model[edit]

Performance[edit]

Model architecture

Overview at: https://fasttext.cc/docs/en/supervised-tutorial.html

Loss: softmax
Epochs: 2
Learning rate: 0.8
Minimum number of word occurences: 1000
Embedding dimension: 256
Character n-grams: 2–5
Word n-grams: 1
Bucket size: 1,000,000
Threads: 68

Output schema

	{score: <score (0-1)>, language: <string>}

Example input and output

Input

curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/langid-model:predict -X POST -i -H "Host: langid.experimental.wikimedia.org" -d '{"text": "Lorem Ipsum is simply dummy text of the printing and typesetting industry."}'

Output

{"language":"en","score":0.5333166718482971}

Data[edit]

Data pipeline

Following data sources to build our open dataset. We chose sources as those which were likely to have trustworthy language labels and which did not rely on other LID systems for labelling.

Arabic Dialects Dataset (El-Haj et al., 2018)
Bhojpuri Language Technological Resources Project (BLTR) (Ojha, 2019)
Global Voices (Tiedemann, 2012)
Guaraní Parallel Set (Góngora et al., 2022)
The Hong Kong Cantonese corpus (HKCan-Cor) (Luke and Wong, 2015)
Integrated dataset for Arabic Dialect Identification (IADD) (Zahir, 2022; Alsarsour et al., 2018; Abu Kwaik et al., 2018; Medhaffar et al., 2017; Meftouh et al., 2015; Zaidan and Callison-Burch, 2011)
Leipzig Corpora Collection (Goldhahn et al., 2012)
LTI LangID Corpus (Brown, 2012)
MADAR 2019 Shared Task on Arabic Finegrained Dialect Identification (Bouamor et al., 2019)
EM corpus (Huidrom et al., 2021)
MIZAN (Kashefi, 2018)
MT-560 (Gowda et al., 2021; Tiedemann, 2012; Post et al., 2012; Ziemski et al., 2016; Rozis and Skadin, š, 2017; Kunchukuttan et al., 2018; Agi´c and Vuli´c, 2019; Esplà et al., 2019; Qi et al., 2018; Zhang et al., 2020; Bojar et al., 2013, 2014, 2015, 2016, 2017, 2018; Barrault et al., 2019, 2020)
NLLB Seed (Costa-jussà et al., 2022)
SETIMES news corpus (Tiedemann, 2012)
Tatoeba collection (Tiedemann, 2012)
Tehran English-Persian Parallel (TEP) Corpus (Pilevar et al., 2011)
Turkish Interlingua (TIL) corpus (Mirzakhalov et al., 2021
WiLI benchmark dataset (Thoma, 2018)
XL-Sum summarisation dataset (Hasan et al.,2021)

Training data

The final dataset contains 121 million lines of data in 201 language classes. Before sampling, the mean number of lines per language is 602,812. The smallest class contains 532 lines of data (South Azerbaijani) and the largest contains 7.5 million lines of data (English). More details at paper

Test data

Used FLORES-200 benchmark provided by Costa-jussà et al. (2022) for evaluation. More details at paper

Licenses[edit]

Code: OpenLID is a fastText model. fastText is licensed under MIT License
Model: The model is licensed under the GNU General Public License v3.0. The individual datasets that make up the training dataset have different licenses but all allow (at minimum) free use for research - a full list is available in this repo.

Citation[edit]

Cite this model as:

@inproceedings{burchell-etal-2023-open,
    title = "An Open Dataset and Model for Language Identification",
    author = "Burchell, Laurie  and
      Birch, Alexandra  and
      Bogoychev, Nikolay  and
      Heafield, Kenneth",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-short.75",
    doi = "10.18653/v1/2023.acl-short.75",
    pages = "865--879",
    abstract = "Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033{\%} across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model{'}s performance, both in comparison to existing open models and by language class.",
}