User:Isaac (WMF)/Language modeling
This page documents some pondering I have been doing about the different approaches to language modeling and the trade-offs between performance and sustainability (in a broad sense: both energy usage and ability to easily monitor and maintain). Equal parts inspired by this blogpost on recent developments in NLP and discussions of the sustainability of large language modeling as typified by the Stochastic Parrots paper. Thoughts are welcome! I see three potentially useful things to come out of this framework:
- Guide discussion and decision-making on what sorts of technologies to work on -- i.e. moving away from "could we do this?" to "how feasible would be it to support our language communities in an equitable way through this approach?"
- Identify core technologies -- e.g., language parsers -- that underlie these approaches so that these basic prerequisites can be developed for other researchers to experiment with even if we do not see an approach as currently a good use of resources.
- Setting norms -- e.g., in the same way machine-learning practitioners often discuss the importance of first trying logistic regression in classification tasks, we might first ask whether a language-agnostic approach has been considered prior to examining more intensive approaches.
The three approaches and basic characteristics are shown below and expanded upon further in each section:
|Language-Agnostic Modeling||Article/section-level||Equitable performance w/ good Wikidata coverage / linking||Low||Low||Topic classification|
|Language-Dependent NLP||Paragraph/sentence-level||Higher performance for more-resourced languages but some shared performance between language families||Medium||Medium||Citation needed|
|Language-Specific NLP||Word-level||Highly-dependent on language data available||High||High||Link recommendation|
This is not even language modeling in the classic sense but the unique structure of Wikimedia projects -- e.g., links, categories, templates, Wikidata -- allow for many approaches to modeling content that do not require parsing the text on a page.
- Because this approach tends to depend on data that is relevant to an article as a whole but necessarily individual statements, it operates best at the article level. Using page links allows for section-level modeling as well but the data will be far sparser in many circumstances. There generally is not enough structured content at the sentence level to make good predictions.
- Though all languages share a vocabulary and therefore the training data is likely dominated by the larger language editions, the reliance on Wikidata items as that vocabulary means that the models generally perform equally well across language editions as long as each language edition is well-connected to Wikidata and makes good usage of links etc. in their articles.
- Generally the necessary features can be extracted from the various links table (pagelinks, categorylinks, etc.) or Wikidata, making the preprocessing very efficient and not requiring of language-specific extraction approaches.
- This generally results in a single (relatively small) model for all languages that does not require individual fine-tuning. For example, the links-based topic-classification model has a footprint of under 1GB for the embeddings and model parameters.
- The simplicity of the pipeline and single model mean that these models can be developed and maintained with relatively small demands of engineer time.
- Topic classification
- Category recommendation -- i.e. what categories are missing from an article?
- Quality modeling -- i.e. how high-quality is a given Wikipedia article?
This begins to look a lot more like familiar natural language processing in that these models parse text to make their predictions, but they do so in a more language-agnostic manner -- i.e. they make no attempt to model language-specific entities such as words, parts-of-speech, etc. but instead treat text as merely a stream of characters to be modeled (regardless of language). Thus, the performance of the model depends still on the amount of data available for that language but new languages can be supported far more easily. This places them as a middle-ground that allows them to handle tasks that pure language-agnostic modeling cannot but with a more realistic footprint and coverage of languages than language-specific NLP.
- This approaches is best-suited to the sentence or paragraph level. This requires some basic additional parsing -- e.g., splitting on new-line tokens, sections, or potentially sentence delimiters -- but these are much simpler to build with good coverage for all languages.
- Because this approach has a single, language-agnostic tokenizer, it is not a good approach for word- or phrase-level tasks as the model is unaware of word breaks.
- Because the shared vocabulary is not inherently language-agnostic -- i.e. different strings of characters may mean very different things in different languages -- highest performance will likely be seen in more well-resourced languages.
- Preprocessing is simple because a single tokenizer is used for all languages, though some languages may benefit from additional pre-processing such as Khmer. Still on the low end for engineering demands.
- This will likely result in larger models than purely language-agnostic approaches, but a single model with a relatively small vocabulary can be developed.
This is the classical approach to natural-language modeling where individual models are built for each language that seek to model text in a more structured manner. Because these approaches have been around longer, most machine-learning models at Wikimedia use this approach. Despite their high costs, they are necessary for certain types of modeling challenges though.
- This approach is best-suited to tasks that require word-level understanding and prediction -- e.g., determining where specifically to insert a link or extracting relationships between concepts from text.
- Because words are not shared across languages, the performance for a particularly language is strongly tied to the amount of text available in that language for training a language model. Substantial fine-tuning might be required -- e.g., crowdsourced word-lists, syntatic parsing -- that further preferences well-resourced languages and requires support from individuals who understand that language.
- Because languages do not split up "words" in a uniform way, this requires language-specific (or language-family-specific) tokenizers. This leads to many pre-processing pipelines that must be developed, fine-tuned, and maintained, which requires a lot of engineer time compared to other approaches.
- Furthermore, modeling at the word level leads to language-specific vocabularies and models that have a far larger footprint.