User:Isaac (WMF)/ML modeling at Wikimedia

There are many ways to think about the different types of ML models in use on the Wikimedia projects. Below I present a few typologies that have helped guide my thinking about appropriate strategies for deploying ML in support of tasks on the wikis. This extends and updates some previous thinking I had done in this space specific to language modeling and when it is possible to develop single models that can support all wikis (language-agnostic) as opposed to multilingual or language-specific models.

I have found the utility in these typologies to be in deciding how to balance potential performance with cost, coverage, potential for harm, and feedback loops for improving the model. Generally, the simpler and more constrained that you can make the model while still preserving "good-enough" performance, the broader the impact and easier it will be to maintain and improve.

Task complexity[edit]

The typology that I find myself using the most is an expansion of the language-agnostic typology, which focused on the question of language coverage. The angle I most often focus on now is the complexity of task: classification vs. language understanding/manipulation vs. language generation. And these particular distinctions also tend to map directly to other facets such as how the model is trained (supervised vs. fine-tuned vs. instruction-tuned), hosted (in-house vs. mixed vs. 3rd-party), or the unit of prediction (articles vs. sentences vs. words). These are general categories and of course all of them have exceptions and are evolving -- e.g., in particular, memory/compute requirements for large models have been greatly decreased over the time since ChatGPT came out in late 2022. These criteria also mostly focus on Wikipedia (i.e. text-based language modeling) but could mostly be applied to Commons (image modeling) or Wikidata.

The three approaches and basic characteristics are shown below and expanded upon further in each section:

Approach	Coverage	Cost	Harms	How to Intervene	Example
Classification	High (all languages)	Low (single CPU)	Generally low and easy to audit	Re-train	Topic classification
Language Understanding/Manipulation	Medium (all medium/large-sized language editions)	Medium (small GPU)	Still limited but harder to assess	Increase size and fine-tune	Article descriptions
Language Generation	Low (highly-resourced languages)	High (large GPU)	High potential and difficult to assess	Release more data and benchmarks	ChatGPT

Classification models[edit]

These are models with a specific purpose that we train fully from scratch and are generally pretty compact. Their outputs are generally a probability of some label -- e.g., vandalism, article topic, article quality. This allows us to achieve high coverage, low latency/cost, and a high degree of control. While some classification models could lead to significant harm if deployed without sufficient guardrails in higher-stakes situations -- e.g., vandalism detection -- the harms are related to questions of allocation and so can generally be audited for fairness in advance and adjusted. We have a fair bit of experience in building these models at Wikimedia -- e.g., all of the ORES models.

The main challenge with classification models tends to be in aligning their training data with trace data collected from the wikis so that they can be continually re-trained and cover all language editions. Because of their tight scope and compactness, we have a high degree of control and ability to intervene when problems arise with these types of models. These models tend to be most helpful for curation-related tasks on wiki but many tasks, especially those related to supporting editors in improving content, can be mapped to a simple classification task.

Language understanding and manipulation[edit]

The next class of models focusing on some form of understanding or manipulating text. Most models in this space output text but are strongly constrained to the particular text that is being evaluated. Examples include the article description model, which generates a potential short description for an article but does so largely based on the first paragraph of the article and so is mainly extracting and re-styling text as opposed to more complex generative tasks. A more involved example is machine translation, which requires re-styling across languages but is still heavily constrained to the input text. There are also some examples of models that fit here but that are also classification models and can be developed either way. For example, while vandalism detection can be construed as a simple classification task (language-agnostic revert risk), it can also be construed as a more complex language understanding task (multilingual revert risk). This is also true for assessing content readability (language-agnostic vs. multilingual).

While these models need to understand language in order to manipulate it, they are still generally small enough to host on our infrastructure. They need to be trained on much larger text corpora than Wikimedia, however, to develop these capabilities and so we often start with a generic, open-source language model (e.g., BERT) and fine-tune it to understand the specific task. While the choice of pre-trained model is important as it impacts size/performance/coverage, the fine-tuning process substantially impacts performance and so issues can usually be addressed through this stage. The challenge, then, for these models is developing a substantially large and diverse fine-tuning dataset so that the pre-trained model can learn the intricacies of the Wikimedia task.

Because these models are outputting text, the potential for harm is higher and more difficult to evaluate than for classification models. The models are heavily constrained to the input text and their specific task, however, so risk assessment generally can be focused on a few known areas where issues can crop up with language -- e.g., encoding gender stereotypes via grammar such as using feminine nouns for nurses and masculine for doctors, white-washing such as referring to dictators as politicians, or just failing where there is insufficient context. And when issues arise, it is sometimes possible to address them via further fine-tuning.

Language generation[edit]

The largest and most open-ended class of models are generative AI, as typified by ChatGPT. These models bring the promise of addressing the very long-tail of tasks related to the Wikimedia projects where AI could provide support. Everything from helping editors to convert a very specific type of archival data into statements for Wikidata to helping answer specific newcomer questions to suggesting edits to be made to a page based on a new source. While almost all of these tasks could be construed as narrow, language understanding/manipulation tasks, it is not feasible to train and host specific models for each of these needs. As such, providing support for this long-tail almost certainly requires something as large and open-ended as a chat-agent.

While these models show much promise, they also tend to bring higher costs and potential for harm. They can be constrained via approaches like retrieval-augmented generation (RAG) but there is still a much higher potential for hallucination or inappropriate answers. Given the size of these models, Wikimedia will have to depend on third-party developers to train and open-source or host these models for the time being. The wide variety of tasks that these models can support also means that on top of the usual ways in which language generation can go wrong (see above), more aggressive testing (red-teaming) of different potential scenarios in which a model might be mis-used is also required.

This lack of control over generative AI models raises the question of how to intervene to make these models more beneficial for Wikimedians. Like task-specific language models, these generative models also start with a generic, pre-trained model. But rather than fine-tuning to develop capabilities around a specific task, these models then go through an instruction-tuning process where they tweaked to be good at responding to a wide range of requests and know how to respond in an appropriate manner. This means that there is no simple place to intervene to ensure that these models will perform well at Wikimedia-specific tasks. The models often have some basic understanding of policies and style on wiki from their pre-training that allow them to perform reasonably well but this is not an explicit objective during the training nor is this capability explicitly measured when evaluating models. There are two places then where it might be possible to intervene: 1) providing more high-quality, diverse data to the model such that the pre-trained models understanding of Wikimedia-related concepts is better, and, 2) establishing benchmarks for Wikimedia-related tasks such that it's easy to compare different models' performance and nudge developers that it's important to train models that do well at these tasks.

Technical considerations[edit]

Another valuable way of categorizing ML models is by thinking about the technical aspects of their outputs and how that impacts the requirements of their deployment. A few examples below that I've found useful.

Independence of predictions[edit]

Is the model handling single inputs (easy) or is it part of a system oriented towards supporting comparisons across many outputs (hard)? For example, classification models often output a score for a single article or edit and thus can largely retrieve the necessary data inputs with a few API calls and have no large data dependencies beyond the model weights. On the other hand, any system that is intended to provide some sort of ranking is generally much more complex. For example, finding similar users to aid in sockpuppet detection or similar articles to support list-building. This requires maintaining data on how all entities relate to other entities, which presents infrastructure challenges. It also means that false positives can be far more disruptive. In the above examples, if a representation for an editor or an article is largely random for editors with low edit counts or stub articles (which are most of editors and most of articles), then the model will often return spurious outputs because these low-data entities will show up randomly and quite frequently in the results.

Dynamicism of predictions[edit]

Are the outputs relatively static (easy) and can be processed in slower batches – e.g., the topic of an article – or more dynamic (hard) and probably need real-time support – e.g., a link recommendation, which might be edited and no longer be relevant. Similar to above, dynamic outputs pose infrastructure demands -- they require relatively quick feedback loop between actions taken on wiki and re-calculation of associated model inferences. In cases when this feedback loop is not quick enough, they also begin to accrue more false positives (and resulting user frustration) as their predictions are invalidated but the cache is not.

Discreteness of predictions[edit]

Are the outputs limited to a small number of discrete labels (easy) – e.g., one of the ~60 high-level article topics – which can be indexed with existing Search systems or are they continuous outputs (hard) such as article embeddings for which storage and look-ups are less well-supported.