User:Isaac (WMF)/AI Datasets

From Meta, a Wikimedia project coordination wiki

This is an evolving draft of a potential blogpost (series) on datasets, AI, and Wikimedia.

From Dumps to Datasets: Recommendations for a More Purposeful Provision of Wikimedia Data for AI[edit]

It is well-documented that Wikimedia data – especially Wikipedia – is essential data to the progression of AI – especially language modeling – over the past several years. The BERT language model that was introduced in 2018 and is often considered the first modern LLM uses English Wikipedia as a majority of its data. Even today, with language models like GPT-4 and their corresponding training datasets being several orders of magnitude larger, Wikipedia is often one of the top sources of data.

Though this usage of Wikimedia data for AI has largely been beneficial in directing attention to the Wikimedia projects, it has largely been incidental to Wikimedia's mission. The dumps have been made available since at least 2005 and while researchers were considered an expected end-user, the AI community in particular has not been viewed as a key stakeholder. As the future of the Wikimedia projects begins to feel more intertwined with the future of generative AI, however, I believe that it is worth taking a deeper look at how this data is and could be used for AI. That is, understanding how we might be more purposeful in what we provide.

I argue that this new class of generative AI models lays bare several key deficiencies in the data that prevent the Wikimedia communities from fully benefiting from these new technologies: 1) a few key gaps – namely imagery from Wikimedia Commons – that reduce the representativeness of these models; 2) a lack of Wikimedia-specific benchmark datasets that make it clear what we would like to see from AI that has been trained on this valuable Wikimedia data and help Wikimedians identify what models are best suited to their tasks.

Some history...or why generative AI is new to Wikimedia but AI is not[edit]

Wikimedia is no stranger to AI. Since at least 2010 with the introduction of User:ClueBot_NG, basic machine learning has been used in impactful ways on the projects (in that case, reverting vandalism). The 2010s saw an expansion of models provided to help support important on-wiki curation tasks, most notably the ORES infrastructure and corresponding models for edit quality, article quality, and article topic. These supervised-learning models were trained to support clearly-scoped tasks and were sufficiently small that they could be easily trained from scratch on Wikimedia data.

During this time, we also saw the rise of a new type of model on the Wikimedia projects – those used for richer language-understanding and -manipulation tasks. Machine translation on Wikipedia is the foremost example but other prominent examples include OCR on Wikisource and more recently models like article description recommendation. While these models could support some of Wikimedia's more complex needs and vastly sped up content creation, they were too large to warrant the Wikimedia community training them fully from scratch. Existing open-source models were often a good start but Wikimedia's needs are often greater, most notably its diversity of languages. A successful solution for closing these gaps has been providing Wikimedia data specific to these tasks to enable outside developers to improve their tooling in ways that directly support the Wikimedia projects. For example, translation data from Wikimedia was packaged to help developers fine-tune their models leading to advances like NLLB-200 that dramatically expanded translation support on Wikipedia. On Wikisource, a partnership with Transkribus used Wikimedian's transcriptions to enable new OCR models for poorly-supported documents like hand-written Balinese palm-leaf manuscripts.

The new generation of generative AI models, as typified by ChatGPT, are dramatically different from the above. They potentially enable a wide range of tasks from writing articles to building SPARQL queries to enabling new ways of interacting with content. They can support these diverse needs because they are not trained for any specific task but their size precludes any direct fine-tuning to meet Wikimedia needs. Instead, after a generic pre-training stage in which they learn to predict the next word in a sentence, they undergo a stylistic instruction tuning stage in which they are trained to respond appropriately to a wide variety of tasks. This leaves two indirect pathways to nudge developers to train models that are beneficial for the Wikimedia projects: 1) providing more, high-quality data, and, 2) establishing benchmarks to evaluate how well these models perform for important Wikimedia tasks and guide developers in ways in which we expect them to improve.

Dumps[edit]

The Wikimedia dumps typify the long-standing approach to providing data publicly. They are not backups or always consistent or complete but they are very very good representations of the Wikimedia projects at particular moments in time. These dumps have been invaluable to the Wikimedia research and AI communities and a whole host of helper libraries have sprung up to help folks interact with this data.

The gem of these dumps is the semi-monthly snapshots of the current wikitext for every article for every language edition of Wikipedia. Wikipedia text is invaluable to natural language processing (NLP) for a variety of reasons. It is generally long-form (lots of context to learn from), "well-written", and has minimal bias (at least compared to the rest of the internet) both in terms of a neutral point-of-view and the constant efforts of Wikipedians to close knowledge gaps. The content is trustworthy and sources reliable thanks to the careful work of the editor community. And the content is highly multilingual, making it especially important to the ability of developers to train models in languages that are underrepresented digitally.

For a long time, these dumps were still a messy resource that NLP folks had to further process to remove the various wikitext syntax (templates, categories, etc.) to leave the simple plaintext or natural language content of an article. Different researchers took slightly different approaches from a simple Perl script to a more fully-fledged Python library that also expands some templates. It's only recently that some standardization has appeared (positive foreshadowing) with HuggingFace providing already-processed plaintext snapshots of Wikipedia and this has been one of their most popular datasets along with subsets such as this 2016 dataset of text from just good and featured articles.

The dumps are not complete though. Notably, none of the imagery that is found on Wikipedia (or not but still on Wikimedia Commons) is available via dumps. Downloading this large source of freely-licensed imagery requires gathering the images one-by-one via APIs, but this can easily take months or result in throttling if the requests are too frequent. Large one-off subsets of Commons have been released in the past – most notably the Wikipedia-Image Text dataset, which could be compiled by Google given their unique infrastructure and was further parlayed into an image-text modeling challenge – but these remain one-offs and usually focus on just the subset of imagery that appears on Wikipedia. This large gap in the dumps means that Wikimedia is not nearly as well-represented within the computer vision community and resulting image or multimodal models, with datasets like ImageNet or LAION that are scraped from the broader web being more common.