Research:Large Language Models (LLMs) Impact on Wikipedia's Sustainability

Created

14:53, 23 July 2024 (UTC)

Contact

Matthew Vetter

Indiana University of Pennsylvania

Collaborators

Zach McDowell

University of Illinois, Chicago

Jialei Jiang

University of Pittsburgh

Duration: 2024-July – April-2025

LLMs, AI, Wikipedia

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Purpose

The purpose of this study is to learn more about how Large Language Models (LLMs) are trained on Wikipedia, and how their uses in AI-powered chatbots such as OpenAI's ChatGPT, Microsoft's CoPilot, or Google's Gemini impact the sustainability of Wikipedia as a crowd-sourced project and introduce issues related to information literacy and exploitative digital labor.

Brief background

Wikipedia, as a collaboratively edited and open-access knowledge archive, provides a vast and rich dataset for training Artificial Intelligence (AI) applications and models (Deckelmann, 2023; Schaul et al., 2023; McDowell, 2024) and making the data within the encyclopedia more accessible. However, such reliance on the crowd-sourced encyclopedia introduces numerous ethical issues related to data provenance, knowledge production and curation, and digital labor. This research critically examines the use of Wikipedia as a training set for Large Language Models (LLMs) specifically, addressing the implications of this practice on data ethics, information accessibility, and cultural representation. Drawing on critical data studies (boyd & Crawford, 2012; Iliadis & Russo, 2016), feminist posthumanism (Haraway, 1988, 1991), and recent critical interrogations of Wikidata’s ethics (McDowell & Vetter, 2024; Zhang et al., 2022) this study explores the potential biases and power dynamics embedded in the data curation processes of Wikipedia and its subsequent use in LLMs. Our research employs a mixed-methods approach, including content analysis of specific case studies where LLMs have been trained using Wikipedia, and interviews with key stakeholders including computer scientists, journalists, and WMF staff.

Methods

Interview

The method of study will be a semi-structured interview taken place over Zoom videoconferencing software OR email exchange.The interview subjects will be asked questions related to their understanding of the relationship between Wikipedia and Lage Language Models like ChatGPT. Interviews will last approximately 30-60 minutes depending on the subjects' responses. Total participation in the study will be under 90 minutes with e-mail communication and IRB consent included.IRB Consent form will be emailed to the subjects as part of the recruitment email. Zoom video recordings will be stored in PI insitutional account/password protected and allowed to expire after 120 days. Only the transcript will be downloaded and that will be password-protected on the PI's laptop computer.

Instruments

Informed consent: https://docs.google.com/document/d/1vcO5zZEcZs4a37O1XvVIsSU76SWA_oxG/edit?usp=sharing&ouid=113182016009423657566&rtpof=true&sd=true
Interview questions :https://docs.google.com/document/d/1SKItfnX0MHQHb0sl2N2tJsx2sPWO4LNRzpr5kYi9TkU/edit?usp=sharing

Subject selection

Subject selection is guided by the PI's knowledge of individuals who either work at or have knowledge of the intersection of Wikipedia and Large Language Models. These individuals are data and computer scientists, journalists, and product designers at Wikimedia foundation. Each potential subject will be e-mailed individually by the PI and asked whether they would be willing to participate in the interview study. They will be provided an informed consent document in the same email.

Participant inclusion criteria

Our inclusion criteria include 1)Computers scientists, researchers, product designers, or journalists with some previous experience and insight into LLMs, machine learning, Wikipedia/media. 2) English speaking participants

Timeline

Interviews: July 25 - August 15, 2024 Analysis: August 15 - September 5 Drafting: August 20 - Sept 16 Article submission: Sept. 16 Article revision: November, 2024 - January, 2025

Policy, Ethics and Human Subjects Research

THIS PROJECT HAS BEEN APPROVED BY THE INDIANA UNIVERSITY OF PENNSYLVANIA INSTITUTIONAL REVIEW BOARD FOR THE PROTECTION OF HUMAN SUBJECTS (PHONE 724-357-7730).

Confidentiality and privacy

Research subjects will be given the choice to remain anonymous or named in the research article created as part of this study. If research subject wishes to be named, they will be identified by their name and professional work title as part of the produced research article.

If the subject does not want to be named in the research article, they will be given the chance to choose a pseudonym and described by their profession (i.e. a data scientist working for a major tech company)/

During the data collection and analysis process, all subjects' identities will be held confidential in that the researcher PI will know their identity, but their identity will not be identifiable to others outside the research project. Data storage will take place via Zoom and the recording will be stored in the Zoom cloud for 120 days and protected by password. The Zoom recording will not be kept past 120 days. If the IRB requires that data be kept the typical 5 years, it will be downloaded to the PI's personal laptop which is password protected. Transcript for the Zoom session will be downloaded and edited to remove the participant's name. It will also be stored on the PI's password-protected laptop computer.

If the subject chooses an email interview format, their data will be left in the password-protected email client until the end of the study at which point it will be deleted.

Results

Describe the results and their implications here. We encourage you to share preliminary data. Don't forget to make status=complete above when you are done.

Resources

Provide links to presentations, blog posts, or other ways in which you disseminate your work.

References

boyd d, Crawford K (2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15:662–679. https://doi.org/10.1080/1369118X.2012.678878

Deckelmann S (2023) Wikipedia’s value in the age of generative AI. In: Wikimedia Foundation. https://wikimediafoundation.org/news/2023/07/12/wikipedias-value-in-the-age-of-generative-ai/. Accessed 11 Jul 2024

Evenstein Sigalov S, Nachmias R (2017) Wikipedia as a platform for impactful learning: a new course model in higher education. Education and Information Technologies 22(6): 2959–2979.

Ford H (2022) Writing the revolution: Wikipedia and the survival of facts in the digital age. The MIT Press, Cambridge, Massachusetts

Future Audiences/Experiments: conversational/generative AI - Meta. https://meta.wikimedia.org/wiki/Future_Audiences/Experiments:_conversational/generative_AI. Accessed 11 Jul 2024

Gertner J (2023) Wikipedia’s Moment of Truth. The New York Times. https://www.nytimes.com/2023/07/18/magazine/wikipedia-ai-chatgpt.html

Huang S, Siddarth D (2023) Generative AI and the Digital Commons. https://doi.org/10.48550/ARXIV.2303.11074

Liu Y, Cao J, Liu C, et al (2024) Datasets for Large Language Models: A Comprehensive Survey. https://doi.org/10.48550/ARXIV.2402.18041

McDowell ZJ (2024) Wikipedia and AI: Access, representation, and advocacy in the age of large language models. Convergence: The International Journal of Research into New Media Technologies 30:751–767. https://doi.org/10.1177/13548565241238924

McDowell ZJ, Vetter MA (2021) Wikipedia and the Representation of Reality. 1st edition. New York, NY: Routledge.

McDowell Z, Vetter M (2022b) Fast “truths” and slow knowledge; oracular answers and Wikipedia’s epistemology. Fast Capitalism 19(1): 104–112.

McDowell Z, Vetter M (2024) The Re-alienation of the commons: Wikidata and the ethics of “free” data. International Journal of Communication 18: 590–608.