Community Wishlist Survey 2023/Larger suggestions/Create a large language model that aligns with the Wikimedia movement

This proposal is a larger suggestion that is out of scope for the Community Tech team. Participants are welcome to vote on it, but please note that regardless of popularity, there is no guarantee this proposal will be implemented. Supporting the idea helps communicate its urgency to the broader movement.

Random proposal ►◄ Larger suggestions The survey has concluded. Here are the results!

Create a large language model that aligns with the Wikimedia movement

Problem: ChatGPT and other large language models (better known as AI chatbots) are being developed, which could either disrupt or benefit Wikimedia projects in unexpected ways.
Proposed solution: create our own models open-source large language models that serve Wikimedia's mission
Who would benefit: Mainly editors by making fighting vandalism and writing content easier
More comments: There has been a proposal named Create Wikipedia article stub from Wikidata using ChatGPT that's related to the topic. Some of the potential uses for such a model includes:
- Brainstorming, identifying possible missing information
- Detecting more subtle context-dependent vandalism
- Generate SQL queries to Wikipedia's database without the user needing to know SQL
- Make templates without needing to know complex wikitext and Lua (probably the easiest to do)
- Copyediting, identifying prose issues
- Recommending known sources and further reading resources to a topic (can be done right now with ChatGPT, but the recommendation would be much more effective if the AI is trained on article references)
- Quickly make stubs on a large scale (Abstract Wikipedia wink wink)
- and more...
Phabricator tickets:
Proposer: CactiStaccingCrane (talk) 11:32, 2 February 2023 (UTC)[reply]

Discussion

We can either use existing language models (e.g. GPT) or develop our own model based on an existing model. But creating a whole new one sounds like very complex and difficult. Thanks. SCP-2000 07:50, 3 February 2023 (UTC)[reply]

Have you seen phab:T328494?--Strainu (talk) 20:17, 10 February 2023 (UTC)[reply]

Voting

Support Xbypass (talk) 20:07, 10 February 2023 (UTC)[reply]
Support TheAmerikaner (talk) 20:17, 10 February 2023 (UTC)[reply]
Strong support PureTuber (talk) 22:20, 10 February 2023 (UTC)[reply]
Support Skimel (talk) 00:39, 11 February 2023 (UTC)[reply]
Support NMaia (talk) 05:59, 11 February 2023 (UTC)[reply]
Support Shizhao (talk) 13:59, 11 February 2023 (UTC)[reply]
Strong oppose as A) Your goals were too vague and ~~B) language models like GPT 3.5 and LaMDA cannot *know*. They can and will spout out bullshit with utter confidence, because they're not meant as a source for information.~~ QuickQuokka ^{[⁠talk • contribs]} 17:17, 12 February 2023 (UTC)[reply]
The thing here isn't to rely on LLMs for factual accuracy, but to make editing easier. Things such as summarizing sources, detecting vandalism, etc. does not require the LLM to "know", it just need to understand and synthesize new text. I do agree that my goals are too vague and I should've lock in at a specific feature that LLMs can deliver. CactiStaccingCrane (talk) 09:50, 13 February 2023 (UTC)[reply]
Oppose It's quite unclear what the goal here is, other than doing it for the sake of doing it. The proposal speaks vaguely of "fighting vandalism", "writing articles", and "benefits", but with no real focus. Perhaps there is a value in large language models for Wikimedia, but there has to be a plausible goal first. Eiim (talk) 16:44, 13 February 2023 (UTC)[reply]
Support Rodolfo Hermans (talk) 09:11, 14 February 2023 (UTC)[reply]
Oppose per Eiim. --Lion-hearted85 (talk) 12:08, 14 February 2023 (UTC)[reply]
Support Its inevitable anyways, so why not do it now? Meganinja202 (talk) 15:50, 14 February 2023 (UTC)[reply]
Support This would be great to just research information, like answering quick question based on information available on the Wikimedia projects. In the answer the AI could redirect to sources. Also, the bibliographic data in Wikidata could be use to access scientific articles in Open Access and further train the AI. Fvtvr3r (talk) 20:16, 14 February 2023 (UTC)[reply]
Support chatGPT is too much of a blackbox, so please, yes, develop Bert76 (talk) 09:36, 15 February 2023 (UTC)[reply]
Support Qono (talk) 18:06, 15 February 2023 (UTC)[reply]
Support People will use LLMs regardless of what our policies say. We might as well make an LLM available that is specifically trained to recognise sources Wikipedia sees as reliable, and specifically trained on our policies. I doubt any general-purpose LLM will ever be "good" at Wikipedia-specific tasks, without being trained specifically for them. DFlhb (talk) 19:36, 16 February 2023 (UTC)[reply]
Support Vulcan❯❯❯Sphere! 16:39, 18 February 2023 (UTC)[reply]
Weak oppose while it could be useful, the issue is that information coming from an AI obviously cannot be fully trusted. For some of these things like source recommendations, the time it would take to verify the the correctness of the AI's suggestions would be just as long as the time it would take to just fully do it yourself, and for things like vandalism detection, that's not what language models can do. That's not what they're for. It's an interesting idea at the base though--using AI to make lives easier. But putting it that way, isn't that what society as a whole is generally trying to do right now? Snowmanonahoe (talk) 19:36, 18 February 2023 (UTC)[reply]
Support cyrfaw (talk) 18:54, 21 February 2023 (UTC)[reply]
Support Thank you, but this MUST be backed up by RSs. Thingofme (talk) 01:56, 23 February 2023 (UTC)[reply]