Talk:Community Wishlist/Wishes/Software for turning articles into spoken Wikipedia audios using novel AI voice tech
Add topicThis page is for discussions related to the Community Wishlist/Wishes/Software for turning articles into spoken Wikipedia audios using novel AI voice tech page.
Please remember to:
|
![]() |
Thank you
[edit]@Prototyperspective, thank you for sharing a problem with the Community Wishlist. We will get back to you with further questions if need be. –– STei (WMF) (talk) 13:15, 18 October 2024 (UTC)
Generate spoken content on the fly
[edit]I completely agree with your proposal and am genuinely glad to see it addressed. In today's AI landscape though, I believe we already have the capability to generate spoken content dynamically. Rather than relying on pre-recorded audio, AI can read the most current version of an article on the fly, ensuring that the content is always up to date without needing to refresh recordings. This approach also could allow us to preserve energy and focus it on developing other useful features related to this functionality, like the ability to read only specific sections of an article, fast forward or rewind by a certain duration, adjust playback speed, or even customize the voice style or accent, making article navigation more efficient and personalized. - Klein Muçi (talk) 23:55, 20 October 2024 (UTC)
- Well I expected that this issue is raised, there's many problems with that which is why this proposal was deliberately not about on-the-fly generation and I'm sure I'll forget to raise some of them but I guess there's no way around addressing this (and thanks for asking). Maybe there even is a way to implement it via on-the-fly generation but it would overcomplicate things for no reason and make it much harder to implement well without drawbacks if that's possible at all (and I think in practice it isn't).
- However, first of all you seem to have misunderstood the proposal: Rather than relying on pre-recorded audio, AI can read the most current version of an article[…]. This proposal is exactly about having a recent or the latest version of the article spoken by AI and it's not relying on "pre-recorded audio" – why do you think so?
- So here are the problems with on the fly generation of the very latest version of the article in case you also meant that (the quote above makes me unsure of that but other people will inevitably wonder about that as well):
- Generating audio dynamically would mean the audio can not be found and is not indexed, a key benefit of publishing the audio product is that people can find it
- It would require you to wait until the audio generation is done
- It would cause excessive server load that is not needed (and may also not be possible even if WMF had more computing resources due to the third-party servers)
- It could not be embedded in pages where people can find it and tap on play
- Some adjustments usually need to be made and this can't be done when it's done fully automatically dynamically – please see all the adjustments that are made in the SoniTranslate tutorial and if possible create an audio for a nonshort article with that once to see the issues that need corrections – some issues it has are really stupidly simple things and vary per article such as replacing M with million depending on context or removing the header of a table that is not narrated or removing some characters (if not the entire example) in a code example that is narrated etc
- As explained in the proposal, automatic generation can be implemented once the tool has been used to develop correction & adjustment rules such as replacing "i.a." with "inter alia" or "e.g." with "for example". Once the tool has been created one can develop the at-scale automatic generation script (e.g. at some point 4 files for every article each with a different voice). It would still not generate the audios on the fly but in regular intervals such as once a month if there have been changes due to reasons 1-4 & more. For the ability read specific sections, please see Community Wishlist/Wishes/Video & audio chapters (jump to timestamp).
- Prototyperspective (talk) 12:57, 21 October 2024 (UTC)
- Prototyperspective, I do understand that it won't be pre-recorded per se and I didn't want to use that term initially but I had no idea how to actually make the difference between what you were suggesting and my suggestion. What I'm picturing is an AI capable enough of actually reading properly. If we're talking about problems such as failing to replace M with Million etc. than the AI we're imagining is different and it wouldn't be too different from just the read out loud functionality some web browsers already have natively. We can of course create an AI tool first and then progress towards more making it more autonomous but I personally believe Wikimedia in general is falling a bit behind in the whole AI thing and the sooner we try to fully embrace it, the better it would be.
Currently the only AI big thing we have have is the functionality of Automoderator. From my experience tutoring new users in wikiworkshops, I frequently hear requests for AI-driven features, such as an article creation assistant that handles things like references and templates, or a policy-advising AI that warns users of potential policy violations before submission. These ideas aren’t part of the current proposal, but they highlight a broader expectation for AI integration.
Personally, I’ve always appreciated the spoken articles project. However, the constant updates to articles made recording them in a static format seem impractical to me. Since browser-based text-to-speech is now readily available, if we're going to adopt AI-driven solutions within Wikipedia, I think it’s good to go all in, if we can afford that. — Klein Muçi (talk) 13:27, 21 October 2024 (UTC)- Reading M instead of million is reading properly – the AI can't know what it's supposed to mean or what the preferred replacement is and it shouldn't replace things and read things exactly as specified. As said, it needs these auto-correction&adjustment rules then it could do things more autonomously and these could be developed using the tool where audios are created slower than with at-scale fully-automatic generation.
- It's very different from the read out loud functionality of browsers due to the 6 reasons listed above and starting with the narration quality where readoutloud is very low quality T2S.
- We can of course create an AI tool first and then progress towards more making it more autonomous but I personally believe Wikimedia in general is falling a bit behind I think so too but again step 1 is needed for step 2 since if you skip step 1 you will fall down, the thing will probably not work and people consider it failed if it's not done right.
- The spoken article audios are on average outdated by estimated 7 years while the audios created here would simply be a version that is 1 or 2 months old (and articles with lots of changes or in the current events category could be updated more frequently). Browser-based text-to-speech is shitty outdated tech that nobody wants to listen to while this proposal is about creating high-quality audios that one likes and actually does listen to similar to podcasts. Prototyperspective (talk) 13:55, 21 October 2024 (UTC)
- Prototyperspective, as I already said, I support every kind of AI integration and in extent to that, your proposal as well. I'd just be more supportive of a wish that asks for a more "autonomous AI approach" if I had to choose. In an age where AI technology is being used even for entertainment purposes, Wikimedia sites feel very left behind, relying so much on manual work. Big wikis can work hard enough to not have that large of a gap in that aspect but small wikis which have 10 active volunteers tops don't stand a chance. But again: I support your proposal. — Klein Muçi (talk) 15:16, 21 October 2024 (UTC)
- Prototyperspective, I do understand that it won't be pre-recorded per se and I didn't want to use that term initially but I had no idea how to actually make the difference between what you were suggesting and my suggestion. What I'm picturing is an AI capable enough of actually reading properly. If we're talking about problems such as failing to replace M with Million etc. than the AI we're imagining is different and it wouldn't be too different from just the read out loud functionality some web browsers already have natively. We can of course create an AI tool first and then progress towards more making it more autonomous but I personally believe Wikimedia in general is falling a bit behind in the whole AI thing and the sooner we try to fully embrace it, the better it would be.