Talk:Community Wishlist/W214
Add topicMissing template name
[edit]@Prototyperspective: It looks like the name of the template you were referring to has been removed in the part that reads English via template like . Could you add it in? SWilson (WMF) (talk) 04:54, 8 August 2024 (UTC)
Using Wikidata
[edit]I am not sure why this Wish suggests machine translation rather than Wikidata. Eg wikidata:Q1413117 has quite a few translations for 'retention pond' already. Any thoughts @Prototyperspective? Also, phab:T120451 is sort of related. Commander Keane (talk) 22:44, 19 October 2024 (UTC)
- Good point and question. Four issues with that:
- Often items are linked to Wikidata items that are very related but not exactly the same (could link some examples later)
- Many maybe even most Commons categories are not linked to a Wikidata item
- Even when the WD item is about exactly the same as the WMC cat, the label is often somewhat different, e.g. here retention pond vs retention ponds (this could be solved via plural/singular detection)
- Many Wikidata items do not have labels in many languages
- However, one could consider using these labels for categories that are linked to a Wikidata item for the languages where the item has the label set. I think it would make things more complicated and introduce issues rather than making things better. Nevertheless, it may (not sure about that) be good to add but the wish is meant to be as simple to implement as possible and this would make it more difficult while likely introducing problems. I think one thing one could theoretically definitely do in a beneficial way is to compare machine translated titles to labels to then either add labels or alter/replace the machine translated cat title.
- "Retention ponds" can be easily translated into many languages using MT quite accurately – e.g. see here and here. mw:MinT could be used for this. Prototyperspective (talk) 22:57, 19 October 2024 (UTC)
- Good points. Multilingual categorisation is probably a top issue for Commons and any step forward, like using machine translation, would be good. I see MT as a stop gap solution though and would be interested in other wishes in this realm using Wikidata. I notice that MT doesn't quite get Japanese right for Eurotunnel Class 9. And Italian and French MT don't match their respective Wikipedia articles either. Commander Keane (talk) 23:27, 19 October 2024 (UTC)
I dont know this is the right solution
[edit]I dont think machine translation is the right solution here. I think a system of manual translation would be better.
I would propose:
- make category redirects fully work (so if you categorize something under a redirected name, it works correctly. Thus alternative names won't be confusing.)
- allow associating redirected category titles with specific lang codes
- show the category name as whatever lang code user language is set to.
Bawolff (talk) 17:44, 14 March 2025 (UTC)
- You make a statement but why do you think this would be so? Why would a system of manual translation be better? It's not but also this isn't only machine translation so maybe you had a few wrong assumptions. This is making use of machine translation so instead of having no values in many languages, it has a very-likely correct one. That is better than a system where most languages do not have most or all item labels + descriptions set.
- Also it saves a lot of time that could be used for other things and is more likely to produce good results when the l/d is sourced from a well-maintained well-written source l/d (e.g. well-defined items rather than sloppily described ones; the baseline quality would be much higher and vandalism + low-quality cases far rarer + l/d more reliable).
- See this key part: b) flaws can be corrected using a machine translation correction system where people can see the machine translations for the languages they speak and if necessary adjust them.
- -
- Regarding your proposal: good to bring up redirects. See this part: Rarely people create redirects in other languages (example) which can also e.g. be used with HotCat - the approach proposed here is more scalable and much better than creating redirects like that. so something more constructive would be considering and addressing this part instead of just bringing up redirects. Like how would these redirects be created at scale? But I could do that for you: one could show the Wikidata label for the linked item as a redcat for the user with that language on that category page so they can create that redirect cat with a click (or even auto-create these). Two problems with that are that often the category title deviates somewhat from the Wikidata item and that many categories do not have a Wikidata item. The redirects ideas may sound good at first glance and on paper – it's short and simple – but it's not viable.
- It's not scalable (there are millions of categories; this wouldn't even work for languages with many active editors willing to spend/waste their time on this but there are very many without many editors)
- it's not implementable to any meaningful degree in the real world and
- as far as I can see only would be a method to duck out from using more than ready machine translation for no reason other than for the sake of avoiding using the great opportunity of machine translation
- However, it could be best to involve redirects – for example because then there will be a matching URL which may have relevance to how things show in search engines and the discoverability of the page (also for unregistered viewers it would set the language of the category page, e.g.
?uselang=defor a German-language redirect to a cat page). I think a good way to use them would be to have these redirects be auto-created based on the translated category name. A translated category name is the machine translated label or the adjusted version thereof – if a translated label gets adjusted or even gets a click by an autoconfirmed user on the 'confirm' button next to it, then the redirect is created (the prior created one possibly deleted). If an auto-translated cat label is not confirmed one could have it wait 6 months until the redirect is autocreated. Also associating redirected category titles with specific lang codes here would be done automatically. so if you categorize something under a redirected name, it works correctly this is a bug in the UploadWizard and should be addressed but is an independent issue. show the category name as whatever lang code user language is set to. Yes, this should be used to show the translated category title and is part of this proposal.
- Regarding your proposal: good to bring up redirects. See this part: Rarely people create redirects in other languages (example) which can also e.g. be used with HotCat - the approach proposed here is more scalable and much better than creating redirects like that. so something more constructive would be considering and addressing this part instead of just bringing up redirects. Like how would these redirects be created at scale? But I could do that for you: one could show the Wikidata label for the linked item as a redcat for the user with that language on that category page so they can create that redirect cat with a click (or even auto-create these). Two problems with that are that often the category title deviates somewhat from the Wikidata item and that many categories do not have a Wikidata item. The redirects ideas may sound good at first glance and on paper – it's short and simple – but it's not viable.
- Prototyperspective (talk) 16:55, 15 March 2025 (UTC)
- As Bawolff says a system of manual translation would be better. Wikidata is already a manual translation system, lets use that. If you insist on machine translation you could use it where Wikidata lacks a language (and clearly mark it for human review). I do believe that Wikidata is the holy grail for Commons categorisation. Some may not be able to see that yet, but I imagine others can (I am guessing Google's $1.5 million Wikidata grant was intended to feed their machine translation of places/things so they can make AI summaries that remove the need for people to click on Wikipedia articles). Also, this wish presents a solution (and not necessarily a good one) in the title so I suggest reforming it as "Multilingual categorisation system for Commons" or similar. If you see any limitations in using Wikidata for categorisation, you can discuss them somewhere and solutions will be found. As a software development community we probably are going to have to decide if we continually patch a bad system, or work on a better new one which is presumably based on SDC. Commander Keane (talk) 22:22, 15 March 2025 (UTC)
- Please first engage with what is in the wish or in the prior comment that you reply to. I don't know if read either. I'll just repeat and leave it at that: Two problems with that are that often the category title deviates somewhat from the Wikidata item and that many categories do not have a Wikidata item. Prototyperspective (talk) 21:27, 16 March 2025 (UTC)
- I know about your arguments of Wikidata deviation and lack of Wikidata items. I think they are not difficult to solve using SDC. Name a problematic category and I will attempt to describe the solutions I envision.
- This wish is only about reading categories isn't it? What about adding the correct category at upload? Or organising category trees when you don't speak English, including creating a category? For example a Japanese speaker takes a photo of a theme park. They go to add a category in the upload form by typing "アドベンチャーワールド", and the machine translation system presents two translated options of categories "Adventure World (Japan)" and "Adventure World (amusement park)". They have to guess the correct one. I imagine this is like picking between "Disneyland (USA)" or "Disneyland (amusement park)", which one looks more correct? I would go for the latter in case the first is referring to a company or city - which is completely wrong in this scenario. There will be even trickier scenarios. In the Adventure World example, you could disambiguate the latter option better by changing the category name to "Adventure World (Perth)" and hope that Scotland's Perth doesn't open a theme park in the future. But guess what, Wikidata takes care of any ambiguity already, and SDC depicts use that.
- I move that this wish be archived and a problem-based wish be established, where everyone can discuss an appropriate solution (including machine translation). Commander Keane (talk) 22:46, 16 March 2025 (UTC)
- Then please address them for constructive discussion in particular if you mean to attack my proposal. What do you mean with I think they are not difficult to solve using SDC and why would it not use machine translation?
- Examples: c:Category:Our World in Data energy and environment maps of the world c:Category:Microscopic images relating to biology, c:Category:Animals using surface tension, c:Category:Meat consumption maps of the world, c:Category:Audio files with closed captioning in Serbian Cyrillic, c:Category:Agriculture statistics, c:Category:Wikipedia article statistics.
- Spanish machine translation for the latter: Estadísticas del artículo de Wikipedia; Verdict: correct (does not need any adjustment); Estimated time required to set a label for the hundreds of languages Wikidata has: more than 3 hours (without also setting some item description or alias); Verdict of volunteer time-well spent: not time-efficient; Feasibility of most or all languages' labels being set this way: near-zero (due to limited time available, low motivation for this task, and the many millions of items).
- If you have another assessment than the above, please explain the rationale.
- and the machine translation system presents two translated options of categories "Adventure World (Japan)" and "Adventure World (amusement park)" This is offtopic to this wish. I suggest you create a separate wish or issue about [how to address potential] ambiguity of autocomplete cats in UploadWizard. They have to guess the correct one No, they don't: they'd need to check the two categories to find which is the correct one to set but again this is a separate problem (and is rare btw).
- where everyone can discuss an appropriate solution Wishes are about solution to described problems. When somebody has another solution, they can submit another wish. So far you haven't described an alternative solution and talk pages are also there for discussing a wish and potential alternatives which you so far haven't. Prototyperspective (talk) 11:51, 17 March 2025 (UTC)
- I will try to convey my understandings.
- Structured Data on Commons (SDC) is not about duplicating the current category system with a corresponding Wikidata for each category.
- SDC is about describing files individually and running queries that intersect those descriptions.
- These queries would form the basis for new category-like pages. So intersecting article (Q191067), Wikipedia (Q52) and statistical data (Q35308049) would generate a page like c:Category:Wikipedia article statistics. The page would be defined by a editor once, this is not about a query service with blank inputs. Subcats could be generated by placing additional intersections like, Good article (Q21167453).
- Describing all ~117 million files is a big job, but so was writing 64 million Wikipedia articles (source). The benefits are a readable, editable, flexible system that doesn't preference one language over any other that people from any language can contribute to (easing a scarcity of volunteers).
- Concerns about the state of Wikidata's translation coverage could be eased by machine translation integration where human translation isn't available. Considering the concept of intersections many languages have translations for items already available. FOSS (and probably proprietary) machine translators need to feed on Wikidata I imagine.
- Flexibility with SDC is in the querying, for example the ability for an Arabic speaker to instantly find German Wikipedia article statistics from 2007 with PNG extension and Polish labels. A purposefully obscure one, but what about instantly creating a page to display all the photographs (and nothing irrelevant) of Karel Čapek using Czech input (idea credit) so you can pick the best one for a Wikipedia article.
- Also, why should humans need to constantly narrow down categorisation by country, province, town, suburb when the parent cat gets too big? Or century, decade, year, month, day? This is an automated software job, and that software needs to be able to understand what is in the files. I am looking at a bigger solution to many problems. I understand technological improvements are required, and they may be large but the rewards are large too.
- The potential for SDC may be easier to understand by looking at a file you are familiar with. I think I saw you upload something like a deadlift exercise demonstration video from Youtube. The human SDC tags would be the wikidata item "deadlift", "demonstration" and "female". Other data would be "video", "no audio", "source: Youtube (channel ID = 'donna does dumbells')" and whatever else you like. Then for a navigatable tree you could start with "exercise demonstrations", which subpages like "deadlift exercise demonstrations" autogenerated if there are >1 files. If you want a page of videos you add that intersection. Also, if that particular Youtube channel gets 126 deadlift videos uploaded it will be possible to refine the categorisation automatrically by creating a subcat-type page if one creator has > 10 files (no need to visit each 'donna' files and change "cat:deadlift videos" to "cat:deadlift videos by donna"). The search potential is also raised if files are fully described. Looking for a deadlift video with audio not from Youtube is possible in the future if that is useful.
- I wasn't aware you purposefully limited your proposal to only reading categories in languages other than English, my general alternative "Multilingual categorisation system for Commons" did contradict that.
- I didn't feel comfortable discussing potential alternatives under wish with such a restrictive title and description that I disagreed with, but in the end you drew me out. I don't want to invest any further time in discussing this at the moment but I am sure there are others with a better overview and way of explaining things. I hope you and anyone else following along gained a better understanding of my reasoning. Commander Keane (talk) 08:12, 24 March 2025 (UTC)
- Commons categories are also about describing files individually and [enabling] running queries that intersect those descriptions.
- Maybe Structured Data on Commons (SDC) is not about duplicating the current category system in its goals or in theory but it is in practice and reality. This also quite clear by that for example, most SD gets set via tools that set these based on categories.
- These queries would form the basis for new category-like pages. 1. That's only your idea and not real 2. It's redundant to categories 3. Just more extra work 4. would generate a page like c:Category:Wikipedia article statistics It would contain various files that don't belong there and more importantly miss many files that are in the category. In addition, it also doesn't have useful subcategories for whatever subcategorizations are useful.
- These queries would form the basis for new category-like pages You can do the same with Commons categories. Either personally using deepcategory or petscan, or also for others by creating a new category, or via some script/bot that populates some category like that routinely.
- that doesn't preference one language over any other Continued perseverance of misconception that these various things would be possible with SD but aren't with categories. I could link to this very proposal.
- Flexibility with SDC is in the querying Same again. Can be done with categories. In addition, nobody would use this as it's too difficult for real people except for a handful of very active techie long-time contributors. And again these could also use the categories. Moreover, completing that task by just looking at the category and maybe checking two subcategories takes maybe 30 seconds so what's won with that plus thinking of and creating that query and fixing its issues would probably even take longer anyway.
- And as a note it can be useful to show the other language items also since it may contain some high-quality very suitable files that may not yet be in that language but which the user could and may want to translate so navigating to or filtering via a nearly empty subcategory (for the subject) about Polish charts but first shortly coming across a glanced category containing also other charts is beneficial.
- Just look at how much hypotheticals and ideas make up your post, nearly all of it – categories in contrast are there, real, tangible, pragmatic, sufficient and improvable with no infeasible extremely-laboriously ideas needed to justify using and improving these.
- This is an automated software job Agree! SD.is.not.needed.for.that. Why do you assume categories can only be set manually by humans but structured data not? I don't know where such misconceptions come from. (Btw, I had another idea/proposal regarding the tech survey but didn't yet outline it since I think those other things in its are even more urgent for now. For now see, Auto-addition of inferrable categories which I may revive some time.)
- I am looking at a bigger solution to many problems. Me too but it's a misconception that SD would address these or that SD would be required for addressing these. Pretty active there so I think I can say I have robust enough experience with the site for things to be considered in a way that does not assume I'm not thinking of broader issues also etc.
- The human SDC tags would be the wikidata item "deadlift", "demonstration" and "female" all of that is set via the categories. Then for a navigatable tree you could start with "exercise demonstrations" there's a category for that which subpages like "deadlift exercise demonstrations" autogenerated if there are >1 files agree but SD is not needed for that. And it's just hypotheticals anyway.
- Thanks for elaborations though. I just think while there may be some applications for SD like having metadata about whether or not the file has audio (currently only in the categories), it doesn't solve / improve the issues people think it would and impedes/delays addressing these in practice and inhibits people to consider proposals that do not require/involve SD. Prototyperspective (talk) 14:50, 24 March 2025 (UTC)
- Please first engage with what is in the wish or in the prior comment that you reply to. I don't know if read either. I'll just repeat and leave it at that: Two problems with that are that often the category title deviates somewhat from the Wikidata item and that many categories do not have a Wikidata item. Prototyperspective (talk) 21:27, 16 March 2025 (UTC)
- As Bawolff says a system of manual translation would be better. Wikidata is already a manual translation system, lets use that. If you insist on machine translation you could use it where Wikidata lacks a language (and clearly mark it for human review). I do believe that Wikidata is the holy grail for Commons categorisation. Some may not be able to see that yet, but I imagine others can (I am guessing Google's $1.5 million Wikidata grant was intended to feed their machine translation of places/things so they can make AI summaries that remove the need for people to click on Wikipedia articles). Also, this wish presents a solution (and not necessarily a good one) in the title so I suggest reforming it as "Multilingual categorisation system for Commons" or similar. If you see any limitations in using Wikidata for categorisation, you can discuss them somewhere and solutions will be found. As a software development community we probably are going to have to decide if we continually patch a bad system, or work on a better new one which is presumably based on SDC. Commander Keane (talk) 22:22, 15 March 2025 (UTC)
Clarification
[edit]Hi @Prototyperspective I want to clarify the problem you're trying to solve here
Take for example https://commons.wikimedia.org/wiki/File:Zernez,_Unterengadin,_Graub%C3%BCnden._20-09-2023._(actm.)_71.jpg?uselang=nl ... it's in the category "Garden sheds".
Is your proposal about displaying "Tuinhuisjes" instead of "Garden sheds" in the list of categories on the bottom of the page when a user is browsing the File page in Dutch? And then if they click through on the category link should the Category page title be also displayed as "Tuinhuisjes" rather than "Garden sheds"? And how about if I'm browsing in Dutch and I want to add an image to that category - ought I be able to add [[[Categorie:Tuinhuisjes]] to the wikitext and it'll be understood as the "Garden sheds" category?
If you want all these things then I can't see how to do it without changing how categories are stored in MediaWiki itself - which would be a very large engineering project CParle (WMF) (talk) 15:16, 7 August 2025 (UTC)
- Yes. It's more the other way around, I'd start with that the category page c:Category:Garden sheds should display the machine translated title if Dutch is set in the preferences or ?uselang=nl is in the URL. Then only once this works could one work on also having the categories underneath or above the images on the file pages display the machine translated title.
- It's a very good question how category addition would work with this – I think also at start it could work just as is with only the English category title and any redirects working where active users that actually do add categories (a small subset of Commons users) either have English set in their preferences for that reason or click the button to see the original/English cat-title on demand if they want to add such on the cat-page (or maybe also the cat on the file page). Redirects work as well and there already are many redirects of cat-titles in other languages but not consistently or reliably – for example there is c:Category:Tiere (de) redirecting to c:Category:Animals (en). Then later this could be changed so that users could also enter the cat-title in the other languages. There would be several ways for this – one would be to store and have linked the machine translated (and possibly rarely manually adjusted) titles to their original cat-title so that when people enter these the proper category gets added; another would be to automatically create very many redirects which automatically get adjusted (well deleted and a new redirect made) if the machine-translated title is manually adjusted or the original cat is renamed causing the machine translated one to change as well. I'd favor the latter since that also affects the URLs which is relevant (mainly) to people searching the Web in other languages with Web search engines.
- Never said this would be a small change. It would have tremendous benefits, may not be as big of a change as you think it is (actually just let MinT translate category titles and enable displaying them on the cat page would already solve most of it) and importantly this would be less effort and save super much volunteer time compared to people entering and adjusting and fixing and checking manual translations of category titles. That's not feasible anyway; currently a very small subset (even when it comes to the largest categories linked on the frontpage) of cats have a small number of manually added&maintained translations via templates like multilingual description such as c:Category:Mechanical engineering (sometimes this is just the title not a description including the title in bold). More useful would be why you think it would be a specifically very large engineering project and what you think would need to be changed. I don't think this is an urgent issue to solve but consider that machine translations have become much better just recently and that Commons categories are generally very short and well-translatable and importantly the big impact this could have in terms of more people using and on the Web finding Commons (category pages). The full thing including also displaying translated cats at file pages and allowing people to use these for adding cats could be more complex but again I'd suggest to start with and at first only implement the basic display of the translated title on cat pages. Prototyperspective (talk) 11:23, 10 August 2025 (UTC)
- @MusikAnimal (WMF) and JWheeler-WMF: Please change the status of the wish to open. It's been submitted since 4 August 2024. It's an important impactful thing that could be done regardless whether it would be a difficult / large undertaking or not. Prototyperspective (talk) 18:02, 13 October 2025 (UTC)
- I have marked it as "Accepted". Best, MusikAnimal (WMF) (talk) 07:04, 14 October 2025 (UTC)
- @Prototyperspective
- Thank you for your wish!
- While reviewing this with the team, we were unsure about ways to decompose the proposed initiative into one or more concrete deliverables. Then we can work with the stakeholders to understand the feasibility of the wish, but right now it's all very open. MikeZ-WMF (talk) 11:39, 27 November 2025 (UTC)
- Thanks for all the good conversations on this discussion page so far!
- Right now there's not yet consensus on using anything machine generated without review and the translation team doesn't have the right service for this yet, so we're leaving this as a long-term opportunity for now. MikeZ-WMF (talk) 09:44, 19 March 2026 (UTC)
- I have marked it as "Accepted". Best, MusikAnimal (WMF) (talk) 07:04, 14 October 2025 (UTC)
- @MusikAnimal (WMF) and JWheeler-WMF: Please change the status of the wish to open. It's been submitted since 4 August 2024. It's an important impactful thing that could be done regardless whether it would be a difficult / large undertaking or not. Prototyperspective (talk) 18:02, 13 October 2025 (UTC)
Wondering how far this has been thought through
[edit]- How will we avoid absurdities like turning L'Étoile into "The Star" or Les Deux magots into "The Two Nest Eggs"?
- How will we avoid collisions (same name in different languages for different categories; or two different names translate into the same name in some language)?
- How will we add categories to file pages and category pages if we don't all see them under the same name?
- Jmabel (talk) 00:50, 25 October 2025 (UTC)
- Very good question! The talk page of the wish is the current place to develop this further and identify issues like this and think about and develop ways such could be addressed. These kinds of simple mistranslations may well be the most common types of issues with machine translation for well-working language-pairs like EN<->ES. The postediting system described in W78 is about how such could be corrected efficiently and at scale by contributors when it comes to articles. For category titles, which are shorter and e.g. rarely have the same mistranslatable word in many tiles, that's probably not needed or not needed fully to be able to address this well. I think something simpler would work here – namely being able to correct translations which could undo the translations into other languages than the corrected one and mark them for review or adjust them based on the corrected translation. Correcting translations would allow the user to specify which part of the title are names that should not be translated (e.g. WHO or The Guardian) which in that case would also adjust the other languages' translations (and the edit that does the correction does show up on the Watchlist). Moreover, if the category is linked to some Wikidata item this could be used in clever ways to do things like this automatically for many categories and/or mark them for needing human review.
- Hold back that cat translation so a human review is needed where the user working on that backlog can for example append (TV series) to the translation in say French. There may well be quite a few hundred of these but overall it would be a tiny percentage of categories and if 99.9% of categories have titles translated into your language and 0,1% are still in English / the original cat title, then that's still a huge progress.
- I like that you really thought much about this and with your Commons experience asked key questions, thanks for that. There's at least two options: 1. also show the English/main/source category title somewhere well-visible on the category page 2. allow people to use these translated categories like redirects. I think the latter would be preferable. I can already add the category Animals to files by adding German "Tiere" via HotCat (including its autocomplete; note that if adding the cat via wikitext editing a bot will move the file to the target cat shortly thereafter) but it's unreliably rare exceptions that German-language redirects have been created so this can't be used practice.
- Prototyperspective (talk) 23:04, 27 October 2025 (UTC)