Jump to content

Talk:Community Wishlist/Wishes/Physical Wikimedia Commons media dumps (for backups, AI models, more metadata)

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 7 months ago by Prototyperspective in topic Legal implications

  Please remember to:

why are there no Wikimedia Commons dumps that include any media?

[edit]

Assuming there is a reason, which there usually is, why are there no Wikimedia Commons dumps that include any media? Polygnotus (talk) 19:40, 4 September 2024 (UTC)Reply

See c:Commons:Dumps and backups maybe you can find some info on that there. A dump of all files instead of e.g. all used files and all files except videos above 100 MB would be very large to be unsuitable even for few people to download the whole thing. I think this really needs to change but think about what would happen if people can download just some 100 TB dump of a constantly changing website but can't get the physical dump (e.g. lots of server load). Hence, I think what I proposed here should probably be implemented first. Prototyperspective (talk) 20:22, 4 September 2024 (UTC)Reply
These may be of interest:
https://magnustools.toolforge.org/can_i_haz_files.html
https://wikilovesdownloads.toolforge.org/
Polygnotus (talk) 20:41, 4 September 2024 (UTC)Reply
Thanks a lot! These are very useful and I guess I thought there must be something like it so I should have searched for tools like these when the Dumps and backups page was created. I'll add them there if you don't mind – you could edit it or add more if you know of any. Strange that there wasn't a page about dumps and backups and/or a page linking to these on WMC earlier and there's probably further pages where these links are relevant. (Note that this doesn't make the proposed thing less useful and I doubt it works for large categories, especially not the c:Category:CommonsRoot.) Prototyperspective (talk) 09:16, 7 September 2024 (UTC)Reply
Toolforge tools have a huge discoverability problem. They can't handle CommonsRoot but it is better than nothing. Polygnotus (talk) 15:59, 7 September 2024 (UTC)Reply
Yes, I've added it to the page along with other tools I found (if you have any more please add them). Agree on what you said about toolforge tools – what idea do you have to make them more discoverable? I think they should always have at least 1) a documentation page (which includes a talk page and info on how to use) 2) a source code repo (for example on GitHub). Let me know what you think about that, whether you agree on that and what could be done about it, and whether you have any further ideas. This applies not only to toolforge tools but also to scripts. Prototyperspective (talk) 10:29, 8 September 2024 (UTC)Reply

┌─────────────────────────────────┘
@Prototyperspective: You could stick this in en:Special:MyPage/common.js and it would add 2 links to the bottom of the Tools section of the sidebar.

mw.loader.using(['mediawiki.util', 'mediawiki.base']).then(function () {

    function addCustomLinks() {
        var portletId = 'p-tb'; // This is the ID for the "Tools" section

        mw.util.addPortletLink(portletId, 'https://toolhub.wikimedia.org/search?ordering=-modified_date&page=1&page_size=50', 'Toolforge', 'toolforge-link');
        mw.util.addPortletLink(portletId, 'https://en.wikipedia.org/wiki/Wikipedia:User_scripts/List', 'Userscripts', 'userscripts-link');
    }

    mw.hook('wikipage.content').add(function () {
        addCustomLinks();
    });
});

Polygnotus (talk) 16:09, 8 September 2024 (UTC)Reply

I'm not sure what it does and how it relates to what I wrote. Could you explain and what do you think about what I suggested there? The toolhub site is really interesting but doesn't this script simply add a link to that website to some wiki pages? What's the benefit of that? If that was useful then it's not a solution to require users to add this script to make this link shown – one could also just send the link. Additionally, it seems like the documentation needs to be within the Wikimedia projects themselves, for example so that it's findable in searchable, wikilinkable, and discussable at talk pages. Lastly, it seems like lots of tools on there are 404, like the this one. Prototyperspective (talk) 17:16, 8 September 2024 (UTC)Reply
Well, I often want to look up if some kinda tool/script already exists or if I have to make it. This script adds 2 links to the bottom of the Tools section in the sidebar to give me quick access. That does not solve the problem of discoverability for everyone ofcourse, but it does save me some time.
The toolhub is a great idea but, as you noticed, not perfect yet. It could be improved and promoted more so that people can find what they need and know where to look. Polygnotus (talk) 17:22, 8 September 2024 (UTC)Reply
It could be improved and promoted more so that people can find what they need and know where to look. Agree, thanks for the info about that site anyway. If it really just adds these two links then I'd just enter something like toolh into the url bar and add the page to my bookmarks. It would be good if you or somebody created an issue to add links to these pages where they are relevant or added them directly there but I don't know where that would be and don't think it would be the Tools section....maybe in Special:Preferences under tools and/or via integrating this into the Wikimedia search engine. Prototyperspective (talk) 17:26, 8 September 2024 (UTC)Reply
[edit]

I don't like encouraging legal paranoia but I think legal implications of an offline dump should be considered. What if a file in the dump is discovered to be illegal? There will be a large percentage of copyvios but I am talking about something a court orders action on.

And I think I read something in phabriactor about WMF Legal prescribing the destruction of models based on files where a file included in the model was discovered to be sensitive.

On another topic, I am not sure if I spotted it in the wish but reduced resolution versions of large files could make the dumps more manageable and maybe applications like reverse image search don't need the original file anyway. Commander Keane (talk) 11:06, 13 November 2024 (UTC)Reply

Good point. Maybe there are protections for things like this where it's common sense that one can't make sure that absolutely no copyvios are included in the dump or it's better done by some external project, not the WMF. I don't think it's an issue when models are just trained on this but don't output these contents. I think if reduced resolution files are used, those should only be unused files that are large (like a 5GB video that is used in no Wikimedia project). I included the idea of having smaller dumps / filters where one could just filter away some particularly large files. Prototyperspective (talk) 12:50, 13 November 2024 (UTC)Reply