Talk:File metadata cleanup drive

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Templates about GLAM partnerships[edit]

@Guillaume (WMF):, @TheDJ: Re: marking up templates about GLAM partnerships -- How do you want this done? What information needs to be identifiable for them ?

I presume the big carrot for GLAMs is that, if they get their templates in order, then they can get a credit on MediaViewer in some upcoming iteration of the software ?

That would seem to imply that the machine-readable info needs to provide a short text string to identify the institution.

Is this correct? Does it need to be multilingual? Is a Wikidata Q-number more appropriate for internationalisation?

Thanks, Jheald (talk) 08:59, 12 September 2014 (UTC)

Jheald I think that 'how' to do it exactly is part of what needs to be determined. But it's going to be primarily using a structure of yet-to-be-defined classes. So one class on each template to identify it as "Donating institution"-template, one class wapping the name of the institution and one class wrapping a relevant url for the 'donation drive' or something. Adding Q properties for the institutions might be nice, but can easily be done in a follow-up. Multilinguality is not needed here I think, once stuff is connected to Wikidata, that will probably easily be addable afterwards. —TheDJ (talkcontribs) 20:02, 12 September 2014 (UTC)
you have institution templates, and institution fields with artwork & photograph template; and there are upload event templates like c:Template:NARA-cooperation
metadata standardization among GLAMs is a big problem. the Glamwiki mass uploader maps each institution's metadata as a part of the upload. if we could train & roll that out, we could standardize on the commons template & wikidata. (apart from APIs). Slowking4 (talk) 13:40, 22 September 2014 (UTC)
@Slowking4 and TheDJ:. You're right. It's important to distinguish what we're talking about.
In particular, which templates we're talking about, and how they are to be marked up.
  1. The institution templates, ie templates of the form c:Institution:British Library, should be little problem, as these all route through a common infrastructure, so it would be very easy to add tags to them. Incidentally, see d:Template:Institution/wrapper/test for current progress towards drawing the fields in this template directly from Wikidata.
  2. These usually fill the "current location" field of the c:Template:Artwork template, so usually that field will be taken care of by taking care of the relevant institution template. It would be interesting to know how many pages using the Artwork template have this field set by hand, without using the Institution template. My guess is probably quite few, but it would be good to know.
  3. Finally there are the source credit templates, such as c:Template:British Library image or c:Template:Mechanical Curator image. These can be more expansive, and may link to a number of different catalogues at the source institution. For example in this file of St Cuthbert, the template has back-links to four separate catalogues/viewers at the BL; plus it could also have linked back to the image at the BL Images Online site, but that link was already covered in the source text above. It's not clear how in general to handle these, in particular because some of the different back-links can relate to different types of thing -- so some link-backs relate to the specific scan; some to the specific folio or object; and some to the manuscript or book; sometimes all for the same file. (And, at the BL site at least, there may be no cross-references between these different targets at all). In all I think there are something like ten different catalogues at the BL that the template can link back to, sometimes overlapping and sometimes not, and more could probably be added. (IIRC the BL has something well over 50 legacy catalogues, with no budget or current project to merge or integrate them any time soon). So whilst it would be good to try to better standardise source-credit templates, it needs some thought as to how to cope best with the range of bespoke back-links and descriptions that each institution may require. Also note that in the St Cuthbert image above, there is further valuable source information in the free text before the source-credit template, so by itself even the template doesn't give the full institutional information. Jheald (talk) 19:16, 28 September 2014 (UTC)
(OT) One other thing to add -- note that there are also examples where the Artwork template field has been used with the location field purposely not set -- for example, for engravings like c:File:MA(1829) p.340 - Fleurs - John Preston Neale.jpg taken from books from the BL Mechanical Curator collection, because (i) it was thought 'location' could be confusing with the location of the item depicted (even more confusing for engravings of sculptures that have their own particular locations), and (ii) there are many libraries that hold the edition of the book from which the engraving was taken, so (in the context of the Artist template, at least) it was thought it would be misleading to assert that this was an image whose location should be particularly asserted with the BL.
A future structure on Wikidata may make it clearer to say which properties are associated with the particular scan-set, which with the edition or work, and which the piece of art that may be being depicted, as typically these will each have different Wikidata items, that the CommonsData item will reference -- see some of the recent threads at d:Wikidata talk:WikiProject Visual arts for attempts to think about this, for example d:Wikidata talk:WikiProject Visual arts#Engravings_reproduced_more_than_once_.28should_CommonsData_property_.22Underlying_work.22_accept_multiple_targets.3F.29.
However, some of this will rest on whether such a multiplication of items on main Wikidata really is the way the Wikidata community decides to go. Jheald (talk) 19:16, 28 September 2014 (UTC)
  • BTW: Some partnership templates can be found at c:Commons:Partnership templates, and some partnerships at Commons:Partnerships. But it looks like a lot of content institutions aren't there -- for example Welcome Foundation, New York Public Library, British Library and who knows how many more. And it's not clear how many of these other institutions may have more complicated templates, including item-specific linkbacks. Jheald (talk) 15:41, 8 October 2014 (UTC)
More diverse templates can be found in c:Category:Source templates and its various subcategories; though there may well be further source templates that have evaded categorisation. Jheald (talk) 15:53, 8 October 2014 (UTC)

Licensing modelling on Wikidata[edit]

@Guillaume (WMF): If you are up to the task, it should be possible to model the licensing information on Wikidata. For instance: <CC-BY-SA 3.0> allows <reuse>, <CC-BY-SA 3.0> requires <attribution>, etc. It would need some more properties but it would be machine readable on Wikidata, and multilingual.--Micru (talk) 12:13, 12 September 2014 (UTC)

That is in the plans, but I think only after the basic metadata is cleaned up on Commons first. Figuring out how to represent the information in Wikidata is... complicated :/ Lots of details here. —Luis Villa (WMF) (talk) 15:32, 12 September 2014 (UTC)
Yeah, there is a difference between what we want to build for CommonsData (Commons wikibase), and what we want to do 'right now' with the raw data. This campaign would likely make the 'wikidata' step of Commons significantly easier. —TheDJ (talkcontribs) 20:02, 12 September 2014 (UTC)

Add information templates[edit]

The Labs tool makes a wrong guess for the date when the picture was taken. Exif "File change date and time" is changed after some work with a image processing program and saving the picture again. Example [1]. Using of exif "Date and time of data generation" or exif "Date and time of digitizing" results in a better guess.--Kogo01 (talk) 17:10, 13 September 2014 (UTC)

Thanks! There are a few other things to fix in the tool as well, so I'll make a list and reach out to Magnus to see if he's available to help fix them. Guillaume (WMF) (talk) 16:22, 16 September 2014 (UTC)
I like the idea of this tool, and wouldn't mind putting some work into some of the heuristics... for instance, it seems possible to guess a language template for the description. Also, in cases where the EXIF data doesn't include a camera model (or is lacking other fields that would imply a scanner vs. actual camera) and something that looks like a date appears in the description, maybe present that instead... [Example case]. There are already way too many old images on commons with the date field filled from upload dates, etc. (particularly bot transfers from enwiki). Anyway, is there a proper place to discuss this particular tool? :) --Junkyardsparkle (talk) 19:50, 22 September 2014 (UTC)

nice tool, i see it kicks out images with license inside the information template, broken templates i.e. c:File:000 Cremation shirt.jpg. perhaps notifying users, and changing custom templates [2]. perhaps a bot for systematic errors in user uploads would expedite, and we can use humans on the harder cases. Slowking4 (talk) 16:32, 29 September 2014 (UTC)

for example user:G.dallorto has a lot of images that could be handled with a bot [3]. could we have a system of bot requests to handle repetitive information templates? Slowking4 (talk) 01:11, 2 October 2014 (UTC)
VisualFileChange is your friend in such cases I think. I’m happy to take requests. :) Jean-Fred (talk) 11:51, 2 October 2014 (UTC)
that is pretty slick. i'm having difficulty parsing the java. i want to do this insert information template diff [4] to this list of files [5]. (more than 1000?) i expect you could do it faster than me. Slowking4 (talk) 01:32, 3 October 2014 (UTC)
maybe not. it does not see when uploader or subsequent editors are inconsistent. the info tool sorts by file name and the visualfilechange sorts by upload date - this is dysfunctional.
there needs to be an auto semi-auto tool to sort like text together, and insert the right template fields. until there is one you will not make a dent in the backlog. Slowking4 (talk) 15:30, 4 November 2014 (UTC)

scale?[edit]

How many files are we talking about, for the metadata cleanup drive? There is a Quarry-query:

which says: "Takes too long to run (~1h) for Quarry ; result on 2014-09-04 was 697.814 files."

  • Magnus AddInformation tool shows the 20.000th file still beginning with the letter A: File:A71-Talbruecke-Judental.jpg

Any other numbers? Approximations? --Atlasowa (talk) 10:53, 18 September 2014 (UTC)

Yeah, although the query is too big for quarry, it runs if executed directly on TooLabs. I’m happy to run it as often as needed (someone could set up a cron for daily progress)
If you have any suggestions of other number we might want to have, I’d be happy to look into it :) Jean-Fred (talk) 20:19, 18 September 2014 (UTC)
The ~700,00 estimation is in the right ballpark for Commons; we may need to add another ~100,000 for information templates with missing files. As for other wikis, I'm working on a script to get those numbers. I'll post them as soon as I get them :) Guillaume (WMF) (talk) 06:01, 19 September 2014 (UTC)
you need to also consider cleanup of the information templates that should really be photograph and artwork, (which have more fields that correspond with institution's metadata), and looking back at institution to improve metadata on commons. information template might work for individuals uploading their own work, but it does not work with archival material in institutions with detailed metadata. this will require a continuous effort similar to the teahouse. Slowking4 (talk) 13:43, 22 September 2014 (UTC)

Some messages[edit]

I sent 74 wikis a message (which I had to send anyway) with some information on this initiative. --Nemo 19:32, 18 September 2014 (UTC)

That's a fantastic initiative in its own right, Nemo. Thank you!--Erik Moeller (WMF) (talk) 04:29, 19 September 2014 (UTC)
I've said it on IRC, but thanks again for this; it's much appreciated :) Guillaume (WMF) (talk) 06:01, 19 September 2014 (UTC)

Defining the scope[edit]

Pinging Erik, Luis, DJ, Gergő:

I feel that we need to define the scope of this project a bit more. Specifically, we need to agree on which type of metadata we care about, and which wikis are concerned.

For the metadata, the prototype script (see below) currently looks for:

  • description
  • author
  • source
  • license
  • license URL

The tracking categories added by Gergő in https://gerrit.wikimedia.org/r/#/c/160580/ are the same, except for the license URL. Luis, do you have a recommendation? I'm fine with removing the URL; I just want us to have a clear definition of what counts as "missing", because if we change our definition later, our metrics and our baseline won't mean a thing :)

For the wikis, the scope was defined as "all Wikimedia wikis". The script currently includes some closed wikis (but not all), and doesn't include chapter wikis (but it could). Erik, do you have a preference?

Thanks :) Guillaume (WMF) (talk) 20:04, 23 September 2014 (UTC)

How do you identify images/licenses which should have a license URL but don't? Many licenses (well, license-like things which have a copyright tag template) don't really have a URL: PD-*, fair use etc. --Tgr (WMF) (talk) 20:14, 23 September 2014 (UTC)
Tgr (WMF): One solution for fair use is to use the file description page as URL; since it has a lot of information on the copyright status, like the fair use rationale, it kinda makes sense. It's what DJ and I did on Template:Non-free use rationale and Template:Non-free use rationale 2. I guess we could do the same for PD-*. Are there cases where this would be an issue? Guillaume (WMF) (talk) 11:49, 24 September 2014 (UTC)
I don't see any issue with that. --Tgr (WMF) (talk) 14:37, 24 September 2014 (UTC)
per the above, i would suggest:
  • institution
  • date
Slowking4 (talk) 00:00, 24 September 2014 (UTC)
Slowking4: We're really trying to identify required information that apply to all files. While "institution" and "date" might be important for a subset of images, many files don't need them. A tracking template similar to commons:Template:Author missing might be a better way to identify and fix those missing fields. Guillaume (WMF) (talk) 11:49, 24 September 2014 (UTC)
Some offhand questions:
  • Do we have cases where people are specifying a title separate from the filename? If so, that matters for some licenses. (My general observation is that we always simply use the filename, but it wouldn't surprise if I have missed counter-examples.)
  • How do you propose to handle the multi-license case?
  • Besides the multi-license case above, are there any scenarios where it is non-trivial to generate the license URL from the license name?
Luis Villa (WMF) (talk) 17:31, 24 September 2014 (UTC)
  • Luis: Regarding the title, it's theoretically possible to specify a title as part of the "Other" field in commons:Template:Credit line, but that field can also be used for a URL or an affiliation (ugh). I don't know if there are others.
  • Regarding multi-licensing, I'm not sure there's a way to identify cases where one license is machine-readable but another is not. I feel the best we'll be able to do is going through the interwikis of license templates and see if any of them don't output machine-readable data.
  • As for generating the license URL automatically, it might be tricky to do it from the license name, however it should be easier to have URLs for Wikidata items for each license.
HTH, Guillaume (WMF) (talk) 19:03, 24 September 2014 (UTC)
  • Title: Might be worth experimenting with trying to parse titles out of the Other field, or at least doing a quick sampling to see how often (if at all) it comes up and is different from the file name.
  • Multi-licensing: I meant more the case where the machine-readable license is something like CC-by-all, which is not really one license but rather many. I know how mediaviewer will handle this, but wasn't clear how this project wants to treat that.
  • URLs: I'm just trying to avoid redundant data, and it seems to me that for most of the scenarios I can think of the URL and the name are redundant - the URL is always derivable from the name. But possibly I'm not being creative enough about the broad variety of license statements we might have :)
Luis Villa (WMF) (talk) 19:25, 24 September 2014 (UTC)
License URLs can be language-dependent so more precisely you should be able to generate them from a license identifier + language pair. (This should be a challenge for using Wikidata to store license info since Wikidata does not have a multilingual URL type AFAIK.) --Tgr (WMF) (talk) 20:37, 24 September 2014 (UTC)
License URLs should not generally be language dependent. The license is the license.
Or to put it another way: CC BY-SA 2.0 UK is a different license from BY-SA 2.0 Spain, and CC BY-SA 2.5 Spain is different from CC BY-SA 2.5 Argentina. You can't simply say "ah, this person is a Spanish speaker, I will show them the Spanish license".
This will change in CC 4.0; CC BY-SA 4.0 English and 4.0 Spanish are translations of each other and you can show them interchangeably to the user based on their language preference.
Deeds are a somewhat different story, and of course our default links for CC point to the deeds rather than the license itself. I'm not sure how best to handle that. —Luis Villa (WMF) (talk) 22:04, 24 September 2014 (UTC)
The pages you are linking to are language dependent. Compare https://creativecommons.org/licenses/by-sa/2.0/uk/deed and https://creativecommons.org/licenses/by-sa/2.0/uk/deed.de . The legalcodes are not (pre-4.0), but all of the existing license templates use the deed as a machine-readable URL, not the legalcode. Also, not all licenses have the deed-legalcode distinction - GFDL for example has a single page which serves as both, and has unofficial translations of that. --Tgr (WMF) (talk) 08:00, 25 September 2014 (UTC)
It is for CC to handle redirecting of languages on their side for deeds (which I seem to recall they try, or at least are trying, to do.
For GFDL, the key word is unofficial - we should basically never be linking to the translations directly. —Luis Villa (WMF) (talk) 14:12, 25 September 2014 (UTC)

Proposal: use licensetpl_link_req to determine if Licence URL is needed[edit]

I agree with Luis Villa on that giving an Licence URL is trivial and only needed where there are free licenses. Following that, I would like to make an proposal that licensetpl_link_req is used to identify what templates should have an licence url. In detail that would work by checking if licensetpl_link_req is false - if so then the licence url is not needed, otherwise the url should be given.--Snaevar (talk) 10:34, 11 October 2014 (UTC)

Prototype dashboard[edit]

Over the last few days, I've been working on a dashboard to track files missing machine-readable metadata across Wikimedia sites. You can see the prototype at https://tools.wmflabs.org/mrmetadata/ .

Caveats:

  • This is a prototype; it has broken links and probably many bugs.
  • I just got a Tool Labs account so I'm setting up the tool there; the data you currently see was generated by using the script on my personal laptop.
  • The data on the prototype is static: it isn't continuously updated yet. Also, it only includes a few wikis, because I interrupted the script.
  • The data doesn't include Commons yet, because the process for Commons is different from the one used to go around wikis with local uploads.

The goal of this tool is twofold:

  • to track how many files pages are missing machine-readable metadata, and measure our progress over time by using the initial run as a baseline;
  • to provide a point of entry for each wiki, i.e. if someone wants to focus on files from Wikinews in Arabic, they have their own dashboard.

I'm still improving this prototype, and I welcome feedback as well. My first priority was to get the baseline numbers, and as soon as we agree on what actually counts as "missing metadata" (see the section above), I'll start a full run of the script and schedule it for regular runs. Guillaume (WMF) (talk) 20:04, 23 September 2014 (UTC)

Looks awesome! --Tgr (WMF) (talk) 20:20, 23 September 2014 (UTC)

Really nice dashboard! I just looked at File metadata cleanup drive: wikipedia/de
Total number of files: 166,101. Files with missing machine-readable metadata: 137,627. 18% complete.
Uh. And that is just the local files from deWP? How many files from commons are used on deWP (and how many of those lack metadata)? --Atlasowa (talk) 09:53, 2 October 2014 (UTC)
Atlasowa: Note that the dashboard is still a prototype; I'm still tweaking the metrics so the numbers may change in the future. To address your question, if you look at the current first page of files for de.wikipedia, you see that all of them have machine-readable information, but they're all missing machine-readable license. This probably means (I haven't checked yet) that the information template on de.wikipedia already emits machine-readable tags, but that the license templates don't do that yet. I'm preparing a how-to to explain how to modify the templates to emit machine-readable metadata. Once we add machine-readable tags to the license templates, they will propagate to the files and hopefully we'll make a lot of progress with very few edits :) Guillaume (WMF) (talk) 12:00, 2 October 2014 (UTC)
Hehe but where's the oldwikisource? --Liuxinyu970226 (talk) 13:22, 18 October 2014 (UTC)
It's at https://tools.wmflabs.org/mrmetadata/wikisourceorg/mul/ . Guillaume (WMF) (talk) 23:42, 4 November 2014 (UTC)

Fixing tools like Flickr2commons[edit]

Would it be within the scope of this drive to engage the maintainers of these types of tools in an effort to not produce results like this? It seems like fixing it at the source would be better than fixing it forever, maybe borrowing on the work that's already going into cleanup tools... --Junkyardsparkle (talk) 00:34, 6 October 2014 (UTC)

There is nothing in your example that needs fixing in the tool. It just happens that this image on Flickr has similar entries to our template in its free-text description. Unless you have code for a functional AI somewhere, I can't really see how we could support edge cases like that. --Magnus Manske (talk) 08:42, 7 October 2014 (UTC)
Please don't take offense, I didn't mean to sound critical. :) I'm just noticing more and more cases of archives being hosted on flickr (which is the real problem) and it seems like tools can't safely assume "own work" by the flickr account anymore. It might not require much AI to detect a case where the date field was missing, and perhaps do some things based on that. But again, the real problem is collection holders using an image hosting platform that isn't suited for that purpose at all. But solving that problem is even further out of scope. --Junkyardsparkle (talk) 09:04, 7 October 2014 (UTC)
i kinda agree, hence my comment about an option for photograph or artwork template. however, i'm seeing this as a separate issue: first introduction of information template where none exists, (big improvement, must have) and updating wizard, commonist, and flickr2commons to include better metadata input. (nice to have) the GWtoolset could be the answer. maybe some upgrades to the old tools would help too. and Magnus, the flickr tool is the best of the bunch. but, i find i'm spending more time cleaning up metadata than uploading. it's an even larger problem than getting information templates in. maybe we need some sort of bot request system to migrate from information to artwork. Slowking4 (talk) 01:12, 8 October 2014 (UTC)
Yes, I mention flickr2commons only because it appears to be a best-of-breed tool that is still being used, not sure about the others. As far as the order of things go, there's an argument to be made for not templating information without sanity checking it at the same time, since once it's in a machine-readable form, it will tend to propagate to places that won't necessarily reflect later corrections. And by "sanity check" I only mean flagging things like date/license combinations that don't make sense together, things like that, so that a human can look at it. I just feel like it's a bad thing for commons to have so many historical photos with nonsense in the author/date fields, but if that isn't really relevant to this drive, I'll shut up and go away... --Junkyardsparkle (talk) 03:13, 8 October 2014 (UTC)
@Magnus Manske:. idea, institutions is a 'limited set of users' with recurring contributions. We could just keep a list on a wiki page, make the Flickr2commons tool source that page on an hourly basis and then handle 'known flickr users' in a specialized way... —TheDJ (talkcontribs) 09:04, 8 October 2014 (UTC)

Fixing the ID based approach of {{Information}}[edit]

So, if we are doing this drive in the wider movement context, we should really make sure that the things being fixed are not made using the broken approaches of the current information template. Licensing is slightly better, but even THAT can be improved.

One of the 'advisable' ways is to use classes. One to wrap the big template, one to wrap and identify each key-value pair, one to wrap the name of the key and one to wrap the value. In theory, that would allow for maximum flexibility. Even better would be if we could add classes to indicate the data type, so you could indicate if something is a URI, label, wikitext etc.

Alternatively, we could use a TemplateData-like approach (we discussed this in Berlin this week) to define the structure of the information in our templates. My point is that yes, anything is better than nothing, but if we are really going out there asking people to 'fix their templates', it would be good if we have a story to tell them that enriches the current situation in the best way possible. —TheDJ (talkcontribs) 09:15, 8 October 2014 (UTC)

@TheDJ:. I'm sorry, but what does the above mean? What do you mean by "to use classes"? What are these classes? Do you just mean devising and adding appropriate HTML-like tags, or is there more to it? Could you give a concrete illustration of what you're thinking of? Jheald (talk) 10:08, 8 October 2014 (UTC)
It's a basic element of HTML. Knowledge of what a class is is sort of a basic requirement for template authors. And I don't have an example, because i haven't had time to write one yet, but I will. —TheDJ (talkcontribs) 11:01, 8 October 2014 (UTC)
@TheDJ: Thanks! I had immediately previously been looking at the work to formulate a Multimedia API, so sorry if my mind had been thinking about "class" in quite a different sense :-) Jheald (talk) 11:20, 8 October 2014 (UTC)
TheDJ: There is indeed value in "doing things right", however there is also value in not rewriting the system since we're going to do that again in a year for Structured Data. For licenses, classes sound good, but many information templates across wikis were copied from Commons, and some of them already have the machine-readable markers in td elements. Do you know why we didn't go the class route for the information template from the beginning? How much work is it going to be to use classes instead of IDs in the information template? Guillaume (WMF) (talk) 08:15, 9 October 2014 (UTC)
To close the loop on this: for now we're going to use a simplified version of the Information template from Commons. Guillaume (WMF) (talk) 18:41, 29 October 2014 (UTC)

Binding templates to their Q-items[edit]

The preparation for migrating to structured data will probably include providing a CommonsMetadata-ish interface which will work whether the data is stored to in templates or in Wikibase. To do this, we need data parity, which is easy enough for data that's currently stored in the file page (such as {{Information}} parameters) but less trivial for data implicitly stored in templates.

The obvious example of templates with lots of implicit data are institution templates, but e.g. license templates include implicit data as well (some machine-readable, some not so much). The sane way to store this data would be using Wikidata (as the local Wikibase will be only used for per-file information, and using wikitext for storage should be avoided), which means we need some way of matching templates with their corresponding Q-items. That Q-item might be the item of the template (in which case the binding is trivial), but in at least some cases that's not the case - e.g. {{cc-by-sa-3.0}} should be matched to d:Q14946043 (which is the cc-by-sa-3.0 item), not d:Q5614379 (which is the linked item).

The obvious way would be to

  • create/select an appropriate property, "described by" or something similar, to store which item holds information about what being tagged with this template means for a file
  • make sure each template is linked to an item
  • point the property to the right item for each template.

@Daniel Kinzler (WMDE): what do you think? --Tgr (WMF) (talk) 17:27, 16 October 2014 (UTC)

Labs tool - Commons Template:Painting[edit]

I see the Labs tool recommends the Commons c:Template:Painting for paintings. That template is deprecated in favor of c:Template:Artwork (and its page redirects there). Can we update the tool accordingly before we jump in? - PKM (talk) 02:04, 17 October 2014 (UTC)

And separately, should we consider a bot to update the ~4,000 existing files that use c:Template:Painting to c:Template:Artwork as part of this project? It probably needs to be done sometime. - PKM (talk) 02:16, 17 October 2014 (UTC)

Need some help[edit]

I need an example of how to do the fix for a template like en:Template:Non-free album cover. First, it's a copyright template, right? There's not any tables or div's where I could add those classes. So probably I'm lost with these things. --Stryn (talk) 20:21, 22 October 2014 (UTC)

MrMetadata[edit]

Where to find oldwikisource? --Liuxinyu970226 (talk) 01:02, 23 October 2014 (UTC)

Would it also be possible to add the wikimedia.org wikis other than meta (such as chapters, outreach etc.) /André Costa (WMSE) (talk) 10:20, 23 October 2014 (UTC)
And of course Commons =) /André Costa (WMSE) (talk) 10:21, 23 October 2014 (UTC)
Liuxinyu970226: Yes check.svg Done for oldwikisource. It was actually already in the list, but it wasn't running because of a bug with fr.wikisource that was blocking the following wikis in the list. I've removed fr.wikisource while I investigate, and restarted that list so oldwikisource now works (as well as de.wikiversity and wmfwiki, among others).
André Costa (WMSE): outreachwiki doesn't have local uploads. As for chapter wikis, I didn't want to include them by default, but I can add them on an opt-in basis. I'm assuming you're interested in se.wikimedia.org, so I've run it manually for today, and added it to the list, so it'll now be updated daily with the others. Let me know if you'd like other wikis to be added. Guillaume (WMF) (talk) 18:39, 29 October 2014 (UTC)
I forgot to respond about Commons! Because Commons has more than 22 million files, we can't use the same method as the one we use for everything else (it would take weeks to complete a single run). So, what I need to do is exclude all the files that we know already have machine-readable metadata (e.g. the 20 million or so files with the standard information template), and then only check the remaining 2 or so million files. Because the process is different, I need to write new code for this so it's going to take a few more days. We also have the recently added tracking categories that can be helpful to track metrics. Guillaume (WMF) (talk) 19:22, 29 October 2014 (UTC)
Many thanks for se.wikimedia. For Commons I expected something like that to be the issue. /André Costa (WMSE) (talk) 10:44, 30 October 2014 (UTC)

Files located on Commons with a local description page[edit]

I noticed that nn:File:Brun_ris.jpg show up on the list of files to fix. But it is on Commons and the local description page is just text. Perhaps someone could exclude files like that from the list? --MGA73 (talk) 18:16, 25 October 2014 (UTC)

I don't see the file in the categories (nn:Kategori:Filer uten maskinlesbar opphavsmann, nn:Kategori:Filer uten maskinlesbar lisens, nn:Kategori:Filer uten maskinlesbar kjelde and nn:Kategori:Filer uten maskinlesbar beskriving). Where does it show up for you? --Stefan2 (talk) 18:56, 25 October 2014 (UTC)
At the moment in none of these categories. --MGA73 (talk) 19:45, 25 October 2014 (UTC)
Oh... You probably ask where I found the file :-D Look in https://tools.wmflabs.org/mrmetadata/wikipedia/nn/index.html --MGA73 (talk) 19:51, 25 October 2014 (UTC)
Yes check.svg Done, Thank you for reporting this! It was on my to-do list but I appreciate the nudge :) I just fixed it, so the next time the script runs, it'll skip files that aren't hosted locally. Guillaume (WMF) (talk) 16:57, 29 October 2014 (UTC)

Problem with tool or template[edit]

As suggested on File metadata cleanup drive/How to fix metadata I copied the File metadata cleanup drive/How to fix metadata/Simple information template to no.wiki: no:Mal:Information-

As suggested I also used https://tools.wmflabs.org/add-information/no_information.php on no:File:DJ nygrav.jpg.

But no description shows up on the file page!

As far as I can tell the problem is that the template uses "|description" but the tool adds "|Description". It would be nice if someone would either fix the tool or the template so they work together. --MGA73 (talk) 18:34, 25 October 2014 (UTC)

Any reason why the template is called "Mal:Information" instead of "Mal:Informasjon"? Preferably, the parameters should also be in Norwegian. --Stefan2 (talk) 18:47, 25 October 2014 (UTC)
The reason is that the tool that can be used to add the template uses english. Also if we copy the files to Commons the template should be in english. If all goes well then any files without a good source or license etc. will be deleted from no-wiki. And all the good files are copied to Commons. So in time no files will be left on no-wiki. --MGA73 (talk) 19:47, 25 October 2014 (UTC)
Yes check.svg Done, Thank you for reporting this. I've edited the "simple information template" to include the alternate capitalization. Guillaume (WMF) (talk) 16:24, 29 October 2014 (UTC)
Thank you. Also for fixing the script mentioned in the heading above this one. --MGA73 (talk) 17:15, 4 November 2014 (UTC)

Global file deletion review[edit]

If you are interested in metadata cleanup, you might also be interested in Requests for comment/Global file deletion review. In my experience, when a file page lacks a standardized template, it may also have missing or misrepresented information that needs be salvaged from deleted files across different sites. whym (talk) 03:14, 2 November 2014 (UTC)

Are any projects excepted?[edit]

Meta host a lot of bad files - see Category:Images with unknown license for example. Perhaps the problem is that meta spend years trying to figure out if meta should have an Exemption doctrine policy or not. But it makes me wonder are meta and/or any other projects excepted? Or are there just not many working to clean up on meta? --MGA73 (talk) 18:14, 4 November 2014 (UTC)

The problem on Meta seems to be WM:CSD, which doesn't always permit deletion of files with insufficient copyright violation. The category is in part also full of files because people often include source and licensing information in textual format instead of using templates. --Stefan2 (talk) 16:24, 5 November 2014 (UTC)

How to handle "and future versions" cases[edit]

While going through license templates on the English Wikibooks, I came across b:Template:Cc-by-sa-all that says that a work is "is licensed under the Creative Commons Attribution ShareAlike license versions 3.0 2.5, 2.0, 1.0 and any later versions." (emphasis mine).

Does anyone have any idea about how to handle this case, since that last part isn't an actual license, and therefore doesn't have a name or a link to a license? Guillaume (WMF) (talk) 23:14, 20 November 2014 (UTC)

And the same problem happens with templates like b:Template:GFDL, w hich says "Version 1.2 or any later version". User:LVilla (WMF), User:Slaporte (WMF), do you have any advice? Guillaume (WMF) (talk) 23:28, 20 November 2014 (UTC)
For the GFDL, I've made this edit that reuses the existing links and language of the template. Feedback is welcome. Guillaume (WMF) (talk) 19:54, 21 November 2014 (UTC)
I would put "1.2+" in the short/long name, not that it matters a lot. The harder question is, how we are going to handle these once license templates will be expected to be bound to a Wikidata item? --Tgr (WMF) (talk) 17:54, 23 November 2014 (UTC)
If we're looking for fun, there's also c:Template:Wikimedia-screenshot and its friends :) Guillaume (WMF) (talk) 00:14, 26 November 2014 (UTC)
It looks as if c:Template:Wikimedia-screenshot can be used for screenshots of any Wikimedia website, including Wikinews and Wikidata which use different licences. There is also c:Template:Wikinews screenshot which is a special version meant for Wikinews screenshots, and that template looks unnecessarily verbose to me. Why does it mention GFDL and CC-BY-SA in the first place? --Stefan2 (talk) 00:31, 26 November 2014 (UTC)
Sorry, didn't see this ping earlier. Typically, people trying to analyze licenses like this (e.g., SPDX) treat "version X" and "version X, or any later version" as two separate licenses, since when new versions are published new terms or restrictions may come into play. You then actually link to the latest version, but track them separately in the DB. Make sense? —Luis Villa (WMF) (talk) 21:44, 12 December 2014 (UTC)

"Presumed Public domain"[edit]

Another weird case is files tagged as being "Presumed in the Public domain", like de:Datei:022 Amm.jpg. How should we tag these files? Guillaume (WMF) (talk) 23:52, 20 November 2014 (UTC)

That's an odd tag. It'd be nice to go through and actually make determinations one way or the other. Assuming that's not possible, can we do a combination of "PD" and "warning, check the licensing"? That seems to be closest to what is intended here. —Luis Villa (WMF) (talk) 21:38, 12 December 2014 (UTC)
We don't have markers for such warnings yet, so I've dodged the issue by reusing the template's current wording. The proper modeling of those usage terms can be done when we move to a proper structured data system. Guillaume (WMF) (talk) 20:50, 17 December 2014 (UTC)

Top templates to fix[edit]

@Guillaume (WMF): I created a shell one-liner from hell to get the top templates used in the MrMetadata "failing files" for a project/language. Returns some Information/style false positives as well, but fixing a few of the license templates from that list could "fix" a lot of files. --Magnus Manske (talk) 11:26, 21 November 2014 (UTC)

  • On English Wikipedia, there are numerous problems with fair use files, and plenty of fair use files therefore end up in the problem categories. There is some discussion about this problem at w:WT:NFC#Help with file metadata cleanup. One problem is that the information part isn't always machine-readable. --Stefan2 (talk) 13:50, 21 November 2014 (UTC)

@Magnus Manske: Thank you! I'd been meaning to add those for a while, and I'm really glad you beat me to it :) I'll see if it's worth scripting to have it run for all wikis. Guillaume (WMF) (talk) 17:46, 21 November 2014 (UTC)

AddInformation love[edit]

Last week I cleaned up over 200 files on Commons using the wonderful NoInformation tool developed by Magnus Manske.

It's a great tool, if one enjoys cleanup projects that are drama-free and you get to view a lot of really cool images, I highly recommend it.

It occurred to me during the cleanup that with some bug fixes, the addinformation gadget that NoInformation uses could be run by a bot to cleanup the files. These files could be placed in a temporary category for review to make sure the information template is added correctly and that there is no data loss. With ~700,000 files on Commons needing the template, which is in turn half of the sum total of all Wikimedia files lacking machine-readable data, I think this is a reasonable goal.

This would need two things:

  1. Bug fixing for addinformation: it currently mangles == {{int:filedesc}} == and == {{init:license-header}} ==[6]. It also does not play well with custom templates[7], location templates[8], custom licenses[9] and old upload logs[10], etc.
  2. Bot approval: who owns the bot? It's complicated.

If we attempt to correct this by hand, with bug fixes that I mentioned prescribed, it would take roughly 3,000 Commons editors a day of solid work to clean up 233 files apiece and check for quality. This isn't scalable, but we can find the solution :) Keegan (talk) 08:59, 22 November 2014 (UTC)

Keegan: Thank you for your help! The code-mangling issue seems to have been fixed, thanks to Mark and Magnus Manske. For the other issues, the best thing would probably be to report dedicated issues, although we might fix those by bot more easily.
For the bot discussion, see c:Commons:Bots/Work requests#Adding the Information template to files that don't have it. Guillaume (WMF) (talk) 17:33, 4 December 2014 (UTC)

Authors with a lot of images without information template[edit]

Is it possible to get a list of the authors with many images (and the images involved) without an information template. Those large uploaders tend to have some standard way of formatting their description pages, and a lot of images are involved. Thus these seem to me like the parts which are suitable to work on semi-automatically with bots. Mvg, Basvb (talk) 21:35, 22 November 2014 (UTC)

(partly answered on Commons; copying here for reference) The list of authors is now available and updated daily; I should be able to get a list of images but it might take a little more time. Guillaume (WMF) (talk) 01:26, 17 December 2014 (UTC)

Machine-readable data on Open Government License[edit]

I have added machine readable data to c:Template:OGL. Could someone please review the correctness of this? Thanks!

(By the way, it appears there are already 3 versions of OGL, but our template has a generic name, although it points to v1 license. We might want to look into this.=

Jean-Fred (talk) 15:58, 16 December 2014 (UTC)

Jean-Fred: I fixed it with this edit. The classes in the table's header were overwriting each other, so I moved them all further down. It works fine now :) Regarding the different versions, this might be more of a discussion for Commons. Guillaume (WMF) (talk) 00:00, 17 December 2014 (UTC)
Thanks Guillaume! And hop, 15K files now with MR-license \o/ Jean-Fred (talk) 00:52, 17 December 2014 (UTC)

Wrong licence[edit]

Some projects have a tag telling that the copyright information likely is wrong, see d:Q11118887. This template does not seem to have any computer-readable copyright tags. Should it have this? I assume so, since the template provides information about the other copyright templates... --Stefan2 (talk) 16:55, 18 December 2014 (UTC)

instructions not clear, check my work[edit]

As a first attempt, I just mangled the two below articles, someone pls look at this and fix what I did wrong (make it a good example)... I'm especially concerned about 1) using the correct template 2) not having filled in all the fields, 3) not having deleted enough redundant information between the old untagged data and the new template. Once I think it's right, I'll start poking away at other untemplated stuff. It's been a while since I've helped with maintenance like this.

I see that both of these are now tagged as missing an author. The first one, I haven't a guess. The second one I suppose technically has two authors -- I assume I could pull the original author from the article, but there is also the poster who transcribed it. --Ssd (talk)

Hello Ssd, and thank you for taking this on! I'm not very familiar with the English Wikipedia's fair use templates, but here's my take: That looks like the correct template, and you may want to add something in the "Replaceability" field about the fact that there is no free equivalent of this video, so the image cannot be replaced by a free image. If you want to copy the item about ABC, that may go into the "Other information" field. Once that is done, you should be able to remove the corresponding section below. You'll probably want to add another copy of that template for each article (with its rationale) where the image is used. This is based on my (admittedly limited) understanding of the documentation at w:Template:Non-free use rationale. You may want to double-check on the template's talk page.
Regarding the "missing author" category, I'm not sure there's much more you can do. In many fair use cases, the author is difficult to establish. That category was made to help with maintenance, but shouldn't necessarily be seen as indicative of a specific problem to solve on every file where it appears.
Hope that helps :) Guillaume (WMF) (talk) 16:52, 5 January 2015 (UTC)

Help needed for de:Vorlage:Musik-Zitat[edit]

In de-WP, we are unsure how to handle de:Vorlage:Musik-Zitat that is responsible for the vast majority of the files in de:Kategorie:Datei:Keine maschinenlesbare Lizenz. See the bottom of this discussion. --Leyo (talk) 00:32, 7 January 2015 (UTC)

I've replied there. Let me know if you need anything else! Guillaume (WMF) (talk) 20:38, 7 January 2015 (UTC)
de:Template:Musik-Zitat seems to be a fair use copyright tag for music, so I assume that the files simply should be tagged as fair use (and additionally with de:Template:DÜP as there does not seem to exist any exemption doctrine policy covering the files). It seems that many of the files also could be converted to text, see w:Help:Score. --Stefan2 (talk) 22:57, 7 January 2015 (UTC)
I'm fine with the solution by Guillaume (WMF), thank you. --Leyo (talk) 02:48, 8 January 2015 (UTC)

Multiple copyright tags[edit]

Some files have multiple copyright tags. For example, File:Hammer Museum Westwood June 2012.jpg by User:King of Hearts has three tags: two telling that the photograph is licensed under GFDL and CC-BY-SA and one telling that the depicted buildings are in the public domain per c:Template:PD-US-architecture. Mediawiki misinterprets this information and claims that the entire picture is in the public domain.

From api.php:

<LicenseShortName value="Public domain" source="commons-desc-page" hidden=""/>
<UsageTerms value="Public domain" source="commons-desc-page" hidden=""/>
<AttributionRequired value="false" source="commons-desc-page" hidden=""/>
<Copyrighted value="False" source="commons-desc-page" hidden=""/>
<License value="pd" source="commons-templates" hidden=""/>

The Media Viewer also gets things wrong by claiming that the file is in the public domain. This seems to be a big problem as it is fairly common to have multiple copyright tags for different portions of a file, in particular when someone has taken a photograph of an object. --Stefan2 (talk) 16:41, 22 January 2015 (UTC)

phabricator:T77108, unless I missremember. /André Costa (WMSE) (talk) 12:46, 23 January 2015 (UTC)
We first need to create a machine readable marker that can indicate that a license is referring to the embedded work. Such a marker currently doesn't exist. Worse, the template in which we could embed such a marker doesn't really exist either.... Once we have such a marker, we can simply ignore these 'licenses', as they are only relevant to our internal verification procedure, they are not really required for re users. —TheDJ (talkcontribs) 12:11, 29 March 2015 (UTC)
So my idea is that we add a new marker, the class "work-partial". Anything (licenses, information templates etc) inside this .work-partial would be grouped separately (and initially ignored by CommonsMetaData). —TheDJ (talkcontribs) 12:36, 29 March 2015 (UTC)
Commons has c:Template:3-D in PD, but it is usually not used. Instead, a short statement is usually used which tells what the other copyright tag refers to. c:Template:PD-US-architecture and "freedom of panorama" templates (for example, c:Template:FoP-US) always impliy 'work-partial' since you can't upload architecture to Commons but only pictures of architecture, but it is also common to use standard copyright tags such as c:Template:PD-old-100 to refer to the copyright status of included works.
On English Wikipedia, files may use w:Template:Photo of art, which takes both a free licence chosen by the photographer (e.g. CC-BY-SA) and a fair use copyright tag referring to the included item. In this situation, it is more important to provide machine-readable metadata about both copyright tags: you must comply with fair use requirements in order to use the included item, but you must also comply with CC-BY-SA requirements in order not to violate the photographer's copyright. --Stefan2 (talk) 12:45, 29 March 2015 (UTC)

zhwiki page seems down[edit]

@Guillaume (WMF): [11] - Still "Last updated on: 2015-01-02", and I've fixed more and more files. --Liuxinyu970226 (talk) 13:34, 5 February 2015 (UTC)

Liuxinyu970226: Thank you for the notification. I've restarted the script for zhwiki and will try to find the cause of the problem. Guillaume (WMF) (talk) 16:09, 9 February 2015 (UTC)

Europeana/DPLA efforts[edit]

Might be interesting: https://docs.google.com/document/d/1H6TWxGARqUMxJrc2sXjaBlOsg7UkUTb27rvtS8aC5y4/edit?pli=1 --Nemo 10:56, 1 June 2015 (UTC)

Unused file status[edit]

How about also adding an "in use" column in the tables like toollabs:mrmetadata/wikipedia/tt/index.html, or alternatively producing a list of unused files which lack license? These are often good candidates for deletion, which is much easier than metadata fixing. --Nemo 15:13, 2 October 2015 (UTC)

Commons bot to inform users about broken templates[edit]

Hi all, over the last months i am cleaning files appearing in different maintenance categories such as commons:category:Pages using Information template with incorrect parameter or commons:category:Language templates with no text displayed. Often an editor creates the problems by changing the files. So my idea is to have a bot daily checking the changes in certain predefined maintenance categories. For each new file in these categories it checkes if there was only one editor working on this file within the last day. If so, a message on the user page is added asking the editor to review his changes. This would moves some of the cleaning work from the cleaner to the responsible editor. --Aschroet (talk) 13:50, 11 November 2015 (UTC)

  • Please make bot proposals for Commons on Commons instead of Meta. c:COM:VPP is probably an appropriate venue. --Stefan2 (talk) 15:05, 11 November 2015 (UTC)

You are right. Did it already. --Aschroet (talk) 07:07, 12 November 2015 (UTC)

MrMetadata is down?[edit]

@Guillaume (WMF): MrMetadata has not updated the stats for Commons or en.wp since 2015-08-31. Most, if not all, other sites are no longer being updated as well. The last site updated appears to be zh.wp, on 2015-10-01. Is this reporting tool permanently offline, or can it be restarted? —RP88 (talk) 03:34, 15 November 2015 (UTC)