Talk:Wikilegal/Database Rights

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 6 years ago by Nemo bis in topic New EC report

We welcome your comments and questions! Big thanks to Lukas Mezger/Gnom for his work on early drafts, and Elaine Wallace for her help in polishing and finishing it. -LVilla (WMF) (talk) 02:43, 8 November 2013 (UTC)Reply

Thank you for preparing this. Very useful for helping folks understand the topic. Cheers. Aude (talk) 09:13, 8 November 2013 (UTC)Reply
Yes, thank you. People have expressed widely differing views on this subject, so hopefully this will make it easier for us to reach consensus. --Avenue (talk) 12:40, 9 November 2013 (UTC)Reply
Thank you from me too. I have linked to this on the Wikidata:Project_chat page. Filceolaire (talk) 22:50, 11 November 2013 (UTC)Reply

Other systematic extracts[edit]

I have a question arising from the last sentence on the page, i.e. "For EU databases, bots or other automated ways of extracting data should also be avoided because of the Directive’s prohibition on “repeated and systematic extraction” of even insubstantial amounts of data." There are many systematic editing efforts that do not involve automation, such as editors working their way through a shared checklist. Should extracting data from EU databases in such ways also be avoided? --Avenue (talk) 11:17, 10 November 2013 (UTC)Reply

The Database Directive defines extraction as "transfer ... to another medium ... by any means" (Article 7.2(a)). In other words, it does not discriminate between automated extraction and manual extraction. -LVilla (WMF) (talk) 20:21, 11 November 2013 (UTC)Reply
I think the conclusions focus on bots is causing confusion. Any means (including 12 monkeys) should be the focus of the issues with using EU database contents.
Instead the part which mentions bots/automation should be more clear that the it is a warning to Wikimedians that the method of using software to extract data puts them at personal risk of lawsuit for using the database in a way other than intended by the database authors. i.e. any automation is legal basis for the database operator to claim w:denial of service. John Vandenberg (talk) 03:57, 6 April 2014 (UTC)Reply

A guideline for libraries in Germany[edit]

A good resource in German language might be

@Gnom: do you think it makes sense to have a "further references" section? Did you come across other references like this one that could be useful? I think perhaps the list of CC0 users I linked to below might also make sense in such a section. -LVilla (WMF) (talk) 18:12, 13 December 2013 (UTC)Reply
Hi @LVilla (WMF): Please excuse my late response. In reply to your first question: Why not? I haven't read the document @Make: recommends, but the contents look quite relevant. I can't submit any other documents to the list, however. In other news: I'm looking for a topic for my Ph.D. thesis at the moment, and a database/open data-related problem is definitely on the list of proposed topics I will present to my professor at the end of January... --Gnom (talk) 18:31, 30 December 2013 (UTC)Reply

No way to go out of this grey situation ?[edit]

Could you please give some exeamples of document we can used to ask some agreement to database managers in order to extract some data from their databases ? This document says that we have to avoid to extract and import data without any agreement but if we want to find an agreement, what do we need to do, to ask and in which terms ? Snipre (talk) 20:17, 12 December 2013 (UTC)Reply

Hi, Snipre: the best thing to do is to use standard open licenses. The best one of these is CC0 (used by many organizations, not just Wikidata); the Public Domain Dedication License is also good. I will see if I should update the text to suggest that.-LVilla (WMF) (talk) 17:48, 13 December 2013 (UTC)Reply
I see that we already basically do that. Perhaps CC, or the Wikidata team, should put together a guide on convincing data producers to use liberal data licenses. -LVilla (WMF) (talk) 18:11, 13 December 2013 (UTC)Reply
@LVilla (WMF): Thanks for the answer. The small problem with your text is that no guide or solution is provided to avoid any problem in the future. The grey zone with the sentence don't do that is not enough because in one, five or ten years, through the manual import of thousands contributors, even without any coordination, large portions of some databases will be included in wikidata for example, because these databases are often the main source of some kind of informations. In that case the problem is not don't import data from databases but don't import this kind of data and in these cases tools like Wikidata aren't good systems.
I see another problem: citing source of data. We encourage contributors to cite the sources when importing data but by this way we are offering a way to databases to track the data and to attack wikidata because large portions of their data was imported into wikidata (and again as above: thousands of contributors working without any coordination lead at the end to the same results as a bot importing automatically data). To avoid this we should better to avoid any source citations in order to avoid any problem by saying we imported that data from sources which are different from your databases even if this is not true. Snipre (talk) 08:22, 14 December 2013 (UTC)Reply
And can you say if licence like this ones (don't consider data coming from third part) can be considered as CC0 ? Snipre (talk) 08:45, 14 December 2013 (UTC)Reply
And for data coming from databases with CC BY licence is it possible to import them into a CC0 database by citing the source and by providing a link to the webpage from where the data is stored ? Snipre (talk) 08:45, 14 December 2013 (UTC)Reply
I think this has probably gotten a bit off-topic for this talk page - questions about citation and licenses (which are more general than just the question of database rights) are probably best for general discussion in the Wikidata community, not tucked out of the way here. Happy to discuss elsewhere if that is appropriate. -LVilla (WMF) (talk) 01:17, 18 December 2013 (UTC)Reply

French National Assembly directory[edit]

Hi, could the authors please clarify why they are linking to as an example. I have read citation 12, which immediately follows the link, and that is a very different legal issue. I also read citation 11 and 13, and cant find any mention of the French National Assembly directory. Was this example included in some other legal decision? John Vandenberg (talk) 02:42, 6 April 2014 (UTC)Reply

@LuisV (WMF): just a gentle ping. John Vandenberg (talk) 08:15, 8 May 2014 (UTC)Reply
Thanks for the ping, John. I don't recall why the example was used; User:Gnom might. —Luis Villa (WMF) (talk) 23:56, 9 May 2014 (UTC)Reply
Hi John and Luis! Actually, I put in the link to the Assemblée nationale website because I wanted to insert a nice website database example into my draft, and I was looking for something simple, accessible, and encyclopedia-like, and maybe also something pretty and a little exotic ;-) As I can see now, it looks like footnote 12 were meant to support that specific example, which it doesn't. The judgment quoted in footnote 12 is about a selection/list of 11,000 poems and it is one of the main precedents regarding databases in German law. I presume that I forgot to reposition the footnote at some point during the drafting process. Therefore, footnote 12 should be moved next to footnote 11 - I just took the liberty to make that change. I'm sorry about the confusion and I hope I made things a little clearer. --Gnom (talk) 22:01, 11 May 2014 (UTC)Reply
Thanks, @Gnom:! —Luis Villa (WMF) (talk) 16:09, 12 May 2014 (UTC)Reply

Copyright in data transcriptions?[edit]

@LVilla (WMF):: Could you give any thought on the U.S. copyright position regarding transcriptions of historical data?

There are a number of sites (especially genealogical) which contain large-scale databases of such transcriptions.

The original handwriting of such records may be not easy to read, and sometimes susceptible to multiple interpretations. Would the act of reading such records be considered a creative act of the kind likely to attract a new U.S. copyright? (But: does transcribing an old handwritten narrative, or the draft for a novel, create any new copyright for the transcriber?)

Is the error rate of any relevance in this? For example if two independent transcriptions differ only in 1 record in 10,000, would that suggest that comparatively little autonomous input has been required in the transcription process? On the other hand, if the rate of differences, eg in the transcription of surnames, was (say) 1 in 20, would that make a difference?

If attempts have been made to clean up the database, does that also make a difference, since cf Bridemann vs Corel, the intention would then only be to accurately reproduce the original; and Bridgeman said the nearer a reproduction was to the original, the less strong could be any claim for copyright.

Finally, in any potential fair use analysis, what would be the taking? The whole database? Only so much of it as represented surnames that were genuinely hard to decipher? Or no taking at all, since the material was ultimately factual?

Thank you for any thoughts on this. Jheald (talk) 15:44, 11 September 2014 (UTC)Reply

@Jheald: Some tough questions there. I'll try to get someone to look into it, but it may be a while before we can get back to you. —Luis Villa (WMF) (talk) 16:40, 11 September 2014 (UTC)Reply

Crawling websites to obtain data points[edit]

Dear legal team,

I hope this is the right place to ask this question. If not, could someone please point me into the right direction?

There has been a lenghty discussion going on in the WikiProject:Tennis at Wikidata on using bots to obtain data points available on web sites to include into Wikidata on a regular, i.e. weekly, basis. I will try to do my best to structure the issue and the questions as best as I can, if you have any further questions, please feel free to ask by pinging me. Further, please note that I have no legal degree and all of my assumptions are based on a more or less educated guess with no proffessionell backing.


There have been some discussions around this issue for some time at various places, please refer to the following links (I hope the short list is complete):

Plese note that in both discussions the legal context as well as your database rights analysis are/were taken into consideration.

Case at hand[edit]

The world's proffessionell tennis players and tournaments are organised in the ATP (men) and WTA (women). According to Wikipedia, the ATP is based in London/EU whereas the WTA is based in the US.

Both of these organisations publish weekly (every Monday) updated player profiles on their web sites that contain inter alia current world ranking, best historic world ranking, prize money, win/loss record,..., i.e. the data points change regularly for active players. Please see two examples:

Both organisations hold a copyright on their respective Terms & Conditions:

As these data points change weekly, we would like to have some automation in obtaining the data, e.g. crawl their website or ask for database interfaces. Before we do so, we think it is beneficial to obtain a legal opinion on this matter, as to our understanding this is not included in your high level analysis on this site.

Legal issues from my point of view[edit]

  1. I understand that even though ATP/WTA hold copyrights on their databases that this may be true for the database, i.e. even if we had access to this, we cannot simply copy the database, but does not apply for the contents of the database, e.g. world ranking, prize money,..., if these contents do not take some kind of judgement in the selection process. As ALL the male or female respectively tennis players are included in the database and that all of these contents are merely numerical statements of some kind, I believe that these contents are not copyrighted. Do you see this the same way?
  2. Therefore, I understand that we may use such data points, however, we may be restricted to automatically obtain such information by a bot on a regular basis. Correct? Is there a difference between the ATP (EU) and the WTA (USA)?


  1. Bot solution
    1. Would it be possible to use a bot to crawl the two web sites regularly and enter the information into Wikidata? Is there a difference between the ATP (EU) and the WTA (USA)? Basically, I don't think that this is possible, but I want to try at least ;)
    2. If this wouldn't be allowed per se, would it make a difference if ATP/WTA gave us written permission (OTRS) to do so? In my understanding, permission by ATP/WTA is just between us and them, but we have to think of our licences and this permission would still allow us to publish the obtained data under CC licence. This is basically the big question I have, because I various interacting lines of argument here: are we able to publish data under CC licence that we already can publish under CC if not obtained auomatically (at least for the EU) given that we obtained permission to obtain the data automatically. Again, is there a difference between the ATP (EU) and the WTA (USA)? In addition, please refer to the most current discussion we have on the WikiProject:Tennis on Wikidata (link above) to see more thinking around this from our side.
  2. Interface solution
    1. How would the questions above be answered differently if we got some agreement with ATP/WTA to access their data base via an interface?

Please consider that we will not be able to make ATP/WTA change their data base rights from copyrighted to CC or alike.

Thanks for looking into this, I understand that this may take a while, but if you could come up with something in the next two weeks or so, this would be highly appreciated, as I am in the process of getting accredited for the ATP World Tour Finals and may have some discussions around the issue there, but I don't want to have ATP giving us access or permission to do something and in the end our CC licencing modell forbids us to actually use those access/permission.

Thank you and in case of any questions, just ping me. --Mad melone (talk) 12:46, 16 October 2014 (UTC)Reply

Anything?--Mad melone (talk) 14:49, 28 October 2014 (UTC)Reply
As I have not received an answer to my emails yet, eventually someone from the legal team sees this by chance?--Mad melone (talk) 15:01, 13 November 2014 (UTC)Reply
I have sent an email to the legal person I contacted before, to be sure they have seen this issue. Hopefully they will get back to us. Edoderoo (talk) 14:45, 14 November 2014 (UTC)Reply

Does the UK Open Government Licence (OGL) impose conditions on downstream re-users?[edit]

Hi! Would it be possible for somebody to take a look at d:Wikidata:Project chat#OGL licence for data, where we've been trying to discuss whether the UK Open Government Licence (OGL) imposes conditions on downstream re-users or not, and therefore whether data under it can't or can be included at scale in Wikidata, which is of course intended to be available CC0.

My first thought was that the OGL seems not to contain any viral re-licensing conditions. However, on further thought, it seems to me that significant re-use of information from OGL databases that was incorporated in Wikidata would count as "re-utilisation" (in the terminology of the EU Database Directive) of the original data; and so they would be in breach of the Directive if they in turn were not compliant with the OGL licensing terms. And so in effect there is (or could be) an induced obligation, and therefore such data cannot be said to be compatible with CC0.

But I'd be grateful if anybody could drop in and say whether I had got this right. Jheald (talk) 17:05, 19 September 2017 (UTC)Reply

Hi Jheald, I'll take a look (I co-wrote this Wikilegal piece). --Gnom (talk) Let's make Wikipedia green! 20:20, 19 September 2017 (UTC)Reply


"From a legal perspective, a database is any organized collection of materials — hard copy or electronic — that permits a user to search for and access individual pieces of information contained within the materials" - tree-type graph on Wikidata or Commons's tree categories is "organized collection of materials"? --Fractaler (talk) 11:51, 23 February 2018 (UTC)Reply

New EC report[edit]

There's also a section about originality threshold. CC0 is mentioned various times. --Nemo 10:05, 10 May 2018 (UTC)Reply