Talk:Wikilegal/Lexicographical Data

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Some comments[edit]

Hi @Jrogers (WMF):,

Thank you for this interesting work. I am not a lawyer but a lexicographer in the way I already published a dictionary and I am contributing to French Wiktionary. And English is not my mother tongue, so you're welcome to ask me for clarifications. Well, I have four main concern.

  1. Pronunciation are not created? Well, that is tricky. What we call pronunciation in a dictionary is a written description of a sound sequence. The better way of describing a sequel of sound is not trivial. Usually, for a first description of an unknown language, it will took at least two months. A sound can be describe with more or less granularity, including some prosody or subtile traits that one may not notice or that are not relevant to distinguish words. And there is variation in the sounds, so representations are approximations of a given standard but may no represent any real occurrence of the sound. Finally, there can be bias is the choice of the norm, and the lexicographer may be influenced by a small part of the community uses or may choose a better way of say some words (i.e. because it sounds more Latin, or because it is a loan word and it has to be pronounced more like the source language). In total, I think there is a dedicate work and personal decisions made for a sound to represented in a dictionary, and pronunciation may be considered as copyrightable.
  2. Grammatical information. It could be interesting to define more precisely the meaning of this. Is it how grammatical words are defined in a dictionary? Is it how verbs categories are indicated? Well, it is not easy to describe grammatical information. It depend a lot on the analysis and the vulgarization of it. Do you think it could be interesting to distinguish path verbs and motion verbs because they have a specific argumental structure? Well, maybe. Depending on your target readers. In some dictionary, you can find a lot more detailed descriptions of verbs than in a basic one. Meanwhile, it is not a fact that a given noun is countable, or is part of a masculine/feminine/neutral category (when there is gender distinction). It depends on a prior collection of data and analysis. It changes in time. Sometimes grammatical information can be altered by an ideology, as a dictionary is never totally neutral and be use as a way to give an orientation to a language (i.e. for French, feminine attributes for the social gender is associated to the feminine arbitrary grammatical category and some dictionary have described nouns as feminine despite a different use in the society). As grammatical information are not factual data, could they are subjected of copyright?
  3. You did not mention etymology, but I imagine they are copyrightable the same way as description because the phrasing is peculiar. Is there something specific here?
  4. Is there is a specific problem with pictures associated with meanings? Is a specific choice of associating a figure with a meaning could be copyrightable in a way or in another? I think it is a creative choice to pick for example the perfect picture to illustrate prout (sound of fart in French). So, do you think it could be copyrightable.
    Again, I am very pleased to read your work! I hope you will spend more time to expand your analysis to every aspects of lexicography Face-smile.svg Noé (talk) 10:00, 18 February 2018 (UTC)
  5. A new question suggested by Lyokoï: Is it possible to mention in every pages of Wiktionary if a word is present or absent in a dictionary? Until now, we considered such a work may be protected by database laws, but if it is not, it may be great to know, to help us track when a word appears in a dictionary and when it is not. Noé (talk) 10:14, 19 February 2018 (UTC)
    • Just a comment on this 5th question. The list of words (nomenclature) of a dictionary should normally be considered as copyrighted, because it's the result of a huge selection work, and because customers may buy a dictionary only to check the presence or not of words (arbitration of a discussion, or dictionary used as a referee for word games). If such a list is copied, it's unfair competition, because there will be fewer customers for the copied dictionary. Lmaltier (talk) 20:46, 19 February 2018 (UTC)
      Not sure this is relevant as, for several languages, Wiktionaries are the largest compilation of terms. Thus most or all other dictionaries are subclasses of Wiktionary. By this argument all dictionaries are 'copying' Wiktionary, using varying selection criteria. - Amgine/meta wikt wnews blog wmf-blog goog news 00:45, 20 February 2018 (UTC)
      No, of course, they don't copy Wiktionary. I was referring to a precise nomenclature resulting of a selection. Read what I explain again. Lmaltier (talk) 18:32, 20 February 2018 (UTC)
      Yes, I think I understood you did not mean they copy Wiktionary. However, the argument regarding the precise nomenclature copyright still could be made - a 3rd party dictionary is using a subset of Wiktionary which could (imo spuriously) be construed as violating Wiktionary's copyright. But I suspect Noé is discussing using referents such as wikt:en:Template:R:TLFi in entries; they indicate the term appears in a specific dictionary (w:fr:Trésor de la langue française informatisé) and could be used to deductively disclose such a precise nomenclature (or to create it, if a category were added to the template.) - Amgine/meta wikt wnews blog wmf-blog goog news 20:00, 20 February 2018 (UTC)
      Was I unclear? I mention a precise, complete, nomenclature. A subset of a nomenclature is not this nomenclature, not at all. Anyway, I think my concern is not relevant to Wiktionary protection (wiktionaries don't perform any selection work at all, and their nomenclature is not fixed, it changes everyday). But this concern is important nonetheless, because Wiktionary should not violate copyright laws. What you explain about TLFi is exactly what I was meaning. Lmaltier (talk) 20:21, 20 February 2018 (UTC)
      My personal opinion is such a deductive disclosure - which is less accessible than that published on CRNTL - would not constitute an infringement. But I am not a lawyer, and would suggest consulting a copyright expert for France to answer that precise question. - Amgine/meta wikt wnews blog wmf-blog goog news 21:32, 20 February 2018 (UTC)

@Noé: Thank you for the questions! First of all, I should say that it's difficult to answer for every country in the world. While copyright laws are mostly harmonized through the Berne convention, it is still possible for the courts in different countries to disagree about where the minimum line of creativity is in order for a work to qualify for copyright.

  • That said, as best I understand the law in the U.S. and as is likely similar in most other countries, I think that pronunciation information would not qualify for copyright. I do not disagree with you that there may be considerable work that goes into determining it. But it is almost certainly factual information under the legal definition of the term "factual." This is similar to historical or archeological writing in which it may take a considerable amount of work and investment for someone to determine historical information about a past event, but once that information is written, it cannot be copyrighted and becomes generally available to the public (although the exact manner of phrasing a historical book may still be copyrighted). The brevity of pronunciations in combination with this factual nature makes me believe it is very unlikely they could be copyrighted even in situations where significant effort is expended in creating them.
  • For Grammatical information we were primarily contemplating information such as whether a word is a verb, an adjective and so forth, and information about it's conjugations or similar in languages where that is relevant. This information would, again, fall under the "factual" definition as that term is understood in copyright law and therefore would not be copyrighted.
  • Entymology, I think would likely be copyrighted in many cases, as it consists of longer sentences describing the origin of the word and allows for greater author creativity. Specific words may not have a copyrighted etymology however, similar to the fixed expression issue in definitions. For example, if one were referring to specific Greek mythological terms (e.g., between Scylla and Charybdis), the etymology of that may not be copyrighted because every dictionary in the world would refer to the same usage in the Odyssey.
  • Pictures would typically be treated separately. If the picture itself is not copyrighted, it could be copied freely and if it is, it could not be copied even if the written entry could be. Regarding
  • Word lists and the presence or absence of words I think it's very likely that variance in international law is especially high. Database rights themselves only exist in some countries and not others, which makes the issue more difficult to determine. In the U.S., it is very likely that indicating the presence or absence of certain words would not be copyrighted under the same factual information issue already discussed. However, this might vary in European countries with stronger database rights, which can extend even to works that do not meet the threshold of creativity for copyright. Similarly, a word list by itself is probably not copyrighted in the U.S. I agree that there is a level of choice as to which words are on the list, but the organization of those words is almost certainly alphabetical order, and that choice about what words to include would be unlikely to pass the threshold of creativity as explained in the Feist case (where, similarly, a phone book had choices available about what numbers to include or leave out, but the Court found it was not copyrightable). I suspect even in EU countries, the word list by itself would not be restricted, as it might be too small of a data set and too little work to compile to be covered under European database rights, although again I'm not totally sure there because country by country variance on database rights is an unresolved issue. I hope all that is helpful! -Jrogers (WMF) (talk) 02:57, 24 February 2018 (UTC)
Hello, I would say that etymologies are much more than that you say. For the example of "between Scylla and Charybdis", we can have many analysis about the first text where we found this expression, the semantic context of its apparition. Lots of etymologies aren't fixed by a consensus, and we need to explain each one. More, we need to explain why some are false, like folk etymologies. We need to explain the phonetic evolution for each word too, under the various existing historical phonetic analysis. If you based your analysis on the english wiktionary kind of etymology, please don't forget that they are really bad (no sources, no dating, no explanations...) compared to those of the French version (for example : accommodation, bréhaigne...). --Lyokoï (talk) 13:59, 26 February 2018 (UTC)

Thanks again, Jrogers, your work is very appreciated! I think you clarified very well each point, with a judiciary vision that was missing for me, my expertise being in linguistic. Well, I still have some questions, because Wiktionary is much more than a dictionary, it's almost a complete shelf of lexicographical books!

  • Wiktionaries contains lexicographical thesauri. Prolegomena: We distinguish lexicographical thesaurus and documentation thesaurus. The former one is for every words useful to describe something (i.e. for honey you should have bee, hive, flower, pollen, etc.) and the latter one is made to structure documents (i.e. for honey you should have subcategories of honey and descriptors for properties of the honey). A nice example in French could be the thesaurus for beer (cheers!). So, is a lexicographical thesaurus protected by CC BY-SA?
  • Wiktionaries offer attestations of usage via quotations from published books, unpublished academic works, press articles, lyrics, films dialogues, or even cooking recipes. They are not used as sources but written in the pages and a word is highlight to show how it appears in a the real language. In French Wiktionary, we considered attestations are covered by "droit de citation" but it may be something variable in international laws. In en.Wikipedia a similar content is mentioned as inline citation but the usage of those is much more frequent in Wiktionary. To make it clear, there is more than 340.000 attestations in French Wiktionary. Is attestations inclusion a problem? How to specify the legal status of that pieces of text?
  • Finally (for today at least), Wiktionaries indicate in the licensing footer text that the texts are under CC BY-SA but it appears a lot of content in Wiktionaries can't be protected by this license. Should we change the Copyright in footer text for Wiktionaries? Should we indicate something specific in the Terms of Use distinguishing CC BY-SA, where it doesn't work and where it is inline citation with a different status? I feel we are lying to contributor telling them their contribution will be covered by CC BY-SA when it may not be.
I hope you can still consider looking at my questions. I am very happy to read your answers, as they fill to some doubts I (and others in various communities) had for ages Face-smile.svg Noé (talk) 23:01, 27 February 2018 (UTC)
Hi @Jrogers (WMF):,
A new question arose. Wikidata is helping Wiktionaries by connecting the macrostructures with entries like d:Q35459762#P971. You wrote Macrostructure could be copyrightable, so it is CC BY-SA on Wiktionaries. Then, is there a problem for having a copy of Wiktionaries macrostructure under CC0 in Wikidata? Noé (talk) 10:37, 11 March 2018 (UTC)

Comment by Psychoslave[edit]

Hi @Noé:. as @Jrogers (WMF): and yourself already pointed, this a complex subject as it makes international right and locale jurisdiction, so I think there will be no straight forward unilateral answer regarding legality of such and such at a global level. So

To start with an easy playful technical answer: no, no license protect anything. But possibly copyright, or some other rights, might "protect"[1] some data. And license then can modulate this conferred monopolies.

I think that you are right regarding the "droit de citation", it won't hold outside France jurisdiction, although in United-States for example one might try to justify it by fair use, which as a different scope but certainly have a common subset of granted rights with the former.

Take for example "facts", that are constantly argued as "not copyrightable". First it raise the question what is a fact. And especially, are there legal definition of a fact? Here I must admit my ignorance, maybe Jacob can point us to a legal text defining a fact under Unite-States law. What's clear is that this "not copyrightable" claim will not hold under any place and circumstance. For example despite its also a Common Law state, in India it seems that facts are possibly copyrightable.

That then lead to the question of whether we want create projects that maybe won't face legal problem under the United-States juridiction, or do we want an infrastructure which are modular enough to be be adapted to locale jurisdictions, or that we just say "let's ignore this jurisdictions which does not fit with the WMF agenda and put them under the label 'in a few countries it might be different' so we can minor the problem right to oblivion" like in the Wikidata team self-hagiographic rethoric? --Psychoslave (talk) 08:14, 1 March 2018 (UTC)

  1. Actually, I think that "protecting data" is propaganda terminology used by the so called "intellectual property" industry, because what is protected is the monopoly that law is granting to a single person on a mental ability, not the data which need no protection nor law to happen.
Thanks for the link to The document you linked to explicitly states: "As a general rule, facts are not copyrightable." Only expressions of facts might be copyrightable. Please correct my reading - I am not sure how you use this document to support the conclusion that "in India it sees that facts are possibly copyrightable". --denny (talk) 17:25, 2 March 2018 (UTC)
I've got the same reading as denny. For me: an entry in a wiktionary is not a fact but a collection of expressed facts, and thus is clealy copyrighted but the facts in this entry are not copyrighted. Cdlt, VIGNERON * discut. 16:55, 3 March 2018 (UTC)
Actually, I agree with your interpretation of this particular sentence. But this sentence is already considerably shaded by the very next one alone: However, going back to copyright doctrine, a fact per se not being copyrightable does not extend to allowing copyright protection to be denied to the expression of a fact (as an article which includes one or more facts in its text. So the possibly, in the conclusion in India facts are possibly copyrightable means that in some circumstances factual data can also be concerned by copyright issues, especially when it comes to large data set. Indeed the article also indicate Further, a fact in itself not being copyrightable does not also mean that a collection of facts is not copyrightable. This is what allows databases to be protected under copyright law. The article have section entitled conclusion which states Thus, although there is considerable confusion under copyright law, the provisions of the IT Act are relatively clear, and violating the provisions of the latter law could result in criminal prosecution. Just extracting the single sentence that recall the general rule and throw basically the primary global message of the article does not seem to be a proper reading of the article. Does Wikidata is thought out of scope of this explicit issues, and if so, through which legal arguments? If this kind of risk do pertain for Wikidata, is it considered fine that Wikimedia contributors take this kind of risks? Surely, advises by people competent in legal matters, unlike me, would be far more pertaining to give some answers here. Could for example @Jrogers (WMF): give us a legal feedback? --Psychoslave (talk) 10:13, 4 March 2018 (UTC)
@Psychoslave: your interpretation seems a bit off, « the very next [sentence] » is not about fact but expression of a fact (same for the other sentences) which is where the legislations draw the line. This seems quite clear to me, it fits what I already knew on this subject (and can be found on many places, for what I know in France : « le droit d’auteur ne protège pas les idées ou les concepts » or on the old legal locution « Les idées sont de libre parcours » you'll find many resources and cases on this matter) and I still agree with denny interpretation : « "As a general rule, facts are not copyrightable." Only expressions of facts might be copyrightable. » Cheers, VIGNERON * discut. 18:59, 6 March 2018 (UTC)
Hi @VIGNERON:. Once again, this is not about single statement, whether they are considered to carry fact of fiction. It's all about how much data is extracted from a source which is not public domain or covered by CC0 and imported into a CC0 covered database. Because this is the real criteria which rise concerns here. French WikisourceWikiquote had to blank its database and start it again from scratch following a sui generis issue. That a quote comes verbatim from a source, surely count as a fact. Actually, Wikidata provides data whose factual character is far more doubtful. Still, it's not legally possible to extract a whole database that only contains quotes and their sources without facing legal issues when republishing them under incompatible terms. In other countries, there are other monopoly rights on database which apply, as it as already been pointed. --Psychoslave (talk) 08:52, 12 March 2018 (UTC)
@Psychoslave: I agree, it's would not be legal to copy-paste a whole Wiktionary without respecting its license but Wikidata is not doing copy-paste of whole databases, it's not even a copy per se as facts are transformed into data and enrich by comparison of multiple sources. I'm not a lawyer but if the same facts are found in multiple documents, can one document claim rights on them? Cdlt, VIGNERON * discut. 10:33, 12 March 2018 (UTC)
What is a fact? "Object 1 is Object 2" is a fact? "Object 2 is Object 3" a is fact? "Object X is Object X+1" is a fact? "Object 1 is Object 2 is Object 3 is ... Object Y" is a fact? Is a database? Is a knowledge base? Can be free? --Fractaler (talk) 14:22, 12 March 2018 (UTC)
@VIGNERON: once again, it's seems that I failed to make pass the exact problematic which rise the current issue. For example the sui generis right is not about whether data is transformed after extraction, nor if this data are factual, it's about amount and frequency of extraction, whatever the final form of republication. See for example Quels sont les droits reconnus au producteur de la base de données ?. Moreover, having several sources is not enough. Let's say I make one hundred website using data extracted from a random database which is not under a free license. All this website have commons statements, all generated from this single database I wasn't legally granted to use for such a purpose. Now, I could go further extracts data from this new websites and put them into a new database stating 100 sources for each statement. This is the data traceability problem. I think you are more in position to answer if Wikidata would be able to cope with this problem, which certainly pertains to legal issues, but also more generally to data reliability. --Psychoslave (talk) 15:12, 13 March 2018 (UTC)

Side notes[edit]

As a side note, I would like to apologize for the sarcastic tone I used above (self-hagiographic rethoric), as it as been reported as judged troublesome. The observation that Noé made, regarding the lake of impartial presentation when it comes to ask the community to make an important decision, is not covered by the issue of this apologize. --Psychoslave (talk) 07:49, 5 March 2018 (UTC)