Talk:WikiCite/Shared Citations

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Talkpage for the "Shared Citations" proposal.

Citation Formats and Sources[edit]

Aligning bibliographic metadata with the [Citation Style Language (CSL)][1] schema could be powerful (aka, the "CSL-JSON" fields). This format is specifically designed to enable citation formatting, and there is a whole open ecosystem of tools and bots which can work with it (eg, the Zotero reference manager, citeproc styling libraries in several programming languages, extraction/transform helpers for many catalogs and HTML formats). This would make it easier for readers to do things like "re-format this citation in an alternative format" or "give me a BibTeX of all the citations in this article for import to zotero". For editors and bots, it would potentially be easier to do automated updates or improvements of partial citation metadata from external catalogs and sources. --Blnewbold (talk) 21:55, 4 January 2021 (UTC)

Ideally, Shared Citations will be as inclusive in allowing reuse from external sources as Zotero and Wikidata, and more. Zotero has import from many formats, parsers from many kinds of web pages (including a single page's metadata, as well as parsing a list of citations from the page). Wikidata has import from ORCID and extra tools like SourceMD etc --Vladimir Alexiev (talk) 10:11, 15 January 2021 (UTC)
@Blnewbold and Vladimir Alexiev: thank you for these specific example uses and potential benefits to a wider ecosystem of citation management. Do you think that the proposal document can be improved to include this information somehow? Perhaps in the sections called "Community-developed", or "Open questions", or "use cases"? And if so - could you please suggest a couple of sentences that I could copy/paste into the right place? I would like to be able to incorporate this information accurately. LWyatt (WMF) (talk) 10:25, 15 January 2021 (UTC)
Please check the work of Lars Willighagen on citation.js (see also https://peerj.com/articles/cs-214/). He created many mappings already for Wikidata which can server as template here. --Egon Willighagen (talk) 06:52, 18 January 2021 (UTC)

Principles[edit]

Related to "pragmatic scope", curious what the scope/limits of bibliographic metadata will be. Here are some specific metadata scope questions:

1. Will authors be normalized or tagged with external identifiers (eg, ORCID or VIAF)? I would guess not; this isn't necessary for formatting references.

It's best if such are included, but that's a lot of work. WD took the pragmatic approach for "scientific articles": you first import them with "author name string", then gradually replace these strings with "author", and there are tools to do this disambiguation and linking.

2. Which external identifiers for works would be included? DOIs, ISBNs, PubMed identifiers, arxiv identifiers are commonly used and likely to be included, but what about holdings-specific identifiers (like Hathitrust, JSTOR, Amazon, WorldCat, Open Library) or domain-specific identifiers (dblp for conference proceedings)? Wikidata has been expansive/inclusive of such identifiers; deciding how inclusive Shared Citations should be sounds like it will require discussion and consensus.

The more external identifiers/links, the merrier. The enable large-scale data integration efforts (building of Science Knowledge Graphs). Shared Citations should be as expansive as Wikidata in this regard. Being able to link to eg DBLP doesn't mean an editor HAS to do it.

3. Would Shared Citations include bibliographic metadata beyond what is necessary for citation? Eg, access and licensing status (Open Access "color", public domain status, Creative Commons license), subject and categorization (keywords? controlled vocabulary? wikidata entities?), retraction or update/errata status (hopefully!), original language, etc. --Vladimir Alexiev (talk) 10:08, 15 January 2021 (UTC)

@Vladimir Alexiev: I think the most sensible answer to all of these questions is: Those are things the community would need to decide. I have opinions about all of them, but it's important as a proposer that I'm trying to demonstrate the potential scope and potential functionality, without being unnecessarily prescriptive about the specific implementation decisions.
But to give you my personal opinions...
1. The proposal implies that the only records in the Share Citations database are the citation records themselves. Anything else – notably authors or publishers – would not have records on the database in their own right. Those would be Wikidata items. There should not be a "victor hugo" record in the citation database, there should only be Victor Hugo (Q535) in Wikidata. In cases where there is no item in Wikidata (e.g. for many journalists who just happen to be the author of a single newspaper article used in a citation) their name would just be a string in the Citation record, and perhaps a redlink to a "would you like to create a wikidata item about this person" which would take you to Wikidata as illustrated in this mockup.
2. I would suggest that it should be very expansive. I don't see a reason why it couldn't be. Because some of the citations (e.g. to historic newspapers) might only be held in a couple of physical libraries, it would be useful to add their specific physical call-numbers. Why not? Like you say: having the option to add more info doesn't mean you have to. That's exactly why many citation templates on Wikipedia have many many fields available for edge cases.
3. I believe so, yes. As illustrated in this mockup I've implied "retraction status" property, a "primary source?" property, multiple access location links (including "shadow library" like SciHub). You could expand that with a qualifier property for "price for access for an individual without academic affiliation", open access status/colour, subject keywords... Any of those things. My only point of caution would be to not 'encroach' on Wikidata. This proposed database should be pragmatic and its ontology focused on supporting practical needs of Wikimedia-project editors and readers. For potential uses which go beyond that, then it might be more appropriate to create the work as its own item on Wikidata as well and then use all the fancy properties there - such as the "cites" property which builds the citation graph beyond the scope of what Wikipedia is referencing.
FYI: I've made some slight reformatting of how your questions are shown on the page (without changing the words), in order to make it easier to clarify what I'm responding to.
-- LWyatt (WMF) (talk) 10:46, 15 January 2021 (UTC)

MediaWiki Development[edit]

Considerations[edit]

Bibliographic data vs citation*: Bibliographic data describes the resource as a whole; a citation MAY include a specific location within that resource. For sharing, the resource description must be universal and the specific location (usually a page number) needs to be a characteristic of the article where the resource is cited. Lamona (talk) 15:35, 28 January 2021 (UTC)

While any attempt to make a 'clean' description of different kinds of citable resource is bound to end up being caught up in the edge-cases, I've tried to get at the point you're making in the description on the subsection "relationship to Wikidata". The proposed "record" pages on the Shared Citations database would be for the specific work being cited - e.g. the "PenguinBooks paperback, Armenian translation, first published in 1995, 3rd edition, of Voltaire's Candide". Have a look at the example of a "book" record, here File:References status quo & Shared Citations records examples.pdf. Note: that this record concatenates all references to the same subsection of that work (page, chapter, page range). That is my proposal for how to address this issue.
However, the Shared Citations database would NOT have a record for the "overarching" concepts of "Voltaire" or "Candide" - THOSE belongs as items on Wikidata. Does that clarify/answer the comment you're making? LWyatt (WMF) (talk) 17:12, 15 February 2021 (UTC)

WD ref storing and sharing[edit]

Each field of each Wikidata reference is stored independently, even if it’s identical.

@LWyatt (WMF): This is not quite true. WD examines ref fields and makes ONE ref node for each identical set of fields (in programming, this is called "value objects"). But this merging is insufficient, eg

  • if the two refs have different "retrieved at" they won't be merged, even though they are the same thing
  • if there are minor differences (eg trailing slash in "reference URL" or not), they won't be merged
  • if they're part of the same resource (eg two pages from the same book), they won't be related
  • it's neither controlled by nor exposed to the user --Vladimir Alexiev (talk) 10:03, 15 January 2021 (UTC)
Thank you Vladimir Alexiev for this technical clarification. Would you say that the short phrase I've used (which you quoted) should be changed because it is missing this nuance, or would you say that it is accurate enough? And if so, how would you suggest that it be improved so it can still express the same idea (about the massive redundancy) as briefly as possible? LWyatt (WMF) (talk) 10:19, 15 January 2021 (UTC)
@Vladimir Alexiev: Hmm, I wasn't aware of that. I've long thought there should be a better UI for handling Wikidata references but haven't seen any obvious solution - there are several javascript gadgets that make it a lot easier these days, but it's still more painful to add and maintain references there than needed, I think. ArthurPSmith (talk) 21:03, 1 March 2021 (UTC)

Pilots[edit]

Good idea to start from sister projects like Wikiquote, although I suppose only the French Wikiquote would be suitable at the moment. Most Wikiquote subdomains follow the model of the Low-barrier Wikiquote, therefore introducing large software dependencies would be highly disruptive. Nemo 19:12, 16 February 2021 (UTC)

whatever the rollout ‘schedule’ would be, especially at the very initial stages, they would need to be pragmatic choices to ensure that there was the capacity and support from the local editing community to support the new feature, be patient with any initial flaws, and give advice. It would also have to be practical to ensure that a diverse range of languages and cultures are involved in the very initial stages: to ensure that the community culture, content, and properties of the new database are built from the very beginning with a diversity of citation forms (not just one culture’s definition of what is ‘real’). This is something user:Joalpe especially taught me. LWyatt (WMF) (talk) 22:21, 16 February 2021 (UTC)
I agree. Specifically about Wikiquote, I believe this should be possible by carefully selecting a few diverse subdomains. Another I'd try is Russian. It would be great to have any Japanese community on board, and so on. One can only find out by discussing it with them. Nemo 08:13, 18 February 2021 (UTC)
precisely. That's why there's a big caveat about that 'rollout schedule' - in that it could not feasibly be so neat as "all languages of sisterproject, then all languages of the next sister project, etc." It would, in practice, be much more 'piecemeal' - working with those communities who are most interested first - but with a dedicated attempt to try to ensure those early projects aren't all from one language/culture/script/sisterproject. LWyatt (WMF) (talk) 17:57, 18 February 2021 (UTC)

Development principles[edit]

This proposal about a "database" is super generic; it might mean anything, possibly not even a wiki, and I understand that's intentional. I love https://fatcat.wiki but I hope we're not trying to reinvent the wheel here. If it's going to be a wiki, or anything like a wiki, I'd like to ask that it follows the MediaWiki principles, simply because that's what Wikimedia is most experienced at doing. If WikiCite turns out to need something different, it should be hosted by another entity, which would be better at spending the millions of dollars this is going to require. Nemo 19:12, 16 February 2021 (UTC)

Makes sense to me. I’ve added a reference to that page in the ‘principles - database” subheading. LWyatt (WMF) (talk) 22:11, 16 February 2021 (UTC)
Good! Nemo 08:13, 18 February 2021 (UTC)

Wikidata or bust[edit]

This is a worthy idea, but the implementation MUST be based on the existing infrastructure in Wikidata. Otherwise, it will inevitably compete with that, just because people can.

The page makes an attempt to justify the separation, but most of the arguments are weak:

  • Scale: this is probably the best reason. It is indeed true that some (most?) of the references would not fit the notability of items on Wikidata, but we could conceive different criteria for citations, just like there are for lexemes.
  • Service: totally orthogonal to where the ontology lives
  • Scope: both totally orthogonal to where the ontology lives. Also, they seem to be artificial limits that are meant to justify the separation, I don't really see the need to limit to existing Wikimedia citations
  • Sovereignty: as long as there are separate rules for citations, there is no need for this. I personally consider the distribution of efforts into lots of different projects as a terrible waste of resources.

Here are my arguments for keeping this within Wikidata:

  • resource optimisation for developers. Software developed for this project will be easily extended fo other wikidata purposes (e.g. On-wiki editors)
  • resource optimisation for reusers. No need for new libraries, functions, checking where something lives, it will reuse a huge part of the tons of code people wrote for wikidata.
  • ease of understanding. Wikipedia vs Commons is already complicated. Wikidata introduces a new layout, makes it harder for newcomers. Having a fourth project needed to write articles (well, maybe not needed, but strongly recommended by policies and best practices on Wikipedia) raises the entry barrier even further.
  • competition: as described above, just because people can. We have editors who refuse to upload to Commons. What happens when you have the same reference in Wikipedia, Wikidata, and this? Strainu (talk) 18:43, 23 February 2021 (UTC)
Hello Strainu, I appreciate you've taken the time to read this rather lengthy documentation page for the proposal, and gone to the effort of describing the your thoughts upon it. It is useful that these points are here, enumerated and elaborated, as they allow for other people to build upon the thinking too. And I hope that other people will comment here - there's a lot of people by now who have added their name to the 'endorse' list, and also a lot of people who have asked me the question "why not inside Wikidata", so hopefully this sparks some conversations!
When I was first running interviews with many people in the first phase of this project (to identify the problems and see which ones could feasibly be addressed) the biggest "structural" question was - as you have correctly identified - whether the eventual proposal should be "inside" or "separate" from Wikidata itself. The "WikiCite Roadmap" series of ideas was equally split on this topic with options for both. So, as Lydia, Addshore and Amir can attest - in my own thinking based on the research and consultation (with them, and many others), I went 'back and forth' many times on this very question.
If I can summarise the various points you've made to what I feel is your core concern with the proposed solution, it is that of *inevitable competition* with Wikidata. You say, "What happens when you have the same reference in Wikipedia, Wikidata, and this?" and "I personally consider the distribution of efforts into lots of different projects as a terrible waste of resources." - I completely concur, and that is precisely the main feature of this whole proposal. Currently we Wikimedians are all doing the monotonous task of reference management with an almost 100% distributed effort. Even when the same footnote is used hundreds of times, the same metadata (the isbn, URL, author name... ) has to be typed hundreds of times. So, even if you feel that a "new" database just for references is a distribution of effort compared to it being handled inside wikidata directly, that is nonetheless still incalculably more centraliased than it currently is.
One of the "principles" in this proposal is of non-deprecation: that the existing reference systems (e.g. the citation templates in Wikipedias) remain and are not removed. It is up to the communities how they wish to use the new system (or not use it). So, there will always be edge-cases of obscure references that can't sensibly be centralised and also community members who prefer the existing workflow. That is perfectly fine, and just like Uploading to Commons (another one of your examples) is a choice for individuals and whole Wikis for how they wish to operate. The task is to make the new project (and Commons) compelling enough for people to want to shift - not to force anyone.
Equally, it is not necessarily something that the Wikidata community would want forced upon them. Wikidata is often described as being overwhelmed with millions of items about scholarly journal articles. Notwithstanding your suggestion of a new namespace similar to how Lexemes work, the idea that Shared Citations would bring tens of millions of new items about specific publications - items about individual tweets sent by politicians, about specific articles in specific pages of old local newspapers... - would be rightly seen by many of the WD community as an unacceptable extension of the scope of the project. This is aside from whether the Wikibase software and query service can handle that scale of the combined WD+SharedCitations corpus.
To return to the question of 'competition': I feel that the strict scope of this proposed database lends itself most naturally to a new/separate project - not a new "sister project" (which WOULD feel like it is competing for attention) but a "service" project that supports all Wikimedia sister projects (including Wikidata) in this specific are. WikimediaCommons was originally set up for that strict "service" role for centrally storing images - but then rapidly grew the larger scope of hosting freely licensed multimedia whether or not it's used in a WP article. Wikidata, on the other hand, was born with the "wide" scope in mind and serves many roles for many communities (inside and beyond the wikiverse). Precisely because Wikidata already exists, the specific/targeted scope of the Shared Citations database - to store the citation metatada of things being referenced in Wikimedia projects only - is possible. As d:user:Andrawaag describes it: we need to ensure that Shared Citations doesn't suffer from people trying to make "stamp collections" - adding items to make "sets" of works for the sense of competeness' sake. THAT is the value/role/purpose of Wikidata and its queries. Shared Citations' role is pragmatic - to support references - while maintaining agnosticism about what each Wikimedia sister project/language edition feels is appropriate policies for Notability, Reliable Source, and reference display format.
Of course, there would be extensive collaboration and inter-linking with Wikidata. My above sentences are not meant to imply that Shared Citations would live in some isolation. For example: the Shared Citations database would not have the an item for "Victor Hugo" or "Penguin Books" or "COVID-19" (examples of author, publisher, and main-subject respectively). These would be linked from Wikidata in order to allow for searching for "all references in Wikimedia projects with the author = victor hugo". You can a mockup of how the connection between the free-text and wikidata item might work at this example. Furthermore, it is quite possible (probable?) that the "properties" used for all these items in the Shared Citations database should be imported from Wikidata using Federation. It might make no sense to create local properties - and the translations of the labels for those properties - when organising the metadata fields needed for references does fit within Wikidata's scope (e.g. author (P50)).
I'm not dismissing the validity of your concern/question - I agree it's a major decision, one that would be determined by professional analysis of the database architectural needs as much as community preferences. And I don't expect that my comments here will "make you change your mind". But I've written this to ensure you know that the comments you raise are taken seriously; that there's no obvious or 100% right/wrong answer (notwithstanding your opening statement that this "MUST" be placed inside WD); and for other people who come along later can add their perspective in reply to us.
Sincerely, LWyatt (WMF) (talk) 15:09, 24 February 2021 (UTC)
Thanks for taking the time to write such an extensive answer. I would like to see more people involved in this discussion as well, so I will only respond to a very specific statement you make, which I feel could be the key to learning from the mistakes of the past.
You say not a new "sister project" but a "service" project that supports all Wikimedia sister projects. The same has been said for Commons and Wikidata at their respective birth. Looking back, it is my belief that as soon as there will be a community around this project, that community will ask for equal standing. And it's normal, it's about recognition and resources.
So, can we do something to prevent a community from happening? Not really, but we could prevent a site from appearing. What I'm suggesting is that edits to this database be made from other projects only. This is a big change from the current way projects work, but has several advantages:
  • people are less likely to see this as a project, but as a database
  • it's irrelevant where the database lives, you edit it from your home wiki
  • the entity that does the software development will be forced to design for interaction rather than concentrating on local features and leaving interaction with other projects for "later" (remember how bad wikibase was in Lua for the first few years? And we still don't have an official way of editing Wikidata from other wikis or cross-uploading a new version of a picture on Commons)
This does leave open the question of database administration, for which I don't have a good idea.
On an unrelated note, it's unclear to me if you have formally asked the wikidata community to host this database (maybe I missed it on the proposal) Strainu (talk) 15:49, 24 February 2021 (UTC)
A reason why I believe it will be easier to retain the scope as ‘service project’ for this proposal, compared to Commons’ history, is the very fact that Wikidata already exists. Anything that is beyond the scope of S.C. can probably find a natural home in Wikidata. As a practical measure of achieving (enforcing?) this, I have also suggested in the proposal document that there be NO “create new record” button/workflow on the database itself. Thereby, the only way to create a new citation record in the central database would be to ‘push’ a reference from a Wikimedia project. I’m not sure if that’s a good idea, but it’s a possibility and would help engrain the ‘service project’ culture by making it harder to do ‘stamp collecting’.
Your suggestion that the S.C. Database could be made ‘invisible’ and only editable from within other projects is a fascinating one. It was mentioned as an idea in some of the early consultation interviews I ran too. You will note that in this proposal I have never explicitly stated a preference for what kind of user interface or database software it is built on. My belief is that Wikibase is the default option and any other suggestion should have to prove itself as superior to it. Nonetheless I’ve tried to NOT pre-determine specific technical solutions for this proposal - that would be premature for a proposal at this stage. I am working with User:KChapman (WMF) and the WMF architecture team who will be researching some of these kinds of questions. I will ensure your suggestion is, at the very least, raised as an option for investigation.
No, I have not formally asked the wikidata community to host this proposed database, not just because I don’t want to pre-empt the Architecture team’s research, but also because this IS still just a proposal. I do not want to over-promise what is not yet officially on any annual plan nor has any resources formally dedicated to it. There is a long paragraph in the proposal document just after the “timeline” which describes my thinking about the concept of “approval” - and how proposing something so complex (and with such a potential impact on both wmf and volunteers) must do a “delicate dance” of raising awareness of the idea, but not making anyone feel like someone else has already made a decision without consulting first. Thus, if I had made a formal request to the wikidata community at this stage, it would give an incorrect impression to other parts of the movement (and wikidata itself) about how formal this proposal is at this stage. We are still in the mode of, “let’s tell people about this idea and see if it has traction across all corners of our movement”, not “formal vote about anything”. Nonetheless, I have added this proposal to this week’s wikidata community newsletter, asking for feedback, and this proposal was recently publicised/promoted on the wikidata-l mailing list by the Internet Archive’s Mark Graham.
LWyatt (WMF) (talk) 11:52, 25 February 2021 (UTC)
Making the database editable only from other projects is not a technical solution, but a policy one. Returning an error page on accessing the pages instead of the API is trivial. I believe the proposal should be clearly pushing for a service project and framed in a way that prevents it from becoming a sister project.--Strainu (talk) 13:30, 25 February 2021 (UTC)

citations are references[edit]

From my perspective the notion of a separate project is not realistic. First; in the scope all the citations of Wikidata are included in "Shared Citations". Every scholarly work includes citations and is cited when it it relevant. At Wikidata we have a bot that is revived that will include the citations for scholarly works. This would make "Shared Citations" a superset of Wikidata.

Dear GerardM, Because your comments here are numerous, I will reply to them inline - to make it more obvious which comment I'm replying to. Also, I note that you've added an "oppose" on the proposal page in a subsection called "endorsements", as well as writing these critiques here. That list of endorsements is not a vote. If you don't wish to endorse it - don't - and bring your concerns here (which you have done). But it seems odd to provide an "anti-endorsement".
If you want to create an echo chamber, you do a good job. Only inviting endorsements does not make your proposal well argued.
If someone in the street is asking people to sign a petition to endorse a policy, and you don't like the policy - then you don't sign the petition. It's not a vote. This talkpage is a useful and valid place to express concerns and debate the issues (as you and others are doing). LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)
With regards to your first point (above): I think we are talking about different things - due to the often-overlapping popular usage of "citation" and "reference". To try to be more clear - this proposal is not talking about the "Cites work" property in Wikidata. That is the property that collects information about when one work (usually a scientific journal article) has a reference to another work. Rather, for the Wikidata context specifically, this would only affect things which are used in the "References" fields of claims in Wikidata. There would be many items in Wikidata about scholarly journal articles (for example) which would not receive Shared Citation records, because those journal articles aren't used as references for other wikidata items (or Wikipedia footnotes etc). Furthermore, there will be many things in the Shared Citations database that will never be in Wikidata - most obviously the tens of millions of individual URLs (to blogs etc) that are used in Wikipedia footnotes.
Your notion that you should not have the citations for the citations does not help us improve Wikipedia. The hubris that the existing Wikipedia references qualify for the sum of all knowledge for any subject is breathtaking.
I have been adding loads of references to websites and the likes as a reference for a paper.. Never, why? Thanks, GerardM (talk) 06:15, 3 March 2021 (UTC)
Unfortunately, I do not understand the point you are expressing. Could you clarify? LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)
Thus, Shared Citations is neither a subset, nor a superset, of Wikidata. It is partially overlapping circles. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
Only when Shared Citations is to be a project divorced from the work that has already been done. Thanks, GerardM (talk)
As with the previous comment - your meaning here is unclear. LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)

When "Shared Citations" insists on a certain presentation, it follows that it is just that, a presentation. There are mechanisms in Wikidata that will strongly suggest to include the information needed for a presentation and by implication you will have that presentation. Insisting that a reference will arise like Pallas Athena from the head of Zeus is rather unwiki and not becoming of a Wikimedia project.

Sorry, but I don't understand what this means. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)

We already have Scholia who KNOWS for certain Wikipedias if a paper is used as a reference. As a template Scholia shows the references for a SUBJECT as it is.

I'm not sure if this is a critique of the Shared Citations proposal because I'm not sure how it relates to the previous sentence or the following sentence. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
What you propose exists largely already. Just find out what is already there. It needs widespread practical adoption, the implementation may need verification for "other" projects just like your abstract does. Thanks, GerardM (talk) 06:15, 3 March 2021 (UTC)

When the Open Library is onboard supporting the books that are used as references it would be great. We already link to OpenLibrary for books and they link to Wikidata. When we are to acknowledge this relation, we would suggest people to read the books where they are available. We should promote the donation of rare books to the Internet Archive exactly because we aim to share in the sum of all knowledge.

One of the things I've implied with this example wireframe is the suggestion that we could indicate where a reader can access the work in multiple places. Poeple could, in theory, make tools which allow a reader to say their and thereby be suggested local libraries (for example), or we could include properties to indicate what the $ cost would be for the paywall for different methods of access, and also link to "shadow" libraries like SciHub. And yes, we could most certainly connect with the Internet Archive - who have already made an extensive endorsement in the "Partners" section - which could use this combined data to know what it is they should prioritise for digitisation next. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
You may have a wireframe. We have the data. It does not take a new project to enhance our relations and expose available texts. What it takes is an interest in the extend we serve our readers already and link them to the IA, OL and others. Thanks, GerardM (talk) 06:15, 3 March 2021 (UTC)
If I understand your meaning, then we agree - Wikidata's excellent corpus should continue to grow and thrive and have fabulous connections/partnerships with other organisations such as the I.A. LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)

When we truly aim for SharedCitations to be a success, we would partner extensively. For instance with ORCiD and enable people to assess the content of their publications in both Wikidata and ORCiD.

ORCiD is an identifier for people - and therefore would not appear in the Shared Citations database. Records for people belong in Wikidata, which can then be connected to any of their publications in the Shared Citations database. See, for example this wireframe example where an author does not yet exist in Wikidata and the 'redlink' suggestion to create it, and by comparison this wireframe example where an author is in Wikidata.
However, with your sentence about ORCiD and partnerships generally, to "truly aim for Shared Citations to be a success" - this seems to be something you think is a good potential thing. Not a criticism, Which confuses me because you wrote "oppose" before... LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
Your argument about ORCiD demonstrates why your project is one dimensional. With all papers in one database, and authors known by any and all identifiers, we can invite authors to come to Wikidata and curate what we have on them, their papers, their citations, their co-authors. In Wikipedia articles for many authors their papers are mentioned. These list became less complete in time. These same papers typically are in Wikidata, papers are added all the time, are attributed all the time. Making for the potential to collaborate.
As with the previous comment: yes - Wikidata is excellent for this kind of collaboration with individuals (and organisations) to improve the quality and quantity of data about all sorts of things - as we all agree. LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)

When our aim is to include ALL the scholarly works, we can point out that Wikipedia references are old vis a vis the new publications available. It is another perspective on the validity of articles.

The Shared Citations proposal is not to create items for all scholarly works. It has a pragmatic scope to store the metadata used for references in Wikimedia projects. Wikidata is the place for that larger scope, and this proposal is agnostic about Wikidata's criteria for notability. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
This may seem pragmatic. It prevents collaboration and it makes for the need of two closely related sets of data where shared citations will ensure that operationally things will be less effective. Thanks, GerardM (talk) 06:15, 3 March 2021 (UTC)

Having a separate project is in my perspective a power grab that will not add value and make our operations more complicated. Thanks, GerardM (talk) 12:27, 28 February 2021 (UTC)

It could be equally argued that trying to put this proposal "inside" Wikidata would be a "power grab", so, it's just a matter of people perceive that to be the case. For what it's worth, I've extensively consulted with Wikidata users and developers to ensure it is a proposal that is supporting all Wikimedia projects, not a new competing sister project. LWyatt (WMF) (talk) 11:25, 2 March 2021 (UTC)
You concentrate on one part of the argument with the more essential parts that follow. It will not add value, it will ensure that much of the information will be in two places making the argument used by some stronger "that we do not need all that". Your refusal to acknowledge that a citation in a paper IS a reference ensures that there is no value for scientists in your project but for the scientists who study our projects. It prevents functionality where we open up all the literature for a subject for those who care for it. Your approach will NOT indicate the extend existing references are old and may have been superseded. In my opinion you may have talked with the usual prospects and you have not refuted the points why your idea will not help us beyond the narrow scope you entertain and will damage what is done on papers in Wikidata. We already have much of the proposed functionality in Scholia and in the latest developments using data for references in Wikipedia from Wikidata. So thank you but no thank you. Thanks, GerardM (talk) 05:29, 3 March 2021 (UTC)
Your disapproval with the proposal is noted. LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)
I agree with Strainu that this project would exist better inside Wikidata (by being a separate namespace like lexemes are). Wikidata is already very welcoming of creating properties that are needed over at WikiCommons and I don't anybody complained at Wikidata about WikiCommons properties living inside the Wikidata namespace for properties.
If this project has it's own admins and policy making it will be a competing sister project.
If this project results in it being much easier to add references inside of Wikidata, it would be good for Wikibase it would be good if other Wikibase instances could simply reuse the new features. That would be possible if the project would be another namespace within Wikidata/Wikibase but not if it's it's own Wikibase. ChristianKl❫ 17:42, 2 March 2021 (UTC)
Thanks for chiming in with this User:ChristianKl. While the proposal as-written does state as separate/independent database (highly connected to WD, but still independent), the prospect that it could also achieve the same functionality as a new namespace in Wikidata should not be discounted. It changes the nature of the proposal in terms of 'structure' but not the eventual hoped-for benefits. With regards to properties - the 4th bullet point in the Open questions section discusses the assumption that (even as a separate database) they could be "federated" in from Wikidata. But - that is a specific technical implementation I did not want to pre-suppose the implementation-answer for (hence it is an open question). With regards to Admins - I did describe in the Workflows subsection that there would probably need to be some kind of community process for giving advanced permissions (such as deletions, bots...). However, as for the policy-making, the the core principles of editorial independence, non-deprecation, style-independence, and Create upon use - should mean that it is enforced as a "service" project - not a competitor with editorial policies of its own that in any way override the editorial policies of the individual wikis. Indeed: If it was based inside Wikidata, I can imagine a lot of Wikipedians being nervous about WP editorial policies being overridden by WD. I would not want "Shared Citations" to enter that fight - it should be agnostic. But nonethess: the idea of 'new namespace in WD' is intriguing. One practical/technical question which would address that independently of what anyone prefers, is whether Wikidata (and the query service) can handle the added scale of tens of millions of records about things like individual URLs? Already people say that WDQS is overloading... LWyatt (WMF) (talk) 14:35, 3 March 2021 (UTC)

Hi everyone, (There are a lot of open threads already so I'm adding here what could probably go in a few different sections here.) As I have been talking about for many years now the current state of Wikidata is already a struggle, both technically and socially. Today 40% of Wikidata's Items are scientific papers, considerably more if you add Items created to indicate authors of those papers etc. This is not a healthy balance, especially considering how little those Items are used today. They are important but as we stand we are very much pushing the limits of Wikidata in all areas. We are playing catch-up everywhere. The SPARQL endpoint being very much at capacity is just one of those issues, not even thinking about all the trickle-down issues this causes wrt data maintainability and data re-use. So while my team, I and everyone in the community is doing everything we can to make Wikidata scale more both socially and technically there are limits to this and there are limits to how fast this can be done without jeopardising the whole project. The massive amount of data we are talking about here would make all this even harder. A clearer separation as proposed here makes a lot of sense to me. And if you ask me I'd go even further. --Lydia Pintscher (WMDE) (talk) 07:51, 22 April 2021 (UTC)

Thank you very much for your comments Lydia Pintscher (WMDE). It is very important to me that this proposal is designed in a way that is a positive thing for the movement in general and also for Wikidata specifically. @ChristianKl and Strainu: for your awareness, Lydia's comment here is referencing your suggestions above. LWyatt (WMF) (talk) 14:56, 22 April 2021 (UTC)

@Lydia Pintscher (WMDE): Thanks for your view. As a botmaster, I'm painfully aware of the sparql endpoint limits. However, moving things to a new project will not solve the problem, but multiply it. On one hand, this new thing, whatever we call it, will very quickly match the size of Wikidata. On the other hand, cross-queries are a natural request and they were even discussed on this page. I don't know much about the architecture of the endpoint, but I suspect those will bring even more pressure on the system. Strainu (talk) 17:14, 22 April 2021 (UTC)

The simple size of the graph is a big issue currently. So not combining them (and even moving some things out of Wikidata) would actually help a lot easing one of the pain points. One of the things the search platform team is currently looking into exactly because of that is how we can meaningfully partition the data we already have. It's not the only solution but from what we know at the moment separation helps a lot. --Lydia Pintscher (WMDE) (talk) 17:42, 22 April 2021 (UTC)

Zotero translators integration[edit]

Reading the "Partners" section I realized that Zotero and their translators would be a great partner for this project (not necessarily in the sense from the page). They are of course used currently in Cite, but there is no way to easily extend them/add new ones. You should consider building a way to visually define a translator for a website. I have in mind something like the element selectors from the browser's development window. The resulting translator should then be pushed back to upstream.--Strainu (talk) 18:08, 24 February 2021 (UTC)

I think the best person to comment about this is User:Diegodlh who is currently also working on a WikiCite-grant funded project - “addon for Zotero with citation graph support” for Wikidata. LWyatt (WMF) (talk) 11:24, 25 February 2021 (UTC)
> They are of course used currently in Cite
Do you mean Cite Q? What do you mean Zotero translators are currently used in it?
> there is no way to easily extend them/add new ones
I have never myself proposed a new Zotero translator, but as far as I know the community is open to contributions, although they do recommend that the right steps are followed to expose metadata appropriately.
> You should consider building a way to visually define a translator for a website. I have in mind something like the element selectors from the browser's development window.
I'm not sure I understand what you mean here. Do you mean the translator should be able to import individual items from the References section of a Wikipedia article, like what happens in Google Scholar result pages, for example? As far as I know, this is not currently possible with the Wikipedia translator. Even if it were, I understand the way Zotero handles this is through the Zotero Item Selector window. I guess a different way of addressing this (like the one you are suggesting) should be proposed to the Zotero Connectors developers instead. But anyway, I'd appreciate it if you could restate your question, as I may have understood wrongly. --Diegodlh (talk) 18:13, 25 February 2021 (UTC)

Diegodlh The cite extension uses the Zotero translators to properly convert urls to citations. Over the years, communities such as mine have also developed such software, for instance this huge js which covers Romanian news outlets.

However, there is no easy way to contribute these back to upstream, so they can be reused in the Citation extension, your project, or anywhere else.

What I'm proposing is a visual way of creating such translators: the software would have a list of fields needed, it would try to extract as much of the information as possible automatically, then confirm the data with the user. For the remaining fields, it would display the site in a frame and allow the user to select the element corresponding to that field (e.g. For date, the user would have to click on the date in the page). After a few people confirm the data, the software could automatically create a pull request for the Zotero translators repository.

It's a lot of work, but it could be "gamified" (not the right word, but the best I could find). What I mean is when a user introduced a link in some Citation interface in Wikipedia or wikidata, the software would generate the Citation but offer the user to improve it. If the user chooses to do so, it would launch the interface described above. Strainu (talk) 07:11, 26 February 2021 (UTC)

@Strainu: I had completely missed your point. I understand now that you mean the Citoid service.
> Over the years, communities such as mine have also developed such software, for instance this huge js which covers Romanian news outlets.
Could you explain what this software does? Could it be replaced by a Zotero translator?
> there is no easy way to contribute these back to upstream, so they can be reused in the Citation extension, your project, or anywhere else.
Just to be sure, I understand you mean "the Citoid service" when you say "the Citation extension" here. I don't know the Citoid service in detail, but I understand one has to create new translators and submit them to Zotero's repository for translators on Github if one wants the Citoid service to use them. This is what you mean is not easy, right? On the other hand, when you say "your project" you mean the Wikicite plugin for Zotero? The plugin only uses two custom Wikidata translators, which I don't plan to upload to the Zotero's repository yet, anyway.
> What I'm proposing is a visual way of creating such translators
Oh, I see now! Sorry, I was reading your comment with a Shared Citations frame, and I couldn't see that you were proposing something that goes well beyond just that. Your visual way of creating translators (more generally, a visual way of creating a scraper) would be something the whole Zotero community (and services using their translators, such as Citoid) would benefit of. Have you checked the Zotero forum to see if this has been discussed already?
> the software would generate the Citation but offer the user to improve it. If the user chooses to do so, it would launch the interface described above
This sounds like a cool idea! I'm not a heavy translator user, so I can't be sure how often translators fail to extract the citation data from websites. What is your experience? In general, I understand it would be better to have websites appropriately expose their metadata rather than developing a new translator, but I understand this is not always possible. --Diegodlh (talk) 20:28, 1 March 2021 (UTC)
> Could you explain what this software does? Could it be replaced by a Zotero translator?
Yes, it's basically a set of translators for a number of Romanian sites considered trusted sources. In Zotero, you would have one translator for each site.
> Just to be sure, I understand you mean "the Citoid service" when you say "the Citation extension" here.
Not only. I mean any method of introducing citations in the Wikimedia world. That includes Citoid, Wikidata (including your project, if I understand it correctly) and Shared Citations in the future, if launched.
I haven't looked in Zotero, my understanding is that it is based on different technologies, I'm interested in something web-based that would work or could easily be ported to the Wikimedia world.--Strainu (talk) 09:48, 2 March 2021 (UTC)
@Strainu: As far as I know, the Citoid service is the only Wikimedia tool able to extract citation metadata from a URL provided. It is planned that Wikidata supports Citoid to help adding statement references, and Shared Citations will integrate with Citoid too. Citoid uses Zotero translators to extract citation metadata from websites.
As I mentioned in my previous reply, I'm not a heavy Citoid (or Zotero translator in general) user, so I'm not sure how common it is to find websites not supported by any of the current translators (and hence by Citoid). The Zotero's translators Github repo does seem quite active. If it is indeed a common experience, it may be worth applying to these open Wikimedia software grants. What do you think? I may try and develop a browser extension that would visually help create new translators and post them to Zotero's translator repo, and to its Wikimedia's fork. --Diegodlh (talk) 19:09, 2 March 2021 (UTC)
I've been doing some research, thinking around the idea and its specific implementation more thoroughly, and talking about it with User:Scann. We think it might be worth it developing one such tool and promoting it among the Wikipedia and Zotero communities.
To sum up, Citoid uses Zotero's translators to fetch citation metadata for a URL provided. Websites which expose metadata appropriately are understood by generic translators. This is often not the case and site-specific translators are needed. However, most of them seem to be for English sources (see here, or here). Although contributions to the Zotero's translators repository are open, they require some technical skills.
The idea, then, would be to develop a web browser extension, a visual scrapper editor, that would enable non-technical users create and edit web translators, define test cases, and post them to Zotero's translator repo (and Wikimedia's fork). This would benefit both Wikimedia and Zotero communities, by widening the coverage and diversity of websites supported by Zotero translators and Citoid.
Before starting to write a proposal, we would really appreciate User:Mvolz_(WMF)'s (Citoid) and User:LWyatt (WMF)'s (Wikicite) feedback. I have also notified the Zotero community, as we would appreciate their feedback as well.
@Mvolz (WMF): I think users may be unwilling to use the extension if they have to wait until their new translators are (1) pulled by the Zotero's translators repo, and (2) incorporated into Wikimedia's mirror. Maybe the browser extension could run a Zotero translator server itself to extract citation metadata with user's translators (until they are used by Citoid) and output a raw citation template (e.g., Cite web template) that the user can copy and paste. For that, the VisualEditor's Citation tool would have to be changed to accept raw Citation templates as well. What do you think? Alternatively, I was thinking of having the Citoid API accept a custom translator provided by the user, but I'm not sure if the translator server is prepared (security-wise) to run user-provided JS code. --Diegodlh (talk) 17:04, 5 March 2021 (UTC)
I have had a thorough discussion about this idea with Zotero developers Dan Stillman and Sebastian Karcher on the Zotero forums. Their main concerns are around the quality and maintainability of translators created with a visual tool aimed at non-technical users. Although I agree that "metadata quality and saving reliability does matter when generating citations, so bad translators can sometimes be worse than non-existent ones", I also agree that there might be "a perfect-is-the-enemy-of-the-good argument to be made". Regarding their suggestion that it may be better to focus on improving embedded metadata (e.g., JSON-LD) support instead, although I agree this would increase translator coverage, it would still left websites which do not embed metadata (or which embed wrong metadata) out of the picture. See the forum thread for the full discussion. --Diegodlh (talk) 15:06, 8 March 2021 (UTC)
I've started a separate sub-page to discuss this idea in its own space. --Diegodlh (talk) 00:26, 9 March 2021 (UTC)
@Strainu: @LWyatt (WMF): I and User:Scann have been further developing this idea and we posted a proposal here today. We may continue introducing minor changes until the proposal goes out for community review on March 26th. We would appreciate your thoughts and comments in the discussion page, as well as your endorsements if you would like to support it. Thanks! --Diegodlh (talk) 21:34, 16 March 2021 (UTC)

Other record examples? And deleted records?[edit]

I think it would be useful to see how you would handle other types of references. Reference URL + date retrieved is obviously very common. Wikidata has a lot of references that are in the form "stated in: database, database id (often an external id on the item), retrieved date". In both cases the retrieval date may be important; if the database or source website changes continuously it's critical. So ideally we want to find an archive.org link on or as close as possible to the retrieval date. The same URL or database may be cited with many different retrieval dates, so you do get a bit of an explosion in specific citation constructs.

Another question I had on this was what happens when a citation is added to this database from a client wiki page creating a reference, but then that reference is removed and the citation is no longer used anywhere. Will there be a cleanup or removal process to delete unused records, or will they stick around? ArthurPSmith (talk) 21:11, 1 March 2021 (UTC)

Thanks for taking the time to comment ArthurPSmith.
You're right that the 'devil's in the detail' with how to handle different types of things to cite. A lot of the work of the community in the early stages would be identifying the required new software features to deal with important cases, triaging those requests, and designing solutions for them. As you say, the "easiest" is the simple URL and then perhaps the modern scholarly journal article. But beyond that - even something innocuous as "a book" and there's rabbitholes galore! In this wireframe illustration I tried to indicate the idea that there could be a tight connection to the Internet Archive - not just one link but perhaps multiple links depending on the access-date parameter. Given that I.A. are very keen on this proposal (and have added their own section to the Partners subheading) I suspect they would be able to spend time thinking about this kind of issue you raise and optimise their bot(s) for it. By centralising the reference handling to this database it will make it far easier for them to support such cases.
With regards to second point - about "unused records" - this is one of the things I can't make a by-fiat solution for, but did try to identify in the proposal document as a known-problem. It is, ultimately, something that the community would need to come to an agreement about, as it is an editorial policy decision. In the Workflows subheading I wrote:

Deletion and under what circumstances. If a record is being used in any Wikimedia project that would make it automatically valid for retention. But what happens when a citation is no longer used in any Wikimedia page?

At one end of the scale - the most obvious case for something which is unused being retained is the example I provided of a journal article in the wireframe illustrations (the same wireframe I referred to in the previous paragraph). In it you can see that this is a journal article with a "publication status=retracted", And zero current "citelinks"| but many former "citelinks". [Citelink is my cute term - riffing off Wikidata's "sitelinks" - meaning "instance of use as a citation"]. I think this is an obvious used case where we would like to retain the information about a reference which was used validly at some point, but is no longer.
At the other end of the scale, is what I wrote in the Open questions section regarding Spam:

Nonetheless, spammers will undoubtedly find innovative ways to stress the system. A potential way to mitigate the risk would be to have a "speedy delete" policy for newly created Shared Citation database record where the associated and sole reference on the originating wiki was itself removed within a short period of time.

I could imagine someone creating a user sub page with 100 reference templates to the different pages of their personal blog, consequently creating the 100 new Shared Citation records, And then immediately deleting that user subpage. I think this is an obvious use case for mass deletion of those items.
There are various technical and editorial policies that the community could enact for this. You could, for example, restrict the creation of shared citation records to only references used in mainspace. However that would prejudice against user: or draft: namespace creations. My instinctive proposal is a "probation period", whereby: if a reference is removed from all instances where it was added within, say, one week of its first creation – then it should be deleted from the Shared Citations database. There would need to be some kind of "flag" placed on such "probationary" items to make them visible/sortable.
Sincerely, LWyatt (WMF) (talk) 10:23, 2 March 2021 (UTC)
Why do you believe that a URL is simple? URL's have titles that might change over time. Some people might want the title to change when the source website changes the title while other's might prefer to keep the title as it was when a reference was made.
Your right there are many cases and circumstances when a reference to a webpage is not "simple" - for the reason you gave but probably many others I've never even heard of. My point was that, arguably, a stable and clean URL to a notable source is probably one of the most common and easily understood kinds of reference types that we have in the Wikiverse, and should be one of the first kinds of reference formats that should be built in. People would be surprised if "handling a reference to a website" wasn't one of the things that the database could do from day 1. But for what it's worth I've struck out the word "simple" to not give the impression that I think that handling URLs would always and necessarily be simple. LWyatt (WMF) (talk) 14:42, 3 March 2021 (UTC)

What's a citation?[edit]

This is a long proposal and I don't see anywhere an attempt to define what's meant with a citation. When two Wikipedia article cite the same journal article both citations might differ in both which page number the reference and when they were retrieved (those two come immediately to my mind but there might be others). ChristianKl❫ 17:21, 2 March 2021 (UTC)

Thinking about it a bit we sometimes also have quotes in Wikidata reference statements. ChristianKl❫ 22:41, 2 March 2021 (UTC)
Ultimately, the level of granularity for what each record should cover would be a community consensus decision + based on what is technically possible. There would also need to be edge cases: Dictionaries and Bibles (just to give two examples) have a different kinds of sub-section structures for how you reference them and so might need their own special way of being handled.
Nonethless, what I'm proposing is indicated most clearly in this wireframe. The idea would be that separate references to the same subsection of a work (e.g. page, pagerange, chapter, or possibly even 'quote') would be grouped. If there was no subsection specified then it would be implied to reference the whole document. With these wireframe illustrations I wished to demonstrate at least 1 possible technical implementation was possible, without trying to suggest that this particular implementation was 'set in stone' - it would be a combination of community consensus and technical feasibility. LWyatt (WMF) (talk) 14:49, 3 March 2021 (UTC)
So is your suggestion that in addition to the datatype of Citations there's the datatype of Citelink and that citelinks can have additional properties like Retrieved/Page/Quote/ArchiveUrl? ChristianKl❫ 10:54, 4 March 2021 (UTC)
I don't wish to pretend I have all the answers, or even that I have a specific technical implementation preference - just that these are the kind of things which would indeed need to be worked out. The important - at this stage - is to demonstrate that there are no inherently insurmountable obstacles, and the medium-obstacles (which we might classify this example as being) could be addressed in a couple of ways.
In one of the wireframes I proposed that a potential way this might look in the sourcecode on Wikipedia might be: <cite>{{C|1234567#123|style=1|freetext}}</cite> whereby the #123 indicates the subsection (perhaps the pagenumber 123 in this arbitrary example. The Shared Citation record (number 1234567 in this arbitrary example) would then display this 'citelink' grouped with any other usages across anywhere else in Wikimedia which link to the same subsection (page 123). Does that answer your question? LWyatt (WMF) (talk) 14:24, 4 March 2021 (UTC)

Adding Bias to Problem Statement/Goals[edit]

Pasting my comment here for any thoughts/discussion:

Thanks, Liam and Noé. Gaps and biases are a significant problem, as large as the ones listed in the proposal's problem statement, and equally relevant. The findings of the writer in The Washington Post story ( Wikipedia’s political science coverage is biased. I tried to fix it.) speak to a very deep problem that isn't represented among those driving this work. If knowledge equity is fundamental to the goals of the movement, it is also fundamental to the problems that must be solved. It's great that you noted a use case for gender bias research, but the existence of gender bias and other harmful biases is known, and belongs beside the other established problems and goals that motivate this work (verifiability, anti-disinformation, knowledge integrity, duplication, repetition, manual effort). It's much easier to understand that something is a priority--one that people care deeply about and are committed to realizing/solving--when it's at the top. Is there a reason not to do this? OpenSexism (talk) 18:03, 2 March 2021 (UTC)
  • The main reason is that whether citations are shared between Wikiprojects or aren't shared isn't central to the question of gaps and biases. A large part of what produces gaps and biases are editoral decisions and this project is not about interefering with editoral decisions of existing Wikiprojects beyond giving them tools to inform themselves. Challenging the editoral independence of individual Wikipedia's would be a way to doom this project in the typical way WMF efforts get into problems when they seek fights with Wikipedias. ChristianKl❫ 21:02, 2 March 2021 (UTC)
    • Yes as [[User:ChristianKl indicates, this proposed project is definitively framed as agnostic to the editorial policies on each Wiki. Nonetheless - simply by being able to visualise the different kinds of knowledge/equity gaps in Wikimedia project references - within, among, and changes-over-time - makes it that much easier for the communities on those projects to try to address them. Currently it is simply not possible to say with ease and relative certainty "what is the gender ratio of authors cited in category:medicine articles on English Wikipeida compared to French?". With some caveats, that would become possible with this kind of citation database - making arguments and campaigns to help redress that gender-gap all the easier to mount. LWyatt (WMF) (talk) 14:53, 3 March 2021 (UTC)
  • @OpenSexism: given there is the ongoing discussion here - and also our parallel conversation on twitter, can I request that you copy all of the other parallel discussion from the main page of this proposal (the thread of commentary following your initial 'opppose' endorsement message) and move it over to this page instead? It's weird we're having this conversation on the proposal page AND its talkpage. Based on that twitter conversation, I have the impression that you would like to work with me to improve the proposal documentation, rather than your being opposed to the proposal definitively. If you could suggest some practical ways I can rewrite the proposal document to incorporate your concern more thoroughly, please suggest them. LWyatt (WMF) (talk) 17:49, 3 March 2021 (UTC)
    • Thanks, Liam. How about something like the following:
Citations are the core of verifiability, anti-disinformation and knowledge integrity in our movement, and ensuring that the demographics of primary source authors are representative is an important dimension of knowledge equity. Our Verifiability policy has become the backbone of the reliable web. Wikimedia’s citations are one of our greatest assets. However, because they are stored as raw “inline” text in each content page, references are also one of our biggest burdens.
Our references are high in maintenance, technical complexity, and duplication of effort, and suffer from gaps and biases that are difficult to quantify and subsequently to address. In order to reach the 2030 goal of eliminating the gender gap, for example, better tools for understanding and monitoring progress are required.
Currently, the burden of reference creation and maintenance is shouldered by repetitive, manual, volunteer effort which is disproportionately felt by smaller communities. Individuals and groups working towards addressing systemic citation bias, lack tools that would make imbalances visible to the broader community.
... Not sure what you mean by copy the posts and put them here (the text? the wikitext?) (talk) 17:21, 4 March 2021 (UTC)
Fair suggestions OpenSexism. I've integrated these sentences - with some copy edits - in this diff.
What I meant by 'copy the posts' was to remove the threaded conversation that we've been having on the main proposal page in the 'endorsements' section, and paste it here in this subheading of the talkpage. This would neaten up that section of the proposal page, but also it would consolidate the conversations here. It's suboptimal to have the conversation split. I would have moved all that content over myself, but, especially since it is in response to your 'oppose' comment I didn't want this to appear like me censorship. LWyatt (WMF) (talk) 14:39, 5 March 2021 (UTC)
  • Thanks, Liam. I understand that you prefer comments that express support in the 'endorsement' section, so is there an equally visible space for expressing concerns in the working draft? OpenSexism (talk) 19:28, 7 March 2021 (UTC)
I suppose "equally visible" is in the eye of the beholder... Ultimately, this talkpage is the place for concerns (and ideas) to be raised, discussed, and hopefully addressed. Moving that thread of conversation here would at least consolidate the conversation to one place rather than spread across two. Ultimately, I would naturally like to hope that you would feel your concerns are adequately addressed via this conversation and would feel confident to change your comment from overt opposition to at least neutral or ideally supportive - but that's a separate point! LWyatt (WMF) (talk) 17:25, 8 March 2021 (UTC)
  • You are working with the community to include feedback, which is great. Yet the impulse to retroactively impose norms upon the discussion is not true to its history and our collaboration. Since more eyes “behold" the proposal than the talk page, perhaps better would be a link from the endorsements section to this discussion. OpenSexism (talk) 21:19, 9 March 2021 (UTC)
  • It's not attempting to retroactively impose norms, but to consolidate the conversation in one place. The norms are that "endorsements" heading on a proposal page is a place for people to list their endorsement if they wish - and this talkpage is the place for conversation. I'm specifically not removing/editing that section of the mainpage myself lest it be seen as retroactive imposition, but I do agree that linking from there to here would do readers a better service. LWyatt (WMF) (talk) 15:31, 10 March 2021 (UTC)
  • In your reply to ChristianK, you mention 'Currently it is simply not possible to say with ease and relative certainty "what is the gender ratio of authors cited in category:medicine articles on English Wikipeida compared to French?". With some caveats, that would become possible with this kind of citation database.' Is there anything that should change in the body of the proposal to minimize these caveats and/or better support achieving the goals laid out in the problem statement? OpenSexism (talk) 20:29, 9 March 2021 (UTC)
  • The caveats I was thinking of when I wrote that sentence are two: 1) The proportion of references on articles using the 'shared citations' method (rather than existing citation templates) would need to be high, and across many articles, before it became statistically reliable. If only a small number of references used this new style, or they were only the recent publications, or they were only on a specific subset of articles, then it would not be a representative answer. So - increased and consistent uptake of Shared Citations references itself is the way to mitigate that. And 2) the gender of the author (or any other demographic information) would be on the author's Wikidata item - which would be linked to from the Shared Citatino record. The Shared Citations database itself won't have items for authors, just works. Wikidata is where the authors' data is collected. So, it would only be possible to get a statistically significant result for a query for authors' gender when a large number of the authors cited in Wikipedia articles also have their own Wikidata items. The proposal document already mentions in the Use Cases subheading that it would enable the community to "Create ‘redlists’ for Wikidata of authors and publications which are frequently cited but don’t have a Wikidata item". That is the method by which caveat no.2 would be mitigated. LWyatt (WMF) (talk) 15:31, 10 March 2021 (UTC)

versions and errors[edit]

I just added the question on this to the main page. For many of the works cited as example, there exist other equally good forms of the same reference, or equally good versions of the reference in other versions of the work--Ovid's Metamorphosis ( en:Metamorphoses )is perhaps the clearest example. The page cites one particular edition of one particular English translation. Other printings of the same translation from the same publisher are likely to have different page numbering; dozens of other English translations exist; Dozens of other languages have translations, made over a span of centuries, many in later reprints and adaptations--some will be fairly literal and easy to relate to the original; some much less faithful; there are multiple editions of the original Latin version, and quite apart from the presentation, the text in them will be slightly different. In addition, scholarly references will be to the original Latin book and line numbers, even if a translation is being cited. Even more generally, references in WP will be to individual lines in the work, or individual books, or the work as a whole. The only ways of handling these area concordance of all possibilities, or the adoption of a standard and the adjustment of everything to it--both of these are major research projects in their own, on which a person could build a career.

There are analagous problems for journal articles. Print and electronic versions are not always identical--people have often cited the open access "Preprint" format, not the definitive reference, but the differences are relatively minor, but if only one is given, people using the other format cannot find the article. What is not minor is that people have sometimes cited an article while seeing only an abstract or worse, only a title, sometimes even from a journals database such as Proquest as if Proquest were the publisher. Often, and it is even suggested as a use for this on the proposal page, they will have copied it from another article without seeing any form of the original, and without really knowing if it even applies.(academics do that also, not just WPedians). The entire concept that someone can copy references they haven't seen in order to reference articles is a violation of WOLV (and of all academic practice. Even I almost never in approving an AfC actually check references against the original, though sometimes I have enough doubts that I actually do check in context of an AfD, and quite a few times the reference is wrong or used wrong, or turns out when seen to be an advertisement, Adopting this proposal as it stana will certainly produce more references in articles, just as using unrevised translations can produce more articles--in both cases, if we are being used by the public as their major reference source, we are not fulfilling our responsibility and not meeting even our own weak standards.

I am concerned that we will build a database that is inaccurate and inconsistent, deluding ourselves we are approaching accuracy when we are just adding complexity. When I camw in 06 I originally wanted to improve the accuracy of WP , but soon concluded it was hopeless; just as in Wikipedia I now pretty much confine myself to removing the worst trash and promotionalism, If I work on this, I will again only try to clear up the worst of the errors. I can't force others to workingtowards an accuracy that may be beyond their abilities and available facilities, nor can I convince programer from designing elabarate structures that ignore the weakness of the foundations.

I can predict the response to this: we should build it first, and fix it later. How many of the 6 million enWP articles are actually reliable? How many more are likely to be 5 years from now? DGG (talk) 07:13, 26 March 2021 (UTC)

This is the kind of question that demonstrates you've read the proposal and understand its inherent advantages and weaknesses! The meta answer is: it's for the community to decide on editorial matters.
My personal initial response is that a valid reference is a valid reference - even if there are different editions/versions of the same text being cited in different places across the wikis. These inconsistencies already exist, and they're not inherently a 'problem', a central database would only help to make any inconsistencies visible and therefore fixable (if the are problematic). I can imaging a specific metadata property being created to the effect of "Different version of same work" - which would be able to bi-directionally connect two (or more) actively used citations which are to effectively the same original work. They might have different pagination for example. It would then be possible to cross-refrence the "status" or "year" metadata properties to see if one of those references is from a superceded edition of the work (e.g. academic preprint or older edition of a textbook). From there, it would be possible for motivated editors to create worklists to update the superceded citations. There's noting inherently wrong with two different editions of Ovid's Metamorphosis being cited in different articles. But, if a Wikipedia language community decided that only 1 edition is the canonical version, then this system would enable the community to trace and replace the others. LWyatt (WMF) (talk) 11:51, 26 March 2021 (UTC)
I see you also include some of this also is the sketch you referred to earlier. In terms of my questions, if we proceeded in that way, it does have "pages" "editions" and "works" as layers, but for many works we'll also need "language" and "translation". And these layers are only given for the instances people specify them, which are a minority. It leaves out quite a few special cases, including the recommended one--that since there is both a print and electronic version of this particular edition, it is advisable to cite both. It's not a question of having references to 2 editions--it will become having reference to 10 in enWP and perhaps 30 in the various WPs, and another 50 being too unspecific to know what they're referring to. I agree that if we were just collating references we could fix this up later, though we might not ever be able to figure out what the unspecified ones were actually referring to, and I suspect most of the time they would not even know that themselves, but just copied another reference. Further, there are no superseded versions when one cites what one has sene . If a person cites the 16th century translation, then the reference must be to that. If what they saw is the first ed., and there's a third, they must cite the first--the software in wikidata will tell then there's another, and tempt them to cite the newest rather than the one they use. People working in fields who know academic practices generally do it right now--this is a small minority of editors.
The problem is the ability for people to use what we have collated to make a reference themselves. Only a few of them with the book before them will actually try to get it right. Most of them, making an article on some aspect of the subject and desiring to support or even just embellish it with a reference to the work, will copy any reference that looks likely. We already have the problem with people copying from one article to another works they have never seen; and this will facilitate people doing this as the standard practice. Citing something one has not actually seen is faking verifiability. It will end up like the set of smilies on a cell phone.
When WP started, nobody really foresaw what would be the real problems. They instead constructed elaborate software to deal with problems that have never arisen, and I think this true with wikidata also. There is probably no real alternative to doing this again, as systems develop as people use them. I'm not suggested we not do this, just warning that it will be much more difficult than we are thinking, just as making a reasonably accurate encyclopedia has proven more difficult than we thought at first. As for me personally, I am not going to participate in the design of an inadequate and complicated system that tries to interface with multiple other complicated and erroneous systems. I know from experience here and elsewhere that the parts I think important will not be the parts that are developed. I will do what I am trained and experienced in doing, and help clear up the resulting mess later. This is not depreciating what you are doing--if we relied on people like me, nothing would ever get built. But putting it another way, if individual encyclopedia articles have n problems, the encyclopedia as a while has n2 ,t e current WM complex 3., and this will be n4 and in the end, I do see the fun of the adventure. DGG (talk) 00:13, 27 March 2021 (UTC)
I certainly concur with you that that 'building a new database' could be a way to simply magnify and further-complicate existing problems - but, like you say, it will be an adventure to try to do it right! For a lot of the current inconsistent and inaccurate citation practices that you mentioned (or other kinds of related problems that previous commenters in this talkpage mentioned) its important to note that these problems are already existent in our projects: this proposed database would simply make them visible for the first time. That shouldn't be considered a flaw of this project if, through it, we can aggregate all the different incorrect ways that Ovid's Metamorphosis have been referenced across the whole wikiverse. Rather, in my opinion, that should be seen as an opportunity for cleanup that never existed before. But yes - we need to ensure that it is not built in such a way that encourages blind re-citing of works which are the wrong version. To ensure that would require lots of careful user-interface research, to identify a good way for people adding a new reference to be 'suggested' existing items in the database, and allow them to say "this one, but edition 3, not edition 2" and for it to seamlessly create that new record in the database. LWyatt (WMF) (talk) 15:42, 29 March 2021 (UTC)