Group 3: Representing citations and citation events

Room 123, 4:00 - 6:00 pm • Etherpad: Room 123

Goal

Discuss how to express the citation of a source in a Wikimedia artifact (such as a Wikipedia article, a Wikidata statements etc.) and review alternative ways to represent them

Participants

Adam Becker (Open Journal, Freelance Astrophysicist)
Adam Shorland (Wikimedia Deutschland, Wikidata)
Daniel Kinzler (Wikimedia Deutschland, Wikidata)
Daniel Mietchen (National Institutes of Health (NIH), organizer)
Elizabeth Seiver (Public Library of Science (PLOS))
Joe Wass (Crossref)
Jonathan Dugan (organizer)
Karen Coyle (KarenCoyle.net)
Laura Rueda (DataCite)
Terry Catapano (Plazi Verein / Columbia University Libraries

Summary

This working group discussed the use cases and data needs for structured citations. We defined the terminology of discussion, created a recommendation for the data structure for citation instances, and explored how the existing infrastructure and community needs would support a transition from the current systems for representing citation to a more structured systems stored in Wikidata.

Introduction

Citations are the connections between all of humanity's knowledge; as such, they are interesting in their own right. There are many people who would like to access citations from Wikipedia and other Wikimedia projects, independently of the citing objects themselves, in order to analyse the citation network and provide various services. Thus, we need a data standard for citations in Wikimedia projects. The standard must be flexible enough to handle the wide variety of citations, and it must be machine-readable. Here at WikiCite 2016, we've taken the first steps toward developing such a standard.

Terminology

Citation instances are a relationship between a citing resource (the citation origin) and the cited resource (the citation target). The citation appears at a particular point in the origin (the origin anchor) and can point to a particular point in the target (the target anchor, which in many cases does not exist or is not mentioned). [We also discussed that it would be useful to indicate what exact portion of the text a particular cited source supports. Let's call it the "domain". It would perhaps be uncommon, but it would be quite nice to have. I thought that this would basically be a more fancy anchor, but it now seems to me that the anchor and the "domain" should be separate: a citation may be placed at the end of a sentence or paragraph for typographical reasons, while the statement it supports is not at the end. Do others agree that this would be useful to mention here at least as a point of consideration?]

(kcoyle):Is a bibliographic reference that is recognizable as a reference (e.g. has ISBN or DOI) but is not between <ref></ref> tags included in this? There are many bibliographic "things" in lists of works or further reading sections. The origin anchor in this case would/ could be the article itself.

For example, the Wikipedia article on capybaras Capybara cites Charles Darwin's book The Voyage of the Beagle. In this citation instance, the citation origin is the capybara article, and the origin anchor is the point in that article at which this citation appears (at the end of the section on Etymology). The citation target is The Voyage of the Beagle, and the target anchor is page 619 of that book (as indicated in the citation). These four things uniquely define the citation instance (as long as we do not take versions into account). Another citation, identical other than the origin anchor, could easily exist -- the capybara article could cite Darwin's book more than once. But this would, for our purposes, be considered a separate citation instance. (In contrast, a target anchor in a single citation instance can be multi-valued (e.g. pp. 1, 3-5, 180-184). Q: could there just be multiple target anchors instead? [These notions seem equivalent to me. Speaking of muliple citations having the same anchor seems clearer to me than saying that an anchor is multi-valued. In my mind, the anchor's "value" would be the word offset or some such, of which there would be only one])

Recommendations

Data structure

Citation instances can be modeled as follows


    Instance ID


    Citation origin


    Origin anchor [perhaps add the domain, see above]6


    Citation target


    Target anchor

While this appears straightforward, details of implementation are complicated:

The target ideally points to a unique identifier (DOI, ISBN, Wikidata entry, etc) that contains all the necessary bibliographic data. (The question of what bibliographic data are necessary is beyond the scope of this working group.)
The target points to the current version, not a specific version, so updates to the bibliographic data are reflected in any new rendering. [do we need to define "rendering"? Do we need to explain the implications of caching?]
It might be nice if the origin anchor would specify what section of text is covered by a citation, but this is uncommon, as authors are often intentionally vague in citation origin anchoring. Thus, text selection in the origin anchor should not be required [perhaps treat the "domain" of the citation separately from the anchor, see my comment above]
It would be nice for the target anchor to be as specific as possible, but in most cases the target anchor will simply not exist, especially if the target is a website or scholarly article.
Citation references are managed as part of wikitext (for now). The complexity is greatly reduced by being able to reference bibliographical data, instead of specifying it inline.
The citation style is defined locally in wikitext, e.g. by specifying a template name or parameter.

Several details remain undecided:

How do we track citation identity across page revisions? Perhaps we inject a UUID into the wikitext? This is ultimately a question about the nature of the citation instance identifiers ["identity"?], and this is still up in the air.
How do we specify origin anchors that are robust against editing unrelated sections of the origin text? [->Hypothes.is]
How do we specify origin anchors that point to the exact location of the citation in the origin text, yet also retain information about co-citation groups (e.g. [5][6][8]) in a machine-readable way? A machine should be able to easily tell that [5][6][8] are all in the same place, yet also know the order in which they appear. It should be able to do this without retrieving the full origin text. And finally, this must all be robust against edits in other parts of the page. This is a difficult, but solvable, technical problem.

Bibliographic metadata

Citations can reference bibliographic data as Wikidata items (e.g. a book like The Origin of Species). Details such as chapter and page can be supplied as local anchor information.

Alternatively, citations may contain bibliographical data directly, as part of the wikitext; we continue to use existing mechanisms like templates to structure them. This "legacy" citation format would preferably be discouraged after enough time has passed.

Modelling all cited sources as separate Wikidata items may be impractical due to maintenance overhead, community capacity, and database scaling.

We could also "recycle" the references attached to a Wikidata statement (e.g. "water boils at 100°C" is supported by [1][2][3]...), by referencing a Wikidata statement itself [do we need to explain "Statement"? It refers to a techical concept from the Wikibase data model here], implicitly citing the references that support that Wikidata statement. This introduces very difficult questions about transparency and mutability of citations over time. These questions, in turn, relate to larger issues about how scientific paradigms change and break over time, how scholarly communities communicate allegiance to sets of ideas through citations, and the cultural differences between "traditional" scholarly communities and the community of Wikipedia editors. We discussed these at great length and decided that we were unlikely to resolve these issues in a timely fashion.

Discussion

Data availability and use cases

Once this standard is in place, a machine-readable representation (e.g. as JSON) of all citation instances should be made available via a data API for every version of every page. The necessary data extraction can be achieved via a parser function or Lua library while parsing the wiki page.
These stand-alone citations could then serve many purposes and enable many services. These include (but are not limited to!) the following:

A citation recommendation engine. "I see you've cited A. Perhaps you'd also like to cite B, C, and D?" Such an engine could suggest papers to read or to cite. Could show you the most relevant citations for a particular subject or by a particular person. This would be enhanced by tracking co-citation groups. Could also help with providing suggestions for possible citations to support translations of articles ("I see you are starting a translation of an existing article. Here are some citations used in other language versions of Wikipedia that might be helpful.
Publishers and other content providers would be extremely interested in ALMs from a database of citations, and to see how their content is used across Wikimedia properties.
A large citation database would, in effect, be a dependency tree for knowledge: fact X in article A is supported by objects B, C, and D, and so on. Performing network analysis on such a tree could lead to very interesting results.
[Simply having a large corpus of citations with nicely tagged topics and co-location information would be very nice for researchers, too. Though we don't really have a citation graph here, since wikipedia pages don't cite each other.]

Further resources

Hypothesis: Robust anchors for electronic documents
Citation Ontology
Open Annotation Ontology
Rich Citations (structured JSON citation objects)

Appendix: workgroup notes

Raw notes from group 3.