WikiCite 2018/Program/Building a WikiCite corpus

Abstract:[edit]

In WikiCite contexts, a corpus is a set of Wikidata entries that share some common characteristics, for instance having the same author, translator, language or topic, being cited from the same Wikipedia page or having been published from within a geographic region or within a specific period. In this talk, I will explore some examples of such corpora that have been or are being assembled on Wikidata and highlight how they can be used to improve data quality, data models, tools and workflows or simply to gather a deeper understanding of the relationships between elements of the corpus. These examples include corpora under the auspices of the WikiProjects Wikipedia Sources, Retractions, Invasive Species, Kākāpō as well as Zika Corpus and others.

Presentation[edit]

Defining a WikiCite corpus[edit]

Multiple approaches are possible here; I am just illustrating some.

Primary corpora[edit]

things that have been
- authored
- published
- cited
- archived
- used as a reference on Wikimedia platforms

Secondary corpora[edit]

things related to things that form primary corpora, e.g.
- topics of things that have been published
- authors of things that have been cited
  - Danish female authors, ordered by citations known to Wikidata
- people with the same author name string
  - 7x "Li Li"
- events attended by authors
  - for instance WikiCite 2018 (see next talk)
- institutions with which authors are/ were affiliated
  - e.g. James Mason University people
- things published in a given language
  - see talk by Jason Evans
- collections of things that form primary corpora, e.g.
  - corpus of things cited from Wikipedia that have a persistent identifier for publications
    - WikiProject Wikipedia Sources
- items or properties required by things that form primary corpora, e.g.
  - Bibliographic properties
  - newspaper, book or journal items need publisher items
    - see also WikiProject Books or WikiProject Periodicals
  - journal article items need journal items
- tools or workflows around things that form primary corpora
  - e.g. SourceMD or Using OpenRefine to extract affiliation information from ORCID
  - note that work on corpora stimulated the development of such tools
- Wiki(m|p)edia citation templates for things that form primary corpora
  - e.g. Citation templates on the English Wikipedia
- publications licensed compatibly with Wikimedia projects
  - good foundation for reuse of text and media in Wikisource, Wikimedia Commons etc.
    - see also Ina Blümel's lightning talk
- statements supported by the same source (cf. Dario's slide 17)
  - all statements citing
    - any article from the New York Times (or Daily Mail, or Berkeley News) (for those working on such issues around newspapers, see: WikiProject Periodicals on Wikidata, and WikiProject Newspapers on English Wikipdia)
      - a specific article amongst these
    - any works of Joseph Stiglitz
    - journal articles by physicists who worked at Oxford University in the 1970s
    - a journal article that was retracted
- timelines of things that form primary corpora, e.g.
  - Histropedia timeline of publications about invasive species
  - similar timeline for the Zika Corpus
- things translated by the same translator
  - e.g. by Yanka Kupala
- type specimens of biological taxa or minerals
  - nothing to see here yet
- research published last week
  - see also this FORCE 2018 session

What to consider before getting started on a new corpus[edit]

What is already there?
- items
- properties
- What about lexemes/ forms/ senses?
How is it modeled?
- poems
- datasets
What is the purpose of the existing and new corpora?
- discovery
  - e.g. of knowledge, connections, potential collaborators
- quality control
  - might involve
    - constraint statements
    - Shape Expressions
      - see also WikiProject ShEx and lightning talks by Eric Prud'hommeaux and Jose Emilio Labra Gayo
    - maintenance queries
      - for constraints, benchmarks etc. (some examples)
    - Scholia
      - see also next talk
- research assessment
What about starting your corpus as a subset of one of the existing ones?

Notes[edit]

How does it related to past, present and future of WikiCite?
- Extends across the three possible scenarios in the WikiCite Roadmap
create new properties
revise data models
write Shape Expressions
build/ adapt tools and workflows
SourceMD

Additional considerations[edit]

Complete corpora are good for testing purposes, so watch out for
- things that do not change (much any more), e.g.
  - all citations from a given version of a publication
  - publishers, journals, organizations, authors, countries etc. that do not exist any more
  - publications in extinct languages
Scholia for quality control

Presenter[edit]

Daniel Mietchen is trained as a biophysicist and now works for the Data Science Institute of the University of Virginia on opening up research and education workflows for large-scale collaboration, including with machines. More details via Scholia or this user page.