Jump to content

WikiCite 2018/Program/Building a WikiCite corpus

From Meta, a Wikimedia project coordination wiki



In WikiCite contexts, a corpus is a set of Wikidata entries that share some common characteristics, for instance having the same author, translator, language or topic, being cited from the same Wikipedia page or having been published from within a geographic region or within a specific period. In this talk, I will explore some examples of such corpora that have been or are being assembled on Wikidata and highlight how they can be used to improve data quality, data models, tools and workflows or simply to gather a deeper understanding of the relationships between elements of the corpus. These examples include corpora under the auspices of the WikiProjects Wikipedia Sources, Retractions, Invasive Species, Kākāpō as well as Zika Corpus and others.



Defining a WikiCite corpus


Multiple approaches are possible here; I am just illustrating some.

Primary corpora

  • things that have been
    • authored
    • published
    • cited
    • archived
    • used as a reference on Wikimedia platforms

Secondary corpora


What to consider before getting started on a new corpus

  • What is already there?
    • items
    • properties
    • What about lexemes/ forms/ senses?
  • How is it modeled?
  • What is the purpose of the existing and new corpora?
    • discovery
      • e.g. of knowledge, connections, potential collaborators
    • quality control
      • might involve
        • constraint statements
        • Shape Expressions
        • maintenance queries
          • for constraints, benchmarks etc. (some examples)
        • Scholia
          • see also next talk
    • research assessment
  • What about starting your corpus as a subset of one of the existing ones?


  • How does it related to past, present and future of WikiCite?
  • create new properties
  • revise data models
  • write Shape Expressions
  • build/ adapt tools and workflows
  • SourceMD

Additional considerations

  • Complete corpora are good for testing purposes, so watch out for
    • things that do not change (much any more), e.g.
      • all citations from a given version of a publication
      • publishers, journals, organizations, authors, countries etc. that do not exist any more
      • publications in extinct languages
  • Scholia for quality control


Daniel in 2017

Daniel Mietchen is trained as a biophysicist and now works for the Data Science Institute of the University of Virginia on opening up research and education workflows for large-scale collaboration, including with machines. More details via Scholia or this user page.