WikiCite 2016/Report/Group 5/Zika corpus project

From Meta, a Wikimedia project coordination wiki

Pending tasks for Zika corpus project[edit]

Pending tasks for the Zika corpus project to get a sense of the possible dependencies and timeline

Team members[edit]

  1. Adam Shorland (Wikimedia Deutschland, Wikidata)
  2. Andra Waagmeester (Micelio)
  3. Daniel Mietchen (National Institutes of Health (NIH))
  4. Dario Taraborelli (Wikimedia Research)
  5. Eamon Duede (Knowledge Lab @ University of Chicago)
  6. Finn Årup Nielsen (Danmarks Tekniske Universitet (Technical University of Denmark))
  7. Jonas Kress (Wikimedia Deutschland)
  8. Katherine Thornton (University of Washington)
  9. Konrad Förstner (Universität Würzburg (University of Wurzburg))
  10. Markus Kaindl (Springer Nature)
  11. Tobias Schönberg (talk) (Wikidata)


  1. Discussion with team members to see if consensus is to move these tasks to Phabricator. If any of you has concerns about the learning curve/coordination costs, we can keep this on Etherpad, wiki.
  2. Best place to post project? Post this here, on wikidata-l, or on both lists. Maybe someone can advise?

Summary of goals[edit]

  1. Our goal is to showcase the benefits of storing source metadata in Wikidata with a small, curated corpus of the literature on Zika virus (Q202864). The corpus is small enough (<900 entries), relevant to annotate existing Wikidata items, and the overall topic notable enough that we believe it's going to be the perfect use case for this project.
  2. The first step is to generate a dump of bibliographic metadata and the citation graph for these papers. Eamon and his team are working on this.
  3. The second step is to identify which items on Zika virus already exist in Wikidata that may be potential duplicates. Konrad already started working on this.
  4. Once we have obtained the data and merged/removed the duplicates, we'll prepare it for import into Wikidata, via Magnus's Source MD tool. Initially, we'll aim to use short author names (P2093) to represent authors as strings and we'll represents citations between articles via the newly created cites property (P2860). We should give the community the heads about this plan (I am not terribly concerned given the small size of the import, but given the debates followed by the creation of
  5. Meanwhile, we'll coordinate with the GeneWiki community via Andra to annotate and crosslink existing Wikidata items by sourcing them with this corpus, via the stated in property (P248). Andra: I was wondering if you could give us examples of the statements that you would be generating and the type of items involved. I also wanted to discuss with the rest of the WG members which other statements they believe should be reasonably created, once we have the source metadata corpus. Institutional affiliations? Main subjects? Journals? Other ideas? Andra: also feel free to invite the rest of the GeneWiki folks to the wikicite-discuss group.
  6. Once the corpus is in and the annotation completed, we'll experiment with the generation of reports via WDQS and visualizations.
  7. Once we have enough materials, we'll advertise the dataset, analyses and visualizations. At the very list we should work on a blog post (WMF / WMDE) but we can think of any other possible venues.