WikiCite 2016/Report/Group 5/Notes

From Meta, a Wikimedia project coordination wiki

Notes and links[edit]

Goal[edit]

WikiData will serve as a centralized, highly structured, repository capable of representing the highly networked nature of the scholarly sources that support the Knowledge archived across all Wikimedia projects. This signals an unprecedented opportunity for not only scientists and scholars but also society at large to explore the complex landscape of human knowledge. Yet, it is not clear what such an exploration would look like. What kinds of questions can be asked of such a system? With the generous support of CrossRef and the Sloan and Moore Foundations, the WikiCite 2016 Workshop established a working group to not only envision concrete use cases for scholarly source-related question in WikiData but also to determine whether the technical foundations required to effectively express those questions as intelligent, efficient, and systematic queries are in place. Where these technical foundations are lacking but needed, the working group tasked itself with developing proposals for overcoming such limitations.

This group focused on discussing and prioritizing use cases for wikidata queries involving source metadata. The assumption is that we already have all the required data. We also worked on obtaining a small open licensed bibliographic and citation graph dataset to build a proof of concept of the querying and visualization potential of having this data stored in Wikidata and exposed via SPARQL.

Notes[edit]

  • See Proposal: Retrieving Wikidata statements by source
  • Aim
    • discuss and prioritize the most important types of source-related queries that WDQS should support
    • determine if these queries can be effectively expressed in SPARQL and executed via WDQS or if they require a different indexing / data modeling strategy

Key properties[edit]

Properties expressing a citation relation[edit]

The 'Cites' property was suggested, supported and created during the WikiCite 2016 meeting. The quick and bold creation promptly became a topic of discussion on Wikidata and a Wikidata user even suggested it for deletion. Nevertheless, meeting participants rapidly utilized the property to mark up a few scientific papers, so small citation networks could be visualized. This was particularly the case for scientific papers about the Zika virus and fever.

Other relevant properties[edit]

See also[edit]

Examples[edit]

  • list all Wikidata statements citing a New York Times article
  • list the most popular scholarly journals used as citations of statements for any item that is a subclass of economics
  • retrieve all statements citing the works of Joseph Stiglitz (d:18430)
  • retrieve all statements citing journal articles
    • by physicists from Oxford University
    • that have a PubMed Central ID
  • list all statements citing a specific journal article that was retracted
  • list all statements citing(WD) a source that cites(non-WD) a specific journal article ( or one that was retracted).
    • this is outside the current scope of any Wikidata-related project, it requires storing scholarly citations between papers

all Zika-related journal articles(WD) that were published in the last n weeks Wikidata WikiProject Source Metadata: Items about Zika virus or fever

  • coauthors of X
    • requires storing bibliographic metadata for all publications by X
  • coauthors of X in Wikipedia
    • is there an interest for coauthors limited to sources cited in Wikipedia?
  • *other examples of queries on citations restricted to Wikipedia would be more useful
  • X's H-Index
  • requires storing bibliographic metadata for all publications by X and all their citations
  • Can citation links by typed?
  • Citations restricted by their target
  • Note: you cannot add qualifiers to sourcing statements, e.g. stated in (with a specific citation intention)
  • How do we think about veracity on WD?

Use cases[edit]

Reuse source MD[edit]

    • Example: look up a particular publication via a combination of free-form keywords, e.g. author, journal name, words in the title ('choosing experiements evans sociology')
    • Is this something WDQS would be able to return? would a vanilla search API be more appropriate?
  • Publication lists
    • Example: all publications by Finn Arup Nielsen sorted by publication date
    • requires storing biblio metadata for the entire publication record of a given author
    • could potentially be implemented via a script periodically syncing up an author entry on Wikidata with the corresponding ORCID record
    • could extend to bibliographies/ reading lists of all types
    • knowledge wells (return the developed scholarship from an arbitrary 'community' [e.g. individual, lab, department, division, university, company]).
  • custom curriculums
    • Example: all publications by members of a given lab
    • ORCID supports affiliations as free-form text, Wikidata has the benefit of supporting affiliations via linked data
    • Example: all publications supported by grants from a specific funder
    • Overlaps potentially with Crossref data (funderID)

Sanity Checks[edit]

This is mostly targeted at data producers / source owners

  • Example 1:
    • bot scraping data about proteins and storing sources on Wikidata
    • used to reference text, but created errors referencing synonyms, e.g. Ebola River (Q934455) instead of Ebolavirus (Q5331908)
  • Example 2:
    • graph representations surfacing type/class errors, e.g. US states sharing borders used to return items that are not an instance of a state link
  • Example 3:

Federated Wikibase Queries[edit]

  • run queries across multiple data providers
  • analyze data quality by comparing results from separate providers

Generating a test case[edit]

We decided to identify a corpus of references to explore the feasibility of importing them and using them as sources for existing Wikidata items. Requirements for this dataset are the following:

  • size: the corpus should have a fairly small number of nodes (articles)
  • relevance: the corpus should fill some obvious gaps, such as serving to directly source statements in Wikidata
  • PID-ready: the corpus should have clean metadata derivable from persistent identifiers (DOIs or PMIDs)

Obtaining a dataset[edit]

Mapping records to existing WD items[edit]

Tools for importing data and curating it[edit]

Curation[edit]

Ask the GeneWiki community to help crossreference this corpus with existing Wikidata items

Running samples queries and visualizations[edit]

  • Zika dataset
  • existing visualizations
  • timeline
  • listeria-generated list of references
  • graph visualizations
  • queries

Reference material[edit]

Proposal[edit]

  • Import an entire corpus of bibliographic metadata and citation graph for a given field
    • Show all kinds of queries / visualizations that can be obtained via WDQS
    • Source: Pubmed? Mendeley? American Physical Review?
  • Mendeley contacted 25 May
  • American Physical Review contacted 25 May

Example queries[edit]

See also

Results[edit]

  • An example of how the set of articles can be used in Wikidata d:Q202864 This is the entity for Zika virus, we added sources for several of the statements that had been empty.
  • A property 'cites' (d:property:P2860) was created to model citation events between documents. It
  • Up to and after the meeting Finn Årup Nielsen created Wikidata item for all papers associated with data in the OpenfMRI neuroimaging database (d:Q23891141).

For Andra[edit]

  • Zika virus @ BioProject from the National Center for Biotechnology Information (NCBI)