WikiCite 2016/Report/Group 5
Room 129, 4:00 - 6:00 pm • Etherpad: Room 129
Identify use cases for SPARQL queries involving source metadata. Obtain a small open licensed bibliographic and citation graph dataset to build a proof-of-concept of the querying and visualization potential of source metadata in Wikidata. Includes work on the Zika virus corpus.
- Adam Shorland (Wikimedia Deutschland, Wikidata), Thursday
- Andra Waagmeester (Micelio)
- Daniel Mietchen (National Institutes of Health (NIH), organizer)
- Dario Taraborelli (Wikimedia Research)
- Eamon Duede (Knowledge Lab @ University of Chicago)
- Finn Årup Nielsen (Danmarks Tekniske Universitet (Technical University of Denmark)), Thursday
- Jonas Kress (Wikimedia Deutschland)
- Katherine Thornton (University of Washington)
- Konrad Förstner (Universität Würzburg)
- Markus Kaindl (Springer Nature), Thursday
Wikidata will serve as a centralized, highly structured, repository capable of representing the densely networked nature of the scholarly sources that support the knowledge archived across all Wikimedia projects. This signals an unprecedented opportunity for not only scientists and scholars but also society at large to explore the complex landscape of human knowledge. Yet, it is not clear what such an exploration would look like. What kinds of questions can be asked of such a system? We established a working group to not only envision concrete use cases for scholarly source-related questions in Wikidata, but also to determine whether the technical foundations required to effectively express those questions as intelligent, efficient, and systematic queries are in place. Where these technical foundations are lacking but needed, the working group tasked itself with developing proposals for overcoming such limitations.
This group focused on discussing and prioritizing use cases for Wikidata queries involving source metadata. The assumption is that we already have or have access to all the required data. In addition, we worked to obtain a small, open licensed bibliographic and citation graph dataset to build a proof of concept of the querying and visualization potential of having this data stored in Wikidata and exposed via SPARQL.
One of the unique benefits of representing source metadata as linked open data is the ability to perform arbitrary queries on sources and the statements that reference them. Integrating Wikidata with an annotated source metadata repository will allow us to answer questions like the following:
- list all Wikidata statements citing a New York Times article
- list the most popular scholarly journals used as citations in any item that is a subclass of economics
- retrieve all statements citing the works of Joseph Stiglitz
- retrieve all statements citing journal articles by physicists from Oxford University
- list all statements citing a journal article that was retracted
- list all statements citing a source that cites a journal article that was retracted
- list all journal articles on Zika virus that were published in the last four weeks
In this session we discussed the most important types of source-related queries that Wikidata Query Service should support and the underlying data model requirements.
- Most of the use cases the group discussed can be readily expressed in SPARQL so long as the bibliographic data model for the corresponding source metadata is rich enough. The only exception is reference lookup/recommendation via free-form keywords, which may require a different technical solution.
- It is hard to assess the scalability or limits of SPARQL when applied to source metadata without 'real' data. For this reason, the group participants recommended focusing on building proof-of-concept datasets that can both provide value to the broader Wikidata community and allow us to experiment with and augment querying capabilities against data.
- The group expressed the need to streamline the process for property creation during similar technical events, when the creation of such a property by a group of Wikidatans who are also subject matter experts can significantly speed up data modeling and data ingestion efforts.
We discussed four primary use cases:
Analysis of sources supporting statements of scientific relevance
The first use case is targeted at the community of researchers studying the evolution and structure of scientific and scholarly communication (Metaknowledge or Science of Science). Having a complete citation and source metadata graph in Wikidata, integrated with statements of scientific relevance, will allow this community to perform analyses that – to date – are impossible without substantial data curation and aggregation efforts. We discussed use cases such as the analysis of coauthor / co-citation networks for authors of scientific articles in general or for articles cited in Wikipedia, Wikidata, or other Wikimedia properties. We discussed how common bibliographic metrics such as the h-index or a journal's impact factor could one day be generated transparently, replicated and verified by anyone via data stored in Wikidata.
Data quality checks
A second use case involving source metadata is related to automated data quality checks of Wikidata statements. This case is mostly targeted at data producers. Examples include bots scraping data from the scientific and scholastic literature and storing such statements in Wikidata (such as StrepHit or ContentMine). Bots will typically cause errors that require entity disambiguation, e.g. Ebola River (Q934455) instead of Ebolavirus (Q5331908). Queries running on statements generated semi-automatically and retrieving potential errors could feed queues of statements requiring human curation/review.
A third use case we discussed is the idea of using Wikidata and WDQS as a way to dynamically generate publication lists, by author, department/institution or topic. The target for this use case is individual students, academics, and institutions with a need to routinely generate lists of publications meeting specific criteria. This use case is also targeted at teachers and instructors creating syllabi and reference lists as part of classes. Funders could also generate lists of works they financially support, broken down by topic or scientific field and any arbitrary criteria that can be expressed in Wikidata. Clean bibliographic metadata and support for identifiers like DOI (Q25670), PMID (Q2082879), ORCID (Q51044), Funder ID etc. is critical for supporting this use case.
Source lookup across Wikimedia properties
Source reuse in citations across Wikipedia articles in the same language, across Wikipedia language editions or across Wikimedia projects is one of the known pain points of the current way in which this data is stored. A centralized store for this data will allow the design of source lookup functionality / citation recommendations into editing interfaces for Wikimedia contributors. It's unclear at this stage if WDQS would be the best way to support this functionality or a different approach would be needed (for example, supporting via a combination of free-form keywords e.g. author, journal name, words in the title ("choosing experiments evans sociology" → Q24289251)
The group also worked on testing the current capabilities of WDQS at expressing a number of queries involving source metadata (see the appendix of this group's report).
From abstract use cases to a proof of concept
We determined that assessing the usefulness of specific types of query is hard without real data already available in a structured data store. We proceeded to request the creation of a property – cites (P2860) – which was key to the group's ability to conduct this work by importing and curating a small data sample. The process involving the creation of the property as part of a technical event was in itself very interesting: it triggered a request for deletion and a lengthy discussion on the legitimacy of new properties created without going through the default community discussion period. The participants felt the default process may need to be revisited in order to allow future data modeling initiatives to have the ability to create important properties (particularly when they map uncontroversially to existing properties in mainstream ontologies such as Schema.org or Dublin Core), while respecting community processes at large.
Finally, the group realized that the most viable strategy to showcase the potential of a corpus integrating Wikidata statements, expert annotations and source metadata would be to focus on the production of a relatively small, highly curated dataset, tied to existing content in Wikidata but also producing value beyond what currently exists in the project. We settled on the case of the corpus of scholarly articles on Zika virus and the project is currently being developed as a spin-off of this group.
Limitations of the current approach
The focus of the group was significantly skewed towards queries of scholarly papers, for a number of reasons:
- source metadata are more readily available in canonical format via existing APIs and services
- the definition of a data model is more straightforward than for other types of publications, as highlighted in other workgroups
- there's an immediate need to address these issues in the context of existing curation projects that involve Wikidata and Wikipedia, such as Gene Wiki.
The group participants were aware of the limited scope of this exercise, but unanimously felt that the creation of a proof-of-concept for this simplified use case could help spearhead a discussion on the value of this data and the kind of manual and automated processes required to support its creation and consumption.
Appendix: workgroup notes
- Wikidata items that are instances of (d:P31) scientific article (d:Q13442814) and have a PMID (d:P698) or PMCID (d:P932)
- Wikidata items that are instances of scientific article (d:Q13442814) but do not have a PMID (d:P698) or PMCID (d:P932)
- Wikidata statements that use scientific paper as references, specifically Wikidata items with statements involving a PMID (d:P698) or PMCID (d:P932)
- Wikidata statements that have scientific papers as references, specifically Wikidata items that are instances of a scientific article (d:Q13442814) but do not have a PMID (d:P698) or PMCID (d:P932)
- Most common author names (strings) for articles whose main subject (d:P921) is Zika virus (d:Q202864)
- Scientific articles with metadata stored in Wikidata and containing "zika" in the title
Example of citation networks for individual Zika research papers using the Wikidata Graph builder:
- root node: Zika virus complete genome from Salvador, Bahia, Brazil (Q23906890 - doi: 10.1016/j.meegid.2016.03.030), mode: undirected:
- root node: Zika virus – an overview (Q23308149 - doi: 10.1016/j.micinf.2016.03.003), mode: both:
- raw notes from group 5.
- see also Zika corpus project