Wikicite/grant/Wikipedia Citations in Wikidata/Report

From Meta, a Wikimedia project coordination wiki

Goals[edit]

The aim of the project was to develop a codebase to enrich Wikidata with citations to scholarly publications (journal articles and books) that are currently referenced in English Wikipedia. At the end of April 2021, we have entirely released the code developed in the GitHub repository of the project. The current code implements successfully the originally envisioned workflow for creating and uploading citation data coming from Wikipedia articles into Wikidata.

Outcome[edit]

Target outcome Achieved outcome Explanation
Create a mapping between the Wikipedia Citations and the OpenCitations Data Model (OCDM). The mapping between the various ways Wikipedia citations are represented in Wikipedia articles and the OCDM has been implemented in the first two steps of the workflow, namely extractor and converter. The mapping was not defined as a mapping document but rather it was technically implemented at code level. The extractor module is responsible to extract bibliographic references and citation from Wikipedia articles, and is based on an existing code developed by one of the members of the research group. All the extracted citations are stored as a parquet dataset. From this dataset, another module, i.e. converter, is responsible for converting the extracted citation data into a set of RDF files which are OCDM compliant.
Create a mapping between the OCDM and Wikidata. The mapping between the OCDM and Wikidata has been implemented in the last step of the workflow, namely pusher. The mapping was not defined as a mapping document but rather it was technically implemented at code level. The pusher module is responsible for getting a set of RDF files in input describing scholarly data compliant with the OCDM and for producing a series of TSV files compliant with the QuickStatements input format that enable the Wikidata user to bulk upload the citational data onto Wikidata.
Implement a tool to enrich and disambiguate entities via PIDs. The tool oc_graphenricher has been developed and released in PyPi. The oc_graphenricher tool has been developed to be a standalone tool working with OCDM compliant data is divided into two part: an enricher component, responsible to find new identifiers (i.e. ORCID, VIAF, DOI, ISSN, Wikidata QID, Crossref's publisher ID) to the entities in a dataset complaint with OCDM, and an instance matching component, responsible to deduplicate any entity that shares the same identifier. It is at the core of the third step, i.e. enricher, of the workflow implemented in the project.
Release the code with an open-source license and appropriate documentation for future use. The code has been released in GitHub and licensed using the ISC license, and it has been accompanied by extensive documentation. The code has been developed following a Test-Driven Development process, and includes several tests for each module developed to check its consistency. In addition, and extensive documentation of the code has been produced and it has been accompanied by a full description of how the workflow works and how to run it.


Lessons learned[edit]

What worked well[edit]

In general, the work plan went smoothly, with no insuperable issues. We finished the implementation of the code and the production of the documentation in time, even if we underestimated the effort necessary to address the second step of the workflow, i.e. converter, that needed more time for being properly addressed - i.e. almost two months.

What did not work so well[edit]

Although we have implemented the full workflow, the last step of it, i.e. pusher, is a semi-automatic step and needs a human to be involved in the process when QuickStatements is used. Within the timeframe of the project, we did not have the chance to devise and develop a fully automatic approach to upload new data in Wikidata, particularly due to some issues that interested the creation of new entities (e.g. it was not straightforward to automatically retrieve the QID of a new entity just created via QuickStatements) and the automatic interaction with the Wikidata database (e.g. to have a bot to automatically interact with it required to have specific permission to gain).

What would you do differently next time[edit]

The issue related to getting the QID of a newly created entity and, thus, having a fully automatised workflow, is something that could be addressed since the beginning if known. According to the Telegram channel, other recipients of WikiCite grants had a similar issue. Better coordination among the recipients since the beginning, maybe, could have been effective for identifying and addressing this problem since the first stages of the project.

Finances[edit]

Grant funds spent[edit]

The grant amount of 8,070 EUR has been used to pay the salaries of 2 research fellows appointed to work on the project. The monthly salary of a research fellow is 1,614 EUR.

The following salaries have been paid with the project's grant:

  • A short-term research, working from January to April 2021, addressing the main part of the implementation of the workflow: 6,456 EUR (1,614 euros per month, for 4 months)
  • Another short-term research fellow, working in January 2021, focussing on the task of creating a tool for implementing the enricher step: 1,614 EUR (1,614 euros per month, for 1 month)

In total, 8,070 EUR have been spent on salaries.

Remaining funds[edit]

The full grant of 8,070 EUR has been completely spent; no funds has remained.

Anything else[edit]

The grant covered the expenses for hiring the personnel to develop the software and documentation to do the extraction of citations from Wikipedia English and to push them in Wikidata.

Of course, in the future, when supported by appropriate funding and/or personnel, the intention is to use the software to create a dataset of Wikipedia English citations to understand, in particular, how many new entities (i.e. citing Wikipedia pages, cited articles and venues, authors) we should add to Wikidata in order to upload all the set of extracted citations. Once this figure is computed, we will ask the Wikidata community and database maintainers whether it is fine with them to upload all such new entities into Wikidata, considering the huge amount of bibliographic-related entities that a mass upload like that will add to the dataset.