Wikicite/grant/Wikipedia Citations in Wikidata

From Meta, a Wikimedia project coordination wiki

Project summary[edit]

Project Name
Wikipedia Citations in Wikidata
Start/End dates
January to May 2021
Amount requested (and the currency you wish to receive it in)
8,070 EUR
Amount requested (in US$ equivalent)
9,559.34 USD (exchange rate as of 18 September 2020)

The people[edit]

Contact person name/Wikimedia username
Silvio Peroni
Contact person e-mail address
Organisation (optional)
Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Project participants
  • Silvio Peroni (PI): his expertise includes document markup, semantic description of bibliographic resources using OWL ontologies, services development for citation data management, bibliometrics and scientometrics studies. He is one of the main developers of the SPAR (Semantic Publishing and Referencing) Ontologies, Co-Director of OpenCitations, and a founding member of the Initiative for Open Citations (I4OC) and of the Initiative for Open Abstracts (I4OA). In the last years, he has developed most of the software used to ingest and expose OpenCitations data.
  • Giovanni Colavizza (co-I): he has extensive research experience in citation analysis and, in particular, in mining and analysing citations and citation usage in Wikipedia.
  • Marilena Daquino (co-I): she has experience in citation and bibliographic data modelling, reengineering in RDF, and data reconciliation activities (particularly adopting Wikidata services).
  • Short-term research software engineer (to be hired).

The project[edit]


We propose to develop a codebase to enrich Wikidata with citations to scholarly publications (journal articles and books) that are currently referenced in English Wikipedia. This codebase will build on top of previous work, such as the wikiciteparser, and integrates new components, notably: i) a classifier to distinguish citations by cited source (books, journal articles and other online contents); ii) a look-up module to equip citations with identifiers from Crossref or other APIs. In so doing, Wikipedia Citations extends upon prior work which only focused on citations already equipped with identifiers, such as mwcites.

Our goal is to develop four software modules in Python (the codebase from now on) that can be easily reused by developers in the Wikidata community:

  1. [extractor] a module to extract citation and bibliographic information from articles in the English Wikipedia;
  2. [converter] a module to convert extracted information into a CSV-based format compliant with a shareable bibliographic data model, e.g., the OpenCitations Data Model;
  3. [enricher] a module for reconciling bibliographic resources and people (obtained in step 2) with entities available in Wikidata via their persistent identifiers (primarily DOIs, QIDs, ORCIDs, VIAFs, then also persons, places and organisations if time allows);
  4. [pusher] a module to disambiguate, deduplicate, and load citation and bibliographic data in Wikidata that reuses code already developed by the wikidata community as much as possible.

As a case study for creating and testing the codebase, we rely on the Wikipedia Citations dataset, which currently includes around 30M citations from Wikipedia pages to a variety of sources, of which 4M are to scientific publications. The preprint of the article (currently under review for Quantitative Science Studies) describing the dataset is available on arXiv. Wikipedia Citations includes MIT-licensed scripts to replicate and extend upon its results. The codebase to develop during the Wikipedia Citations in Wikidata project will be based on top of the existing scripts used for creating the Wikipedia Citations dataset.

The codebase will be accompanied by extensive documentation to foster its reuse. In particular, we will focus on providing:

  1. extensive documentation in browsable HTML pages of all the Python classes, methods, and functions developed in the codebase;
  2. clear explanation of the full workflow implemented by the codebase, so as to enable the replication of all its steps (extraction, conversion, enriching, pushing) using either the data of our case study or any other data compatible with it;
  3. providing runnable unit tests to check the correctness of all the code developed, to enable checking the consistency of the codebase after future modifications and extensions.

The development of the codebase contributes to one of the most desired applications for WikiCite, i.e., the management of citations on Wikipedia. Future work from the community could focus on:

  1. the reuse of the codebase to extract and upload new structured citation data to Wikidata;
  2. the creation of additional extractors for Wikipedia articles available in other languages;
  3. the creation of additional extractors for mining and reengineering data on references to non-scholarly publications (e.g. web pages);
  4. the possibility to import in Wikidata additional data compliant to the same data model, i.e. the OpenCitations Data Model, such as data available in OpenCitations.

After the successful completion of the project, and in coordination with the Wikidata and Wikicite communities, we will consider the ingestion in Wikidata of an updated version of the Wikipedia Citations dataset, using the developed codebase.


The proposed project will considerably improve the interconnection between Wikipedia and Wikidata, by providing structured data to one of the key elements of Wikipedia articles, i.e., references. Having citation links from Wikipedia to scholarly publications has several benefits. On the one hand, it improves the discoverability of relevant encyclopaedic articles related to scholarly studies, thus defining a folksonomy of topics related to particular research. On the other hand, it enacts Wikipedia as a social authority and policy hub which would enable policymakers to assess the importance of an article, person, research group and institution by looking at how many Wikipedia articles cite them.

Generally speaking, these citations in Wikidata would make Wikipedia contents better discoverable and enrich Wikidata with a ready-to-use corpus for further analysis or for developing new services (e.g., citation recommendation bots). In addition, Wikimedia projects (e.g., Scholia), infrastructures (e.g., OpenCitations), and GLAM services that already leverage Wikidata knowledge base or alignments to Wikipedia pages, would benefit from having mechanisms that allow to discover relevant works related to entities desctribed in Wikipedia and distilled in Wikidata.


All the software modules devised above will be implemented by a short-term research fellow with skills in programming, software engineering, and Semantic Web technologies. His/her work will focus on:

  • create a mapping between the Wikipedia Citations and the OpenCitations Data Model schemas;
  • create a mapping between the OpenCitations Data Model and the current way bibliographic metadata and citations are represented in Wikidata;
  • devise and implement a mechanism that is able to interact with several open REST APIs and SPARQL endpoints available online to perform disambiguation activities of the entities included in the dataset (e.g., documents and people);
  • document the project outcomes in Meta-Wiki, and release the code with an open source license and appropriate documentation for future use.

Measures of success[edit]

The primary measure of success is quantified on the amount of the codebase documentation, which is crucial to foster the adoption and reuse of the codebase by the community in other relevant Wikimedia projects. In particular, we show how much text, pages, documents, and unit tests are released to make the codebase and the workflow it implements reusable in different contexts by developers.

Secondly, we consider as a potential measure of success the increased the number of citations included in Wikidata after a bulk import of the citation data created using the software developed in the project. While Wikidata already includes a significant amount of citation information, we expect to be able to considerably increase it in a single batch operation, in coordination with the community. Our work would also allow facilitating future imports of Wikipedia citations into Wikidata.

Thirdly, we consider as a potential measure of success the increased engagement with citation data in Wikidata by users and its community. Such a measure can be provided by the Wikimedia Foundation itself, and can be quantified by looking at the statistics on users' interactions with the new data.

Lastly, another potential measure of success is the claimed, planned, or direct ingestion of so created citation data in existing services, both created by Wikidata communities (see the Community section) and external providers (e.g., OpenCitations) that may analyse the new data and/or build new services out of them.


The results of the project are of interest to several communities, namely:

  • [wiki] scholars and developers leveraging Wikidata in existing services;
  • [biblio] scholars, librarians, and developers in Library and Information science, bibliometrics, research assessment;
  • [dh] scholars and developers in Digital Humanities.

We plan to inform and engage with the aforementioned communities about the project proposal, project updates, and data/code releases at the beginning, during, and at the end of the project, in the following ways:

In the later stages of the project, we would like to contribute to the following pages:

The Budget[edit]

We budget to hire a software engineer to work on the tasks we previously described.


  • 8070,00 EUR (exchange rate, as of 18 September 2020: 9,559.34 USD) for hiring a short-term (5 months) research fellow to work on the project, from 1 January 2021 to 31 May 2021.

COVID risk assessment (for in-person events)[edit]

No in-person activities are planned.


Community notification[edit]

We are currently in touch via private email with people actively working in Wikidata-related projects.

Moreover, we advertised the project proposal in the following pages/channels:


Optional: Community members are encouraged to endorse your proposal and leave a rationale here.


Any questions about this proposal and feedback from reviewers should be placed on the associated discussion page.