Jump to content

Research:Towards Modeling Citation Quality

From Meta, a Wikimedia project coordination wiki
12:52, 21 May 2018 (UTC)
Duration:  2018-May – 2018-?
citations, references, accessibility, machine learning
This page documents a completed research project.


We would like to understand and map the quality of citations and references in Wikipedia. Reference 'quality' is a broad notion including: reliability, accessibility, neutrality, etc. In this project, we want to map a set of citation dimensions, towards the complete understanding of citation quality.

Citation Quality Dimensions[edit]

Towards modeling a full, rich notion of citation quality, we are exploring topic distribution and accessibility of citations look across different languages.


First, we define a topic for each publication, by:

  • Collecting all articles where a publication is cited
  • For articles in Wikipedia editions other than enwiki, find the corresponding article in enwiki. This is done by finding the Wikidata item corresponding to teach article, then retrieving the enwiki page linked from that Wikidata item.
  • Assigning a topic to each article, in the top of the WikiProject hierarchy, using Scoring Platform's draftopic tool


We mark each publication (doi type) as Open Access or Closed Access as follows:

  • We download the dataset from Unpaywall, containing, for each doi publication, a reference to its open access version, if any.
  • We match the Unpaywall dataset with the entries in our data, and assign an accessibility label to each of them


We published a dataset of citations with identifiers. There is a file for each Wikipedia edition (e.g. english Wikipedia, Farsi Wikipedia). Each line contains the following tab-separated values:



  • page_id - the id of the page in Wikipedia that is citing the publication
  • page_name - the title of the page citing the publication
  • revision_id - the id of the revision where the citation has been added
  • timestamp - the time when the revision has been saved
  • publication_type - the type of the publication cited, it can be: isbn,doi,pmid,pmc,arxiv
  • publication_id - the identifier of the publication, format differs according to the type
  • topic - publication topic inherited from the pages where it is cited
  • open_access - boolean: true if the publication is open access, false if it is not open access [for DOI publications only]
  • open_access_url - the url of the open access version of the publication, if 'open_access' is true


We produced some visualizations to show the distribution of publication by topics and accessibility. (Non-interactive) examples below.

The Y axes shows the number of publication per topic, the X axes the percentage of open publications in that topic The Y axes shows the number of publication per language, the X axes the percentage of open publications in that language