Jump to content

Research:Understanding the context of citations in Wikipedia

From Meta, a Wikimedia project coordination wiki

This page documents a completed research project.

Wikipedia represents not only a collection of multilingual encyclopedic content, but also of citation data. To date, there has been little work to systematically examine and understand the quality, kind, and role of citations that appear in Wikimedia Projects. Citations are widely understood to be markers of quality and signposts for navigating networks of information resources, if we want to use Wikipedia citation data toward this end and enable others to do so, it is important to first understand the different features of Wikipedia citations.

In this project, we propose to use machine learning techniques to build an enhanced set of Wikipedia citation data and use that data to explore Wikipedia citations and citation practices in more depth than has yet been possible.

Benefits to the Community


Discovering Characteristics of Wikipedia's Citations

Reference data could not only enhance the learning techniques applied in Automated Classification of Article Importance, but these techniques could also be applied to help categorize and groom citation data in order to discover characteristics of Wikipedia citations.

For example, some of the questions we aim to eventually answer include:

  1. What sources and kinds of citations appear in articles of different quality?
  2. What sources and kinds of citations appear in articles of different categories (i.e. biology, art history, computer science)?
  3. What kinds of citation-related editing activity do different kinds of users engage in and are there lifecycle effects?

But in order to get there, we need to build a dataset of enhanced citation data.

New Data

The project will yield an open dataset of enhanced Wikipedia citation data.

Automated Measurement of the Quality and Importance of References in Wikipedia

People forage information through networks of references in Wikipedia. Measuring the quality and importance of references enables making efficient decisions on what kind of information to further look for while increases the chances of finding relevant information. Appraising an article's importance is a manual process and citations are additional markers of its quality. With that being said, creating measures of reference and source quality and importance could help inform editors' assessment of potential sources.

New Knowledge in the World

In terms of scholarly output, traditional bibliometrics and scientometric work is built on assumptions and theories about scientific practices as a context for citation, we aim to update these assumptions and theories for a world of participatory information production.

Proposed Analysis


This study aims to explore the quality, kind, and role of citations in Wikipedia articles against citation practices in sciences. To this end, we 1) quantify citation quality and importance, 2) classify quality and categories of articles given citations, and 3) model types of users with different citation behaviors and predict their lifecycle effects on edited articles. The primary methods used in this project will be supervised machine learning techniques.

Building a Dataset

The proposed dataset include the following constructs and attributes:

reference - the text within an article that refers to a particular source

source - an external resource that provides support for a statement in an article

article - a concept being described

A reference has the following fields
  • content -- appears between the <ref></ref> tags (in the primary tags if named)
    • raw_content
    • templated -- boolean, the entirety of the content a template call
    • cite_template -- citation templates are usually named something like {{cite book|...}} This field will contain the type the citation template if present.
    • urls -- all the urls that appear in the content
      • domain --
    • identifiers -- all of the structured identifiers contained in the content
  • occurrences -- the locations in the text that the <ref> tags appear
    • section -- the section # as defined by MediaWiki (starts at 0, split by headers)
    • text_offset -- the # of chars between the beginning of the article and the beginning of the <ref> tag
    • preceding_text -- the immediate 250 chars before the beginning of the <ref> tag.
    • header_level -- the level of the header immediately above the <ref> tag (0 for lead)
    • header_text -- the text of the closest header
    • level_2_text -- the text of the nearest level 2 header (see Research:Investigate_frequency_of_section_titles_in_5_large_Wikipedias)
    • level_2_offset -- the # of chars between the nearest level 2 header and the beginning of the <ref> tag
  • revid -- revision ID associated with reference addition/deletion/edit

A source has the following fields
  • type (categorical) -- books, journal articles, conference proceedings, magazines, mainstream newspapers, etc.
  • level (categorical) -- primary, secondary, and tertiary
  • style (categorical) -- full, inline, short, in-text, and general reference -- this property should be bound to reference
  • quality
    • verifiability (see Wikipedia Verification Check: A Chrome Browser Extension )
      • technical verifiability (boolean) -- the trustworthiness of a source' identifier (e.g. ISBN, DOI)
      • practical verifiability (boolean) -- the open accessibility to a source. It may raise an interesting tension between open access sources and those that require payment in terms of the value added up to the body of sources in Wikipedia.
        • Has a URL
        • URL returns a 200
    • persistence (numerical) -- a continued duration of a source based on the history of reference revisions
      • revision history -- timestamp, article title, article ID, user name, user IP, added edits, and removed edits
  • relevance
    • content similarity
      • raw distance (numerical) -- a source' textual proximity to titles, headers, specific sections, preceding text, etc.
      • semantic distance (numerical) -- the semantic proximity referenced by WordNet
    • topological similarity (numerical) -- the neighbor distance between sources, shortest paths, sigma of network diameter, etc.
  • importance
    • cumulative measures (numeric) -- cited frequency, citation burst, etc.
    • topological measures (numeric) -- PageRank, HIT, betweenness centrality, sigma of modularity, network diameter, or clustering coefficient, etc.
    • possible measures: TBD


  • A comprehensive review of research on citation practices in Wikipedia
  • Labeled reference dataset
  • Machine prediction model for predicting "good" and "bad" citations (likely, citations that will be removed vs. those that stay)
  • A report describing the above items.



By the end of the project, we aim to have built and evaluated a classifier that can accurately assess features of citations in Wikipedia data. We aim to engage interested members of the community in an open discussion about what features of citation data would be most useful to extract (see talk page). We will use this tool to investigate features of citation data and publish both an open dataset and (hopefully) a paper that describes our findings.

Policy, Ethics and Human Subjects Research


The proposed work does not involve human subjects research. If we decide later to incorporate a human-centered approach, the Drexel investigators have a long history of obtaining and approval for and conducting human subjects work including many studies specific to Wikipedia and Wikipedia editors.





The references dataset, extracted from 503 XML files of the July 1, 2017 dump of English Wikipedia, is now available as a set of compressed JSON files:

  Citations with contexts in Wikipedia
Halfaker, Aaron; Kim, Meen Chul; Forte, Andrea; Taraborelli, Dario (2017): Citations with contexts in Wikipedia. figshare. https://doi.org/10.6084/m9.figshare.5588842 Retrieved: 22:36, Dec 01, 2017 (GMT)

The parsed references include 1) citation context(s), 2) structured data and bibliographic metadata, and 3) additional data/metadata as follows:

  • Citation context(s)
    • section: the section # as generated by the parser (Integer)
    • text_offset: the # of chars between the beginning of the article and the beginning of the <ref> tag (Integer)
    • preceding_text: the immediate 250 chars before the beginning of the <ref> tag (String)
    • succeeding_text: the immediate 250 chars after the end of the <ref> tag (String)
    • header_level: the level of the header immediately above the <ref> tag (Integer)
    • header_text: the text of the header immediately above the <ref> tag (String)
    • header_offset: the # of chars between the closest header and the beginning of the <ref> tag (String)
    • level_2_text: the text of the nearest preceding level 2 header (String)
    • level_2_offset: the # of chars between the nearest preceding level 2 header and the beginning of the <ref> tag (Integer)
  • Structured data and bibliographic metadata
    • templated: boolean describing whether the <ref> wraps a citation template (Boolean)
    • cite_template: the type the citation template (String)
    • urls: all the urls that appear in the content (String)
    • identifiers: all of the persistent identifiers contained in the content (String)
  • Additional data/metadata
    • revid: the revision ID associated with reference addition/deletion/edit (Integer)
    • name: the name of the reference (String)
    • raw_content: the original content between <ref> tags (String)

The JSON schema and Python parsing libraries used to generate the dataset are listed in the references of the data registry.

Future Directions


The only missing field from the proposed reference structure is a domain of urls. Rather than just parsing domain names but extracting categories of domains will add a richer dimension of interpretations in quantifying and classifying quality and categories of articles given citations. Multiple approaches can be considered. First, supervised learning can serve this. To this end, initial training labels are needed. We can crowdsource this task with predefined semi-structured categories in order to minimize annotation biases. Second, NER (named entity recognition) can parse meaningful word bits given urls. Then, we can use thesauri such as WordNet to understand and capture representative abstractions of those at a semantic level.

Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.