Research:Characterizing Wikipedia Citation Usage

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Tiziano Piccardi
Michele Catasta
Jure Leskovec
Robert West
Dario Taraborelli
Bahodir Mansurov
Duration:  2018-05 — ??
Open data project  Open data
no url provided

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.


As per Wikipedia's Citation page “Citations have several important purposes: to uphold intellectual honesty (or avoiding plagiarism),[2] to attribute prior or unoriginal work and ideas to the correct sources, to allow the reader to determine independently whether the referenced material supports the author's argument in the claimed way, and to help the reader gauge the strength and validity of the material the author has used.[3]”.

Therefore, at a historical time where the role of fake news has proven to be pivotal in many political and societal matters, it is of great interest to assess if the external citations in Wikipedia are leveraged (and checked) by its readers.

The goal of this project is to form an understanding of the role of external citations in Wikipedia reading. To this purpose, we plan to instrument all the English Wikipedia articles for a limited amount of time, to capture user interactions with the footnotes and references. The expected outcomes of our project are: (1) gaining a deeper understanding of the citation usage patterns showcased by the readers, and (2) providing insights and potential recommendations to elicit higher interest on citations.

In the longer term, we would like also to develop a predictive model that can output the click frequency of any given external citation. The insights exposed by this model could inform the design of new/better tools for citation visualization, such as the Reference Tooltips.


We expect the methodology and the list of questions change over time and as we learn more about the data. The current work-in-progress list of questions we have brainstormed about are listed below.

Citation Usage in Wikipedia[edit]

  • What are the most cited resources? Building on existing research on citations with identifiers, we will breakdown the analysis per type of resource (e.g., scientific article, newspaper article, company webpage, blog, social media, etc.), per URL domain, per topic

Citation Consumption[edit]

  • How frequently are references clicked (e.g., what fraction of pageviews entails a reference click)?
  • Which characteristics of a Wikipedia article impact how often the external citations are visited? We will use different techniques to highlight the role of article popularity, article topic, article quality, article saliency (e.g., article about a current, trending event) in citation consumption.What are the common characteristics of an external citation, and how is citation clickthrough affected by these? Characteristics include position on the page, creation time relative to the lifecycle of the article, reference type (e.g., further info, support for a fact, etc.)

Reader Behavior Analysis[edit]

  • Can we study how distinct groups of Wikipedia readers interact with external citations? e.g., reference followers vs. ignorers, top-to-bottom readers vs random section readers, etc.
  • Which type of reading sessions lead to consult the external citations more frequently? We will evaluate the impact of session length, per entry point (search engine vs. random browsing), per Wikipedia browsing session (first article vs. end of a session)

Data Collection[edit]

We are in early stages of understanding the data -- this means the data collection plans are work in progress, and we will iterate on. As of June 14, 2018, here is what we know about the data collection steps:

  • We will collect data for a few days, sampling 1 to 15% of traffic, depending on the sparsity of entries (we do not know the frequency of citation usage, so we may have to change this plan based on the initial validation steps)
  • After this period, we will check the data quality. Once that’s verified, we intend to do data collection at 100% sampling rate for a period of one week.
  • The schema for the data we will collect is here: Schema:CitationUsage. In the schema, we are not storing the IP address, this is automatically collected by the event capsule
  • While the Schema does not include the IP, the clientIP is collected by the EventLogging capsule, and the information gets purged (dropped) every 90 days, see Data retention and purging
  • Initially, we intend to purge all data at 90-day time intervals until we get a better sense of what kind of signal we can get from this kind of data.
  • We won't collect data from logged-in users.

Discussion moved to the talk page.

First round of data collection[edit]

The first round of data collection started on Thu, Jun 28, 11:13 PM (UTC) and ended on Mon, Jul 9, 11:06 PM (UTC)

See also[edit]