Research:Wikipedia clickstream

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.


About[edit]

The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the "London" article on English Wikipedia during January 2015.

London clickstream.png

Where to get the Data[edit]

The canonical citation and most up-to-date version of this dataset can be found at:

Ellery Wulczyn, Dario Taraborelli (2015). Wikipedia Clickstream. figshare. doi:10.6084/m9.figshare.1305770

Getting Started[edit]

Check out this ipython notebook for a tutorial on how to work with the February 2015 release.

Data Preparation[edit]

For each release, and for several Wikipedia language versions, we take one months worth of requests for articles in the main namespace. Referers are mapped to a fixed set of values, based on this scheme:

  • an article in the main namespace -> the article title
  • a page from any other Wikimedia project -> other-internal
  • an external search engine -> other-search
  • any other external site -> other-external
  • an empty referer -> other-empty
  • anything else -> other-other

Requests for pages that get redirected were mapped to the page they redirect to. We attempt to exclude spider traffic by classifying user agents with the ua-parser library and a few additional Wikipedia specific filters. Finally, any `(referer, resource)` pair with 10 or fewer observations was removed from the dataset. To give you a sense of the scale of the data, the March 2016 release for English Wikipedia contained 25 million `(referer, resource)` pairs from a total of 6.8 billion requests

Format[edit]

The current data includes the following 6 fields:

  • prev: the result of mapping the referer URL to the fixed set of values described above
  • curr: the title of the article the client requested
  • type: describes (prev, curr)
    • link: if the referer and request are both articles and the referer links to the request
    • external: if the referer host is not en(.m)?.wikipedia.org
    • other: if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer.
  • n: the number of occurrences of the (referer, resource) pair

Releases[edit]

As the project has evolved, the exact details of how the data was generated has changed. Below, is a list of releases with notes if the data preparation and format is different from what is described above.

January 2017

  • released a dataset for English (2017_01_en)

September 2016

  • released a dataset for English (2016_09_en)

August 2016

  • released a dataset for English (2016_08_en)
  • released a dataset for English (2016_08_en_unresolved) where redirects were not resolved in the usual manner. Instead, the requested "current article" is captured in the curr_unresolved column. This means that page titles in this column can be redirects. In this case, the curr column captures what page the user was redirected to.

April 2016

  • released dataset for English, Arabic and Farsi Wikipedia.

March 2016

  • external referers were mapped to a more granular set of fixed values

February 2016

  • external referers were mapped to a more granular set of fixed values

February 2015

  • data also included page ids for prev and curr
  • only requests to the desktop version were used
  • requests from clients who made too many requests were removed (for details, see here and here)
  • redlinks were included as a type
  • external referers were mapped to a more granular set of fixed values

January 2015

  • data also included page ids for prev and curr
  • only requests to the desktop version were used
  • redirects were not resolved
  • external referers were mapped to a more granular set of fixed values

Applications[edit]

This data can be used for various purposes:

  • determining the most frequent links people click on for a given article
  • determining the most common links people followed to an article
  • determining how much of the total traffic to an article clicked on a link in that article
  • generating a Markov chain over Wikipedia

External links[edit]