Research:Wikipedia clickstream

Contact

Wikimedia Analytics

Wikimedia Foundation

Ellery Wulczyn

Wikimedia Foundation

Dario Taraborelli

Wikimedia Foundation

Duration: 2015- – 2017-

Open source
via github.com

Open data
via figshare.com

Research:Projects

This page documents a completed research project.

You can access the monthly public data releases that share how often two Wikipedia article pages are viewed consecutively at Analytics ClickStream Dataset. — Check it out!

About[edit]

The Wikipedia Clickstream dataset contains counts of (referrer, resource) pairs extracted from the request logs of Wikipedia. A referrer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the "London" article on English Wikipedia during January 2015. We look at desktop, mobile web, and mobile app requests.

Where to get the Data[edit]

The canonical citation and most up-to-date version of this dataset can be found at: https://dumps.wikimedia.org/other/clickstream/ (and the readme at https://dumps.wikimedia.org/other/clickstream/readme.html).

Ellery Wulczyn, Dario Taraborelli (2015). Wikipedia Clickstream. figshare. doi:10.6084/m9.figshare.1305770

Data Preparation[edit]

For each release, and for several Wikipedia language versions, we take one month worth of requests for articles in the main namespace. Referrers are mapped to a fixed set of values, based on this scheme:

an article in the main namespace -> the article title
a page from any other Wikimedia project -> other-internal
an external search engine -> other-search
any other external site -> other-external
an empty referrer -> other-empty
anything else -> other-other

Requests for pages that get redirected were mapped to the page they redirect to. We attempt to exclude spider traffic by classifying user agents with the ua-parser library and a few additional Wikipedia specific filters. Finally, any `(referrer, resource)` pair with 10 or fewer observations was removed from the dataset. To give you a sense of the scale of the data, the March 2016 release for English Wikipedia contained 25 million `(referrer, resource)` pairs from a total of 6.8 billion requests

A note on empty referrers. There's a discussion on Phabricator that broadly suspects unidentified bots and browser bugs to be the main culprits, with fantastic deeper dives that look at VPNs, Wikipedia being set as the home page, and switching from mobile apps to mobile browsers when clicking on Wikipedia links. Definitely worth a read. And some further reading on Groupon's experiment, that finds a high percentage of Direct and Organic search traffic.

Format[edit]

The current data includes the following 4 fields:

prev: the result of mapping the referrer URL to the fixed set of values described above
curr: the title of the article the client requested
type: describes (prev, curr)
- link: if the referrer and request are both articles and the referrer links to the request
- external: if the referrer host is not en(.m)?.wikipedia.org
- other: if the referrer and request are both articles but the referrer does not link to the request. This can happen when clients search or spoof their refer.

n: the number of occurrences of the (referrer, resource) pair

Releases[edit]

As the project has evolved, the exact details of how the data was generated has changed. Below, is a list of releases with notes if the data preparation and format is different from what is described above. Data is based on requests from desktop, mobile web, and mobile apps.

From June 2019

A bug has been corrected in the job allowing to correctly process Farsi language (see https://phabricator.wikimedia.org/T191964). This correction probably applies for other languages not yet computed.

From November 2017

Automatic monthly releases of datasets for Wikipedia in English, Russian, German, Spanish and Japanese (top 5 most visited wikipedia languages): https://dumps.wikimedia.org/other/clickstream/. Jobs computing this dataset regularly are maintained by the WMF Analytics Team. Please contact them in case of issues.

January 2017

released a dataset for English (2017_01_en)

September 2016

released a dataset for English (2016_09_en)

August 2016

released a dataset for English (2016_08_en)

released a dataset for English (2016_08_en_unresolved) where redirects were not resolved in the usual manner. Instead, the requested "current article" is captured in the curr_unresolved column. This means that page titles in this column can be redirects. In this case, the curr column captures what page the user was redirected to.

April 2016

released dataset for English, Arabic and Farsi Wikipedia.

March 2016

external referrers were mapped to a more granular set of fixed values

February 2016

external referrers were mapped to a more granular set of fixed values

February 2015

data also included page ids for prev and curr
only requests to the desktop version were used (after this, we look at mobile web and mobile app requests)
requests from clients who made too many requests were removed (for details, see here and here)
redlinks were included as a type
external referrers were mapped to a more granular set of fixed values

January 2015

data also included page ids for prev and curr
only requests to the desktop version were used
redirects were not resolved
external referrers were mapped to a more granular set of fixed values

Applications[edit]

This data can be used for various purposes:

determining the most frequent links people click on for a given article
determining the most common links people followed to an article
determining how much of the total traffic to an article clicked on a link in that article
generating a Markov chain over Wikipedia

Some examples:

External links[edit]

Access the datasets through Wikimedia Dumps
Wikipedia Clickstream, figshare
WikiNav - tool for exploring this data