Research:Characterizing Wikipedia Reader Behaviour

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.

You can read peer-reviewed results of this research here.

We are currently working on reproducing and extending the study to several other language editions. Details on this ongoing work are described here.

This page summarizes our research efforts on characterizing reader behavior on Wikipedia; i.e., why we read Wikipedia.


Wikipedia attracts millions of readers from across the globe and serves a broad range of their daily information needs. Despite this, very little is known about the motivations and needs of this diverse user group: why they come to Wikipedia, how they consume the content in the encyclopedia, and how they learn. Without this knowledge, creating more content, products, and services that ensure high levels of user experience remains an open challenge. In light of this, this research is concerned with understanding why we read Wikipedia.

In detail, we present a robust taxonomy of use cases for reading Wikipedia, constructed through a series of surveys. We then enrich the survey data by linking each survey response to the respondent's behavior traces mined from Wikipedia's webrequest logs. Finally, we use the joined survey and log data to identify characteristic behavior patterns for reader groups with specific intentions.

The outcomes of this research can help the community at large, Wikipedia's editor and developer communities, as well as the Wikimedia Foundation (e.g., the Reading team), to make more informed decisions about how to create and serve encyclopedic content in ways that are more suitable for the needs of those who seek to access it, and to design appropriate user experiences.


  1. We present a robust taxonomy for characterizing use cases for reading Wikipedia, which captures users' motivations to visit Wikipedia, the depth of information they are seeking, and their familiarity with the topic of interest prior to visiting Wikipedia.
  2. We quantify the prevalence and interactions between users' motivations, information needs, and prior familiarity via a large-scale survey yielding almost 30,000 responses.
  3. We enhance our understanding of the behavioral patterns associated with different use cases by combining survey responses with digital traces recorded in Wikipedia's server logs.

Taxonomy of Wikipedia Readers[edit]

Our research relies on a taxonomy of Wikipedia readers, something that was previously absent from the literature. We designed and analyzed a series of surveys based on techniques from grounded theory to build a robust categorization scheme for Wikipedia readers' motivations and needs.

We have many hypotheses about reasons why users read Wikipedia. Some of the hypotheses come from our own experience with Wikipedia. Some of them come from mostly unstructured interactions with Wikipedia readers. In order to understand our readers, we set as our first goal to collect qualitative data on reasons why users read Wikipedia articles utilizing a series of surveys conducted on live Wikipedia.

We iterated the design of the survey starting with a round of free form questions (S1 and S2) asking participants Why are you reading this article today? in free-form text. We hand-coded the responses and arrived at three broad ways in which users interpreted the question; we use them as orthogonal dimensions to shape our taxonomy:

  • Motivation: work/school project, personal decision, current event, media, conversation, bored/random, intrinsic learning
  • Information need: quick fact look-up, overview, in-depth
  • Prior knowledge: familiar, unfamiliar

We assessed the robustness of the above taxonomy in follow-up surveys. S2 is identical to S1 and validates our categories on unseen data; additionally, we tested the design of S1 across multiple languages (English, Spanish, Persian). Subsequently, we crafted a multiple-choice version of the free-form surveys (S3) letting respondents select the dimensions of our taxonomy also allowing for an other option where users could again enter free-form text if no option applies. Only 2.3% of respondents used the other option, and hand-coding of the corresponding free-form responses did not result in new categories. We thus concluded that our categories are robust and use the resulting classification as our taxonomy of Wikipedia readers.

Survey Project Platform Start End
S1-English English Wikipedia Desktop 2015-10-12 2015-10-16
S2-English English Wikipedia Mobile 2015-11-20 2015-11-23.
S3-English English Wikipedia Desktop+Mobile 2015-11-24 2015-11-30
S1-Spanish Spanish Wikipedia Desktop+Mobile 2016-01-28 2016-01-30
S1-Persian Persian Wikipedia Desktop+Mobile 2016-01-28 2016-01-30
S3-English Large Scale English Wikipedia Desktop+Mobile 2016-02-29 2016-03-08

Data and Preprocessing[edit]

This section covers our core data and preprocessing.

Large-scale survey[edit]

To quantify the prevalence of the driving factors specified by our taxonomy, we ran a large-scale survey on English Wikipedia consisting of the same three questions on motivation, depth of information need, and prior knowledge (S3-English Large Scale). Respective survey responses are main subject to the data analyses conducted for this research project. Overall, our dataset consists of survey answers from 29,372 participants after basic data cleaning such as removing duplicate answers from the same users. Given the added instrumentation there is a new privacy statement [1], and an added message right before the Submit button in the survey that will help the user in making a more informed choice about submitting his/her responses or not. You can track the phabricator tasks associated with deploying the survey here.

Webrequest logs and features[edit]

Ultimately, we aim to to understand how users' motivation, desired depth of knowledge, and prior knowledge (i.e., their answers to our survey) manifest themselves in their reading behavior. The data collected through the survey alone, however, does not provide any information on the respondent's behavior beyond the single pageview upon which the survey was presented. In order to be able to analyze respondents' reading behavior in context, we connect survey responses to the webrequest logs maintained by Wikipedia's web servers. As the information needs and reading behavior of the same user may change over time, we operate at an intermediate temporal granularity by decomposing a user's entire browsing history into sessions, where we define a session as a contiguous sequence of pageviews with no break longer than one hour. Additionally, we supplemented the data with a large variety of features: survey features capture responses to survey questions, request features capture background information about the respondent mined from webrequest logs; article features describe the requested Wikipedia article, and session\activity features are derived from the entire reading session and beyond-session activity.

Survey bias correction[edit]

The goal of this work is to study the motivations and behaviors representative of Wikipedia's entire reader population. However, deducing properties of a general population from surveying a limited subpopulation is subject to different kinds of biases and confounds, including coverage bias (inability to reach certain subpopulations), sampling bias (distortions due to sampling procedure), and non-response bias (diverse likelihood of survey participation after being sampled as a participant). Consequently, an important step in our analyses is to account for potential bias in survey responses. To that end, we opt for inverse propensity score weighting. This technique assigns control weights to each survey response, thus correcting bias with respect to a control group (Wikipedia population). The rationale behind this procedure is that answers of users less likely to participate in the survey should receive higher weights, as they represent a larger part of the overall population with similar features. For determining participation probabilities (propensity scores), we use gradient-boosted regression trees on individual samples to predict if they belong to the survey vs. the control group, using all of our features. By using background features (e.g., country, time) plus digital traces (e.g., sessions), and by building a representative control group, we have an advantage over traditional survey design, which is often limited to few response features such as gender and age, as well as to small control groups.

Summary of Results: Why We Read Wikipedia[edit]

  • Wikipedia is read in a wide variety of use cases that differ in their motivation triggers, the depth of information needs, and readers' prior familiarity with the topic. There are no clearly dominating use cases, and readers are familiar with the topic they are interacting with as often as they are not.
  • Wikipedia is used for shallow information needs (fact look-up and overview) more often than for deep information needs. While deep information needs prevail foremost when the reader is driven by intrinsic learning, and fact look-ups are triggered by conversations, we see that overviews are triggered by bored/random exploration, media coverage or the need for making a personal decision.
  • Results suggest that motivations are mostly stable over time (days of the week and hours of the day). There are a few exceptions for this general observation: motivations triggered by the media are increased over the weekends and at nights, conversation triggers are increased over the weekends, and work/school triggers are increased on week days and during the day.
  • When Wikipedia is used for work or school assignments, users tend to use a desktop computer to engage in long pageviews and sessions; sessions tend to be topically coherent and predominantly involve central, serious articles, rather than entertainment related ones; search engine usage is increased; and sessions tend to traverse from the core to the periphery of the article network.
  • Media-driven usage is directed toward popular, entertainment-related articles that are frequently less well embedded into the article network.
  • Intrinsic learning tends to involve arts and science articles with no significant navigational features; conversations bring infrequent users to Wikipedia, who engage in short interactions with the site, frequently on their mobile devices.
  • People who use Wikipedia out of boredom or in order to explore randomly tend to be power users; they navigate Wikipedia on long, fast-paced, topically diverse link chains; and they often visit popular articles on entertainment-related topics, less so on science-related topics.
  • Current events tend to drive traffic to long sports- and politics-related articles; the articles tend to be popular, likely because the triggering event is trending.
  • When Wikipedia is consulted to make a personal decision, the articles are frequently geography- and technology-related, possibly due to travel or product purchase decisions.



No presentation is scheduled at this point.


A list of presentation about this work can be found below:

See also[edit]

Research terms[edit]

This project is conducted in collaboration with a team at Stanford University. This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding. This research is subject to Wikimedia Foundation's Open Access Policy.