# Abstract

Much existing knowledge about global consumption of peer-produced information is supported by data on Wikipedia page view counts and surveys. In 2017, we began measuring the time readers spend on a given page view (dwell time), enabling a more detailed understanding of such reading patterns.

In this report, we validate and model this new data and, building on existing findings, use regression analysis to test hypotheses about how patterns in reading time vary, in particular between global contexts. This data allows us to answer questions like: How much time do Wikipedia readers typically spend when they visit an article? How does time spent vary from language edition to language edition, or between different kinds of readers or articles? How can we determine whether a new design change increases the time spent by readers?

We validate this data and begin to answer questions such as the ones above. We observe the limitations of the data, most notably a high rate (57%) of missing data on mobile devices. It is important to consider these shortcomings, but we believe that the data can be fruitfully applied to improve our current knowledge of how people read Wikipedia. We used regression analyses to explore how factors like page length, device choice, and the locations of readers are related to reading times. We believe that our results for device choice and reader location offer behavioral data to corroborate findings from Why the World Reads Wikipedia, a large scale survey of Wikipedia readers of 14 language editions.

Here are some highlights from our analysis:

• The entire human race spent about 670,000 years reading Wikipedia between November 2017 and October 2018.
• The distribution of reading times is skewed. Upon opening an article, the median visitor will remain for only 25 seconds. Given that a reader remains for at least 25 seconds, the chances are about even that they will spend more than 75 seconds on the page. This means that the median and geometric mean are better for summarizing changes to reading time than the arithmetic mean.
• Log-normal distributions fit the data well enough to reasonably justify regression modeling and t-tests for comparing geometric means. However, both power-law distributions such as the lomax distribution and the three-parameter exponentiated Weibull distribution often fit better than the log-normal distribution. This is a potential avenue for future analysis.
• Page length has a positive relationship to reading time. Doubling page length is associated with an increase in reading times of a factor of 1.2. A typical page that is 10000 bytes of wikitext long will have expected reading time of about 25 seconds. Our model predicts that the expected reading time will increase to 30 seconds if the page's length were to increase to 20000 bytes of wikitext.
• Readers visiting from mobile devices are likely to spend less time per page view than readers from desktop devices. The geometric mean reading time is 26 seconds on desktop and 22 seconds on mobile.
• On average, readers spend over twice as much time on their last view in a session compared to preceding views. The typical reading time for page views other than the last in session is about 22 seconds, but if the view is the last in session this jumps to about 44 seconds.
• Typical readers in the Global South spend more time per page page view than readers in the Global North. Readers in Global South countries are predicted to spend 130% as much time compared to equivalent readers in Global North countries. The median Global North reader on mobile, not in the last page view in their session, spends just over 20.1 seconds on a page compared to the 21.5 seconds spent by a Global South reader.
• The gap between the time spent by Global South readers and Global North readers is amplified on desktop devices. The median Global North reader on desktop has a median reading time of 16.1 seconds, compared to 21.8 seconds for Global South readers.

# Introduction

In 2017, the Wikimedia Foundation’s web team introduced new instrumentation to measure the amount of time Wikipedia readers spend on the pages they view. At the outset, the goal was to develop a new metric for evaluating how feature releases, such as the launch of the page previews feature, may change reader behavior. However, we realized that this data could also be useful for understanding patterns of reading behavior in general. In preliminary work during 2017, Readers senior analyst Tilman Bayer and Readers analysis intern Zareen Farooqui explored and evaluated the new metric. This project continues this work with additional validation and new analyses of general reading patterns.

Measuring the amount of time that visitors to Wikipedia spend viewing pages provides new information about patterns of content consumption on Wikipedia. Mediawiki wikis provide records of every edit and thereby readily available granular and high quality data on productive collaboration, but we lack comparable sources of data for understanding Wikipedia readership. Analysts of Wikipedia content consumption have approached this through a creative arsenal of data collection strategies including using view counts, click streams, session lengths, eye-tracking, scroll positions, instrumenting a feature on the mobile website that allows readers to expand and collapse particular sections, and through surveys. Previous research also used limited approximations of dwell time such as assuming that the end of a page-view is always marked by a new web-request originating from the same IP and user agent[1]. However, at present, the only publicly available large scale and granular data source on reader behavior is the count of page views. Measuring reading time provides additional nuance over view data. With reading times in our field of view, it becomes clear that not all views are created equal. Some page views involve deep reading, yet most are quite short.

In this project we first evaluate the quality of the adopted approach for measuring reading times. We do this by comparing them to the timing of server side log events and by looking for patterns of systematically missing or invalid data. While we do find some inconsistencies such as a substantial amount of missing data from mobile devices and a low rate of invalid (missing or negative) measurements, we believe that the data can be generally informative as long as these limitations are considered.

We next consider possible probability models for reading time. One anticipated use of reading time data is in the evaluation of design experiments seeking to improve reader experiences on Wikipedia, to answer questions such as "How does this feature change reading behavior?" Model selection is important for validating assumptions that underlie the use of a statistic such as an arithmetic or geometric mean as a metric. Model selection itself can sometimes lead to novel insights when theorized data generating processes predict that a given model will be a good fit for the data [2] [3]. We evaluate several different distributions and find that the log-normal distribution fits the data well enough to justify the use of the geometric mean as a metric and of ordinary least squares regression models to explore marginal effects of other variables.

Consistent with the results of the survey, we find that readers in countries that are less developed or in the Global South stay on given pages for longer on average compared to readers in the Global North or in more developed countries. Moreover, this difference is amplified where we would expect users to consume information in depth: on the desktop (non-mobile) site. While we hypothesized that the difference would also be greater in the last page-view in a session, this idea was not supported by our data analysis. We demonstrate these patterns using non-parametric methods and multivariate parametric analyses.

# Methods

## Data collection and validation

### Collecting reading time data

The "reading depth" instrumentation plugin works by running JavaScript in the client browser which sends two messages to the server during a page view. The first message is sent when the page is loaded and the second message is sent when it is unloaded. The "page unloaded" event sends values from timers that measure, among other things, the amount of time that the page was visible in the visitor's browser window.

More specifically, the plugin uses the page visibility API to measure time visible, the total amount of time that the page was in a visible browser tab. Time visible is the primary measure we use in this report because it excludes time when the user could not possibly have been reading the page. The plugin also records a second measure of reading time: total time. This is simply the entire time the page was loaded in the browser. We use this variable for data validation and in robustness checks. The plugin also measures page load time in two ways: time till first paint, the time from the request until the browser starts to render any part of the page; and dom interactive time, the time from the request until the user can interact with the page. The current version of the reading depth event logging plugin was first enabled on November 20th 2017. From November 2017 until September 2018 we logged events from a 0.1% sample of visitor sessions, and the sampling rate was increased to 10% on September 25, 2018.

Since we care about the reading behavior of humans, we detect known bots using user agent strings and exclude them from all of our analyses.

## Missing data

We are only able to collect data from web browsers that support the APIs on which the instrument depends. Also, we excluded certain user agents that were found to send data unreliably in our testing, namely the default Android browser, versions of Chrome earlier than 39, Safari, and all browsers running on versions of iOS less than 11.3. See this Phabricator task for additional details. We also do not collect data from browsers that have not enabled JavaScript or that have enabled Do Not Track.

Even when the above conditions are met, in some cases we are still not able to collect data. Sometimes we observe a "page loaded" event, indicating that a user in our sample opened a page, but we do not observe a corresponding event indicating that the user has left the page (the "page unloaded" event). This issue affects 57% percent of records on the mobile site and about 5% of records on the desktop site. The likely explanation for why many mobile views are affected is that many mobile browsers will refresh the page if the user switches to a program other than the browser. In such cases the browser will not send a "page unloaded" event. We only include events where we observe exactly 1 page loaded event and 1 page unloaded event and remove 0.016% of page read events where, for unknown reasons, the instrument recorded a page visible time that was less than 0 or undefined.

## Taking a sample

Because Wikipedia is so widely read, even a 0.1% sample results in an amount of data exceeding the statistical requirements of this project-level analysis and leading to computational inconveniences. We therefore conduct our analysis on random subsamples of the collected data. (Data was collected at a higher sampling rate to enable content-level analysis of dwell times, e.g. for specific topics or pages, which was among the possible research topics envisioned for this project.) To ensure that all wikis are fairly and adequately represented in our sample, we use stratified sampling. Stratified sampling assigns a weight to each group that adjusts the probability that members of the group are chosen in the sample. This introduces a known bias in the resulting sample, which is corrected using the weights in ways analogous to weighted averaging.

For estimating total reading time, and for the univariate analysis, we stratify by wikis and take up to 20,000 data points for each wiki and exclude wikis that have fewer than 300 data points. In the Multivariate Analysis below, we stratify by wikis, by the country of the reader's approximate location, and by whether or not we think that the user is on a mobile device. We sample up to 200 data points for each strata.

## Distribution of reading times

Before turning to our other questions, we present summary statistics and a high level description of reading behavior on Wikipedia. When someone opens a given page on Wikipedia, how long do they typically stay on the page? Are reading times highly skewed? How much does reading behavior vary across different language editions of Wikipedia? How much time does all of humanity spend reading Wikipedia?

### Wikipedia as a whole

In general, the distribution of reading times is very skewed. Upon opening a Wikipedia page, a reader will close it or navigate away in less than 25 seconds about half of the time. Given that they remain on the page longer than the median dwell time, half of the time they will navigate away within 75.1 seconds. In general, the distribution of reading times is very skewed. When discussing reading times, it makes more sense to discuss geometric means, medians and other percentiles because in skewed distributions the arithmetic mean is far away from most of the mass of the distribution. This runs contrary to intuitive interpretations of the arithmetic mean. Fortunately, once the data has been log-transformed, the density is bell-shaped and the median and the mean are quite close. Therefore, we can also find it useful to discuss the geometric mean. We discuss the skewed nature of the data and the benefits of log-transformation in more depth below.

5% 25% 50% 75% 95%
time visible (sec) 1.8 8.0 25.0 75.1 439.1

Table 1.1 This table shows percentiles for reading times over all Wikipedia editions

Figure 1.1. The distribution of dwell times across all language editions of Wikipedia. The top figure shows a histogram of dwell times less than 300 seconds (5 minutes) long. In this figure we can see that the median dwell time is about 25 seconds long and that the distribution of dwell times is very skewed, with the mean far from the median. The Y axis represents the probability that a given page view is in a given box. In the lower figure, the dwell times are log-transformed and the data appear bell-shaped, with considerable skew to the right.

### Total time spent

The plot below shows an estimate of the total amount of time people read each month, taken by multiplying the average reading time in each month (measured as time visible) by the number of views per month. Humanity spent about 672,349 years reading Wikipedia from November 2017 through October 2018, excluding readers using the mobile apps and identified bots. It is possible that some people leave Wikipedia pages visible in their browsers for extended periods of time without reading. For example, someone might open a page and then walk away from the computer to have lunch. To make our estimates of reading time somewhat conservative, we rounded all page views lasting more than 1 hour down to hour in these estimates of reading time.

Figure 1.2. This chart plots an estimate of the total amount of time people spend reading Wikipedia each month from 11-2017 -- 10-2018

### Variation between different language editions

We present box plots of the distribution of page visible times on different Wikipedia language editions. As above we place unscaled data side-by-side with log-transformed data. The box plots show the inter-quartile range (IQR) as a box. A line inside the box represents the median and the "whiskers" extend the IQR by a factor of 1.5. The plots with transformed data are truncated at 300 seconds (5 minutes) because showing the entire range of the data would render the plots illegible. The log-transformed plots show the full range of the data. We hope these plots will be readers who may wish to know how reading times compare between their wikis of interest.

We highlight a handful of example language editions that are representative of projects of different sizes and of different cultures. These are Arabic (ar), German (de), English (en), Spanish (es), Hindi (hi), Dutch (nl) and Punjabi (pa).

This large chart shows box plots for each wiki. Wikis are in alphabetical order by language codes

.

wiki 5% 25% 50% 75% 95%
ar 5.2 5.2 21.5 69.9 371.7
de 14.1 14.1 14.1 56.6 482.7
en 37.2 37.2 37.2 37.2 262.4
es 23.3 23.3 23.3 65.5 616.4
hi 2.5 11.4 31.4 82.6 360.5
nl 6.1 6.1 15.9 60.1 441.8
pa 2.0 7.2 19.5 55.4 303.1

Table 1.2 This table shows percentiles for reading times for selected Wikipedia editions

Figure 1.3. This chart shows box plots of the distribution of time that Wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds. Spanish, Hindi, and Arabic appear to have longer reading times while English and Punjabi appear to have somewhat shorter reading times.

We observe a great deal of variation in the distribution of reading times between different language editions. We do not investigate these differences any further in this report because we lack knowledge of the specific contexts of each community and their audiences which would be necessary to adequately explain them. Instead, we present an analysis of the relationship between reading time and the development level of reader's countries to offer a more general explanation of one factor that might make a difference.

### Other variables

We use some variables other than the timers obtained from the plugin in our analysis. The event-logging system records the date and time the page was viewed as well as the page title of each page a user visits in a session. We obtained the page length, measured in bytes at the time the page was viewed, by merging the event logging data with the edit history. To understand how reading behavior on mobile devices differs from behavior on non-mobile (i.e. desktop) devices, we assume that visitors to mobile web-hosts (e.g. en.m.wikipedia.org) are using mobile devices and that visitors to non-mobile web-hosts (e.g. en.wikipedia.org) are on non-mobile (desktop) devices.

We obtain the approximate country in which a reader is located from the MaxMind GeoIP database which is integrated with the analytics pipeline. We then use the human development index (HDI) from the UN's human development data to measure the development level of the country. We lack geolocation data before March 3rd 2018, which limits our analysis of development and reading times to the period from then until September 28th 2018.

In our model selection process, we observed that partial residual plots of the interaction term between the HDI and mobile were very skewed. Standardizing the HDI by centering to 0 and scaling it by the standard deviation (taken at the country level) improved this and also allows us to interpret results in terms of standard deviations.

We also use a second, dichotomous, measure of development in terms of established regional classifications of Global North and Global South. Finally, this EventLogging instrumentation retains a session token with which we measure the number of pages viewed in the session so far (Nth in session) and whether or not a given page view is the last in session.

## Univariate model selection

### Motivation

We want to be able to answer questions like: Did introducing a new feature to the user interface cause a change in the amount of time users spend reading? Are reading times on English Wikipedia longer than on Spanish Wikipedia? What is the relationship between the development level of a reader's country and the amount of time they spend reading on different devices if we account for other factors?

Using a parametric model allows us to perform statistical tests to answer questions such as the ones listed above. Parametric models assume the data have a given probability distribution and have interpretable parameters such as mean, variance, and shape parameters. Fitting parametric distributions to data allows us to estimate these parameters and to statistically test changes in the parameters. However, assuming a parametric model can lead to misleading conclusions if the assumed model is not the true model. Therefore we want to evaluate how well different parametric models fit the data in order to justify parametric assumptions. Understanding how the data is distributed can also be interesting in its own right because distributions can inform understandings of the data generating process.

### Candidate models

Log-Normal Distribution: The log-normal distribution is a two-parameter probability distribution. Intuitively, it is just a normal distribution, but on a logarithmic scale. This gives it convenient properties because its parameters the mean and variance of the log-transformed data. This means that one can take the logarithm of the data and then use t-tests for evaluating differences in means, or use ordinary least squares to infer regression models. These advantages make the log-normal distribution a common choice in analyzing skewed data, even when it is not a perfect fit.

Lomax (Pareto Type II) Distribution: Datasets on human behavior often exhibit power-law distributions, meaning that the probability of extreme events, while still low, is much greater than would be predicted by a normal (or log normal) distribution [3]. Power law distributions are a commonly used class of one-sided "heavy-tailed," "long-tailed," or "fat-tailed" probability models [8]. We fit the Lomax Distribution, a commonly used long-tailed distribution with two parameters that assumes that power law dynamics occur over the whole range of the data.

Weibull Distribution: Liu et al. (2010) model reading times on web pages using a Weibull distribution [9]. This model has two parameters: ${\displaystyle \lambda }$, a scale parameter, and ${\displaystyle k}$, a shape parameter. The Weibull distribution can be a useful model because of the intuitive interpretation of ${\displaystyle k}$. If ${\displaystyle k>1}$, then reading behavior exhibits "positive aging," which means that the longer someone stays on a page, the more likely they are to leave the page at any moment. Conversely ${\displaystyle k<1}$ is interpreted as "negative aging," which means that as someone remains on a page, the less likely they are to leave the page at any given moment. The Weibull distribution is often used in the context of reliability engineering because it is convenient for modeling the chances that a given part will fail at a given moment.

Exponentiated Weibull Distribution: The Weibull model assumes that the rate of readers leaving a page changes monotonically over time. This means that the longer a reader stays on a page, they will not become more likely to leave the page up to a point after which they become less likely to leave. In other words there must be either "negative aging," "positive aging," or "no aging." This excludes more complicated dynamic processes where "positive aging" gives way to "negative aging" after a point in time [10]. Therefore if the data show that the liklihood of a reader leaving a page first increases and then decreases (or vice versa) then assumptions of the Weibull model are violated. The exponentiated Weibull distribution is a three-parameter generalization of the Weibull distribution that relaxes this constraint. The extra degree of freedom will allow this model to fit a greater range of empirical distributions compared to the two-parameter Weibull model.

We also considered the gamma distribution and the exponential distribution, but we will not go into depth about them here. We didn't have a strong motivation for these models and they did not fit the data well.

We fit the models using SciPy. The Exponentiated Weibull, Weibull, lomax, and gamma models were fit using maximum likelihood estimation and the other models were fit using the method of moments.

### Methods

Our method for model selection is inspired in part by Liu et al. (2010), who compared the log-normal distribution to the Weibull distribution of dwell times on a large sample of web pages [9]. They fit both models to data for each website and then compare two measures of model fit: the log-likelihood, which measures the probability of the data given the model (higher is better), and the Kolmogorov-Smirnov distance (KS distance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the log-normal model in a large majority of cases according to both goodness-of-fit measures.

Similar to the approach of Liu et al. (2010), we fit each of the models we consider on reading time data, separately for each Wikipedia project [9]. We also use the KS distance to evaluate goodness-of-fit. The KS-distance provides a statistical test of the null hypothesis that the model is a good fit for the data [8]. The KS-test is quite sensitive to deviations between the model and the data, especially in large samples. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. For the samples sizes we use, passing the KS test is a high bar. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausible model, instead of just whether one distribution is a better fit than another.

Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without improving out-of-sample predictive performance or explanatory power. To avoid the risk over-fitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the log-likelihood. Both criteria attempt to quantify the amount of information lost by the model (lower is better), evaluate the log likelihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.

We build these goodness-of-fit measures for each wiki and rank them from best to worst. For each distribution, we report the mean and median of these ranks. In addition, we report the mean and median p-values of the KS-tests and the proportion of wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.

The diagnostic plots are shown with data on English Wikipedia. On this Wiki, the exponentiated Weibull model is the best fit, followed by the lomax model and then the log-normal model and only the exponentiated Weibull model passes the KS test.

## Results

### Goodness-of-fit metrics

The table below shows the results of this procedure. The lomax, exponentiated Weibull, and log-normal all fit the data reasonably well. All pass the KS-test for many wikis, and are in a three-way tie for best median rank according to AIC.

model AIC_rank BIC_rank ks_rank ks_pvalue ks_pass95 ks_pass97_5
mean median mean median mean median mean median mean mean
Lomax 1.780992 2.0 1.723140 1.5 2.033058 2.0 0.272096 1.765701e-01 0.760331 0.818182
Log-normal 2.256198 2.0 2.165289 2.0 2.342975 2.0 0.272003 1.476440e-01 0.685950 0.772727
Exponentiated Weibull 2.260331 2.0 2.417355 3.0 2.144628 2.0 0.293441 2.078284e-01 0.727273 0.793388
Weibull 3.938017 4.0 3.942149 4.0 3.809917 4.0 0.084717 1.190242e-04 0.247934 0.297521
Gamma 4.933884 5.0 4.962810 5.0 4.785124 5.0 0.044867 2.255307e-12 0.132231 0.157025
Exponential 5.830579 6.0 5.789256 6.0 5.884298 6.0 0.015356 0.000000e+00 0.049587 0.066116

Table 2.1 Goodness of fit statistics resulting from the model selection prosess. The lomax, log-normal, and exponentiated Weibull distributions fit the data reasonably well.

The lomax distribution is the best fit across all Wikis according to all metrics. With only 2 parameters, it has a lower AIC and BIC than the three parameter exponentiated Weibull distribution (exponweib) and passes the KS test 76% of the time at the 95% confidence level.

The exponentiated Weibull model fits the data better than the log-normal model in terms of passing KS-tests and with respect to AIC. However, the log-normal is better in terms of BIC, which imposes a greater penalty on the additional parameter of the exponentiated Weibull model.

The Weibull model fits substantially worse than the lomax, log-normal, and exponentiated Weibull in terms of all of our goodness-of-fit metrics. In this respect, our results differ from those of Liu et al. (2010), who observed the Weibull model fitting dwell time data better than the log-normal model [9]. We observe that for dwell times on Wikipedia, the log-normal model is the better fit. While substantially worse than the lomax model, the log-normal model still passes the KS-test for about 69% of wikis in the sample.

### Discussion

We found that lomax, exponentiated Weibull, and log-normal models all fit the data within reason. We now discuss how each of these models can be applied to understanding Wikipedia reading behavior.

Lomax (Pareto Type II) Distribution: That the lomax model fits well indicates that Wikipedia reading time data may follow a power law. Mitzenmacher (2004) describes several possible data generating processes for power law (Pareto) and log-normal distributions [2]. Rich-get-richer dynamics such as preferential attachment are commonly associated with power law distributions. However, according to Mitzenmatcher's analysis, a mixture of log-normal distributions can also generate data appearing to follow a power law. Therefore, we cannot conclude from our model-fitting exercise that reading behavior on Wikipedia is driven by rich-get-richer dynamics. Furthermore, it is difficult to conceive of mechanisms for such dynamics. On the other hand, it is intuitive that a mixture of different log-normal processes are involved in reading time, such as an exploration process mixed with a reading process or even a mixture of behavior patterns associated with discrete types of information consumption. Deeper exploration of potential power-law dynamics in reading behavior is a potential avenue for future research.

Log-Normal Distribution: The log-normal model does not fit the data perfectly, but it fits well enough to be useful. It frequently passes KS tests, and is preferred to the exponentiated Weibull by the BIC. Even though the lomax model typically fits the data better, and the log-normal model is likely to underestimate the probability of very long reading times, assuming a log-normal model is very convenient. Once the data is transformed to a log-scale we can use t-tests to compare differences in means. This implies that the mean of the logarithm of reading time is an appropriate metric for evaluating experiments. Furthermore, assuming log-normality justifies using ordinary least squares to estimate regression models in multivariate analysis instead of more complex models that require maximum likelihood estimation.

Weibull Distribution: The Weibull model did not fit the data well. This was somewhat disappointing because we had hoped to analyze reading behavior in terms of the inferred parameter that indicates positive or negative aging. While Liu et al. (2010) observed that the Weibull model out-performed the log-normal model on their datasets, we observe the opposite. However, the exponentiated Weibull model generalizes the Weibull, is a good fit for the data, and can help us explain why the Weibull does not fit the data well.

Exponentiated Weibull Distribution: The Exponentiated Weibull has 3 parameters [11]. Two are shape parameters (${\displaystyle \alpha >0}$ and ${\displaystyle \gamma >0}$) and one is a scale parameter (${\displaystyle \lambda >0}$). The major qualitative distinctions in interpreting the model depend on the shape parameters. In many cases the parameters can be interpreted in terms of a transition from negative to positive ageing (or visa-versa) after some threshhold. However, if either ${\displaystyle \gamma >1}$, ${\displaystyle \alpha <1}$ or ${\displaystyle \gamma <1}$, ${\displaystyle \alpha >1}$ then qualitative interpretation may require closer inspection of estimated hazard functions.

Unfortunantly, we estimated ${\displaystyle \alpha >1}$ and ${\displaystyle \gamma <1}$ for all but 1 of the 242 Wikipedia projects we analyzed. This limits the utility of exponentiated Weibull models for large scale analysis reading on many Wikis because locations of parameters do lead directly to intuitive qualitative interpretations. However, by plotting the estimated hazard function we can see over what range of the data the hazard function is decreasing or increasing, accelerating or decelerating.

Figure 2.1. Hazard functions for the parametric models estimated on English Wikipedia. The exponentiated Weibull model (the best fit to the data) indicates that the hazard rate increases in the first seconds of a page view after which we observe negative aging.

We observe that, on English Wikipedia, the log-normal and exponentiated Weibull models both indicate a brief period of positive aging, during which the instantaneous rate of page-leaving increases, followed by negative aging. This helps explain why the Weibull model is not a good fit for the data compared to the log-normal and exponentiated Weibull models: the Weibull distribution cannot model a non-monotonic hazard function. While Liu et al 2010 found that such a process can describe the distribution of dwell times in data collected through a web browser plugin, our analysis suggests that behavior by Wikipedia readers may be more complex. One plausible interpretation is that the hazard rate increases during the first 1 or 2 seconds of a page view because readers require this time to make a decision whether to leave the page or to remain.

#### Distribution fitting plots

To further explore how well these distributions fit the data, we present a series of diagnostic plots that compare the empirical distribution of the data with the model predicted distributions. For each of the four models under consideration (lomax, log-normal, exponentiated Weibull, Weibull), we present a density plot, a distribution plot, and a quantile-quantile plot (Q-Q plot). The density plots compare the probability density function of the estimated parametric model to the normalized histogram of the data. Similarly the distribution plots compare the estimated cumulative distribution to the empirical distribution. The Q-Q plots plot the values of the quantile function for the data on the x-axis and for the estimated model on the y-axis. These plots can help us explain diagnose ways that the data diverge from each of the models. We present the x-axis of all these plots on a logarithmic scale to improve the visibility of the data.

We show these plots for data from English Wikipedia. For this wiki, the likelihood-based goodness-of-fit measures indicate that the exponentiated Weibull model is the best fit (BIC = 19321) followed in order by the lomax (BIC = 19351), the log-normal (BIC = 19373) and the Weibull (BIC = 20111), but the log-normal model is the only model that passes the KS test (${\displaystyle p}$ = 0.089).

 Figure 2.2. The Lomax model accurately estimates the rate of long reading times, but its monotonic density overestimates the probability of very short reading times and underestimates that of reading times in the range of 1-10 seconds. Figure 2.4. The Exponentiated Weibull model fits the data somewhat better than the log-normal model, but still overestimates the occurrence of very short reading times. Figure 2.3. The log-normal model fits the data well, but overestimates the probability of very short reading times and underestimates the probability of very long reading times. Figure 2.5. The Weibull model is not a good fit for the data. On a log scale, the PDF is not only monotonically decreasing, it is concave up everywhere. It greatly overestimates the probability of very short and very long reading times while under estimating the probability of reading times between 10 and 1000 seconds.

# Reading time and human development

This section explores the research question "How do Wikipedia readers in less developed countries differ from readers more developed countries?" Results of a global survey of Wikipedia readers, suggest that readers in less developed countries are more likely to engage in deeper information seeking tasks. Assuming that when executing such information seeking tasks, Wikipedia readers will spend more time reading, our data on reading times allows us to test this hypothesis using behavioral data.

H1: Readers in less developed countries are more likely to spend more time reading each page they visit compared to readers in more developed countries.

The assumption that time spent reading correlates to the depth of an information seeking task is clearly questionable. Other factors such as reading fluency, the type of device used to access information, internet connection speed and whether a reader is on a page that contains the information they wish to consume may all confound the relationship between reading time and the type of an information seeking task. We attempt to build confidence that such factors do not drive the observed relationship in two ways. First we use multivariate regression to statistically control for observable factors. We also examine how the gap between low-development country and high-development depends on device type and on whether the reader is on the last page in their session by testing the following hypotheses.

H2: The amount that readers in less developed countries read more than readers in more developed countries will be greater on desktop than on mobile devices.

The intuition for this hypothesis is that users will prefer to engage in deeper information seeking tasks on desktop devices instead of on mobile devices where they are more likely to engage in shallower tasks such as quick look-up of facts [12].

H3: The amount of that readers in less developed countries spend over readers in more developed countries will be greater in the last page view in a session than on other page views.

The intuition for this hypothesis is that deep reading of an article is most likely to take place in the last page view in a session. Therefore if the gap between low and high development context readers is attributable to types of information seeking tasks then we will observe a gap between reading time in more developed countries than in less developed countries.

We test these three hypotheses using two regression models, that differ only in how they represent economic development. Model 1a uses the human development index (HDI) reported by the United Nations and model 1b uses the Global North and Global South regional classification. We evaluated alternative specifications of these models

Model 1a: ${\displaystyle Y=B_{0}+B_{1}HDI+B_{2}Mobile+B_{3}Mobile:HDI+B_{4}RevisionLength+B_{5}DayOfWeek+B_{6}Month+}$
${\displaystyle B_{7}NthInSession+B_{8}LastInSession+B_{9}HDI:LastInSession+B_{10}Mobile:LastInSession+}$
${\displaystyle B_{11}FirstPaint+B_{12}DomInteractiveTime}$
Model 1b: ${\displaystyle Y=B_{0}+B_{1}GlobalNorth+B_{2}Mobile+B_{3}Mobile:GlobalNorth+B_{4}RevisionLength+B_{5}DayOfWeek+B_{6}Month+}$
${\displaystyle B_{7}NthInSession+B_{8}LastInSession+B_{9}GlobalNorth:LastInSession+B_{10}Mobile:LastInSession+}$
${\displaystyle B_{11}FirstPaint+B_{12}DomInteractiveTime}$

We include Day Of Week and Month as statistical controls for seasonal and weekly reading patterns. Including NthInSession statistically controls for the number of pages a reader has viewed so far in the session. Revision Length, the size of the Wiki-page, measured in bytes, roughly accounts for the amount of textual content on the page. To statistically control for the time it takes for pages to load we include time till first paint and dom interactive time. We include Desktop:LastInSession because during our model selection process, it improved the model fit.

We consider H1 supported if ${\displaystyle B_{1}<0}$ in both models; H2 if ${\displaystyle B_{3}>0}$; and H3 if ${\displaystyle B_{9}<0}$. Because interaction terms can be difficult to interpret qualitatively. We will present marginal effect plots to assist in qualitative interpretation of the observed relationships.

We explored alternative model specifications that include higher order terms and additional interaction terms. We choose to present model 1a and model 1b because more complex models neither substantively improve the explained variance and the predictive performance nor lead to qualitatively different conclusions. We fit both models using weighted ordinary least squares estimation in R on a stratified sample of size 9,873,641.

### Non-Parametric Analysis

The multivariate analysis assumes a parametric model and as we saw in the univariate analysis above, the assumption of log-normality may be invalid. Therefore, we also provide a simple non-parametric analysis based on median reading times. This analysis does not allow us to include statistical controls or perform statistical hypothesis tests, but doesn't depend on distributional assumptions. We construct a 3x3 table of users depending on whether they are in the Global North or Global South, on a mobile or desktop device or on the last page view in their session. The medians of each cell of the table validate that our findings are not driven by the normality assumption alone.

# Results

## Regression Analysis

We use marginal effects plots to interpret our regression models. A marginal effects plot shows how the model predicted outcome varies with respect to one or more of the predictors when other terms of the model are held constant at some typical value [13]. The y-axis shows the model predicted values and the x-axis shows the values of the predictor variables. In the marginal effects plots shown here, uncertainty intervals represent confidence intervals of the parameter estimates, not uncertainty about the model predictions. Uncertainty about model predictions in this case is generally very high, as our models explain only a small fraction (about 7%) of the variance in reading times.

### Page length

Before presenting results on our hypotheses about human development and reading times, we first consider page length. The association between page length and reading times is small and positive (${\displaystyle \mathrm {B} =0.17,SE=0.0004}$) as shown by the marginal effects plot in Figure 3.1. Pages on Wikipedia vary quite widely in page length: from 1 to 2,000,000 bytes long. Our model estimates that the difference between the shortest and the longest page lengths can account for a difference in typical reading times from about 5 seconds to about 45 seconds. If a page were to double its length, our model would predict a marginal increase in reading times of a factor of 1.2. This means that should a page with 10000 bytes of text with an average reading time of 25 seconds double in length to 20000 bytes then we would expect an increase in average reading time to 30 seconds.

Figure 3.1: Marginal effects plot showing how the time spent on pages depends on page length according to Model 1a.

### Hypothesis 1: Development and reading times

We find support for H1 that predicted that readers in more developed countries (${\displaystyle \mathrm {B} =-0.20,SE=0.002}$) or in the Global North (${\displaystyle \mathrm {B} =-0.27,SE=0.002}$) are likely to spend less time on each page than readers in less developed countries or in the Global South. The effect size is significant as shown in the marginal effects plots. According to model 1a, a prototypical user, average in all other respects in a country with an HDI 1 standard deviation below the mean can be expected to spend about 25 seconds on a given page compared to about 18 seconds spent by an average reader in a country with an HDI 1 standard deviation above the mean. Similarly, according to model 1b, assuming that all else is equal, a user in Global South country is expected to spend 130% as much time reading a page as an equivalent reader in a Global North country and a prototypical Global North reader is expected to spend just over 16 seconds on a page compared to the 21 seconds spent by a Global South reader.

 Figure 3.2. Marginal effects plot showing how the time spent on pages depends on the development level of the country they are in. Figure 3.3. Marginal effects plot showing how the time spent on pages depends on the development level of the country they are in.

### Hypothesis 2: Development and reading times on mobile devices

In general, we find evidence of a "device gap" between desktop and mobile devices. The geometric means of reading times on mobile devices and desktop devices are 22 and 26 seconds respectively. Although more people visit Wikipedia from mobile devices, these visitors spend less time reading than visitors to the desktop site. However, the marginal effects plot of our model shows that this gap is explained almost entirely by last-in-session behavior. There is a sizable gap between reading times on desktop between the last page view (where the typical reading time is about 44 seconds) and the others (where the typical reading time is about 18 seconds). However this gap is much smaller on mobile devices where typical reading times are about 19 and 26 seconds respectively.

Figure 4.1. Marginal effects plot from Model 1 showing how the time spent on pages varies with type of device and on whether a page view is the last in a session.

We also find support for our second hypothesis: that readers in the Global North (${\displaystyle \mathrm {B} =15,SE=0.002}$) or higher HDI (${\displaystyle \mathrm {B} =0.11,SE=0.002}$) countries are likely to spend even less time reading compared to Global South or lower HDI readers when they are on a desktop device compared to a mobile device (visiting the desktop vs the mobile webhost). Indeed, as shown in the marginal effects plot for model 1b, for the prototypical reader, the gap between Global South and Global North is greater on desktop devices (about 5 seconds) than on mobile devices (about 3 seconds). The marginal effects plot for model 1a indicates that, according to the model, prototypical readers in low HDI countries are likely to spend more time reading when they are on a desktop device than on a mobile device, but the reverse is true for readers in high HDI countries. Yet readers in low HDI countries are expected to spend more time reading a given page no matter their device choice.

 Figure 4.2. Marginal effects plot showing how the time spent on pages depends on whether a reader is on the kind of device they are using, and the development level of the country they are in. Figure 4.3. Marginal effects plot showing how the time spent on pages depends on the kind of device they are using, and the development level of the country they are in.

### Hypothesis 3: Development and last-in-session

As we expect deep reading to be most likely in the last page view in a session, we predicted H3: the difference in reading times between less developed countries and more developed countries will be amplified in the last page view in a session. However, we do not find support for this hypothesis, which would have been indicated by a negative regression coefficient for the interaction term between development and last-in-session.Instead we find a positive coefficients for HDI:Last in session (${\displaystyle \mathrm {B} =0.63,SE=0.002}$) in model 1a and for Global North:Last in session (${\displaystyle \mathrm {B} =0.08,SE=0.002}$).

 Figure 5.1. Marginal effects plot showing how the time spent on pages depends on whether a reader is on whether they are on their last page view in a session, and the development level of the country they are in. Figure 5.2. This is a marginal effects plot from a research study of reading behavior on Wikipedia. It shows how the time spent on pages depends on whether a reader is on their last view in a session and the development level of the country they are in.

### Non-Parametric Analysis

The table below shows the median time pages are visible by the user's economic region, device and whether a page is the last viewed in the user's session. Consistent with H1, users in the Global South spend more time on pages compared to users in the Global North regardless of device or session stage. Consistent with H2, the difference between Global South and Global North users is clearly more pronounced on desktop compared to mobile. In contrast to the prediction of H3, but in line with the findings from our parametric analysis, we do not observe an accentuation of the difference between Global South and Global North users in the last page view in a session.

Last-in-session Economic-region Desktop Time-visible (seconds)
0 False Global North False 20.1
1 False Global North True 16.1
2 False Global South False 21.5
3 False Global South True 21.8
4 True Global North False 28.1
5 True Global North True 39.8
6 True Global South False 28.7
7 True Global South True 43.6

Table 6.1 Table of medians reading times by last-in-session, economic region, and device type.

# Limitations

Two important limitations of this analysis affects our ability to compare reader behavior between mobile phone and PC devices. The first is the technical limitation of the browser instrumentation on mobile devices, discussed above, which lead to a large amount of missing data on mobile devices. This missing data likely introduces a negative bias to our measures of reading time on mobile devices, because data is more likely to be missing in cases where the user switches tasks from the browser, and then subsequently returns to complete their reading task. This bias may be quite significant as the issue affects a large proportion of our sample. We are considering improvements to the instrumentation that address this limitation, in particular making use of the Page Lifecycle API recently introduced in Google Chrome.

A second limitation of our ability to compare mobile phone and PC devices is derived from our intuitions about how reader behavior may differ in the two cases. Mainly, we think that it may be somewhat common for readers to leave a page visible in a web browser at time when they are not directly reading it (the "lunch break problem"). Users may leave multiple visible windows on PCs, while only interacting with one, or may leave a browser window visible and move away from their computer for long periods of time. In general, the best we can hope to observe is that a page is visible in a browser. We cannot, through this instrument alone, know with confidence that an individual is reading. It may be possible to introduce additional browser instrumentation for the collection of scrolling, mouse position, or mouse click information. However, such steps should be taken with care as additional data collection may negatively affect the experiences of readers and editors in terms of privacy, browser responsiveness, page load times, and power consumption.

To address this limitation, we fit regression models on data with dwell times greater than 1 hour removed, and found that our results were not substantively affected by the change. Therefore, we do not believe that user behaviors that may generate the appearance of long reading times that do not correspond to reading.

An additional limitation arises from the missing data described above. It is possible that we are missing data in ways that may potentially confound our results, especially, but not exclusively, in terms of the comparison between mobile and non-mobile devices.

The analysis presented here is carried out on observational, rather than experimental, data with the intention of describing correlations, rather than demonstrating causal relationships, between our variables. We used ordinary least squares analysis becuase it was convenient for working with our large dataset, however, future analysis might better account for the hierarchical structure of our data using multilevel modeling.

# Conclusion

We conducted the first in-depth investigation of reading time data from Wikipedia. We measured the time that web pages are visible in the browser windows of Wikipedia visitors as an approximation of reading time. We vetted the data to understand its limitations and found a high rate of missing data on mobile, among other less significant irregularities. Future analysts should keep this in mind and work to improve the coverage.

One anticipated application of reading time data is for evaluating design interventions intended to improve the user experience of Wikipedia visitors. We recommend that analysts and designers use geometric means as a metric for comparing reading behavior between treatments or between sites. The distribution of reading times is very skewed and therefore the arithmetic mean can be misleading. Moreover, for most wikis, the log-normal distribution is a good fit to the data, and this justifies the use of geometric means.

The reading time data we used in this study is a promising tool for future researchers to better understand Wikipedia's audiences. For example, recent prior research has shown widespread misalignment between how often articles are visited and the quality of those articles [14]. However, we have observed that not all views are created equal. Future studies on the relationship between content production and content consumption on Wikipedia might use reading time data to learn about how content consumption might change depend on article quality.

# References

1. a b c Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2018-12-02). "Why the World Reads Wikipedia: Beyond English Speakers". arXiv:1812.00474 [cs].
2. a b Mitzenmacher, Michael (2004-01-01). "A Brief History of Generative Models for Power Law and Lognormal Distributions". Internet Mathematics 1 (2): 226–251. ISSN 1542-7951. doi:10.1080/15427951.2004.10129088. Retrieved 2018-10-17.
3. a b Stumpf, Michael P. H.; Porter, Mason A. (2012-02-10). "Critical Truths About Power Laws". Science 335 (6069): 665–666. ISSN 1095-9203. PMID 22323807. doi:10.1126/science.1216142. Retrieved 2019-03-22.
4. Kiesler, Sara; Sproull, Lee S. (1986-01-01). "Response Effects in the Electronic Survey". Public Opinion Quarterly 50 (3): 402–413. ISSN 0033-362X. doi:10.1086/268992. Retrieved 2018-12-29.
5. Phillips, Derek L.; Clancy, Kevin J. (1972-03-01). "Some Effects of "Social Desirability" in Survey Studies". American Journal of Sociology 77 (5): 921–940. ISSN 0002-9602. doi:10.1086/225231. Retrieved 2018-12-29.
6. Antin, Judd; Shaw, Aaron (2012). "Social desirability bias and self-reports of motivation: a study of Amazon Mechanical Turk in the US and India". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '12. New York, NY, USA: ACM. pp. 2925–2934. ISBN 978-1-4503-1015-4. doi:10.1145/2207676.2208699. Retrieved 2014-01-12.
7. Benjamin Mako Hill, Aaron Shaw: "The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation" PLoS ONE Volume: 8, Issue: 6, DOI:10.1371/journal.pone.0065782
8. a b Clauset, A.; Shalizi, C.; Newman, M. (2009-11-04). "Power-Law Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 0036-1445. doi:10.1137/070710111. Retrieved 2019-01-01.
9. a b c d Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
10. Pal, M.; Ali, M.M.; Woo, J. (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147.
11. Pal, Manisha; Ali, M. Masoom; Woo, Jungsoo (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147. ISSN 1973-2201. doi:10.6092/issn.1973-2201/493. Retrieved 2018-10-14.
12. Pearce, Katy E.; Rice, Ronald E. (2013-08-01). "Digital Divides From Access to Activities: Comparing Mobile and Personal Computer Internet Users". Journal of Communication 63 (4): 721–744. ISSN 1460-2466. doi:10.1111/jcom.12045. Retrieved 2018-10-20.
13. Pepinsky, Thomas B. (2018-01-01). "Visual heuristics for marginal effects plots". Research & Politics 5 (1): 2053168018756668. ISSN 2053-1680. doi:10.1177/2053168018756668. Retrieved 2019-03-23.
14. Warncke-Wang, Morten; Ranjan, Vivek; Terveen, Loren; Hecht, Brent (2015-04-21). "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities". Ninth International AAAI Conference on Web and Social Media. Ninth International AAAI Conference on Web and Social Media. Retrieved 2016-08-15.

# Appendices

## Regression Tables

Statistical models
Model 1a Model 1b
Intercept 1.3660 (0.0085)*** 1.3791 (0.0085)***
Global North -0.2680 (0.0022)***
mobile : Global North 0.1490 (0.0024)***
mobile : Last in Session -0.6332 (0.0021)*** -0.6349 (0.0021)***
Global North : Last in Session 0.0830 (0.0024)***
Human development index -0.1961 (0.0018)***
mobile : HDI 0.1133 (0.0019)***
HDI : Last in Session 0.0632 (0.0019)***
Revision length (bytes) 0.1752 (0.0004)*** 0.1758 (0.0004)***
time to first paint -0.0164 (0.0006)*** -0.0171 (0.0006)***
time to dom interactive 0.0025 (0.0009)** 0.0024 (0.0009)**
mobilemobile -0.0118 (0.0023)*** -0.0142 (0.0023)***
sessionlength -0.0001 (0.0000)*** -0.0001 (0.0000)***
lastinsessionLast in session 0.8632 (0.0023)*** 0.8575 (0.0023)***
nthinsession 0.0002 (0.0000)*** 0.0002 (0.0000)***
dayofweekMon 0.0939 (0.0020)*** 0.0926 (0.0020)***
dayofweekSat 0.0169 (0.0020)*** 0.0175 (0.0020)***
dayofweekSun 0.0322 (0.0020)*** 0.0332 (0.0020)***
dayofweekThu 0.0561 (0.0019)*** 0.0548 (0.0019)***
dayofweekTue 0.0349 (0.0020)*** 0.0326 (0.0020)***
dayofweekWed 0.0757 (0.0019)*** 0.0743 (0.0019)***
usermonth4 0.0095 (0.0096) 0.0083 (0.0096)
usermonth5 0.0108 (0.0095) 0.0104 (0.0095)
usermonth6 -0.0102 (0.0097) -0.0103 (0.0097)
usermonth7 -0.0494 (0.0097)*** -0.0491 (0.0097)***
usermonth8 -0.0119 (0.0097) -0.0121 (0.0097)
usermonth9 0.0382 (0.0076)*** 0.0370 (0.0076)***
usermonth10 -0.0004 (0.0075) 0.0010 (0.0075)
R2 0.0721 0.0725
Adj. R2 0.0720 0.0725
Num. obs. 9873641 9873641
RMSE 14.2330 14.2297
***p < 0.001, **p < 0.01, *p < 0.05