This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation.

This is the first larger investigation into the reading time metric and the results it has produced have shown us expected and unexpected patterns around reading behavior. We hope to expand this research in the future by identifying the causes of the trends we have observed below and finally, we hope to incorporate these findings into our overall product strategy and planning processes.

# Abstract

How much time do Wikipedia readers spend when they visit an article? How does time spent vary from language edition to language edition, or between different kinds of readers or articles? How can we determine whether a new design change increases time spent by readers?

In 2017, the Wikimedia Foundation began collecting data on page dwell times to answer such questions. In this project, we validate this data and begin to answer questions such as the ones above. We observe the limitations of the data, most notably a high rate (57%) of missing data on mobile devices. Yet as long as one keeps such shortcomings in mind, we believe that the data can be fruitfully applied to improve our current knowledge of how people read Wikipedia. We used regression analyses to explore how factors like page length, device choice, and the locations of readers are related to reading times. We believe that our results for device choice and reader location offer behavioral data to corroborate findings from Why the World Reads Wikipedia, a large scale survey of Wikipedia readers of 14 language editions.

Here are some highlights from our analysis:

• We estimate that the entire human race spent about 670,000 years reading Wikipedia between November 2017 and October 2018.
• Upon opening an article, the median visitor will remain for only 25 seconds, but the distribution is extremely skewed. Given that a reader remains for at least 25 seconds, the chances are about even that they will spend more than 75 seconds on the page.
• Page length is positively related to reading time. Doubling page length is associated with a increase in reading times of a factor of 1.2.
• Typical readers on mobile devices spend about 1.7 seconds less per view than readers on desktop devices.
• On average, readers spend over twice as much time on their last view in a session compared to other views.
• Typical readers in the Global South spend more time in an average page view than readers in the Global North.
• The gap between the time spent by Global South readers and Global North readers is amplified on desktop devices. On desktop devices the gap is 5 seconds, but on mobile devices it is only 3 seconds.
• As reading times are highly skewed, the median and geometric mean are better for summarizing changes to reading time than the arithmetic mean.
• Log-normal distributions fit the data well enough to reasonably justify regression modeling and t-tests on log-transformed reading times. However, both power-law distributions such as the lomax distribution and the three-parameter exponentiated Weibull distribution often fit better than the log-normal distribution.

# Introduction

In 2017, the Wikimedia Foundation’s web team introduced new instrumentation to measure the amount of time Wikipedia readers spend on the pages they view. The original goal was to develop a new metric for evaluating how feature releases, such as this year’s launch of the page previews feature, may change reader behavior. However, we realized that this data could also be useful for understanding patterns of reading behavior in general. In preliminary work during 2017, Readers senior analyst Tilman Bayer and Readers analysis intern Zareen Farooqui explored and evaluated the new metric. This project continues this work with additional validation and new analyses of general reading patterns.

Measuring the amount of time that visitors to Wikipedia spend viewing pages provides previously unavailable information about patterns of content consumption on Wikipedia. While Mediawiki wikis provide histories of every edit and therefore granular and high quality data on productive collaboration is readily available, comparable sources of data are unavailble when it comes to understanding Wikipedia readership. Researchers interested how people consume Wikipedia content have approached this through a creative arsenal of data collection strategies including using view counts, click streams, session lengths, eye-tracking, scroll positions, collapse expansion on the mobile website, and through surveys. However, at present, the only publicly available large scale data source on reader behavior is the count of page views. Measuring reading time provides additional nuance over view data. With reading times in our field of view, it becomes clear that not all views are created equal. Some page views involve deep reading, yet most are quite short.

In this project we first evaluate the quality of the adopted approach for measuring reading times. We do this by comparing them to the timing of server side log events and by looking for patterns of systematically missing or invalid data. While we do find some inconsistencies such as a substantial amount of missing data from mobile devices and a low rate of invalid (missing or negative) measurements, we believe that the data can be generally informative as long as these limitations are considered.

We next consider possible probability models for reading time. One anticipated use of reading time data is in the evaluation of design experiments seeking to improve reader experiences on Wikipedia. "How does this feature change reading behavior?" is a question experimenters are likely to ask. Model selection is important for validating assumptions that underlie the use of a statistic such as an arithmetic or geometric mean as a metric. Model selection itself can sometimes lead to novel insights when theorized data generating processes predict that a given model will be a good fit for the data. We evaluate several different distributions and find that the lognormal distribution fits the data well enough to justify the use of the geometric mean as a metric and of ordinary least squares regression models to explore marginal effects of other variables.

Consistent with the results of the survey, we find that readers in countries that are less developed or are in the Global South stay on given pages for longer on average compared to readers in the Global North or in more developed countries. Moreover, this difference is amplified where we would expect users to consume information in depth: on the desktop (non-mobile) site. While we hypothesized that the difference would also be greater in the last pageview in a session, this idea was not supported. We demonstrate these patterns using non-parametric methods and multivariate parametric analyses.

# Methods

## Data collection and validation

The reading depth plugin works by running JavaScript in the client browser which sends two messages to the server. The first message (the page loaded event) is sent when the page is loaded and the second message (the page unloaded event) sends values from timers that measure, among other things, the amount of time that the page was visible in the visitor's browser window.

More specifically, the plugin uses the page visibility api to measure visible time, the total amount of time that the page was in a visible browser tab. Visible time is the primary measure we use in this report because it excludes time when the user could not possibly have been reading the page. The plugin also records a second measure of reading time: total time. This is simply the entire time the page was loaded in the browser. We use this variable for data validation and in robustness checks. The plugin also measures page load time in two ways: time till first paint, the time from the request until the browser starts to render any part of the page; and dom interactive time, the time from the request until the user can interact with the page. The current version of the reading depth event logging plugin was first enabled on November 20th 2017. From November 2017 until September 2018 we logged events from a 0.1% sample of visitor sessions, and the sampling rate was increased to 10% on September 25, 2018.

Since we care about the reading behavior of humans, we identify bots using user agent strings and exclude them from all of our analyses.

## Missing data

We are only able to collect data from web browsers that support the APIs on which the instrument depends. Also, we excluded certain user agents that were found to send data unreliably in our testing, namely the default Android browser, versions of Chrome earlier than 39, Safari, and all browsers running on versions of iOS less than 11.3. See this Phabricator task for additional details. We also do not collect data from browsers that have not enabled JavaScript or that have enabled Do Not Track.

Even when the above conditions are met, in some cases we are still not able to collect data. Sometimes we observe a page loaded event, indicating that a user in our sample opened a page, but we do not observe a corresponding event indicating that the user has left the page (a page unloaded event). This issue affects 57% percent of records on the mobile site and about 5% of records on the desktop site. The likely explanation of why many mobile views are affected is that many mobile browsers will refresh the page if the user switches to a program other than the browser. In such cases the browser will not send a page-unloaded event. We only include events where we observe exactly 1 page loaded event and 1 page unloaded event.

We also remove 0.016% of page read events where, for unknown reasons, the instrument recorded a page visible time that was less than 0 or undefined.

## Taking a sample

Because Wikipedia is so widely read, even a sample of 0.1% of events results in a huge amount of data. This exceeds the requirements of this project-level analysis and leads to computational difficulties. We therefore conduct our analysis on random samples of the dataset. (Data was collected at a higher sampling rate to enable content-level analysis of dwell times, e.g. for specific topics or pages, which was among the possible research topics envisioned for this project.) To ensure that all wikis are fairly and adequately represented in our sample, we use stratified sampling. Stratified sampling ensures that all groups are represented fairly in a random sample by assigning a "weight" to each groups to adjust the probability that members of the group are chosen in the final sample. This introduces a known "bias" in the resulting sample, so the "weights" are subsequently used (as in a weighted average) to correct the known sampling bias. For estimating total reading time, and for the univariate analysis, we stratify by wikis and take up to 20000 data points for each wiki and exclude wikis that have fewer than 300 data points.

In the Multivariate Analysis we stratify by wikis, and by country of the reader, and by whether or not we think that the user is on a mobile device and we sample up to 200 data points for each strata.

Before turning to our other questions, we we presenting summary statistics and a high level description of reading behavior on Wikipedia. When someone opens a given page on Wikipedia, how long do they typically stay on the page? Are reading times highly skewed? How much does reading behavior vary across different language editions of Wikipedia? How much time does all of humanity spend reading Wikipedia?

### Wikipedia as a whole

In general, the distribution of reading times is very skewed. Upon opening a Wikipedia page, a reader will close it or navigate away in less than 25 seconds about half of the time. If they do remain on the page longer than the median dwell time, half of the time again they are likely to navigate away within 75.1 seconds. In general, the distribution of reading times is very skewed. When discussing reading times, it makes more sense to discuss medians and other percentiles because in highly skewed distributions the mean will be very far away from most of the mass of the distribution. This runs quite contrary to what the mean represents to most people's intuitions. Fortunately, once the data has been log-transformed, the density is bell-shaped and the median and the mean are quite close. Therefore, we can also find it useful to discuss the geometric mean. We discuss the skewed nature of the data and the benefits of log-transformation at length below.

5% 25% 50% 75% 95%
time visible (sec) 1.8 8.0 25.0 75.1 439.1

Table 1.1 This table shows percentiles for reading times over all Wikipedia editions

Figure 1.1. The distribution of dwell times accross all language editions of Wikipedia. The top figure shows a histogram of dwell times less than 300 seconds (5 minutes) long. In this figure we can see that the median dwell time is about 25 seconds long and that the distribution of dwell times is very skewed. The Y axis represents the probability that a given page view is in a given box. In the lower figure, the dwell times are log-transformed and the data appear bell-shaped, with considerable skew to the right.

### Total time spent

The plot below shows an estimate of the total amount of time people read each month, taken by multiplying the average reading time in each month (measured as visible length) by the number of views per month. Humanity spent about 672,349 years reading Wikipedia from November 2017 through October 2018, excluding readers using the mobile apps, and bots. It is possible that some people leave Wikipedia pages visible in their browsers for extended periods of time without reading. For example, someone might open a page and then walk away from the computer to have lunch. To make our estimates of reading time somewhat conservative, we rounded all page views lasting more than 1 hour down to hour in these estimates of reading time.

Figure 1.2. This chart plots an estimate of the total amount of time people spend reading Wikipedia each month from 11-2017 -- 10-2018

### Variation between different language editions

We present box plots of the distribution of page visible times on different Wikipedia language editions. As above we place unscaled data side-by-side with log-transformed data. The box plots show the inter-quartile range (IQR) as a box. A line inside the box represents the median and the "wiskers" extend the IQR by a factor of 1.5. The plots with untransformed data are trunctated at 300 seconds (5 minutes) because showing the entire range of the data would render the plots illegible. The log-transformed plots show the full range of the data. We hope these plots will be readers who may wish to know how reading times compare between their wikis of interest.

We highlight a handful of example language editions that are representitive of projects of different sizes and of different cultures. These are Arabic (ar), German (de), English (en), Spanish (es), Hindi (hi), Dutch (nl) and Punjabi (pa).

This large plot shows box plots for each wiki. Wikis are in alphabetical order by language codes

.

wiki 5% 25% 50% 75% 95%
ar 5.2 5.2 21.5 69.9 371.7
de 14.1 14.1 14.1 56.6 482.7
en 37.2 37.2 37.2 37.2 262.4
es 23.3 23.3 23.3 65.5 616.4
hi 2.5 11.4 31.4 82.6 360.5
nl 6.1 6.1 15.9 60.1 441.8
pa 2.0 7.2 19.5 55.4 303.1
Figure 1.2. This chart shows box plots of the distribution of time

that Wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds. Spanish, Hindi, and Arabic appear to have longer reading times while English

and Punjabi appear to have somewhat shorter reading times.

We observe a great deal of variation in the distribution of reading times between different language editions. We do not investigate these differences any further in this report because we lack knowledge of the specific contexts of each community and their audiences which would be necessary to adequately explain them. Instead, we present an analysis of the relationship between reading time and the development level of reader's countries to offer a more general explination of one factor that might make a difference.

### Other variables

We use some variables other than the timers obtained from the plugin in our analysis. The eventlogging system records the date and time the page was viewed as well as the page title of each page a user visits in a session. We obtained the page length, measured in bytes at the time the page was viewed, by merging the event logging data with the edit history. To understand how reading behavior on mobile devices differs from behavior on non-mobile (i.e. desktop) devices, we assume that visitors to mobile web-hosts (e.g. en.m.wikipedia.org) are using mobile devices and that visitors to non-mobile web-hosts (e.g. en.wikipedia.org) are on non-mobile (desktop) devices.

We obtain the approximate country in which a reader is located from the MaxMind GeoIP database which is integrated with the analytics pipeline. We then use human development index (HDI) from the the UN's human development data to measure the development level of the country. We lack geo-location data before March 3rd 2018, and this limits our analysis of development and reading times to the period from then until September 28th 2018.

In our model selection process, we observed that partial residual plots of the interaction term between the HDI and mobile were very skewed. Standardizing the HDI by centering to 0 and scaling it by the standard deviation (taken at the country level) improved this and also allows us to interpret results in terms of standard deviations, which are likely more intuitive to readers less familiar with the HDI.

We also use a second, dichotomous, measure of development in terms of established regional classifications of Global North and Global South. Finally, this EventLogging instrumentation retains a session token with which we measure the number of pages viewed in the session so far (Nth in session) and whether or not a given page view is the last in session.

## Univariate model selection

### Motivation

We want to be able to answer questions like: Did introducing a new feature to the user interface cause a change in the amount of time users spend reading? Are reading times on English Wikipedia longer than on Spanish Wikipedia? What is the relationship between the development level of a reader's country and the amount of time they spend reading on different devices if we account for other factors?

Using a parametric model allows us to perform statistical tests to answer questions such as the ones listed above. Parametric models assume the data have a given probability distribution and have interpretable parameters such as mean, variance, and shape parameters. Fitting parametric distributions to data allows us to estimate these parameters and to statistically test changes in the parameters. However, assuming a parametric model can lead to misleading conclusions if the assumed model is not the true model. Therefore we want to evaluate how well different parametric models fit the data in order to justify parametric assumptions. Understanding how the data is distributed can also be interesting in its own right because distributions can inform understandings of the data generating process.

### Candidate models

Log-Normal Distribution: The log-normal distribution is a two-parameter probability distribution. Intuitively, it is just a normal distribution, but on a logarithmic scale. This gives it convenient properties because its parameters the mean and variance of the log-transformed data. This means that one can take the logarithm of the data and then use t-tests for evaluating differences in means, or use ordinary least squares to infer regression models. These advantages make the log-normal distribution a common choice in analyzing skewed data, even when it is is not a perfect fit for the data.

Lomax (Pareto Type II) Distribution: Datasets on human behavior often exhibit power-law distributions, meaning that the probability of extreme events, while still low, is much greater than would be predicted by a normal (or log normal) distribution. Power law distributions are a commonly used class of one-sided "heavy-tailed," "long-tailed," or "fat-tailed" probability models [6]. We fit the Lomax Distribution, a commonly used heavy-tail distribution with two parameters that assumes that power law dynamics occur over the whole range of the data.

Weibull Distribution: Liu et al. (2010) model reading times on web pages using a Weibull distribution [7]. This model has two parameters: ${\displaystyle \lambda }$, a scale parameter, and ${\displaystyle k}$ a shape parameter. The Weibull distribution can be a useful model because of the intuitive interpretation of ${\displaystyle k}$. If ${\displaystyle k>1}$, then reading behavior exhibits "positive aging," which means that the longer someone stays on a page, the more likely they are to leave the page at any moment. Conversely ${\displaystyle k<1}$ is interpreted as "negative aging," which means that as someone remains on a page, the less likely they are to leave the page at any given moment. The Weibull distribution is often used in the context of reliability engineering because it is convenient for modeling the chances that a given part will fail at a given moment.

Exponentiated Weibull Distribution: The Weibull model assumes that the rate of readers leaving a page changes monotonically with respect to time. This means that the longer a reader stays on a page, they will not become more likely to leave the page up to a point after which they become less likely to leave. In other words there must be either "negative aging," "positive aging," or "no aging." This excludes processes in which "positive aging" transitions to "negative aging" after a point [8]. Therefore if the data show that the probability of a reader leaving a page first increases and then decreases (or vice versa) then assumptions of the Weibull model are violated. The exponentiated Weibull distribution is a three-parameter generalization of the Weibull distribution that relaxes this constraint. The extra degree of freedom will allow this model to fit a greater range of empirical distributions compared to the two-parameter Weibull model.

We also considered the gamma distribution and the exponential Distribution, but we will not go into depth about them here. We didn't have a strong motivation for these models and they did not fit the data well.

We fit the models using SciPy. The Exponentiated Weibull, Weibull, lomax, and gamma models were fit using maximum likelihood estimation and the other models were fit using the method of moments.

### Methods

Our method for model selection is inspired in part by Liu et al. (2010), who compared the log-normal distribution to the Weibull distribution of dwell times on a large sample of web pages [7]. They fit both models to data for each website and then compare two measures of model fit: the log-likelihood, which measures the probability of the data given the model (higher is better), and the Kolmogorov-Smirnov distance (KS-distance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the log-normal model in a large majority of cases according to both goodness-of-fit measures.

Similar to the approach of Liu et al. (2010), we fit each of the models we consider on reading time data, separately for each Wikipedia project [7]. We also use the Kolmogorav-Smirnov distance (KS-distance) to evaluate goodness-of-fit. It turns out that the KS-distance supports a statistical test of the null hypothesis that the model is a good fit for the data [6]. The KS-test is quite sensitive to deviations between the model and the data, especially in large enough samples. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausible model, instead of just whether one distribution is a better fit than another.

Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without increasing out-of-sample predictive performance or explanatory power. To avoid the risk over-fitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the log-likelihood. Both criteria attempt to quantify the amount of information lost by the model (lower is better), evaluate the log likelihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.

We analyze these goodness-of-fit measures for each wiki and rank them from best to worst. For each distribution, we report the mean and median of these ranks. In addition, we report the mean and median p-values of the KS-tests and the proportion of wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.

The diagnostic plots are shown with data on English Wikipedia. On this Wiki, the exponentiated Weibull model is the best fit, followed by the lomax model and then the log normal model. Only the exponentiated Weibull model passes the KS test.

## Results

### Goodness-of-fit metrics

The table below shows the results of this procedure. The lomax, exponentiated weibull, and log-normal all fit the data reasonably well. All pass the KS-test for many wikis, and are in a three-way tie for best median rank according to AIC.

model AIC_rank BIC_rank ks_rank ks_pvalue ks_pass95 ks_pass97_5
mean median mean median mean median mean median mean mean
Lomax 1.780992 2.0 1.723140 1.5 2.033058 2.0 0.272096 1.765701e-01 0.760331 0.818182
Log-normal 2.256198 2.0 2.165289 2.0 2.342975 2.0 0.272003 1.476440e-01 0.685950 0.772727
Exponentiated Weibull 2.260331 2.0 2.417355 3.0 2.144628 2.0 0.293441 2.078284e-01 0.727273 0.793388
Weibull 3.938017 4.0 3.942149 4.0 3.809917 4.0 0.084717 1.190242e-04 0.247934 0.297521
Gamma 4.933884 5.0 4.962810 5.0 4.785124 5.0 0.044867 2.255307e-12 0.132231 0.157025
Exponential 5.830579 6.0 5.789256 6.0 5.884298 6.0 0.015356 0.000000e+00 0.049587 0.066116

The lomax distribution is the best fit across all Wikis according to all metrics. With only 2 parameters, it has a lower AIC and BIC than the three parameter exponentiated Weibull distribution (exponweib) and passes the KS test 76% of the time at the 95% confidence level.

The exponentiated Weibull model fits the data better than the log-normal model in terms of passing KS-tests and with respect to AIC. However, the log-normal is better in terms of BIC, which imposes a greater penalty on the additional parameter of the exponentiated Weibull model.

The Weibull model fits substantially worse than the lomax, log-normal, and exponentiated Weibull in terms of all of our goodness-of-fit metrics. In this respect, our results differ from those of Liu et al. (2010), who observed the Weibull model fitting dwell time data better than the log-normal model [7]. We observe that for dwell times on Wikipedia, the log-normal model is the better fit. While substantially worse than the lomax model, the log-normal model still passes the KS-test for about 69% of wikis in the sample.

### Discussion

We found that lomax, exponentiated Weibull, and log-normal models all fit the data within reason. We now discuss how each of these models can be applied to understanding Wikipedia reading behavior.

Lomax (Pareto Type II) Distribution: That the lomax model fits well indicates that Wikipedia reading time data may follow a power law. Mitzenmacher (2004) describes several possible data generating processes for power law (Pareto) and log-normal distributions [9]. Rich-get-richer dynamics such as preferential attachment are commonly associated with power law distributions. However, according to Mitzenmatcher's analysis, a mixture of log-normal distributions can also generate data appearing to follow a power law. Therefore, we cannot conclude from our model-fitting exercise that reading behavior on Wikipedia is driven by rich-get-richer dynamics. Furthermore, it is difficult to conceive of mechanisms for such dynamics. On the other hand, it is intuitive that a mixture of different log-normal processes are involved in reading time, such as an exploration process mixed with a reading process or even a mixture of behavior patterns associated with discrete types of information consumption. Deeper exploration of potential power-law dynamics in reading behavior is a potential avenue for future research.

Log-Normal Distribution: The log-normal model does not fit the data perfectly, but it fits well enough to be useful. It frequently passes KS tests, and is preferred to the exponentiated Weibull by the BIC. Even though the lomax model typically fits the data better, and the log-normal model is likely to underestimate the probability of very long reading times, assuming a log-normal model is very convenient. Once the data is transformed to a log-scale we can use t-tests to compare differences in means. This implies that the mean of the logarithm of reading time is an appropriate metric for evaluating experiments. Furthermore, assuming log-normality justifies using ordinary least squares to estimate regression models in multivariate analysis instead of more complex models that require maximum likelihood estimation.

Weibull Distribution: The Weibull model did not fit the data well. This was somewhat disappointing because we had hoped to analyze reading behavior in terms of the inferred parameter that indicates positive or negative aging. While Liu et al. (2010) observed that the Weibull model out-performed the log-normal model on their datasets, we observe the opposite. However, the exponentiated Weibull model generalizes the Weibull, is a good fit for the data, and can help us explain why the Weibull does not fit the data well.

Exponentiated Weibull Distribution: The Exponentiated Weibull has 3 parameters [10]. Two are shape parameters (${\displaystyle \alpha >0}$ and ${\displaystyle \gamma >0}$) and one is a scale parameter (${\displaystyle \lambda >0}$). The major qualitative distinctions in interpreting the model depend on the shape parameters.

• If ${\displaystyle \alpha =1}$ and ${\displaystyle \gamma =1}$ then the model is equivalent to an exponential distribution with parameter ${\displaystyle \lambda }$.
• If ${\displaystyle \alpha =1}$ then the model is equivalent to a Weibull distribution.
• In this case the hazard rate is always increasing (positive aging) if ${\displaystyle \gamma >1}$ and always decreasing (negative aging) if ${\displaystyle \gamma <1}$.
• In addition, if ${\displaystyle \alpha >0}$ then the hazard rate increases when ${\displaystyle 0.
• On the other hand if ${\displaystyle \alpha <0}$ then the hazard rate decreases when ${\displaystyle x>\lambda }$.
• ${\displaystyle \gamma >1}$ and ${\displaystyle \alpha >1}$ indicates positive aging (the hazard rate is increasing).
• ${\displaystyle \gamma <1}$ and ${\displaystyle \alpha <1}$ indicates negative aging (the hazard rate is decreasing).
• If either ${\displaystyle \gamma >1}$, ${\displaystyle \alpha <1}$ or ${\displaystyle \gamma <1}$, ${\displaystyle \alpha >1}$ then qualitative interpretation may require closer inspection of estimated hazard functions.

We estimated ${\displaystyle \alpha >1}$ and ${\displaystyle \gamma <1}$ for all but 1 of the 242 Wikipedia projects we analyzed. This limits the utility of exponentiated Weibull models for large scale analysis reading on many Wikis because locations of parameters do lead directly to intuitive qualitative interpretations. However, by plotting the estimated hazard function we can see over what range of the data the hazard function is decreasing or increasing, accelerating or decelerating.

Figure 2.1. Hazard functions for the parametric models estimated on English Wikipedia. The exponentiated Weibull model (the best fit to the data) indicates that the hazard rate increases in the first seconds of a page view after which we observe negative aging.

We observe that, on English Wikipedia, the log-normal and exponentiated Weibull models both indicate a brief period of positive aging, during which the instentatious rate of page-leaving increases followed by negative aging. This helps explain why the Weibull model is not a good fit for the data compared to the log-normal and exponentiated Weibull models: the Weibull distribution cannot model a non-monotonic hazard function. While Liu et al 2010 found that such a process can describe the distribuiton of dwell times in data from a web browser plugin, our analysis suggests that behavior by Wikipedia readers may be more complex. One plausible interpretation is that the hazard rate increases during the first 1 or 2 seconds of a page view because readers require this time to make a decision whether to leave the page or to remain.

#### Distribution fitting plots

To further explore how well these distributions fit the data, we present a series of diagnostic plots that compare the empirical distribution of the data with the model predicted distributions. For each of the four models under consideration (lomax, log-normal, exponentiated Weibull, Weibull), we present a density plot, a distribution plot, and a quantile-quantile plot (Q-Q plot). The density plots compare the probability density function of the estimated parametric model to the normalized histogram of the data. Similarly the distribution plots compare the estimated cumulative distribution to the empirical distribution. The Q-Q plots plot the values of the quantile function for the data on the x-axis and for the estimated model on the y-axis. These plots can help us explain diagnose ways that the data diverge from each of the models. We present the x-axis of all these plots on a logarithmic scale to improve the the visibility of the data.

We show these plots for data from English Wikipedia. For this wiki, the likelihood-based goodness-of-fit measures indicate that the exponentiated Weibull model is the best fit (BIC = 19321) followed in order by the lomax (BIC = 19351), the log-normal (BIC = 19373) and the Weibull (BIC = 20111), but the log-normal model is the only model that passes the KS test (${\displaystyle p}$ = 0.089).

 Figure 1.1. The Lomax model accurately estimates the rate of long reading times, but its monotonic density overestimates the probability of very short reading times and underestimates that of reading times in the range of 1-10 seconds. Figure 1.3. The Exponentiated Weibull model fits the data somewhat better than the log-normal model, but still overestimates the occurrence of very short reading times. Figure 1.2. The log-normal model fits the data well, but overestimates the probability of very short reading times and underestimates the probability of very long reading times. Figure 1.4. The Weibull model is not a good fit for the data. On a log scale, the PDF is not only monotonically decreasing, it is concave up everywhere. It greatly overestimates the probability of very short and very long reading times while under estimating the probability of reading times between 10 and 1000 seconds.

## Multivariate Analysis

This section explores the research question "How do Wikipedia readers in less developed countries differ from readers more developed countries?" Results of a global survey of Wikipedia readers, suggest that readers in less developed countries are more likely to engage in deeper information seeking tasks. Assuming that when executing such information seeking tasks, Wikipedia readers will spend more time reading, our data on reading times allows us to test this hypothesis using behavioral data.

H1: Readers in less developed countries are more likely to spend more time reading each page they visit compared to readers in more developed countries.

The assumption that time spent reading correlates to the depth of an information seeking task is clearly questionable. Other factors such as reading fluency, the type of device used to access information, internet connection speed and whether a reader is on a page that contains the information they wish to consume may all confound the relationship between reading time and the type of an information seeking task. We attempt to build confidence that such factors do not drive the observed relationship in two ways. First we use multivariate regression to statistically control for observable factors. We also examine how the gap between low-development country and high-development depends on device type and on whether the reader is on the last page in their session by testing the following hypotheses.

H2: The amount that readers in less developed countries read more than readers in more developed countries will be greater on desktop than on mobile devices.

The intuition for this hypothesis is that users will prefer to engage in deeper information seeking tasks on desktop devices instead of on mobile devices where they are more likely to engage in shallower tasks such as quick lookup of facts [11].

H3: The amount of that readers in less developed countries spend over readers in more developed countries will be greater in the last page view in a session than on other page views.

The intuition for this hypothesis is that deep reading of an article is most likely to take place in the last page view in a session. Therefore if the gap between low and high development context readers is attributable to types of information seeking tasks then we will observe a gap between reading time in more developed countries than in less developed countries.

We test these three hypotheses using two regression models, that differ only in how they represent economic development. Model 1a uses the human development index (HDI) reported by the United Nations and model 1b uses the Global North and Global South regional classification. We evaluated alternative specifications of these models

Model 1a: ${\displaystyle Y=B_{0}+B_{1}HDI+B_{2}Mobile+B_{3}Mobile:HDI+B_{4}RevisionLength+B_{5}DayOfWeek+B_{6}Month+}$
${\displaystyle B_{7}NthInSession+B_{8}LastInSession+B_{9}HDI:LastInSession+B_{10}Mobile:LastInSession+}$
${\displaystyle B_{11}FirstPaint+B_{12}DomInteractiveTime}$
Model 1b: ${\displaystyle Y=B_{0}+B_{1}GlobalNorth+B_{2}Mobile+B_{3}Mobile:GlobalNorth+B_{4}RevisionLength+B_{5}DayOfWeek+B_{6}Month+}$
${\displaystyle B_{7}NthInSession+B_{8}LastInSession+B_{9}GlobalNorth:LastInSession+B_{10}Mobile:LastInSession+}$
${\displaystyle B_{11}FirstPaint+B_{12}DomInteractiveTime}$

We include Day Of Week and Month as statistical controls for seasonal and weekly reading patterns. Including NthInSession statistically controls for the number of pages a reader has viewed so far in the session. Revision Length, the size of the Wikipage, measured in bytes, roughly accounts for the amount of textual content on the page. To statistically control for the time it takes for pages to load we include FirstPaint and DomInteractiveTime. We include Desktop:LastInSession because during our model selection process, it improved the model fit.

We consider H1 supported if ${\displaystyle B_{1}<0}$ in both models; H2 if ${\displaystyle B_{3}>0}$; and H3 if ${\displaystyle B_{9}<0}$. Because interaction terms can be difficult to interpret qualitatively. We will present marginal effect plots to assist in qualitative interpretation of the observed relationships.

We explored alternative model specifications that include higher order terms and additional interaction terms. We choose to present model 1a and model 1b because more complex models neither substantively improve the explained variance and the predictive performance nor lead to qualitatively different conclusions. We fit both models using weighted ordinary least squares estimation in R on a stratified sample of size 9,873,641.

Future iterations of the analysis may improve upon these models by including fixed effects for Wiki and random effects for page to considerably strengthen statistical controls and better isolate the relationships of interest. Computational limitations (issues with memory consumption and with SWAP) moved these steps out of the scope of the present project.

### Non-Parametric Analysis

The multivariate analysis assumes a parametric model and as we saw in the univariate analysis above, the assumption of log-normality may be invalid. Therefore, we also provide a simple non-parametric analysis based on median reading times. We construct a 3x3 table of users depending on whether they are in the Global North or Global South, on a mobile or desktop device or on the last page view in their session. The medians of each cell of the table validate that our findings are not driven by the normality assumption alone.

# Results

## Regression Analysis

### Hypothesis 1: Development and reading times

We find support for H1 that predicted that readers in more developed countries (${\displaystyle \mathrm {B} =-0.20,SE=0.002}$) or in the Global North (${\displaystyle \mathrm {B} =-0.27,SE=0.002}$) are likely to spend less time on each page than readers in less developed countries or in the Global South. The effect size is significant as shown in the marginal effects plots. According to model 1a, a prototypical user, average in all other respects in a country with an HDI 1 standard deviation below the mean can be expected to spend about 25 seconds on a given page compared to about 18 seconds spent by an average reader in a country with an HDI 1 standard deviation above the mean. Similarly, according to model 1b, assuming that all else is equal, a user in Global South country is expected to spend 130% as much time reading a page as an equivalent reader in a Global North country and a prototypical Global North reader is expected to spend just over 16 seconds on a page compared to the 21 seconds spent by a Global South reader.

 Figure 3.1. Marginal effects plot showing how the time spent on pages depends on the development level of the country they are in. Visible length is measured in miliseconds. Figure 3.2. Marginal effects plot showing how the time spent on pages depends on the development level of the country they are in.

### Hypothesis 2: Development and reading times on mobile devices

We also find support for our second hypothesis: that readers in global north (${\displaystyle \mathrm {B} =15,SE=0.002}$) or higher HDI (${\displaystyle \mathrm {B} =0.11,SE=0.002}$) countries are likely to spend even less time reading compared to global south or lower HDI readers when they are on a desktop device compared to a mobile device (visiting *.wikipedia.org rather than *.m.wikipedia.org). Indeed, as shown in the marginal effects plot for model 1b, for the prototypical reader, the gap between global south and global north is greater on desktop devices (about 5 seconds) than on mobile devices (about 3 seconds). The marginal effects plot for model 1a indicates that, according to the model, prototypical readers in low HDI countries are likely to spend more time reading when they are on a desktop device than on a mobile device, but the reverse is true for readers in high HDI countries. Yet readers in low HDI countries are expected to spend more time reading a given page no matter their device choice. While a prototypical reader in a country 1 standard deviation below the mean for HDI is predicted to read for about 25 seconds on desktop and about 22 seconds on mobile, in a country 1 standard deviation above the mean, she is predicted to read for about 19 seconds on mobile and about 17 seconds on desktop.

 Figure 4.1. Marginal effects plot showing how the time spent on pages depends on whether a reader is on the kind of device they are using, and the development level of the country they are in. Figure 4.2. Marginal effects plot showing how the time spent on pages depends on the kind of device they are using, and the development level of the country they are in.

### Hypothesis 3: Development and last-in-session

As we expect deep reading to be most likely in the last page view in a session, we predicted H3: the difference in reading times between less developed countries and more developed countries will be amplified in the last page view in a session. However, we do not find support for this hypothesis, which would have been indicated by a negative regression coefficient for the interaction term between development and last-in-session.Instead we find a positive coefficients for HDI:Last in session (${\displaystyle \mathrm {B} =0.63,SE=0.002}$) in model 1a and for Global North:Last in session (${\displaystyle \mathrm {B} =0.08,SE=0.002}$).

 Figure 5.1. Marginal effects plot showing how the time spent on pages depends on whether a reader is on whether they are on their last page view in a session, and the development level of the country they are in. Figure 5.2. This is a marginal effects plot from a research study of reading behavior on Wikipedia. It shows how the time spent on pages depends on whether a reader is on their last view in a session and the development level of the country they are in.

### Non-Parametric Analysis

The table below shows the median time pages are visible by the user's economic region, device and whether a page is the last viewed in the user's session. Consistent with H1, users in the Global South spend more time on pages compared to users in the Global North regardless of device or session stage. Consistent with H2, the difference between Global South and Global North users is clearly more pronounced on desktop compared to mobile. In contrast to the prediction of H3, but in line with the findings from our parametric analysis, we do not observe an accentuation of the difference between Global South and Global North users in the last page view in a session.

Last In Session Economic Region Desktop Visible Length
0 False Global North False 20109
1 False Global North True 16185
2 False Global South False 21554
3 False Global South True 21804
4 True Global North False 28178
5 True Global North True 39840
6 True Global South False 28684
7 True Global South True 43630

# Limitations

Two important limitations of this analysis affects our ability to compare reader behavior between mobile phone and PC devices. The first is the technical limitation of the browser instrumentation on mobile devices, discussed above, which lead to a large amount of missing data on mobile devices. This missing data likely introduces a negative bias to our measures of reading time on mobile devices, because data is more likely to be missing in cases where the user switches tasks from the browser, and then subsequently returns to complete their reading task. This bias may be quite significant as the issue affects a large proportion of our sample. We are considering improvements to the instrumentation that address this limitation, possibly making use of the Page LifeCycle API recently introduced in Google Chrome.

A second limitation of our ability to compare mobile phone and PC devices is derived from our intuitions about how reader behavior may differ in the two cases. Mainly, we think that it may be somewhat common for readers to leave a page visible in a web browser at time when they are not directly reading it (the "lunch break problem"). Users may leave multiple visible windows on PCs, while only interacting with one, or may leave a browser window visible and move away from their computer for long periods of time. In general, the best we can hope to observe is that a page is visible in a browser. We cannot, through this instrument alone, know with confidence that an individual is reading. It may be possible to introduce additional browser instrumentation for the collection of scrolling, mouse position, or mouse click information. However, such steps should be taken with care as additional data collection may negatively affect the experiences of readers and editors in terms of privacy, browser responsiveness, page load times, and power consumption.

To address this limitation, we fit regression models on data with dwell times greater than 1 hour removed. And found that our results were not substantively affected by the change. Therefore, we do not believe that user behaviors that may generate the appearance of long reading times that do not correspond to reading.

An additional limitation arises from the missing data described above. It is possible that we are missing data in ways that may potentially confound our results, especially, but not exclusively, in terms of the comparison between mobile and non-mobile devices.

The analysis presented here is carried out on observational, rather than experimental, data with the intention of describing correlations, rather than demonstrating causal relationships, between our variables.

### Alternative Explanations

Data plans: is this driven by people wanting to get the most out of their page views? We could use Wikipedia Zero as a way to answer this.

## Bonus Results

### Last in session

This is a marginal effects plot from a research study of reading behavior on Wikipedia. It shows how the time spent on pages depends on whether a reader is on their last view in a session.

### Page length

This is a marginal effects plot from a research study of reading behavior on Wikipedia. It shows how the time spent on pages depends on the page length.

### Device type

This is a marginal effects plot from a research study of reading behavior on Wikipedia. It shows how the time spent on pages depends the type of device.

# Conclusion

• We have a metric for reading times.
• Summarize findings from each of the above sections.
• Propose some future directions.

# References

1. Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2018-12-02). "Why the World Reads Wikipedia: Beyond English Speakers". arXiv:1812.00474 [cs].
2. Kiesler, Sara; Sproull, Lee S. (1986-01-01). "Response Effects in the Electronic Survey". Public Opinion Quarterly 50 (3): 402–413. ISSN 0033-362X. doi:10.1086/268992. Retrieved 2018-12-29.
3. Phillips, Derek L.; Clancy, Kevin J. (1972-03-01). "Some Effects of "Social Desirability" in Survey Studies". American Journal of Sociology 77 (5): 921–940. ISSN 0002-9602. doi:10.1086/225231. Retrieved 2018-12-29.
4. Antin, Judd; Shaw, Aaron (2012). "Social desirability bias and self-reports of motivation: a study of Amazon Mechanical Turk in the US and India". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '12. New York, NY, USA: ACM. pp. 2925–2934. ISBN 978-1-4503-1015-4. doi:10.1145/2207676.2208699. Retrieved 2014-01-12.
5. Benjamin Mako Hill, Aaron Shaw: "The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation" PLoS ONE Volume: 8, Issue: 6, DOI:10.1371/journal.pone.0065782
6. a b Clauset, A.; Shalizi, C.; Newman, M. (2009-11-04). "Power-Law Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 0036-1445. doi:10.1137/070710111. Retrieved 2019-01-01.
7. a b c d Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
8. Pal, M.; Ali, M.M.; Woo, J. (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147.
9. Mitzenmacher, Michael (2004-01-01). "A Brief History of Generative Models for Power Law and Lognormal Distributions". Internet Mathematics 1 (2): 226–251. ISSN 1542-7951. doi:10.1080/15427951.2004.10129088. Retrieved 2018-10-17.
10. Pal, Manisha; Ali, M. Masoom; Woo, Jungsoo (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147. ISSN 1973-2201. doi:10.6092/issn.1973-2201/493. Retrieved 2018-10-14.
11. Pearce, Katy E.; Rice, Ronald E. (2013-08-01). "Digital Divides From Access to Activities: Comparing Mobile and Personal Computer Internet Users". Journal of Communication 63 (4): 721–744. ISSN 1460-2466. doi:10.1111/jcom.12045. Retrieved 2018-10-20.

# Appendices

## Regression Tables

Statistical models
Model 1a Model 1b
Intercept 8.2737 (0.0085)*** 8.2868 (0.0085)***
Global North -0.2680 (0.0022)***
mobile : Global North 0.1490 (0.0024)***
mobile : Last in Session -0.6332 (0.0021)*** -0.6349 (0.0021)***
Global North : Last in Session 0.0830 (0.0024)***
Human development index -0.1961 (0.0018)***
mobile : HDI 0.1133 (0.0019)***
HDI : Last in Session 0.0632 (0.0019)***
Revision length (bytes) 0.1752 (0.0004)*** 0.1758 (0.0004)***
time to first paint -0.0164 (0.0006)*** -0.0171 (0.0006)***
time to dom interactive 0.0025 (0.0009)** 0.0024 (0.0009)**
mobilemobile -0.0118 (0.0023)*** -0.0142 (0.0023)***
sessionlength -0.0001 (0.0000)*** -0.0001 (0.0000)***
lastinsessionLast in session 0.8632 (0.0023)*** 0.8575 (0.0023)***
nthinsession 0.0002 (0.0000)*** 0.0002 (0.0000)***
dayofweekMon 0.0939 (0.0020)*** 0.0926 (0.0020)***
dayofweekSat 0.0169 (0.0020)*** 0.0175 (0.0020)***
dayofweekSun 0.0322 (0.0020)*** 0.0332 (0.0020)***
dayofweekThu 0.0561 (0.0019)*** 0.0548 (0.0019)***
dayofweekTue 0.0349 (0.0020)*** 0.0326 (0.0020)***
dayofweekWed 0.0757 (0.0019)*** 0.0743 (0.0019)***
usermonth4 0.0095 (0.0096) 0.0083 (0.0096)
usermonth5 0.0108 (0.0095) 0.0104 (0.0095)
usermonth6 -0.0102 (0.0097) -0.0103 (0.0097)
usermonth7 -0.0494 (0.0097)*** -0.0491 (0.0097)***
usermonth8 -0.0119 (0.0097) -0.0121 (0.0097)
usermonth9 0.0382 (0.0076)*** 0.0370 (0.0076)***
usermonth10 -0.0004 (0.0075) 0.0010 (0.0075)
R2 0.0721 0.0725