Research:Reading time/Draft Report
This page is currently a draft. More information pertaining to this may be available on the talk page.
Translation admins: Normally, drafts should not be marked for translation. 
So far, we have been unable to determine how long readers spend on our pages. Knowing these numbers can tell a lot about the way people read Wikipedia as well as signal ways to improve the features we make available to readers. Patterns of reading time can give us insight into the types of pages readers find interesting, or the types of pages where more time is spent due to complexity or difficulty we can then work to resolve. We can identify audiences that read longer than others and thus learn more about their needs and the ways in which they find Wikipedia helpful in their search for knowledge.
This is the first larger investigation into the reading time metric and the results it has produced have shown us expected and unexpected patterns around reading behavior. We hope to expand this research in the future by identifying the causes of the trends we have observed below and finally, we hope to incorporate these findings into our overall product strategy and planning processes.
Contents
Abstract[edit]
How much time do Wikipedia readers spend when they visit an article? How does time spent vary from language edition to language edition, or between different kinds of readers or articles? How can we determine whether a new design change increases time spent by readers?
In 2017, the Wikimedia Foundation began collecting data on page dwell times to answer such questions. In this project, we validate this data and begin to answer questions such as the ones above. We observe the limitations of the data, most notably a high rate (57%) of missing data on mobile devices. Yet as long as one keeps such shortcomings in mind, we believe that the data can be fruitfully applied to improve our current knowledge of how people read Wikipedia. We used regression analyses to explore how factors like page length, device choice, and the locations of readers are related to reading times. We believe that our results for device choice and reader location offer behavioral data to corroborate findings from Why the World Reads Wikipedia, a large scale survey of Wikipedia readers of 14 language editions.
Here are some highlights from our analysis:
 We estimate that the entire human race spent about 670,000 years reading Wikipedia between November 2017 and October 2018.
 Upon opening an article, the median visitor will remain for only 25 seconds, but the distribution is extremely skewed. Given that a reader remains for at least 25 seconds, the chances are about even that they will spend more than 75 seconds on the page.
 Page length is positively related to reading time. Doubling page length is associated with a increase in reading times of a factor of 1.2.
 Typical readers on mobile devices spend about 1.7 seconds less per view than readers on desktop devices.
 On average, readers spend over twice as much time on their last view in a session compared to other views.
 Typical readers in the Global South spend more time in an average page view than readers in the Global North.
 The gap between the time spent by Global South readers and Global North readers is amplified on desktop devices. On desktop devices the gap is 5 seconds, but on mobile devices it is only 3 seconds.
 As reading times are highly skewed, the median and geometric mean are better for summarizing changes to reading time than the arithmetic mean.
 Lognormal distributions fit the data well enough to reasonably justify regression modeling and ttests on logtransformed reading times. However, both powerlaw distributions such as the lomax distribution and the threeparameter exponentiated Weibull distribution often fit better than the lognormal distribution.
Introduction[edit]
In 2017, the Wikimedia Foundation’s web team introduced new instrumentation to measure the amount of time Wikipedia readers spend on the pages they view. The original goal was to develop a new metric for evaluating how feature releases, such as this year’s launch of the page previews feature, may change reader behavior. However, we realized that this data could also be useful for understanding patterns of reading behavior in general. In preliminary work during 2017, Readers senior analyst Tilman Bayer and Readers analysis intern Zareen Farooqui explored and evaluated the new metric. This project continues this work with additional validation and new analyses of general reading patterns.
Measuring the amount of time that visitors to Wikipedia spend viewing pages provides previously unavailable information about patterns of content consumption on Wikipedia. While Mediawiki wikis provide histories of every edit and therefore granular and high quality data on productive collaboration is readily available, comparable sources of data are unavailble when it comes to understanding Wikipedia readership. Researchers interested how people consume Wikipedia content have approached this through a creative arsenal of data collection strategies including using view counts, click streams, session lengths, eyetracking, scroll positions, collapse expansion on the mobile website, and through surveys. However, at present, the only publicly available large scale data source on reader behavior is the count of page views. Measuring reading time provides additional nuance over view data. With reading times in our field of view, it becomes clear that not all views are created equal. Some page views involve deep reading, yet most are quite short.
In this project we first evaluate the quality of the adopted approach for measuring reading times. We do this by comparing them to the timing of server side log events and by looking for patterns of systematically missing or invalid data. While we do find some inconsistencies such as a substantial amount of missing data from mobile devices and a low rate of invalid (missing or negative) measurements, we believe that the data can be generally informative as long as these limitations are considered.
We next consider possible probability models for reading time. One anticipated use of reading time data is in the evaluation of design experiments seeking to improve reader experiences on Wikipedia. "How does this feature change reading behavior?" is a question experimenters are likely to ask. Model selection is important for validating assumptions that underlie the use of a statistic such as an arithmetic or geometric mean as a metric. Model selection itself can sometimes lead to novel insights when theorized data generating processes predict that a given model will be a good fit for the data. We evaluate several different distributions and find that the lognormal distribution fits the data well enough to justify the use of the geometric mean as a metric and of ordinary least squares regression models to explore marginal effects of other variables.
After conducting model selection, we used such regression models to understand how page length, device use, whether the page view is the last in a session, and the reader's economic development context relate to reading time. In this analysis we seek to answer a research question: How do typical information seeking tasks differ between people in less developed countries compared to more developed countries?
A recent largescale survey of readers of different Wikipedia language editions found that readers in less developed countries were more likely to engage in "deep" information seeking tasks compared to readers in more developed countries ^{[1]}. However, this study is limited by the use of selfreported data, which can be biased by effects of social desirability ^{[2]} ^{[3]} ^{[4]} and participation biases due to the volunteer nature of webbased surveys (which were shown to have significant demographic effects in case of a previous Wikipedia reader and editor survey^{[5]}). In this study, we address this limitation by directly comparing reader behavior in contexts with varying levels of development.
Consistent with the results of the survey, we find that readers in countries that are less developed or are in the Global South stay on given pages for longer on average compared to readers in the Global North or in more developed countries. Moreover, this difference is amplified where we would expect users to consume information in depth: on the desktop (nonmobile) site. While we hypothesized that the difference would also be greater in the last pageview in a session, this idea was not supported. We demonstrate these patterns using nonparametric methods and multivariate parametric analyses.
Methods[edit]
Data collection and validation[edit]
Collecting reading time data[edit]
The reading depth plugin works by running JavaScript in the client browser which sends two messages to the server. The first message (the page loaded event) is sent when the page is loaded and the second message (the page unloaded event) sends values from timers that measure, among other things, the amount of time that the page was visible in the visitor's browser window.
More specifically, the plugin uses the page visibility api to measure visible time, the total amount of time that the page was in a visible browser tab. Visible time is the primary measure we use in this report because it excludes time when the user could not possibly have been reading the page. The plugin also records a second measure of reading time: total time. This is simply the entire time the page was loaded in the browser. We use this variable for data validation and in robustness checks. The plugin also measures page load time in two ways: time till first paint, the time from the request until the browser starts to render any part of the page; and dom interactive time, the time from the request until the user can interact with the page. The current version of the reading depth event logging plugin was first enabled on November 20th 2017. From November 2017 until September 2018 we logged events from a 0.1% sample of visitor sessions, and the sampling rate was increased to 10% on September 25, 2018.
Since we care about the reading behavior of humans, we identify bots using user agent strings and exclude them from all of our analyses.
Missing data[edit]
We are only able to collect data from web browsers that support the APIs on which the instrument depends. Also, we excluded certain user agents that were found to send data unreliably in our testing, namely the default Android browser, versions of Chrome earlier than 39, Safari, and all browsers running on versions of iOS less than 11.3. See this Phabricator task for additional details. We also do not collect data from browsers that have not enabled JavaScript or that have enabled Do Not Track.
Even when the above conditions are met, in some cases we are still not able to collect data. Sometimes we observe a page loaded event, indicating that a user in our sample opened a page, but we do not observe a corresponding event indicating that the user has left the page (a page unloaded event). This issue affects 57% percent of records on the mobile site and about 5% of records on the desktop site. The likely explanation of why many mobile views are affected is that many mobile browsers will refresh the page if the user switches to a program other than the browser. In such cases the browser will not send a pageunloaded event. We only include events where we observe exactly 1 page loaded event and 1 page unloaded event.
We also remove 0.016% of page read events where, for unknown reasons, the instrument recorded a page visible time that was less than 0 or undefined.
Taking a sample[edit]
Because Wikipedia is so widely read, even a sample of 0.1% of events results in a huge amount of data. This exceeds the requirements of this projectlevel analysis and leads to computational difficulties. We therefore conduct our analysis on random samples of the dataset. (Data was collected at a higher sampling rate to enable contentlevel analysis of dwell times, e.g. for specific topics or pages, which was among the possible research topics envisioned for this project.) To ensure that all wikis are fairly and adequately represented in our sample, we use stratified sampling. Stratified sampling ensures that all groups are represented fairly in a random sample by assigning a "weight" to each groups to adjust the probability that members of the group are chosen in the final sample. This introduces a known "bias" in the resulting sample, so the "weights" are subsequently used (as in a weighted average) to correct the known sampling bias. For estimating total reading time, and for the univariate analysis, we stratify by wikis and take up to 20000 data points for each wiki and exclude wikis that have fewer than 300 data points.
In the Multivariate Analysis we stratify by wikis, and by country of the reader, and by whether or not we think that the user is on a mobile device and we sample up to 200 data points for each strata.
Distribution of reading times[edit]
Before turning to our other questions, we we presenting summary statistics and a high level description of reading behavior on Wikipedia. When someone opens a given page on Wikipedia, how long do they typically stay on the page? Are reading times highly skewed? How much does reading behavior vary across different language editions of Wikipedia? How much time does all of humanity spend reading Wikipedia?
Wikipedia as a whole[edit]
In general, the distribution of reading times is very skewed. Upon opening a Wikipedia page, a reader will close it or navigate away in less than 25 seconds about half of the time. If they do remain on the page longer than the median dwell time, half of the time again they are likely to navigate away within 75.1 seconds. In general, the distribution of reading times is very skewed. When discussing reading times, it makes more sense to discuss medians and other percentiles because in highly skewed distributions the mean will be very far away from most of the mass of the distribution. This runs quite contrary to what the mean represents to most people's intuitions. Fortunately, once the data has been logtransformed, the density is bellshaped and the median and the mean are quite close. Therefore, we can also find it useful to discuss the geometric mean. We discuss the skewed nature of the data and the benefits of logtransformation at length below.
5%  25%  50%  75%  95%  

time visible (sec)  1.8  8.0  25.0  75.1  439.1 
Table 1.1 This table shows percentiles for reading times over all Wikipedia editions
Total time spent[edit]
The plot below shows an estimate of the total amount of time people read each month, taken by multiplying the average reading time in each month (measured as visible length) by the number of views per month. Humanity spent about 672,349 years reading Wikipedia from November 2017 through October 2018, excluding readers using the mobile apps, and bots. It is possible that some people leave Wikipedia pages visible in their browsers for extended periods of time without reading. For example, someone might open a page and then walk away from the computer to have lunch. To make our estimates of reading time somewhat conservative, we rounded all page views lasting more than 1 hour down to hour in these estimates of reading time.
Variation between different language editions[edit]
We present box plots of the distribution of page visible times on different Wikipedia language editions. As above we place unscaled data sidebyside with logtransformed data. The box plots show the interquartile range (IQR) as a box. A line inside the box represents the median and the "wiskers" extend the IQR by a factor of 1.5. The plots with untransformed data are trunctated at 300 seconds (5 minutes) because showing the entire range of the data would render the plots illegible. The logtransformed plots show the full range of the data. We hope these plots will be readers who may wish to know how reading times compare between their wikis of interest.
We highlight a handful of example language editions that are representitive of projects of different sizes and of different cultures. These are Arabic (ar), German (de), English (en), Spanish (es), Hindi (hi), Dutch (nl) and Punjabi (pa).
.
wiki  5%  25%  50%  75%  95% 

ar  5.2  5.2  21.5  69.9  371.7 
de  14.1  14.1  14.1  56.6  482.7 
en  37.2  37.2  37.2  37.2  262.4 
es  23.3  23.3  23.3  65.5  616.4 
hi  2.5  11.4  31.4  82.6  360.5 
nl  6.1  6.1  15.9  60.1  441.8 
pa  2.0  7.2  19.5  55.4  303.1 
that Wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds. Spanish, Hindi, and Arabic appear to have longer reading times while English
and Punjabi appear to have somewhat shorter reading times.We observe a great deal of variation in the distribution of reading times between different language editions. We do not investigate these differences any further in this report because we lack knowledge of the specific contexts of each community and their audiences which would be necessary to adequately explain them. Instead, we present an analysis of the relationship between reading time and the development level of reader's countries to offer a more general explination of one factor that might make a difference.
Other variables[edit]
We use some variables other than the timers obtained from the plugin in our analysis. The eventlogging system records the date and time the page was viewed as well as the page title of each page a user visits in a session. We obtained the page length, measured in bytes at the time the page was viewed, by merging the event logging data with the edit history. To understand how reading behavior on mobile devices differs from behavior on nonmobile (i.e. desktop) devices, we assume that visitors to mobile webhosts (e.g. en.m.wikipedia.org) are using mobile devices and that visitors to nonmobile webhosts (e.g. en.wikipedia.org) are on nonmobile (desktop) devices.
We obtain the approximate country in which a reader is located from the MaxMind GeoIP database which is integrated with the analytics pipeline. We then use human development index (HDI) from the the UN's human development data to measure the development level of the country. We lack geolocation data before March 3rd 2018, and this limits our analysis of development and reading times to the period from then until September 28th 2018.
In our model selection process, we observed that partial residual plots of the interaction term between the HDI and mobile were very skewed. Standardizing the HDI by centering to 0 and scaling it by the standard deviation (taken at the country level) improved this and also allows us to interpret results in terms of standard deviations, which are likely more intuitive to readers less familiar with the HDI.
We also use a second, dichotomous, measure of development in terms of established regional classifications of Global North and Global South. Finally, this EventLogging instrumentation retains a session token with which we measure the number of pages viewed in the session so far (Nth in session) and whether or not a given page view is the last in session.
Univariate model selection[edit]
Motivation[edit]
We want to be able to answer questions like: Did introducing a new feature to the user interface cause a change in the amount of time users spend reading? Are reading times on English Wikipedia longer than on Spanish Wikipedia? What is the relationship between the development level of a reader's country and the amount of time they spend reading on different devices if we account for other factors?
Using a parametric model allows us to perform statistical tests to answer questions such as the ones listed above. Parametric models assume the data have a given probability distribution and have interpretable parameters such as mean, variance, and shape parameters. Fitting parametric distributions to data allows us to estimate these parameters and to statistically test changes in the parameters. However, assuming a parametric model can lead to misleading conclusions if the assumed model is not the true model. Therefore we want to evaluate how well different parametric models fit the data in order to justify parametric assumptions. Understanding how the data is distributed can also be interesting in its own right because distributions can inform understandings of the data generating process.
Candidate models[edit]
LogNormal Distribution: The lognormal distribution is a twoparameter probability distribution. Intuitively, it is just a normal distribution, but on a logarithmic scale. This gives it convenient properties because its parameters the mean and variance of the logtransformed data. This means that one can take the logarithm of the data and then use ttests for evaluating differences in means, or use ordinary least squares to infer regression models. These advantages make the lognormal distribution a common choice in analyzing skewed data, even when it is is not a perfect fit for the data.
Lomax (Pareto Type II) Distribution: Datasets on human behavior often exhibit powerlaw distributions, meaning that the probability of extreme events, while still low, is much greater than would be predicted by a normal (or log normal) distribution. Power law distributions are a commonly used class of onesided "heavytailed," "longtailed," or "fattailed" probability models ^{[6]}. We fit the Lomax Distribution, a commonly used heavytail distribution with two parameters that assumes that power law dynamics occur over the whole range of the data.
Weibull Distribution: Liu et al. (2010) model reading times on web pages using a Weibull distribution ^{[7]}. This model has two parameters: , a scale parameter, and a shape parameter. The Weibull distribution can be a useful model because of the intuitive interpretation of . If , then reading behavior exhibits "positive aging," which means that the longer someone stays on a page, the more likely they are to leave the page at any moment. Conversely is interpreted as "negative aging," which means that as someone remains on a page, the less likely they are to leave the page at any given moment. The Weibull distribution is often used in the context of reliability engineering because it is convenient for modeling the chances that a given part will fail at a given moment.
Exponentiated Weibull Distribution: The Weibull model assumes that the rate of readers leaving a page changes monotonically with respect to time. This means that the longer a reader stays on a page, they will not become more likely to leave the page up to a point after which they become less likely to leave. In other words there must be either "negative aging," "positive aging," or "no aging." This excludes processes in which "positive aging" transitions to "negative aging" after a point ^{[8]}. Therefore if the data show that the probability of a reader leaving a page first increases and then decreases (or vice versa) then assumptions of the Weibull model are violated. The exponentiated Weibull distribution is a threeparameter generalization of the Weibull distribution that relaxes this constraint. The extra degree of freedom will allow this model to fit a greater range of empirical distributions compared to the twoparameter Weibull model.
We also considered the gamma distribution and the exponential Distribution, but we will not go into depth about them here. We didn't have a strong motivation for these models and they did not fit the data well.
We fit the models using SciPy. The Exponentiated Weibull, Weibull, lomax, and gamma models were fit using maximum likelihood estimation and the other models were fit using the method of moments.
Methods[edit]
Our method for model selection is inspired in part by Liu et al. (2010), who compared the lognormal distribution to the Weibull distribution of dwell times on a large sample of web pages ^{[7]}. They fit both models to data for each website and then compare two measures of model fit: the loglikelihood, which measures the probability of the data given the model (higher is better), and the KolmogorovSmirnov distance (KSdistance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the lognormal model in a large majority of cases according to both goodnessoffit measures.
Similar to the approach of Liu et al. (2010), we fit each of the models we consider on reading time data, separately for each Wikipedia project ^{[7]}. We also use the KolmogoravSmirnov distance (KSdistance) to evaluate goodnessoffit. It turns out that the KSdistance supports a statistical test of the null hypothesis that the model is a good fit for the data ^{[6]}. The KStest is quite sensitive to deviations between the model and the data, especially in large enough samples. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausible model, instead of just whether one distribution is a better fit than another.
Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without increasing outofsample predictive performance or explanatory power. To avoid the risk overfitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the loglikelihood. Both criteria attempt to quantify the amount of information lost by the model (lower is better), evaluate the log likelihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.
We analyze these goodnessoffit measures for each wiki and rank them from best to worst. For each distribution, we report the mean and median of these ranks. In addition, we report the mean and median pvalues of the KStests and the proportion of wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.
The diagnostic plots are shown with data on English Wikipedia. On this Wiki, the exponentiated Weibull model is the best fit, followed by the lomax model and then the log normal model. Only the exponentiated Weibull model passes the KS test.
Results[edit]
Goodnessoffit metrics[edit]
The table below shows the results of this procedure. The lomax, exponentiated weibull, and lognormal all fit the data reasonably well. All pass the KStest for many wikis, and are in a threeway tie for best median rank according to AIC.
model  AIC_rank  BIC_rank  ks_rank  ks_pvalue  ks_pass95  ks_pass97_5  

mean  median  mean  median  mean  median  mean  median  mean  mean  
Lomax  1.780992  2.0  1.723140  1.5  2.033058  2.0  0.272096  1.765701e01  0.760331  0.818182 
Lognormal  2.256198  2.0  2.165289  2.0  2.342975  2.0  0.272003  1.476440e01  0.685950  0.772727 
Exponentiated Weibull  2.260331  2.0  2.417355  3.0  2.144628  2.0  0.293441  2.078284e01  0.727273  0.793388 
Weibull  3.938017  4.0  3.942149  4.0  3.809917  4.0  0.084717  1.190242e04  0.247934  0.297521 
Gamma  4.933884  5.0  4.962810  5.0  4.785124  5.0  0.044867  2.255307e12  0.132231  0.157025 
Exponential  5.830579  6.0  5.789256  6.0  5.884298  6.0  0.015356  0.000000e+00  0.049587  0.066116 
The lomax distribution is the best fit across all Wikis according to all metrics. With only 2 parameters, it has a lower AIC and BIC than the three parameter exponentiated Weibull distribution (exponweib) and passes the KS test 76% of the time at the 95% confidence level.
The exponentiated Weibull model fits the data better than the lognormal model in terms of passing KStests and with respect to AIC. However, the lognormal is better in terms of BIC, which imposes a greater penalty on the additional parameter of the exponentiated Weibull model.
The Weibull model fits substantially worse than the lomax, lognormal, and exponentiated Weibull in terms of all of our goodnessoffit metrics. In this respect, our results differ from those of Liu et al. (2010), who observed the Weibull model fitting dwell time data better than the lognormal model ^{[7]}. We observe that for dwell times on Wikipedia, the lognormal model is the better fit. While substantially worse than the lomax model, the lognormal model still passes the KStest for about 69% of wikis in the sample.
Discussion[edit]
We found that lomax, exponentiated Weibull, and lognormal models all fit the data within reason. We now discuss how each of these models can be applied to understanding Wikipedia reading behavior.
Lomax (Pareto Type II) Distribution: That the lomax model fits well indicates that Wikipedia reading time data may follow a power law. Mitzenmacher (2004) describes several possible data generating processes for power law (Pareto) and lognormal distributions ^{[9]}. Richgetricher dynamics such as preferential attachment are commonly associated with power law distributions. However, according to Mitzenmatcher's analysis, a mixture of lognormal distributions can also generate data appearing to follow a power law. Therefore, we cannot conclude from our modelfitting exercise that reading behavior on Wikipedia is driven by richgetricher dynamics. Furthermore, it is difficult to conceive of mechanisms for such dynamics. On the other hand, it is intuitive that a mixture of different lognormal processes are involved in reading time, such as an exploration process mixed with a reading process or even a mixture of behavior patterns associated with discrete types of information consumption. Deeper exploration of potential powerlaw dynamics in reading behavior is a potential avenue for future research.
LogNormal Distribution:
The lognormal model does not fit the data perfectly, but it fits well enough to be useful. It frequently passes KS tests, and is preferred to the exponentiated Weibull by the BIC. Even though the lomax model typically fits the data better, and the lognormal model is likely to underestimate the probability of very long reading times, assuming a lognormal model is very convenient. Once the data is transformed to a logscale we can use ttests to compare differences in means. This implies that the mean of the logarithm of reading time is an appropriate metric for evaluating experiments. Furthermore, assuming lognormality justifies using ordinary least squares to estimate regression models in multivariate analysis instead of more complex models that require maximum likelihood estimation.
Weibull Distribution: The Weibull model did not fit the data well. This was somewhat disappointing because we had hoped to analyze reading behavior in terms of the inferred parameter that indicates positive or negative aging. While Liu et al. (2010) observed that the Weibull model outperformed the lognormal model on their datasets, we observe the opposite. However, the exponentiated Weibull model generalizes the Weibull, is a good fit for the data, and can help us explain why the Weibull does not fit the data well.
Exponentiated Weibull Distribution: The Exponentiated Weibull has 3 parameters ^{[10]}. Two are shape parameters ( and ) and one is a scale parameter (). The major qualitative distinctions in interpreting the model depend on the shape parameters.
 If and then the model is equivalent to an exponential distribution with parameter .
 If then the model is equivalent to a Weibull distribution.
 In this case the hazard rate is always increasing (positive aging) if and always decreasing (negative aging) if .
 If then the model is equivalent to an exponentiated exponential distribution and the hazard rate may not be monotonic.
 In addition, if then the hazard rate increases when .
 On the other hand if then the hazard rate decreases when .
 and indicates positive aging (the hazard rate is increasing).
 and indicates negative aging (the hazard rate is decreasing).
 If either , or , then qualitative interpretation may require closer inspection of estimated hazard functions.
We estimated and for all but 1 of the 242 Wikipedia projects we analyzed. This limits the utility of exponentiated Weibull models for large scale analysis reading on many Wikis because locations of parameters do lead directly to intuitive qualitative interpretations. However, by plotting the estimated hazard function we can see over what range of the data the hazard function is decreasing or increasing, accelerating or decelerating.
We observe that, on English Wikipedia, the lognormal and exponentiated Weibull models both indicate a brief period of positive aging, during which the instentatious rate of pageleaving increases followed by negative aging. This helps explain why the Weibull model is not a good fit for the data compared to the lognormal and exponentiated Weibull models: the Weibull distribution cannot model a nonmonotonic hazard function. While Liu et al 2010 found that such a process can describe the distribuiton of dwell times in data from a web browser plugin, our analysis suggests that behavior by Wikipedia readers may be more complex. One plausible interpretation is that the hazard rate increases during the first 1 or 2 seconds of a page view because readers require this time to make a decision whether to leave the page or to remain.
Distribution fitting plots[edit]
To further explore how well these distributions fit the data, we present a series of diagnostic plots that compare the empirical distribution of the data with the model predicted distributions. For each of the four models under consideration (lomax, lognormal, exponentiated Weibull, Weibull), we present a density plot, a distribution plot, and a quantilequantile plot (QQ plot). The density plots compare the probability density function of the estimated parametric model to the normalized histogram of the data. Similarly the distribution plots compare the estimated cumulative distribution to the empirical distribution. The QQ plots plot the values of the quantile function for the data on the xaxis and for the estimated model on the yaxis. These plots can help us explain diagnose ways that the data diverge from each of the models. We present the xaxis of all these plots on a logarithmic scale to improve the the visibility of the data.
We show these plots for data from English Wikipedia. For this wiki, the likelihoodbased goodnessoffit measures indicate that the exponentiated Weibull model is the best fit (BIC = 19321) followed in order by the lomax (BIC = 19351), the lognormal (BIC = 19373) and the Weibull (BIC = 20111), but the lognormal model is the only model that passes the KS test ( = 0.089).

Figure 1.4. The Weibull model is not a good fit for the data. On a log scale, the PDF is not only monotonically decreasing, it is concave up everywhere. It greatly overestimates the probability of very short and very long reading times while under estimating the probability of reading times between 10 and 1000 seconds.

Multivariate Analysis[edit]
This section explores the research question "How do Wikipedia readers in less developed countries differ from readers more developed countries?" Results of a global survey of Wikipedia readers, suggest that readers in less developed countries are more likely to engage in deeper information seeking tasks. Assuming that when executing such information seeking tasks, Wikipedia readers will spend more time reading, our data on reading times allows us to test this hypothesis using behavioral data.
 H1: Readers in less developed countries are more likely to spend more time reading each page they visit compared to readers in more developed countries.
The assumption that time spent reading correlates to the depth of an information seeking task is clearly questionable. Other factors such as reading fluency, the type of device used to access information, internet connection speed and whether a reader is on a page that contains the information they wish to consume may all confound the relationship between reading time and the type of an information seeking task. We attempt to build confidence that such factors do not drive the observed relationship in two ways. First we use multivariate regression to statistically control for observable factors. We also examine how the gap between lowdevelopment country and highdevelopment depends on device type and on whether the reader is on the last page in their session by testing the following hypotheses.
 H2: The amount that readers in less developed countries read more than readers in more developed countries will be greater on desktop than on mobile devices.
The intuition for this hypothesis is that users will prefer to engage in deeper information seeking tasks on desktop devices instead of on mobile devices where they are more likely to engage in shallower tasks such as quick lookup of facts ^{[11]}.
 H3: The amount of that readers in less developed countries spend over readers in more developed countries will be greater in the last page view in a session than on other page views.
The intuition for this hypothesis is that deep reading of an article is most likely to take place in the last page view in a session. Therefore if the gap between low and high development context readers is attributable to types of information seeking tasks then we will observe a gap between reading time in more developed countries than in less developed countries.
We test these three hypotheses using two regression models, that differ only in how they represent economic development. Model 1a uses the human development index (HDI) reported by the United Nations and model 1b uses the Global North and Global South regional classification. We evaluated alternative specifications of these models
 Model 1a:
 Model 1a:
 Model 1b:
 Model 1b:
We include Day Of Week and Month as statistical controls for seasonal and weekly reading patterns. Including NthInSession statistically controls for the number of pages a reader has viewed so far in the session. Revision Length, the size of the Wikipage, measured in bytes, roughly accounts for the amount of textual content on the page. To statistically control for the time it takes for pages to load we include FirstPaint and DomInteractiveTime. We include Desktop:LastInSession because during our model selection process, it improved the model fit.
We consider H1 supported if in both models; H2 if ; and H3 if . Because interaction terms can be difficult to interpret qualitatively. We will present marginal effect plots to assist in qualitative interpretation of the observed relationships.
We explored alternative model specifications that include higher order terms and additional interaction terms. We choose to present model 1a and model 1b because more complex models neither substantively improve the explained variance and the predictive performance nor lead to qualitatively different conclusions. We fit both models using weighted ordinary least squares estimation in R on a stratified sample of size 9,873,641.
Future iterations of the analysis may improve upon these models by including fixed effects for Wiki and random effects for page to considerably strengthen statistical controls and better isolate the relationships of interest. Computational limitations (issues with memory consumption and with SWAP) moved these steps out of the scope of the present project.
NonParametric Analysis[edit]
The multivariate analysis assumes a parametric model and as we saw in the univariate analysis above, the assumption of lognormality may be invalid. Therefore, we also provide a simple nonparametric analysis based on median reading times. We construct a 3x3 table of users depending on whether they are in the Global North or Global South, on a mobile or desktop device or on the last page view in their session. The medians of each cell of the table validate that our findings are not driven by the normality assumption alone.
Results[edit]
Regression Analysis[edit]
Hypothesis 1: Development and reading times[edit]
We find support for H1 that predicted that readers in more developed countries () or in the Global North () are likely to spend less time on each page than readers in less developed countries or in the Global South. The effect size is significant as shown in the marginal effects plots. According to model 1a, a prototypical user, average in all other respects in a country with an HDI 1 standard deviation below the mean can be expected to spend about 25 seconds on a given page compared to about 18 seconds spent by an average reader in a country with an HDI 1 standard deviation above the mean. Similarly, according to model 1b, assuming that all else is equal, a user in Global South country is expected to spend 130% as much time reading a page as an equivalent reader in a Global North country and a prototypical Global North reader is expected to spend just over 16 seconds on a page compared to the 21 seconds spent by a Global South reader.


Hypothesis 2: Development and reading times on mobile devices[edit]
We also find support for our second hypothesis: that readers in global north () or higher HDI () countries are likely to spend even less time reading compared to global south or lower HDI readers when they are on a desktop device compared to a mobile device (visiting *.wikipedia.org rather than *.m.wikipedia.org). Indeed, as shown in the marginal effects plot for model 1b, for the prototypical reader, the gap between global south and global north is greater on desktop devices (about 5 seconds) than on mobile devices (about 3 seconds). The marginal effects plot for model 1a indicates that, according to the model, prototypical readers in low HDI countries are likely to spend more time reading when they are on a desktop device than on a mobile device, but the reverse is true for readers in high HDI countries. Yet readers in low HDI countries are expected to spend more time reading a given page no matter their device choice. While a prototypical reader in a country 1 standard deviation below the mean for HDI is predicted to read for about 25 seconds on desktop and about 22 seconds on mobile, in a country 1 standard deviation above the mean, she is predicted to read for about 19 seconds on mobile and about 17 seconds on desktop.


Hypothesis 3: Development and lastinsession[edit]
As we expect deep reading to be most likely in the last page view in a session, we predicted H3: the difference in reading times between less developed countries and more developed countries will be amplified in the last page view in a session. However, we do not find support for this hypothesis, which would have been indicated by a negative regression coefficient for the interaction term between development and lastinsession.Instead we find a positive coefficients for HDI:Last in session () in model 1a and for Global North:Last in session ().


NonParametric Analysis[edit]
The table below shows the median time pages are visible by the user's economic region, device and whether a page is the last viewed in the user's session. Consistent with H1, users in the Global South spend more time on pages compared to users in the Global North regardless of device or session stage. Consistent with H2, the difference between Global South and Global North users is clearly more pronounced on desktop compared to mobile. In contrast to the prediction of H3, but in line with the findings from our parametric analysis, we do not observe an accentuation of the difference between Global South and Global North users in the last page view in a session.
Last In Session  Economic Region  Desktop  Visible Length  

0  False  Global North  False  20109 
1  False  Global North  True  16185 
2  False  Global South  False  21554 
3  False  Global South  True  21804 
4  True  Global North  False  28178 
5  True  Global North  True  39840 
6  True  Global South  False  28684 
7  True  Global South  True  43630 
Limitations[edit]
Two important limitations of this analysis affects our ability to compare reader behavior between mobile phone and PC devices. The first is the technical limitation of the browser instrumentation on mobile devices, discussed above, which lead to a large amount of missing data on mobile devices. This missing data likely introduces a negative bias to our measures of reading time on mobile devices, because data is more likely to be missing in cases where the user switches tasks from the browser, and then subsequently returns to complete their reading task. This bias may be quite significant as the issue affects a large proportion of our sample. We are considering improvements to the instrumentation that address this limitation, possibly making use of the Page LifeCycle API recently introduced in Google Chrome.
A second limitation of our ability to compare mobile phone and PC devices is derived from our intuitions about how reader behavior may differ in the two cases. Mainly, we think that it may be somewhat common for readers to leave a page visible in a web browser at time when they are not directly reading it (the "lunch break problem"). Users may leave multiple visible windows on PCs, while only interacting with one, or may leave a browser window visible and move away from their computer for long periods of time. In general, the best we can hope to observe is that a page is visible in a browser. We cannot, through this instrument alone, know with confidence that an individual is reading. It may be possible to introduce additional browser instrumentation for the collection of scrolling, mouse position, or mouse click information. However, such steps should be taken with care as additional data collection may negatively affect the experiences of readers and editors in terms of privacy, browser responsiveness, page load times, and power consumption.
To address this limitation, we fit regression models on data with dwell times greater than 1 hour removed. And found that our results were not substantively affected by the change. Therefore, we do not believe that user behaviors that may generate the appearance of long reading times that do not correspond to reading.
An additional limitation arises from the missing data described above. It is possible that we are missing data in ways that may potentially confound our results, especially, but not exclusively, in terms of the comparison between mobile and nonmobile devices.
The analysis presented here is carried out on observational, rather than experimental, data with the intention of describing correlations, rather than demonstrating causal relationships, between our variables.
Discussion: development level and information seeking tasks[edit]
Our analysis reading time is generally consistent with findings from the survey study, which suggested that readers in Global South countries are most likely to engage in more intensive information seeking tasks. We found that readers in less developed countries spend have longer reading times than readers in more developed countries and that this relationship is amplified when they choose to read on a nonmobile (desktop) device compared to a mobile device. We also considered whether the relationship would be amplified in the last view in a browser session, which we expect to be associated with content consumption as opposed to discovery. While we do observe that all readers dwell for longer in the last page view in a session, and that readers in developing countries appear to read longer, we do not observe the gap between readers in less developed and more developed countries amplified in the last view in the session. These relationships are supported not only by the regression models, but also by non parametric analysis comparing medians of the dwell times by device, Global NorthGlobal South, and lastinsession. The observed relationships are quite similar whether development level is measured using the human development index or by dichotomized economic region. All of these results are consistent with the proposition that readers in the Global South are more likely to engage in deep information seeking tasks compared to readers in the global north.
Alternative Explanations[edit]
That said, there are several plausible alternative explanations that we cannot rule out in the presented analysis. The observed reading time gap between more and less developed countries may be due to factors other than the types of information seeking tasks in which they are engaged. For instance, if readers in less developed countries are more likely to read in languages that are not their primary language, they may spend more time reading independently of their task. A future iteration of this project may partially address this limitation by accounting for whether readers are visiting a Wikipedia edition that is a common primary language in their country. A second alternative explanation is that the gap between readers in more and less developed countries is primarily due to time spent on exploration rather than on content consumption. We proposed H3 as an attempt to rule out this alternative explanation, but we were unsuccessful. More generally, drawing conclusions about information seeking from our analysis rests on relatively strong assumptions about the relationships between task type and reading times. Future work on information seeking behavior on Wikipedia testing these assumptions would help validate such conclusions.
Data plans: is this driven by people wanting to get the most out of their page views? We could use Wikipedia Zero as a way to answer this.
Bonus Results[edit]
Last in session[edit]
Page length[edit]
Device type[edit]
Conclusion[edit]
 We have a metric for reading times.
 Summarize findings from each of the above sections.
 Propose some future directions.
References[edit]
 ↑ Lemmerich, Florian; SáezTrumper, Diego; West, Robert; Zia, Leila (20181202). "Why the World Reads Wikipedia: Beyond English Speakers". arXiv:1812.00474 [cs].
 ↑ Kiesler, Sara; Sproull, Lee S. (19860101). "Response Effects in the Electronic Survey". Public Opinion Quarterly 50 (3): 402–413. ISSN 0033362X. doi:10.1086/268992. Retrieved 20181229.
 ↑ Phillips, Derek L.; Clancy, Kevin J. (19720301). "Some Effects of "Social Desirability" in Survey Studies". American Journal of Sociology 77 (5): 921–940. ISSN 00029602. doi:10.1086/225231. Retrieved 20181229.
 ↑ Antin, Judd; Shaw, Aaron (2012). "Social desirability bias and selfreports of motivation: a study of Amazon Mechanical Turk in the US and India". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '12. New York, NY, USA: ACM. pp. 2925–2934. ISBN 9781450310154. doi:10.1145/2207676.2208699. Retrieved 20140112.
 ↑ Benjamin Mako Hill, Aaron Shaw: "The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation" PLoS ONE Volume: 8, Issue: 6, DOI:10.1371/journal.pone.0065782
 ↑ ^{a} ^{b} Clauset, A.; Shalizi, C.; Newman, M. (20091104). "PowerLaw Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 00361445. doi:10.1137/070710111. Retrieved 20190101.
 ↑ ^{a} ^{b} ^{c} ^{d} Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
 ↑ Pal, M.; Ali, M.M.; Woo, J. (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147.
 ↑ Mitzenmacher, Michael (20040101). "A Brief History of Generative Models for Power Law and Lognormal Distributions". Internet Mathematics 1 (2): 226–251. ISSN 15427951. doi:10.1080/15427951.2004.10129088. Retrieved 20181017.
 ↑ Pal, Manisha; Ali, M. Masoom; Woo, Jungsoo (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147. ISSN 19732201. doi:10.6092/issn.19732201/493. Retrieved 20181014.
 ↑ Pearce, Katy E.; Rice, Ronald E. (20130801). "Digital Divides From Access to Activities: Comparing Mobile and Personal Computer Internet Users". Journal of Communication 63 (4): 721–744. ISSN 14602466. doi:10.1111/jcom.12045. Retrieved 20181020.
Appendices[edit]
Regression Tables[edit]
Model 1a  Model 1b  

Intercept  8.2737 (0.0085)^{***}  8.2868 (0.0085)^{***}  
Global North  0.2680 (0.0022)^{***}  
mobile : Global North  0.1490 (0.0024)^{***}  
mobile : Last in Session  0.6332 (0.0021)^{***}  0.6349 (0.0021)^{***}  
Global North : Last in Session  0.0830 (0.0024)^{***}  
Human development index  0.1961 (0.0018)^{***}  
mobile : HDI  0.1133 (0.0019)^{***}  
HDI : Last in Session  0.0632 (0.0019)^{***}  
Revision length (bytes)  0.1752 (0.0004)^{***}  0.1758 (0.0004)^{***}  
time to first paint  0.0164 (0.0006)^{***}  0.0171 (0.0006)^{***}  
time to dom interactive  0.0025 (0.0009)^{**}  0.0024 (0.0009)^{**}  
mobilemobile  0.0118 (0.0023)^{***}  0.0142 (0.0023)^{***}  
sessionlength  0.0001 (0.0000)^{***}  0.0001 (0.0000)^{***}  
lastinsessionLast in session  0.8632 (0.0023)^{***}  0.8575 (0.0023)^{***}  
nthinsession  0.0002 (0.0000)^{***}  0.0002 (0.0000)^{***}  
dayofweekMon  0.0939 (0.0020)^{***}  0.0926 (0.0020)^{***}  
dayofweekSat  0.0169 (0.0020)^{***}  0.0175 (0.0020)^{***}  
dayofweekSun  0.0322 (0.0020)^{***}  0.0332 (0.0020)^{***}  
dayofweekThu  0.0561 (0.0019)^{***}  0.0548 (0.0019)^{***}  
dayofweekTue  0.0349 (0.0020)^{***}  0.0326 (0.0020)^{***}  
dayofweekWed  0.0757 (0.0019)^{***}  0.0743 (0.0019)^{***}  
usermonth4  0.0095 (0.0096)  0.0083 (0.0096)  
usermonth5  0.0108 (0.0095)  0.0104 (0.0095)  
usermonth6  0.0102 (0.0097)  0.0103 (0.0097)  
usermonth7  0.0494 (0.0097)^{***}  0.0491 (0.0097)^{***}  
usermonth8  0.0119 (0.0097)  0.0121 (0.0097)  
usermonth9  0.0382 (0.0076)^{***}  0.0370 (0.0076)^{***}  
usermonth10  0.0004 (0.0075)  0.0010 (0.0075)  
R^{2}  0.0721  0.0725  
Adj. R^{2}  0.0720  0.0725  
Num. obs.  9873641  9873641  
RMSE  14.2330  14.2297  
^{***}p < 0.001, ^{**}p < 0.01, ^{*}p < 0.05 