Research:Reading time/Draft Report
What can measuring the amount of time that visitors to Wikipedia spend on each page tell us about patterns of content consumption on Wikipedia? Much research about Wikipedia focuses on how content is produced in terms of editing and collaboration. The open nature of Wikis and the related practice of publishing histories of every edit mean that granular and high quality data on collaboration is readily available. Researchers also study how people consume Wikipedia content by analyzing view counts, click streams, session lengths, eyetracking, scroll positions, collapse expansion on the mobile website, and through surveys. While researchers of reader behavior are deploying a creative arsenal of approaches, the only readily available large scale data on reader behavior is the count of page views.
Last year, the Wikimedia Foundation began collecting data on page dwell times through an event logging plugin on all Wikis. In this project, we set out to assess the validity of this data, understand how it is distributed, and to demonstrate how it may be useful. Results from our preliminary analysis support the idea that information seeking tasks vary between less developed countries and more developed ones. Readers in countries that are less developed or are in the global south stay on given pages for longer on average. Moreover, this difference is almost entirely located where we would expect users to consume information in depth: in the last view in a session and on the desktop (nonmobile) site. This finding supports the conclusions of recent analysis of a largescale survey of Wikipedia readers ^{[1]}.
Contents
Background[edit]
In 2017, the Wikimedia Foundation’s web team introduced new instrumentation to measure the amount of time Wikipedia readers spend the pages they view. The original goal was to develop a new metric for evaluating how feature releases, such as this year’s launch of the Page Previews, may change user behavior. However, we realized that this data could also be useful for understanding patterns of reading behavior in general. In preliminary work during 2017, Tilman and Zareen explored and evaluated the new metric. This project continuous this work with additional validation, and new analyses of general reading patterns.
Contributions[edit]
 Evaluate the consistency of the measured times by comparing them to the timing of server side log events.
 Select parametric model(s) for the distribution of reading times to help choose a metric.
 Use the new data to answer descriptive research questions to understand how the following variables relate to reading time.
 Page length
 Page load time
 Mobile devices
 Development level of the reader's context
 Whether the page is the last viewed in a session
Methods[edit]
Data Collection and Validation[edit]
Collecting reading time data[edit]
The reading depth plugin works by running Javascript in the client browser which sends two messages to the server. The first message (the page loaded event) is sent when the page is loaded and the second message (the page unloaded event) sends values from timers that measure, among other things, the amount of time that the page was visible in the visitor's browser window.
More specifically, the plugin uses the page visibility api to measure visible time, the total amount of time that the page was in a visible browser tab. Visible time is the primary measure we use in this report because it excludes time when the user could not possibly have been reading the page. The plugin also measures a second measure of reading time: total time. This is simply the total time the page was loaded in the browser and we use this variable for data validation and in robustness checks. The plugin also measures page load time in two ways: time till first paint, the time from the request until the browser starts to render any part of the page; and dom interactive time, the time from the request until the user can interact with the page. The current version of the reading depth event logging plugin was first enabled on November 20th 2017. From November 2017 until September 2018 we logged events from a 0.1% sample of visitor sessions, and the sampling rate was increased to 10% on September 25, 2018.
Missing Data[edit]
We are only able to collect data from web browsers that support the APIs on which the instrument depends. This excludes the default Android browser, verions of Chome earlier than 39, Safari, and all browsers running on versions of iOS less than 11.3. We also do not collect data from browsers that have not enabled Javascript or that have enabled doNotTrack. See this Phabricator task for additional details.
Even when the above conditions are met, in some cases we are still not able to collect data. Sometimes we observe a page loaded event, indicating that a user in our sample opened a page, but we do not observe a corresponding event indicating that the user has left the page (a page unloaded event). This issue affects 57% percent of records on the mobile site and about 5% of records on the desktop site. We suspect that the reason that so many mobile views are affected is because many mobile browsers will refresh the page if the user switches to a program other than the browser. We will not observe a pageunloaded event in these cases.
Other variables[edit]
We use some variables other than the timers obtained from the plugin in our analysis. The event loggingsystem records date and time the page was viewed as well as the page title of each page a user visits in a session. We obtained page length, measured in bytes at the time the page was viewed, by merging the event logging data with the edit history. To understand how reading behavior on mobile devices differs from behavior on nonmobile (i.e. desktop) devices, we assume that visitors to mobile webhosts (e.g. en.m.wikipedia.org) are using mobile devices and that visitors to nonmobile webhosts (e.g. en.wikipedia.org) are on nonmobile (desktop) devices. We obtain the approximate country in which a reader is located from the MaxMind GeoIP database which is integrated with the analytics pipeline. We then use the UN's human development data to measure the development level of the country. We also use a second, dichotomous, measure of development in terms of established regional classifications of Global North and Global South. Finally, the event logging pipeline retains a session token with which we measure the number of pages viewed in the session so far (Nth in session) and whether or not a given page view is the last in session.
Taking a sample[edit]
Because Wikipedia is so widely read, even a sample of 0.1% of events results in a huge amount of data, well exceeding the requirements of our projectlevel analysis here, we conduct our analysis on random samples of the dataset. (Data was collected at a higher sampling rate to enable contentlevel analysis of dwell times, e.g. for specific topics or pages, which was among the possible research topics envisaged for this project.) To ensure that all wikis are fairly represented in our sample, we use stratified sampling. Stratified sampling ensure that all groups are represented fairly in a random sample by assigning a "weight" to each groups to adjust the probability that members of the group are chosen in the final sample. This introduces a known "bias" in the resulting sample, so therefore the "weights" are subsequently used (as in a weighted average) to correct the known sampling bias. For estimating total reading time, and for the univariate analysis, we stratify by Wikis. For the Multivariate Analysis we stratify by Wikis, and by country of the reader, and by whether or not we think that the user is on a mobile device.
Estimating Total Reading Time[edit]
Univariate Model Selection[edit]
Motivation[edit]
We want to be able to answer questions like: Did introducing a new feature to the user interface cause a change in the amount of time users spend reading? Are reading times on English Wikipedia greater than on Spanish Wikipedia? What is the relationship between the development level of a reader's country and the amount of time they spend reading on different devices if we account for other factors?
Using a parametric model lets us use of statistical tests to answer questions like those above. Parametric models assume the data have a given probability distribution and have interpretable parameters such as mean, variance, and shape parameters. Fitting parametric distributions to data allows us to estimate these parameters and to statistically test changes in the parameters. However, assuming a parametric model can lead to misleading conclusions if the assumed model is not the true model. Therefore we want to evaluate how well different parametric models fit the data in order to justify parametric assumptions. Understanding how the data is distributed can also be interesting in its own right as distributions can inform understandings of the data generating process.
Candidate Models[edit]
LogNormal Distribution: The lognormal distribution is a twoparameter probability distribution. Intuitively, it is the same as a normal distribution, but on a logarithmic scale. This gives it convenient properties because it's parameters can be interpreted as the mean and variance of the logged data. This means that one can take the logarithm of the data and then use ttests for evaluating differences in means or use ordinary least squares to infer regression models. These advantages make the lognormal distribution a common choice in analyzing skewed data, even when it is is not a perfect fit for the data.
Lomax (Pareto Type II) Distribution: Data on human behavior often exhibit powerlaw distributions, meaning that the probability of extreme events, while still low, is much greater than would be predicted by an normal (or log normal) distribution. Therefore, power law distributions are often referred to as "heavytailed," "longtailed," or "fattailed." We fit the Lomax Distribution, a commonly used heavytail distribution with two parameters.
Weibull Distribution: Liu et al. (2010) model reading times on web pages using a Weibull distribution ^{[2]} . The Weibull distribution has two parameters: , a scale parameter and a shape parameter. The Weibull distribution can be a useful model because the of the intuitive interpretation of . If then reading behavior exhibits "positive aging," which means that the longer someone stays on a page, the more likely they are to leave the page at any moment. Conversely is interpreted as "negative aging," which means that as someone remains on a page, the less likely they are to leave the page at any given moment. The Weibull distribution is often used in the context of reliability engineering because it is convenient for modeling the chances that a given part will fail at a given moment.
Exponentiated Weibull Distribution: The Weibull model assumes that the rate of readers leaving a page changes monotonically with respect to time. This means that the longer a reader stays on a page, they will not become more likely to leave the page up to a point after which they become less likely to leave. In other words there must be either "negative aging," "positive aging," or "no aging." This excludes processes in which "positive aging" transitions to "negative aging" after a point ^{[3]}. Therefore if the data show that the probability of a reader leaving a page first increases and then decreases (or visavera) then assumptions of the Weibull model are violated. The exponentiated Weibull distribution is a threeparameter generalizion of the Weibull distribution that relaxes this constraint. The extra degree of freedom will allow this model to fit a greater range of empirical distributions compared to the twoparameter Weibull model.
We also considered the gamma distribution and the Exponential Distribution, but we will not go into depth about them here. We didn't have a strong motivation for these models and they did not fit the data well.
Methods[edit]
Our method for model selection is inspired in part by Liu et al. (2010), who compared the lognormal distribution to the Weibull distribution of dwell times on a large sample of web pages. They fit both models to data for each website and then compare two measures of model fit: the logliklihood, which measures the probability of the data given the model (higher is better), and the KolmogoravSmirnov distance (KSdistance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the lognormal model in a large majority of cases according to both goodnessoffit measures.
Similar to the approach Liu et al. (2010), we fit each of the models we consider on reading time data seperatly for each Wikipedia project. We also use the KolmogoravSmirnov distance (KSdistance) to evaluate goodnessoffit. It turns out that the KSdistance supports a statistical test of the null hypothesis that the model is a good fit for the data. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausable model instead of just whether one distribution is a better fit than another.
Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without increasing outofsample predictive performance, to avoid the risk overfitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the logliklihood. Both criterion attempt to quantify the amount of information lost by the model (lower is better), evaluate the log liklihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.
Wiki we analyze these goodnessoffit measures for each wiki and rank them from best to worst. We report the mean and median of these ranks. In addition, we report the mean and median pvalues of the KStests and the proportion of Wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.
Results[edit]
The table below shows the results of this procedure. The lomax, exponentiated weibull, and lognormal all fit the data reasonably well. All pass the KStest for many wikis, and are in a threeway tie for best median rank for the goodness of fit statistics and close in terms of mean AIC and BIC.
The lomax distribution is the best fit according to all metrics. With only 2 parameters, it has a lower AIC and BIC than the three parameter exponentiated Weibull distribution (exponweib) and passes the KS test 76% of the time at the 95% confidence level (it is slightly behind exponweib on ksrank, but passes the kstest more often).
The exponentiated Weibull model fits the data better than the lognormal model in terms of passing KStests and with respect to AIC. However, the lognormal is better in terms of BIC, which imposes a greater penalty on the exponentiated Weibull's extra parameter.
In contrast to the findings of Liu et al. (2010), the lognormal model a better fit for the data than the (nonexponentiated) Weibull model. While substantially worse than the lomax model, the lognormal model still passes the KStest for about 69% of Wikis.
AIC rank  BIC rank  ks rank  ks pvalue  ks pass 95%  ks pass 97.5%  

mean  median  mean  median  mean  median  mean  median  mean  mean  
lomax  1.880165  2.0  1.814050  2.0  2.086777  2.0  0.252968  1.498555e01  0.756198  0.818182 
exponweib  2.148760  2.0  2.359504  2.0  1.958678  2.0  0.279059  1.845931e01  0.739669  0.789256 
lognorm  2.173554  2.0  2.057851  2.0  2.318182  2.0  0.255167  1.479791e01  0.685950  0.756198 
weibull  3.954545  4.0  3.942149  4.0  3.917355  4.0  0.072418  3.552921e05  0.227273  0.260331 
gamma  4.958678  5.0  4.971074  5.0  4.818182  5.0  0.041481  3.219647e15  0.111570  0.123967 
exponential  5.884298  6.0  5.855372  6.0  5.900826  6.0  0.018722  0.000000e+00  0.049587  0.053719 
The leftmost mode are events where the page is open for a very short time. These might be "quick backs" where someone opens a page and then immediately realizes they do not want the page open and the navigate away. The vast majority of the density is around the central mode. I created a new set of goodnessoffit plots where the Xaxis is scaled to try and see how well models are fitting the central mode.


The first table looks at the mean and median KS statistics. When we look at the rate at which each distribution passes the KS test at 0.05% and 0.025% confidence levels it actually does the best.
Discussion[edit]
Here is a very good paper by Michael Mitzenmacher explaining data generating processes for power law (pareto) type data generating processes that compares them to log normal distributions. My takeaway is that many types of data generating processes can generate either a log normal or pareto distributions. Richgetricher dynamics (preferential attachment) are often associated with power law distributions. However according to Mitzenmacher, multiplicative processes can produce either lognormal or power law distributions, depending on somewhat subtle differences in the process. These both seem somewhat counterintuitive for reading time. Finally, Mitzenmacher also points out that a mixture of 2 log normal distributions is also a power law. Perhaps we have a situation where we have 2 distributions (reading, leaving the page open) both of which are log normal. If Pareto distributions are a good fit then we might actually prefer a mixture of two lognormal distributions. Based on my reading of Michael Mitzenmacher's paper on power laws, the most likely explanation for this is that the data are generated by a mixture of log normal distributions.
The table below illustrates our approach to model selection. We fit each of 5 distributions to a sample of views from each wiki and compute goodness of fit criteria. Each of the models is fit on a 75% subsample (training set) and the goodness of fit criteria are computed using the other 25% (test set).
The table shows results on a handful of selected wikis. The lognormal model appears to be a good fit, outperforming the Weibull model (weibull_min). The exponentiated Weibull model outperforms both, but may be difficult to interpret. Next we will fit the models on all the wikis to better evaluate which models are a good fit.
^{[4]}
The model might not be a perfect fit for the data. There are fancier versions of the weibull distribution with some more degrees of freedom that might be worth a shot.
Multivariate Analysis[edit]
Limitations[edit]
Two important limitations of this analysis affects our ability to compare reader behavior between mobile phone and PC devices. The first, is the technical limitation of the browser instrumentation on mobile devices, discussed above, which lead to a large amount of missing data on mobile devices. This missing data likely introduces a negative bias in our measures of reading time on mobile devices because data is more likely to be missing in cases where the user switches tasks from the browser, and then subsequently returns to complete their reading task. This bias may be quite significant as the issue affects a large proportion of our sample. Improvements to the instrument that address this limitaion are underway.
A second limitation of our ability to compare mobile phone and PC devices is derived from our intuitions about how reader behavior may differ in the two cases. Mainly, we think that it may be somewhat common for readers to leave a page visible in a web browser at times when they are not directly reading it. Users may leave multiple visible windows on PCs, while only interacting with one, or may leave a browser window visible and move away from their computer for long periods of time. In general, the best we can hope to observe is that a page is visible in a browser. We cannot, through this instrument alone, know with confidence that an individual is reading. It may be possible to introduce additional browser instrumentation for the collection of scrolling, mouse position, or mouse click information. However, such steps should be taken with care as additional data collection may negatively affect the experiences of readers and editors in terms of privacy, browser responsiveness, page load times, and power consumption.
To address these limitations, and we
Research_talk:Reading_time/Work_log/20181118#Robustness_check:_removing_long_dwell_times fit regression models on data with dwell times greater than 1 hour removed. And found that our results were not substantively affected by the change. Therefore, we do not believe that user behaviors that may generate the appearance of long reading times that do not correspond to reading.
Results[edit]
Differences in reading time by different projects[edit]
visiblelength  

max  percentile_95  percentile_75  median  mean  percentile_25  percentile_5  min  count  
wiki  
arwiki  1.525657e+12  358095.30  79087.25  28957.0  3.862190e+08  9719.25  2150.25  219.0  3954 
dewiki  2.994148e+08  444023.55  79470.50  24497.0  3.996264e+05  7482.75  1750.10  261.0  4084 
enwiki  1.515280e+12  374439.40  64480.00  21509.0  3.579515e+08  6854.00  1649.20  188.0  4237 
eswiki  1.526892e+12  577875.50  100811.00  32537.0  3.725822e+08  10733.50  2238.15  125.0  4102 
hiwiki  2.021348e+07  359279.70  78303.00  31319.0  1.000993e+05  11674.50  2665.90  7.0  3679 
nlwiki  1.825265e+08  472283.75  66615.00  22465.0  2.952587e+05  6989.75  1729.25  267.0  4114 
pawiki  1.528875e+12  274091.00  54982.25  20308.0  5.298374e+08  7543.25  2007.75  105.0  2886 
Wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds.. Spanish, Hindi, and Arabic appear to have longer reading times while English and Punjabi appear to have somewhat shorter
reading times.
Total Reading Time[edit]
Discussion[edit]
Page Length[edit]
of page length.pngcenter600pxpxalt=This chart shows how reading times predicted by a regression model change as the development level of the country changes. This plot shows how the model predicts reading time will change as pages get longer. The difference between very long and and very short pages is associated with up to a 40 second increase in expected reading time in the last session and about 15 seconds for views that are not the last in session. For typical pages, a doubling of the length of the page is associated with an increase of about 4 seconds for nonlastinsession views and about 7 seconds for
lastinsession views.]]regression model change as the development level of the country changes. This plot shows how the model predicts reading time will change as pages get longer. The difference between very long and and very short pages is associated with up to a 40 second increase in expected reading time in the last session and about 15 seconds for views that are not the last in session. For typical pages, a doubling of the length of the page is associated with an increase of about 4 seconds for nonlastinsession views and about 7 seconds for
lastinsession views.
Discussion[edit]
Page Load Time[edit]
Discussion[edit]
Last in session[edit]
Discussion[edit]
Mobile[edit]
Discussion[edit]
Development and HDI[edit]
lastinsession. This chart shows how reading times predicted by a regression model change as the development level of the country changes. Readers from the global south read for longer than those in the global north, especially on the last view in a session. The difference between mobile and desktop is mainly a difference in last
in session behavior.
Discussion[edit]
Conclusion[edit]
 We have a metric for reading times.
 Summarize findings from each of the above sections.
 Propose some future directions.
References[edit]
 ↑ Lemmerich, Florian; SáezTrumper, Diego; West, Robert; Zia, Leila (20181202). "Why the World Reads Wikipedia: Beyond English Speakers". arXiv:1812.00474 [cs].
 ↑ Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
 ↑ Pal, M.; Ali, M.M.; Woo, J. (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147.
 ↑ Yi, Xing; Hong, Liangjie; Zhong, Erheng; Liu, Nanthan Nan; Rajan, Suju (2014). "Beyond Clicks: Dwell Time for Personalization". Proceedings of the 8th ACM Conference on Recommender Systems. RecSys '14 (New York, NY, USA: ACM): 113–120. doi:10.1145/2645710.2645724.
Appendicies[edit]
Regression Tables[edit]
model 1  model 2  

Intercept  8.1783 (0.0084)^{***}  8.2388 (0.0084)^{***}  
mobile  0.0962 (0.0015)^{***}  0.0006 (0.0023)  
Human Development Index  0.1007 (0.0009)^{***}  0.1613 (0.0014)^{***}  
mobile : HDI  0.1059 (0.0019)^{***}  
Revision length (bytes)  0.1752 (0.0004)^{***}  0.1752 (0.0004)^{***}  
time to first paint  0.0164 (0.0006)^{***}  0.0163 (0.0006)^{***}  
time to dom interactive  0.0023 (0.0009)^{**}  0.0023 (0.0009)^{**}  
sessionlength  0.0001 (0.0000)^{***}  0.0001 (0.0000)^{***}  
lastinsessionTRUE  0.9281 (0.0015)^{***}  0.9232 (0.0015)^{***}  
nthinsession  0.0002 (0.0000)^{***}  0.0002 (0.0000)^{***}  
dayofweekMon  0.0940 (0.0020)^{***}  0.0940 (0.0020)^{***}  
dayofweekSat  0.0189 (0.0020)^{***}  0.0171 (0.0020)^{***}  
dayofweekSun  0.0336 (0.0020)^{***}  0.0324 (0.0020)^{***}  
dayofweekThu  0.0563 (0.0019)^{***}  0.0563 (0.0019)^{***}  
dayofweekTue  0.0350 (0.0020)^{***}  0.0353 (0.0020)^{***}  
dayofweekWed  0.0760 (0.0019)^{***}  0.0758 (0.0019)^{***}  
usermonth4  0.0095 (0.0096)  0.0096 (0.0096)  
usermonth5  0.0113 (0.0095)  0.0111 (0.0095)  
usermonth6  0.0097 (0.0097)  0.0100 (0.0097)  
usermonth7  0.0487 (0.0097)^{***}  0.0494 (0.0097)^{***}  
usermonth8  0.0112 (0.0097)  0.0118 (0.0097)  
usermonth9  0.0377 (0.0076)^{***}  0.0383 (0.0076)^{***}  
usermonth10  0.0002 (0.0075)  0.0000 (0.0075)  
mobileTRUE:lastinsessionTRUE  0.6508 (0.0021)^{***}  0.6442 (0.0021)^{***}  
R^{2}  0.0717  0.0719  
Adj. R^{2}  0.0717  0.0719  
Num. obs.  9873641  9873641  
RMSE  14.2360  14.2338  
^{***}p < 0.001, ^{**}p < 0.01, ^{*}p < 0.05 