What can measuring the amount of time that visitors to Wikipedia spend on each page tell us about patterns of content consumption on Wikipedia? Much research about Wikipedia focuses on how content is produced in terms of editing and collaboration. The open nature of Wikis and the related practice of publishing histories of every edit mean that granular and high quality data on collaboration is readily available. Researchers also study how people consume Wikipedia content by analyzing view counts, click streams, session lengths, eyetracking, scroll positions, collapse expansion on the mobile website, and through surveys. While researchers of reader behavior are deploying a creative arsenal of approaches, the only readily available large scale data on reader behavior is the count of page views.

Last year, the Wikimedia Foundation began collecting data on page dwell times through an event logging plugin on all Wikis. In this project, we set out to assess the validity of this data, understand how it is distributed, and to demonstrate how it may be useful. Results from our preliminary analysis support the idea that information seeking tasks vary between less developed countries and more developed ones. Readers in countries that are less developed or are in the global south stay on given pages for longer on average. Moreover, this difference is almost entirely located where we would expect users to consume information in depth: in the last view in a session and on the desktop (non-mobile) site. This finding supports the conclusions of recent analysis of a large-scale survey of Wikipedia readers [1].

## Background

In 2017, the Wikimedia Foundation’s web team introduced new instrumentation to measure the amount of time Wikipedia readers spend the pages they view. The original goal was to develop a new metric for evaluating how feature releases, such as this year’s launch of the Page Previews, may change user behavior. However, we realized that this data could also be useful for understanding patterns of reading behavior in general. In preliminary work during 2017, Tilman and Zareen explored and evaluated the new metric. This project continuous this work with additional validation, and new analyses of general reading patterns.

## Contributions

• Evaluate the consistency of the measured times by comparing them to the timing of server side log events.
• Select parametric model(s) for the distribution of reading times to help choose a metric.
• Use the new data to answer descriptive research questions to understand how the following variables relate to reading time.
• Page length
• Mobile devices
• Development level of the reader's context
• Whether the page is the last viewed in a session

## Methods

### Data Collection and Validation

The reading depth plugin works by running Javascript in the client browser which sends two messages to the server. The first message (the page loaded event) is sent when the page is loaded and the second message (the page unloaded event) sends values from timers that measure, among other things, the amount of time that the page was visible in the visitor's browser window.

More specifically, the plugin uses the page visibility api to measure visible time, the total amount of time that the page was in a visible browser tab. Visible time is the primary measure we use in this report because it excludes time when the user could not possibly have been reading the page. The plugin also measures a second measure of reading time: total time. This is simply the total time the page was loaded in the browser and we use this variable for data validation and in robustness checks. The plugin also measures page load time in two ways: time till first paint, the time from the request until the browser starts to render any part of the page; and dom interactive time, the time from the request until the user can interact with the page. The current version of the reading depth event logging plugin was first enabled on November 20th 2017. From November 2017 until September 2018 we logged events from a 0.1% sample of visitor sessions, and the sampling rate was increased to 10% on September 25, 2018.

#### Missing Data

We are only able to collect data from web browsers that support the APIs on which the instrument depends. This excludes the default Android browser, verions of Chome earlier than 39, Safari, and all browsers running on versions of iOS less than 11.3. We also do not collect data from browsers that have not enabled Javascript or that have enabled doNotTrack. See this Phabricator task for additional details.

Even when the above conditions are met, in some cases we are still not able to collect data. Sometimes we observe a page loaded event, indicating that a user in our sample opened a page, but we do not observe a corresponding event indicating that the user has left the page (a page unloaded event). This issue affects 57% percent of records on the mobile site and about 5% of records on the desktop site. We suspect that the reason that so many mobile views are affected is because many mobile browsers will refresh the page if the user switches to a program other than the browser. We will not observe a page-unloaded event in these cases.

#### Other variables

We use some variables other than the timers obtained from the plugin in our analysis. The event loggingsystem records date and time the page was viewed as well as the page title of each page a user visits in a session. We obtained page length, measured in bytes at the time the page was viewed, by merging the event logging data with the edit history. To understand how reading behavior on mobile devices differs from behavior on non-mobile (i.e. desktop) devices, we assume that visitors to mobile webhosts (e.g. en.m.wikipedia.org) are using mobile devices and that visitors to non-mobile webhosts (e.g. en.wikipedia.org) are on non-mobile (desktop) devices. We obtain the approximate country in which a reader is located from the MaxMind GeoIP database which is integrated with the analytics pipeline. We then use the UN's human development data to measure the development level of the country. We also use a second, dichotomous, measure of development in terms of established regional classifications of Global North and Global South. Finally, the event logging pipeline retains a session token with which we measure the number of pages viewed in the session so far (Nth in session) and whether or not a given page view is the last in session.

### Taking a sample

Because Wikipedia is so widely read, even a sample of 0.1% of events results in a huge amount of data, well exceeding the requirements of our project-level analysis here, we conduct our analysis on random samples of the dataset. (Data was collected at a higher sampling rate to enable content-level analysis of dwell times, e.g. for specific topics or pages, which was among the possible research topics envisaged for this project.) To ensure that all wikis are fairly represented in our sample, we use stratified sampling. Stratified sampling ensure that all groups are represented fairly in a random sample by assigning a "weight" to each groups to adjust the probability that members of the group are chosen in the final sample. This introduces a known "bias" in the resulting sample, so therefore the "weights" are subsequently used (as in a weighted average) to correct the known sampling bias. For estimating total reading time, and for the univariate analysis, we stratify by Wikis. For the Multivariate Analysis we stratify by Wikis, and by country of the reader, and by whether or not we think that the user is on a mobile device.

### Univariate Model Selection

#### Motivation

We want to be able to answer questions like: Did introducing a new feature to the user interface cause a change in the amount of time users spend reading? Are reading times on English Wikipedia greater than on Spanish Wikipedia? What is the relationship between the development level of a reader's country and the amount of time they spend reading on different devices if we account for other factors?

Using a parametric model lets us use of statistical tests to answer questions like those above. Parametric models assume the data have a given probability distribution and have interpretable parameters such as mean, variance, and shape parameters. Fitting parametric distributions to data allows us to estimate these parameters and to statistically test changes in the parameters. However, assuming a parametric model can lead to misleading conclusions if the assumed model is not the true model. Therefore we want to evaluate how well different parametric models fit the data in order to justify parametric assumptions. Understanding how the data is distributed can also be interesting in its own right as distributions can inform understandings of the data generating process.

#### Candidate Models

Log-Normal Distribution: The log-normal distribution is a two-parameter probability distribution. Intuitively, it is the same as a normal distribution, but on a logarithmic scale. This gives it convenient properties because it's parameters can be interpreted as the mean and variance of the logged data. This means that one can take the logarithm of the data and then use t-tests for evaluating differences in means or use ordinary least squares to infer regression models. These advantages make the log-normal distribution a common choice in analyzing skewed data, even when it is is not a perfect fit for the data.

Lomax (Pareto Type II) Distribution: Data on human behavior often exhibit power-law distributions, meaning that the probability of extreme events, while still low, is much greater than would be predicted by an normal (or log normal) distribution. Therefore, power law distributions are often referred to as "heavy-tailed," "long-tailed," or "fat-tailed." We fit the Lomax Distribution, a commonly used heavy-tail distribution with two parameters.

Weibull Distribution: Liu et al. (2010) model reading times on web pages using a Weibull distribution [2] . The Weibull distribution has two parameters: ${\displaystyle \lambda }$, a scale parameter and ${\displaystyle k}$ a shape parameter. The Weibull distribution can be a useful model because the of the intuitive interpretation of ${\displaystyle k}$. If ${\displaystyle k>1}$ then reading behavior exhibits "positive aging," which means that the longer someone stays on a page, the more likely they are to leave the page at any moment. Conversely ${\displaystyle k<1}$ is interpreted as "negative aging," which means that as someone remains on a page, the less likely they are to leave the page at any given moment. The Weibull distribution is often used in the context of reliability engineering because it is convenient for modeling the chances that a given part will fail at a given moment.

Exponentiated Weibull Distribution: The Weibull model assumes that the rate of readers leaving a page changes monotonically with respect to time. This means that the longer a reader stays on a page, they will not become more likely to leave the page up to a point after which they become less likely to leave. In other words there must be either "negative aging," "positive aging," or "no aging." This excludes processes in which "positive aging" transitions to "negative aging" after a point [3]. Therefore if the data show that the probability of a reader leaving a page first increases and then decreases (or visa-vera) then assumptions of the Weibull model are violated. The exponentiated Weibull distribution is a three-parameter generalizion of the Weibull distribution that relaxes this constraint. The extra degree of freedom will allow this model to fit a greater range of empirical distributions compared to the two-parameter Weibull model.

We also considered the gamma distribution and the Exponential Distribution, but we will not go into depth about them here. We didn't have a strong motivation for these models and they did not fit the data well.

#### Methods

Our method for model selection is inspired in part by Liu et al. (2010), who compared the log-normal distribution to the Weibull distribution of dwell times on a large sample of web pages. They fit both models to data for each website and then compare two measures of model fit: the log-liklihood, which measures the probability of the data given the model (higher is better), and the Kolmogorav-Smirnov distance (KS-distance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the log-normal model in a large majority of cases according to both goodness-of-fit measures.

Similar to the approach Liu et al. (2010), we fit each of the models we consider on reading time data seperatly for each Wikipedia project. We also use the Kolmogorav-Smirnov distance (KS-distance) to evaluate goodness-of-fit. It turns out that the KS-distance supports a statistical test of the null hypothesis that the model is a good fit for the data. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausable model instead of just whether one distribution is a better fit than another.

Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without increasing out-of-sample predictive performance, to avoid the risk over-fitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the log-liklihood. Both criterion attempt to quantify the amount of information lost by the model (lower is better), evaluate the log liklihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.

Wiki we analyze these goodness-of-fit measures for each wiki and rank them from best to worst. We report the mean and median of these ranks. In addition, we report the mean and median p-values of the KS-tests and the proportion of Wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.

### Results

The table below shows the results of this procedure. The lomax, exponentiated weibull, and log-normal all fit the data reasonably well. All pass the KS-test for many wikis, and are in a three-way tie for best median rank for the goodness of fit statistics and close in terms of mean AIC and BIC.

The lomax distribution is the best fit according to all metrics. With only 2 parameters, it has a lower AIC and BIC than the three parameter exponentiated Weibull distribution (exponweib) and passes the KS test 76% of the time at the 95% confidence level (it is slightly behind exponweib on ks-rank, but passes the ks-test more often).

The exponentiated Weibull model fits the data better than the log-normal model in terms of passing KS-tests and with respect to AIC. However, the log-normal is better in terms of BIC, which imposes a greater penalty on the exponentiated Weibull's extra parameter.

In contrast to the findings of Liu et al. (2010), the log-normal model a better fit for the data than the (non-exponentiated) Weibull model. While substantially worse than the lomax model, the log-normal model still passes the KS-test for about 69% of Wikis.

AIC rank BIC rank ks rank ks pvalue ks pass 95% ks pass 97.5%
mean median mean median mean median mean median mean mean
lomax 1.880165 2.0 1.814050 2.0 2.086777 2.0 0.252968 1.498555e-01 0.756198 0.818182
exponweib 2.148760 2.0 2.359504 2.0 1.958678 2.0 0.279059 1.845931e-01 0.739669 0.789256
lognorm 2.173554 2.0 2.057851 2.0 2.318182 2.0 0.255167 1.479791e-01 0.685950 0.756198
weibull 3.954545 4.0 3.942149 4.0 3.917355 4.0 0.072418 3.552921e-05 0.227273 0.260331
gamma 4.958678 5.0 4.971074 5.0 4.818182 5.0 0.041481 3.219647e-15 0.111570 0.123967
exponential 5.884298 6.0 5.855372 6.0 5.900826 6.0 0.018722 0.000000e+00 0.049587 0.053719

Density of English Wikipedia Page Visible Times (Logged). It looks like the distribution of page Visible times has 3 modes. One mode is for very short times, and a second (teeny-tiny) mode is way out on the right side with very long times. The vast majority of the density is in the main, middle, model.

The leftmost mode are events where the page is open for a very short time. These might be "quick backs" where someone opens a page and then immediately realizes they do not want the page open and the navigate away. The vast majority of the density is around the central mode. I created a new set of goodness-of-fit plots where the X-axis is scaled to try and see how well models are fitting the central mode.

 Logged goodness of fit plots for a Lomax model for English Wikipedia.. The Lomax model fits the right side of the data very well. However, unlike the data, it's PDF is monotonically decreasing. This is why it overestimates the probability of the head. Logged goodness of fit plots for an Exponentiated Weibull model for English Wikipedia. The Exponentiated Weibull model is clearly a good fit for the data, but it doesn't fit the left mode very well. Logged goodness of fit plots for a Log Normal model for English Wikipedia.. The log normal model is almost as good of a fit as Exponentiated Weibull, but it is a bit worse at the left mode. Logged goodness of fit plots for a Weibull model for English Wikipedia.. The Weibull model is not great. The PDF is not only monotonically decreasing, it is concave up everywhere.

The first table looks at the mean and median KS statistics. When we look at the rate at which each distribution passes the KS test at 0.05% and 0.025% confidence levels it actually does the best.

#### Discussion

Here is a very good paper by Michael Mitzenmacher explaining data generating processes for power law (pareto) type data generating processes that compares them to log normal distributions. My takeaway is that many types of data generating processes can generate either a log normal or pareto distributions. Rich-get-richer dynamics (preferential attachment) are often associated with power law distributions. However according to Mitzenmacher, multiplicative processes can produce either lognormal or power law distributions, depending on somewhat subtle differences in the process. These both seem somewhat counter-intuitive for reading time. Finally, Mitzenmacher also points out that a mixture of 2 log normal distributions is also a power law. Perhaps we have a situation where we have 2 distributions (reading, leaving the page open) both of which are log normal. If Pareto distributions are a good fit then we might actually prefer a mixture of two lognormal distributions. Based on my reading of Michael Mitzenmacher's paper on power laws, the most likely explanation for this is that the data are generated by a mixture of log normal distributions.

The table below illustrates our approach to model selection. We fit each of 5 distributions to a sample of views from each wiki and compute goodness of fit criteria. Each of the models is fit on a 75% sub-sample (training set) and the goodness of fit criteria are computed using the other 25% (test set).

The table shows results on a handful of selected wikis. The lognormal model appears to be a good fit, outperforming the Weibull model (weibull_min). The exponentiated Weibull model outperforms both, but may be difficult to interpret. Next we will fit the models on all the wikis to better evaluate which models are a good fit.

This plot shows the weibull distribution fit to data on dwell

times on wikipedia pages in 2018.. The Weibull distribution fit to dwell times of views

lasting less than 10 minutes. Units on the x axis are miliseconds.

The model might not be a perfect fit for the data. There are fancier versions of the weibull distribution with some more degrees of freedom that might be worth a shot.

## Limitations

Two important limitations of this analysis affects our ability to compare reader behavior between mobile phone and PC devices. The first, is the technical limitation of the browser instrumentation on mobile devices, discussed above, which lead to a large amount of missing data on mobile devices. This missing data likely introduces a negative bias in our measures of reading time on mobile devices because data is more likely to be missing in cases where the user switches tasks from the browser, and then subsequently returns to complete their reading task. This bias may be quite significant as the issue affects a large proportion of our sample. Improvements to the instrument that address this limitaion are underway.

A second limitation of our ability to compare mobile phone and PC devices is derived from our intuitions about how reader behavior may differ in the two cases. Mainly, we think that it may be somewhat common for readers to leave a page visible in a web browser at times when they are not directly reading it. Users may leave multiple visible windows on PCs, while only interacting with one, or may leave a browser window visible and move away from their computer for long periods of time. In general, the best we can hope to observe is that a page is visible in a browser. We cannot, through this instrument alone, know with confidence that an individual is reading. It may be possible to introduce additional browser instrumentation for the collection of scrolling, mouse position, or mouse click information. However, such steps should be taken with care as additional data collection may negatively affect the experiences of readers and editors in terms of privacy, browser responsiveness, page load times, and power consumption.

To address these limitations, and we


Research_talk:Reading_time/Work_log/2018-11-18#Robustness_check:_removing_long_dwell_times fit regression models on data with dwell times greater than 1 hour removed. And found that our results were not substantively affected by the change. Therefore, we do not believe that user behaviors that may generate the appearance of long reading times that do not correspond to reading.

## Results

### Differences in reading time by different projects

visiblelength
max percentile_95 percentile_75 median mean percentile_25 percentile_5 min count
wiki
arwiki 1.525657e+12 358095.30 79087.25 28957.0 3.862190e+08 9719.25 2150.25 219.0 3954
dewiki 2.994148e+08 444023.55 79470.50 24497.0 3.996264e+05 7482.75 1750.10 261.0 4084
enwiki 1.515280e+12 374439.40 64480.00 21509.0 3.579515e+08 6854.00 1649.20 188.0 4237
eswiki 1.526892e+12 577875.50 100811.00 32537.0 3.725822e+08 10733.50 2238.15 125.0 4102
hiwiki 2.021348e+07 359279.70 78303.00 31319.0 1.000993e+05 11674.50 2665.90 7.0 3679
nlwiki 1.825265e+08 472283.75 66615.00 22465.0 2.952587e+05 6989.75 1729.25 267.0 4114
pawiki 1.528875e+12 274091.00 54982.25 20308.0 5.298374e+08 7543.25 2007.75 105.0 2886
This chart shows box plots of the distribution of time that

Wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds.. Spanish, Hindi, and Arabic appear to have longer reading times while English and Punjabi appear to have somewhat shorter

### Page Length

[[File:Wikipedia reading dwell time analysis --- Marginal effects

of page length.png|center|600pxpx|alt=This chart shows how reading times predicted by a regression model change as the development level of the country changes. This plot shows how the model predicts reading time will change as pages get longer. The difference between very long and and very short pages is associated with up to a 40 second increase in expected reading time in the last session and about 15 seconds for views that are not the last in session. For typical pages, a doubling of the length of the page is associated with an increase of about 4 seconds for non-last-in-session views and about 7 seconds for

last-in-session views.]]
Marginal effects of page length. This chart shows how reading times predicted by a

regression model change as the development level of the country changes. This plot shows how the model predicts reading time will change as pages get longer. The difference between very long and and very short pages is associated with up to a 40 second increase in expected reading time in the last session and about 15 seconds for views that are not the last in session. For typical pages, a doubling of the length of the page is associated with an increase of about 4 seconds for non-last-in-session views and about 7 seconds for

last-in-session views.

### Development and HDI

Marginal effects plot for model 3 by globalsouth and

lastinsession. This chart shows how reading times predicted by a regression model change as the development level of the country changes. Readers from the global south read for longer than those in the global north, especially on the last view in a session. The difference between mobile and desktop is mainly a difference in last

in session behavior.

## Conclusion

• We have a metric for reading times.
• Summarize findings from each of the above sections.
• Propose some future directions.

# References

1. Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2018-12-02). "Why the World Reads Wikipedia: Beyond English Speakers". arXiv:1812.00474 [cs].
2. Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
3. Pal, M.; Ali, M.M.; Woo, J. (2006). "Exponentiated Weibull distribution". Statistica 66 (2): 139–147.
4. Yi, Xing; Hong, Liangjie; Zhong, Erheng; Liu, Nanthan Nan; Rajan, Suju (2014). "Beyond Clicks: Dwell Time for Personalization". Proceedings of the 8th ACM Conference on Recommender Systems. RecSys '14 (New York, NY, USA: ACM): 113–120. doi:10.1145/2645710.2645724.

# Appendicies

## Regression Tables

Statistical models
model 1 model 2
Intercept 8.1783 (0.0084)*** 8.2388 (0.0084)***
mobile 0.0962 (0.0015)*** 0.0006 (0.0023)
Human Development Index -0.1007 (0.0009)*** -0.1613 (0.0014)***
mobile : HDI 0.1059 (0.0019)***
Revision length (bytes) 0.1752 (0.0004)*** 0.1752 (0.0004)***
time to first paint -0.0164 (0.0006)*** -0.0163 (0.0006)***
time to dom interactive 0.0023 (0.0009)** 0.0023 (0.0009)**
sessionlength -0.0001 (0.0000)*** -0.0001 (0.0000)***
lastinsessionTRUE 0.9281 (0.0015)*** 0.9232 (0.0015)***
nthinsession 0.0002 (0.0000)*** 0.0002 (0.0000)***
dayofweekMon 0.0940 (0.0020)*** 0.0940 (0.0020)***
dayofweekSat 0.0189 (0.0020)*** 0.0171 (0.0020)***
dayofweekSun 0.0336 (0.0020)*** 0.0324 (0.0020)***
dayofweekThu 0.0563 (0.0019)*** 0.0563 (0.0019)***
dayofweekTue 0.0350 (0.0020)*** 0.0353 (0.0020)***
dayofweekWed 0.0760 (0.0019)*** 0.0758 (0.0019)***
usermonth4 0.0095 (0.0096) 0.0096 (0.0096)
usermonth5 0.0113 (0.0095) 0.0111 (0.0095)
usermonth6 -0.0097 (0.0097) -0.0100 (0.0097)
usermonth7 -0.0487 (0.0097)*** -0.0494 (0.0097)***
usermonth8 -0.0112 (0.0097) -0.0118 (0.0097)
usermonth9 0.0377 (0.0076)*** 0.0383 (0.0076)***
usermonth10 0.0002 (0.0075) 0.0000 (0.0075)
mobileTRUE:lastinsessionTRUE -0.6508 (0.0021)*** -0.6442 (0.0021)***
R2 0.0717 0.0719