Research talk:Reading time/Work log/2018-10-18

Thursday, October 18, 2018

Wrapping up model selection.

On Tuesday I observed that a | Lomax distribution is a two parameter distribution that fits the page visible lengths at least as well as any other distribution we have tried so far. Lomax is a 2 parameter power law distribution. Based on my reading of Michael Mitzenmacher's paper on power laws, the most likely explanation for this is that the data are generated by a mixture of log normal distributions.

This is consistent with our intuition that there may be (at least) two different processes generating these measurements. One process is someone who reads articles in their browser and then switches away from or closes the tab when they are done reading. In the second process they leave the page open for some time after they have stopped reading and the tab remains visible for some time until they close it. Looking at the data, it appears that we do indeed have a mixture of log normals. However, instead of a distribution with 2 modes it looks like we have 3 modes.

Density of English Wikipedia Page Visible Times (Logged). It looks like the distribution of page Visible times has 3 modes. One mode is for very short times, and a second (teeny-tiny) mode is way out on the right side with very long times. The vast majority of the density is in the main, middle, model.

The leftmost mode are events where the page is open for a very short time. These might be "quick backs" where someone opens a page and then immediately realizes they do not want the page open and the navigate away. The vast majority of the density is around the central mode. I created a new set of goodness-of-fit plots where the X-axis is scaled to try and see how well models are fitting the central mode.

Logged goodness of fit plots for a Lomax model for English Wikipedia.. The Lomax model fits the right side of the data very well. However, unlike the data, it's PDF is monotonically decreasing. This is why it overestimates the probability of the head.

Logged goodness of fit plots for an Exponentiated Weibull model for English Wikipedia. The Exponentiated Weibull model is clearly a good fit for the data, but it doesn't fit the left mode very well.

Logged goodness of fit plots for a Log Normal model for English Wikipedia.. The log normal model is almost as good of a fit as Exponentiated Weibull, but it is a bit worse at the left mode.

Logged goodness of fit plots for a Weibull model for English Wikipedia.. The Weibull model is not great. The PDF is not only monotonically decreasing, it is concave up everywhere.

Conclusions

Comparing reading times: Given the results of the model selection process I recommend using T-tests on logged data to detect differences in means. This is the primary metric for comparing reading times. This also supports using least squares estimation on logged data for estimating regression models. The fact that the Lomax model fits the data better than the lognormal model brings a caveat that these models will underestimate the probability of long page dwell times. However, this should not be a very concerning threat compared to the assumption that we are measuring reading behavior.

Ageing and Hazard functions: Since the ordinary Weibull model is such a poor fit for the data, I do not recommend making decisions based on the interpretation of $k$ as indicative of positive or negative ageing. The fits of the Exponentiated Weibull model indicate that ageing is often non-monotonic, which contradicts the assumptions of the Weibull model. I recommend that future work seeking to measure changes in ageing patterns interpret hazard function and survival plots of Exponentiated Weibull Models. Uncertainty may be estimated using bootstrapping or MCMC.

The Weibull plot above may not be totally clear (e.g. QQ plot is malformed). Here's the unlogged version:

Goodness of fit plots for a Weibull model for English Wikipedia. The model goes to infinity at 0, but the data do not, so the Weibull model really overestimates the probability of short dwell times. On the other hand, it underestimates the probability of medium length dwell times.

Multi-model distribution: The relative goodness of the Lomax model compared to the lognormal model is probably due to multi-modality in the data. A mixture of lognormals or a lognormal - pareto mixture might fit the data quite well. Future work might attempt to correct bias in single-mode models by fitting mixture models.

Next steps

I will switch to multivariate analysis of page visible time. We have a good justification for using OLS regressions on the log of page visible times, which will make fitting models very convenient. I'm going to build variables we are likely to use and do some work to justify rationales for the hypotheses that we can test.