Research:Reading time/Model selection

From Meta, a Wikimedia project coordination wiki

Our method for model selection is inspired in part by Liu et al. (2010), who compared the log-normal distribution to the Weibull distribution of dwell times on a large sample of web pages [1]. They fit both models to data for each web page and then compare two measures of model fit: the log-likelihood, which measures the probability of the data given the model (higher is better), and the Kolmogorov-Smirnov distance (KS distance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the log-normal model in a large majority of cases according to both goodness-of-fit measures.

Similar to the approach of Liu et al. (2010), we fit each of the models we consider on reading time data, separately for each Wikipedia project [1]. We also use the KS distance to evaluate goodness-of-fit. The KS-distance provides a statistical test of the null hypothesis that the model is a good fit for the data [2]. The KS-test is quite sensitive to deviations between the model and the data, especially in large samples. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. For the samples sizes we use, passing the KS test is a high bar. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausible model, instead of just whether one distribution is a better fit than another.

Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without improving out-of-sample predictive performance or explanatory power. To avoid the risk over-fitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the log-likelihood. Both criteria attempt to quantify the amount of information lost by the model (lower is better), evaluate the log likelihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.

We build these goodness-of-fit measures for each wiki and rank them from best to worst. For each distribution, we report the mean and median of these ranks. In addition, we report the mean and median p-values of the KS-tests and the proportion of wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.

The diagnostic plots are shown with data on English Wikipedia. On this Wiki, the exponentiated Weibull model is the best fit, followed by the lomax model and then the log-normal model and only the exponentiated Weibull model passes the KS test.

  1. a b Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.  Closed access
  2. Clauset, A.; Shalizi, C.; Newman, M. (2009-11-04). "Power-Law Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 0036-1445. doi:10.1137/070710111. Retrieved 2019-01-01.