Research talk:Reading time/Work log/2018-10-02

From Meta, a Wikimedia project coordination wiki

Monday, October 1, 2018 / Tuesday, October 2, 2018[edit]

Investigate the 40 second period error DONE[edit]

It turns out at the 40 second intervals were due to a bug computing the deltas. So it's not a problem.

Made cleaner plots of discrepancies[edit]

The 40 second intervals are positive (except around 0) and appear to decay exponentially. The discrepancies compare the time between server log events and the total length of time recorded by the browser. The discrepancies between visible length and the timestamps are similar.
This chart shows the distribution of discrepancies between event timestamps and measured dwell times on Wikipedia. The 40 second intervals are positive (except around 0) and appear to decay exponentially. The discrepancies compare the time between server log events and the total length of time recorded by the browser. The discrepancies between visible length and the timestamps are similar.
As above, with dwell times measured in visible length, and the axis constrained. 


Look by IP block[edit]

When grouping by IPv4 block, there are not any obvious discrepancies. When comparing IPv4 to IPv6 it becomes clear that most of the errors are coming from IPv4.

1*, 2*, and 7* addresses are somewhat more common, but these might be over represented in the logs as well.
Comparison of the discrepancies by the first digit of the IPv4 address. 1*, 2*, and 7* addresses are somewhat more common, but these might be over represented in the logs as well.

TODO: Do these as a proportion of all events in the group.

Most of the errors come from IPv6.
As above, comparing IPv4 to IPv6. Most of the errors come from IPv6.
  • Look by Geolocation (Mountain View, Redmond, Country, region, city)
The y axis is the average magnitude of the discrepancy between times on the server logs and client side timers. The x axis shows country codes. There is quite a bit of variation in the amount of discrepancy by country, but so far no clear pattern.
Variation in reading time discrepancy by country.. The y axis is the average magnitude of the discrepancy between times on the server logs and client side timers. The x axis shows country codes. There is quite a bit of variation in the amount of discrepancy by country, but so far no clear pattern.
The y axis is the average proportion of views where times on the server logs are shorter than the client side timers. The x axis shows country codes. There doesn't appear to be much variation. The countries at the high and low end have smaller sample sizes.
Variation in the proportion of client timers that measure more time than logevents suggest is possible.. The y axis is the average proportion of views where times on the server logs are shorter than the client side timers. The x axis shows country codes. There doesn't appear to be much variation. The countries at the high and low end have smaller sample sizes.
  • Inform engineering of findings
  • Maybe fallback to Webrequest table if we need more information

Improve Workflow[edit]

Filtering data for analysis[edit]

  • Exclude bots and spiders.