Research talk:Reading time/Work log/2018-10-26

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Friday, October 26, 2018[edit]

Investigating Timing Discrepancies[edit]

I previously observed a low, but significant rate of inconsistent timings in the server log. Jon asked If I would look into this a little more and provide some data on events where this occurs and surrounding events.

I plotting the distribution of these events over time reveals two interesting patterns:

  1. A very brief spike around [Date]
  2. A longer lived, normally distributed spike around [Time]

I created a table in Hadoop (nathante.neg_span_spike) to enable easier investigation of the spike.

I'll upload plots of the distributions of the incidence of the issue over time as well as distributions of the discrepancies.

One pretty crazy thing I think I might be seeing is that the page unloaded events are in December 2017, but the month and date fields are set to August and September.

Code for building the page timings table[edit]

---- the lengths in the record are in ms, but the dt are in s
CREATE TABLE  nathante.pageEventTimings LOCATION "/user/nathante/pageEventTimings" AS
SELECT lo.pageToken AS pageToken, 
        lo.dt_ts AS lo_dt_ts, 
        ul.dt_ts AS ul_dt_ts, 
        ul.dt_ts - lo.dt_ts AS EventSpan,
        (ul.dt - lo.dt) - totalLength/1000.0 AS errTotalLength, 
        (ul.dt - lo.dt) - visibleLength/1000.0 AS errVisibleLength,
        ul.dt as ul_dt,
        lo.dt as lo_dt,
        ul.totalLength,
        ul.visibleLength,
        ul.domInteractiveTime,
        ul.event, 
        ul.useragent, 
        ul.wiki,
        ul.year,
        ul.month,
        ul.day,
        ul.hour,
        ul.webhost,
        ul.ip,
        split(ul.ip,'[\.]')[0] AS ipBlock3,
        substr(ul.ip,0,1) AS ipBlock1,
        ul.ip LIKE "%.%.%.%" AS isipv4,
              ul.geocoded_data
                    FROM 
                    (SELECT unix_timestamp(GetMediawikiTimestamp(dt),"YYYYMMddHHmmss") AS dt_ts,
                            dt,
                            event.pageToken as pageToken
                     FROM nathante.cleanReadingData WHERE event.action = "pageLoaded") lo,
                    (SELECT unix_timestamp(GetMediawikiTimestamp(dt),"YYYYMMddHHmmss") AS dt_ts, 
                     dt,
                     event.pageToken as pageToken,
                     event.totalLength AS totalLength, 
                     event.visibleLength AS visibleLength, 
                     event.domInteractiveTime AS domInteractiveTime,
                     event, 
                     useragent,
                     wiki,
                     year,
                     month,
                     day,
                     hour,
                     webhost,
                     ip,
                     geocoded_data
                         FROM nathante.cleanReadingData
                         WHERE event.action = "pageUnloaded") ul
WHERE lo.pageToken = ul.pageToken AND lo.dt_ts > 0