Research talk:Reading time/Work log/2018-11-02

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Saturday, November 3, 2018[edit]

Page unloaded event differences by mobile?[edit]

As discussed in the meeting with Jon, One possible limitation of the data may threaten our ability to make a fair comparison between mobile readers and desktop readers. Will mobile browsers fire pageUnloaded events when readers close or switch apps or not? If they do not then we will have a large number of page loaded events without page unloaded events and we will have missing data in a way that will be correlated with mobile usage. Even if mobile browsers do fire page unloaded events, they may do so in situations when we might expect the visiblelength counter to be updated. This would lead to downward bias in mobile reading times.

I wrote this query to compare the frequency of discrepant pageloaded and pageunloaded events.

SELECT COUNT(DISTINCT pagetoken) AS NReaders, Mobile, SUM(IF(one_each,1,0)) AS n_one_each, SUM(IF(not_unloaded,1,0)) AS n_not_unloaded, SUM(IF(loaded_more_than_1x,1,0)) AS n_loaded_more_than_1x, SUM(IF(unloaded_more_than_1x,1,0)) AS n_unloaded_more_than_1x
FROM 
( SELECT pagetoken, Mobile, (SUM(Nloaded) == 1) AND (SUM(Nunloaded) == 1) AS one_each, (SUM(Nloaded) == 1) AND (SUM(Nunloaded) == 0) AS not_unloaded, SUM(Nloaded) > 1 AS loaded_more_than_1x, SUM(Nunloaded) > 1 AS unloaded_more_than_1x 
FROM 
( SELECT pagetoken, 
         action, 
         Mobile,
         COUNT(*) AS N,
         SUM(IF(action=="pageLoaded", 1, 0)) AS Nloaded,
         SUM(IF(action=="pageUnloaded", 1, 0)) AS Nunloaded
         FROM ( SELECT event.pagetoken AS pagetoken, event.action AS action, webhost LIKE "%.m.%" AS Mobile    FROM nathante.cleanReadingData WHERE event.namespaceid == 0) g
         GROUP BY pagetoken, action, Mobile
) h
GROUP BY pagetoken, action, Mobile
) i

GROUP BY Mobile
dt = as_pandas(hive_cursor)
dt['p_one_each'] = dt['n_one_each'] / dt['nreaders']

dt['p_not_unloaded'] = dt['n_not_unloaded'] / dt['nreaders']
dt['p_unloaded_more_than_1x'] = dt['n_unloaded_more_than_1x'] / dt['nreaders']

dt = dt.drop('n_loaded_more_than_1x',1)
nreaders mobile n_one_each n_not_unloaded n_unloaded_more_than_1x p_one_each p_not_unloaded p_unloaded_more_than_1x
0 6 None 3 2 1 0.500000 0.333333 0.166667
1 448722947 False 424857529 21928737 1935907 0.946815 0.048869 0.004314
2 940421259 True 400006656 535123953 5288834 0.425348 0.569026 0.005624

As suspected, the incidence of page loaded events without page unloaded events is high on mobile. About 57%!

@Jon (WMF): --- FYI

Nevertheless, I am fitting the models that we talked about earlier. —The preceding unsigned comment was added by Groceryheist (talk) 03:03, 3 November 2018

Thanks for looking into this and quantifying these concerns!
Also CCing Timo Tijhof, with whom I had a chat about this recently in Portland. He mentioned that Google has proposed a new browser feature that addresses such issues, the "Page Lifecycle API". Actually, from https://developers.google.com/web/updates/2018/07/page-lifecycle-api it seems that this is already live in the most recent versions of Chrome? Jon, would this be something we could try using in the ReadingDepth schema? Regards, Tbayer (WMF) (talk) 02:39, 4 November 2018 (UTC)