Schema talk:ReadingDepth

From Meta, a Wikimedia project coordination wiki
Please specify the schema maintainer.
Team:Product Analytics
Project:Reading Depth (see phab:T155639)
Purge:Auto-purge just eventCapsule PII and sessionToken after 90 days, keep the rest indefinitely


The default sample was decreased to 0% on Tuesday, 20th August at 6:45 PM UTC+1 (see phab:T229042) -- Phuedx (WMF) (talk) 09:48, 23 August 2019 (UTC)[reply]

Previous updates[edit]

The ReadingDepth instrumentation is enabled on all Wikipedias. Per 340095, the sampling rate is currently 0.1%, i.e. at least one ReadingDepth event is logged for 0.1% of all distinct browser sessions.

Around September 2018, we augmented the schema by separate samples from the Page Issues A/B test, leaving the default sample as is (phab:T191532#4393286).
The default sample was increased from 0.1% to 10% on September 25, 2018 (phab:T205176). The current sampling ratio and scope for the default sample can generally be found in InitialiseSettings.php. Regards, Tbayer (WMF) (talk) 04:30, 6 October 2018 (UTC)[reply]

Likely broken on Safari and some other browsers[edit]

Since this schema was first launched, Safari has added sendBeacon support. However, per the investigation in phab:T204143, the instrumentation has to be regarded as broken for this browser for the time being, as events are being sent inconsistently. Thus, events with a Safari user agents should currently be removed for all data analysis involving this schema. Regards, Tbayer (WMF) (talk) 05:52, 20 September 2018 (UTC)[reply]

After some further investigation into additional discrepancies (see the more recent parts of phab:T204143), it seems that the current recommendation should be to remove the following from analysis:
  • Android native browser (i.e. useragent.browser_family = 'Android')
  • iOS versions older than 11.3
  • Chrome <= 38
  • Safari (desktop and likely also Mobile Safari - needs a little further investigation)
Regards, Tbayer (WMF) (talk) 20:58, 23 October 2018 (UTC)[reply]

Update: It appears Safari Mobile (on iOS >= 11.3) does not need to be excluded, based on the findings from that task (phab:T204143#4895679).

So the updated recommendation is to exclude the following:

  • Android native browser (i.e. useragent.browser_family = 'Android')
  • iOS versions older than 11.3
  • Chrome <= 38
  • desktop Safari (i.e. useragent.browser_family = 'Safari')

Here is a Hive code snippet implementing these restrictions (to be used in the WHERE clause of a query to event.readingdepth):

... AND ( 
 (useragent.browser_family != 'Safari') 
 AND (useragent.browser_family != 'Android') 
 AND ((useragent.os_family != 'iOS') OR (CAST(useragent.os_major AS INT) > 11) OR (CAST(useragent.os_minor AS INT) >= 3)) 
 AND ((useragent.browser_family != 'Chrome') OR (CAST(useragent.browser_major AS INT) > 38)

Regards, Tbayer (WMF) (talk) 06:54, 21 January 2019 (UTC)[reply]

Technical details on implementation[edit]

Regards, Tbayer (WMF) (talk) 05:06, 6 October 2018 (UTC)[reply]

2017 sampling bug[edit]

Note that earlier data from this schema was affected by a serious sampling bug fixed in late September 2017 (phab:T175918), which caused many events occurring after the first pageview in a session not to be recorded. Regards, Tbayer (WMF) (talk)

Events where unloaded event is logged before the loaded event[edit]

I see about 0.75% of cases where the page unloaded event is logged before the page loaded event. I don't know why this would happen, and it merits further investigation.

I made nathante.pageeventtimings, a table of page event timings from the event logs. More details in the work log .

Looking at a subset of this data, I observed a clear daily pattern of spikes around midnight utc. This seems like a mystery to me.

hive_cursor.execute("SELECT * FROM nathante.pageEventTimings WHERE (month==08)")
dt_err3 = as_pandas(hive_cursor)

dt_err3 = dt_err3.rename_axis({name:name.split('.')[1] for name in dt_err3.columns},1)
dt_err3.loc[~dt_err3['lo_dt'].str.endswith("Z"),"lo_dt"] = dt_err3.loc[~dt_err3['lo_dt'].str.endswith("Z"),"lo_dt"] + "Z"
dt_err3["lo_dt"] = pd.to_datetime(dt_err3.lo_dt,format="%Y-%m-%dT%H:%M:%SZ")
dt_err3["ul_dt"] = pd.to_datetime(dt_err3.ul_dt,format="%Y-%m-%dT%H:%M:%SZ")
dt_err3['span_neg'] = dt_err3.eventspan < 0
dt_err4 = dt_err3.loc[dt_err3.span_neg == True,:]
dt_err4.loc[:,'ul_minute'] = dt_err4.ul_dt.dt.round('min')
ax = dt_err4.ul_dt.hist(bins=500,figsize=(20,10))
ax.set_xlabel("Unloaded DT")
This chart shows a histogram of cases of inconsistency in the logs of dwell time events on English Wikipedia.

Groceryheist (talk) 21:25, 28 October 2018 (UTC)[reply]

In the following line, should dt_err3.lo read dt_err3.ul?
dt_err3["ul_dt"] = pd.to_datetime(dt_err3.lo_dt,format="%Y-%m-%dT%H:%M:%SZ")
Regards, Tbayer (WMF) (talk) 00:55, 18 November 2018 (UTC)[reply]
Yes, I had already fixed that issue by the time I made the plot, but didn't correct it here (until now).Groceryheist (talk) 00:58, 18 November 2018 (UTC)[reply]

Note on February ResourceLoader module schema deprecation[edit]

In T214444 we removed call to the deprecated schema.ReadingDepth. Previously, this would delay ReadingDepth events being sent which wait till the window load event. Although unlikely, in analysis of events around this period you might see additional/earlier page load events and/or slight discrepancies in timestamps (e.g. sessions might appear slightly shorter).

Removed from WikimediaEvents[edit]

Following on from the reduction of the default sample to 0% on Tuesday, 20th August, the ReadingDepth instrument, supporting config variables, and documentation were removed from the WikimediaEvents codebase both to improve performance and ease maintenance (see gerrit:626116). Phuedx (WMF) (talk) 14:05, 9 September 2020 (UTC)[reply]