Requests for comment/User site behavior collection

The following request for comments is closed. The request was eventually archived as inactive.

Fundraising is opening this request for comments to ask if it is acceptable, in order to support the fundraiser, to do sampled tracking and aggregation of site visitor sessions.

The purpose of this sampled tracking/aggregation will be to build a model about how readers interact with our site over time. This request for comments (RFC) specifically asks if it is acceptable to track users over a 15 day interval to collect statistics about the number of pages viewed per visit and how long it is between visits for users across geographical regions, languages, and Wikimedia Foundation properties.

The Wikimedia Foundation legal team has given approval for this experiment.
Erik Möller has given engineering approval.

Motivation[edit]

Fundraising[edit]

Fundraising would like to be able to model the effect on future income based on immediate and historic banner delivery. We hypothesize that future income can be modeled as a function of unique visitors seeing a banner on their first site visit. In order to prove this we must therefore know how many unique visitors there are to the site at any given moment. In order to further fine tune our efforts, it is useful to know this across languages, countries, and web properties. This data can then inform us about the long term effects on income of temporarily increasing the number of banners seen.

Other questions we can answer

If we were to show two or more banners from multiple campaigns, how many visitors would have been expected to have seen a banner from a specific previous campaign?
With a purely random distribution of banners over visitors; how long would it take for one visitor to have seen two or more banners.
- Answering this would inform the design of an algorithm that would maximize the length of time between a user seeing banners.

Why could this not be collected in other ways[edit]

It could, however it would require running live tests with real banners. This we deem to be annoying and we think should be avoided where possible. Doing it live may affect things in uncertain ways -- so this can also be thought of as variable reduction.

In general[edit]

This is valuable in general because it can inform the required time length of future studies on unrelated topics. It will also help us to understand/independently validate the data we obtain from third party analytics providers.

What data will be collected & retained?[edit]

For anonymous users, connecting over HTTP (specifically not HTTPS)¹:

Server side

The average number of page views per session (a session is defined as any set of contiguous page requests where the inter-page request time is less than 24 minutes²).
The number of sessions
The average inter-session length
The average time between page requests in a session
The maximum amount of time between page requests in a session
The minimum amount of time between page requests in a session
The project (ie: Wikipedia)
The language the user is viewing in (ie: english)
The users country (as determined by a GeoIP lookup)
Down-sampling rate
Time collected

Client side

The last page request time
The current number of pages in a session
A random number between 1 and 20 which will decide on which page view to report³

EventLogging Schema: Schema:AnonymousUserSiteUsage

Notes

We can assume that a user connecting over HTTPS is more concerned about their privacy. We also need an opt out mechanism.
From comScore data, this is the average amount of time spent on Wikipedia + 3 standard deviations.
Each unique visitor only visits once. The number comes from comScore data: users typically spend 20 page views per month on Wikipedia. Only data up to the mean is required because of a hypothesis that this distribution over time is log normal - these distributions can be parameter fitted if only data up to the mean is included in the fit.

How will this be generated & reported[edit]

The code that does this will be distributed via CentralNotice. Either in a banner, or in the controller itself. This ensures that accurate counts are recorded as this code is delivered to every user.

The data will be stored locally in a LocalStorage object (think of this as a cookie that never gets transmitted up to the server).
- Upon transmission the data will be replaced in LocalStorage with a participation token which prevents a user from contributing more than once.

Data will be uploaded using HTTPS into the EventLogging infrastructure once the user has seen the pre-determined number of pages.

Down-Sampling[edit]

An additional die will be rolled to determine if a user will report at all. This die will start with users having a 1:100 chance of reporting. In order to test the ability of the EventLogging infrastructure to scale however, this value may be changed to as low as 1:1 during the study.

Older browsers[edit]

Browsers that do not support LocalStorage will not be participants. Potentially the same thing described above could be done using cookies, but there are worries about exposing more information than is required to any person eavesdropping (cookies are sent for every page request).

Bot (Web scraper) filtering[edit]

The client side code will inspect the user-agent string. Clients with a user agent with the word 'bot' and/or contain a URL will not report.

Experiment Length[edit]

15 days - This comes from comScore again; the site wide mean - 2 standard deviations of average sessions per month is 2.08; assuming a linear model it will take 14.4 days to get half of all unique visitors. The extra margin is for management ease and buffer time for lead in / lead out measurements.

Data Privacy[edit]

The EventLogging infrastructure does not log the user-agent string and the IP address is stored in a hashed form.

Precedent[edit]

We currently have approval to track the number of 'banners seen' by a user and report this back up to the fundraising team upon donation. This RfC is a detailed and specific extension of this approval.