Fundraising is opening this request for comments to ask if it is acceptable, in order to support the fundraiser, to do sampled tracking and aggregation of site visitor sessions.
The purpose of this sampled tracking/aggregation will be to build a model about how readers interact with our site over time. This request for comments (RFC) specifically asks if it is acceptable to track users over a 15 day interval to collect statistics about the number of pages viewed per visit and how long it is between visits for users across geographical regions, languages, and Wikimedia Foundation properties.
The Wikimedia Foundation legal team has given approval for this experiment.
Fundraising would like to be able to model the effect on future income based on immediate and historic banner delivery. We hypothesize that future income can be modeled as a function of unique visitors seeing a banner on their first site visit. In order to prove this we must therefore know how many unique visitors there are to the site at any given moment. In order to further fine tune our efforts, it is useful to know this across languages, countries, and web properties. This data can then inform us about the long term effects on income of temporarily increasing the number of banners seen.
Other questions we can answer
If we were to show two or more banners from multiple campaigns, how many visitors would have been expected to have seen a banner from a specific previous campaign?
With a purely random distribution of banners over visitors; how long would it take for one visitor to have seen two or more banners.
Answering this would inform the design of an algorithm that would maximize the length of time between a user seeing banners.
Why could this not be collected in other ways
It could, however it would require running live tests with real banners. This we deem to be annoying and we think should be avoided where possible. Doing it live may affect things in uncertain ways -- so this can also be thought of as variable reduction.
This is valuable in general because it can inform the required time length of future studies on unrelated topics. It will also help us to understand/independently validate the data we obtain from third party analytics providers.
We can assume that a user connecting over HTTPS is more concerned about their privacy. We also need an opt out mechanism.
From comScore data, this is the average amount of time spent on Wikipedia + 3 standard deviations.
Each unique visitor only visits once. The number comes from comScore data: users typically spend 20 page views per month on Wikipedia. Only data up to the mean is required because of a hypothesis that this distribution over time is log normal - these distributions can be parameter fitted if only data up to the mean is included in the fit.
An additional die will be rolled to determine if a user will report at all. This die will start with users having a 1:100 chance of reporting. In order to test the ability of the EventLogging infrastructure to scale however, this value may be changed to as low as 1:1 during the study.
Browsers that do not support LocalStorage will not be participants. Potentially the same thing described above could be done using cookies, but there are worries about exposing more information than is required to any person eavesdropping (cookies are sent for every page request).
15 days - This comes from comScore again; the site wide mean - 2 standard deviations of average sessions per month is 2.08; assuming a linear model it will take 14.4 days to get half of all unique visitors. The extra margin is for management ease and buffer time for lead in / lead out measurements.
We currently have approval to track the number of 'banners seen' by a user and report this back up to the fundraising team upon donation. This RfC is a detailed and specific extension of this approval.