Research:Mobile User Behavioural Differences
This research project explores two themes. The primary is behavioural differences between mobile and desktop readers, and between different demographies of mobile and desktop readers (countries, connection types) on Wikipedia. The secondary is an exploration of the utility of IP and User Agent combinations as a UUID. Our expectation is that the behavioural differences will be interesting to both researchers and designers/product managers seeking to explore the impact the rise of mobile as a platform will have on our traffic, and the UUID experiments will be interesting to researchers as a more methodological point.
The rise of the mobile web has implications for how we (and others) design software; as a platform it is more compressed, used in different circumstances, and used by a slightly different user population. We do not, however, have a good idea of what these differences are. Some historical work has been done looking at a small subset of the metrics we measure reader behaviour by (session length) but was limited by the lack of engineering: user identification and isolation was done by combining IP address and user agent, with a limited ability to test if this was actually robust.
That question - whether IP/UA combos are sufficient to distinguish users - is itself interesting, because IP/UA combinations are commonly used to identify users, not just at Wikimedia but in other spaces too, where there isn't the possibility of using an actual, cookie-stored IP. Our hope is that we can also answer questions about the viability of that identification method, and how it varies between populations.
For privacy protection reasons we will be looking at this data with two different approaches; one to answer questions around IP entropy, and the other to answer questions around mobile versus desktop user behaviour.
The entropy questions can be answered with pre-existing data in HDFS - pageviews from Wikimedia app users, all of whom have a unique ID associated with their device. This gives us longitudinal data with a fixed UUID and varying IP addresses, meaning that we can survey IP change rates easily. One confounding element is that there is no equivalent desktop data, and mobile app users are a small and probably biased proportion of our traffic, but the concerns of Analytics Engineering around instituting new, temporary UUIDs cannot otherwise be met.
The mobile behaviour questions can be answered with an eventlogging schema, launched on 0.0001% of readers, tagging them with a cookie that lasts 7 days. This schema will collect, every time they land on a page:
- The timestamp of the pageview;
- The platform (desktop or mobile web) that the user is browsing on.
With the HDFS data we will compare a pseudo-unique ID (computed with IP and user agent) to the actual UUID to identify the rate at which it degrades. Where possible these datapoints will then be subdivided by country, platform and connection type to get a nuanced view of the circumstances in which they vary.
With the EventLogging data We will be able to look at the length of each session, the amount of time on page, the number of pages viewed during a session, and the number of sessions within a fixed period (say, 24 hours).
This research has been discussed with both the Research and Mobile teams at the Wikimedia Foundation, who are very interested - the Research team because the IP entropy element has implications for their work on unique IDs, and backs it up, and the Mobile team because the knowledge gained about the difference between mobile and desktop browsing habits is useful in product.
- February - April 2016: draft EventLogging schema and get approval for methods.
- May 2016: write and deploy data collection scripts. Collect data. Disable scripts.
- May - July 2016: Analysis and writeup.
Policy, Ethics and Human Subjects Research
There are no substantial ethical and privacy implications here, beyond the problem that Wikimedia's Engineering department already assigns UUIDs for the apps. The only additional data collection actually being done is the EventLogging schema, which only collects a timestamp and a UUID - it does not collect the page that was viewed, and so there is no risk of (for example) creating a convenient "reading list" for specific users. The only situation in which this would pose a privacy risk - by linking to actual content - is if the request logs were also leaked, which is rather improbable.
Despite this we are interested in getting rid of this data as soon as possible; the EventLogging table should be purged and removed after 90 days.
This project and the implications it has have been discussed with both Analytics Engineering and WMF Legal, both of whom find the risks acceptable.