Research:Unique Devices/Other Possible Implementations

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Qualified Last Visited tokens[edit]

Proposal 1: qualified last access date cookie For all requests, we will set a qualified last access date cookie for the domain of the request (exp. 32 days) where qualified last access date is:

   <qualified last access date> ::= <method> <project> <language> <last_access_date>
   
   <method> ::= "desktop" | "mobile"
   <project> ::= "wikipedia" | "wikibooks" | ...
   <language> ::= "en" | "pt" | ...
   <last_access_date> ::= <date> (now)
   

Solves: s2, s3, t1, t2, m1, m2 Doesn't solve: s1, s4, m3 Todo

   check with App team if this could handle m3
   Ottomata talks to varnish people about setting cookies (e.g. Brandon Black)
   [nuria] by looking at the erb templates for vcl on puppet I'd say there is no problem, now code is ugly
   https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-common.inc.vcl.erb#L35

Pseudo code:

   # Count unique clients from t1 to t2 with qualifier q
   def count_unique_visitors(all_requests, t1, t2, q):
   requests = all_requests[ timestamp in (t1, t2) and qualifier == q ]
   unique_visitors = 0
   for request in requests:
       if request.last_access_date is None or request.last_access_date < t1:
           unique_visitors += 1

Existing browser session cookies[edit]

Fingerprinting[edit]

The request logs - our primary logs of traffic to the Wikimedia sites - contain, by default, the IP address, user agent and language variant for each request. This was used during the mobile sessions analysis in early 2014 as a way of extracting "unique client" information for reconstructing sessions, which is one of the possible use cases for unique client identification. Testing done during that study showed that the inclusion of user agents and language variants, on top of IP addresses, dramatically increased the granularity of the unique client IDS.

This approach has the advantage of not actually requiring any implementation work; we simply take information that is already present and hash it with an appropriate algorithm. The two primary problems with it are, first, that it requires us to keep around intact user agents and IP addresses - information we may want to scrub entirely, or extract value from and then discard.[1] In addition, it's simply not accurate for periods greater than 24 hours in duration, which is not sufficient for reasonable unique client analysis or fundraising analysis, although it might work for (some) session analysis.

Identifying tokens[edit]

Similar to fingerprinting (discussed above) we could provide a unique identifying token to each client, in the form of a cookie. This would be passed through as a new column in the request logs (or as an addition to an existing column).

In theory, this would allow us to handle all three sets of use cases: unique clients would be handled by the mere existence of the token, and the session analysis and fundraising use cases by it being associated with user requests. The problem comes with how to build in sufficient privacy protections. At its core, the idea is somewhat problematic: it would mean that a researcher could (in theory) trace all of a user's read actions within a period of time, if they knew that reader's ID. This is not behaviour we would engage in but it's behaviour we should note and protect against.

So, clearly we'd need some kind of opt-out mechanism. Do Not Track is a possibility, but the implementations vary from browser to browser, as does how that is transmitted to the server and what the default settings are.[2] Another way of doing it would be to set a "don't give me a unique ID" cookie instead, offering users the ability to request one: this would require additional engineering effort on top of building the token-distributing and saving systems.

Even with those protections, we'd still need an expiry limit on the cookie. If it is below 31 days, unique client counting on a month-by-month basis is impossible (a user would appear twice if their cookie expired halfway through the 30-day standard month). Making it greater than that, though, increases the privacy problems, and if we are setting a standard of "the cookie lasts as long as any of our use cases need it to", Fundraising's use cases would ideally have it last a year.

Hybrid approach[edit]

  1. For example, replacing exact user agents, with version information and browser plugins, with "[device] [os] [browser] [browser_major_version]", or replacing IP address with the country it geolocates to
  2. For a long time, DNT was the default for Internet Explorer. I'm not sure if it still is.