Talk:Edge Uniques
Add topicSlight contradictions
[edit]The third "key point" is
Readers can block or clear these cookies without affecting their reading or editing experience.
but then about DDOS
During an attack, this additional context will help us maintain site availability while minimizing disruptions to real humans.
so clearly they can affect your reading experience if you get classified as not "real humans".
Also
The design makes use of a stateless protocol to make sure that even at the edge, the cookie values including their originating IP addresses will never be stored in traffic logs or databases
but wikitech:Edge uniques acknowledges that
However, when functionally necessary, our edge servers will create derivative, irreversible, temporary one-way hashes of the long-term unique identifier, which may be forwarded to other systems
so clearly a functional equivalent of the IP address and other personal data included in the cookie can be stored elsewhere, even though in a hashed version. So I don't see how a "stateless protocol" can make such promises. Storing hashed versions of personal data is only helpful if the salt changes regularly, for example every day as in cryptolog (used by the Internet Archive). There is a mention of an server-side key rotation once a year but I'm not sure exactly what threat this protects from. Nemo 06:52, 12 April 2025 (UTC)
- @Nemo bis
- Re: "real humans" - The intent here is that this scheme doesn't make anything worse for real humans that the existing situation today. We sometimes have to block traffic to deal with attacks today with a heavy hand (whole ISPs or larger ranges of addresses, and/or faked UA strings belonging to a subset of real browsers, etc), and sometimes real humans are caught up in these blocks as collateral damage. This is especially true in a world where a botnet's source IPs may be the same as our actual users. In the most extreme example: in the same household you might have an IoT thermostat which has been compromised to participate in an attack on the wikis, *and* a legitimate human reader of our wikis, both sharing the same public IP address. If the reader blocks or disallows our cookies, then their situation would remain as it was before this project. But for the expected common case where many of our readers do not block the cookies and have a history of reading our projects, we could potentially block or limit that IoT device's illegitimate traffic while preserving the reader's legitimate traffic, even if they share an exact match on source IP.
- Re: "stateless" and derivative hashes: The base cookie mechanism (the one that creates these new cookies' contents, validates them on reception, etc) is stateless on our end, and the only state kept on these is in the users' own agents/browsers. When we generate a new cookie for a new agent, we keep no record of said cookie. It's sent to the user's agent and forgotten on our end. If or when it's sent back to us later, we simply trust the MAC signature in the cookie (which relies on the ~yearly-rotated server-side secret key referenced earlier) to validate the contents, and again don't keep any local log, record, or database about the cookie or the unique identifier it carries.
- Separately from the base cookie mechanism, there are the uses to which the CDN puts the validated identifiers carried by the cookies. The raw identifier itself is (again) never stored or forwarded to any other system or recorded anywhere; any use of the raw identifier happens as some stateless transformation in our CDN edge nodes. A trivial example would be using it in the attack scenarios referenced above: we may explicitly allow traffic carrying metadata about history with us before otherwise-blocking traffic from a range of IPs, or we may key a local, short-term, in-memory rate-limiter (inside the CDN software) on the identifier (as opposed to keying it on the IP address).
- For A/B test purposes, we derive a one-way hash based on the name of the experiment hashed together with the identifier from the cookie. There is no way to work backwards from this derived hash to obtain the original identifier from the cookie. We then use the derived hash to determine experiment enrollment (e.g. the first 0.1% of the numeric space of the derived hashes for this experiment are included in the experiment). This is also a stateless mechanism: on every request, we re-calculate the derivation from scratch again and re-check enrollment again. For most normal requests (e.g. wiki pageviews), the result of this enrollment check on the derived ID is simply whether, if the agent is part of the selected 0.1%, to include the name of the experimental group in a header that is sent to both the browser and MediaWiki, that contains contents like "ExperimentFoo=GroupA". In these cases, which is nearly all cases, even the derived hash is not forwarded or recorded anywhere.
- However, if the agent in question was part of the enrolled subset of users in an experiment, and when the client-side Javascript (which knows it's in an experiment based on the header mentioned above) decides to submit an experimental metric (such as "The user clicked the new blue button") back to specific analytics ingestion URIs (e.g. "https://en.wikipedia.org/beacon/v2/events"), only then the CDN attaches the (statelessly re-derived for this request) derived hash to the metric before forwarding it to analytics, so that the experimenters can distinguish results between unique devices under experiment when performing mathematical analysis. These metrics records are ultimately stored in an analytics database, where this derived hash is treated as PII under our normal policies (just like IP addresses on our normal traffic logs are today) and deleted after 90 days.
- These stored identifiers can't be reversed to reveal the original long-term cookie identifier (which only the users' agents/browsers keep state on), and can't be mathematically correlated across unrelated experiments over time, either. They will correlate across multiple metrics submissions from the same agent while they remain enrolled in the same ongoing experiment, so long as they don't clear cookies. If the user happens to clear cookies partway through an experiment in which they were enrolled, they will most likely drop out of the experiment at that point. If by statistical chance their new identifier was also included in the same experiment, their derived ID would not match the one they had before. BBlack (WMF) (talk) 13:27, 14 April 2025 (UTC)
- If the reader blocks or disallows our cookies, then their situation would remain as it was before this project. This seems unlikely to remain true. Once you start assuming that human readers don't block the cookie, you'll very likely be quicker to classify cookieless traffic as "bot" and block it. Anomie (talk) 01:24, 22 April 2025 (UTC)
- But this isn't true. We've had plenty of chances to make the approximation "cookieless implies bot" and have not done so. We instead try to apply some heuristics to identify bot behavior. And we are working to improve that. I can assure you, none of the ideas are simply using a lack of cookies. Milimetric (WMF) (talk) 14:49, 22 April 2025 (UTC)
- If the reader blocks or disallows our cookies, then their situation would remain as it was before this project. This seems unlikely to remain true. Once you start assuming that human readers don't block the cookie, you'll very likely be quicker to classify cookieless traffic as "bot" and block it. Anomie (talk) 01:24, 22 April 2025 (UTC)
Ad blockers
[edit]Privacy badger appears to block cookies for a domain when they have more than 24 bits of entropy. The cookie will contain 384 bits of information (not sure how much entropy that's going to be). Are we expecting that all privacy-conscious visitors with ad blockers and other widely used protections from tracking will block this cookie? Nemo 06:52, 12 April 2025 (UTC)
- Reading the FAQ of PrivacyBadger, I understand it mainly blocks third-party cookies (here it is proposed a first-party cookie). For first-party cookies, I understand that it tracks if first-party cookies are set from third-party origins and if there are 3 such cases with a total of more than 12 bits of entropy, then the third-party is blocked (this doc and reading quickly heuristicblocking.js, and Facebook and Google have a special treatment for first-party cookies). Even if this should be confirmed by reading more carefully the algorithm of PrivacyBadger, I don’t think the proposed cookie will be caught by PrivacyBadger.
- BTW I’m currently using PrivacyBadger and it says that no tracker is found on en.WP, Wikidata, and Commons for what I checked (which is quite expected given the WMF’s policy and implementation).
- Also, if WMF’s Legal Team is interested, it may be checked if WMF comply with the EFF’s DNT standard policy and possibly set /.well-known/dnt-policy.txt (I guess so, but IANAL), it would be a positive hint to PrivacyBadger and other extensions. ~ Seb35 [^_^] 12:48, 12 April 2025 (UTC)
What's changed?
[edit]What has changed since the last time this was proposed and rejected that makes having uniques worth the increased privacy intrusion? Legoktm (talk) 00:18, 15 April 2025 (UTC)
- I believe the proposal you're raising was about unique cookies which would have been stored in databases, and which would have been recorded on all site requests / pageviews alongside all the other analytics metadata in our webrequest logs. This proposal is very different in all of those aspects. Do you have a link if we're not talking about the same thing here? BBlack (WMF) (talk) 01:48, 15 April 2025 (UTC)
Impossible to track vs "just trust us, bro"
[edit]I think the direction here is kind of sad. We've gone from a system where its impossible for WMF to track people, to one where we rely on trust for WMF to do the right thing. Sure the software is open source and whatnot, but there is no way for the average user to verify that the deployed software is the software it is supposed to be. In the old system, anyone could simply verify that the cookie does not have tracking data in it. The new system requires trust that WMF behaves the way its supposed to. It also requires people to not screw up. WMF has accidentally logged session cookies in varnish logs in the past, so it seems very plausible similar mistakes could happen in the future. Similarly we've been seeing an erosion of the rule of law and rights in the United States - this new system also requires trust that WMF is not coerced into secretly changing its behaviour. Bawolff (talk) 09:43, 29 April 2025 (UTC)
- I honestly think that we have these problems without the cookie as well. We can already fingerprint people with a mix of their user agent string and their IP address. That gives us a lot of entropy. And the only place we do that in our code is to apply heuristics that help identify poorly behaved automata, so we can exclude them from our data. That trust is already been in place for over a decade, and it hasn't been violated yet. I very much agree with the potential for this to change, for someone else to take control or for external forces to coerce us. But it can already happen without this new cookie. I hope that the cookie actually surfaces this concern more and makes it easier to pay attention. Whereas the fingerprinting I spoke of is probably unknown to most people, buried in some analytics code, this cookie is clearly visible to everyone browsing our site. So we have been talking about a "canary page" where we would detail where someone would have to look in our code to double check that we're keeping our promise of not storing these edge uniques. I'm optimistic here, but also happy to be proven wrong if someone makes a good argument. Milimetric (WMF) (talk) 13:51, 29 April 2025 (UTC)
FAQ
[edit]Hi everyone, we've started building an FAQ over here: Edge Uniques/FAQ. My plan was to post it on this talk page, but honestly, it got a bit too long for that. As of writing, the questions we're trying to address there are:
- How is this mechanism different from past ideas about creating unique tracking cookies on our sites?
- How is this cookie going to help people access the sites?
- Does blocking or clearing these cookies really not affect a reader's experience on the sites?
- How is it possible to have metadata insights about visitor history without storing the cookies in server-side logs or databases?
- How are you doing A/B Testing if you are destroying the cookie at the edge?
- What is the math behind the code?
- Why is the cookie's expiry set to 365 days? Could it be shorter?
- Are there any plans for this mechanism to honor anti-tracking headers sent by some user agents or extensions, such as DNT: 1 and/or Sec-GPC: 1?
- When readers get upset about unannounced changes from A/B tests, volunteers (not paid staff) must handle their complaints. How will the Wikimedia Foundation decrease this support burden for the volunteers?