Edge Uniques/FAQ
Frequently asked questions about the Edge Uniques project and technical implementation.
Q: How is this mechanism different from past ideas about creating unique tracking cookies on our sites?
[edit]A: The key difference in this Edge Uniques design is that the system does not collect or store first-party tracking identifiers. The system is designed to work statelessly on our servers, where the only persistent record of the cookie and its identifier exists in the user's own browser, not in our databases or logs.
Refer to wikitech:Edge uniques and gitlab:repos/sre/libvmod-wmfuniq/ for additional technical detail. See other FAQs in this document for additional detail as well.
Q: How is this cookie going to help people access the sites?
[edit]Our sites are subject to increasing volumes of undesirable automated traffic from various kinds of denial of service attacks and unfriendly scraping, which place a significant burden on both our infrastructure and our staff in trying to detect and mitigate the impact.
Unfortunately, it's also the case that a lot of the undesirable traffic comes from the same networks and IP addresses as our actual human readers. This happens because many of these automated sources use sophisticated botnets built on top of end user agents and devices (e.g. compromised PCs and IoT devices in readers' homes). An additional complication is that singular IP addresses are often network address translation (NAT) exit points, where many unrelated users or devices are all sharing IP addresses (e.g. university campus networks, some mobile carrier networks, etc).
Currently, when undesirable traffic floods our network, we identify the source IPs or IP Networks causing the problem, and either completely block or heavily rate limit all traffic from those sources, hopefully temporarily. When we do this, there is often collateral fallout in the form of legitimate, human readers on the same IPs or networks having their access throttled or denied as well.
The existence of this cookie presents our staff with a new tool which we believe will help with this situation. We can reasonably assume that the bulk of our readers will be carrying this new cookie, and it has some site history metadata (more on this in the next question!) within it, which is difficult for many of the low-effort botnets to replicate in their agents with similar patterns. We can craft network rate limiting or blocking rules which pay attention to the presence of and the metadata within these new cookies to better sort the wheat from the chaff, basically. By first explicitly allowing traffic with appropriate-looking cookies before we blanket-deny large networks to control the undesirable traffic, we can reduce the collateral damage effects of this blocking on human users.
The collateral fallout of being blocked or limited in order to preserve our infrastructure from an attack is something that can happen to anyone today. If, in the future, when these cookies are live you choose to block or clear these cookies, then you would still face those same blockage risks as before. If you don't block or clear these cookies, then it's possible that we can preserve your traffic when blocking a botnet which shares IPs or networks with you. The existence of your cookie does have an observable effect on your access in the future, but it should be a positive change from today's world for those who keep the cookie, as opposed to a negative change for those who do not – those who do not keep the cookie merely get the same experience as they would have today.
Q: Does blocking or clearing these cookies really not affect a reader's experience on the sites?
[edit]Compared to our current state, where this cookie doesn't exist, nothing really changes for readers. In order to understand more clearly, this needs to be put into context:
In terms of A/B testing, if you are in an experiment group, there is a chance your experience may change as part of the experiment. Some experiments may set the likelihood of your experience being affected by a test to a low likelihood (e.g., 0.01% chance) whereas in some cases it may be higher (e.g. 1%–10%). Generally, on larger wikis because of a larger population of users lower sampling rates can be used, whereas on smaller wikis higher sampling rates – or longer experiment windows – are necessary to achieve statistical validity. In the past, experimentation on Wikipedia has involved limiting the tested changes to logged-in users, app users, or only tested on specific projects of differing sizes. We will continue to use these methods, but in some cases, these methods are not representative of the 1.5 to 2 billion unique devices that view Wikipedia on a monthly basis: what works for committed editors might not work for readers, what works for iOS app users might not work for mobile browser users, and what works for Polish Wikipedia might not work for Japanese Wikipedia or Arabic Wikipedia (let alone English Wikipedia).
Q: How is it possible to have metadata insights about visitor history without storing the cookies in server-side logs or databases?
[edit]As detailed in the low-level design documentation, the WMF-Uniq cookie value is composed of several distinct numeric fields. In addition to the unique random identifier, these cookies also contain a few bits of very approximate site visit history:
- the day the identifier was first created (but not the time),
- how many weeks ago this cookie was last used (rounded to weeks)
- A rough count of how many weeks you’ve visited (not which specific weeks).
At the end of the cookie value, there is a cryptographic signature which authenticates all of these fields, and proves to our CDN edge servers that the cookie was originally generated by our own servers and hasn’t been tampered with.
Our CDN edge servers generate these cookies and process them ephemerally, but these cookies and the unique identifiers they contain are never forwarded from our CDN to other, more-complex parts of our internal infrastructure stack, and they are never stored anywhere in our infrastructure persistently, such as a log or database. The only place in which these cookies are stored persistently is in the user’s own browser. When our CDN servers process these cookies, they immediately destroy and forget them.
When metadata fields need updating (e.g. to update the weekly counter), this happens via our CDN edge server receiving and authenticating the old copy of the cookie from the browser during a normal client request to our servers, and then sending a new version of the cookie with updated metadata to the browser during a normal server response. If the client forgets or clears the cookie and never sends it back to us again, then the unique identifier and all of the site history metadata it contains is effectively lost forever, as there is no copy of it in our infrastructure. In this case, our CDN edge servers will generate a fresh new cookie with fresh history and a new random identifier. Read the discussion about X-Experiment-Enrollments, elsewhere in this FAQ, to learn what kind of data is transmitted to MediaWiki application servers and analytics servers.
Q: How are you doing A/B testing if you destroy the cookie at the edge?
[edit]A/B testing works by mapping different portions of the possible random IDs in the WMF-Uniq cookie, to a small number of fixed experiment groups.
When viewing pages on a wiki where an A/B-test is active, the CDN edge servers determine group assignments and send only the group information (via an HTTP header) to the MediaWiki backend servers. For example: X-Experiment-Enrollments: button-versus-link-2025=A. MediaWiki can then use this information to render variant A or B.
For interactions where an instrument may send an event (e.g. from MediaWiki PHP, or client-side JavaScript), any events related to the current experiment automatically include which groups a given pageview was part of. Only predetermined metrics relevant to the thing being tested are associated with these fixed group names (not everything else a user might do on the wiki). In addition to the fixed group names (e.g. "A" or "B"), the edge also attaches an separate short-term random identifier to the event, specific to a given experiment.
This allows analysis of the experiment by combining or separating data from the same browser, while minimizing the data collected. In particular, by not storing your WMF-Uniq random ID, and not associating events from unrelated experiments or over long periods of time. Each experiment goes through this process and uses a separately-derived set of values, so no two anonymized identifiers will be used more than once, reducing the risk of correlation between A/B tests.
More detailed technical answer
[edit]Each A/B test has:
- a name (e.g., "button-versus-link-2025"),
- a set of groups (e.g., "blue button" and "text link"),
- enrollment sampling ratios for each group (e.g., 0.01% for blue button and 0.01% for text link),
- and the wikis for which the A/B test apply.
An experiment can also optionally have a shared selector (e.g. “button-experiment-series”) that can associate two related experiments which need to A-B bucket users the same way.
The sampling ratio will be applied to a set of "buckets". There are exactly 100,000 buckets in our sampling system. Suppose an experiment has an enrollment sampling ratio of 0.01% for the treatment group that gets a blue button (let's call this group "A") and 0.01% for the control group that gets a text link instead of a blue button (let's call this group "B"). In this example, 10 out of the 100,000 buckets would be eligible for the blue button and 10 out of the 100,000 buckets would be eligible for the text link. The 99,980 other buckets are out of scope for this A/B test. The buckets would look like this:
- Buckets 1–10: blue button (group "A")
- Buckets 11–20: text link (group "B")
- Buckets 21–100,000: out of scope
When one of our CDN edge servers receives a web request to a wiki where an A/B test is active, and the request contains a valid WMF-Uniq cookie, the CDN will concatenate the edge unique ID and A/B test name (or shared selector, if enabled), and apply a one-way cryptographic hashing function to the resulting concatenated value. The output of this hash function is a derived unique identifier which is specific to both this agent (as identified by their cookie) and this particular experiment or related set of experiments. This new derived identifier can’t be correlated to or otherwise used to recover the original unique identifier from the WMF-Uniq cookie.
A portion of this derived identifier will be interpreted as a bucket number between 1 and 100,000. In this example, if the number is between 1 and 10, then the user is in the blue button "A" group. If the number is between 11 and 20, then the user is in the text link "B" group. If the number is between 21 and 100,000, then the user is out of scope for this experiment.
Continuing with this example, if the user is in group "A" or group "B", the CDN edge will convey the experiment group name in its forwarded request to our MediaWiki application servers with an HTTP header, like this:
X-Experiment-Enrollments: button-versus-link-2025=A
In this example, MediaWiki would render a blue button for the user. Additionally, MediaWiki's JavaScript code would be recording an interaction event associated with the A/B test.
For a user who has been bucketed into an experiment, MediaWiki's JavaScript will instrument actions such as "button was shown" and "button was clicked" and record an event via a beacon URL on the same domain as the wiki:
- Suppose the user had visited the following URL:
https://en.wikipedia.org/wiki/Clade, and was bucketed into an A/B test. - The web browser will send events to a URL like the following:
https://en.wikipedia.org/beacon/v2/events?hasty=true - Events sent to this "/beacon/v2/events" endpoint naturally include the
WMF-Uniqcookie and will be received first by the CDN edge.
In order to know how well the group "A" blue button fares against the group "B" text link, it is necessary to convey some form of an ID to an analytics database (as this represents a distinct "user"). When a CDN edge server receives this event, and the request contains a valid WMF-Uniq cookie, the CDN edge will once again perform a concatenation of the edge unique ID and A/B test name, and then apply a hashing function to the concatenated value and verify whether the user was indeed included in this A/B test. Once verified, the CDN edge will forward the event to an analytics server with an extra HTTP header added that conveys the experiment name, group name, and the hashed value.
Such a request from the CDN edge to an analytics server may look something like the following.
POST /path/to/processor HTTP/2
Host: analytics-server
X-Experiment-Enrollments: button-versus-link-2025=A/cSbN4iEngsnYMz1vEK9O6g;
...
{
"$schema": "/analytics/product_metrics/web/base/1.4.2",
"dt": "2025-10-10T11:12:11.111Z",
"meta": {
"stream": "product_metrics.web_base.button_versus_link_2025",
"domain": "en.wikipedia.org"
},
"action": "clicked",
"experiment": {
"enrolled": "button-versus-link-2025",
"assigned": "A",
"subject_id": "awaiting",
"sampling_unit": "edge-unique",
"coordinator": "xLab",
"other_assigned": {}
},
"agent": {
"client_platform": "mediawiki_js",
"client_platform_family": "desktop_browser",
"release_status": "prod"
}
When the analytics server receives this request, it will substitute the hashed value from the X-Experiment-Enrollments header (cSbN4iEngsnYMz1vEK9O6g in this example; this is an encoded version of the experiment-specific, derived hash output value – it is not the original long-term identifier inside the WMF-Uniq cookie, and that long-term identifier cannot be re-discovered or reverse-engineered from this hash value) into the subject_id field's "awaiting" placeholder.
The above is a simplified example of what an event may contain. This YAML event schema declares the possible fields that an event may contain (n.b., we dissuade including the User-Agent string in experiment-related events).
Events like the above will be stored for upto 90 days in the raw form that includes the hashed value.
When a user is included in more than one A/B test:
1. The hashes will be different for each test, by virtue of different A/B test names being concatenated with an edge unique ID and then being hashed.
2. The subject_id field in events for the A/B test will be populated only with the hashed value pertaining to the specific A/B test for which the event pertains. The other_assigned field will contain the name and group of the other in-scope but unrelated experiment(s), but not the subject_id value(s) associated with the other in-scope experiment(s). It is desirable to be able to determine if one A/B test is confounding the results of another A/B test, but inclusion of multiple distinct hashes that might simplify correlation of the same user is avoided this way.
Refer to wikitech:Edge uniques and gitlab:repos/sre/libvmod-wmfuniq/ for additional technical detail.
Q: What is the math behind the code?
[edit]The Edge Uniques system uses two well-established mathematical tools that work in practice but are theoretically complex and involve advanced cryptography. Random number generation and hashing create unique identifiers that are practically impossible to guess or accidentally duplicate.
Edge uniques values, hashing, and bucketing relies upon:
1. Use of the open source libsodium project generating random sequences of 128-bits for edge unique IDs and salts. These are functionally very large random numbers.
2. The BLAKE2 hashing function from the open source libsodium project producing 128-bit hash output values given the edge unique ID and its creation day. These hash outputs are again functionally very large random numbers.
3. A 64-bit number (unsigned) being created from the top 64 bits of the 128 bit hash output number, then that 64-bit number being modulo (%) divided by the number 100000. The output of such modulo division is a number in the range from 0 to 99,999 (above we use the bucket examples of 1–100,000, which is functionally equivalent).
Within our code as of now, wu_process_cookie and sign_and_encode_cookie use the randombytes_buf function of libsodium for generation of edge unique random values and salts for hashing. The wu_abtests_proc function is responsible for making a determination of whether an edge unique ID concatenated with an experiment name falls into an eligible bucket. It does this by calling derive_ident, which in turn utilizes crypto_generichash_blake2b_salt_personal to get the 128-bit hashed output of the concatenation and then applies the modulo (%) division on the top 64 bits from that hash output value (a large unsigned 64-bit integer) and finally checking if the result of the modulo (%) division is within a bucket defined as being in-sample for A/B test enrollment.
See the cookie design at a low level for additional detail.
Q: Why is the cookie's expiry set to 365 days? Could it be shorter?
[edit]The 365-day cookie expiration serves a few important purposes: it prevents experiment data disruption from frequent cookie regeneration, maintains consistent user identification across visits, and helps prevent readers on shared IP addresses from being incorrectly blocked. While we could set a shorter expiration (e.g. 90 days), the 365-day value more clearly signals our intention to maintain a consistent way to know if a user is new or returning.
Q: Are there any plans for this mechanism to honor anti-tracking headers sent by some user agents or extensions, such as DNT: 1 and/or Sec-GPC: 1?
[edit]We don’t believe that these anti-tracking headers apply to our Edge Uniques implementation. They are intended to block third-party pervasive user tracking across multiple websites, and to prevent the sale or sharing of collected personal data to third parties. If we were to alter our sites’ behavior in response to these headers, it would imply that our default behavior with these cookies is to support the kind of pervasive user tracking these headers are intended to prevent, which it is not. In detail:
Do Not Track is deprecated and isn't treated consistently across browsers. The Electronic Frontier Foundation, which led the original proposal to create Do Not Track, has an explainer on the topic of tracking from that time makes quite clear that the real problem this mechanism was meant to address was pervasive third-party tracking that follows users across many sites. It talks about specific exceptions both for first-party “tracking” of a site’s own traffic, and for trackers which are used for functional security reasons. It even allows for third party cookies, so long as there is some contractual legal integrity about tracking with the provider. Our cookies are not stored or logged, and therefore don’t even qualify as true first-party tracking, much less third party. No third parties are involved at all. There is an argument that could be made that the A/B test derivative’s use is a type of first-party tracking, but even so: it only applies to the small numbers of users in active tests, and it does not record the original cookie values, and it remains, again, first-party-only, and thus outside the intended scope of DNT’s intent. We can’t honor DNT by applying it in ways users wouldn’t expect from its stated purpose.
Sec-GPC is presently experimental (as of May 2025 it is supported only in newer versions of Firefox), and furthermore is intended for the browser to indicate the user's preference for avoiding information being shared or sold. We do not share or sell user data to third parties which means that Sec-GPC's purpose doesn't apply in our context. Refer to the Wikimedia privacy policy for more information.
Q: When readers get upset about unannounced changes from A/B tests, volunteers (not paid staff) must handle their complaints. How will the Wikimedia Foundation decrease this support burden for the volunteers?
[edit]We will publish a catalog for moderators that lists all planned, active, and archived experiments including contact information (Phabricator and email) for the teams that are running the tests. This will allow volunteers to quickly identify where to send people for help instead of having to deal with it themselves. We will evaluate the effectiveness of this and make changes as needed. We’d love more collaboration on this so we can get it done right: if you have ideas or feedback about how we can make this better, please add your thoughts to a Phab task tagged with #experimentation-lab, or contact Virginia, Julie or Johan.