Differential privacy/Completed/Country-project-page/User filtering

A core component of differential privacy is determining what level of contribution will be protected within the dataset -- i.e. privacy unit -- and enforcing sufficient filtering to provide that protection. Commonly the goal is to protect any person who might contribute data -- i.e. in a differentially-private dataset, it should not be possible to determine if any given person has contributed data.^[1] In practice, this "person" is actually a user account as that is how most tech platforms track who contributes what data -- e.g., phone + user account with location history turned on for Google's mobility reports^[2].

The privacy unit is not just theoretical. If the unit is a person, the dataset must be filtered such that each person contributes no more than a certain number of data points to the entire dataset. Otherwise, the formal privacy guarantees will not hold. For most tech platforms, this is a trivial point -- all of their users are required to be logged-in and as such all actions are associated with user IDs that can easily be used for filtering. Wikimedia projects, however, do not generally require accounts in order to preserve users' privacy and reduce the barriers to access.

This privacy-first policy on Wikimedia projects leads to an ironic tension: by virtue of not collecting user identification data, the strong privacy guarantees of differential privacy at the user level are much more challenging to achieve. Below is a possible approach that seek to achieve these strong privacy guarantees without further compromising user privacy in terms of the data collected by the Wikimedia Foundation. It focuses on the use-case of differentially-private datasets of pageviews -- e.g., how many times each Wikipedia article is read by individuals from a given country.

Client-side filtering

Following a rich history of creative use of generic cookies to achieve high-quality data usually achieved via fingerprinting (e.g., unique devices, SessionLength metrics), the gist of this approach is to track contributions on the client-side and pass a simple indicator of whether a pageview should be excluded from any differentially-private datasets. Some possibilities for implementation and potential issues below.

Basic design

A cookie would be established in the client's browser that keeps track of pages read and is used to determine whether a pageview should be filtered or not from differentially-private datasets. This in effect makes the privacy unit a browser, which we refer to as an actor. While we hope an actor is conceptually similar to a user, it is different in that a user may access Wikipedia from many different browsers and one browser may also be used by many different users.
Filtering would occur on a daily cadence -- i.e. any cookies would reset after no more than 24 hours. This reduces issues associated with users switching devices frequently or clearing cookies from their browser.
The filter status of a given pageview will be determined based on the client cookie and logic implemented on Varnish servers that handle webrequests.
The only additional data that will be stored on the Wikimedia Foundation's servers will be the determination to include the pageview or not (via the x-analytics header), which can then be used for filtering the webrequest table.

At its simplest, this is a cookie that on the first pageview is set to 1 with an expiry at midnight UTC. With each subsequent pageview, it increments by 1 and when it exceeds a pre-determined threshold -- e.g., 10 -- it stops being incremented but every subsequent pageview contains an x-analytics header that has a new field -- e.g., dp-do-filter. On the server side, a table is generated for differentially-private datasets that is the subset of pageviews that do not contain this header.

A further benefit of this approach is that it would be easy to support user preferences that allow for private viewing sessions by automatically including the dp-do-filter header in all pageviews.

Major Design Choices

Which pageviews

We have decided to only include an actor's first 10 unique pageviews. A challenge with our approach is ensuring a representative set of pageviews is included. In traditional approaches, all of a user's contributions would be considered and, if they exceed the limit, a random subset would be included in the dataset. This is not possible with client-side filtering where we must decide whether to filter when the page is viewed, not post-hoc. A concern is that using the first k pageviews might be an inaccurate reflection of what content is being read -- e.g., if many reader sessions start with the Main Page for a wiki and clicks to articles from there, the Main Page and pages linked from it would be overrepresented in the final dataset.

The more pageviews per actor we include, the smaller this first-viewed bias will be. The vast majority of readers view fewer than 10 pages per day and therefore we would use unbiased data from them. Higher thresholds (larger sensitivity values) require more noise to achieve equivalent privacy guarantees and thus do not necessarily translate into higher quality data. Furthermore, there is a legitimate argument to be made that we want the datasets to reflect what most readers are viewing, not what is being most viewed.

We opted to take unique pageviews only as well to further reduce this first-viewed bias because we believe most duplicate views for a page come from reloads or going back in one's browser (as opposed to the person reading the page for a second time). This noticeably complicates our cookie and algorithm but should improve the data quality.

Fixed or flexible threshold

We are starting with this 10 threshold as a single fixed threshold. This allows the information passed to the server to be a simple binary "include" or "exclude" with each pageview. If more flexible thresholds are required -- e.g., 5 for some datasets but 10 for others -- then either the count of the pageview in the session would need to be provided to the server, which carries privacy implications as it could be used to generate more accurate reader sessions on the server, or multiple thresholds would have to be supported (which might be a reasonable trade-off for two or three thresholds but slowly transforms to just sending the pageview count with more thresholds).

Alternatives not under consideration

Storing device-specific user IDs, even if these are quickly discarded, is not under consideration at this time.
We are very hesitant to use approximate user IDs -- e.g, based on IP address and user-agent information -- as it is difficult to quantify the privacy loss this introduces, the efficacy of this is likely to shift over-time (see task T242825), and the privacy costs would be unequally distributed with e.g., mobile users whose IP changes frequently having less privacy guarantees than desktop users whose IP is stable.^[3]
We would like to avoid weaker privacy guarantees such as pageview-level privacy. Our most vulnerable users are generally frequent editors -- because they may be subject to retribution for what they write, much of their data is available via their edit history, and they generate lots of pageviews -- and they would receive the least protection under pageview-level privacy.
Local differential privacy -- where data is made differentially-private before sending data to the Wikimedia Foundation -- is not currently under consideration. It would be substantially more complicated and it is unclear how it would be implemented effectively.

References

↑ This can be further complicated when there are recurring dataset releases so the privacy unit then becomes e.g., a user's contributions in a given day with reduced guarantees over time. A common alternative to the user as a privacy unit might be just a single data point -- e.g., in a differentially-private dataset of pageview counts, it should not be possible to determine whether a given pageview is in the dataset. This is a weaker form of protection, however, as one might be able to determine whether a given user who contributed many pageviews is present in the data.
↑ Aktay, Ahmet; Bavadekar, Shailesh; Cossoul, Gwen; Davis, John; Desfontaines, Damien; Fabrikant, Alex; Gabrilovich, Evgeniy; Gadepalli, Krishna; Gipson, Bryant; Guevara, Miguel; Kamath, Chaitanya; Kansal, Mansi; Lange, Ali; Mandayam, Chinmoy; Oplinger, Andrew; Pluntke, Christopher; Roessler, Thomas; Schlosberg, Arran; Shekel, Tomer; Vispute, Swapnil; Vu, Mia; Wellenius, Gregory; Williams, Brian; Wilson, Royce J. (3 November 2020). "Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)". arXiv:2004.04145 [cs].
↑ Saxon, James; Feamster, Nick (27 May 2021). "GPS-Based Geolocation of Consumer IP Addresses". arXiv:2105.13389 [cs].

[1] This can be further complicated when there are recurring dataset releases so the privacy unit then becomes e.g., a user's contributions in a given day with reduced guarantees over time. A common alternative to the user as a privacy unit might be just a single data point -- e.g., in a differentially-private dataset of pageview counts, it should not be possible to determine whether a given pageview is in the dataset. This is a weaker form of protection, however, as one might be able to determine whether a given user who contributed many pageviews is present in the data.

[2] Aktay, Ahmet; Bavadekar, Shailesh; Cossoul, Gwen; Davis, John; Desfontaines, Damien; Fabrikant, Alex; Gabrilovich, Evgeniy; Gadepalli, Krishna; Gipson, Bryant; Guevara, Miguel; Kamath, Chaitanya; Kansal, Mansi; Lange, Ali; Mandayam, Chinmoy; Oplinger, Andrew; Pluntke, Christopher; Roessler, Thomas; Schlosberg, Arran; Shekel, Tomer; Vispute, Swapnil; Vu, Mia; Wellenius, Gregory; Williams, Brian; Wilson, Royce J. (3 November 2020). "Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)". arXiv:2004.04145 [cs].

[3] Saxon, James; Feamster, Nick (27 May 2021). "GPS-Based Geolocation of Consumer IP Addresses". arXiv:2105.13389 [cs].

[1]

[2]

[3]