Differential privacy/Docs/Privacy budget accounting

From Meta, a Wikimedia project coordination wiki

An overarching goal of the differential privacy project at WMF is to introduce a strong measure of accounting to our private data releases. DP is particularly well-suited to this goal because various measures of privacy loss can be composed easily with each other. This document seeks to summarize where that goal has led us so far and answer the following questions:

  • How do you track cumulative privacy loss across mechanisms that might use different noise distributions?
  • More generally, what academic resources exist on privacy budget management?

Tracking cumulative privacy loss across multiple mechanisms[edit]

Typically, the cumulative privacy loss of running multiple mechanisms is determined through a process that has two key components:

  1. The privacy properties of each mechanism must be characterized under one (or more) variants of differential privacy
  2. The cumulative privacy loss of running all of the mechanisms must be determined (using composition rules).

The first step is to map each mechanism to a privacy definition. In general, the same noise mechanism can satisfy multiple privacy definitions. Basic mechanisms, like the Laplace and Gaussian mechanisms, have been analyzed in the literature under different privacy definitions:

  • Mironov "Renyi Differential Privacy"[1] analyzes Laplace, Gaussian under Renyi-DP
  • Bun and Steinke, "Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds"[2] analyzes the Gaussian mechanism under zCDP
  • Gaussian Differential Privacy[3] (f-differential privacy) analyzes stochastic gradient descent (and its subcomponents, Gaussian noise and subsampling) under Gaussian DP.

Furthermore, there are also rules that allow one to convert a characterization under one privacy definition to another. For instance, if mechanism A is -zCDP, it is also an -approxDP mechanism, where the parameters and are functions of . See Lemma 3.5 of “Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds”[2] for this particular conversion.

For the second step, there are many such composition rules and they vary depending on the model of differential privacy (pure DP, approx DP, zCDP, etc.). Composition rules can be found in the previously cited resources, and also resources like Dwork and Roth’s textbook, "The Algorithmic Foundations of Differential Privacy."[4]

The order in which these steps occur can affect the final overall privacy guarantee. For instance, suppose mechanisms A and B are zCDP and mechanism C is pureDP and we ultimately want to get an approxDP guarantee. One could combine them in at least three possible ways:

  1. convert A, B, and C to approxDP and then compose using an approxDP composition rule
  2. compose A and B using a zCDP composition rule then convert to approxDP and convert C to approxDP (i.e., set ) and then compose them using approxDP
  3. convert C to zCDP and then compose A, B, C under zCDP, then convert to approxDP

Each approach may give slightly different final values for the parameters.

Note that there are some subtleties with composition rules in differential privacy as many of them have implicit assumptions about what is fixed and what is variable. For instance, the optimal composition theorem for approxDP[5] assumes the parameters are fixed in advance. If they are not fixed in advance, then the rule does not apply. For more details on the subtleties of composition, see

  • Rogers et al. "Privacy Odometers and Filters: Pay-as-you-Go Composition."[6]
  • Vadhan and Wang, "Concurrent Composition of Differential Privacy"[7]

Tracking cumulative privacy loss with Tumult Analytics[edit]

In tmlt.analytics, privacy loss is tracked by the Session. The Session is responsible for calculating the privacy loss of each query (i.e., noise mechanism) and mapping it to the privacy loss parameters of the Session. Session currently supports pureDP and zCDP; approxDP is on the near term roadmap. The Session also tracks (and bounds) cumulative privacy loss.

One can think of a Session as a black box mechanism that satisfies some DP privacy guarantee. Thus, one could combine the privacy guarantees across Sessions using the aforementioned composition rules. Providing support for this “inter Session” privacy loss accounting is also something on Tumult Labs’ roadmap.

Privacy Budget Management[edit]

There are several aspects to privacy budget management. Some considerations include, but are not limited to, the following:

  1. Identifying the data source and the entity that is being protected
  2. Tracking cumulative privacy loss
  3. Deciding privacy parameters

The first point is probably the most subtle and hard to operationalize. If one has a single data source and wishes to apply standard differential privacy, then this step is straightforward. But in practice, an organization often wants to perform DP releases across multiple, related data sources. Placing all of these releases into a single privacy budget management framework can be challenging.

Such challenges exist at Wikimedia. For instance two data sources — pageview_hourly and pageview_actor — both capture information about what pages users have viewed. The pageview_actor table contains an actor_signature field, which makes it possible to bound the contributions per actor. In contrast, the historical pageview_hourly table lacks this field, which imposes limitations on how differential privacy can be instantiated on this table. Further, if one wanted to assess the cumulative privacy risk to an individual who participated in both of these datasets, it would be challenging because one cannot easily characterize releases from these datasets under the same privacy model.

We are not aware of available software that provides the sufficient flexibility for this critical aspect of privacy budget management.

For the second aspect, tracking cumulative privacy loss, the composition rules (discussed above) can be used. Again this is with the important caveat that the composition rules above generally assume that all the mechanisms are applied to the same data source. If one is trying to manage privacy loss across different, but related data sources, this can be a more challenging problem.

The final aspect — deciding privacy parameters — is more of a policy question where, unfortunately, there are not a lot of easy answers. Some resources that may be helpful here include the following:

Here are some other resources related to privacy budget management:

References and cited papers[edit]