Differential privacy/Completed/GDI Equity Dashboard/Problem statement

From Meta, a Wikimedia project coordination wiki

Since 2021, the Global Data and Insights Team (GDI team) has been working on publishing more statistics about WMF's work to support the global free knowledge movement. To that end, they asked the Privacy Team to support them in publishing differentially-private statistics about WMF-issued grants and grantees since 2009. Much of this data is publicly available on-wiki, but other grants are issued confidentiality due to concerns such as nationality or grantee activity. As such, the strong guarantees of differential privacy are necessary in order to maintain confidentiality and grantee privacy.

Problem description[edit]

Input data[edit]

GDI has provided a rich dataset of thousands of grants for analysis. Each row represents a single grant, and contains the following relevant features:

  • Year: calendar year in which the grant was given, ranging from 2009 to 2023
  • Subcontinent: the UN subcontinent of the grantee
  • USD over grant life: the total amount of money given over the grant life (note: grants can be for longer than one year, but for this data release are treated awarded grants as if they were given entirely during one year)
  • Unique grantee ID: a pseudonymous numeric identifier associated with each grantee, used to limit the number of grants that a single grantee can contribute to the dataset
  • Organization type: the type of organization receiving the grant (user groups, allied organizations, individuals, etc.) — we are seeking to protect individual grantees

Desired output[edit]

We are hoping to release data in the following format (the data below is not real):

Example data for GDI grants DP release
Time period UN subcontinent Noisy grant sum (USD) Noisy grant count (grants)
2009-2014 Eastern Asia $12,740 23
2015 South America $2,563 6
... ... ... ...
2022 Western Africa $5,325 5

The hope is to re-run this DP script each year in order to publish the statistics for each year.

Auxiliary data[edit]

  • Country protection list: a public list of countries where WMF does not release data that can be excluded from this this data release
  • Location self-disclosure: many WMF grant applicants disclose their location in public wiki pages applying for the grant — we can presume that we only need to protect the location of those individuals who haven't already disclosed it

Requirements[edit]

The requirements for this DP algorithm fall into three categories: utility requirements, privacy requirements, and performance requirements. We will consider each in turn.

Utility requirements[edit]

In order to measure utility for this algorithm, we've used many of the metrics defined on the DP error metrics page, specifically:

  • median relative error
  • bias
  • dropped rows percent
  • spurious rows percent

For all of the listed metrics, the utility requirement is an average error of ±5% or less when aggregated at the subcontinental level. This means that the median relative error should be <5% and the overall bias of the dataset should be <5%. No more than 5% of the rows in the dataset should be "hallucinations" (spurious rows), and no more than 5% of rows that should've been released should be randomly (dropped rows).

Privacy requirements[edit]

The goal of this data release is to provide privacy to WMF grantees who meet the following criteria:

  • individuals
  • who have not already publicly disclosed the location of their grant
  • and are not located in one of the countries on the Country Protection List

on the level of a person-time partition. A time partition is usually one year, but because grant data is sparse earlier on, 2009-2014 are considered one time partition. This privacy guarantee requires bounding the contribution that any given grantee can make to a given time partition of the dataset. Privacy loss for a grantee will increase if they appear in multiple partitions (i.e. in multiple years).

Two queries will be made on the underlying dataset (both a sum and a count query), which further increases the privacy loss.

Performance requirements[edit]

There are not stringent performance requirements for this data release, because it will largely be a one-off historical release of data. As with other DP releases, we are using a custom Spark instance that represents roughly 20% of the available compute. This job (as well as error metric calculation) should run in under 10 minutes, and should be able to be run easily and quickly in future years as new data becomes available.

Algorithm[edit]

Hyperparameters[edit]

The grants sum and grants count queries largely share the same hyperparameters, which are listed below

  • (for grants sum only) Grant amount range :
    • range (currently $460-3450) constraining the possible values of the USD over grant life column
    • values <$460 are increased to $460, values >$3450 are decreased to $3450
    • set based on real-world data, with lower and upper bounds corresponding to roughly the 10th- and 90th-percentiles of the underlying dataset
  • Grants per person-time partition :
    • currently
    • used to limit the number of rows corresponding to a given unique grantee ID in a given time partition to enable a strict upper bound on privacy loss
  • Noisy suppression threshold :
    • minimum number of noisy grants or dollars to be released publicly
    • for grants,
    • for dollars,
  • Epsilon :
    • Privacy loss per time partition
    • for grants,
    • for dollars,

Pseudocode and explanation[edit]

The following (python-ish) pseudocode illustrates the algorithm that will be used for this DP data release:

def grants_dp(
  db,  # dataset of grants containing year, subcontinent, amount, org_type, self_disclosure, and grantee_id (excluding country protection list)
  n,   # number of grants per time partition to protect
  R,   # list [460, 3450] containing limiting sum range
  eps, # list [1, 1.2] of epsilons for each aggregation
  t    # list [0 , 2000] of release thresholds
):
  # private is individuals who haven't self-disclosed
  private = db.filter((org_type == "individual") & (self_diclosure == False))

  # nonprivate is non-individuals or individuals who have self-disclosed
  nonprivate = db.filter((org_type != "individual") | ((org_type != "individual") & (self_diclosure == True)))

  # limit to just n grants per time partition
  private = limit_contrib(private, n)

  # do DP aggregations
  private_count = dp_count(eps[0], private)
  private_sum = dp_sum(eps[1], private, range=R)

  # do non-DP aggregations 
  nonprivate_count = normal_count(nonprivate)
  nonprivate_sum = normal_sum(nonprivate)

  # sum all rows to get final count and sum
  final_count = private_count + nonprivate_count
  final_sum = private_sum + nonprivate_sum

  # filter to rows just above release thresholds
  final_count = final_count.filter('count' > t[0])
  final_sum = final_sum.filter('sum' > t[1])

  # join the columns (allowing for inconsistent values) to return
  return final_count.join(final_sum, on=['year', 'subcontinent'], how='outer')

This algorithm takes advantage of the fact that once a database is made differentially private, one can add conduct any nonprivate transformation on the dataset without affecting the guarantee of differential private.

In the preprocessing step, individual grantees who have not self-disclosed their location are separated from organizational grantees and individuals who have self-disclosed their location. The algorithm then does parallel private and nonprivate aggregations, adding them together afterwards to get a joint private-nonprivate sum that still has DP protection.

At the very end, the sum and count aggregations are outer-joined together, providing one table with all of the privatized information. Note: the fact that these aggregations are conducted separately from each other may lead to inconsistencies (e.g. a row where grants are reported and a sum is not, or vice versa). This is by design, and means that the noisiness induced in the algorithm is providing protection to those grantees who need it.