Differential privacy/Proposed/Future work (15 August 2022)

Although theories of differential privacy (DP) date back to 2006, DP software is still relatively nascent as a technology. This is true both at WMF and at other technology/data organizations. Because DP, particularly operating at terabyte/distributed computing scale, is an early-stage technology, there are many theoretical and implementation problems that remain unsolved (or even undescribed as problems).

These facts underlie a three-pronged approach toward thinking about WMF's future DP work:

Continue publishing differentially private data releases
Increase automation and velocity in the realms of DP experimentation, data publication, and privacy controls
Work on documenting, socializing, educating, and publicizing WMF's work on differential privacy

We'll go through these facets one-by-one, and then discuss other possible uses of differential privacy at WMF.

DP data releases

A large portion of future work in differential privacy at WMF is specifically focused on increasing the number of differentially private data products that we have available. These proposed releases largely fall into two categories — 1) releasing data that was previously deemed too sensitive to release and 2) revamping existing datasets that are currently being released insecurely.

Some plausible examples of case 1:

releasing pageview counts grouped by country and page ID from pageview_hourly data from May 2015 to January 2021
working with the Global Data and Insights Team to release global data about how WMF has historically given grants while ensuring that individual grantees' information is protected

Some plausible examples of case 2:

revamping the geoeditors_monthly public data release so that it has a strong guarantee of privacy and doesn't need to rely on bucketing
revamping the clickstream dataset (or use other differentially-private methods, like synthetic data generation) to enable the release of 3-page, 4-page, etc. sessions

One potential future use of differential privacy that falls outside of both of the cases above is in "encrypting" historical financial data. Data that is currently stored on the level of individual transactions would be aggregated and anonymized using DP, but not released. Differential privacy would be useful in this instance because it would compress data and ensure that any data breach of financial data would not have harmful effects on donors.

DP automation

All of the above listed data releases will likely require intensive work to get over the finish line. Right now, DP is a highly specialized subfield, and only one engineer at WMF (Hal Triedman) has the expertise to oversee DP data releases. As such, future work on differential privacy must also include meta-work in order to productionize our process.

Ideally, each data release would not be a bespoke process. Instead, there would be common formula for most data releases, enabling large sections of the process to be automated. Future work in this area would include automating DP hyperparameter experimentation and data publication. For experimentation, instead of manually specifying a set of error functions, hyperparameters, etc. and re-coding them for each data release, these parameters could be reusable across data releases. For data publication, code templates for airflow processes could be created, making the publication of data significantly easier.

Another realm of automation could come in the form of privacy controls. Differential privacy outputs parameters known as epsilon (ε) and delta (δ), which quantify the privacy risk to an individual whose data is contained in the released dataset. In principle, these parameters are composable — doing two data releases with privacy risk $(ε 1, δ 1), (ε 2, δ 2)$ will lead to a net privacy risk of $(ε 1+2, δ 1+2)$ . In practice, it's not so simple. Regardless, we will work on logging the privacy risk of each DP data release and take those metrics into account when considering new data releases.

Documentation, socialization, education, publication

Future work must also include public-facing aspects. Over the last 16 months, there have been many opportunities to educate internally to WMF about differential privacy; we must make those sessions available to the broader internet and enable more Wikimedians to understand the principles and guarantees behind DP. Ideally, more than one person should be able to conduct DP data releases, and a plethora of people at WMF should be able to request DP services/discuss benefits and drawbacks of DP in depth.

Another section of this work consists of written materials: both documentation on-wiki (this document is a part of that effort) and publications. We must have a clearly-structured and often-updated hub of information on DP projects. WMF also faces new and innovative challenges in the realm of DP, meaning we have an opportunity to use our experience to push the discipline more broadly. Hopefully, we can publish research papers and give talks to the wider world to share our experiences and spark academic discourse.

Other work

This document has focused mostly on DP data releases; however, DP is a general-purpose paradigm and can be used for other things. Below are listed some far-future/more speculative ideas for DP at WMF:

DP for machine learning on PII (e.g. a sockpuppet detection model that uses information like IP address)
DP for synthetic data generation