Differential privacy/Proposed/DP dataset release prioritization

From Meta, a Wikimedia project coordination wiki

Differential privacy (DP) is a wide-ranging paradigm for statistically-guaranteed data privacy. The Privacy Engineering (PE) team at the Wikimedia Foundation (WMF) is in the midst of a multi-year effort to establish use cases and workflows for DP at WMF.

To help WMF prepare for future data releases, Tumult Labs will be evaluating three potential use cases for DP and writing feasibility reports for each use case.

The purpose of this document is to:

  1. outline a rationale for inclusion/exclusion and prioritization of dataset types in DP data releases (from both a technical and an organizational perspective)
  2. summarize the steps taken to date to establish consensus on the three datasets that we’re giving to Tumult Labs
  3. explain the three datasets that we’ve settled on for evaluation and future release:
    1. Geoeditors_monthly
    2. Search queries and results (queries = Wikidata item labels, results = wiki project articles)
    3. CentralNotice engagement funnel statistics for Wikimedia affiliates and projects (WikiLovesMonuments, Picture of the Year, etc.)

Technical factors and prioritization rationale[edit]

As we attempt to productionize DP at WMF, there are technical factors that may (dis)qualify datasets from consideration, as well as organizational rationales for prioritizing the datasets that do qualify for release. All of these considerations were salient in determining the three datasets we’ve identified for future release.

Technical factors[edit]

Keyset size: In order for non-approximate DP to work properly, the domain of the mechanism (called the keyset) needs to be finite and known prior to calculation. Noise is calculated for every key in the keyset, even if it has a value of zero. The privacy guarantee from calculating noise for the entire keyset may incur both a computation cost (since WMF has limited compute capacity) and an increase in error metrics (since billions of pulls from a random distribution leads to outliers). In the context of WMF, this requirement means that:

  • Most natural language domains (specific search queries, deletion log text, etc.) are out of scope, as the keyset would be the number of UTF-8 characters to the power of however many characters are accepted in the search bar. Most of the results would be gibberish, and the keyset would quickly become too big.
  • Individual reader paths would be out of scope. Although the PE team considered revamping the clickstream dataset, we quickly realized that (because there are >60 million articles), the keyset size of mapping from all pages to all pages even once would be >3.6 quadrillion intersections, which our clusters would be unable to calculate. Besides, desire for expansion of the clickstream dataset mostly concerns expanding the number of projects/namespaces that it supports, not getting more data from the long tail of the distribution by lowering thresholds (see T289532 and T296359).

Plausible privacy units: In order for DP to work safely, we usually want to have a plausible privacy unit. For most purposes at WMF, this privacy unit would correspond with a person’s contribution to a dataset (a number of pageviews, edits, donations, etc.). This gives WMF a preference for considering datasets with some form of back-end key for identification, though it is also possible to do DP on keyless (or pre-aggregated) data.

Organizational rationales[edit]

Principle of least access: Currently, if a researcher (or editor, or affiliate employee, etc.) wants access to data that is kept private by WMF, they sign the Confidentiality agreement for nonpublic information, a legal agreement with WMF not to share data. This process is onerous, requires many people to do extra work, and often gives the non-WMF employee permission to access much more internal data than what they asked for. We can use the principle of least access to prioritize data releases that most reduce the number of legal forms that need to be signed.

Establishing new classes of DP data release: The PE team’s first two data releases are both concerning pageviews, and we want to prioritize novelty of kinds of data released going forward. Each time we work through a new class of data release, it becomes easier to release that kind of data in the future.

Establishing a consensus[edit]

In order to gather ideas and establish a consensus about which datasets we should prioritize for release, we did the following:

  1. Did an initial internal brainstorm of possible datasets based on ~15 months of work in the DP space at WMF
  2. Ran the initial set of ideas past the WMF Research team to see if they had any more ideas and/or experiences with commonly-requested datasets from the wiki research community
  3. Created a survey with five possible dataset release ideas and sent it out to the research community via wiki-research-l for feedback
  4. Looked at the results together with the Research team and decided which three datasets should be prioritized based on technical factors, organizational rationales, and community input

Selected datasets[edit]

After this consensus process, the following datasets were selected for priority DP release:

geoeditors_monthly[edit]

Currently, the geoeditors_monthly dataset is released publicly as rounded values and numeric ranges corresponding to editing activity in a given country-project-month tuple. For example, a row might look like this:

Geoeditors_monthly_public data sample
wiki_db project country_name country_code activity_level editors_ceil month
arwiki ar.wikipedia Morocco MA 100 or more 10 2022-11

The goal of this project will be to use DP to report more precisely on both the number of editors (editors_ceil) and the number of edits (activity_level) within a given country-project-month tuple. The final presentation of the numbers has not yet been decided, and may or may not look like the current table.

From a technical perspective, geoeditors_monthly is a particularly good dataset to start with because the keyset of projects and countries for a given month is quite small (~60,000). Further, editor behavior is more easily accounted for within WMF architecture than reader behavior. This will make defining a privacy unit by bounding editor contributions relatively easy.

Organizationally, releasing geoeditors_monthly is also ideal. Many researchers want access to more precise versions of this data than we currently release. Publishing this data would entail a new class of DP dataset, enabling more rapid release of future datasets concerning editor data.

Finally, there is community desire to release geoeditors_monthly. Six out of the eight research community respondents felt neutral to positive that this dataset would be important for their work, and the WMF Research team agreed that this dataset would be useful.

Search queries and results[edit]

At present, WMF makes no datasets about search queries or results publicly available. This project would seek to privately release information about search queries and results on WMF projects.

About a decade ago, there was a short-lived attempt to publish information about what people were searching for on Wikipedia that was quickly nixed due to privacy concerns. Five years later, in 2017, a senior engineer on the Search Platform team investigated what one finds in the most common search queries with no results. This project will be more aligned with the 2012 attempt to release data. Because we cannot bound the keyset for natural language strings, the focus of this release must be on search queries containing Wikidata item labels and search result rankings of pages.

This dataset will be difficult to wrangle technically, as it requires querying across all projects. However, if we limit the keysets to Wikidata item labels (for queries) and Wikipedia project page rankings (for results), it will be possible to create bounded keysets for this data. We will likely be able to define a fuzzy notion of a privacy unit by using actor_signature to limit the number of queries per actor. For queries, we're seeking to release the frequency of wikidata item labels being searched per project per some time period. For results, the output metric is less clear — but we’re interested in the relative popularity/frequency of pages in search results, as represented by how highly ranked a page is/how often it shows up in results.

Organizationally, releasing search data would create a new class of DP data. It would represent a new class of data from WMF more generally, and would provide researchers with insight into how search works on-wiki. This prospective data release was also strongly supported by half of the research community members we surveyed, as well as the WMF Research team.

CentralNotice engagement funnel[edit]

CentralNotice is a MediaWiki extension for announcing things in banner form on-wiki. It is frequently (and at times controversially) used for soliciting donations, but it also is a key tool for WMF affiliates, projects like Wiki Loves Monuments, user groups, etc. As such, access to CentralNotice data is frequently requested by community members. This data release will seek to release privatized statistics about CentralNotice banner funnels, with a focus on affiliate, project, and user group data (rather than fundraising data, for the moment).

Output data might look something like

CentralNotice DP release proposed data sample
country project message_id views clicks
Argentina eswiki 09876 1,324,354 9,786

where message_id = a specific centralnotice banner text and clicks = number of link clicks from the banner.

This dataset will also be difficult to wrangle technically, as it will involve joining information across multiple private datasets. The keyset, however, is limited greatly by the fact that banner metadata (countries, languages, projects, etc.) is public information. The privacy unit will likely be defined simply using actor_signature, similarly to search data.

This release would be a new class of data release, and would likely reduce requests for access to WMF private data from people and organizations with poor cybersecurity practices. Although it scored poorly on the research community survey, that may be because it targets a different editor/affiliate constituency.