Differential privacy/Completed/Geoeditors/Feasibility report

Problem definition

The geo-editors monthly dataset reports the number of active editors broken down by project and country for each month for a limited set of countries. This dataset allows Wikimedia and other third parties to assess the amount of collaboration between editors in different countries across different regions. Wikimedia currently releases monthly version of the geo-editors tables using coarse rounded values and numerical ranges.

A sample of what a current release might look like can be seen in the table below. Each row counts the number of editors who edit a particular project, from a specific country, a given number of times. The project column denote which Wikimedia project is edited, this can often denote the language of the page among other properties. The country column denotes which countries the editors edited from. The country of an editor is computed automatically from identifiers such as IP address. The activity_level column describes the activity level of the editors in the row. In the current release this falls into the coarse ranges of 1 to 4, 5 to 99, and 100 or more. The editors_ceil column reports an upper bound on the number of editors who have edited the particular project from the particular country whose activity level falls within the specified bucket.

Geoeditors_monthly_public data sample
wiki_db	project	country_name	country_code	activity_level	editors_ceil	month
arwiki	ar.wikipedia	Morocco	MA	100 or more	10	2022-11
arwiki	ar.wikipedia	Morocco	MA	5 to 99	10	2022-11
arwiki	ar.wikipedia	Morocco	MA	1 to 4	10	2022-11

Privacy risk

If the data underlying this table were combined with the publicly available edit history there is a significant risk of the disclosure of editors locations. The date, time, and username of each edit is available publicly, so an adversary could link that information to the geo-editors monthly data in order to determine the country in which specific editors edit from. The simplest example of this attack would be a page that was only edited once in a month. In this case, the geo-editors data would precisely reveal the country the edit came from.

Due to this risk previous versions of this release have not included data from countries identified as potentially dangerous for journalists or internet freedom. Additionally, current releases of geo-editors monthly use very coarse measures of both the activity level and editors count in order to make linking more difficult.

Objectives

The objective of this project is to use differential privacy (or a variant thereof) to more precisely report both the number of editors (editor_ceil) and the number of edits(activity_level). Currently these measures use a small number of coarse buckets. For example the editor_ceil column only reports upper bounds and the activity_level column only uses three ranges, 1 to 4, 5 to 99, and 100 or more. Differential privacy would enable the editors_ceil column to be published as a set of direct (noisy) counts and increase the granularity of the activity_level buckets in order to capture more complex distributions. We could also (if desired) report the noisy number of edits for each row directly (currently an analyst who wants to know how many edits were made has to estimate it based on the rough number of editors and activity level). The granularity of the differentially private data can be tuned to meet the accuracy and privacy loss requirements.

Privacy analysis

The geo-editors monthly release is well suited to differential privacy. All of the released statistics are aggregate statistics with a clear bound on user contribution. Likewise the overall keyset of project, country pairs is relatively small (approx 75,000), which limits the risk of spurious counts. Depending on the definition, privacy loss for editors can be bounded and well tracked across multiple releases.

Here we discuss several possible differential privacy strategies and their tradeoffs. Each of these descriptions assumes that there are no changes to the current methods of collecting the raw data. Changes to the underlying data protection (along the lines of those done for the differentially private pageview dataset) could allow us to use stronger privacy notions. Here we introduce three definitions for neighboring relation for differential privacy. Each protects privacy at a different granularity and affects editors privacy grantees differently.

Neighboring relation

The existing dataset contains the activity of all editors. Editors are identified either by a pseudonym if logged in or via IP address if not. As such editors who are logged in may contribute to different projects from a variety of countries under the same linkable pseudonym. Editors who are logged out can still contribute to multiple projects from the same IP, but are likely to receive (and be tracked under) a different IP if contributing from a different country.

(Country, Project, Month)-Differential Privacy: This definition protects the addition or deletion of any (editor, country, project, month) pair. This results in every editor receiving one unit of privacy loss for each individual (Country, Project, Month) which they appear in. For example, if an editor appears in exactly one (country, project) pair per day they would suffer one unit of privacy loss for each day they edited a page. Likewise an editor who edits across multiple projects (or countries) would incur one unit of privacy loss per project they edited per day. This neighboring relation is the most straightforward as it can be achieved using row-differential privacy on a pre aggregated dataset. However, it incurs increased privacy loss (potentially substantially increased) for prolific editors who edit multiple projects or travel to multiple countries.
(Country, Month)-Differential Privacy: This definition protects the addition or deletion of any (editor, country, month) pair. This definition is more general than the previous one as it protects the activity of editors across projects such that an editor who edits multiple projects (but remains in the same country) only receives one unit of privacy loss. In exchange the noise required scales with the maximum number of edited projects. In order to bound the noise public editor data may be used to find a true upper bound on the edited projects.
(Month)-Differential Privacy: This is a more general neighboring definition than the previous definition as it also protects editors who move across countries. However any bound on editors' travel using public data will not protect editors who use IP address as they will likely not have the same IP in different countries. As such the bound on editor influence across multiple rows will be an approximation and IP editors whose activities across countries exceed that bound will suffer a larger privacy loss. Likewise this definition required a larger noise scale to account for editors who may travel across countries as well as those who edit multiple projects.

Privacy challenges

There remains one significant challenge to applying differential privacy to the geo-editors monthly release. While editors are usually identified by a pseudonym, editors may chose to instead by identified via IP address. This makes it difficult to to bound the contribution of editors who use IP identification while moving between IP addresses. Because of this any measure of privacy loss is only guaranteed to apply to those using the pseudonym identifiers and those using IP identification may experience larger privacy loss.

Wikimedia will implement IP masking in the near future. This technique will assign each editor an identifier instead of using their direct IP address. This identifier will be tied to a browser cookie. This will enable users editors to be linked across multiple IP addresses. While an improvement over using IP address directly the bound achieved using IP masking will still be only an approximate bound as users who use multiple devices and opt out of using an account may experience larger privacy loss.

Initial mechanism

Given the current output of the geo-editors monthly we suggest using a histogram query for each project, country, month pair. Given a range of activity levels and some preset bucket ranges a histogram query would report the (noisy) number of editors with an activity level that falls into a given bucket. For example using the existing activity level breakdown the corresponding histogram query would be on the range $[1,100]$ with only three buckets representing the ranges $[1,4]$ , $[5,99]$ and $[100,\infty )$ . In this case any editor who has edited 100 or more pages within the month would be counted in the last bucket while an editor who has edited only 50 pages would only be counted in the second bucket.

In order to report a differentially private histogram, zero mean Laplace noise (or Gaussian if using zCDP) is added to the value of each each bucket with a scale of ${\frac {\Delta }{\epsilon }}$ ( ${\frac {\Delta }{\sqrt {2\rho }}}$ if using ZCDP). Here $\epsilon$ refers to the privacy loss budget and $\Delta$ refers to the maximum user contribution under the chosen neighboring relation. For (Country, Project Month)-Differential Privacy this value is 1. While under (Country, Month)-Differential Privacy this value is the maximum number of projects a single editor has contributed to. This value must be identified or upper bounded using public information.

def initial_mechanism(buckets, project_country_pairs, editor_activity, epsilon):
  output = {}
  for (project, country) in project_country_pairs:
    for bucket in buckets:
      s = editor_activity.filter(bucket).count()
      output[country, project, bucket] = s + laplace(1/epsilon)
  return output

Extensions

Here we discuss possible additions or tunable values which can improve the utility of the previous mechanism.

Tunable parameters and optimization

The histogram query has 2 tunable parameters, the number of buckets and the bucket sizes. Each parameter has its own unique tradeoffs that require independent tuning.

The number of buckets decides the granularity of the histogram. The more buckets, the more precise the data, which is presumably better for analysts. However, smaller buckets will have fewer users who contribute to them. The noise added is on an absolute scale, so smaller buckets will have higher relative error.

The way the buckets divide the space can also be adjusted. While it might be tempting to use equally-sized buckets, this will fail to capture condensed distributions with long tails (since a majority of the buckets will capture only the tail). In order to effectively capture more complex distributions the size of each bucket must be tuned to the particular distribution. For example the existing three buckets in the geo-editors release uses a small first bucket $[1,4]$ in order to capture the head of the distribution a larger middle bucket $[5,99]$ and an unbounded third bucket $[100,\infty )$ .

Alternatively bucket sizes and domains can be chosen dynamically based off the distribution of a (project, country) pair. A few summary statistics such as a (bounded) mean or quantile can be used to estimate an approximate shape of the distribution and derive both a bucket size and domain. This would either require additional budget be spent on these summary statistics or that public data be used to inform the dynamic bucketing.

Post-processing

Given the vast amount of public data available at WMF additional post-processing can be done to increase the overall utility of the data release. WMF publicly releases a change log of all pages which includes the editors individual identifiers as well as the total number of edits on each page. The private release can be post-processed to ensure that it agrees with the public data on total number of edits per (project, month) pair. This will result in data that is easier to analyze by third parties who are less knowledgeable about differential privacy. For advanced users, this sort of post-processing may actually be harmful, as it will make it harder to model the noise that was added to each statistic.

Edits count

In addition to a noisy count of editors, we could also add a noisy sum of edits. For most rows of the dataset, we will be able to easily bound the number of edits per editor (based on the top of the activity level bucket), so we will have a natural bound for the noise added to this sum. However, this value could be substantially more noisy than the editor count (due to higher contribution bounds), and calculating it would require spending additional privacy loss budget.