Differential privacy/Completed/Country-project-page-historical/Problem statement

The Wikimedia Foundation, with help from Tumult Labs, is seeking to release ~5 years of historical pageview data using differential privacy. Wikimedia collects historical data about visits to its website pages in a table, pageview_hourly. Currently, data from January 2021 on is compiled for easy accessibility via the Pageviews API, but data from 2015-2020 is inaccesible. This project seeks to release that data.

Problem description

Input data

Each row of the pageview_hourly corresponds to a set of views from users with the same characteristics. Each row of pageview_hourly contains, among other things, the following information of interest.

The ID of the page (page id)
The project associated with the page (project)
The date and time of the visit (as hour, day, month, and year, we simply denote it date)
A set of user characteristics at the time of visit (as, continent, country, subdivision, city, user agent map, and access method. We simply denote it characteristics)
A count of views by users with the particular descriptor (view count).

The data is pre-aggregated and users can contribute any number of times to the aggregate count. This makes bounding the user contribution very difficult. Note that the data contains pageviews from both Wikipedia users, who simply read the encyclopedia, and from editors, who can engage with particular pages at greater length and therefore may contribute many views to particular pages. Prior to February 2017, Wikimedia's systems counted every time an editor previewed their edits as a pageview. Since an editor could plausibly preview a page many times in the course of a lengthy edit, this means that editors could contribute disproportionately many pageviews to the total during that time period.

Desired output

The goal is to generate a histogram of page views, grouped by date (at day-level granularity), country, project, and page ID: for each tuple (project, page id, country, date), we want to compute and publish the number of times this page was viewed from this specific country, on this particular date. The histogram will be thresholded to remove pages without much traffic.

Auxiliary data

Data about the total number of pageviews is publicly available via the WMF Pageviews API, and can allow for bounding the keyset of pages to just those pages with more than a certain number (currently 150) of total global views in a given day.

Requirements

The requirements for a working solution to this problem fall in three categories: utility requirements, privacy requirements, and performance requirements.

Utility requirements

We care primarily about the most-viewed articles per country, project, and date. We would like these to have relatively low error, but care less about smaller pages. In particular, our primary accuracy metric is recall for the true top 1000 pages with more than release_thresh pageviews per project, country, and date. In other words, we want to optimize the number of the true top 1000 pages with more than release_thresh views that appear in the noisy data (not necessarily in the noisy top 1000, if there are more than 1000 noisy pages).

Our secondary utility metric is absolute error between true and noisy pageview counts, for pages in the top-1000 with more than release_thresh views per project, country and date.

We are also attempting to minimize bias, spurious rows, and dropped rows (see DP error metrics for more information on those metrics).

Privacy requirements

Ideally we would provide Wikipedia users and editors with user-level differential privacy guarantees. Unfortunately, due to the impossibility of bounding user contributions, we cannot achieve user-level privacy without adding overwhelming amounts of noise. Instead, we propose to protect a given volume of personal Wikipedia activity per day. In particular, we will pick a number of pageviews, $n$ , and apply differentially private noise to the dataset such that users who viewed fewer than $n$ pages in a day receive the full differential privacy guarantee. Users who viewed more than $n$ pages will experience increased privacy loss proportional to the extent to which their activity exceeded $n$ pageviews. Similarly, users who interact with Wikipedia on multiple days may experience increased privacy loss.

One particularly strong objective of our privacy requirements is that we do not want it to be possible to determine the country from which an editor made a given edit (note that the pseudonym of the editor, and the date and time of the edit are already public information). The potentially high volume of editor page views prior to 2017 means that we will need to adopt a separate strategy for this portion of the dataset. We expect that editor pageviews post-2017 will be more closely aligned with normal user behavior, and that editors should be reasonably protected by the same $n$ pageview guarantee as other users.

Performance requirements

Although this data release will in some sense be a ”one-off” (new pageview data is being processed by a different algorithm, and therefore the total volume of data for our algorithm is fixed), we still want to avoid excessive resource consumption. In particular, our goal is to process each data-day's worth of data in 10 minutes or less (measured on a spark cluster using ~20% of the production spark cluster resources).

Algorithm

We will present two separate algorithms: one for post-2017 data where editors’ contributions are more limited, and one for pre-2017 data where editors contributions will tend to be higher.

Hyperparameters

Both the pre/post 2017 data will contain several parameters which can impact the performance of the algorithms. Here we present each of these parameters along with a default value. These values should be tuned prior to release to achieve the desired utility metrics. The parameters are as follows.

User contribution limit: $n$ $n$
- The number of views in a day that we protect with the full privacy guarantee. Should be set so that most users contribute fewer views in a day
- Defines the sensitivity of the DP counts
- Should be set using real data about usage statistics
- Default Value: 30 (post-2017), 300 (pre-2017)
Global Suppression threshold: $t$ $t$
- Number of global views per day required for a page to be considered in the keyset
- Default Value: 150
- Tunable
Noisy Suppression threshold: $\tau$ $\tau$
- Number of noisy views required for a page country pair to be published
- Default value: 100
- Tunable
Epsilon: $\epsilon$ $\epsilon$
- Privacy loss per day per $n$ page views
- Default Value: 1
- Tunable
(possibly) Delta: $\delta$ $\delta$
- Likelihood of catastrophic failure from leaking records
- Only used for zero-concentrated differential privacy (zCDP), which may not be relevant in this case

Post-2017 data

We propose the following algorithm for post-2017 data detailed in Algorithm 1. The precise values of thresholds, privacy budgets and sensitivities will be tuned at a later date.

In python-esque psuedocode:

def historical_pageview_DP_algo(
  n,   # number of pageviews to protect
  t,   # minimum pageview threshold for including pages in the keyset
  tau, # minimum noisy pageview threshold for including page views broken down by country in the output
  eps, # privacy loss per day and per volume of pageviews
  W,   # set of all (page, country, date) tuples in the pageviews_hourly table
  p,   # public dataset containing global pageviews
  tbl, # the pageviews_hourly table
):
  keys = W.filter(safe_countries).filter(p[page] >= t)
  pageview_hourly_split = tbl.split() # into m rows with n views and one row with the remaining views

  out = []

  for (page, country, date) in keys:
    x = dp_sum(eps, clamping_bounds=[0, n], pageview_hourly_split[page][country][date])
    if x >= tau:
      out.append(x)

  return out

There is a desire for user-level privacy but there is no user identifier so individual contribution cannot be bounded. Likewise the data is pre-aggregated which allows for a user to contribute to multiple rows in the table even across different characteristics (like changing devices).

The algorithm is split into two phases: the pre-processing phase and the main phase. The pre-processing phase will include generating the keyset from publicly available data as well as partially disaggregating the pageview_hourly table to be compatible with Tumult Analytics (which is not designed to work with pre-aggregated data). In the main phase Tumult Analytics will be used to to generate the appropriate private sums.

The pre-processing phase contains two steps. The first step is generating the keyset of (day, page, country) pairs which will be published. Starting with the complete cross product of days, pages and countries we remove pairs that do not meet the specified criteria. First any countries which have been identified by independent organizations as potentially dangerous are removed. Then any page which has less than $t$ page views globally on a given day is removed from the keyset for that day. In the second step the pageview hourly table is pre-processed to be compatible with Tumult analytics. To do this any row with view count greater than $n$ must be split into an equivalent amount of rows with view count up to $n$ . This only needs to be done for the rows which are in the keyset which was previously generated. This allows the following step to be written as a simple sum in Tumult Analytics.

The main phase consists of a set of sum queries for each for each day and post processing suppression. For each day in the dataset we use the keyset generated in the pre-processing phase to determine the set of sum queries. For each item in the keyset for that day we take the sum over all rows matching that key with clamping bounds $[0,n]$ and privacy budget $\epsilon$ . After the noisy sums are generated remove from the output any sum whose noise value is less than $\tau$ .

Pre-2017 data

Prior to 2017 the data contains editors interactions with the page preview tool. This data counts each time an editor refreshes the page preview as in individual page view. This leads to editors having significantly higher contribution to page views particularly on pages that they have edited. For this time period editors may experience significantly higher actualized privacy loss than the typical user. This privacy loss additionally compounds across each page that is edited. As such for the pre-2017 data we suggest adding noise which scales to the contributions of each editor. As such for the pre-2017 data the value of $n$ is set to be significantly higher (300 instead of 30) than in the post-2017 data. This results in significantly stronger protections at the cost of higher noise.