Research:Differential privacy for Wikimedia data
This page provides an overview of research and development to incorporate differential privacy into the Wikimedia Foundation's approaches to releasing public datasets.
Differential Privacy at the Wikimedia Foundation
Ensuring open data access while maintaining user privacy
Projects supported by the Wikimedia Foundation are among the most used online resources in the world, garnering tens of billions of visits in hundreds of countries annually. As such, the Foundation has access to terabytes of user data: session histories, IP addresses, asset metadata, and more. Analyzed effectively, this user-derived data can provide a rich resource for understanding user behavior and the dissemination of information on many topics, from epidemiology to online harassment and browsing patterns.
Hovering over this conflict is the political nature of Wikimedia projects — users and editors are pseudonymous for a reason. In spite of our best efforts to anonymize data, motivated actors could still combine our data with outside other data sources or direct users to low-traffic pages in order to spy on or persecute our users for their view history, edit history, or other behavior.
In this context, differential privacy provides a first step toward reconciling transparency and privacy: allowing us to release data while measuring and mitigating data harm.
What is differential privacy?
Differential privacy was developed in 2006 by Cynthia Dwork, and provides a mathematical framework for providing statistics and information about a dataset while ensuring that no information is recoverable about the presence or absence of any single individual in that dataset. To put it simply, making a dataset differentially-private involves adding a relatively small, measurable amount of random noise to the data, either as it is being collected or as it is being queried. This methodology provides a formal guarantee that an individual participant’s data will not be leaked, and that any adversary, regardless of their computational power and knowledge of other data in the dataset, will not learn anything about a participant that they could not have learned in the outside world.
Imagine, for example, that Alice is an English Wikipedia user and Bob is trying to figure out her view history: the output of any computation Bob tries on a differentially-private dataset of pageviews is statistically independent of whether or not Alice’s data is actually included. This is achieved either by 1) Alice adding some random noise to her pageview history before sending it to the dataset or 2) WMF adding some random noise to the dataset prior to making it public for Bob to query.
Note: differential privacy does not provide a guarantee that the results of releasing a dataset won’t be harmful to Alice; just that no individual harm will befall Alice for adding her data to the dataset. For example, a government may view data that leads them to censor access to Wikipedia (including for Alice), but Alice’s participation in the dataset would have no bearing on the dataset output, and her information would not be leaked in the process.
Importantly, differential privacy provides a quantifiable amount of privacy loss for each dataset release. Unlike any other existing form of anonymization — k-anonymity, vectorization, etc. — it allows for a precise measure of exactly how risky a dataset release is.
Over the last decade, differential privacy has become more feasible to implement, and has been used in production systems at Apple, Google, Facebook, and others. We propose to develop a differential privacy infrastructure for the safe release of information that could otherwise be harmful to WMF users.
Differential privacy in practice
The first test-case for differential privacy at Wikimedia is the private release of
(project, country, article, views) tuples. We currently release the most viewed pages broken down by project and country through the pageview API, but that data does not provide the necessary granularity to understand the browsing dynamics of multilingual countries with less connectivity or population.
For instance, India, South Africa, and Nigeria drove 691 million, 70 million, and 44 million visits to English Wikipedia in March 2021, respectively. Besides English, those three countries are home to at least 33 other languages in widespread use with existing Wikipedia projects. With access to pageview data in English in Nigeria, editors writing in Igbo, Hausa, and Yoruba would be able to identify and close salient gaps in their Wikipedia projects, with up-to-date information on recent events. This data would also help analysts disaggregate the impacts of nationally-specific events (e.g. the January 6th storming of the US Capitol, which was in the top ten most viewed English articles for two weeks in January) from project-wide trends.
Without differential privacy, the release of this data could constitute a risk to users who are extreme linguistic minorities and/or from small countries — for example, Malay speakers in San Marino might find their browsing behavior easy to disaggregate and re-identify.
With differential privacy (depending on the hyperparameters set in the process of counting pageviews), we would be able to theoretically protect the browsing privacy of users visiting pages with as few as ~30 views (per hour, day, month, etc.). At that low level, however, the data would mostly be noise; a more realistic threshold to retain meaningful information would likely be 100 or 150 views. Regardless, by following this approach, we would be giving the world a view into Wikipedia browsing activity with a granularity that is orders of magnitude smaller than the current status quo.
Open questions and future challenges
One important aspect of using differential privacy effectively is clearly defining the unit of privacy that we are trying to mask. Currently, for the release of
(project, country, article, views) tuples, we are debating between individual pageviews and user sessions. In the future, there should be a formalized process for clearly, mathematically defining what the privacy unit is for a proposed data release.
There are other salient open questions about the decision-making, communications, and context around utilizing differential privacy:
- What constitutes an acceptable balance of privacy and usability of the dataset?
- What should the process look like for deciding the differential privacy hyperparameters (i.e. epsilon, delta, sensitivity) for a given data release?
- Given that users are impacted by these technical decisions, should discussions about the tradeoffs between privacy and usability include the user community?
- How can we apply differential privacy principles to more complex datasets with more unique features, like reader sessions?
- How can we accurately and effectively communicate highly technical privacy decisions to less- or non-technical audiences, especially the users and editors who might be most impacted by potential data releases?
As we gain more technical expertise in this field and it becomes more established as a privacy technique at the Foundation, we will need to have conversations addressing these questions and establish processes for evaluating when and how differential privacy should be used.