Research:Wikipedia and data protection

From Meta, a Wikimedia project coordination wiki
Sean Rosenberg
Duration:  2019-11 – 2020-11
Data protection

This page documents a planned research project.
Information may be incomplete and change before the project starts.

The purpose of this research project is to first identify all those who process personal data on behalf of the Wikimedia Foundation (the systems for processing personal data) across all its projects. Once this is completed we will score each respective system from least protective of personal data to most protective of personal data. This scoring will give a benchmark as to how compliant Wikipedia Foundation’s projects are with new and emerging data protection law and identify areas for improvement.  


The research will be conducted in three phases which are set out below:

Phase one[edit]

Example dataset that comprises of information obtained in phase one and three of the research project.

Phase one of the research project will seek to identify all those who process personal data on behalf of the Wikimedia Foundation. We will utilize electronic methods of discovery which include using Wikimedia software to:

  1. Find out which systems (automated or non-automated) that process personal data on Wikimedia Foundation projects; this can include Wikimedia Foundation project users and automated software that processes personal data etc
  2. Categorize the systems that process personal data into their appropriate groups; those groups will be automated or non-automated decision making, non-decision making, profiling and non-profiling

Phase two[edit]

Phase two of the project will identify and record the types of personal data processed by each of the systems. Most of this phase of discovery will be utilizing human resources to collect information from electronic sources such as the Wikimedia Foundation project sites. We may also conduct surveys with people who process personal information for the Wikimedia Foundation to help complete this part of our research, this survey will be publicly published once phase one has been completed. Examples of personal data types are IPs, names, gender, age, email addresses etc

Phase three[edit]

In phase three every system that processes personal data which was identified in phase one will be evaluated under a simple “true or false” checklist. Once the checklist is completed for a system it will be scored out of 0 – 100 with 0 being the least protective of personal data and 100 being most protective of personal data. We will then use the score data from each system to calculate the overall scoring for the group that system belongs to. For example, the scores of all non-automated decision making and profiling systems will be used to calculate an overall score for that group.

Table A. Checklist showing the criteria for evaluating the data protection score of a system (automated and non-automated) that processes personal data
code Criteria Yes or no
1.1 has the system identified an appropriate lawful basis (or bases) for processing
1.2 does the system process special category data or criminal offence data, if yes have they identified a condition for processing this type of data
1.3 Is there no evidence that the system does anything generally unlawful with personal data
2.1 does the processing of personal data have any adverse effects on those concerned and if yes is there a justification to all the adverse impacts
2.2 does the system only handle people’s data in ways they would reasonably expect if no have they explained why any unexpected processing is justified
2.3 does the system not deceive or mislead people when they collect their personal data
2.4 is the system open and honest, and compliant with the transparency obligations of the right to be informed
3.1 has the system clearly identified its purpose or purposes for processing personal information in its public privacy information
4.1 does the system only collect personal data they actually need for their specific purposes in 3.1
4.2 does the system periodically review the data it holds, and delete anything it doesn't need
5.1 does the system ensure the accuracy of any personal data it creates
5.2 can the system start appropriate processes to check the accuracy of the data it collects, and record the source of that data
5.3 does the system keep a record of a mistake, if yes is it clearly identifiable as a mistake
5.4 does the system create a record clearly identifying any matters of opinion, and where appropriate whose opinion it is and any relevant changes to the underlying facts
5.5 does the system comply with the individual’s right to rectification and carefully consider any challenges to the accuracy of the personal data
6.1 does the system know what personal data it holds and why its needed
6.2 is there evidence the system has considered the length of time it holds personal data
6.3 does the system have a policy that is applicable to it which sets out standard retention periods where possible, in line with documentation obligations
6.4 does the system regularly review its information and erase or anonymise personal data when it no longer needs it
6.5 does the system have appropriate processes in place to comply with individuals’ requests for erasure under ‘the right to be forgotten’

Why this research will benefit the Wikimedia Foundation projects communities[edit]

People all over the globe are still coming to terms with the impact of the mishandling of personal data by several large multinational organisations and we are still learning new lessons every day. In response to these substantial issues with the processing of data various legislating bodies have acted to regulate how your personal data is processed by organisations. For now, the Wikimedia Foundation and its most popular project, Wikipedia, have escaped the spotlight. Our research project will benefit the Wikimedia Projects communities by providing a comprehensive and accurate assessment of the systems it uses to process personal data and represent the level of compliance those systems have with new and emerging data protection law in an easy to read scoring format. We hope that the results can help identify areas of strength and weakness in how the Wikimedia Foundation and those who process personal data on its behalf handle personal data.

Policy, Ethics and Human Subjects Research[edit]

Some phases of this research will involve interactions with human participants from Wikimedia Foundation’s projects, this will be limited to the extent required to obtain data. All research which involves human participants and personal data will be conducted in compliance with the University of Cambridge Policy on the Ethics of Research Involving Human Participants and Personal Data. At this time, we only anticipate interacting directly with human participants when surveying in phase two to discover what types of personal information are processed for the Wikimedia Foundation. We will not collect personal data of survey participants.

In phase one of this project, we will be identifying systems on Wikimedia Foundation projects which process personal data, this includes those who are human. The research will require assigning each system a unique identifier, for example, a person who processes personal data will be identified by username so that they can be categorized in phase one and scored in phase three. Once the person has been scored under phase three there will be no need to link the data to a username and therefore we will remove the username and replace it with a data-set identifier which cannot be linked to a username.

It is possible that usernames on Wikimedia Project spaces are associated with a unique user page which may also contain personal data. For this reason, researchers will cautious not to process personal data other than username on user pages belonging to research participants. Researchers are explicitly prohibited from processing any personal data other than required to identify the system being researched which is encountered while conducting their research. We will log each encounter with personal data as and when necessary to conduct the research project.

Any dataset that is linked to a unique identifier for the purposes of this research is stored on an isolated computer, one that has no connections with any network which transports data from one computer to another. This ensures that these identifiers are kept secure and that only authorized researchers can access them. Only datasets whereby the unique identifiers have been truly anonymized will be moved from the isolated computer.

There will be no data identifying individuals including usernames published in the final publications of the research findings.

Policy regarding interaction with the Wikimedia Foundation and its personal data processors[edit]

We believe that there is a high risk of participant bias if researchers and project overseers have more interaction with the Wikimedia Foundation and its processors than is absolutely required in order to conduct the research. One of our primary purposes for conducting this research is to accurately measure Wikimedia Foundations compliance with existing and emerging data protection laws without the risk of appearing bias in favor or against a specific research finding.

Key definitions[edit]

Table B. Terms and definitions for the purposes of Wikipedia Data Protection Research
Term Definition
Wikimedia Foundation "Wikimedia Foundation inc."
Data processor Those responsible for series of actions or steps taken in order to achieve a particular end using personal data on behalf of the "controller"
Controller Wikimedia Foundation
Personal data any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person


Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.