Discovery/Data access guidelines

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This describes the data access and analysis guidelines used by the Discovery team around data sources, or by other teams around Discovery data sources.

This document answers the following questions:

  1. What data sources do we have?
  2. Where are those data sources located?
  3. What information is in each data source?
  4. What confidentiality standards apply to each data source?
  5. Who has access to each data source?
  6. What is the process for sharing data within the organization?
  7. What is the process for sharing data outside of the organization?

Executive summary[edit]

  • Do not share any data externally without first receiving approval from Discovery, Security, and Legal.
  • We have three main data sources: request logs; event logs; and search logs. In general, getting access requires that you are (1) under a nondisclosure agreement (NDA), (2) have manager approval, and (3) undergo a standard review period (as explained below). Other requirements below may apply.
  • To share and move data internally, you must transfer the data through one of the Wikimedia servers via SSH. Do not transfer data through any other means. There are additional requirements below for transferring sensitive data.
  • To share and move data externally, you must sanitize the data by stripping it of all personally identifiable information or by obfuscating the data to preserve user privacy. To obfuscate the dataset, you can anonymize or aggregate the data.

Data sources[edit]

This section describes the internal data sources that the Discovery team uses to improve the user experience around Search and related features.

Request logs[edit]

Request logs are logs of every HTTP or HTTPS request made to a Wikimedia site. We store them in a Hadoop cluster inside the Analytics network.

The security around the cluster goes like so: To access a machine that has a connection to the Hadoop cluster, a staffer must first connect to our Analytics network, which requires them to have a private SSH key and associated individual password that has been authorized by the Operations team. Additionally, to access the Hadoop cluster itself and read from it, they must then have specific access privileges, also Operations-approved.

Operations approval requires that users of the Hadoop cluster:

  • demonstrate a legitimate need for this data;
  • be under an NDA;
  • have approval from their direct manager to access the data; and
  • undergo a 3-day waiting period in which Operations engineers can raise any concerns they may have about the user’s handling of data.

The request logs contain various pieces of personally identifiable information. (Personally identifiable information is any information that could be used to identify an individual.) Of particular note are users’ IP addresses, user agents, and in certain cases - such as with the Wikimedia mobile apps - unique identifiers. Additionally, IP addresses and user agents are considered “personal information” under the Wikimedia Foundation’s privacy policy and should not be disclosed publicly or to third parties, unless the information has been sufficiently anonymized or if such disclosure falls under a permissible disclosure exception under the privacy policy. Please check with Legal to be sure.


EventLogging is a system for tracking user-side events on the Wikimedia projects. We use this for things like identifying how often people actually click through to search results, or which options on a page people tend to select. These logs are stored in a MySQL database inside the Analytics network.

Similar to the process for accessing request logs, accessing the MySQL database requires first having a private SSH key (approved by the Operations team) to access the Analytics network. Staff and contractors under NDA can get SSH keys with manager approval (after a routine waiting period) and approval by the Operations team (this can be the same as the key used to access the request logs in Hadoop). Once a staffer has access to the Analytics network, they then need the shared group password to access the MySQL database itself.

The information that an EventLogging table contains varies depending on what the EventLogging system is being used for, but they include various forms of personally identifiable information by default (specifically, user agents) and may contain others depending on the use. Personal information should be kept confidential, as explained above.

Search logs[edit]

Search logs are logs of searches made by users - the actual text that was searched for. These logs only include the text searched for and are not connected to any other personally identifying user information, such as IP addresses or user agents. They are stored in CirrusSearchRequestSet, a table in the wmf_raw database on our Hadoop cluster. To access the logs via Hive, a staffer must have an authorized SSH key.

To get that SSH key authorized, a staffer needs to:

  • be under NDA,
  • have manager approval, and
  • go through a standard waiting period.

The SSH key allows a staffer access to “stat1002” in the Analytics network and use Hive to work with Cirrus search logs.

Search logs may include personally identifiable information in the case that a user searches for personally identifiable information, such as their address or social security number. Search logs do not currently contain other information such as the user agent or IP address associated with the request.

Information from search logs should never be publicly disclosed or shared with contractors or third-parties who are not under NDA. We do not currently have effective means for excluding PII, so any release of this information could have very negative consequences. If a staffer has any questions, please ask Legal.

Sharing and moving data internally[edit]

Example #1

An engineer needs to debug a particular problem with the search system - but the dataset that highlights it is found only on the Analytics network, which not all engineers have access to. The solution is that an analyst or engineer who does have access to that network takes that dataset and copies it on to a machine the engineer /does/ have access to. The engineer accesses it remotely (or downloads it to their WMF-issued laptop), does the work that needs to be done, and ensures the copied data is destroyed once it’s no longer of immediate use.

Sometimes the work we do at WMF requires staff and contractors with direct access to sensitive data to share it with other staff and contractors. On other occasions, data needs to be transferred between machines - perhaps to an person’s local machine, rather than a server - for analysis or ad-hoc processing. This section provides guidance on how to handle these scenarios.

Sharing data with other staff and contractors[edit]

There are situations that require sharing restricted data with other WMF staff and contractors.

If a staffer finds themselves in a scenario - like the ones above - where they need to share sensitive data with other staff and contractors, or those staff and contractors need to share sensitive data, staff are expected to keep the data as secure as possible. In particular, no sensitive data should ever be transferred via email, IRC, direct file transfer, or any other unauthorized method.

Example #2

An analyst has performed some research, discovered something that needs to be passed to the engineers, and it can only be effectively represented in a form that contains personally identifiable information. They copy this data (including the PII) to fluorine, a machine that engineers can access, and go through the same pattern above. The data is held until the problem is understood, and then destroyed.

Should a staffer need to transfer this data, it should only be transferred through one of the Wikimedia servers via SSH. Within the Analytics network, “stat1002” and “stat1003” are commonly used; within the Production network, “fluorine.” Find a server that works for the sender and receiver, place the files in your home directory on that server, confirm that it has been retrieved, and delete the server-side copy.

Files containing sensitive data should only be transferred to staff and contractors who have signed an NDA and who have demonstrated legitimate use for the data in a phabricator ticket asking for it. For example, transferring search data to a WMF staff engineer developing a new search suggestion engine could be acceptable, while transferring the same data to a volunteer, or a staff engineer who is simply curious about it, would not. The phabricator ticket itself should be under the “NDA” tag, which provides some coverage and security.

The Labs cluster is never an appropriate intermediary for transferring data between staff and contractors, or for anything else involving sensitive data, because Labs does not guarantee the inability of other system users to access a user's data.

Storing data locally[edit]

Sometimes a staffer need to store data on a local machine, rather than on one of our servers, for custom, ad-hoc analysis (as a fairly common example, it’s pretty hard to prototype data visualizations when all you have is a command line). If this data contains any sensitive information whatsoever, it should only be stored on a machine that is:

  1. Owned by the Wikimedia Foundation;
  2. Exclusively used by a staff member or contractor under the Wikimedia Foundation’s NDA; and
  3. Subject to Full Disk Encryption (FDE) (which should be on by default for all WMF computers).

Additionally, such information should only be stored for 30 days and details on the information stored, the person storing it, the rationale for storing it, and a confirmation that it has been deleted when the time comes, should be provided to the Discovery Analysis team or to Legal (or both!)

Sharing data externally[edit]

One of the things the Wikimedia Foundation prides itself on is its openness and transparency. This extends to our data; we try to give back to the wider research community to enable research that improves our work and the Internet as a whole.

At the same time, another thing we pride ourselves on is our privacy protections. Sharing data externally should be done cautiously with every effort expended to ensure that personally identifying information is never released. This section provides guidance on how to do that.

Sanitizing data[edit]

Case study: anonymization

We want to release data that uses IP addresses as a unique ID. Without these IP addresses, the dataset is useless. To sanitize the data we anonymize it by strongly hashing the IP addresses, with a salt, so that they are unique identifiers but cannot be used as a basis for things like geolocation or connecting read requests to edits or editors. The salt is one-time (and immediately disposed of), and a 'strong' hashing algorithm like SHA-256 or SHA-512 is recommended.

Principle: When releasing user data, we must find ways to anonymize effectively the data to preserve user privacy.

Case study: aggregation

We want to release user agent data so that people can be aware of what sorts of browsers we should be supporting, based on how many people are using each browser. We aggregate this data, to preserve user privacy, by compressing it to each unique user agent and the number of times that agent appears, and then setting a threshold - say, 1% of users - below which the data is not included in any form.

Principle: When releasing user data, we must find ways to aggregate effectively the data to preserve user privacy.

Raw data containing personally identifiable information should never, under any circumstances, be released publicly or to individuals not covered by the Wikimedia NDAs. If a staffer is unsure whether a type of data constitutes personally identifiable information, they should reach out to the Discovery Analytics team. If this doesn’t result in an answer, contact Legal.

Sanitizing data can be done through two ways:

  • stripping datasets of identifiable information; or
  • obfuscating datasets, if the identifiable information must be included in the release.

Obfuscation can consist of either anonymization or aggregation.

  • Anonymization changes the information so that it is not personally identifiable.
  • Aggregation changes the information so that it cannot be traced back to any one person.

Checking sanitization[edit]

Just because a staffer has come up with a way of sanitizing the data that seems robust to them isn't sufficient. They have to check it with our Security and Legal teams. This consists of contacting the lawyers and security engineers, along with the senior Discovery analyst, to explain what the data looks like, provide examples (see “sharing data with other staff and contractors,” above), what your sanitization strategy is, and any risks they can think of that they don’t have an answer for yet.

The relevant people will then come back with their thoughts on whether the sanitization works, and what additional checks they can think of to make the data even safer. Only when security, legal, and analytics have agreed that the data is safe to release, should it be released.

The Wikimedia Foundation makes reasonable efforts to provide accurate and reliable information. However, we do not endorse, approve, or certify such information, nor do we guarantee accuracy, completeness, timeliness, efficiency, or correct sequencing of the information. The information appearing on this site is for general informational purposes only and is not intended to be legally binding or to provide legal advice to any individual or entity. Use of such information is voluntary, and reliance on it should only be undertaken after an independent review of its accuracy, completeness, efficiency, and timeliness. Wikimedia is not responsible for, and expressly disclaims all liability for, damages of any kind arising out of use, reference to, or reliance on such information.