Discovery/Data access guidelines
This describes the data access and analysis guidelines used by the Discovery team around data sources, or by other teams around Discovery data sources.
This document answers the following questions:
- What data sources do we have?
- Where are those data sources located?
- What information is in each data source?
- What confidentiality standards apply to each data source?
- Who has access to each data source?
- What is the process for sharing data within the organization?
- What is the process for sharing data outside of the organization?
- Do not share any data externally without first receiving approval from Discovery, Security, and Legal.
- We have three main data sources: request logs; event logs; and search logs. In general, getting access requires that you are (1) under a nondisclosure agreement (NDA), (2) have manager approval, and (3) undergo a standard review period (as explained below). Other requirements below may apply.
- To share and move data internally, you must transfer the data through one of the Wikimedia servers via SSH. Do not transfer data through any other means. There are additional requirements below for transferring sensitive data.
- To share and move data externally, you must sanitize the data by stripping it of all personally identifiable information or by obfuscating the data to preserve user privacy. To obfuscate the dataset, you can anonymize or aggregate the data.
This section describes the internal data sources that the Discovery team uses to improve the user experience around Search and related features.
Request logs are logs of every HTTP or HTTPS request made to a Wikimedia site. We store them in a Hadoop cluster inside the Analytics network.
The security around the cluster goes like so: To access a machine that has a connection to the Hadoop cluster, a staffer must first connect to our Analytics network, which requires them to have a private SSH key and associated individual password that has been authorized by the Operations team. Additionally, to access the Hadoop cluster itself and read from it, they must then have specific access privileges, also Operations-approved.
Operations approval requires that users of the Hadoop cluster:
- demonstrate a legitimate need for this data;
- be under an NDA;
- have approval from their direct manager to access the data; and
- undergo a 3-day waiting period in which Operations engineers can raise any concerns they may have about the user’s handling of data.
EventLogging is a system for tracking user-side events on the Wikimedia projects. We use this for things like identifying how often people actually click through to search results, or which options on a page people tend to select. These logs are stored in a MySQL database inside the Analytics network.
Similar to the process for accessing request logs, accessing the MySQL database requires first having a private SSH key (approved by the Operations team) to access the Analytics network. Staff and contractors under NDA can get SSH keys with manager approval (after a routine waiting period) and approval by the Operations team (this can be the same as the key used to access the request logs in Hadoop). Once a staffer has access to the Analytics network, they then need the shared group password to access the MySQL database itself.
The information that an EventLogging table contains varies depending on what the EventLogging system is being used for, but they include various forms of personally identifiable information by default (specifically, user agents) and may contain others depending on the use. Personal information should be kept confidential, as explained above.
Search logs are logs of searches made by users - the actual text that was searched for. These logs only include the text searched for and are not connected to any other personally identifying user information, such as IP addresses or user agents. They are stored in CirrusSearchRequestSet, a table in the wmf_raw database on our Hadoop cluster. To access the logs via Hive, a staffer must have an authorized SSH key.
To get that SSH key authorized, a staffer needs to:
- be under NDA,
- have manager approval, and
- go through a standard waiting period.
Search logs may include personally identifiable information in the case that a user searches for personally identifiable information, such as their address or social security number. Search logs do not currently contain other information such as the user agent or IP address associated with the request.
Information from search logs should never be publicly disclosed or shared with contractors or third-parties who are not under NDA. We do not currently have effective means for excluding PII, so any release of this information could have very negative consequences. If a staffer has any questions, please ask Legal.
Sharing and moving data internally
Sometimes the work we do at WMF requires staff and contractors with direct access to sensitive data to share it with other staff and contractors. On other occasions, data needs to be transferred between machines - perhaps to an person’s local machine, rather than a server - for analysis or ad-hoc processing. This section provides guidance on how to handle these scenarios.
Sharing data with other staff and contractors
There are situations that require sharing restricted data with other WMF staff and contractors.
If a staffer finds themselves in a scenario - like the ones above - where they need to share sensitive data with other staff and contractors, or those staff and contractors need to share sensitive data, staff are expected to keep the data as secure as possible. In particular, no sensitive data should ever be transferred via email, IRC, direct file transfer, or any other unauthorized method.
Should a staffer need to transfer this data, it should only be transferred through one of the Wikimedia servers via SSH. Within the Analytics network, “stat1002” and “stat1003” are commonly used; within the Production network, “fluorine.” Find a server that works for the sender and receiver, place the files in your home directory on that server, confirm that it has been retrieved, and delete the server-side copy.
Files containing sensitive data should only be transferred to staff and contractors who have signed an NDA and who have demonstrated legitimate use for the data in a phabricator ticket asking for it. For example, transferring search data to a WMF staff engineer developing a new search suggestion engine could be acceptable, while transferring the same data to a volunteer, or a staff engineer who is simply curious about it, would not. The phabricator ticket itself should be under the “NDA” tag, which provides some coverage and security.
The Labs cluster is never an appropriate intermediary for transferring data between staff and contractors, or for anything else involving sensitive data, because Labs does not guarantee the inability of other system users to access a user's data.
Storing data locally
Sometimes a staffer need to store data on a local machine, rather than on one of our servers, for custom, ad-hoc analysis (as a fairly common example, it’s pretty hard to prototype data visualizations when all you have is a command line). If this data contains any sensitive information whatsoever, it should only be stored on a machine that is:
- Owned by the Wikimedia Foundation;
- Exclusively used by a staff member or contractor under the Wikimedia Foundation’s NDA; and
- Subject to Full Disk Encryption (FDE) (which should be on by default for all WMF computers).
Additionally, such information should only be stored for 30 days and details on the information stored, the person storing it, the rationale for storing it, and a confirmation that it has been deleted when the time comes, should be provided to the Discovery Analysis team or to Legal (or both!)
Sharing data externally
One of the things the Wikimedia Foundation prides itself on is its openness and transparency. This extends to our data; we try to give back to the wider research community to enable research that improves our work and the Internet as a whole.
At the same time, another thing we pride ourselves on is our privacy protections. Sharing data externally should be done cautiously with every effort expended to ensure that personally identifying information is never released. This section provides guidance on how to do that.
Raw data containing personally identifiable information should never, under any circumstances, be released publicly or to individuals not covered by the Wikimedia NDAs. If a staffer is unsure whether a type of data constitutes personally identifiable information, they should reach out to the Discovery Analytics team. If this doesn’t result in an answer, contact Legal.
Sanitizing data can be done through two ways:
- stripping datasets of identifiable information; or
- obfuscating datasets, if the identifiable information must be included in the release.
Obfuscation can consist of either anonymization or aggregation.
- Anonymization changes the information so that it is not personally identifiable.
- Aggregation changes the information so that it cannot be traced back to any one person.
Just because a staffer has come up with a way of sanitizing the data that seems robust to them isn't sufficient. They have to check it with our Security and Legal teams. This consists of contacting the lawyers and security engineers, along with the senior Discovery analyst, to explain what the data looks like, provide examples (see “sharing data with other staff and contractors,” above), what your sanitization strategy is, and any risks they can think of that they don’t have an answer for yet.
The relevant people will then come back with their thoughts on whether the sanitization works, and what additional checks they can think of to make the data even safer. Only when security, legal, and analytics have agreed that the data is safe to release, should it be released.
The Wikimedia Foundation makes reasonable efforts to provide accurate and reliable information. However, we do not endorse, approve, or certify such information, nor do we guarantee accuracy, completeness, timeliness, efficiency, or correct sequencing of the information. The information appearing on this site is for general informational purposes only and is not intended to be legally binding or to provide legal advice to any individual or entity. Use of such information is voluntary, and reliance on it should only be undertaken after an independent review of its accuracy, completeness, efficiency, and timeliness. Wikimedia is not responsible for, and expressly disclaims all liability for, damages of any kind arising out of use, reference to, or reliance on such information.