As part of the WMF research policy, we will be asking researchers that receive significant support from the Foundation to publicly release the data they produce as part of their research under an open license. This applies in particular to projects that receive the following kinds of support from the Foundation:

  • technical support for data collection
  • special API permissions
  • hosting or direct financial support
  • institutional endorsement
  • subject recruitment (with some restrictions)

Publishing raw datasets about Wikimedia projects occasionally implies releasing information associated with registered editor usernames. The Foundation's official privacy policy states that by participating in Wikimedia projects, "editors create a published document, and a public record of every word added, subtracted, or changed. This is a public act, and editors are identified publicly as the author of such changes. All contributions made to a Project, and all publicly available information about those contributions, are irrevocably licensed and may be freely copied, quoted, reused and adapted by third parties with few restrictions." (emphasis mine)

What this implies is that datasets referring to individual usernames (e.g. lists of editors ranked by some given criteria) fall in the category of derivative content extracted from publicly logged user contributions. Data obtained by processing the XML dumps or by querying the Wikimedia API with standard user access privileges (i.e. with the exclusion of nonpublic data) fall under the above definition. As a subset of the log of contributor activity that Wikimedia hosts in publicly accessible databases, this data can be freely copied, quoted, reused and adapted by third parties.

Members of the Wikimedia editor community have expressed in the past their desire to have their username individually removed by public lists of editors that proliferate on Wikimedia projects. These lists ceased to be contentious as soon as editors were given the possibility of having their name replaced by a placeholder (see this example). This solution, however, only makes it harder (not impossible) to obtain editor-related information. It doesn't prevent this information from being extracted from the public logs and republished elsewhere, which the WMF privacy policy explicitly allows. Any attemps at anonymizing public data will fail for the same reason.

One solution to meet these concerns would be to publish research data in an aggregate form only, but this defies the very purpose of making data openly available to promote further research. Publishing them in a raw, but anonymized form, on the other hand, is pointless insofar as effective anonymization of public data is not possible. We would like to recommend that any raw research dataset derived from publicly available logs of editor activity – collected via tools such as the toolserver, wikilytics or any third party script querying the API with standard privileges or processing the XML dumps – be published without anonymization. We would like to include in the above recommendation excerpts of discussions from user talk pages, article talk pages or policy-related discussions that researchers may quote verbatim as part of datasets for qualitative research.

Real anonymization cannot be achieved unless a central mechanism is implemented that would effectively allow editors to opt out of all public logs thereby entirely hiding their activity from these logs. This, however, would undermine one of the tenets of Wikimedia projects as a commons-based peer production system that publicly releases all user contributions under an open license allowing unconstrained reuse of these contributions. We think the RCom should play an active role in persuading our community that having researchers republish datasets that may include public editor-related information as part of their work is not only legitimate from the point of view of WMF privacy policy, but also beneficial to the community in the long term.

