Research:Wikimedia Research Best Practices Around Privacy Whitepaper/Draft

From Meta, a Wikimedia project coordination wiki

How to provide feedback and comments[edit]

We're gathering feedback on draft Research Ethics Privacy White Paper until 30 April 2024.

  • Join us for a Conversation Hour on 23 April 2024 at 15:00 UTC. This conversation will be guided by some questions to encourage actionable feedback. Join via Google Meet.
  • We encourage you to use the talk page/discussion feature to provide your input. If you need a private space to communicate your feedback, you can do so by sending an email to research-feedback@wikimedia.org with "privacy white paper" in the subject line.
  • We are asking for your feedback before the draft becomes stable and as we are revising the draft (hence the "todo" notes you'll see as you read the page). We believe now is the right time as we have done sufficient thinking and discussion on our end to be able to share and gather your feedback to improve the work. This also means that you won't see a completely polished draft.
  • Please don't directly edit this page. This is to make sure everyone is reading the same draft for providing feedback.
  • Consult with the discussion page before providing your feedback. We have listed some specific questions for you there.
  • We encourage you to focus your feedback on the technical content of the white paper and do not provide feedback in copy-editing nature. We expect the page to change significantly before we finalize the white paper and we want to be respectful of your time.
  • We will be monitoring the talk page until 30 April 2024. We commit to reading every feedback but unless we have follow-up questions we do not commit to engage with every feedback. Instead we will use the time on our end to continue improving the draft.
  • If you are more comfortable leaving comments in a language other than English, please feel welcome to do so. Please note that we may utilize machine translation in reviewing feedback provided in languages other than English.

Research and Privacy on Wikipedia [Draft][edit]

Section 1: Introduction[edit]

1.1 What is the problem?[edit]

TODO: Shorten the introduction as much as possible. Look up and fix the number of actively edited projects and the number of publications.

Wikipedia is the largest encyclopedia and one of the most popular websites on the Web. The free encyclopedia is written and curated by hundreds of thousands of volunteers who come to the website every month and across more than 160 actively edited languages to deliberate and write the sum of all human knowledge. Readers access Wikipedia articles more than 15 billion times every month. However, the usage of Wikipedia does not stop here.

Over the last decade, researchers have published an annual average of 30,000 articles about Wikipedia or using Wikipedia data. This research is possible because Wikipedia data, whether in the body of the article or in its one of many namespaces[1], is primarily licensed under CC BY-SA 4.0, a licensing agreement that allows researchers to freely use Wikipedia’s wealth of data to learn about the world, develop the latest technologies, or learn how humans govern knowledge at scale. The usage of Wikipedia data is governed by Wikimedia Foundation’s terms of service[2] and other ethical requirements demanded by researchers’ corresponding institutions, when available. Wikipedia researchers, however, do not have access to guidance on how to conduct research ethically, specifically on Wikipedia considering the specific values that govern Wikipedia as a global knowledge governance platform. On the other hand, Wikipedians do not have a shared understanding of the researchers, their context, and the constraints they must follow. This lack of shared understanding and guidance has resulted over the years in multiple instances of tensions between Wikipedians and the researchers who use Wikipedia data. One of the key tension points is on the topic of privacy, and specifically the ethics of research that impacts Wikipedia user privacy.

1.2 Why is it important?[edit]

There are a number of benefits of having clearer alignment of understanding around privacy between project participants and researchers. These include reduced risk of harm to project participants, increased efficiency and avoidance of wasted time, as well as the opportunity to maximize benefits to the projects and all those involved.

Foremost is the opportunity to avoid harm to both parties, particularly project participants, such as Wikipedia editors, who may not be aware of the presence of researchers. While anonymous editors have IP addresses assigned to edits, and logged in users via the username policy have the expectation that all activity will be publically tied to their username, editors vary significantly in how public and identifiable they wish to be. Some voluntarily self-identify their personal identities either directly or through social cues such as various attributes including language, location, profession, etc… Others intentionally avoid disclosing any personally identifiable information, for a wide range of reasons. At least in part the reason for which we observe variation in how participants choose to enact anonymity or not are the political, national, and social contexts in which they participate and edit. For example, editors in certain locations may face greater personal risks when editing about political figures or parties, among a long list of other topics that may be considered controversial or contentious.[3] Without an understanding of such issues and the values around privacy that project participants have, researchers may inadvertently cause harm despite being otherwise well-intentioned.

Secondly, by providing guidance and a clearer shared understanding around privacy, there are clear benefits in terms of efficiency for everyone involved in the projects, including outside researchers. When things go wrong, hours of time have to be invested on both ends to address issues and points of tension. Moreover, such situations can escalate and even demand time from administrators and governing bodies on the projects, as well as Wikimedia Foundation or Wikimedia affiliates that may be involved. For example, when editors are named without adhering to clear guidance, it may be necessary for a Wikipedia arbitration committee to review the case in detail and determine if any project policies were violated; an escalated case may require resources on Wikimedia Foundation’s Legal, Communications, and Research teams; the editor involved may need to spend hours writing cases and escalating their challenge; etc.

Finally, without shared understanding and clear guidance, there’s a missed opportunity to maximize benefits to everyone involved. From the perspective of the projects, there are clear benefits of having researchers engaged. For example, study of the projects can help lead to the identification of inequities and opportunities, as well as eventually translate into concrete product outcomes benefiting current and future participants.[4] As for researchers, by providing support for more to get involved and proceed with confidence with investments of time and effort into studying the projects, the number of researchers can be increased which can in turn also lead to increased awareness of the projects and their value through the dissemination of such research, in turn resulting in further additional benefits to the projects. Importantly, increased support for researchers can help them engage more successfully and effectively, while avoiding undesirable outcomes, such as inadvertent violation of project policies or values.

1.3 Existing guidance[edit]

While differences occur across global jurisdictions, researchers typically rely on established ethical guidelines or policies regarding the protection of subject privacy to ensure that their studies adhere to ethical standards and legal requirements. In the United States, for example, a set of federal regulations governing research involving human subjects known as the “Common Rule” includes provisions aimed at protecting subject privacy, including obtaining informed consent, de-identifying data, ensuring data security, and related privacy safeguards.[5] Similar regulatory frameworks for the protection of human subjects exist globally, including those published by the Canadian Tri-Council, the Australian Research Council, The European Commission, the United Kingdom’s Economic and Social Research Council, and the Forum for Ethical Review Committees in Asia and the Western Pacific (FERCAP).

While adherence to research policies are commonly guided by an institution’s ethical review board, not all researchers have access to such formal review bodies, and not all research that might utilize Wikipedia data falls under their purview. As a result, various scholarly communities have established their own research ethics committee and guidelines to ensure the ethical collection and use of online data for research purposes. These include the Association of Internet Researcher’s (AoIR) Ethical Guidelines,[6] the ACM Special Interest Group on Computer-Human Interaction (SIGCHI) Research Ethics Committee,[7] and the American Psychological Association’s (APA) Advisory Group on Conducting Research on the Internet.[8] Such guidelines provide general advice for researchers relying on online data to ensure subject privacy, respect the norms and expectations of the communities they study, and minimize risk to subjects.

Building from these general guidelines for the ethical collection and use of online data, scholars have focused on the unique ethical implications of relying on large data sets from online platforms that otherwise appear to be “publicly available”,[9] and have started providing guidance for research relying on data from specific online platforms, such as Reddit[10] and Twitter.[11]

Despite these advances, there has been little guidance developed specifically for research relying on data from Wikipedia even though Wikipedia shares its data in significantly more extensive ways when compared to other online platforms. There have been some limited efforts to define best practices, especially around community engagement and recruitment.[12] There have also been a few historical, mostly now inactive, attempts to define research procedures. For example, the no longer active Wiki Research Committee[13] and the English Wikipedia research recruitment page,[14] detailing a process that researchers should follow before asking Wikipedia contributors to participate in research, including even holding an RFC (request for comment) to obtain permission in order to move forward after community consensus. For the most part, these previously attempted processes and best practice guidelines have not specifically addressed privacy in detail.

In this white paper we aim to provide recommendations that will help researchers, research governing bodies and Wikipedians to more successfully navigate the ethics of research on Wikipedia when considering privacy.

The remainder of this paper is organized as follows. In Section 2 we share related work that this white paper builds on. In Section 3 we explore the key questions researchers and Wikipedians face when considering ethical research utilizing Wikipedia data and user privacy. In Section 4 we offer recommendations about how to conduct ethical research on Wikipedia considering privacy as a core value for many Wikipedians. We also offer some recommendations for Wikimedians. We close in Section 5 by sharing potential next steps.

Section 2: Related work[edit]

2.1 Privacy and research ethics[edit]

TODO: Where we refer to U.S. federal regulations in this section, add a comparisons to GDPR's definition. Consider adding a discussion of differences/implications.

Broadly, principles of research ethics mandate that researchers must employ sufficient measures to safeguard the privacy of participants and maintain the confidentiality of collected data. Any privacy violation or breach of confidentiality creates risk for participants, including the potential exposure of personal or sensitive information, disclosure of embarrassing or unlawful behavior, or the release of legally protected data. Preserving the privacy and confidentiality of research participants has typically involved employing strategies such as collecting data anonymously, implementing access to controls for any sensitive data collection, and scrubbing personally identifiable information (PII) from datasets.

This latter strategy of “de-identifying” data is perhaps most common, but is increasingly falling short of providing complete privacy protections. With the growth of internet-based research activities, researchers are increasingly able to collect detailed data about individuals from sources such as social media profiles, publicly accessible discussion forums, and data-sharing platforms where datasets can more easily be processed, compared, and combined with other data available online. Resultantly, datasets presumed to be anonymized are increasingly susceptible to re-identification through comparing and combining datasets, sometimes in surprising ways.[15] Such cases reveal that merely stripping traditional “identifiable” information such as a subject’s name or email address is no longer sufficient to ensure data remains anonymous,[16] and requires the reconsideration of what is considered “personally identifiable information”[17] and whether de-identification is even possible at all.[18]

Traditional research ethics also typically only affords protection to “private information,” which U.S. federal regulations define as:

[A]ny information about behavior that occurs in a context in which an individual can reasonably expect that no observation or recording is taking place, and information that has been provided for specific purposes by an individual and that the individual can reasonably expect will not be made public (for example, a medical record).

The nature and understanding of what constitutes “private information”—and thus triggering particular privacy concerns—becomes difficult within the context of big data and internet-based research. Distinctions within the regulatory definition of “private information”—that it only applies to information that subjects reasonably expect is not normally monitored or collected and not normally publicly available—become less clearly applicable when considering the data environments and collection practices that typify many contemporary research practices, such as scraping data from public websites and social media platforms.

Researchers interested in collecting or analyzing online actions of subjects—such as through scraping of social media profiles and feeds or mining online logs of user activities—often argue that subjects do not have a reasonable expectation that such online activities are not routinely monitored since nearly all online transactions and interactions are routinely logged by websites and platforms.[19] Indeed, research protocols relying on such data are often considered exempt from formal ethical review boards since the data is “already public.”[20]

A growing number of scholars have started to confront these complex privacy concerns that pervade research relying on data from online platforms,[21] and multiple scholarly associations have worked to provide ethical guidance for internet researchers, including the American Psychological Association’s Advisory Group on Conducting Research on the Internet,[22] the SIGCHI Research Ethics Committee,[23] and the Association of Internet Researchers (AoIR).[24] Importantly, the AoIR’s popular “Internet Research: Ethical Guidelines 3.0” emphasizes how researchers must engage with a “deliberative processes of ethical reflection” when considering the appropriateness of using data even when collected from public sources, a sentiment shared by a growing number of research ethics scholars.[25]

2.2 Privacy and Wikipedia[edit]

Successfully navigating the complex privacy aspects of research using online data requires researchers to understand and reflect carefully on contextual factors related to the users and platforms under study. Previous work has explored users’ privacy expectations regarding research using data from Twitter,[26] Facebook,[27] and other social platforms,[28] but to date only a limited number of studies have focused on the use of Wikipedia for research purposes.

2.3 Understanding privacy risks on Wikipedia[edit]

While various policy protections are in place on Wikipedia, both generally and on a project-by-project basis, given the fundamentally open nature of Wikipedia, it’s important to ask what privacy risks are nonetheless present for participants. For the most part, editors can control what degree of personal information they decide to disclose, either through their talk page contents or through in-person or online community events. However, edit histories are non-negotiably visibly public. Past studies have explored whether otherwise harmless public data (such as number of edits on broad categories of topics) have the potential to be used to identify personal traits of editors. Rizoiu et al. (2016)[29] for example, use a dataset of 13 years of Wikipedia edit histories (2001-2013) to demonstrate how it’s possible to track seemingly harmless features such as edit counts on different topics, and uncover personal traits such as gender, religion, and education. Keep in mind that such traits go otherwise undisclosed unless Wikipedians chose to overtly share them through communication on discussions or, more likely, on their user page.[30] Notably, the authors of this study show how prediction of private traits improves over time as editors continue to contribute to the project(s).

Researchers should consider how their work, especially via publication and sharing of datasets, may further ease the ability of off-the-shelf learning algorithms to predict increasingly more about personal traits of editors, which in turn may degrade participant privacy and open the door to risks such as doxxing and harassment. To date, Rizoiu et al. note that no studies have shown that editors ‘real’ identities can be revealed through editing activity alone. However, this might change as learning models advance, and the study’s authors don’t directly address an open question that exists around around how such modeling of personal traits from edit histories could be paired with other data, such as editor communication on talk pages and village pumps, by malicious actors to identify not only personal traits, but also personal identities. Currently an open question, this is relevant because, as noted elsewhere in this white paper, editors are operating in a wide range of political and social contexts, with varying degrees of freedom and risk of persecution.

Work on personal trait prediction from editing histories also raises the topic of how editors (and researchers) may come to the platform with a range of expectations around privacy, but some subset of editors may reasonably expect a high level of privacy if they opt not to disclose information. Consider, for example, an editor who opts to disclose no personally identifiable information nor general personal traits, focusing instead on editing around topics of interest. This research highlights how it is actually unreasonable to expect that all general personal traits may escape predictability, and participation over time increases predictability.

Also related to risks, we can ask whether or not Wikipedians may be aware of these risks, and if so, what behaviors they may adopt to mitigate risks, both perceived and real. Forte et al. (2017)[31] present a qualitative study examining privacy practices and concerns among contributors to open collaboration projects. On the list of top five threats is surveillance and loss of privacy. In particular, participants may not know when, nor by whom, their activities are surveilled. At least some editors are aware that their public edit histories contain sensitive information. Forte et al. identify a number of other perceived risks and implications, including examples of how participants can become the target of discrimination based on their activities and solely what they edit about. Unfortunately, a number of examples are provided of cases in which violence went beyond threats to physical harm. As a result, the two most common ways in which participants deal with such risks is by modifying their participation and/or enacting anonymity. Researchers need to be aware of the risks and consequences that editors may face, and ensure their work is not contributing to risks, nor contributing to increased risks that cause participants to modify their behavior.[32]

2.4 Wikipedia (user)names[edit]

When creating an account, new editors can receive guidance if they opt to review the “help me select [a username]” option on the create account page. This guidance includes details such as how all contributions will be attributed to the chosen username, as well as risks involved in editing under one’s real name.[33] In practice usernames may fall into a few different categories. First, those that reveal a real world identity; next, those that may nod to a real world identity but only in part; third are those used by editors deliberately to obscure real-world identity. This means that usernames may in part contain personally identifiable information, either directly or indirectly.[34]

An additional consideration for researchers around usernames is what they mean to editors. In considering the question of whether to name research participants, it’s important to consider the value and importance of project usernames for editors. Even for editors who attempt to enact anonymity through their username, that username comes to encapsulate an online social identity that may be just as important as their real-world identity. For example, while editor motivations are varied, one’s username may carry weight in terms of reputation on the projects, and certainly captures the whole of their activity given the public nature of edit history attributions. This is often years of editing work, not to mention a reputation in editor social circles, as well as other essential administrative tasks an editor may carry out. For some, possibly even many, editors, an attack on their username may be perceived as a serious personal attack, one on par with an attack on their real-world name and identity.

Thus, one question frequently encountered by researchers is that of how to treat usernames in their reporting. Human subjects research protocols and Wikimedia policies provide clear guidance on protecting personally identifiable information, but treatment of usernames by researchers can be a fuzzier area to navigate. In reporting, should researchers name usernames? Should they be treated on par with real-life names? There may not be a clear-cut blanket answer to such questions, but drawing on work such as Bruckman et al. (2015),[35] we can identify some general guidelines or considerations that researchers can use on a case-by-case basis when their institution’s human subjects protocols may not provide guidance in this regard.

Bruckman et al. (2015)[36] introduce various complex cases around naming, but begin with the assertion that just because individually identifiable information is a fundamental part of the definition of the human subject, it doesn’t necessarily mean it has to be hidden. They review a case in which not naming a prolific internet community member might be considered unethical given the individual’s widely acknowledged accomplishments. They also ask what should be done in cases where a participant may wish to be named because they are seeking out negative attention for possible gains. One of the main suggestions by these authors is that researchers, who will ultimately have to decide to name or not, engage with research participants and give them the option to consent to be named (or not). They also offer the suggestion to allow participants to respond to accounts of them before publication. The practicality of such suggestions may vary based on the type of research being done, but the general notion of engaging with participants during decisions to name or not is good advice. In the context of Wikipedia, it may be a good suggestion to extend such recommendations beyond the context of real names to usernames as well.

2.5 Navigating the different spaces of Wikipedia[edit]

While Wikipedia is most well-known for its article contents, there are various other editable spaces that participants regularly use, and which present equally interesting research areas for that matter. For example, these spaces, such as talk pages and community pumps (i.e., discussion forums) may present vast spaces of study for researchers interested in social phenomena. To list only a few examples of these non-article namespaces, researchers should be aware of article talk pages, other more general discussion forum spaces (e.g., village pumps, community bulletin board, etc.), user pages, among various others.

While all of these spaces are fundamentally public and open, Pentzold (2017)[37] challenge us to consider a more nuanced view of what a ‘public space’ is. They also present reasons that a researcher should carefully consider privacy expectations of participants and research subjects relative to spaces they navigate. While a topic for further work, there may be an intersection between these expectations and different project spaces.

Adopting Nissenbaum’s (2011) concept of privacy as “contextual integrity”, Pentzold argues that an assessment of sensitivity can not be based purely on the type of information in question, but that there are socially shared meanings guiding decisions about what information is ethical to share. In a similar vein, the paper cites Boellstroff et al. (2012), who argue respect for, “not only what is public versus private from an etic perspective, but also what the people we study emically perceive as public or private.”

In the spirit of considering participant perceptions, the author offers multiple frameworks that we can adopt for thinking about the varied contexts of spaces on Wikipedia. For example, the author proposes different spheres of stations according to sensitivity and privateness. An open sphere is one treated as public with no sensitive information (e.g., arguably Wikipedia article pages). Conversely, closed spheres include spaces such as personal conversations and research interviews. Somewhere in between is the ‘limited closed sphere’, which could contain sensitive information; possible Wikipedia analogs might include user pages or talk pages. Applied to our current context, this all goes to argue, similarly to Bruckman et al. (2015) that researchers should seek to engage more directly and frequently with project participants as they make decisions related to privacy in their processes. This implies that any recommendations in section 4 are particularly important for their indirect or direct privacy implications.

2.6 Other on-wiki guidance[edit]

Supplementary and general research ethics guidance for best practices on the projects is available in a few different places. Some of it is inclusive of the topic of privacy and can be good guidance for researchers. For example, there are Meta Wiki notes on good practice for Wikipedia research,[38] which includes recommendations for anonymization of research participants. Although not active for some years, the Meta Wiki Research:Committee page[39] provides a template for research projects. More up-to-date, actively maintained guidance is available at the Resources section of Meta Wiki Research:Index.[40] A future Wikimedia Research Course is also in development at this Media Wiki draft page.[41] Finally, to highlight one key theme from section 2 of this paper is that researchers need to be proactive about communicating about their project(s) in the appropriate spaces and avoid introducing language and communication barriers. Additional recommendations are available in section 4.

Section 3: Exploring key questions[edit]

In this section, we explore a range of topics around privacy, in order to support both researchers and Wikipedians. We begin by discussing community values on the platform, and then presenting some of the key policies that researchers should be aware of. Next, we examine what privacy means in the world of open knowledge production and curation, and then discuss researcher motivations, ethical checks, and benefits of having researchers engaged. Finally, to conclude this section we identify a number of tensions and provide some points of guidance for researchers when this document is not sufficient.

3.1 Understanding key values of Wikipedians[edit]

TODO: We are struggling with this section at the moment. We want to start with the very basic pillars or values but there is a significant distance between WP pillars and the "privacy" value specifically. We are in need of hearing from Wikipedians on this topic. What values do you think should be mentioned in this section? What community essays, policies, or guidelines should be referenced in communicating key values of Wikipedians?

Wikipedia is grounded in five pillars,[42] or fundamental principles. These are briefly listed below, followed by some details in the context of this white paper.

  1. Wikipedia is an encyclopedia
  2. Wikipedia is written from a neutral point of view[43]
  3. Wikipedia is free content that anyone can use, edit, and distribute
  4. Wikipedia’s editors should treat each other with respect and civility
  5. Wikipedia has no firm rules (but policies and guidelines[44] that may evolve over time)

While none of these directly reference privacy, they contain components that relate directly to values around privacy and even research on the projects. For example, pillar 3 acknowledges both the openness of the projects, as well as reuse. In this way the projects by nature are open to outside researchers, but it’s important to note varying degrees of awareness among editors around the presence of researchers. While a small number of editors may be aware that research happens with project data, many are not. In general, Wikipedians value openness and a general pursuit of knowledge, and in this spirit, researchers should make information about their projects publicly available for Wikipedians to learn about and from. Research that can positively impact the projects is welcome if done in a respectful way.

Next, through the presence of various pages discussing privacy practices, we observe a direct acknowledgement among Wikimedians to the importance of privacy, particularly the individual’s choice to operate with more or less privacy.[45] What we might reasonably deduce from the presence of these pages and their content is (1) a recognition that editors have varying needs and wishes around privacy (thus the guidelines for self-management of privacy), and (2) it is perfectly acceptable for editors to want to enact some level of privacy for themselves if desired. There’s no project-wide mandate around how public or private an editor needs to be. It’s sufficient that their edit histories and activity is open in the interest of maintaining high quality content on the projects. In this spirit of individual choice and variation, researchers should be mindful and respectful that there is a spectrum of values around privacy, especially as it pertains to how individual editors value privacy for themselves.

3.2 Doxxing policies across Wikimedia projects[edit]

For the most part, the Wikimedia movement is strongly anti-doxxing, with severe sanctions in place. As researchers look to tell stories about their data and provide informative narratives, they should be aware of doxxing policies. While policies and any rare exceptions differ across projects, most factor in an assessment of the intent and context of potential doxxing violations.

To briefly cover a few of the notable policies around doxxing, the Universal Code of Conduct,[46] section 3[47] prohibits the disclosure of personal data, or sharing of contributors’ private information without consent. This personally identifiable information (PII) includes data such as name, physical or email address, CV, place of employment, among others. As part of enforcement, the Universal Code of Conduct factors in both intent and context within which PII is disclosed. In addition to the Universal Code of Conduct, the Wikipedia Terms of Use,[48] section 4[49] includes several principles against doxxing. These principles cover disclosure of PII, but also solicitation of PII for any purposes of harassment, exploitation, or violation of privacy, or for any promotional or commercial purpose not explicitly approved by the Wikimedia Foundation. As for project-by-project policies around doxxing, it’s worth noting that smaller projects may have less complete policies, or may lack them all together. Certain exceptions to doxxing exist mostly on an ad-hoc basis, and this varies by project. One such exception is when communities have permitted doxxing in the effort to halt undisclosed paid editing (e.g., English and Catalan Wikipedia).

Researchers should be mindful that, even if their intent is non-malicious, surfacing certain types of information could spur others to engage in doxxing. For this reason, researchers need to carefully consider not only their own intent in sharing narratives about project data, but how it might be leveraged by others with malicious intent. Also, researchers should be very mindful that the most dangerous implications of doxxing are for editors who reside in locations where there is strict censorship, frequent human rights violations, and political violence, among other serious risks. All of these risks and potential implications are further intensified when editing takes place around contentious topics, such as politics, national and political strife, and even social topics that may be considered contentious or highly debated.

3.3 Understanding parameters of variation for different language versions of Wikipedia[edit]

One of the advantages of studying Wikipedia is the richness of the diverse global data, stemming from the fact that there is such a diversity of languages and people involved in the projects. While this presents a richness of opportunities, it also presents a number of issues to be mindful of, especially when it comes to issues of privacy. A broader examination of considerations for multilingual Wikipedia research is covered by Johnson and Lescak [50], but here we highlight some of the considerations that are particularly relevant for privacy.

A third key consideration is the fundamental fact that projects are divided by language, not geography. This means that editor communities for projects are diverse in terms of the cultures, political structures, and legal systems that editors are operating within. Thus, researchers should assume both variation across, as well as diversity within, projects when it comes to the composition of editors communities. When evaluating privacy risks, and particularly the implications and potential consequences of these risks, a researcher cannot assume homogeneity, even within a single project. This is particularly true for projects corresponding to languages with very wide global reach, such as English, Spanish, Chinese, among a long list of others. As Forte et al. (2017) [51] note, certain subsets of editors for a project may have privileges due to their nationality, as one example. It is also notable that these authors found that surveillance and loss of privacy were one of the top five threats experienced by contributors to open collaboration projects. These threats may have very impactful implications, such as loss of employment, personal safety, and reputation. Finally, as it relates to privacy, the age of editors may vary as well both within and across projects. This is important as different legal systems may have varying norms for research involving minors, and editors need not disclose age to participate in the projects.

3.4 To support Wikipedians in understanding researchers[edit]

Wikipedia is one of the largest collaborative knowledge platforms on the internet, which makes it a frequent subject of interest for researchers. Its vast repository of articles covering a wide range of topics provides a useful forum for investigating the dynamics of online communities and collaborative knowledge production. Researchers can delve into the intricacies of Wikipedia's editing processes, exploring factors such as editor behavior, motivations, and the mechanisms governing content creation and moderation. By analyzing the patterns of contributions, conflicts, and consensus-building among editors, researchers can gain insights into the mechanisms that drive the creation and maintenance of a large, collaborative, online encyclopedia.

Furthermore, Wikipedia's open nature, combined with the diverse backgrounds of its contributors, make it an excellent place to examine issues related to bias, representation, and the democratization of knowledge. Researchers investigate the presence of systemic biases in Wikipedia articles, such as gender or cultural biases, and explore strategies to address them. Additionally, the platform's multilingual nature allows researchers to explore cross-cultural communication and language dynamics in online environments. By studying Wikipedia, researchers might gain a deeper understanding of online collaboration and information dissemination while also contributing to a broader discourse on the accessibility and reliability of digital knowledge resources.

Researchers might also leverage Wikipedia as a resource for a vast amount of text data to train machine learning or more broadly AI models, including large language models like ChatGPT and similar generative AI platforms. Leveraging the openness of Wikipedia’s millions of articles across a diverse set of topics, researchers can provide machine learning algorithms a wide variety of domain-specific knowledge, language structures, and vocabularies to enhance these models’ ability to comprehend and generate human-like text.

Broadly, Wikipedia data used for research purposes might originate from published article text (“main namespace”), but could also include edit histories, user account information, talk page discussions, or other data that where personally identifiable information might be included or inferred. When researchers intend to use public data, such as data obtained from sources like Wikipedia, they still need to adhere to ethical standards and may require ethical review depending on the type of data used and the nature of their research.

In the United States, researchers may need to submit their research to an Institutional Review Board (IRB) who ensures that studies involving human participants adhere to ethical principles and regulatory requirements. They assess the potential risks and benefits of research, safeguard participants' rights and welfare, and approve or provide guidance to researchers to ensure compliance with ethical standards. Other countries might have similar ethical review boards providing research oversight, with the specific structures, procedures, and regulations varying from country to country based on cultural, legal, and institutional factors.

Section 4: Recommendations[edit]

4.1 Recommendations for researchers[edit]

TODO: 1. Reduce the number of recommendations and create a better balance between the number in this section and the next one.

2. Some of the current recommendations are repetitive. Consolidate as much as possible.

3. Some of the current recommendations are beyond the focus of this whitepaper which is privacy (for example, create a research page on meta-wiki). We should consider making sure they're captured somewhere else on meta and link to it, and remove it as a dedicated recommendation here.

4. We have discussed to potentially create a mapping between the topics elevated in Section 3 and the recommendations in Section 4. Consider this for the final version.

  1. Policies that researchers need to abide by (Universal Code of Conduct, Terms of Use, etc...) (Exact recommendation TBD)
  2. Consider the size of the project (i.e., language version of Wikipedia) you write about. Editors active on smaller language versions could be at higher risk of doxxing due to the lack of policies and admins to take actions on such cases. Thus, researchers' actions could more significantly affect them in negative ways. In addition, researchers should be mindful that a smaller number of active editors necessarily means reduced editor anonymity, both within any social structures that exist among editors, as well as in the broader context of the public visibility of editors and their work. In other words, the potential social identifiability of individuals may be much greater.
  3. Evaluate the geographic distributions of participants for the project. Even if an action taken by a researcher introduces roughly the same level of risk for editors of two different projects, the potential consequences of said risk may vary considerably. To understand why this is the case, and to also assess the potential consequences of the risk, researchers should to the best of their ability consider the political and social implications of privacy risks.
  4. Consider the cost of your action on Wikipedia. (TBD)
  5. Be aware of anti-privacy in a sheepskin. Agendas may be pushed under “investigative journalistic work” and “doxxing for good”. Researchers should proceed very cautiously in this area. If you’re encountering a potential situation in which you or others are considering the applicability of “doxxing for good” or anything in that name, it is best to escalate the concerns to project administrators. Projects have historically dealt with such cases, and are better situated than researchers to evaluate what should be done within the existing policies of the project.
  6. Treat usernames with as much importance and consideration as real-life identities. Although usernames and userpages may sometimes intentionally obscure real-life identities of editors (and therefore not technically qualify as personally identifiable information (PII)), researchers should still treat them as important layers of personal identity. Editors take pride in their reputation, which centers around their username, regardless of whether or not it directly reveals their real-life identity. Assume that any actions with negative impact on specific usernames will carry a real-life negative impact.
  7. Consider the privacy implications of any third-party research tools. When possible, it is often preferable to use open-source tools. However, when this is not possible, the researcher should clearly communicate this to any research participants and when relevant communicate the privacy policy of the third party tool(s) to research participants, ensuring they can make a privacy-informed choice to participate (e.g., survey tools, data storage tools, etc.).
  8. Be proactive about communicating about your project in the appropriate spaces and avoid introducing language and communication barriers. Researchers should plan around this to reduce the burden on participants. It is recommended that researchers communicate in advance about an upcoming project they are planning and design a method of feedback for Wikipedians. This is particularly important if none of the researchers are Wikimedians. Create a page describing your project at https://meta.wikimedia.org/wiki/Research:Index.
  9. Ensure communication reaches the communities where research is taking place. If conducting research across multiple language versions of Wikipedia, there is unlikely one place where you can assume your communication will reach affected parties. Meta wiki is a good location for your research project page, but be mindful that editor communities have different norms for communication, with some utilizing in-person and off-wiki platforms (e.g., Telegram, WhatsApp) more than others.
  10. Follow existing human subjects research protocols at your institution and be aware of unknown age demographics that can affect protocols. Keep in mind that active editors may include legal minors, and the age composition of editor communities varies across language editions. To edit or create an account, there is no requirement to disclose one’s age; thus, many editors' ages (among other demographic information) may be unknown. Depending on the type of research you’re doing, this may have more relevance since human subjects research protocols for minors are frequently different than those for adult research participants.
  11. Consider how any research narratives around individual editors could be leveraged by malicious actors. Even if the researcher’s intent is good, their analyses - especially any that include narratives meant to explain data that focus on specific editors - have the potential to be leveraged by malicious actors for doxxing. Researchers should make an attempt to have their reporting reviewed by research participants whenever possible as editors may be able to flag dangers that researchers may not identify.
  12. Be aware of the power dynamics at play. (TBD)
  13. How to incorporate this document into your research protocol? (How to talk with ethics committees, etc...; exact recommednation TBD)
  14. Topic of risks to individuals vs risks to communities. (TBD)
  15. Natural experiment guidelines. (TBD)
  16. Err on the side of caution. (particularly in light of Terms of Use, Section 4)
  17. Consultation/contact options when researchers have more questions? (TBD)

4.2 Recommendations for Wikipedians[edit]

TODO: We should hear more from Wikipedians about what recommendations they want to elevate in this section for one another. If you are a Wikipedian, please help us expand this section with your suggestions in the discussion page.

In this section we elevate some of the recommendations Wikipedians have had for one another that they have communicated to us, as well as recommendations informed in learning from researchers' perspectives and their needs to be able to effectively contribute to advancing the understanding of Wikipedia.

  1. Prioritize creating an environment where researchers feel welcome to contribute their expertise to Wikipedia. Similar to many other groups, the large majority of researchers who research Wikipedia want to help improve the project. As one editor shared with us, Wikipedia is a "dominant, international source of information. We have become an important part of the internet and wider society and we will be criticized for various things and some of that criticism will be things we need to hear in order to improve. That means we must avoid chilling and harassment of researchers."
  2. Understand the possible privacy risks. Your editing history alone can provide cues to personal traits. If you are editing in highly contentious areas, you may wish to take extra precautionary measures to protect your identity.
  3. Take measures to protect your social identifiability if you highly value anonymity. Your editing and general communication activity both on- and off-wiki may increase the degree to which you might be recognized through behavioral or social cues. Especially if you edit about highly contentious areas, be aware of privacy risks introduced through your patterns of behavior and editing. Your editing history alone can provide cues to personal traits.
  4. Help educate researchers about policies and norms they may be unaware of. Researchers are encouraged to communicate early about their projects, and Wikipedians are similarly invited to reach out directly to researchers when they observe the potential for violation of policies and norms. Researchers may have varying degrees of understanding of Wikipedia policies and norms, but are generally eager to learn. When communicating about a potential problem, whenever possible, offer an alternative approach or solution.
  5. Reach out to researchers proactively with questions/concerns. Expect researchers to be present on the projects. Research on the projects is important and can many times lead to positive impacts on the projects. It is important to raise concerns directly with researchers as early as possible, and offer suggested solutions whenever possible.
  6. In the spirit of assuming good faith, offer feedback to researchers in the form of guidance and communicate directly with them. Wikipedia is an international source of knowledge and a critical part of the internet and Web. As a result, it is expected that the project may be critiqued for various reasons. We must encourage research on the projects and hear the critique in order to improve it.
  7. Follow established guidance on how to protect your real-world identity if so desired. Wikipedia:How to not get outed on Wikipedia provides various measures that editors can take.
  8. Escalation avenues (for example, if something can’t be resolved directly with researchers; TBD)

4.3 Guidance when this document is insufficient[edit]

TODO: It is best practice to indicate what readers should do if they find this document insufficient for their particular purpose. The challenge we're facing is that whatever solution we offer here will need to be scalable. Some options that we're considering but are alert about their scalability:

Book a 1:1 office hour with a member of WMF Research team and seek advice.

Send an email to wiki-research-l mailing list and seek help.

Go to meta-wiki (where exactly?) and seek help.

Go to #wikimedia-research IRC channel and seek help. (risky as the channel is not very active these days.)

what else?

(TBD)

Section 5: Future work[edit]

Having explored both related work and key questions in this white paper, we wish to highlight some research opportunities that exist around the topic of ethical privacy on the projects. As highlighted throughout, Wikimedia projects involve a global, diverse range of participants, all of who come to the projects with a range of expectations, understandings, and desires around privacy. As such, there are considerable opportunities to further study what privacy means to project participants.

For example, further study of perceptions and understandings of privacy, as well as potential risks, should investigate the different components of privacy. Such investigations could be framed by Solove’s (2006)[52] four broad categories of potential privacy violation types (i.e., information collection, information processing, information dissemination, and invasion). Just as importantly, however, researchers should examine editor understandings of privacy as a potential function of the editor’s geographic, political, linguistic, and project experience. We hypothesize that the cultural, geographic, social, and other situational factors of editors may directly affect their expectations and understandings of privacy, including perceived risks and consequences of those risks.

Other open questions include the relevance of different project spaces relative to privacy expectations. For example, do expectations differ as a function of project space (e.g., talk pages, community pumps, article talk pages, etc.)? They may also vary as a function of editor tenureship, among various other editor characteristics that could be explored more deeply. Answering these questions, amongst others, may also help highlight additional privacy tensions present, and researchers could examine frequent areas of conflict or tension, particularly when it comes to the relationships not only between editors, but also relationships between editors and non-editors (including readers, journalists, researchers, among others).

Given the vastness of spaces and projects, a good starting point for further work on privacy would be survey data collection, in tandem with more targeted, in-depth interviews and focus groups.

References and notes[edit]

  1. To learn more about namespaces, you can read about them on English Wikipedia at https://en.wikipedia.org/wiki/Wikipedia:Namespace. To learn about namespaces in other Wikipedia languages, go to https://www.wikidata.org/wiki/Q4994250 and choose your language of interest.
  2. https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use
  3. Other authors have detailed some of these risks, which include surveillance and loss of privacy, loss of employment and opportunity, harassment and intimidation, and reputation loss. See Forte, A., Andalibi, N., & Greenstadt, R. (2017). Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Tor Users and Wikipedians. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 1800–1811. https://doi.org/10.1145/2998181.2998273 for a more detailed discussion of risks faced by Wikipedia editors.
  4. One of many such examples includes work such as Tripodi’s (2023) mixed-methods work demonstrating notability and inclusion differences across biographies about women, a component of the broader gender gap problem.
  5. See 45 C.F.R. § 46, “Protection of Human Subjects”, and in particular the Common Rule, 45 C.F.R. 46 Subpart A (https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html)
  6. https://aoir.org/ethics/
  7. https://sigchi.org/people/committees/research-ethics-committee/
  8. https://www.apa.org/science/leadership/bsa/internet
  9. See, for example, Michael Zimmer. 2016. OkCupid Study Reveals the Perils of Big-Data Science. Wired. https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ Michael Zimmer. 2010. “But the data is already public”: On the ethics of research in Facebook. Ethics and Information Technology 12, 4 (2010), 313–325. https://doi.org/10.1007/s10676-010-9227-5
  10. See, for example, Casey Fiesler, Michael Zimmer, Nicholas Proferes, Sarah Gilbert, and Naiyan Jones. 2024. Remember the Human: A Systematic Review of Ethical Considerations in Reddit Research. Proc. ACM Hum.-Comput. Interact. 8, GROUP, Article 5 (January 2024), 33 pages. https://doi.org/10.1145/3633070 Gliniecka, Martyna. "The ethics of publicly available data research: A situated ethics framework for Reddit." Social Media+ Society 9.3 (2023): 20563051231192021.
  11. See, for example, Takats C, Kwan A, Wormer R, Goldman D, Jones H, Romero D Ethical and Methodological Considerations of Twitter Data for Public Health Research: Systematic Review, J Med Internet Res 2022;24(11). https://www.jmir.org/2022/11/e40380 Klassen, S., & Fiesler, C. (2022). “This Isn’t Your Data, Friend”: Black Twitter as a Case Study on Research Ethics for Public Data. Social Media + Society, 8(4). https://doi.org/10.1177/20563051221144317
  12. See, for example, https://meta.wikimedia.org/wiki/Notes_on_good_practices_on_Wikipedia_research and https://en.wikipedia.org/wiki/Wikipedia:Ethically_researching_Wikipedia.
  13. https://meta.wikimedia.org/wiki/Research:Committee
  14. https://en.wikipedia.org/wiki/Wikipedia:Research_recruitment
  15. See, for example: Barbaro, Michael and Tom Zeller Jr., 2006, “A Face Is Exposed for AOL Searcher No. 4417749”, The New York Times, 9 August 2006, pp. A1. Narayanan, Arvind and Vitaly Shmatikov, 2008, “Robust de-anonymization of Large Sparse Datasets”, Proceedings of the 29th IEEE Symposium on Security and Privacy, Oakland, CA, May 2008, IEEE, pp. 111–125. doi:10.1109/SP.2008. Zimmer, Michael, 2010. "“But the data is already public”: on the ethics of research in Facebook." Ethics and Information Technology 12: 313-325.
  16. Ohm, Paul, 2010, “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”, UCLA Law Review, 57: 1701–1777. Brasher, E. A. (2018). Addressing the failure of anonymization: guidance from the European union's general data protection regulation. Colum. Bus. L. Rev., 209.
  17. Schwartz, Paul M. and Daniel J. Solove, 2011, “The PII Problem: Privacy and a New Concept of Personally Identifiable Information”, New York University Law Review, 86(6): 1814–1893. Solove, Daniel J.,2024, “Data Is What Data Does: Regulating Based on Harm and Risk Instead of Sensitive Data.” Northwestern University Law Review 1081, GWU Legal Studies Research Paper No. 2023-22, Available at SSRN: https://ssrn.com/abstract=4322198 or http://dx.doi.org/10.2139/ssrn.4322198
  18. Currier Stoffel, Elodie. "The Myth of Anonymity: De-Identified Data as Legal Fiction." New Mexico Law Review 54, no. 1 (2024): 129.
  19. See, broadly, Zimmer, Michael. 2016. “OkCupid Study Reveals the Perils of Big-Data Science.” Wired. May 14, 2016. https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/.
  20. See, for example, cases described in: Fiesler, Casey, Michael Zimmer, Nicholas Proferes, Sarah Gilbert, and Naiyan Jones. "Remember the Human: A Systematic Review of Ethical Considerations in Reddit Research." Proceedings of the ACM on Human-Computer Interaction 8, no. GROUP (2024): 1-33. Stommel, Wyke, and Lynn de Rijk. "Ethical approval: None sought. How discourse analysts report ethical issues around publicly available online data." Research Ethics 17, no. 3 (2021): 275-297.
  21. Michael Zimmer. 2018. Addressing conceptual gaps in big data research ethics: An application of contextual integrity. Social Media+ Society 4, 2 (2018), 2056305118768300. Jacob Metcalf and Kate Crawford. 2016. Where are human subjects in big data research? The emerging ethics divide. Big Data & Society 3, 1 (2016), 2053951716650211 Casey Fiesler, Alyson Young, Tamara Peyton, Amy S. Bruckman, Mary Gray, Jeff Hancock, and Wayne Lutters. 2015. Ethics for Studying Online Sociotechnical Systems in a Big Data World. In Conference Companion for the ACM CSCW Conference on Computer Supported Cooperative Work and Social Computing (Vancouver, BC, Canada). New York, NY, USA, 289–292. https://doi.org/10.1145/2685553.2685558
  22. https://www.apa.org/science/leadership/bsa/internet/internet-report
  23. https://sigchi.org/committees/
  24. https://aoir.org/ethics/
  25. See, for example, Shilton, Katie, Emanuel Moss, Sarah A. Gilbert, Matthew J. Bietz, Casey Fiesler, Jacob Metcalf, Jessica Vitak, and Michael Zimmer. 2021. “Excavating Awareness and Power in Data Science: A Manifesto for Trustworthy Pervasive Data Research.” Big Data & Society 8 (2). https://doi.org/10.1177/20539517211040759.
  26. Fiesler, Casey, and Nicholas Proferes. 2018. “‘Participant’ Perceptions of Twitter Research Ethics.” Social Media + Society 4 (1): 2056305118763366. https://doi.org/10.1177/2056305118763366.
  27. Gilbert, S., Vitak, J., & Shilton, K. (2021). Measuring Americans’ Comfort With Research Uses of Their Social Media Data. Social Media + Society, 7(3). https://doi.org/10.1177/20563051211033824
  28. Gilbert, S., Shilton, K., & Vitak, J. (2023). When research is the context: Cross-platform user expectations for social media data reuse. Big Data & Society, 10(1). https://doi.org/10.1177/20539517231164108
  29. Rizoiu, M.-A., Xie, L., Caetano, T., & Cebrian, M. (2016). Evolution of Privacy Loss in Wikipedia. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 215–224. https://doi.org/10.1145/2835776.2835798
  30. https://en.wikipedia.org/wiki/Wikipedia:User_pages
  31. Forte, A., Andalibi, N., & Greenstadt, R. (2017). Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Tor Users and Wikipedians. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 1800–1811. https://doi.org/10.1145/2998181.2998273
  32. This is particularly relevant from the perspective of researchers since any behavior changes participants may enact in response to perceived risks introduced by research would effectively introduce confounding variables that may interfere with answering research questions.
  33. https://en.wikipedia.org/wiki/Wikipedia:Username_policy#Real_names
  34. For more nuance, see https://en.wikipedia.org/wiki/Wikipedia:On_privacy,_confidentiality_and_discretion#Who_are_we_when_we_edit_Wikipedia?
  35. Bruckman, A., Luther, K., & Fiesler, C. (2015). When should we use real names in published accounts of internet research? In Digital Research Confidential. The MIT Press.
  36. Bruckman, A., Luther, K., & Fiesler, C. (2015). When should we use real names in published accounts of internet research? In Digital Research Confidential. The MIT Press.
  37. Pentzold, C. (2017). ‘What are these researchers doing in my Wikipedia?’: Ethical premises and practical judgment in internet-based ethnography. Ethics and Information Technology, 19(2), 143–155. https://doi.org/10.1007/s10676-017-9423-7
  38. https://meta.wikimedia.org/wiki/Notes_on_good_practices_on_Wikipedia_research
  39. https://meta.wikimedia.org/wiki/Research:Committee
  40. https://meta.wikimedia.org/wiki/Research:Index
  41. https://www.mediawiki.org/wiki/Wikimedia_Research/Course
  42. https://en.wikipedia.org/wiki/Wikipedia:Five_pillars
  43. https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view
  44. https://en.wikipedia.org/wiki/Wikipedia:Policies_and_guidelines
  45. See, for example, information pages such as Wikipedia:Personal security practices, and Wikipedia:On privacy, confidentiality and discretion, Wikipedia:How not to get outed on Wikipedia, all of which discuss privacy and offer information and strategies for managing one’s own privacy.
  46. https://foundation.wikimedia.org/wiki/Policy:Universal_Code_of_Conduct#3_%E2%80%93_Unacceptable_behaviour
  47. https://foundation.wikimedia.org/wiki/Policy:Universal_Code_of_Conduct#3_%E2%80%93_Unacceptable_behaviour
  48. https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use
  49. https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use#4._Refraining_from_Certain_Activities
  50. https://arxiv.org/pdf/2204.02483.pdf
  51. Forte, A., Andalibi, N., & Greenstadt, R. (2017). Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Tor Users and Wikipedians. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 1800–1811. https://doi.org/10.1145/2998181.2998273
  52. Solove, D. J. (2006). A taxonomy of privacy. University of Pennsylvania Law Review, 154(3), 477–560.