Research talk:Wikimedia Research Best Practices Around Privacy Whitepaper/Draft

From Meta, a Wikimedia project coordination wiki

Questions for the Wikipedia Communities and Arbitration Committees[edit]

In particular, from this group we are seeking input on the following questions:

1. Starting with a review of some of the basics we've sketched so far in the white paper draft, what should the recommendations be for Wikipedians (section 4.2)? What is missing?

  • response_1
  • response_2
  • ...

2. What values do you think should be mentioned in Section 3.1 'Understanding key values of Wikipedians'? What community essays, policies, or guidelines should be referenced in communicating key values of Wikipedians?

Outside of policy I think the principle of "Transparency for the powerful, privacy for the weak" is one that is strongly held by the community and is what generally gets applied when things like NOTCENSORED and BLP conflict. Horse Eye's Back (talk) 16:31, 9 April 2024 (UTC)[reply]

3. What do you see as missing, if anything, as recommendations for researchers?

  • We can and should be expecting researchers who wish to engage with us to be aware of relevant ethics standards (including privacy and transparency activities; things like pre-registration, ethics boards, etc, etc) in their fields of research. Stuartyeates (talk) 06:45, 9 April 2024 (UTC)[reply]
    @Stuartyeates This is a good suggestion. What we're questioning is whether such standards are in practice widely available in different geographies and languages. For example, while ethics/IRB boards are pretty much standard bodies in certain geographies, I have personally seen research/academic institutions in different parts of the world that don't have them, or if they have them, their standard of practice varies significantly when compared to some other institutions. So something like "if it's available to you, make sure you utilize or follow them" is something we can comfortably encourage in the paper. I'm not sure if we can say beyond that. LZia (WMF) (talk) 23:50, 17 April 2024 (UTC)[reply]
    @LZia (WMF) maybe: Researchers are expected be aware of and following the ethics guidelines and codes of conduct of both there institution (if the have one) and of the prestige journals in their field. Researchers should be up front about ethics approvals they have obtained or plan to obtained for the planned work and the body granting those approvals. ? Stuartyeates (talk) 04:20, 20 April 2024 (UTC)[reply]
  • There used to be some sentiment in the research documentation that researchers should try editing first. Try to be a part of the community before parachuting in. Is this still the recommendation? I know this sentiment is strongly supported in other online communities. That is, take the "external" out of external researcher. Zentavious (talk) 19:28, 18 April 2024 (UTC)[reply]
  • There is a distinction between research ambiance and active support from editors. For the recommendations for editors, points 4-6 are about Wikipedians engaging with researchers which takes effort on editors part. Extending the idea of removing communication barriers, researchers should make it easy for individual editors to contribute/provide feedback to research. Additionally, we need some shared understanding of what lack of engagement means. What does it mean for a researcher to post on meta-wiki and receive no positive or negative feedback from Wikipedians? Zentavious (talk) 19:28, 18 April 2024 (UTC)[reply]
  • Seems that we need to say that en:Participant observation is our preferred research methodology. Stuartyeates (talk)
  • ....


4. Where would you like to see future work on this topic? What opportunities should be highlighting for researchers to examine more deeply?

  • response_1
  • response_2
  • next_response
  • ...

5. Do you have ideas about how we can address one or more of the "TODO"s we have listed throughout the draft?

  • response_1
  • response_2
  • next_response
  • ...

Questions for researchers[edit]

In particular, from this group we are seeking input on the following questions:

1. What recommendations are unclear, and maybe therefore unhelpful, for you?

  • Not particularly unclear, just adding more detail: I would consider adding in the sections on PII (2.1, 3.2):
    • Links to the WMF definitions of personal data, as it provides a useful list of common types and how data that isn't typically considered PII can become PII when combined with other data.
    • The enwiki article on personal data actually seems quite thorough and useful as well, and bridges some US and GDPR laws
    • I would consider specifically naming IP addresses as PII, as many researchers do not think of it this way (possibly in section 3.2 where you list other PII).
    • For section 4.1.7, I would be careful not to conflate open source with privacy (the quick summary from enwiki) in this instance (particularly so that use of open source tools doesn't signify that a research participant shouldn't also consider the privacy implications of the research itself outside of the tool), though I agree open source tools are in general preferred for many reasons. In any case, whether or not the researchers use open or closed tools, they should communicate the relevant privacy policy and terms of use of the tool they're using. :) - TAndic (WMF) (talk) 15:47, 17 April 2024 (UTC)[reply]
  • response_2
  • ...

2. What questions do you still have about privacy on Wikipedia after reading this?

  • As someone who's not done research with Wikipedia (yet!) and so would/will be relying this page pretty heavily for guidance, there were a couple of things that weren't clear that might be obvious to a seasoned Wikipedia researcher: first, are there existing privacy norms around the type of data collected? For example, would Wikipedians likely feel differently about collection and analysis of article content vs edits vs talk pages? Second, are the privacy considerations that qualitative, including ethnographic researchers should consider? (for example, an ethnographer using quotes from a talk page or an interviewer reaching out to people to talk?). Finally, is there anything that researchers should be doing to familiarize themselves with norms and values? Is it recommended that they lurk on talk pages? participate as editors? CAT SarahGilbert (talk) 23:32, 16 April 2024 (UTC)[reply]
    • Adding here that I found this article to have some straightforward recommendations for qualitative researchers working with sensitive online data, and could be incorporated to help with part of this question (though I think there is no perfect resolution for online ethnography and discourse analysis that allows for direct quotes even without attribution): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4376240/ in the Conclusion, two listed strategies I've used in the past that I found especially helpful were:
      • "Devising elaborate strategies for disguise involving altering non-essential details."
      • "Presenting extracts from the same participant under multiple pseudonyms." (quoting the authors) -TAndic (WMF) (talk) 15:47, 17 April 2024 (UTC)[reply]
    • Whether or not researchers/academics should create wikipedia accounts is something that I don't think we really have a defined best practice on. There are upsides such as being able to interview people but there are also downsides such as having to comply with our rules and expectations, a researcher who isn't an editor doesn't have to follow any of the rules which an editor has to follow. We can't/don't sanction people who don't edit wikipedia. If for example a researcher's work inherently includes conduct which the community would construe as off-wiki outing (for example historians or international security studies researchers) it would be inadvisable for that researcher to become a wikipedia editor either openly or semi-anonymously. Those whose work doesn't contravene any community standards or expectations should feel free to wade into the pool to do their research so to speak, my understanding is that for example ethnographic research in a semi-anonymous community would generally maintain that semi-anonymity in the final product. Horse Eye's Back (talk) 17:44, 17 April 2024 (UTC)[reply]
  • response_2
  • ...

3. Do you have ideas about how we can address one or more of the "TODO"s we have listed throughout the draft?

  • response_1
  • response_2
  • ...

Additional comments and feedback[edit]

To all reviewers and commenters, please try to organize your feedback by either adding it to an existing topic on this talk page, or by adding new topics so we can keep discussions around similar topics organized.

We will be monitoring this page until 30 April 2024, but won't be able to respond directly to comments. However, all comments will be reviewed and considered in the ongoing drafting and revising process.

If you are more comfortable leaving comments in a language other than English, please feel welcome to do so. Please note that we may utilize machine translation in reviewing non-English content. Thanks for your feedback!

Is there any general recommendations the WMF Research team can give for how external researchers should document their wiki-based research on the projects themselves? Right now, it seems like there's no recommended template for summarizing a research project on (let's say) meta. Creating such a template and strongly encouraging researchers to use it might be useful, since it would
  1. provide a way for the Research Team to explicitly ask if researchers were abiding with these best practices
  2. give Wikimedians (both from outside and inside the Foundation) a talk-page to go to with questions, thoughts, complaints etc.
Just a thought about how we might provide a bit more accountability to both community members and the research team in this process! HTriedman (WMF) (talk) 21:20, 29 April 2024 (UTC)[reply]
+1 to a template. The more streamlined this can be the easier it will be for all stakeholders. For research teams it takes extra effort to give well-written, conscience results summaries—making it easier is one way to nudge researchers to actually do this. For Wikimedians, these documents will be much easier to understand if they are in a predictable format. Plus, it gives the community a chance to articulate what aspects of results are important to them. Zentavious (talk) 16:10, 30 April 2024 (UTC)[reply]
Right now, it seems like there's no recommended template for summarizing a research project on (let's say) meta - actually there is: Template:Research project (and some associated ones, like the infobox template used there).
It has been used for hundreds of research projects for over a decade, including numerous current projects (check e.g. the entries at Research:Projects#Recent research projects).
The current draft of the whitepaper in fact already implicitly recommends to use this template:

Create a page describing your project at https://meta.wikimedia.org/wiki/Research:Index.

The "+ project" button there leads to Research:New project which guides the user to creating a project page using Template:Research project, including the (voluntary) filling out of various template parameters and page sections defined in Template:Research project/Preload (among them one titled "Policy, Ethics and Human Subjects Research").
This has long been a very effective (see above) way of encouraging use of that template. I guess it could be made a bit more explicit elsewhere. But then again, it remains advisable for researchers to go through Research:New project instead of trying to us the template directly.
On the other hand, the current whitepaper draft awkwardly mentions a separate, entirely outdated template:

Although not active for some years, the Meta Wiki Research:Committee page[1] provides a template for research projects.

That page has indeed been marked as "historical" since 2016. What's more, the "Template for research projects" link there (which appears to be what the whitepaper's authors are referring to) merely goes to the talk page of a page that saw its last content edits in 2011, is almost orphaned and has clearly been superseded by the aforementioned Template:Research project. I suggest removing that sentence.
Regards, HaeB (talk) 07:49, 1 May 2024 (UTC)[reply]

Existing guidance[edit]

First, why hyperlinks to said guidances or Wikipedia articles if they have them are not implemented? There are footnotes, but the lack of hyperlinks is suprising. See ex. en:Common Rule. Second, in addition to linking to Ethic Committee bodies, we should link to the relevant ethic codes, directly, and or quote from them. This WP currently links, for example, to APA's "Advisory Group on Conducting Research on the Internet" ([2]), which is good, but it should also link to APA's "Ethics Code Standard" [3]. Further, I think it would be good to quote the relvant (short) parts of existing codes. Here, from APA: Psychologists take reasonable steps to avoid harming... research participants... I'd add two more I've found a while ago: Royal Historical Society's Statement on Good Practice: taking particular care when research concerns those still living and when the anonymity of individuals is required and ASA's ethical code Sociologists take all reasonable steps to implement protections for the rights and welfare of research participants as well as other persons and groups that might be affected due to the research... In their research, sociologists do not behave in ways that increase risk, or are threatening to the health or life of research participants or others. RHS's is particularly relevant as the paper that led to this WP's being drafted was wrtten by historians. Piotrus (talk) 01:09, 11 April 2024 (UTC)[reply]

In agreement with Piotrus here. It is fine to have "Follow existing human subjects research protocols at your institution" but it should also have "Obey the ethical guidelines established by the relevant professional bodies", or similar, with examples. Zero0000 (talk) 01:52, 11 April 2024 (UTC)[reply]

Other remarks (by Piotrus)[edit]

Overall, I am quite impressed with this WP.

I think we should clearly state somewhere (perhaps in the abstract/nutshell as well as recommendation and conclusion) that TL'DR best practice is to not name anyone unless they have permitted that. Instead, researchers should refer to volunteers as User1, Wikipedian-B, etc.

It would be good to spell out somewhere that for volunteers who disclose their real name on Wikipedia, privacy concerns exist as well (as in, they should not be named in a paper unless permission has been given). What to do when a researcher wants to link to a diff by such a user is a question to discuss further.

Regarding For some, possibly even many, editors, an attack on their username may be perceived as a serious personal attack, one on par with an attack on their real-world name and identity. - if you want an academic citation, I am pretty sure this was discussed in en:Common Knowledge?. Ping User:Pundit, the author, who may be able to quickly provide the relevant chapter/page info.

I definitely agree that veteran nicknames are treated like names, they are a source of one's identity and should be treated with respect. I would not quote nicknames, unless from widely known public discussions or when it is essential to give the name. Pundit (talk) 18:14, 11 April 2024 (UTC)[reply]

I am curious about what can go to "Escalation avenues". Maybe consider linking to en:Committee on Publication Ethics here, for example - but many journals (including the one that sparked this WP) are not members of COPE. Writing a letter to the journal, or publisher, can be mentioned, but the reality is such letters are likely to be ignored. Support from WMF is somewhat theoretical - WMF declined to comment on the said paper (that led to this WP), for example.

While as I said, this WP is overall a very good start, I do think one key aspect is missing (partially related to the outlined but not yet written section on "Escalation avenues"). If one feels that their privacy has been violated by a piece of research, what can they do - and what support, if any, can they expect from WMF? Something I would like to see is a public WMF system where people could ask WMF for help, publically, and where WMF would respond in public. For example, I think WMF should publically comment, when asked, on non-controversial issues such as stating whether a particular research paper followed best practices such as anonymizing volunteer names, asking them for permission to be named, whether a paper passed through a relevant IRB procedure, and assist in writing a letter to the journal expressing concerns if such best practices where not followed, and if the journal declines to publish such a letter, publish it on WMF's pages. Piotrus (talk) 01:31, 11 April 2024 (UTC)[reply]

I think its debatable whether best practices would extend to not naming editors in circumstances when naming them carries legitimate public interest and academic value. Horse Eye's Back (talk) 16:13, 13 April 2024 (UTC)[reply]
One person's public interest and academic value is another person's harassment and trolling. Current WP already correctly observes that "Consider how any research narratives around individual editors could be leveraged by malicious actors. Even if the researcher’s intent is good, their analyses - especially any that include narratives meant to explain data that focus on specific editors - have the potential to be leveraged by malicious actors for doxxing. Researchers should make an attempt to have their reporting reviewed by research participants whenever possible as editors may be able to flag dangers that researchers may not identify." as well as "Be aware of anti-privacy in a sheepskin. Agendas may be pushed under “investigative journalistic work” and “doxxing for good”. Researchers should proceed very cautiously in this area. If you’re encountering a potential situation in which you or others are considering the applicability of “doxxing for good” or anything in that name, it is best to escalate the concerns to project administrators." Related to this is "Consider the cost of your action on Wikipedia. (TBD)", although I think in some cases problematic "research" (aka activism/harassment for greater good) does it and has the real intent of hounding some editors and making them retire from Wikipedia. A key aspect of this WP is to say that Wikipedia community does not endorse outside parties (including researchers) engaging in “doxxing for good”, and hopefully we will develop mechanism for publicly criticizing such papers. Electronic Frontier Foundation's Takedown Hall of Shame [4] comes to mind. en:Censorship by copyright (can't believe we were missing this article until few days ago...) is an issue conceptually similar to what some “doxxing for good” "research" tries to achieve (censorship by doxxing/shaming/harassment?) - influence the content of articles, here, by attracting partisan editors and chasing existing ones away. Piotrus (talk) 23:39, 13 April 2024 (UTC)[reply]

Thanks for this conversation and feedback so far. I find it helpful to sometimes be able to show folks the two ends of a spectrum and then help them see the full spectrum and the grey zone they should navigate. In the case of this paper, and particularly when it comes to handling PII, as an example names, we have said (and we will say) the things researchers should not do, and the things they should exercise caution about. Can we give clear guidance on when it is okay to share this information? Of course one scenario is if the person has given consent. Are there other scenarios?

One of our wishes for the paper is that at the end, it reads as a welcoming space that is helping researchers do better by providing clear guidance as much as possible. As you're thinking about it, if you have ideas that can help us achieve that, please let us know. --LZia (WMF) (talk) 00:12, 18 April 2024 (UTC)[reply]

Another scenario to keep in mind regarding the sharing of user names and real names involves editors who have played a central or very influential role in the governance of a community. To take two obvious examples, the on-wiki actions and communications of w:en:User:Jimbo Wales and w:en:User:Larry Sanger have been frequently examined in published research about Wikipedia (see e.g. this paper's discussion of the origins of Wikipedia policies such as the five pillars), without anonymizing them as "User1", "Wikipedian-B" or such. The former also illustrates one scenario where it is fine to connect real names with user names (i.e. the editor has intentionally posted their real name on-wiki in connection with their user name, here "Jimmy Wales" on his user page w:en:User:Jimbo Wales).
If the whitepaper attempts to retroactively accuse these many academic publications of illegitimate "doxxing", it will lose credibility from the start. (Also note that I have referred to the English Wikipedia's founders as a simple and well-know example, but many or most sister projects - and non-English Wikipedias - were not founded by these two people, but by prominent volunteer Wikimedians who were similarly influential in shaping those projects' mission and policies.)
Admittedly it is difficult to draw a clear, well-defined line between such cases and others that we may want to discourage (say calling out an editor by name for making a single problematic change to a niche article). But that doesn't mean that the whitepaper can afford to ignore this grey zone, or that it is OK to denigrate public interest and academic value as "harassment and trolling" in disguise. I agree with Horse Eye's Back's concerns above.
Regards, HaeB (talk) 07:05, 1 May 2024 (UTC)[reply]

Conversation Hour to Gather Feedback[edit]

Join us for a Conversation Hour on 23 April 2024 at 15:00 UTC to share feedback. This conversation will be guided by some questions to encourage actionable feedback. Join via Google Meet. JKoerner (WMF) (talk) 16:40, 16 April 2024 (UTC)[reply]

General Remarks[edit]

Section 1.3 talks about existing regulations for human subjects under the common rule. While many projects do involve human-generated data, often times IRBs won't consider observation forms of research as human subjects. I believe this point is brought up later section 2.1, but I think that is an important motivator for these sets of guidelines.

+1 to the idea that there are limited efforts to define best practices for community engagement. My experience has been very back of the napkin action research of actively talking to people and following their recommendations, which in turn leads to talking to more people with other recommendations. These efforts have been fruitful, but it definitely isn't uniform between projects and is still at risk of violating norms.

I noticed some different usage of Wikimedians and Wikipedians. This may be a moot point, but in what ways are these the same vs. different?

+1 to learning about a future course for onboarding researchers.

Finally, I was a little confused by this line. "As for researchers, by providing support for more to get involved and proceed with confidence with investments of time and effort into studying the projects, the number of researchers can be increased which can in turn also lead to increased awareness of the projects and their value through the dissemination of such research, in turn resulting in further additional benefits to the projects." Is this pointing out a positive cycle when researchers increase the capacity of editors, it will support more researchers in the community, and so on...? Zentavious (talk) 19:47, 18 April 2024 (UTC)[reply]

HouseBlaster's thought[edit]

Hi! I had previously left a comment at phabricator about this, but opted to wait until now to share my thoughts. Regarding What are the key values of Wikipedians? What do they value in research on/about Wikipedia?, a disproportionate amount of editors on Wikipedia are themselves members of the academic community. Even if not directly involved in academia, we all care about knowledge for the sake of knowledge. I suspect many Wikipedians would agree that we don't just support, but welcome, research. Of course, we care about things like privacy and such. I think I can safely say that in general we are happy to help answer questions/fill out surveys/etc., especially if the research directly benefits Wikipedia in some way (e.g. a study on something like the effectiveness of anti-vandalism techniques). A link to d:Q11059110 and/or d:Q4026300 (on enwiki, en:WP:TEA and en:WP:HELPDESK, respectively) would probably be a good idea. HouseBlaster (talk) 16:39, 26 April 2024 (UTC)[reply]

Analytics Cluster[edit]

Hi, I'm a bit surprised that I cannot see any mention of the Analytics Cluster the WMF hosts with a parallel set of private data sources (and a Data Lake), despite the fact that it can only be accessed under formal collaborations between researchers and WMF.

I'm coming from a very data-sensitive community (German Wikipedia) which has raised concerns of data collection and aggregation more than a decade ago (e.g., see this German Signpost (“Kurier”) post, a local local RfC against systemic surveillance and a global RfC against opt-out solutions for tools, because of GDPR concerns, that was strongly supported by German Wikipedia community members), and I'm very much interested in reasearch about Wikimedia projects (and therefore thankful for a white paper :) ). However, I've only learned from research papers like this that the Wikimedia Foundation collects much more data than regular community members are aware of (in this context, browser sessions) and can also share ith with researchers through the before-mentioned channels.

Please add that to the section where you discuss the awareness of community members what not only publicly but also privately data is collected that could be shared with researchers. This will add more transparency to researchers as well as community members.

On a site notice, on German Wikipedia we don't have five pillars but four core principles “Grundprinzipien” without the fifth pillar which is just a recommendation there. I'm pretty sure that there are other projects with adjustments.

Best, —DerHexer (Talk) 16:59, 29 April 2024 (UTC)[reply]

Looks like the WMF has a privacy policy that documents this intention alongside the information they collect. This could be linked in the document with additional context. I think this is a good example how transparency is a combination of being open and commonly visible. It may be reasonable to publicly define guidelines for when outside researchers are granted access to aggregated private information. I like the idea of requiring the onboarding course as a precursor to requesting access. Zentavious (talk) 16:38, 30 April 2024 (UTC)[reply]