Research:Civil Behavior Interviews

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
20:33, 12 April 2018 (UTC)
Duration:  2018-04 — 2018-10

This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.

The Lead: Introduce and describe your project at a high level in one or two paragraphs. Will the output of this project provide tangible benefits for our community (in the form of data, software, Web services)? If the output of this project mainly consists of scholarly publications, what aspects of Wikimedia projects will they help to understand

The current work hopes to provide Wikipedians with a deeper understanding of how the truth is discussed and agreed upon by editors, using a lens of civility. This work views civility as a collaborative process that occurs on Wikipedia Talk pages, and uses this view of civility more fully understand what incivility looks like and how it can be detected in discussions among Wikipedians. Incivility in discussions like those taking place on Wikipedia should be addressed because incivility can influence how an understanding of the truth is arrived at by participants, which may have consequences for what information is present on the resulting Wikipedia pages (Papacharissi, 2004).  

Civility can be difficult to measure and discuss, partially because it is so universally used. One definition of civility used by researchers (Papacharissi, 2004) focuses on the collaboration needed to accomplish common goals, rather than mere politeness. Academics typically define civility as discussions that promote respect for all participants and enhance understanding of the issue being discussed, regardless of whether or not minds are changed (Papacharissi, 2004).  Civility involves conversations that may be heated but still respect the identity of the individuals involved (Papacharissi, 2004).  Wikipedia is set up in such a way as to encourage heated but civil debate, as Wikipedians must collaborate to determine what constitutes truth when writing an entry, as well as ensure that the entry is written neutrally. If these goals are accomplished, and accomplished in such a way that allows all discussants to contribute and be heard, then civility is present. However, well-informed civil discourse is not the only kind of discourse that occurs online, and among Wikipedia editors, and online incivility often derails these necessary conversations (Coe, Kenski & Rains, 2014; Yasseri, Sumi, Rung, Kornai, Kertesz, 2012). By definition, the presence of incivility makes the discussions occurring less productive (Papacharissi, 2004). The presence of incivility may also impact how a neutral view is arrived at by editors, as incivility may result in certain demographics being privileged and others being excluded from the discussion, thus skewing what viewpoints are present for consideration (Mouffe, 1999; Papacharissi, 2004). This skewing of present viewpoints may also influence how new users are attracted and retained (MacAulay & Visser, 2016; Menking & Erickson, 2015).

Adding a system that automatically flags certain behaviors as civil or uncivil may help users of this tool deal with incivility, but the current definition of incivility, while helpful, is broad and difficult to translate into such a detection tool. Previous works that address this problem often operationalize incivility as profanity, slurs, and name-calling (Kwon & Gruzd, 2017; Santana, 2013; Vargo & Hopp, 2014). These operationalized definitions are far more narrow than Papacharissi’s (2004) definition, and may miss more subtle methods of intentionally derailing a conversation (Bishop, 2014; Gervais, 2015; Muddiman & Stroud, 2017). This is particularly true in an environment such as Wikipedia, where edits and reverts can be used productively or unproductively, and an online environment in general lends itself to forms of trolling and derailing that may be apparent to users but easily missed by systems (Bishop, 2014; Gervais, 2015; Muddiman & Stroud, 2017). As a result, this work will use the broader definition of incivility and begin by asking editors about their experiences in Wikipedia Talk pages. After grounding our initial understanding of civility and incivility on Wikipedia in the experiences of members of the community, we will work to develop a method to help editors detect and deal with incivility on Wikipedia.

Ultimately, this work seeks to more fully understand incivility and civility as they occur on Wikipedia Talk pages, in order to build a tool to help editors and administrators deal with incivility in uniform ways.

Research Team:

This research team is led by

  • E. Whittaker - a PhD student at the University of Michigan
  • Aaron Halfaker - a Principal Research Scientist at Wikimedia


Describe in this section the methods you'll be using to conduct your research. If the project involves recruiting Wikimedia/Wikipedia editors for a survey or interview, please describe the suggested recruitment method and the size of the sample. Please include links to consent forms, survey/interview questions and user-interface mock-ups.

Study 1

           Before presenting editors with Talk pages to rate and label, it is important to understand what Wikipedia editors view as a good discussion. In order to understand this, semi-structured interviews willbe used to gain a deeper understanding of what interactions Wikipedia editors value, and how they think about those interactions in relation to each other.

Participant Sample and Measures

In order to achieve data saturation, a sample size of roughly 20-25 will likely be needed. Ideally, editors would be randomly selected to be interviewed, but convenience sampling may instead be necessary. Because semi-structured interviews are not standardized measures, an interview protocol has been developed, and is available below.

Here is the Interview Protocol

Data Analysis

The interview responses will then be examined for themes and patterns. It may be the case that themes vary by type or experience participant, and this will be considered. The resulting themes will be used to find interactions that can serve as exemplars of the themes. These interactions can later be used to test the effectiveness of the rating system as developed in Study 2.

Study 2

           While the Interaction Timeline tool examines interactions between editors wherever they occur, scoping down to a particular form of interaction allows for a deeper focus on what forms of incivility are present in written discussions, as opposed to an examination of the role that edits play in incivility on Wikipedia. Editors will be asked if the Talk page is productive and civil, and asked to discuss why the page is unproductive, or uncivil, if they label it as such. Currently, the use of “Productive” and “Civil” is a provisional recommendation – it may be that the interviews reveal that other themes indicate the quality of the discussion or the presence of incivility better than the terms “Productive” or “Civil”, but these terms will be used as stand-ins for now. There is some theoretical basis for using these terms, as previous research indicates that participants are often hesitant to label anything as uncivil in the absence of slurs or profanity. However, according to Papacharissi’s (2004) definition of incivility, if incivility is present, the discussion itself will be less productive, and productive discussion is also a key component of editing Wikipedia, so it may act as a proxy for incivility if participants do have more stringent standards for incivility than unproductivity (Bishop, 2014; Gervais, 2015; Kwon & Gruzd, 2017; Muddiman & Stroud, 2017; Santana, 2013; Vargo & Hopp, 2014). Asking about productivity may surface behavior unrelated to incivility, but this behavior is likely to have negative consequences for discussion, so its inclusion should not be too distracting. Ultimately, the interview responses may surface different criteria, or may incorporate these ideas, but that has yet to be seen. These labeled Talk pages will be used to conceptualize what behavior or characteristics of the discussion editors view as incivility, and will ultimately be used to develop a tool for use with the Interaction Timeline tool to help editors effectively handle incivility.

           Data Sample

           Wikipedia has a total of 5,512,755 articles, and a list of 2,595 Controversial pages. Controversial pages and non-controversial pages likely contain a differing amount of incivility, and so both should be examined. However, examining every Talk page of Wikipedia is simply not feasible, so researchers will take a random sample of 256 (10%) of the Controversial pages and a random sample of .1% (551) of the non-controversial article Talk pages that meet a minimum threshold for activity. The Talk pages of the sampled pages should be recorded before recruiting participants to ensure that participants are responding to the same information.  

Participant Sample and Measures

In order to achieve moderate statistical power during subsequent analyses, 1,024 participants should be recruited from active Wikipedia editors or workers on Mechanical Turk, if recruiting a such large number of editors proves impossible. Four participants will rate each Talk page as “Productive”, on a scale of 1-5, and “Civil”, also on a scale of 1-5. If participants rate the page as “Unproductive” (< 3 on the scale) they will be prompted to explain why the page is unproductive. Similarly, if they rate the page as “Uncivil” (<3 on the scale), they will be prompted to explain why the page is uncivil.

Data Analysis

Talk pages that have been labeled as “Unproductive” or “Uncivil” (or some other negative term as recommended by the interviews) will be examined. Inductive content analysis will be used to analyze the editors’ explanations of why the page was “Unproductive” or “Uncivil”. This will allow me to examine recurrent themes of incivility and unproductivity that are mentioned by the community, and to distill these themes into categories. Examining themes surfaced by the community allows for a deeper understanding of what behaviors the community views as incivility, beyond slurs and name-calling. Furthermore, different forms of incivility may present in different ways, and a detection tool will likely need to take this into account in order to most accurately detect the presence of incivility. After detecting incivility, editors will need to address the incivility, and creating categories of incivility allows for any tool developed to be tailored to the needs of the users, as different expressions of incivility may be best handled in different ways. Previous works on incivility do include categories of incivility, although the focus is largely on slurs and personal attacks, and I believe it is reasonable to extend categorization of incivility to the context of Wikipedia (Bishop, 2014; Gervais, 2015; Kwon & Gruzd, 2017; Muddiman & Stroud, 2017; Santana, 2013; Vargo & Hopp, 2014).

After creating the categories, the categories will be adapted for use as tags that can be used by the Interaction Timeline tool. Assuming that the accuracy is acceptable, the tags can be built into the Interaction Timeline tool. The tool should recommend tags to editors who are looking at the timeline, and allow them to accept or reject the tags, or tag the timeline with another tag entirely. Building tags into the Interaction Timeline tool will allow Wikipedians who are examining a series of interactions for incivility to examine not only edit patterns, but semantic content as well. Providing tags for the incivility present will allow Wikipedians to deal with any incivility present in a more structured way, and should make whether an interaction constitutes incivility clearer and less ambiguous. Reducing ambiguity is important because it will empower Wikipedians to deal with a concept of incivility that moves beyond name-calling, but still has a chilling effect on the discourse that is so necessary for Wikipedians.

Development of the Tool

In order to have a sufficiently large training corpus, all talk pages of Controversial pages should be collected. Of those, of 1,500 will be randomly selected as training data, while the rest will make up testing data. 750 Wikipedians or workers on Mechanical Turk should be recruited to label all Controversial talk pages using the tags developed in the previous stage. Each Wikipedian will tag 7 talk pages, and each talk page will be tagged twice.

These 1,500 testing talk pages will be used as a training corpus for a machine learning algorithm that will be trained to detect the forms of incivility that are indicated by the tags.  After training the algorithm, it will need to be tested. Editors will have tagged all Controversial talk pages, and in order to test the algorithm, the predictions of the algorithm will be compared with the actual tags previously supplied by editors. I also recommend having editors tag the examples identified in Study 1 using this system to ensure that the tags are useful for identifying the types of discussions they indicated occur frequently. Tag labels should be compared to the original category of the example.


Main deliverables:

  • Incivility tags/labels resulting from the interviews
  • A dataset of talk pages that have been labeled according to these tags
  • A machine learning algorithm that is trained to detect the forms of incivility indicated by the tags

Study 1:

Create interview protocol

Pre-test interview protocol


Data analysis

Study 2:

o  Creating the data sample

o  Creating the measure

o  Recruiting participants

o  Data Analysis

o  Developing the tool:

§ Creating the data sample

§ Recruiting participants

§ Creating and training the algorithm

§ Testing the algorithm

Policy, Ethics and Human Subjects Research[edit]

IRB forms will be posted here as they are acquired, and before any research is conducted.


Once your study completes, describe the results an their implications here. Don't forget to make status=complete above when you are done.