Research:Applying Value-Sensitive Algorithm Design to ORES
Introduction & Background
Objective Revision Evaluation Service, or ORES, is a web service and API that provides machine learning as a service, and is designed to help automate critical wiki-work – for example, vandalism detection and removal. It has been widely accepted by and used in the Wikipedia community for a while, however, it still faces several key challenges: (1) ORES operates in diverse work contexts and involves multiple stakeholders, (2) in most work context, there are tensions between different stakeholder values, and (3) anecdotal evidence suggests that ORES might compromise some important values such as fairness.
In an effort of mitigating the potential conflicts of interests and to improve ORES performance, we plan to apply Value-Sensitive Algorithm Design to redesign ORES algorithms. The core idea of Value-Sensitive Algorithm Design (VSAD) is to engage relevant stakeholders in the early stages of algorithm creation and incorporates stakeholders’ tacit values, knowledge, and insights into the abstract and analytical process of creating an algorithm . Specifically, we first will conduct qualitative analysis and literature reviews to identify values of different stakeholders involved in the ORES ecosystem. Second, we will translate values of stakeholders into the optimization targets of machine learning models, and fine-tune and design algorithms accordingly. Lastly, we will present and explain our results to ORES stakeholder to help them decide the best models and collect their feedback as evaluation.
We will follow a five-step plan to conduct the study: (1) value seeking and collecting from stakeholders, (2) value translation into modeling criteria, (3) model generation, and (4) result presentation and interpretation, and (5) evaluation. I will describe each step in the following sections.
Study Step 1: Value Seeking and Collecting from Stakeholders
We will identify all types of potential stakeholders involved in the ORES ecosystem and conduct interviews with them to understand their experiences using ORES related applications and solicit their values. Specifically, we focus on people who are involved in the counter-vandalism applications supported by ORES. We choose this type of application because (1) vandalism is one of the major issues in Wikipedia that is of high stakes related to the fundamental operation of Wikipedia, and (2) many related applications have been developed making their impacts to Wikipedia non trivial. By conducting semi-structured interviews, we will collect information about (1) experiences of different stakeholders when they were involved in the ORES application, (2) values they hold and prioritize on when they used ORES applications, and (3) suggestions they have about the algorithm design from their perspective.
Here is a set of stakeholders we identified in the entire ORES ecosystem related to the applications that we will interview with: (1) ORES creators, i.e., people who created ORES, such as Aaron Halfaker, (2) Application builders, i.e., people who built related applications, such as Huggle, etc (3) Application operators, i.e., people who are actually using ORES, and (4) People affected, i.e., people who are affected by ORES, such as those whose edits were reverted.
We will reach out and recruit interview participants in ways consistent with the social norms of the Wikipedia community. For example, we will contact potential participants through their Wikipedia user talk pages. As mentioned above, we will also describe our study and post it in public forums in Wikipedia to (1) inform the community, (2) recruit more potential participants, and (3) present interview summary for more feedback. In the meanwhile, I will submit a document to IRB for the interview permission. Below, I will explain our interview plan in detail, our interview structure and interview questions.
The interview is expected to be 30-45 minutes. Participation is voluntary and uncompensated. Below is the proposed semi-structured interview flow.
1. Introduce ORES. We will introduce ORES briefly in case participants are not aware of it.
2. Ask general thoughts on the use of machine learning algorithms in the context?
3. Investigate values related to ORES that participants consider important by discussing specific ORES use cases or workflows.
4. Ask about how those values play out in the context, and how to abstract and operationalize values.
Note that interview questions are not exact, and will be customized for different stakeholder roles. The following are example questions.
1. How do you describe your role in Wikipedia?
2. What is your general perception of algorithmic systems to make decisions or assist people to make decisions in Wikipedia?
3. How do you feel about using algorithms on Wikipedia to make decisions independently or help others make decisions (e.g., edit reversion, SuggestBot (article recommendation), article improvement suggestion, etc.)?
4. To what extent do you interact with ORES or tools that use ORES?
5. Could you describe your experience using ORES?
6. What do you think is most important to you as a Wikipedia [editor, developer, etc.]? => when you develop XXX (for developers), when you use XXX (for operators), or when you edit in Wikipedia (for people affected)?
7. Have you ever experienced conflict with other ORES users? If so, can you describe that conflict, and how you think it should be resolved?
If interviewees are newcomers to Wikipedia, questions may resemble:
1. Please tell me about your first experience as a Wikipedia Editor.
2. Have you ever revisited your edits on the Wikipedia page(s) you edited? Why?
3. How well do you feel that you understand what algorithms are, and how they relate to Artificial Intelligence?
4. What is your attitude towards algorithms and AI in general?
5. What do you know about how algorithms affect your Wikipedia experience? What is important to you about how such an algorithm might make a decision about your edit?
6. What do you think is most important to the whole Wikipedia community about how such an algorithm might make a decision about your edit?
7. If you could design an algorithm to do anything useful on Wikipedia, what would it do, and why did you choose that particular thing?
Values in ORES Ecosystems
Prior to conducting interviews, we considered two sets of potential values that could play out in the Wikipedia ORES ecosystem. The first set of values is based on prior studies around the Wikipedia domain, such as fairness , quality control [2, 3], and newcomer protection . Those are domain specific values and are of great importance to the Wikipedia community, because they are related to the core mission and thriving of Wikipedia. The second set of values are generic and universal values social psychology literature has identified and studied . Examples include self-direction, achievement, power, conformity, benevolence, etc.
We conducted interviews with 16 stakeholders, including five (5) Wikimedia Foundation employees, two (2) external researchers who use ORES in their work on Wikipedia, two (2) volunteer developers who have built ORES-dependent applications, and seven (7) editors with varying degrees of experience on Wikipedia. Our interview data indicate that there is broad agreement across stakeholder groups about what is most important when it comes to designing machine learning algorithms, and applications that depend on those algorithms. These values capture elements from prior research on Wikipedia and universal values from social psychology literature, however they are specifically suited to a new context of community-based development of artificial intelligence. We have derived five major values that converged across stakeholder groups (see next section).
Convergent Community Values
Our data suggest that these "Convergent Community Values" should guide future development efforts related to how algorithms ought to operate on Wikipedia:
1. Algorithmic systems should reduce the effort of community maintenance work.
2. Algorithmic systems should maintain human judgement as the final authority.
3. Algorithmic systems should support the workflows of individual people with different priorities at different times.
4. Algorithmic systems should encourage positive engagement with diverse editor groups, such as newcomers, females, and minorities.
5. Algorithmic systems should establish the trustworthiness of both people and algorithms within the community.
Tensions in Enacting Convergent Community Values
However, there are also fundamental tensions that arise in attempting to implement these values:
1. Valuing experimentation with algorithmic systems such as ORES can create tension with appropriately serving these five community values.
2. The openness of the Wikipedia community can be in tension against the intrinsic centralization of power in the hands of people who build technology.
3. Latency in quality control can be in conflict with undesirable temporal social consequences.
4. Achieving the best model fitness can be in conflict with ethical considerations related to which features should be used to construct models, and the effects of those considerations on the community.
Open Invitation for Community Feedback on our Results
During the week of June 14 - June 21, 2019, we are hosting a community wide discussion of these results. We invite feedback from everyone on our interpretation of participants' quotes and ideas. You can view a current draft of our results in this google document. Please leave your comments on the google document, or you can write messages to the the authors either at FauxNeme or Bobo.03. We are also posting this invitation at Village Pump and on the Wiki Research mailing list.
Study Step 2: Value Translation to Modeling Criteria
Once a set of values have been identified and coded through interviews, we will analyze to crystallize and finalize a set of inputs and values that are of high stake and relevant to ORES counter-vandalism related applications we scoped. We will then translate stakeholders’ inputs to formal modeling criteria. That is to come up with the optimization target for machine learning models. Here, I will use some hypothetical examples of stakeholders’ values illustrate how to translate values into modeling criteria in this section.
We define a positive event when ORES labels/predicts an edit as a good edit. Following this definition, below is the explanation of the variables in the confusion matrix.
True positive: ORES labels/predicts as a good edit, and it’s a good edit
False positive: ORES labels/predicts as a good edit, but it’s a bad edit
True negative: ORES labels/predicts as a bad edit, and it’s a bad edit
False negative: ORES labels/predicts as a bad edit, but it’s a good edit
Value Translation Example 1: Quality control - minimize false-positive rate
We abstract quality control as minimizing the false-positive rate for modeling. It is important for Wikipedia as a whole to maintain all the articles and edits at a high quality. Hence, ORES is expected to catch all the damaging edits as well as malicious vandalism. That is to minimize the case where edits ORES predicted as good were actually bad edits, i.e., false-positive rate following the definition.
Value Translation Example 2: Protecting editor motivation - minimize false-negative rate
We abstract protecting editor motivation as minimizing the false-negative rate for modeling. A way to protect editor motivation from the perspective of ORES is to avoid reverting good edits from editors of good faith, since editors are more likely to leave the community when their edits are frequently reverted [x], especially for new editors. That is to minimize the case where edits ORES predicted as bad were actually good edits, i.e., false-negative rate.
Value Translation Example 3: Editor treatment fairness - equalize measurements for protected groups
Fairness is an important concern for Wikipedia up from the Wikipedia Foundations, to ORES development team, and Wikipedia editor communities. However, how to define fairness and who should be treated fairly is not yet fully investigated. We will work with stakeholders during the interviews to identify their understanding and definition about fairness, and further explore ways to be fair in the context of counter-vandalism in Wikipedia.
Here we list some possible operationalizations on our hypothetical fairness treatment groups.
1. Unregistered v.s. Registered editors: Equalize false positive rate between unregistered (anonymous) editors and registered editors. E.g., disparity in false positive/negative rate <5%.
2. Female v.s. Male editors: Equalize false positive/negative rate between female editors and male editors. E.g., disparity in false positive rate <5%.
3. New v.s. Experienced editors: Equalize false positive/negative rate between new editors and experienced editors. E.g., disparity in false positive rate <5%.
4. All editor subgroups: Equalize false positive/negative rate across all subgroups (Kearns et al. 2018, 2019) E.g., disparity in false positive rate across all the subgroups <5%.
Study Step 3: Model Generation
We plan to generate a set of prediction models with different parameter settings guided by the values translated in the previous step. These models will cover a spectrum of trade-offs across a set of relevant values.
We will start with pairwise value trade-offs. Take the trade-off between quality control and newcomer protection as an example. The two values will be on the two sides of the spectrum. That is in extreme cases, one model can handle quality control the best but newcomer protection the worst, and another model just does the opposite handling quality control the worst but newcomer protection the best. The performance of the set of models generated will be spread across the entire spectrum, i.e., some models may tend to favor quality control (false positive rate) but harm newcomer motivation (false negative rate), and vise versa. Some models may sit somewhere in between.
We will use the available ORES training data that has ground truth labels for model training. The size of the data will be decided based on (1) the input from ORES creator, and (2) the experimental performance of the models, i.e., we want our models to be trained on a sizable dataset for the effectiveness while not losing much time for efficiency. The set of models to be trained include popular models in the literature, such as classic learning models, ensembling models, and neural networks and deep learning models, as well as models in the explainable machine learning literature.
Study Step 4: Result Presentation and Interpretation
The goal of this step is to present the results and explanations of how to interpret the statistics of each model in the context, helping stakeholders to understand the trade-offs (value tensions) and select a suitable model. We will start with tensions between pairwise values. Since the results are non trivial to understand and interpret especially for non technical background audience, we will show visualizations (plots) and add a proper amount of detailed text.
The results will be presented to our participants along with our evaluation questions discussed in the next study step. We will develop a user-friendly UI for display either hosting our own web page or using a public survey platform.
1. Value tension between quality control and newcomer protection. For the examples of quality control and editor motivation, we will plot the relationship between score modeling criteria for quality control, such as the false positive rate, and the criteria for editor motivation, such as the false negative rate. We will expect to see a trade-off between those two values from the plot, and the stakeholders will have to balance and make decisions.
2. Value tension between fairness and error rate. For the example of editor treatment fairness, we will plot the relationship between unfairness and error rate. According to previous studies, unfairness and error rate tend to be negatively correlated. Again, stakeholders will have to make decisions on threshold balancing their interests. Below is a visual example.
Result Explanation and Interpretation
For all the plots, we will illustrate the meaning of the numbers in the context. For instance, we will show the number of more edits will be labeled as positive per day if we decrease the false positive rate by 10%, e.g., the edits that are supposed to be labeled as positive turn into negative, and how many editors will be affected, e.g., the editors whose edits will be reverted. Potentially using the estimates from prior work, we will further show how many editors will churn due to the revert on their edits, caused by the change of the threshold.
We will also explain the way to interpret the curves on the side of the plot or in some way during the result presentation stage. For instance, in the fairness-error rate plot, we will highlight the key trade-offs, such as the meaning of the curve being steep or flat, which will help stakeholders with the result understanding and decision-making.
We will also explore ways of visualizing trade-offs across more than two types of values, for instance, to show how to achieve the balance among quality control, editor protection, and fairness of subgroups.
Study Step 5: Evaluations
The goal of this step is to evaluate and collect feedback on our results and presentations. We’d like to understand (1) if our presentations do a good job in communicating the results with stakeholders, (2) if our models reflect the original inputs and values of different stakeholders, (3) how stakeholders will make decisions on the model output eventually.
We will design and send out a set of survey questions about understandability of the results, satisfaction on value reflection, usefulness on assisting decision making, etc. We will conduct further interviews if necessary. We will invite the same set of participants we interviewed earlier for their evaluation.
1. Burrell, Jenna. "How the machine ‘thinks’: Understanding opacity in machine learning algorithms." Big Data & Society 3.1 (2016): 2053951715622512.
2. Halfaker, Aaron, et al. "The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline." American Behavioral Scientist 57.5 (2013): 664-688.
3. Halfaker, Aaron, R. Stuart Geiger, and Loren G. Terveen. "Snuggle: Designing for efficient socialization and ideological critique." Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 2014.
4. Halfaker, Aaron, Aniket Kittur, and John Riedl. "Don't bite the newbies: how reverts affect the quantity and quality of Wikipedia work." Proceedings of the 7th international symposium on wikis and open collaboration. ACM, 2011.
6. Schwartz, Shalom H. "Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries." Advances in experimental social psychology. Vol. 25. Academic Press, 1992. 1-65.
8. Kittur, Aniket, et al. "He says, she says: conflict and coordination in Wikipedia." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2007.
9. Zhu, Haiyi, et al. "Value-Sensitive Algorithm Design: Method, Case Study, and Lessons." Proceedings of the ACM on Human-Computer Interaction 2.CSCW (2018): 194.
Dec 2018: IRB Application
Dec-Feb 2019: Step 1: Value Seeking
Jan-Feb 2019: Step 2: Value Translation
Jan-Feb 2019: Step 3: Modeling and Tuning
Mar 2019: Step 4: Result Presentation and Interpretation
Mar-Apr 2019: Step 5: Evaluation
Apr-Jul 2019: Writing Papers
Policy, Ethics and Human Subjects Research
It's very important that researchers do not disrupt Wikipedians' work. Please add to this section any consideration relevant to ethical implications of your project or references to Wikimedia policies, if applicable. If your study has been approved by an ethical committee or an institutional review board (IRB), please quote the corresponding reference and date of approval.
This study was reviewed by the University of Minnesota IRB and approved on February 15, 2019 (see STUDY00005335), meeting criteria for exemption from IRB review.