Research:Applying Value-Sensitive Algorithm Design to ORES
- 1 Introduction & Background
- 2 Study Step 1: Value Seeking and Collecting from Stakeholders
- 3 Study Step 2: Value Translation to Modeling Criteria
- 4 Study Step 3: Model Generation
- 5 Study Step 4: Result Presentation and Interpretation
- 6 Study Step 5: Evaluations
- 7 Reference
- 8 Timeline
Introduction & Background
Objective Revision Evaluation Service, or ORES, is a web service and API that provides machine learning as a service, and is designed to help automate critical wiki-work – for example, vandalism detection and removal. It has been widely accepted by and used in the Wikipedia community for a while, however, it still faces several key challenges: (1) ORES operates in diverse work contexts and involves multiple stakeholders, (2) in most work context, there are tensions between different stakeholder values, and (3) anecdotal evidence suggests that ORES might compromise some important values such as fairness.
In an effort of mitigating the potential conflicts of interests and to improve ORES performance, we plan to apply Value-Sensitive Algorithm Design to redesign ORES algorithms. The core idea of Value-Sensitive Algorithm Design (VSAD) is to engage relevant stakeholders in the early stages of algorithm creation and incorporates stakeholders’ tacit values, knowledge, and insights into the abstract and analytical process of creating an algorithm . Specifically, we first will conduct qualitative analysis and literature reviews to identify values of different stakeholders involved in the ORES ecosystem. Second, we will translate values of stakeholders into the optimization targets of machine learning models, and fine-tune and design algorithms accordingly. Lastly, we will present and explain our results to ORES stakeholder to help them decide the best models and collect their feedback as evaluation.
We will follow a five-step plan to conduct the study: (1) value seeking and collecting from stakeholders, (2) value translation into modeling criteria, (3) model generation, and (4) result presentation and interpretation, and (5) evaluation. I will describe each step in the following sections.
Study Step 1: Value Seeking and Collecting from Stakeholders
We will identify all types of potential stakeholders involved in the ORES ecosystem and conduct interviews with them to understand their experiences using ORES related applications and solicit their values. Specifically, we focus on people who are involved in the counter-vandalism applications supported by ORES. We choose this type of application because (1) vandalism is one of the major issues in Wikipedia that is of high stakes related to the fundamental operation of Wikipedia, and (2) many related applications have been developed making their impacts to Wikipedia non trivial. By conducting semi-structured interviews, we will collect information about (1) experiences of different stakeholders when they were involved in the ORES application, (2) values they hold and prioritize on when they used ORES applications, and (3) suggestions they have about the algorithm design from their perspective.
Here is a set of stakeholders we identified in the entire ORES ecosystem related to the applications that we will interview with: (1) ORES creators, i.e., people who created ORES, such as Aaron Halfaker, (2) Application builders, i.e., people who built related applications, such as Huggle, etc (3) Application operators, i.e., people who are actually using ORES, and (4) People affected, i.e., people who are affected by ORES, such as those whose edits were reverted.
We will reach out and recruit interview participants in ways consistent with the social norms of the Wikipedia community. For example, we will contact potential participants through their Wikipedia user talk pages. As mentioned above, we will also describe our study and post it in public forums in Wikipedia to (1) inform the community, (2) recruit more potential participants, and (3) present interview summary for more feedback. In the meanwhile, I will submit a document to IRB for the interview permission. Below, I will explain our interview plan in detail, our interview structure and interview questions.
The interview is expected to be 30-45 minutes. Participation is voluntary and uncompensated. Below is the proposed semi-structured interview flow.
1. Introduce ORES. We will introduce ORES briefly in case participants are not aware of it.
2. Ask general thoughts on the use of machine learning algorithms in the context?
3. Investigate values related to ORES that participants consider important by discussing specific ORES use cases or workflows.
4. Ask about how those values play out in the context, and how to abstract and operationalize values.
Note that interview questions are not exact, and will be customized for different stakeholder roles. The following are example questions.
1. How do you describe your role in Wikipedia?
2. What is your general perception of algorithmic systems to make decisions or assist people to make decisions in Wikipedia?
3. How do you feel about using algorithms on Wikipedia to make decisions independently or help others make decisions (e.g., edit reversion, SuggestBot (article recommendation), article improvement suggestion, etc.)?
4. To what extent do you interact with ORES or tools that use ORES?
5. Could you describe your experience using ORES?
6. What do you think is most important to you as a Wikipedia [editor, developer, etc.]? => when you develop XXX (for developers), when you use XXX (for operators), or when you edit in Wikipedia (for people affected)?
7. Have you ever experienced conflict with other ORES users? If so, can you describe that conflict, and how you think it should be resolved?
If interviewees are newcomers to Wikipedia, questions may resemble:
1. Please tell me about your first experience as a Wikipedia Editor.
2. Have you ever revisited your edits on the Wikipedia page(s) you edited? Why?
3. How well do you feel that you understand what algorithms are, and how they relate to Artificial Intelligence?
4. What is your attitude towards algorithms and AI in general?
5. What do you know about how algorithms affect your Wikipedia experience? What is important to you about how such an algorithm might make a decision about your edit?
6. What do you think is most important to the whole Wikipedia community about how such an algorithm might make a decision about your edit?
7. If you could design an algorithm to do anything useful on Wikipedia, what would it do, and why did you choose that particular thing?
Potential Values in ORES Ecosystems
We consider two sets of potential values that could play out in the Wikipedia ORES ecosystem. The first set of values is based on prior studies around the Wikipedia domain, such as fairness , quality control [2, 3], and newcomer protection . Those are domain specific values and are of great importance to the Wikipedia community, because they are related to the core mission and thriving of Wikipedia. The second set of values are generic and universal values social psychology literature has identified and studied . Examples include self-direction, achievement, power, conformity, benevolence, etc. I will go through a couple of examples of both sets of values to illustrate how do they play out in Wikipedia and potential value tensions.
Example Values 1: Quality Control and Newcomer Protection
We expect quality control to be one of the most important values in Wikipedia. It’s related to counter-vandalism where Wikipedians would guard the community by patrolling the quality of Wikipedia edits and reverting those poor quality edits and those considered as vandalism. Newcomer protection has become a more serious concern recently. “Don’t bite newcomers” has even become a part of Wikipedia guidelines . Newcomers are valuable assets to Wikipedia due to their potential of becoming experienced, highly productive Wikipedia veterans. However, before their Wikipedia experiences grow and mature, they inevitably tend to produce low-quality edits which does not meet Wikipedia standard. Those edits are more likely to be reverted. If not appropriately approached and mentored, those newcomers may be scared away , despite of the great potentials in those newcomers.
This causes the tension between these two values: quality control and newcomer protection in Wikipedia. That is how to achieve and maintain edits of high quality while not demotivating and scaring newcomers away. The inputs from our interview participants will be the guidance for us to model and tune the value-sensitive algorithms.
Example Values 2: Self-Direction and Wikipedia Policies
Self-direction is more formally defined as independent thought and action - choosing, creating, and exploring . Writing such as editing in Wikipedia is a task that requires high creativity, thinking, and independence. Therefore, self-direction, as can be imagined, is a character of critical values in Wikipedia. At the same time, Wikipedia articles are the artifacts and results of collaborations and collective efforts by millions of volunteer participants . In order to managing the work among such a huge cohort of different backgrounds and culture, there are explicit policies and guidelines that every Wikipedia editor has to follow and obey whether in individual editing and in collaboration with others .
Therefore, the tension between self-direction and Wikipedia policies, or the tension between editing freedom and constraints, can be expected in Wikipedia. ORES applications will play a role in drawing a line between acceptable freestyle edits and editing policies in standard.
Study Step 2: Value Translation to Modeling Criteria
Once a set of values have been identified and coded through interviews, we will analyze to crystallize and finalize a set of inputs and values that are of high stake and relevant to ORES counter-vandalism related applications we scoped. We will then translate stakeholders’ inputs to formal modeling criteria. That is to come up with the optimization target for machine learning models. Here, I will use some hypothetical examples of stakeholders’ values illustrate how to translate values into modeling criteria in this section.
We define a positive event when ORES labels/predicts an edit as a good edit. Following this definition, below is the explanation of the variables in the confusion matrix.
True positive: ORES labels/predicts as a good edit, and it’s a good edit
False positive: ORES labels/predicts as a good edit, but it’s a bad edit
True negative: ORES labels/predicts as a bad edit, and it’s a bad edit
False negative: ORES labels/predicts as a bad edit, but it’s a good edit
Value Translation Example 1: Quality control - minimize false-positive rate
We abstract quality control as minimizing the false-positive rate for modeling. It is important for Wikipedia as a whole to maintain all the articles and edits at a high quality. Hence, ORES is expected to catch all the damaging edits as well as malicious vandalism. That is to minimize the case where edits ORES predicted as good were actually bad edits, i.e., false-positive rate following the definition.
Value Translation Example 2: Protecting editor motivation - minimize false-negative rate
We abstract protecting editor motivation as minimizing the false-negative rate for modeling. A way to protect editor motivation from the perspective of ORES is to avoid reverting good edits from editors of good faith, since editors are more likely to leave the community when their edits are frequently reverted [x], especially for new editors. That is to minimize the case where edits ORES predicted as bad were actually good edits, i.e., false-negative rate.
Value Translation Example 3: Editor treatment fairness - equalize measurements for protected groups
Fairness is an important concern for Wikipedia up from the Wikipedia Foundations, to ORES development team, and Wikipedia editor communities. However, how to define fairness and who should be treated fairly is not yet fully investigated. We will work with stakeholders during the interviews to identify their understanding and definition about fairness, and further explore ways to be fair in the context of counter-vandalism in Wikipedia.
Here we list some possible operationalizations on our hypothetical fairness treatment groups.
1. Unregistered v.s. Registered editors: Equalize false positive rate between unregistered (anonymous) editors and registered editors. E.g., disparity in false positive/negative rate <5%.
2. Female v.s. Male editors: Equalize false positive/negative rate between female editors and male editors. E.g., disparity in false positive rate <5%.
3. New v.s. Experienced editors: Equalize false positive/negative rate between new editors and experienced editors. E.g., disparity in false positive rate <5%.
4. All editor subgroups: Equalize false positive/negative rate across all subgroups (Kearns et al. 2018, 2019) E.g., disparity in false positive rate across all the subgroups <5%.
Study Step 3: Model Generation
We plan to generate a set of prediction models with different parameter settings guided by the values translated in the previous step. These models will cover a spectrum of trade-offs across a set of relevant values.
We will start with pairwise value trade-offs. Take the trade-off between quality control and newcomer protection as an example. The two values will be on the two sides of the spectrum. That is in extreme cases, one model can handle quality control the best but newcomer protection the worst, and another model just does the opposite handling quality control the worst but newcomer protection the best. The performance of the set of models generated will be spread across the entire spectrum, i.e., some models may tend to favor quality control (false positive rate) but harm newcomer motivation (false negative rate), and vise versa. Some models may sit somewhere in between.
We will use the available ORES training data that has ground truth labels for model training. The size of the data will be decided based on (1) the input from ORES creator, and (2) the experimental performance of the models, i.e., we want our models to be trained on a sizable dataset for the effectiveness while not losing much time for efficiency. The set of models to be trained include popular models in the literature, such as classic learning models, ensembling models, and neural networks and deep learning models, as well as models in the explainable machine learning literature.
Study Step 4: Result Presentation and Interpretation
The goal of this step is to present the results and explanations of how to interpret the statistics of each model in the context, helping stakeholders to understand the trade-offs (value tensions) and select a suitable model. We will start with tensions between pairwise values. Since the results are non trivial to understand and interpret especially for non technical background audience, we will show visualizations (plots) and add a proper amount of detailed text.
The results will be presented to our participants along with our evaluation questions discussed in the next study step. We will develop a user-friendly UI for display either hosting our own web page or using a public survey platform.
1. Value tension between quality control and newcomer protection. For the examples of quality control and editor motivation, we will plot the relationship between score modeling criteria for quality control, such as the false positive rate, and the criteria for editor motivation, such as the false negative rate. We will expect to see a trade-off between those two values from the plot, and the stakeholders will have to balance and make decisions.
2. Value tension between fairness and error rate. For the example of editor treatment fairness, we will plot the relationship between unfairness and error rate. According to previous studies, unfairness and error rate tend to be negatively correlated. Again, stakeholders will have to make decisions on threshold balancing their interests. Below is a visual example.
Result Explanation and Interpretation
For all the plots, we will illustrate the meaning of the numbers in the context. For instance, we will show the number of more edits will be labeled as positive per day if we decrease the false positive rate by 10%, e.g., the edits that are supposed to be labeled as positive turn into negative, and how many editors will be affected, e.g., the editors whose edits will be reverted. Potentially using the estimates from prior work, we will further show how many editors will churn due to the revert on their edits, caused by the change of the threshold.
We will also explain the way to interpret the curves on the side of the plot or in some way during the result presentation stage. For instance, in the fairness-error rate plot, we will highlight the key trade-offs, such as the meaning of the curve being steep or flat, which will help stakeholders with the result understanding and decision-making.
We will also explore ways of visualizing trade-offs across more than two types of values, for instance, to show how to achieve the balance among quality control, editor protection, and fairness of subgroups.
Study Step 5: Evaluations
The goal of this step is to evaluate and collect feedback on our results and presentations. We’d like to understand (1) if our presentations do a good job in communicating the results with stakeholders, (2) if our models reflect the original inputs and values of different stakeholders, (3) how stakeholders will make decisions on the model output eventually.
We will design and send out a set of survey questions about understandability of the results, satisfaction on value reflection, usefulness on assisting decision making, etc. We will conduct further interviews if necessary. We will invite the same set of participants we interviewed earlier for their evaluation.
1. Burrell, Jenna. "How the machine ‘thinks’: Understanding opacity in machine learning algorithms." Big Data & Society 3.1 (2016): 2053951715622512.
2. Halfaker, Aaron, et al. "The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline." American Behavioral Scientist 57.5 (2013): 664-688.
3. Halfaker, Aaron, R. Stuart Geiger, and Loren G. Terveen. "Snuggle: Designing for efficient socialization and ideological critique." Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 2014.
4. Halfaker, Aaron, Aniket Kittur, and John Riedl. "Don't bite the newbies: how reverts affect the quantity and quality of Wikipedia work." Proceedings of the 7th international symposium on wikis and open collaboration. ACM, 2011.
6. Schwartz, Shalom H. "Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries." Advances in experimental social psychology. Vol. 25. Academic Press, 1992. 1-65.
8. Kittur, Aniket, et al. "He says, she says: conflict and coordination in Wikipedia." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2007.
9. Zhu, Haiyi, et al. "Value-Sensitive Algorithm Design: Method, Case Study, and Lessons." Proceedings of the ACM on Human-Computer Interaction 2.CSCW (2018): 194.
Dec 2018: IRB Application
Dec-Feb 2019: Step 1: Value Seeking
Jan-Feb 2019: Step 2: Value Translation
Jan-Feb 2019: Step 3: Modeling and Tuning
Mar 2019: Step 4: Result Presentation and Interpretation
Mar-Apr 2019: Step 5: Evaluation
Apr-Jul 2019: Writing Papers
Policy, Ethics and Human Subjects Research
It's very important that researchers do not disrupt Wikipedians' work. Please add to this section any consideration relevant to ethical implications of your project or references to Wikimedia policies, if applicable. If your study has been approved by an ethical committee or an institutional review board (IRB), please quote the corresponding reference and date of approval.
This study was reviewed by the University of Minnesota IRB and approved on February 15, 2019 (see STUDY00005335), meeting criteria for exemption from IRB review.