Research:Ethical and human-centered AI

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Walnut.svg
This page in a nutshell: This report represents an initial attempt to identify what a minimum viable process for the development of machine learning algorithms and other AI products within the Wikimedia Movement in an ethical and human-centered way.
Duration:  2018-August — 2019-May
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


AI-driven products have the potential to benefit the Wikimedia Movement in many different ways, but they also come with risks. Products and features powered by machine learning algorithms have the potential to help experienced contributors identify important work more effectively and perform it more effectively—as well as socialize new contributors, engage readers, fill content gaps and improve content quality.

But the scale, speed, and complexity of the technologies involved in these products also present new kinds of risks and challenges. Risks include disrupting important community processes, disempowering contributors, eroding reader trust, discouraging good-faith newcomers, amplifying existing biases in content, and even violating user privacy.

Researchers working in the emerging domain of "ethical AI" have begun to identify processes and procedures for identifying, preventing, and addressing the potential negative impacts of AI products. Many of the proposed approaches reflect core tenets and methods of human-centered, values-centered, and participatory design processes, as well as well-established practices within domains such as information security.

Not all of these approaches will be useful or appropriate for work done within the Wikimedia Foundation or the Wikimedia Movement. Some will be a better fit for our particular problems, strategic goals, values, and technical and operational capacities than others. This document describes several of the most promising approaches, and a rationale for each.

Background[edit]

AI Products[edit]

Machine learning systems consist of more than just algorithms themselves. They also include other technological components that allow the algorithm to be trained and used for a particular purpose. Each of these components is designed, each of these components is released; therefore each is an AI product in its own right.

AI products developed by the Wikimedia Foundation include (at least):

  1. Machine learning models: algorithms that uses patterns in one set of data to make predictions about the characteristics of different data.
  2. Curated datasets: data collected or labeled to train machine learning models.
  3. Machine learning platforms: machine-learning-as-a-service applications that host models and provide programmatic access to those models.
  4. AI-driven applications: end-user facing apps, gadgets, and features powered by machine learning models.
  5. Data labeling applications: interfaces for humans to (re)classify or dispute model input and output data.

Ethical AI[edit]

The canonical definition of what constitutes ethical AI (or ethical behavior generally) is beyond the scope of this report. The general framework for ethical AI used in this report is based on the widely accepted principles of fairness, accountability, and transparency (FAT) viewed through the lens of the values of the Wikimedia Movement.

Given that framework, a basic definition of FAT might look something like this:

  • Fairness: the AI product does not actively or passively discriminate against groups of people in a harmful way.
  • Accountability: everyone involved in the development and use of the AI product understands, accepts and is able to exercise their rights and resposibilities.
  • Transparency: the intended users of an AI product can meaningfully understand the purpose of the product, how it works, and (where applicable) how specific decisions were made.

Human-centered AI[edit]

Human-centered design is a philosophy and a set of methods for ensuring that any designed thing (artifact, process, system) meets the needs of the people who will use, interact with, or be affected by it. One definition of a human-centered system that applies well to an AI context is:

  1. Designed to address human needs
  2. Based on an analysis of human tasks
  3. Built to account for human skills
  4. Evaluated in terms of human benefit

There are many other definitions of human-centered design, and closely related methodologies such as values-sensitive design and participatory design that prioritize, respectively, investigation and articulation of designer/stakeholder values and direct involvement of end-users in the design process. The definition presented above doesn't exclude either of these considerations; it's more a matter of focus.

Risk scenarios[edit]

Many AI ethics researchers have begun to develop scenarios as a way to communicate how even seemingly mundane or uncontroversial uses of machine learning can have negative consequences, and to spur discussion. Not all of the risk scenarios described by AI researchers are directly applicable in a Wikimedia context—for example, monetization of user data, or the dangers of autonomous cars. The following scenarios are inspired by those developed by Ethical OS[1] and Princeton University[2], but adapted to AI products currently developed by the Wikimedia Foundation (or at least within the realm of possibility for Wikimedia Foundation products).


These scenarios are not intended to suggest that any particular product is biased, harmful, malicious, or fundamentally flawed; rather, they are intended to illustrate some of types of ethical and human-centered design issues that this type of product could present in a Wikimedia Movement context.

Reinforcing existing biases in article content[edit]

Wikimedia builds a section recommendation model to help people expand stub articles. It presents a list of recommendations to add to an article in the sidebar, based on the sections that are present in articles that look (to the algorithm) like this one. The section recommender learns that biographies of women are more likely to have sections with titles like “Personal life” and “Family”, while articles about men are more likely to have sections like “Career” and “Awards and honors”. Users of the tool follow these recommendations and add these sections to stub biographies of notable men and women, unaware that they are perpetuating an existing bias in the way women and men are portrayed on Wikipedia.

Discrimination against new article creators[edit]

Wikimedia build model that assigns a quality prediction to draft articles, and incorporate it into patrolling tools on English Wikipedia. The model systematically assigns lower quality scores to drafts that are written by people for whom English is a second language, regardless of notability or the prevalence or quality of references (which are ostensibly the primary criteria for draft review). These drafts tend to be deleted at a higher rate, discouraging many good faith new contributors from continuing to edit.

False positives and feedback mechanisms[edit]

Wikimedia builds an revision scoring model that makes predictions about edits, and integrate it into revision history interfaces on Wikipedia. These predictions are public and regularly consulted by editors who are monitoring their watchlists and the recent changes feed.

An experienced editor notices that most of their recent and historical edits are highlighted as “likely have problems” by the model, despite being obviously good faith and non-damaging. The editor does not know how the decisions were made by the model, or how to contest them. They find that their edits are now more scrutinized and more frequently reverted than they were before, and it discourages them from editing.

External re-use and unintended consequences[edit]

Wikimedia releases a dataset of Wikipedia talk page comments, labelled by crowdworkers for ‘toxicity’. An external developer builds a machine learning model using this dataset and uses that model in an automated content-moderation tool on an online forum devoted to debating hot-button cultural issues. It turns out that the language used by one group of forum participants is more likely to be labeled as toxic by the model, because they use words that are associated with toxicity on Wikipedia, despite being appropriate to the forum and permitted according to the rules. These users see their comments automatically deleted at a high rate, effectively preventing them from participating in the discussion.

Community disruption (and cultural imperialism?)[edit]

Wikimedia builds a tool that recommends articles to translate from big language A to small language B, based on whether that article also exists in languages [C, D, E…]. The language B community is overwhelmed by the volume of poorly-translated and half-finished translations appearing in their language, and must spend the majority of their time fixing errors and completing partial translations, rather than writing the articles that they are interested in writing, or that they believe are important for readers in their language.

Disparate usefulness and user trust[edit]

Wikimedia deploys a new “top articles” ranking algorithm for readers on the English Wikipedia Android app. The ranking is based on desktop editing and browsing patterns. Since most desktop editors and readers are American or western European, the rankings have a strong "western" bias: the highest-ranked articles reflect the things that Americans and English-speaking Europeans are interested in reading about and editing. Use of the Android App in English is high in India. Indian users get tired of seeing irrelevant recommendations every time they open the app, and decide that Wikipedia is not a useful source of information for them.


Process proposals[edit]

Checklists[edit]

"“Checklists connect principle to practice. Everyone knows to scrub down before the operation. That's the principle. But if you have to check a box on a form after you've done it, you're not likely to forget. That's the practice. And checklists aren't one-shots. A checklist isn’t something you read once at some initiation ceremony; a checklist is something you work through with every procedure.”[3]

Overview
A list of important steps that must be taken, or questions that must be answered, at each stage of the product development process.
Audience
The product team
Pros and cons
Pros Cons
Aids in identification of hidden assumptions, potential negative impacts Need to be flexible enough to work across products and team workflows, but standardized enough to ensure a baseline level of due diligence
Can cover both concrete requirements ("do this") and softer requirements ("have a conversation about this before proceeding") Example AI checklists exist, but few have been vetted/tested in actual product development contexts
Facilitates broader participation in decision-making among team members Binary outcome ("we talked about FOO") may encourage rubber-stamping
Makes it easier for any member of the product team to "flag" missed steps or considerations without fear of reprisal
Encourages articulation of audience, purpose, and context; success metrics and thresholds
Increases process consistency between and across teams
Tracks progress towards goals
Further reading
  1. Of oaths and checklists[3]
  2. Care about AI ethics? What you can do, starting today[4]
  3. DEON: An Ethics Checklist for Data Scientists[5]
  4. Ethical OS Toolkit[6]

Impact assessments[edit]

"Algorithms and the data that drive them are designed and created by people -- There is always a human ultimately responsible for decisions made or informed by an algorithm. "The algorithm did it" is not an acceptable excuse if algorithmic systems make mistakes or have undesired consequences."[7]

Overview
A public product plan that includes a detailed rationale, risk assessment, evaluation criteria, and post-deployment maintenance, monitoring, and feedback mechanisms.
Audience
Other WMF teams, Movement volunteers, external entities.
Pros and cons
Pros Cons
Aids in identification of hidden assumptions, potential negative impacts Time-consuming to create, and unclear whether the expense is justified
Encourages in-depth justification for design decisions Few real-world examples of "algorithmic impact statements" available to learn from
Encourages articulation of audience, purpose, and context; success metrics and thresholds Substantial overlap with checklists (depending on what's in the checklist)
Documentation increases accountability for outcomes Not always clear who the target audience for the document is
Encourages articulation of audience, purpose, and context; success metrics and thresholds


Further reading
  1. Social Impact Assessment: Guidance for assessing and managing the social impacts of projects[8]
  2. Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability[9]
  3. Principles for Accountable Algorithms and a Social Impact Statement for Algorithms[7]
  4. Ethics & Algorithms Toolkit[10]

Interpretable models[edit]

Overview
Algorithms that a) expose the logic behind a particular output or decision, and/or b) expose the general features, procedures, or probabilities implicated in their decision-making in a way that the intended audience can understand.
Audience
AI-driven application developers; External researchers and auditors
Pros and cons
Pros Cons
Facilitates downstream explanations for individual algorithmic decisions Accuracy and performance may be lower overall compared to more opaque models (e.g. deep learning) for some ML tasks
Facilitates iterative development and comparative evaluation towards fairness and utility benchmarks, not just accuracy and performance Making the model more interpretable may allow people to "game" the system in deceptive or damaging ways
Facilitates external auditing, internal sanity checks, formal user testing, and end-user feedback


Further reading
  1. How the machine ‘thinks’: Understanding opacity in machine learning algorithms[11]
  2. The Promise and Peril of Human Evaluation for Model Interpretability[12]
  3. Toward human-centered algorithm design[13]

End-user documentation[edit]

"Because the linguistic data we use will always include pre-existing biases and because it is not possible to build an NLP system in such a way that it is immune to emergent bias, we must seek additional strategies for mitigating the scientific and ethical shortcomings that follow from imperfect datasets. We propose here that foregrounding the characteristics of our datasets can help, by allowing reasoning about what the likely effects may be and by making it clearer which populations are and are not represented."[14]

Overview
Detailed descriptions of the intended audience and use cases for an AI product, with a focus potential issues and limitations, and other special considerations for use.
Audience
External researchers
Pros and cons
Pros Cons
Can ensure that AI product users have the information they need to make informed decisions about how to use the product and/or interpret its functionality Can be costly to create and maintain
Easily transportable with the data/code, wherever it goes Writing good documentation is hard
Easily adaptable to the needs of different users (e.g. third-party tool devs vs. data scientists) and different AI products (e.g. training datasets vs. AI platform APIs) Not always clear how much documentation is necessary and sufficient for a given audience; the documentation itself may require user testing
Many existing frameworks and best practices from software dev are likely applicable to AI product context; some new ones have been proposed specifically for AI bias contexts People don't always read the docs


Further reading
  1. Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science[14]
  2. Increasing Trust in AI Services through Supplier's Declarations of Conformity[15]
  3. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards[16]
  4. Datasheets for Datasets[17]
  5. The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries[18]

UI explanations[edit]

"How can we provide meaningful control over the recommendation process to users, so that they can understand the decisions they make about their recommendations and customize the system to their particular needs?"[19]

Overview
Contextual explanations for algorithmic decisions embedded in the user interface as prose, numbers, or graphical visualizations.
Audience
People using an AI product, at the point of use.
Pros and cons
Pros Cons
Can encourage trust among end users Not always clear how much information to include—tension between overwhelming/distracting the user and depriving them of important insights
Empowers end users to make informed decisions about how to use an AI product or how to interpret its decisions Potential tension between choosing the most correct and the most persuasive explanation
Extensive research literature on the effectiveness of different textual, numeric, and visual approaches to explanation, at least in some domains (e.g. recommender systems) Depends on interpretable models (or computational methods for making opaque model output more interpretable)
Testable; it's possible to empirically verify whether an explanation works or not—sometimes even before you build your model or your interface
Encourages feedback, auditing, monitoring against drift, and potentially re-training of the model


Further reading
  1. Evaluating the effectiveness of explanations for recommender systems: Methodological issues and empirical studies on the impact of personalization[20]
  2. Explaining data-driven document classifications[21]
  3. User interface patterns in recommendation-empowered content intensive multimedia applications[22]

Iterative prototyping and testing[edit]

"Understanding how people actually interact—and want to interact—with machine learning systems is critical to designing systems that people can use effectively. Exploring interaction techniques through user studies can reveal gaps in a designer’s assumptions about their end-users and may lead to helpful insights about the types of input and output that interfaces for interactive machine learning should support."[23]

Overview
A process of making step-by-step refinements to a design based on explicit feedback or observations of use before full deployment.
Audience
The product team, end-users of the product
Pros and cons
Pros Cons
Encourages definition of APC and success metrics ahead of time Works best when performed by people with some degree of familiarity with UX design or research methods, a resource not available to all teams
Can be performed with low-fidelity interfaces, early stage models, or even before any software or ML engineering has begun. Can even be performed on documentation for datasets and APIs Can slow down development in some cases, can sometimes be challenging to implement in Agile/scrum or other XP paradigms
Allows identification unanticipated issues (such as issues of bias or harm) before committing extensive resources towards a particular design solution Identification of issues (whether bias or use experience) requires access to representative test users and an approximation of a typical context of use
Can help avoid costly failures that require teams to pivot or re-boot late in the design/dev process, or after deployment


Further reading
  1. Power to the People: The Role of Humans in Interactive Machine Learning[23]
  2. User perception of differences in recommender algorithms[24]
  3. The usability and utility of recommender systems can be quantitatively measured through user studies[25]
  4. Making recommendations better: an analytic model for human-recommender interaction[26]

Pilot testing[edit]

"Many of our fundamentally held viewpoints continue to be ruled by outdated biases derived from the evaluation of a single user sitting in front of a single desktop computer. It is time for us to look at evaluations in more holistic ways. One way to do this is to engage with real users in 'Living Laboratories', in which researchers either adopt or create real useful systems that are used in real settings that are ecologically valid."[27]

Overview
A fixed-term or limited scale deployment of a final product, where the decision to fully deploy is deferred until the outcome of the pilot is assessed.
Audience
The product team, end-users of the product (individuals or communities)
Pros and cons
Pros Cons
Allows team to understand the ecological validity ("does it work as intended?") of the AI product before committing to release it into their product ecosystem and maintain it long term Extends the product development timeline
Allows team to measure the ecological impact ("what are the adjacent and downsteam effects?") of their AI product on the product ecosystem before committing Not a substitute for iterative prototyping and testing
Allows long-term impact measurement of performance (longitudinal analysis) and comparative measurement (A/B testing) against success criteria with real users Unintended negative consequences impact people's lives
Provides baselines for long-term performance monitoring (e.g. detecting model drift)
Increases accountability and supports user trust in the product team and the organization ("if it doesn't work, we will turn it off")


Further reading
  1. A Position Paper on 'Living Laboratories': Rethinking Ecological Designs and Experimentation in Human-Computer Interaction[27]
  2. Behaviorism is Not Enough: Better Recommendations through Listening to Users[19]
  3. Research:Autoconfirmed article creation trial

Support for auditing[edit]

"Algorithm transparency is a pressing societal problem. Algorithms provide functions like social sorting, market segmentation, personalization, recommendations, and the management of traffic flows from bits to cars. Making these infrastructures computational has made them much more powerful, but also much more opaque to public scrutiny and understanding. The history of sorting and discrimination across a variety of contexts would lead one to believe that public scrutiny of this transformation is critical. How can such public interest scrutiny of algorithms be achieved?"[28]

Pros and cons
Pros Cons
Increases transparency and underscores organizational commitment to ethical and human-centered AI Most effective when paired with interpretable models and UI explanations
Facilitates identification of potentially problematic edge- and corner-cases by external experts and power users Can expose organization to public embarrassment based on individual examples of failure, whether or not those examples are representative of a larger or problematic error patterns (e.g. unfair bias against a group)
Facilitates early detection of model drift may require dedication of substantial platform or personnel resources to support ad hoc use
support requirements vary depending on the capabilities of the auditor and the nature of the audit: do they need a fully-featured web application that supports arbitrary input and provides UI explanations, a well-documented API that exposes model and decision-level metadata, or just a sample dataset and a public GitHub repository?


Further reading
  1. Algorithmic Accountability Reporting: On the Investigation of Black Boxes[29]
  2. Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms[28]

Feedback mechanisms[edit]

"Behavioral data without proper grounding in theory and in subjective evaluation might just result in local optimization or short term quick wins, rather than long term satisfaction. When can we know from the behavior of a user if the recommendations help to fulfill their needs and goals?"[19]

Overview
Mechanisms for correcting, contesting, refining, discussing, validating, or auditing the output of an algorithm.
Audience
product team, end-users, external researchers
Pros and cons
Pros Cons
Can be used to re-training the machine learning model Usefulness of feedback depends heavily on the design of the feedback mechanism
Help the team flag emerging issues of bias, harm, or other unintended consequences Takes resources to monitor, triage, respond to, and make use of feedback (depending on the mechanism for feedback collection and the kind of feedback collected)
Helps the team quickly identify technical and UX issues Privacy considerations around how feedback is captured and stored, and who has access
Increases trust and user acceptance
Can yield insights into user expectations, workflows, and context of use


Further reading
  1. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments[30]
  2. JADE: The Judgement and Dialogue Engine

See also[edit]

Subpages of this page[edit]

Pages with the prefix 'Ethical and human-centered AI' in the 'Research' and 'Research talk' namespaces:

Research talk:

References[edit]

  1. "Ethical OS Toolkit". Ethical OS: A guide to anticipating future impacts of today's technologies (in en-US). Retrieved 2019-01-24. 
  2. "Princeton Dialogues on AI and Ethics". Princeton Dialogues on AI and Ethics (in en-US). 2018-04-19. Retrieved 2019-01-24. 
  3. a b Patil, DJ (2018-07-17). "Of oaths and checklists". O'Reilly Media. Retrieved 2018-12-17. 
  4. Adler, Steven (2018-09-25). "Care about AI ethics? What you can do, starting today". Medium. Retrieved 2019-01-25. 
  5. "Deon: An Ethics Checklist for Data Scientists - DrivenData Labs". drivendata.co. Retrieved 2019-01-25. 
  6. "Ethical OS Toolkit" (in en-US). Retrieved 2019-01-25. 
  7. a b "Principles for Accountable Algorithms and a Social Impact Statement for Algorithms :: FAT ML". www.fatml.org. Retrieved 2019-01-25. 
  8. Franks, Daniel; Aucamp, Ilse; Esteves, Ana Maria; Vanclay, Francis (2015-04-01). "Social Impact Assessment: Guidance for assessing and managing the social impacts of projects". 
  9. Reisman, D., Schultz, J., Crawford, K., & Whittaker, M. (2018). Algorithmic Impact Assessments: a Practical Framework for Public Agency Accountability. Retrieved from https://ainowinstitute.org/aiareport2018.pdf
  10. "Ethics & Algorithms Toolkit (beta)". ethicstoolkit.ai. Retrieved 2019-01-25. 
  11. Burrell, Jenna (2016-01-05). "How the machine ‘thinks’: Understanding opacity in machine learning algorithms". Big Data & Society 3 (1): 205395171562251. ISSN 2053-9517. doi:10.1177/2053951715622512. 
  12. Herman, Bernease (2017-11-20). "The Promise and Peril of Human Evaluation for Model Interpretability". arXiv:1711.07414 [cs, stat]. 
  13. Baumer, Eric PS (2017-07-25). "Toward human-centered algorithm design". Big Data & Society 4 (2): 205395171771885. ISSN 2053-9517. doi:10.1177/2053951717718854. 
  14. a b Bender, Emily; Friedman, Batya (2018-09-24). "Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science". Transactions of the ACL. 
  15. Hind, Michael; Mehta, Sameep; Mojsilovic, Aleksandra; Nair, Ravi; Ramamurthy, Karthikeyan Natesan; Olteanu, Alexandra; Varshney, Kush R. (2018-08-22). "Increasing Trust in AI Services through Supplier's Declarations of Conformity". arXiv:1808.07261 [cs]. 
  16. Holland, Sarah; Hosny, Ahmed; Newman, Sarah; Joseph, Joshua; Chmielinski, Kasia (2018-05-09). "The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards". arXiv:1805.03677 [cs]. 
  17. Gebru, Timnit; Morgenstern, Jamie; Vecchione, Briana; Vaughan, Jennifer Wortman; Wallach, Hanna; Daumeé III, Hal; Crawford, Kate (2018-03-23). "Datasheets for Datasets". arXiv:1803.09010 [cs]. 
  18. Geiger, R. Stuart; Varoquaux, Nelle; Mazel-Cabasse, Charlotte; Holdgraf, Chris (2018-12-01). "The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries". Computer Supported Cooperative Work (CSCW) 27 (3): 767–802. ISSN 1573-7551. doi:10.1007/s10606-018-9333-1. 
  19. a b c Ekstrand, Michael D.; Willemsen, Martijn C. (2016). "Behaviorism is Not Enough: Better Recommendations Through Listening to Users". Proceedings of the 10th ACM Conference on Recommender Systems. RecSys '16 (New York, NY, USA: ACM): 221–224. ISBN 9781450340359. doi:10.1145/2959100.2959179. 
  20. Tintarev, Nava; Masthoff, Judith (2012-10-01). "Evaluating the effectiveness of explanations for recommender systems". User Modeling and User-Adapted Interaction 22 (4): 399–439. ISSN 1573-1391. doi:10.1007/s11257-011-9117-5. 
  21. "MIS Quarterly". misq.org. doi:10.25300/misq/2014/38.1.04. Retrieved 2019-01-25. 
  22. Cremonesi, Paolo; Elahi, Mehdi; Garzotto, Franca (2017-02-01). "User interface patterns in recommendation-empowered content intensive multimedia applications". Multimedia Tools and Applications 76 (4): 5275–5309. ISSN 1573-7721. doi:10.1007/s11042-016-3946-5. 
  23. a b Amershi, Saleema; Cakmak, Maya; Knox, William Bradley; Kulesza, Todd (2014-12-22). "Power to the People: The Role of Humans in Interactive Machine Learning". AI Magazine 35 (4): 105–120. ISSN 2371-9621. doi:10.1609/aimag.v35i4.2513. 
  24. Ekstrand, Michael D.; Harper, F. Maxwell; Willemsen, Martijn C.; Konstan, Joseph A. (2014). "User Perception of Differences in Recommender Algorithms". Proceedings of the 8th ACM Conference on Recommender Systems. RecSys '14 (New York, NY, USA: ACM): 161–168. ISBN 9781450326681. doi:10.1145/2645710.2645737. 
  25. Ricci, Francesco; Rokach, Lior; Shapira, Bracha; et al., eds. (2011). "Recommender Systems Handbook". doi:10.1007/978-0-387-85820-3. 
  26. McNee, Sean M.; Riedl, John; Konstan, Joseph A. (2006-04-21). "Making recommendations better: an analytic model for human-recommender interaction". ACM. pp. 1103–1108. ISBN 1595932984. doi:10.1145/1125451.1125660. 
  27. a b Chi, Ed H. (2009). Jacko, Julie A., ed. "A Position Paper on ’Living Laboratories’: Rethinking Ecological Designs and Experimentation in Human-Computer Interaction". Human-Computer Interaction. New Trends. Lecture Notes in Computer Science (Springer Berlin Heidelberg): 597–605. ISBN 9783642025747. doi:10.1007/978-3-642-02574-7_67. 
  28. a b Sandvig, C., Hamilton, K., Karahalios, K., & Langbort, C. (2014). Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms. Data and Discrimination: Converting Critical Concerns into Productive Inquiry, a preconference at the 64th Annual Meeting of the International Communication Association. Seattle, Washington, USA. Retrieved from http://www-personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20--%20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf
  29. Diakopoulos, Nicholas (2014). "Algorithmic Accountability Reporting: On the Investigation of Black Boxes". doi:10.7916/D8ZK5TW2. 
  30. Elsayed, Tamer; Kutlu, Mucahid; Lease, Matthew; McDonnell, Tyler (2016-09-21). "Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments". Fourth AAAI Conference on Human Computation and Crowdsourcing.