Wikimedia Foundation Scoring Platform team

This page in a nutshell: This page describes a proposal for directing Wikimedia Foundation resources towards the team that maintains ORES, Wiki labels, and related technologies. This proposal was presented as part of the 2018 fiscal year's annual plan. It has been funded and the team's page can be found at mw:Wikimedia Scoring Platform team.

There are two divergent conversations about artificial intelligences—in one, robots will save us from ourselves, and in the other, they destroy us. AI has great potential to help our projects scale by reducing the work that our editors need to do and enhancing the value of our content to readers, but AIs also have the potential to perpetuate biases and silence voices in novel and insidious ways. Imagine a world where AIs are powerful, open, accessible, audit-able tools that Wikimedians use to make their work easier. ORES and the related technologies are a means to unlocking that future. We're an experimental, research focused, community supported, AI-as-a-service team. Our work focuses on balancing efficiency and accuracy with transparency, ethics, and fairness. ORES, is a high capacity, machine learning prediction service that is already heavily adopted and experiencing increasing demand. In this proposal, we'll ask for the resources that we need to keep up with this increasing demand.

The core problem: Efficient, scale-able, wiki processes

Processes need to scale & machine learning can help^[1]. (Stories: WikiProject Medicine & RecentChanges Patrolling)
Innovation around intelligent UIs is slow/stagnant and this has caused massive social/production problems and frustrations in our communities^[2].
The Wikimedia Foundation's Product department can't effectively leverage these technologies. Needs a mixture of process and machine learning expertise. Research needed to explore strategies for implementing transparency, enabling audits, etc.

The proposed solution: ORES, community-driven machine prediction as a service

ORES is a modular, machine prediction service that enables other user-facing intelligent technologies. It started as an on-wiki proposal in 2014 and then a volunteer project. Led by one of the most prominent researchers studying Wiki social/production problems. Needs to balance lots of concerns around accuracy, efficiency, ethics/bias, and openness to our communities^[3].
ORES is already a break-away success. Our beta feature has 10k active users, 20-ish 3rd party tools and the service is running in production^[4]. Our work has inspired many and has provided a major, positive, newsworthy development covered in Wired^[5], MIT Tech Review^[6], BBC^[7], etc.^[8]
The Wikimedia Foundation's Product department is just starting to adopt ORES and requests for new types of predictions are growing. We have several independent collaborations with external research groups to develop new types of prediction models.
Our talks and workshops focusing on the ethical concerns of AI in practice have been well received and we've become a leader in conversations around detecting and mitigating biases^[3]^[9]. We've built up collaborations with researchers at UC-Berkeley, UMN, CMU, Télécom Bretagne, and Northwestern.

Logo for revscoring, our framework for building prediction models.

The complication: Only one funded member of the team is simply not enough

While volunteer efforts have brought us far, it's clear that this project needs needs full-time attention. Developing and maintaining this service requires a lot of consistent effort and vision -- work that really only suits paid staff (or very very dedicated volunteers).
Wikimedia Research doesn't currently have the resources to dedicate to this project. Aaron has operated as a manager, communicator, researcher, community liaison, tech lead, engineer, and designer. While a consistent pool of volunteers has been maintained, Aaron and one dedicated volunteer (Amir Sarabadani) have been the only consistent, long-term contributors.
As a consequence, the project's growth has been slowed and Aaron's research skill set is being under-utilized. Despite success and a large amount of demand, we haven't managed to find a way to get necessary resources allocated to the project.

The proposal: Hire for a Scoring Team in Technology led by Aaron

ORES and the related technologies are a research platform that works -- Aaron continues primary research agenda which surrounds the use of advanced technologies to support open production practices.
Hire 2 engineers, 1 engineering manager and a tech-writer/liaison over the course of 3 years. Use contracting budget to bring in interns, ad-hoc design work, and to trade resources with other teams. 8k per fulltime per year travel budget.
Product, Research, external researchers, and 3rd party tool developers work with the Scoring Team as needed to develop new models and implement Product/tool-focused functionality.

FAQ

Shouldn't we be putting someone else in charge of ORES so that Aaron can focus on research? (discussion)

Aaron is an unusual type of Research Scientist. He's a system builder and ORES is an effective research platform^[10] for exploring the integration of social production community practices and advanced technologies. ORES is not a complete solution; it's a means to arriving at product/technology recommendations. We still don't know how to effectively maintain AIs while mitigating their potentially negative effects, but ORES is allowing us to take great strides. One day, if we're successful, ORES and what we've learned from it will be ready for a "hand-off" to product teams.

Aaron has already been acting in a product management capacity for the ORES project and he's been managing small grants (proposals and reports) to keep a small group of people partially funded to work on the project. Any support will extend the time and energy he can use to focus more broadly.

The Research needs of ORES are already relatively distributed. In order to operate more effectively as a principal researcher, Aaron advises many external research collaborations. If addition support is achieved, he'll have more time for leading and advising research of ORES and its context. These research projects will form the foundation of the documentation necessary for accountability and the vision necessary for direction-setting. They'll also provide direction for new development.

Why isn't Product support enough support? (discussion)

Product teams who are using ORES need to use it for a specific, user-value focused purpose. We've had success in the past with trading resources (Aaron's consulting for Engineering support for ORES & related projects), but that support is generally specific to a particular component of the project and it doesn't involve any sort of long-term commitment.

The project needs engineers working on ORES to be able to focus on ORES so that they can think slow and subconsciously about the direction of the project and therefore make long term proposals/contributions. This is essential for the project's success. Product is not in a position to assign engineering resources to the project on a multi-year timescale.

Does ORES need to be "productionized"? (discussion)

ORES is already running in production at a high capacity and a high level of up-time. The code has passed security review and has had a more casual review from senior engineers at the Wikimedia Foundation. It doesn't need to be "productionized", but rather, the system needs to grow and be extended to accommodate the needs that emerge. Bringing new models to production, improving performance, and extending accountability are among our goals.

To be fair, Wiki labels does need to be productionized and there's currently a collaboration with the Collaboration Team to start that process.

Meta-ORES, a robust false-positive and feedback gathering system is only a proposal at this point, but clear user needs have made its necessity evident.

What kind of expertise will the engineers assigned to this team need to have? (discussion)

Most of the modeling and analysis skills can come from the Aaron himself, the Research Team and our external collaborators. Most of the engineering that needs support is basic web development and some distributed processing systems work. So, anyone with a solid background in software engineering around web technologies and a tolerance of the Python ecosystem should be able to gain the type of competencies that we need quickly.

At least one engineer will need to be at the "senior" level so that they can draw from experience to help architect the system and make decisions about which technologies to adopt.

What would happen in the next 3 years if we manage to get support and have Aaron lead? (discussion)

There will be three focuses of effort over the next three years.

Implement accountability mechanisms: In our past work, we've seen the need for accountability mechanisms emerge. This will both empower our users to refute ORES' predictions and will give us a better opportunity to discover problems with prediction models. Every problem is an opportunity to improve fitness. By implementing open accountability mechanisms, we'll provide a legitimate, alternative view of how algorithms ought to operate in online spaces (as opposed to Google, Facebook, Twitter and other big tech companies).
New prediction models: Expand the types of predictions that ORES makes. Currently, ORES makes predictions about edit quality and article quality. We're working on modeling the aggressiveness of conversations, the types of changes made in edits, the quality of new page creations, the quality of Wikidata items, and the importance of Wikipedia articles. With more resources, we'll finish up those projects and get models deployed so that developers can start experimenting with them. With each new model, we open the doors to new types of products and technologies for making Wikipedians' work easier.
Support more wikis: Currently, we have basic support for 23 wikis and advanced support for 8. In order to speed up the rate at which we extend support to new wikis, we need to improve the tools that Wikipedians use to help record their judgements for training the models. We'll also need to dedicate more time to liaising with communities and recruiting confederates to help adoption & false-positive reporting.

How will other teams be affected by this proposal? (discussion)

Legal: We publish datasets that often have privacy and other sensitive information concerns. In the past we have worked with Legal to make sure that these publications have been reviewed. If we increase our capacity as planned, we'll likely double the rate at which we publish these kind of datasets.
Technical writing: Documentation of the systems we develop will grow along with the systems. We'll need part of a technical writer's time to help keep our documentation high quality.
Community engagement: A large part of our work involves engaging with various Wikimedia communities so that they can help us gather labeled data to train new models and so that they can tell us about problems/opportunities that they see. We'll need liaisons to help with recruiting local confederates in different wiki communities to help translate and advocate.
Research: In order to maintain a steady stream of new development and to take advantage of the research platform, our team will regularly need to work with external researchers and to recruit highly skilled interns. We'll need to work with the Wikimedia Research team to recruit these collaborators and to interview/vet interns. There will also likely be direct collaborations with Wikimedia Research on the development and deployment of new models (Research-and-Data) as well as evaluations of the models' utility to users (Design-Research).
Security: Currently the ORES infrastructure has been reviewed by the security team and does not pose a substantial security risk to our private data or our users. However new developments like accountability systems will enable direct contribution from users that will include the use of free-form text. Reviewing the security and privacy of these mechanisms will require substantial efforts on the part of the security team initially and then follow up reviews for any large changes to the contribution, review, and suppression mechanisms. (Note that a preliminary document describing these concerns has already been filed.)

How would this be sustainable at such a small scale? (discussion)

Two reasons: external collaborations and contracts/internships.

A lot of the work for ORES involves external collaborations. We receive substantial, though sporadic, contributions from a large set of volunteers who find ORES to be useful. Further, most of our new model development is done with external researchers. For example, the modeling initiatives that we have going right now are:

edit types (Kraut & Yang @ CMU) -- predicts the type of change made in an edit (e.g. copy-edit, simplification, process, wikification, etc.)
draft quality (N. Jullien @ Telecom-Bretagn) -- flags new page creations for spam, vandalism, and personal attacks and allows slower review of other, less problematic, article drafts.
detox (Ellery et al.) -- flags aggressive talk page comments
article importance (Warncke-Wang @ UMN) -- predicts the importance of an article to the whole project and within a specific subject-space.
academic/pop-culture/other (Terveen & Shiroo @ UMN) -- predicts whether and article is generally about an academic subject, a pop culture subject, or other.

We'll use the contracting budget to invite some of our collaborators (volunteer and external researchers) to work with us as contractors & interns. This will allow us to give young subject-matter experts opportunities to address core problems to ORES and opportunities to potentially pursue a path towards a more substantial engagement with the team. Aaron's had a lot of success in the past working effectively with researchers and volunteers on short-term contracts through IEG.

Where should this team live and why? (discussion)

Technology is responsible for platforms. Platforms serve many audiences and many different use-cases. Technology prioritizes long term sustainable infrastructure and provides infrastructure.

Product develops features that serve a specific user-value. Product prioritizes delivering on the highest-impact use-cases.

The ORES service is a platform -- an infrastructure for other work. The technologies around ORES (revscoring & Wiki labels) provide an ecosystem for the platform. ORES' audiences include researchers, developers, tool-builders, and product teams. Each model that ORES uses to produce a new score is a platform when many audiences itself. For example, the "edit quality" models are used in several different wiki tools and for many different research projects. The vision of ORES includes discovering effective means for developing long term infrastructure for a large, general class of valuable tools and analysis.

References

↑ Geiger, R. S., & Halfaker, A. (2013, August). When the levee breaks: without bots, what happens to Wikipedia's quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration (p. 6). ACM.
↑ Halfaker, A., Geiger, R. S., & Terveen, L. G. (2014, April). Snuggle: Designing for efficient socialization and ideological critique. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 311-320). ACM.
↑ ^a ^b Halfaker, Aaron "Deploying and maintaining AI in a socio-technical system", Presented at the Wikimedia Research Showcase, December 2016 (video • slides).
↑ m:Research:Revision_scoring_as_a_service#Tools_that_use_ORES
↑ https://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-human-editors/
↑ https://www.technologyreview.com/s/544036/artificial-intelligence-aims-to-make-wikipedia-friendlier-and-better/
↑ http://www.bbc.com/news/technology-34982570
↑ m:Research:Revision_scoring_as_a_service/Media
↑ Algorithmic dangers and transparency -- Best practices
↑ Terveen, L., Konstan, J. A., & Lampe, C. (2014). Study, Build, Repeat: Using Online Communities as a Research Platform. In Ways of Knowing in HCI (pp. 95-117). Springer New York.

[1] Geiger, R. S., & Halfaker, A. (2013, August). When the levee breaks: without bots, what happens to Wikipedia's quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration (p. 6). ACM.

[2] Halfaker, A., Geiger, R. S., & Terveen, L. G. (2014, April). Snuggle: Designing for efficient socialization and ideological critique. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 311-320). ACM.

[halfaker15deploying-3] Halfaker, Aaron "Deploying and maintaining AI in a socio-technical system", Presented at the Wikimedia Research Showcase, December 2016 (video • slides).

[4] :Research:Revision_scoring_as_a_service#Tools_that_use_ORES

[5] ttps://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-human-editors/

[6] ttps://www.technologyreview.com/s/544036/artificial-intelligence-aims-to-make-wikipedia-friendlier-and-better/

[7] ttp://www.bbc.com/news/technology-34982570

[8] :Research:Revision_scoring_as_a_service/Media

[9] Algorithmic dangers and transparency -- Best practices

[10] Terveen, L., Konstan, J. A., & Lampe, C. (2014). Study, Build, Repeat: Using Online Communities as a Research Platform. In Ways of Knowing in HCI (pp. 95-117). Springer New York.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]