Grants:IEG/Revision scoring as a service/Renewal/Midpoint

This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary

In the last 3 months, we focused primarily on our goal of figuring out a long-term strategy for revscoring. We also spent substantial time on new development work for ORES and community outreach to tool developers and new multilingual volunteers. We also spent some time researching the potential harm of "subjective social algorithms" like ours so that we can avoid causing social problems.

Our main thrust of effort re. a long-term strategy for revscoring has been around preparing the system for productionization. We've made the system easier to scale and maintain. We're also working aggressively to addres the constraints preventing ORES from moving into the production Wikimedia cluster. In parallel, we're working on a MediaWiki extension that takes advantage of ORES's scores so that we can release it as a beta feature on Wikimedia's wikis.

Our new development work has focused on improved language support -- specifically for multilingual wiki communities like WikiData, Commons and Meta. We've also more than doubled our language support (from 4 to 9), and thanks to our outreach work, we're close to adding support for 11 more (!!!). During this time, we've also boosted the count of tools using ORES/revscoring from 2 to 10 -- and that includes Huggle -- one of the dominant quality control tools we were initially targeting with the original proposal a year ago.

Finally, we've come to the conclusion that we have a moral obligation to look into the potential for systems like ORES to cause harm. We've been digging into the research around algorithmic bias perpetuation and we have been using clustering strategies to try to learn about the different types of activities that our algorithms lump into one, giant "damaging" category. We'd like to put substantial effort into this in the coming three months and target the publication of a research paper (outside the scope of this IEG) based on what we learn.

Methods and activities

Methods

We use #wikimedia-ai^connect to coordinate on a daily basis.
We meet once per week to groom our phabricator board
We set aside time to work on Saturdays ("hack sessions"). We use this time to coordinate reviews and to invest focused effort on getting things done.
We primarily work through our our github repositories, but a lot of work in the last three months has involved our projects on labs

ORES productionization -- Securing the future of revscoring. Increase opportunities to integrate ORES into existing tools.

We have significantly improved the scale-ability of our project since the last report.
- We use a "pre-caching" strategy to generate score that are likely to be needed before they are requested.
- We deployed celery (asynchronous task queue/job queue) to dramatically increase our parallel processing power and to be able to scale the system to meet demand.
- Re-use of parallel scoring requests (common problem with bots/tools that track RecentChanges)
Created a Vagrant image for ORES and Revision Scoring which simplifies development and makes it easy for others to contribute code
Creation of Debian packages for dependencies (WMF constraint for production deploy)
We are developing an ORES extension to MediaWiki

New Development -- More flexible infrastructure. More languages supported.

We have generalized utilities for generating article quality models so that it can be used outside of English Wikipedia. See github.com/wiki-ai/wikiclass and specifically, the Extractor abstraction.
We have extended our language support from 4 (en, fa, pt, tr) to 9 (+fr, he, id, es, id, vi).
We have developed a workflow for automatically generating resources (likely badwords, informals, stop words, etc.) for new languages by processing local Wikipedia content.
We refactored how revscoring uses language utilities to remove the requirement for a "local language" so that we can better support multilingual wikis like Commons, Meta and WikiData and catch, for example, English language vandalism in Vietnamese Wikipedia.

Research -- Cause no harm by getting ahead of bias and other problems in prediction. Inform the literature/academic community with what we learn.

We surveyed the literature to learn about social concerns related to the application of "subjective algorithms"
We surveyed the literature about bias detection for machine learning algorithms
We used unsupervised learning strategies to extract patterns in the types of edits that are reverted

Community Outreach -- Increase adoption and integration of ORES into other tools.

We have expanded Revision Scoring, ORES and Wiki labels to serve more communities. At the time of this report we are working on creating language models for 22 editions of Wikipedia.
- We are also working towards serving multi-lingual wikis such as Commons, Meta and Incubator
- We have had two presentations on Wikimania 2015 on Revision Scoring
  - “Would you like some AI with that”
  - “Revision Scoring Service – Exposing Quality Wiki tools”
- We have a talk accepted to WikiConferenceUSA 2015:
  - "Revscoring: AI support for Wikipedians"
We increased our confirmed uses of ORES from 2 to 10:
- Raun
- Real-Time Recent Changes
- en:Wikipedia:WikiProject X via en:User:Reports bot
- en:Wiki Education Foundation – Student quality/productivity measurements
- en:User:SuggestBot – Article quality predictions
- crosswatch – a cross-wiki watchlist
- mw:Huggle – One of the dominant quality control support tools
- fa:User:Dexbot - An automatic vandalism reversion bot operating on Persian Wikipedia.
We also picked up a two volunteers who have contributed substantially to productionization work: Adamw & Legoktm

Midpoint outcomes

We have created a series of pages to manage word lists which we utilize to build language models.
- We have used this to extend our language support from 4 to 9 languages
We have had considerable community outreach to expand our system to more communities as well as outreach with third party tool developers. New tools using ORES includes:
- Raun (for projects that support ORES)
- Real-Time Recent Changes (open pull request)
- en:Wikipedia:WikiProject X via en:User:Reports bot
- en:Wiki Education Foundation – Student quality/productivity measurements
- en:User:SuggestBot
- crosswatch – cross-wiki watchlist
- ORES extension to MediaWiki (in development)
- mw:Huggle
We have substantially improved the uptime and stability of the ORES service as well as improved our deploy pattern.

Finances

We have spent half of our total funds as planned and have not requested or received additional resources.

Learning

What are the challenges

Usage measurements

While ORES has generated ~10 million scores, many of those scores were generated as part of our 'precaching' system which loads scores in anticipation of them being requested. It's difficult to use our current logs to know how many scores were actually generated upon request by a human or wiki tool. So, we've prioritized proper metrics collection and are just about to deploy it. See github.com/wiki-ai/ores/pull/96.

Wikilabels UI overcommitment

We planned to flesh out the Wikilabels interface as part of this IEG. Since we picked up Arthur as a grantee and his skill set was more closely suited to modeling work, we've refocused on bias detection rather than improvements to Wikilabels. We think that this will be great for revscoring/ORES and our users, but it was difficult to work out the details of the switch.

Timing of disbursements

Given that grant disbursements pay rent, the timing at which they received is very important. It has been difficult to make sure that the timing worked out for us, but we were able to work with Grants and Finance teams to make it work so far.

What is working well

Adoption rate

We have been working on setting the infrastructure to handle large quantities of requests. This a critical aspect if we are going to get the wide scale adoption we intend to have. We will now focus on measuring our adoption as well as how often our scores are used externally.

Adaptive Re-prioritization

While the prioritization away from Wikilabels work could be seen as a shortcoming, I think it speaks volumes that the team was able to shift focus based on the resources we had available and find important work to do.

Next steps and opportunities

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points.

We will focus on identifying bias in our approaches. Machine learning algorithms can develop a bias if it is trained on biased examples. Furthermore our handcoders themselves can be biased as well. We are working towards measuring this bias and will work towards circumventing the problem.
We will develop more Multi-class classification algorithms such as Error-Correcting Output Codes (ECOC) which will be useful for Automatic Assessment of Article Quality (A3Q) as well as for Edit Quality Scores (EQS).
We will focus on making the infrastructure more robust to handle even greater workload.

Grantee reflection

I am most excited to see the community reaction - particularly that of third party tool developers that wish to use our system to supplement their development process. Being able to serve even more communities is also very exciting! -- とある白い猫 ^chi? 00:37, 28 September 2015 (UTC)
My first few weeks were devoted primarily to setup of a Debian development environment and looking at the ores / revscoring project in wide scope. Aside from routine installation tasks, this included:

gaining a basic understanding of the revscoring source code,
testing various environments for using the ores webservice (leading to the discovery of at least one bug in the ores build)
Contributing to a discussion of recent relevant papers by Tufeci and Sandvig et. al, around the currently hotly debated issue of algorithm/data transparency.
Gaining a rigorous, formal understanding of the machine learning models used by the current revscoring project and considering new models for future use.

Recently I have turned my attention to the project of clustering edits. I have been working to explicate the SigClust algorithm to other engineers on the team, and he has recently started working with Amir on implementing this SigClust algorithm in Python. --Aetilley copy-pasted from email by EpochFail (talk) 19:46, 30 September 2015 (UTC)