Grants:IEG/Revision scoring as a service/Renewal/Midpoint

From Meta, a Wikimedia project coordination wiki



Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary[edit]

In the last 3 months, we focused primarily on our goal of figuring out a long-term strategy for revscoring. We also spent substantial time on new development work for ORES and community outreach to tool developers and new multilingual volunteers. We also spent some time researching the potential harm of "subjective social algorithms" like ours so that we can avoid causing social problems.

Our main thrust of effort re. a long-term strategy for revscoring has been around preparing the system for productionization. We've made the system easier to scale and maintain. We're also working aggressively to addres the constraints preventing ORES from moving into the production Wikimedia cluster. In parallel, we're working on a MediaWiki extension that takes advantage of ORES's scores so that we can release it as a beta feature on Wikimedia's wikis.

Our new development work has focused on improved language support -- specifically for multilingual wiki communities like WikiData, Commons and Meta. We've also more than doubled our language support (from 4 to 9), and thanks to our outreach work, we're close to adding support for 11 more (!!!). During this time, we've also boosted the count of tools using ORES/revscoring from 2 to 10 -- and that includes Huggle -- one of the dominant quality control tools we were initially targeting with the original proposal a year ago.

Finally, we've come to the conclusion that we have a moral obligation to look into the potential for systems like ORES to cause harm. We've been digging into the research around algorithmic bias perpetuation and we have been using clustering strategies to try to learn about the different types of activities that our algorithms lump into one, giant "damaging" category. We'd like to put substantial effort into this in the coming three months and target the publication of a research paper (outside the scope of this IEG) based on what we learn.

Methods and activities[edit]

Methods
  • We use #wikimedia-aiconnect to coordinate on a daily basis.
  • We meet once per week to groom our phabricator board
  • We set aside time to work on Saturdays ("hack sessions"). We use this time to coordinate reviews and to invest focused effort on getting things done.
  • We primarily work through our our github repositories, but a lot of work in the last three months has involved our projects on labs
ORES productionization -- Securing the future of revscoring. Increase opportunities to integrate ORES into existing tools.
  • We have significantly improved the scale-ability of our project since the last report.
    • We use a "pre-caching" strategy to generate score that are likely to be needed before they are requested.
    • We deployed celery (asynchronous task queue/job queue) to dramatically increase our parallel processing power and to be able to scale the system to meet demand.
    • Re-use of parallel scoring requests (common problem with bots/tools that track RecentChanges)
  • Created a Vagrant image for ORES and Revision Scoring which simplifies development and makes it easy for others to contribute code
  • Creation of Debian packages for dependencies (WMF constraint for production deploy)
  • We are developing an ORES extension to MediaWiki
New Development -- More flexible infrastructure. More languages supported.
  • We have generalized utilities for generating article quality models so that it can be used outside of English Wikipedia. See github.com/wiki-ai/wikiclass and specifically, the Extractor abstraction.
  • We have extended our language support from 4 (en, fa, pt, tr) to 9 (+fr, he, id, es, id, vi).
  • We have developed a workflow for automatically generating resources (likely badwords, informals, stop words, etc.) for new languages by processing local Wikipedia content.
  • We refactored how revscoring uses language utilities to remove the requirement for a "local language" so that we can better support multilingual wikis like Commons, Meta and WikiData and catch, for example, English language vandalism in Vietnamese Wikipedia.
Research -- Cause no harm by getting ahead of bias and other problems in prediction. Inform the literature/academic community with what we learn.
  • We surveyed the literature to learn about social concerns related to the application of "subjective algorithms"
  • We surveyed the literature about bias detection for machine learning algorithms
  • We used unsupervised learning strategies to extract patterns in the types of edits that are reverted
Community Outreach -- Increase adoption and integration of ORES into other tools.

Midpoint outcomes[edit]

Finances[edit]

We have spent half of our total funds as planned and have not requested or received additional resources.

Learning[edit]

What are the challenges[edit]

Usage measurements
  • While ORES has generated ~10 million scores, many of those scores were generated as part of our 'precaching' system which loads scores in anticipation of them being requested. It's difficult to use our current logs to know how many scores were actually generated upon request by a human or wiki tool. So, we've prioritized proper metrics collection and are just about to deploy it. See github.com/wiki-ai/ores/pull/96.
Wikilabels UI overcommitment
  • We planned to flesh out the Wikilabels interface as part of this IEG. Since we picked up Arthur as a grantee and his skill set was more closely suited to modeling work, we've refocused on bias detection rather than improvements to Wikilabels. We think that this will be great for revscoring/ORES and our users, but it was difficult to work out the details of the switch.
Timing of disbursements
  • Given that grant disbursements pay rent, the timing at which they received is very important. It has been difficult to make sure that the timing worked out for us, but we were able to work with Grants and Finance teams to make it work so far.

What is working well[edit]

Adoption rate
  • We have been working on setting the infrastructure to handle large quantities of requests. This a critical aspect if we are going to get the wide scale adoption we intend to have. We will now focus on measuring our adoption as well as how often our scores are used externally.
Adaptive Re-prioritization
  • While the prioritization away from Wikilabels work could be seen as a shortcoming, I think it speaks volumes that the team was able to shift focus based on the resources we had available and find important work to do.

Next steps and opportunities[edit]

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points.

  • We will focus on identifying bias in our approaches. Machine learning algorithms can develop a bias if it is trained on biased examples. Furthermore our handcoders themselves can be biased as well. We are working towards measuring this bias and will work towards circumventing the problem.
  • We will develop more Multi-class classification algorithms such as Error-Correcting Output Codes (ECOC) which will be useful for Automatic Assessment of Article Quality (A3Q) as well as for Edit Quality Scores (EQS).
  • We will focus on making the infrastructure more robust to handle even greater workload.

Grantee reflection[edit]

  • I am most excited to see the community reaction - particularly that of third party tool developers that wish to use our system to supplement their development process. Being able to serve even more communities is also very exciting! -- とある白い猫 chi? 00:37, 28 September 2015 (UTC)
  • My first few weeks were devoted primarily to setup of a Debian development environment and looking at the ores / revscoring project in wide scope. Aside from routine installation tasks, this included:
  1. gaining a basic understanding of the revscoring source code,
  2. testing various environments for using the ores webservice (leading to the discovery of at least one bug in the ores build)
  3. Contributing to a discussion of recent relevant papers by Tufeci and Sandvig et. al, around the currently hotly debated issue of algorithm/data transparency.
  4. Gaining a rigorous, formal understanding of the machine learning models used by the current revscoring project and considering new models for future use.
Recently I have turned my attention to the project of clustering edits. I have been working to explicate the SigClust algorithm to other engineers on the team, and he has recently started working with Amir on implementing this SigClust algorithm in Python. --Aetilley copy-pasted from email by EpochFail (talk) 19:46, 30 September 2015 (UTC)