This project is funded by an Individual Engagement Grant

This Individual Engagement Grant is renewed

Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.

Part 1: The Project

Summary

Our project aims to create an Artificial Intelligence (AI) infrastructure for Wikimedia sites as well as any other Mediawiki installation. We have successfully deployed a working revision scoring service that currently supports 5 languages (enwiki, fawiki, frwiki, ptwiki and trwiki) and we have made the onboarding of new languages relatively trivial (specify language-specific features or simply choose to use only language-independent features). In order to gather en:labeled data for new languages and machine learning problems, we also developed and deployed a generalized crowd-sourced labeling system (see Wiki labels) with translations for all of our supported languages.

We currently support two types of models:

"reverted" -- Predicts whether an edit will need to be reverted. This is useful for counter-vandalism tools and newcomer support tools like en:WP:Snuggle
"wp10" -- Predicts the Wikipedia 1.0 assessment rating of an article. This is useful to triage WikiProject assessment backlogs. (e.g. Research:Screening WikiProject Medicine articles for quality)

An "edit_type" model is currently in development.

To accomplish this, we have developed and released a set of libraries and applications that are openly licensed (MIT):

revscoring -- a python library for building machine learning models to score MediaWiki revisions
ores -- a python application for hosting revscoring models behind a web API
wikilabels -- a complex web application built in python/html/css/javascript that uses OAuth to integrate an on-wiki gadget with a WMFLabs-hosted back-end to provide a convenient interface for hosting crowd-sourced labeling tasks on Wikimedia sites

This project is hardly done and will likely (hopefully) never be done. We leave the IEG period working on developing new models, increasing the systems' scale-ability and coordinating with tool developers (e.g. en:WP:Huggle) to switch to using the revscoring system.

Methods and activities

This project was very technically focused, so primarily, we built stuff.

Revscores	ORES	Labels
Revision Scoring	ORES	Labels

The technical work was split across three components
1. Revscoring (Revision Scoring): Which houses the AI back-end with a number of machine learning classifiers.
2. ORES (Objective revision evaluation service): Our API usable by tool developers and academia. Users, tools and bots alike are able to query the API.
3. Wiki labels: Unlike existing AI tools, we made an extra effort to crowdsource handcoding to contributors which serves as a means for us to generate a training set but also get feedback from the communities we serve to better train our machine learning algorithms.

With technical work comes documentation. We wrote a lot of documentation.

We wrote docs on how to make use of the revscoring system (since "ores" is just one use-case): see pythonhosted.org/revscoring
We wrote docs on how Wiki labels works and how you can enable it in your Wiki and documented the project on individual wikis (e.g. en:WP:labels & fr:WP:Label)
We published weekly status updates so that our followers could track our progress more easily: see Research talk:Revision scoring as a service

Finally, we also engaged in substantial work with "the community". We discussed in out Midpoint how we were featured in the Signpost on several Wikis. Since the midpoint most of our community engagement has revolved around Wiki labels campaigns -- reporting on progress, answering questions, fixing bugs reports and encouraging continued participation. E.g., see the talk page for our English Wikipedia labeling campaign.

Outcomes and impact

Outcomes

As we have covered previously, we have deployed the systems we set out to build and deploy. The only area in which we did not reach our goal was that we were not able to complete the labeling campaigns quickly enough. That means our models are trained on reverted edits rather than human-labeled data. However, this is temporary and will be rectified as soon as the campaigns are completed.

As far as measurable outcomes, we stated two categories: model fitness and adoption rate.

Progress towards stated goals

Planned measure of success (include numeric target, if applicable)	Actual result	Explanation
Comparable to state-of-the-art model fitness	We are matching or beating the reported state-of-the-art (84% AUC) in 5 wikis for our revert predictor models	Woo!
Adoption by tool developers.	We didn't actually instrument our API to measure usage.	It seems that our time was better spent engineering the system to be able to deal with demand than counting individual tools. However, we have some key indications of adoption & planned adoption. WEF and WikiProject X are making use of our article quality predictor. We are working with en:WP:Huggle devs help them convert to using ORES. We have developed our own tool based on revision scores (see RCScoreFilter which currently has 11 users despite it's demo-level quality).

Global Metrics

These metrics have a limited relevance for our project as we are primarily building a back-end infrastructure hence we would not have a direct impact on content creation. Furthermore, the ultimate goal of our project is to process on-wiki tasks such as counter-vandalism or WP 1.0 assessments and predict future cases as accurately as possible. As such, we rely on the feedback of more experienced users which at this stage of our project significantly limits our interaction with newer users. As such we do not achieve such global metrics as much as provide the means of support to help others do so.

In reflection, it seems like this list is more relevant to edit-a-thons and other in-person events where it is easier to count individuals and to track the work that they do. However, we try to address the prompts as well as we can below.

For more information and a sample, see Global Metrics.

Metric	Achieved outcome	Explanation
1. Number of active editors involved	51 active editors helping us label revisions, several others involved in discussion of machine learning in Wikipedia in general. How many users WikiProject Medicine and the WEF account for?
2. Number of new editors	0	We aren't in the stage yet were our project will have impacted this.
3. Number of individuals involved		Measurement of this is not tractible.
4. Number of new images/media added to Wikimedia articles/pages	N/A	We do not add new content. Media related to our project: Commons:Revision scoring as a service (category).
5. Number of articles added or improved on Wikimedia projects	0	Our system supports triage, so you might count every article where a user used our tool to perform a revert, but that's probably not is desired here.
6. Absolute value of bytes added to or deleted from Wikimedia projects	N/A	We don't really edit directly -- just evaluate in support of editors doing the work themselves. We did however create a landing page on the wikis we operate on: az:Vikipediya:Etiketləmə en:Wikipedia:Labels fa:ویکی‌پدیا:برچسب‌ه fr:Wikipédia:Label pt:Wikipédia:Projetos/Rotulagem tr:Vikipedi:Etiketleme

Learning question: Did your work increase the motivation of contributors, and how do you know?

This is a larger goal of User:Halfak (WMF)'s research agenda. In his expert opinion, this project provides a critical means to stopping the mass demotivation of newcomers^[1]. So, to answer the exact question, we did nothing to increase contributor motivation. To answer the spirit of the question, we have done some very critical things to preserve contributor motivation. See Halfak's recent presentation for more discussion.

Indicators of impact

How did you improve quality on one or more Wikimedia projects?

We built a system that provides the essential functionality to triage quality work -- AI-based prediction of quality and quality changes.
Our system provides this quality dimension in a way that is extremely easy for tool developers to work with.
Our system provides this quality dimension to wikis that have historically suffered from a lack of triage support.

Project resources

Mockups

Architecture. An early mock of the Wiki Labels architecture

integrated gadget. Early mockup of the Wiki labels gadget interface

integrated gadget (fullscreen). Early mockup of the fullscreen diff coding interface for wiki labels

integrated gadget (pre-install). Early mockup of the wiki labels gadget interface before the gadget is installed

Screenshots

stand-along gadget. Screenshot of functioning Wiki labels stand-alone gadget

gadget. Screenshot of functioning Wiki labels gadget (on-wiki)

form builder. Screenshot of Wiki labels form builder interface

OAuth. Screenshot of Wiki labels OAuth handshake

gadget (fullscreen). Screenshot of fullscreen Wiki labels diff coding interface

First french label. First French Wikipedia Wiki labels tagging at the Lyon Wikimedia Hackathon, 2015

Dependency graph. A rendering of the revscoring feature dependency tree

Presentations

Wikimedia Foundation's metrics meeting in January -- Presentation of the prototype system.

Research Showcase presentation on the vision of Revscoring (May, 2015). See also the video of the presentation

ORES performance plans & analysis

Basic flow. The basic ORES request flow. All processing happens within a single thread (limited to a single CPU core). No caching is done.

Basic + caching. The basic flow augmented with caching. All processing is still single-threaded, but a cache is used to store previously generated scores. This enables a quick response for requests that include previously generated scores.

Basic + caching & celery. The basic flow augmented by caching and celery. Processing of scores is farmed out to a celery computing cluster. Re-processing of a revision is prevented by tracking open tasks and retrieving AsyncResults.

Server response timing. Empirical probability density functions for ORES scoring is generated using the 'reverted' model for English Wikipedia and 5k revisions batched 50 revision requests. Groups represent different iterations of performance improvements for the ORES service.

Project management

trello.com/b/kks8UsRv/revscoring -- Managing weekly tasks
etherpad.wikimedia.org/p/revscoring -- Meeting notes (outside of trello)

Project documentation

Research:Revision scoring as a service (General documentation)

Repositories

Wiki labels WikiProjects

en:Wikipedia:Labels
az:Vikipediya:Etiketləmə
de:Wikipedia:Kennzeichnung
fa:ویکی‌پدیا:برچسب‌ها (demonstrates rtl and unicode support!)
fr:Wikipédia:Label
pt:Wikipédia:Projetos/Rotulagem
tr:Vikipedi:Etiketleme

Wikimedia Labs project

Nova_Resource:Revscoring

Outreach

phab:T90034 -- Wikimedia Hackathon session
English Wikipedia signpost en:Wikipedia:Wikipedia Signpost/2015-02-18/Special report
Portuguese Wikipedia technical Village Pump:
- Wikipédia:Café dos programadores#Serviço de pontuação de edições
- w:pt:Wikipédia:Café dos programadores#Script para destacar edições com grandes chances de serem revertidas
Persian Wikipedia
Turkish village pump
Azerbaijani village pump

Learning

What worked well

SCRUM methodology: We held weekly meetings to adjust and/or reprioritize our workload to best meet our longer term goals. Do this! This methodology was particularly helpful when something unexpected happened where we needed to shift our attention to a specific component of our project.
Project management with trello: This is probably true to any card/swimlane-based system. When it came to prioritizing work and coordinating who would work on what, it was very helpful to be able to assign cards, and then talk about them next week if they were not finished. Do this. It's good.
Lots of mockups: We didn't have a single disagreement about how something should look or work because we mocked things up in advance. For interfaces, we started mocking them up before the IEG began and went through a few iterations before implementing things. We also "mocked up" APIs by talking about what URLs would be served and what the responses would look like. This let us work quickly when it was time to write the code.
Hackathons: We gathered a lot of attention and many potential long-term collaborators to the project at the Wikimedia Hackathon in Lyon. We're planning to do a similar push at the Wikimania Hackathon since it was such an obvious win.
Presentations: We presented on this project in several different venues. This has helped us get some of the resources that we need. It's common that IEG-like projects are a bit ahead of their time. When that's the case, helping an audience understand *why* it is you are doing what you are doing is critical.
Code review: We followed a simple rule for each code project once it hit a basic level of maturity: You may not merge your own changes into "master". That means we filed pull request and had to have at least one other person sign off on the change before it would be incorporated into the "master" branch of the code. While this had the potential to slow us down, we could minimize this problem by communicating about open pull requests whenever we met. The plus side was our code quality is high and the combined concerns of our team were expressed effectively on the code-bases we shared.

Learning patterns
- Keeping documentation of discussions with team: we kept some notes on Etherpad during our meetings
- Project roles: we had some roles assigned for each participant (e.g. community engagement, project management, reporting and documentation)
- Short reports go a long way is a learning pattern related to our usage of Trello and git history as basis for the weekly/monthly reports
- Let the community know is an important pattern when the success of our project also depends on the participation of volunteer members of the communities (e.g. to label revisions which are used to train Machine Learning models)

What didn’t work

We had some difficulty attracting volunteers where even users with a technical background were a bit disheartened from participating and assisting with handcoding the data through wiki-labels.

We believe there are two reasons for this:

The complicated and hard to understand nature of Artificial Intelligence in general caused users to shy away from participating even though our work reduces the need to understand how Artificial Intelligence works. Hence, we weren't able to fully convey the social aspects of our technical work quickly enough to gather enough volunteers for wiki labels.
Users are afraid to make mistakes hence they do not want to make "close calls". We circumvented this problem by adding an "unsure" checkbox.

Other recommendations

We do believe a separate global metrics is needed to better quantify the kind of technical infrastructure work we are preforming. Even though our project intends to have a high impact to fulfill Wikimedia's strategic goals, our impact at this stage will mostly be indirect.

For instance we enable tool developers to develop tools using AI, as a result the performance impact of such tools would be improved however this entirely depends on how quickly our system would be adopted.
Also consider that we are trying to stop mass demotivation of newcomers and to preserve contributor motivation which we believe will have a profound impact on the Wikimedia Strategic Goals but are not measured by the global metrics at the moment. For example at this stage of our project we are seeking to prevent alienation of newcomers generated by other means such as other projects or campaigns rather than seeking to generate them ourselves. So in a sense we are indirectly supporting all other attempts to generate newer users.

Next steps and opportunities

The limitless nature of opportunities offered by Artificial Intelligence makes this particularly difficult to summarize. We have so far only uncovered the tip of the iceberg.

In this six months as it was discussed before, we were able to setup the Artificial Intelligence framework to build on top of.
We have also addressed the more back-end but critical aspects of the hardware/performance aspects of our project.

Hence we see a great potential for our project if we can continue with our project. In the next six months...

We discuss this in greater detail at the renewal proposal.

Part 2: The Grant

Finances

Actual spending

Expense	Approved amount	Actual funds spent	Difference
Development & community organizing stipends (とある白い猫)	$7500	$7500	$0
Development & community organizing stipends (He7d3r)	$4875	$4875	$0
Development & community organizing stipends (He7d3r)	$4500	$4500	$0
Total	$16875	$16875	$0

Remaining funds

No unspent funds remain.

Documentation

Did you send documentation of all expenses paid with grant funds to grantsadminwikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

Yes

Confirmation of project status

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

Yes.

Is your project completed?

Please answer yes or no.

Yes, we achieved our goals for this 6 month cycle and we are seeking a renewal to build on top of our progress.

Grantee reflection

This project allowed me to utilize my expertise in Artificial Intelligence for the benefit of the Wikimedia family as a whole. I enjoyed this aspect of the IEG the most. -- とある白い猫 ^chi? 09:50, 24 June 2015 (UTC)
Seeing for the first time the recent changes feed coloured with predictions of reversions obtained from our project was really amazing. I enjoyed the process to get there a lot :-) It allowed me to practice my understanding of AI, Python while creating something which will benefit a lot of users. Helder 14:47, 24 June 2015 (UTC)

Notes

↑ Research:The Rise and Decline

[1] Research:The Rise and Decline

[1]