Grants:IEG/Revision scoring as a service

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Individual Engagement Grants This project is funded by an Individual Engagement Grant

proposal people timeline & progress finances midpoint report final report

Individual Engagement Grants This Individual Engagement Grant is renewed

renewal scope timeline & progress finances midpoint report final report

statusselected
Revision scoring as a service
Revision scoring.metric dependencies.svg
summaryWe will construct a revision scoring service that will use machine learning algorithms to score the quality of revisions. This service is intended to support the development of intelligent tools for Wikipedia editors (e.g. en:WP:STiki and en:WP:Snuggle). The service will be based on open source software and will run from Wikimedia Labs.
targetNon-English Wikipedias
strategic priorityimproving quality
themetools
amount$16,875 USD
volunteerEpochFailJonas AGXDanilo.macLadsgroup
this project needs...
volunteer
advisor
contact
join
endorse
created on00:30, 21 September 2014 (UTC)



Project idea[edit]

Many of Wikipedia's most powerful tools rely on machine classification of edit quality. Regretfully, few of these tools publish a public API for consuming the scores they generate -- and those are only available for English Wikipedia. In this project, we'll construct a public queryable API of machine classified scores for revisions. It's our belief that by providing such a service, we would make it much easier to build new powerful wiki tools and extend current tools to new wikis. For example, this project idea was originally proposed by the developer of en:WP:Snuggle in order to be able to deploy the tool beyond the English Language Wikipedia.

What is the problem you're trying to solve?[edit]

Current tools[edit]

English Wikipedia has a lot of machine learning classifiers applied to individual edits for the purposes of quality control:

  • ClueBot NG is a powerful counter-vandalism bot that uses Bayesian language learning machine classification.
  • Huggle triages edits for human review, in a local fashion, based on extremely simple metadata
  • STiki calculates its own vandalism probabilities using metadata, consumes those of ClueBot NG, and makes both available as "queues" for human review in a GUI tool.

Availability of scores[edit]

All of these tools rely on machine generated revision quality scores -- yet obtaining such scores is not trivial in most cases. STiki is the only system that provides a queryable API to its internal scorings. ClueBot NG provides an IRC feed of its scores, but not a querying interface. The only one of these tools that runs outside of English Wikipedia is Huggle, but Huggle produces no feed or querying service for its internal scores.

Importance of scores[edit]

This lack of a general, accessible revision quality scoring service is a hindrance to the development of new tools and the expansion of current tools to non-English Wikis. For example, Snuggle takes advantage of STiki's web API to perform its own detection of good-faith newcomers. Porting a system like Snuggle to non-English wikis would require a similar queryable source of revision quality scores.

What is your solution?[edit]

We can do better. In this project, we'll develop and deploy a general query scoring service that would provide access to quality scoring algorithms and pre-generated models via a web API. We feel that the presence of such a system would allow new tools to be developed and current tools to be ported to new projects more easily.

Query
http://revscores.wmflabs.org/?rev_id=34854345&scores=text_svm|stiki_meta|wikiclass
Response
{
  "text_svm": {
    "damaging": 0.87289
    "good-faith": 0.23009
  },
  "stiki_meta": {
    "damaging": 0.8900102
  },
  "wikiclass": {
    "class": "B",
    "probabilities": {
      "FA": 0.0023
      "GA": 0.2012
      "B":  0.4501
      "C":  0.1810
      "Start": 0.120
      "Stub": 0.023
    }
  }
}

Cross-lingual scoring[edit]

The machine learning models within the system will need to be trained on a per-language/wiki basis. If a new wiki-language community wants to have access to scores, we'd ask them to provide us with a random sample of hand-coded revisions (as damaging vs. not-damaging) from which we can train/test new models.

Example
rev_id    damaging   good-faith
93264923  1          0
2383829   0          1
3202937   0          1
30029437  1          1
9030299   0          0
...

To support this work, we may also explore the construction of a human computation interface for enabling Wikipedians to more easily produce human-assessed quality ratings.

Project goals[edit]

  1. Construct a highly accurate classifier for revision quality (Machine learning classifier)
    • We will develop strategies to make it easy to set up new models for new wikis
  2. Make the classifier's scores available to Wiki tool developers (API service)
    • We will construct an API that will allow bots/WikiTools to make use of the service
  3. Socialize the use of the service (Impact)
    • We'll construct a proof-of-concept gadget and broadcast availability of the tool to WikiTool Developers
    • We'll instrument use of the API and downloads of our datasets to track this impact.

Project plan[edit]

Activities[edit]

Month 1
  • Community consultation in the 4 languages spoken by active participants (en, pt, az & tr) via village pump & tool labs mailing list
  • Summarize the state-of-the-art by reviewing the academic literature on machine-assisted triage (counter-vandalism)
Month 2
  • Model building and feature extraction based on available data
  • Development of revision hand-coding tool (gadget)
Month 3
  • Complete feature extractor strategy (with language dependent components for en, pt and tr)
  • Manage the hand-coding of labeled datasets for enwiki, ptwiki and trwiki
  • Mid-term report due
Month 4
  • Model building and optimization on labeled datasets (complete)
  • Hire new contractor to fill community engagement role
Month 5
  • Development of API for accessing winning models on Wikimedia Labs complete
Month 6
  • Final report due
  • Documentation for scoring service, models and hand-coding tool complete

Budget[edit]

The budget for this grant will only need to cover the time investment of the two grantees and potentially a contractor to fill in if He7d3r gets a job in 3 months. If he is still available, we'd like to re-hire him to complete the grant.

  • とある白い猫 -- 12 hours per week @ $25 * 25 weeks = $7500
  • He7d3r -- 15 hours per week @ $25 * 13 weeks = $4875
    • Contractor to replace/cover/support He7d3r -- 12 hours per week @ $25 = 4500
Total
$16,875 USD

All three core participants (とある白い猫, He7d3r and EpochFail) will be contributing code and data to the system. However, to fill the other project roles, we have split responsibilities thusly:

  • Community engagement -- He7d3r
  • Reporting and documentation -- とある白い猫
  • Project management -- EpochFail

Community engagement[edit]

We intend to submit 1-5 paragraphs monthly to The Signpost and similar channels on other wikis (using MassMessage and Translate extensions) to publish the status of the work as well as to attract further community interest which in turn would provide us feedback & more ideas along the way. With community input the project would be tailored to the needs of different wikis with their different circumstances. All code will remain on Wikimedia Labs with public view.

Sustainability[edit]

This project will be entirely open source. All of our code will be developed in a github repository [1] and the project's API will be hosted on Wikimedia Labs. After the project is completed, the maintenance work will be minimal except that models will need to periodically be retrained on new labeled data. The production of new labeled data can be left up to the consumers of scores.

Measures of success[edit]

We will measure our success in two ways.

Model fitness
We will use standard model training and testing strategies to ensure the accuracy of our models. Here, success means attaining a comparable or better fitness than other state-of-the-article classifiers (e.g. en:WP:STiki's classifier).
Adoption rate
We will instrument logging in the API service in order to track how many different tools and services make use of revision scores. Success means gathering wide adoption. We are already guaranteed adoption by two tool maintainers assuming we can build an appropriately fit model: (Snuggle and en:WP:STiki). We hope to expand this list a pre-adopters through out community consultation.

Get involved[edit]

Participants[edit]

  • Volunteer I love to deal with such metrics and test of models, while I can help with my coding skills. Jonas AGX (talk) 04:14, 5 October 2014 (UTC)
  • Volunteer Python and SQL programming Danilo.mac talk 23:09, 6 October 2014 (UTC)
  • Volunteer Sysadmin skills for setting this up on labs, and discussing architecture for deployment Yuvipanda (talk) 21:35, 14 November 2014 (UTC)

Community Notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

General announcements


Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Support Support Though I'm not technically able to assess this proposal in detail, as far as I can understand and have heard from other qualified people it's targeting the challenge of improving and making antivandalism tools more customizable to be implemented globally. And this is key to increase quality in Wikimedia projects in which resources for specific and dedicated efforts such those ones lack. The Brazilian volunteer I know engaged in it has been also committed to improving engagement in Wikipedia for a while. --Oona (talk) 07:30, 1 October 2014 (UTC)
  • Support Support Vandalism control is a major issue in all wikis. Though I do not kwow the other applicant, I have known Helder from Wikipedia and Wikibooks for several years and I believe he has all the qualifications required to accomplish this task.Lechatjaune (talk) 20:44, 1 October 2014 (UTC)
  • Support Support Helder is a very trusted technical user in our community and I'm convinced his approach would be not only helpful, but efficient cost-wise. José Luiz talk 23:22, 1 October 2014 (UTC)
  • Support. That is a very interesting proposal. That brings the obvious idea that not only English Wikipedia needs something like that. I like the plans and goals sections. This proposal didn't forget to invite and to keep broad community participating of the construction and follow up of the project, which is also important. I am also not able to make tehnical criticisms and won't also mention costings, but I support the idea of improving antivandalism tools on these terms.—Teles «Talk to me ˱C L @ S˲» 23:52, 1 October 2014 (UTC)
  • Support Support I have experience of both STiki and Snuggle on the English Wikipedia and feel these tools would be useful on other projects and languages. Also, the use of an API opens things up to other developers, which is good. I do have two comments you may wish to take on board:
    -There isn't a dichotomy between damaging and good-faith edits. I think it would be useful to allow our human classifiers to also specify "damaging but probably good-faith". The machine learning would then calculate probabilities of both. Then tools that present edits for users to consider (like STiki) could use the sum of the two probabilities and the tools that look for good faith editors (like Snuggle) would just look at bad-faith damaging edits. (I would suggest that, if the human classifier is saying that an edit is damaging, they should select whether or not it is good-faith based on the balance of probabilities, rather than assuming good faith, as this tool is not directly reverting edits. But I can see there is room for discussion on that.)
    -I think the information provided by a lot of the meta-data used by STiki will be independent of language. Data collected through this project could show this to be the case (or even merely approximately true). We could then develop a language-independent probability scoring system. This could then enable an API to be created for languages that haven't had a training set of classifications created.
    Yaris678 (talk) 08:08, 2 October 2014 (UTC)
  • Support Support -- as developer of en:wp:STiki and someone who has done academic research into edit quality. I should note here my endorsement might have minor conflicts-of-interest: I have pledged to consult on this project along with EpochFail and my tools plan to consume the data being produced by this proposal. That aside, I think this represents a tremendous opportunity. EN.WP is rich with tools, but other projects and languages are massively under-served. A language independent revision scoring service like that proposed will make it straightforward to port my en:wp:STiki tool (having 600,000+ reverts on en.wp) to other projects and languages. Bringing this capability under the WMF umbrella is also a good thing in terms of reliability and accountability. Opening up this capability via API will only encourage more research into Wikipedia and hopefully will spin off some cool new tools. Thanks, West.andrew.g (talk) 15:26, 3 October 2014 (UTC)
  • I would really love to get the ClueBotNG scores or something alike via an API to prefilter vandalism in my work of looking at editor interaction. Fabian Flöck (talk) 16:44, 4 October 2014 (UTC)
  • Support Support The proposal attacks an important and multi-language issue, let's do more of those! --Jonas AGX (talk) 04:12, 5 October 2014 (UTC)
  • Support Support Considering it is unnecessary to mention the importance of tools to combat vandalism, as a technical user with experience in AI and Machine learning, I endorse this proposal. Mainly because their goals are clear and feasible, giving a perfect notion to the community of what should be expected and how to measure the success of the project. I do not know the user とある白い猫 but I have recently been following the work of Helder (one of the grantees) and he has great skills in software development and in the internals of the MediaWiki software, not to mention that he is one of the most active technical users in pt.wikipedia. Based on these considerations, I think this project meets the requirements to achieve the desired objectives, and should be selected to an IEG. --Diego Queiroz (talk) 15:36, 14 October 2014 (UTC)
  • Support Support I have developed many retro-patrolling tool to find hoaxes or critical edit on it.wikipedia and I know how important they are. there is a lot of potential that needs to be addressed.--Alexmar983 (talk) 20:37, 20 April 2015 (UTC)
  • Support Support This system is a good start, to start a more effective combat in the future it could be replicated in all languages. --The Photographer (talk) 20:10, 25 June 2015 (UTC)
  • Support Support I believe this project will reduce the load on patrollers and reviewer. en.wikipedia is using bots like Cluebot NG but other wikis have to used limited human sources. Currently there is about 7000 pages waiting to patrol in tr.wikipedia. Reviewing and patrolling pages takes our all time. Mavrikant (talk) 18:35, 3 October 2015 (UTC)