Grants talk:IEG/Revision scoring as a service

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 9 years ago by Kim richard in topic Round 2 2014 decision

All ongoing project discussion will take place at Research talk:Revision scoring as a service.

archived discussions


User:EpochFail I just suggested the heck out of your proposal [1] and am hoping you can get endorsements. 15:51, 23 September 2014 (UTC)Reply

Thanks!  :) --EpochFail (talk) 15:55, 23 September 2014 (UTC)Reply
I am stoked to be working with you, too. How do you feel about [2]? 22:47, 24 September 2014 (UTC)Reply
This looks like hot or not[3] for edits. It seems like we could do something interesting here. However, it would be hard to make the edits easy to compare. Still, I wonder if we could get some decent signal for the desirability of edits from it. We might use something like Elo to aggregate ratings. --EpochFail (talk) 01:48, 26 September 2014 (UTC)Reply
HotOrNot does not use pairwise comparison, which is generally easier and more consistent than scale rankings. Did you see in the Signpost today? Whether or not Gediz shows up, I will try to get Susan on this proposal. 15:56, 28 September 2014 (UTC)Reply
I assume this is the Signpost link: w:Wikipedia:Wikipedia Signpost/2014-09-24/Recent research. Helder 15:56, 1 October 2014 (UTC)Reply

Finalize your proposal by September 30!

Hi とある白い猫. Thank you for drafting this proposal!

  • We're hosting one last IEG proposal help session in Google Hangouts this weekend, so please join us if you'd like to get some last-minute help or feedback as you finalize your submission.
  • Once you're ready to submit it for review, please update its status (in your page's Probox markup) from DRAFT to PROPOSED, as the deadline is September 30th.
  • If you have any questions at all, feel free to contact me (IEG committee member) or Siko (IEG program head), or just post a note on this talk page and we'll see it.

Cheers, Ocaasi (talk) 20:16, 25 September 2014 (UTC)Reply

Existing tool

An overview which neglects to mention RC patrol leaves IMHO much to desire. Cf. [4]. --Nemo 14:45, 1 October 2014 (UTC)Reply

I'm not sure if we are neglecting anything. We're really building infrastructure here for tool developers. The ability to score revisions by the likelihood that they are damaging is critical for quality control tools -- many of which may be used for RC patrol. We want to make constructing such quality control tools trivial. Ideally, one should be able to write a gadget that queries the service we are planning to implement.
However, if you are suggesting that we should contact RC patrollers and counter-vandalism tool devs, then I agree. You are right and we could use your help. --EpochFail (talk) 14:51, 1 October 2014 (UTC)Reply


Hi. Which algorithms are used to obtain these scorings? I'm really interested in these.

And for completeness: I wrote a scoring script for the german wikipedia nearly eight years ago. It could be found at The score is only generated for edits by IPs. A spam probability is generated using a very simple naive bayes approach. It is trained automatically by looking at new articles created by IPs. If an article is deleted within seven days (speedy deletion) the words within are learned as spam, if it still exists after seven days the words are learned as ham. Over the years, nearly 790 000 articles created by IPs were learned with 78 million words (2.9 million different "words"). For example the word "fuck" was used 12388 times, the spam probability is 98.6%. The word "und" (and) was used 1.7 million times, the spam probability is "only" 60.4%. Maybe this word database is useful for adapting other tools for the german wikipedia. --APPER (talk) 13:48, 3 October 2014 (UTC)Reply

APPER, I feel like mw:Extension:BayesianFilter would benefit from your feedback. :) Some developers of another wiki farm are working on it a bit lately. --Nemo 21:01, 3 October 2014 (UTC)Reply
Hi APPER. I'm currently working from the academic lit. -- mostly User:West.andrew.g's work. See [5] for a list of features that he's developed for WP:STiki's classifier. I'm planning to run my first tests with a linear SVM classifier, but I'll experiment with a few others too. I'd love to make use of your badwords database. That is one the of highly manual parts of building new classifiers that is difficult for non-native speakers (such as myself). One of our early goals in this project is to gather such badwords lists.
Given your background in this area. I'd also be interested in having you stick around as an advisor or volunteer if you have the time.  :) --EpochFail (talk) 22:29, 3 October 2014 (UTC)Reply
Thanks for the link to the paper. I can dump the word database for you or I can grant you read access on tool labs. --APPER (talk) 11:17, 4 October 2014 (UTC)Reply

Eligibility confirmed, round 2 2014

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for round 2 2014 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 2 2014 begins on 21 October 2014, and grants will be announced in December. See the schedule for more details.

Questions? Contact us.

Siko (WMF) (talk) 23:12, 3 October 2014 (UTC)Reply

Reusability of the datasets

How would you make sure the datasets produced in this project will be reusable? You might want to make sure the datasets can be CC0. To keep it reusabe for a longer term, you might want to include text in the datasets, not just IDs of revisions which could be deleted or suppressed. In that case, the licensing of the datasets could be a little bit more complicated, though. whym (talk) 02:10, 18 October 2014 (UTC)Reply

One of the goals of this project is just that, to unify current (& possibly future scoring) allowing re-usability. As for licensing, I think sticking to CC-BY-SA may be more sound for the reasons you have mentioned as it would be compatible with the most restrictive licensing (CC-BY-SA) used on parts of the data. -- とある白い猫 chi? 19:18, 5 November 2014 (UTC)Reply
Thanks for confirming, とある白い猫. A CC-BY-SA licensed one makes sense. I wonder if it is feasible/worthwhile to create a CC0-licensed reduced version without text. It might atracct more use cases that don't require text (such as network analysis), since it is significantly more permissive than BY-SA. whym (talk) 02:08, 8 November 2014 (UTC)Reply
On a second thought, network analysis is probably not the best example. A more realistic one would be to provide revisions IDs, scoring results and metadata (timestamp, username, etc) for some sort of trend analysis. whym (talk) 13:47, 16 November 2014 (UTC)Reply


This may be too detailed to discuss at this phase, but I just wondered: is there any idea on how to implement (or use implementations of) tokenization in different languages? Some languages have word spacing while others (Chinese, Japanese, etc) don't. Even when they have word spacing, you might want to split some long words into components (e.g. long nouns in German, composed of shorter nouns). I am sure there are ready-to-use tools for well-studied languages (such as en and de, I'm not too sure about az and tr), but when considering freely licensed ones only, your choice might have to be limited. A character-level n-gram tokenization might work as a language-independent fallback.

Furthermore, assuming you keep a suitable abstraction at the level of tokenizer and make it pluggable, I wonder if the system can be extended to support non-text content (such as data items of Wikidata, or images on Commons) with a reasonable amount of adaptation. whym (talk) 09:08, 20 October 2014 (UTC)Reply

For Turkish there exists tools such as the snowball (tr) stemmer which is licensed under the BSD license. It reads:
"All the software given out on this Snowball site is covered by the BSD License (see ), with Copyright (c) 2001, Dr Martin Porter, and (for the Java developments) Copyright (c) 2002, Richard Boulton.
Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give the Snowball software to the fact that it is covered by the BSD license.
We have not bothered to insert the licensing arrangement into the text of the Snowball software."
I am unsure if that is "free enough" but if needed explicit permissions may be asked.
Expanding to Unicode languages (Chinese, Japanese, Korean, Malay, Thai, Arabic, etc.) would be an interesting expansion of this project at a later point. If successful, I feel by no means should this project to be confined to the languages mentioned in the proposal. It is however important to stress test the code with languages programmers are more familiar with to better judge how successful it really is.
-- とある白い猫 chi? 19:29, 5 November 2014 (UTC)Reply
Thank for your response. I'm sure that the BSD license is acceptable, as Labs accepts all licensed approved by OSI. I agree that the languages familiar to the programmers would be better focused at first. An adequate abstraction to allow future expansion would be good enough, I suppose. whym (talk) 02:08, 8 November 2014 (UTC)Reply

Technical responsibilities

Could you please clarify how the technical work will be shared by the three mentioned in this proposal? GitHub commits seem to suggest that EpochFail has been the main contributor. Will this continue to be so, despite his volunteer position here? If the plan is that とある白い猫 and He7d3r will undertake the technical work more, some pointers to their previous work would help the IEG review. I can see en:User:EpochFail nicely summarizes at his (volunteer) work, but I couldn't get such information easily from User:とある白い猫 and User:He7d3r's userpages. whym (talk) 03:21, 22 October 2014 (UTC)Reply

Although I didn't do much in that GitHub repo so far, I've being reviewing related things, such as the stemming code used in nltk and working in a few gists to build the list of badwords in Portuguese. As for previous work, I'm active on WMF wikis as a maintainer of user scripts and gadgets (look for .js on my global contribs). Helder 21:46, 22 October 2014 (UTC)Reply
I think the bulk of the technical work will be shared among all three of us. While I have contribution to various Wikimedia sites for nearly a decade now, my technical work on AI thus far has been mostly academic. I had programmed the original anti-vandalism IRC bots used by CVU/CVN but I passed the torch for that to other developers quite some time ago now. I intend to focus on tasks related to algorithms but depending on the feel of things I may get involved with more technical tasks as needed/required. -- とある白い猫 chi? 19:37, 5 November 2014 (UTC)Reply
I started writing code for this project before the IEG proposal, but Helder and とある白い猫 have already contributed substantially to managing technical concerns. E.g. your concerns about stemming above has been a focal point of our recent discussions and tests. Don't let the github commit log fool you. :) --EpochFail (talk) 23:20, 6 November 2014 (UTC)Reply
Thanks for clarifying these, and thank you for what appear to be long-standing technical contributions, He7d3r and とある白い猫. :) whym (talk) 02:08, 8 November 2014 (UTC)Reply

"provide us with a random sample of hand-coded revisions (as damaging vs. not-damaging)" - Gesichtete Versionen?

If a new wiki-language community wants to have access to scores, we'd ask them to provide us with a random 
sample of hand-coded revisions (as damaging vs. not-damaging) from which we can train/test new models.

Isn't that what the flaggedrevisions extension provides, for dozens of wikis, for years? The reviewing users decide: accept or undo new revision. Huge samples in polish, finnish, german, russian, arabic, turkish etc pp. Did i miss something or shouldn't this be mentioned/explored in the proposal? --Atlasowa (talk) 12:59, 12 November 2014 (UTC)Reply

I believe the emphasis should be on random sample, which is used to train the machine learning models so that these things can work well. Helder 19:09, 12 November 2014 (UTC)Reply
+1 Random sample is essential to make sure that the classifier will be as accurate as possible. We'll certainly make use of flagged revisions and other implicit signals in testing, but I think it is important that we train with the best data available when standing up a production system. We're currently discussing ways to make it easier to produce a labelled dataset. See Research_talk:Revision_scoring_as_a_service#Revision_handcoding_.28mockups.29 --EpochFail (talk) 15:47, 14 November 2014 (UTC)Reply
Handcoder home (mock)
"...train with the best data available...", but which isn't available currently, the random sample. Looking at the mockups, i read "English Wikipedia 2014 - 10k sample" - is that the scale we are talking about? 10.000 ratings x 2 different hand-coders = 20.000 ratings = how many volunteer hours? I think german WP does ~60.000 Sichtungen every month by ~3.000 different reviewing users, so this project is asking for a third of our monthly workload, additionally? Did i get this right? This service for potential tools seems quite expensive for a wiki-language community? ;-P --Atlasowa (talk) 15:07, 15 November 2014 (UTC)Reply
Hey Atlasowa. First. 10k may be many more observations than we need. This is something we'll find out once we generate the enwiki sample. However, I have personally hand-coded large samples[6] and I've organized other large scale hand-codings in the past (see Research:Article_feedback/Stage_2/Quality_assessment & Research:Article_feedback/Stage_2/Quality_assessment), so I have a good sense for how much work is involved. Splitting 10k evaluations between a small group isn't too bad -- especially since you get something immediately valuable for your work. In the past, the only benefit was a scientific result and we never really had trouble recruiting participants.
Now, you say that German WP only has about 60k sightings by 3 user (180k then?). Dewiki saw about 854k revisions in October of this year[7]. That would mean that the other 674k revisions to dewiki went unchecked. --EpochFail (talk) 14:40, 18 November 2014 (UTC)Reply

Aggregated feedback from the committee for Revision scoring as a service

Scoring criteria (see the rubric for background) Score
1=weak alignment 10=strong alignment
(A) Impact potential
  • Does it fit with Wikimedia's strategic priorities?
  • Does it have potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
(B) Innovation and learning
  • Does it take an Innovative approach to solving a key problem?
  • Is the potential impact greater than the risks?
  • Can we measure success?
(C) Ability to execute
  • Can the scope be accomplished in 6 months?
  • How realistic/efficient is the budget?
  • Do the participants have the necessary skills/experience?
(D) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
  • Does it support diversity?
Comments from the committee:
  • It is a wise idea to separate the scoring engine and user-facing tools and save tool developers' time overall.
  • Experiences from the community engagement work in this project could inform adoption strategies of other kinds of tool that are under-utilized.
  • In terms of diversity, focusing onto the two non-English projects is probably a good idea as the first step.
  • Would like to see more communities actively engaged with this tool and related tools that support editing, perhaps as a subsequent project.
  • This could potentially be a great way to find out more about existing editors as well as newbies and vandals
  • Love the idea of harvesting information that gets collected anyway through existing workflows
  • Looks like a nice team is already on board with this idea
  • Lots of endorsers

Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 16:53, 13 November 2014 (UTC)Reply

Round 2 2014 decision

Congratulations! Your proposal has been selected for an Individual Engagement Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $16,875

Comments regarding this decision:
We appreciate seeing the talented technical team, strong advisors, and solid community engagement working under a clear need from multilingual users waiting for the service. Looking forward to seeing your focus on the hand-coding and volunteer recruitment up front to best manage the timeline.

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!
Questions? Contact us.

--Siko (WMF) (talk) 18:24, 5 December 2014 (UTC)Reply

Thanks to WMF for this decision that I support too. It'll be great to dispose powerful tools anti vandalism in others wikis than This grant will make possible to advance in this direction. Automation has a high rate of return for communities and anti-vandals tool is among the most important. This is a giant leap for our projects around the world. Kim richard (talk) 23:20, 7 December 2014 (UTC)Reply