This looks like hot or not for edits. It seems like we could do something interesting here. However, it would be hard to make the edits easy to compare. Still, I wonder if we could get some decent signal for the desirability of edits from it. We might use something like Elo to aggregate ratings. --EpochFail (talk) 01:48, 26 September 2014 (UTC)Reply[reply]
An overview which neglects to mention RC patrol leaves IMHO much to desire. Cf. . --Nemo 14:45, 1 October 2014 (UTC)Reply[reply]
I'm not sure if we are neglecting anything. We're really building infrastructure here for tool developers. The ability to score revisions by the likelihood that they are damaging is critical for quality control tools -- many of which may be used for RC patrol. We want to make constructing such quality control tools trivial. Ideally, one should be able to write a gadget that queries the service we are planning to implement.
However, if you are suggesting that we should contact RC patrollers and counter-vandalism tool devs, then I agree. You are right and we could use your help. --EpochFail (talk) 14:51, 1 October 2014 (UTC)Reply[reply]
Hi. Which algorithms are used to obtain these scorings? I'm really interested in these.
And for completeness: I wrote a scoring script for the german wikipedia nearly eight years ago. It could be found at http://tools.wmflabs.org/ipp/. The score is only generated for edits by IPs. A spam probability is generated using a very simple naive bayes approach. It is trained automatically by looking at new articles created by IPs. If an article is deleted within seven days (speedy deletion) the words within are learned as spam, if it still exists after seven days the words are learned as ham. Over the years, nearly 790 000 articles created by IPs were learned with 78 million words (2.9 million different "words"). For example the word "fuck" was used 12388 times, the spam probability is 98.6%. The word "und" (and) was used 1.7 million times, the spam probability is "only" 60.4%. Maybe this word database is useful for adapting other tools for the german wikipedia. --APPER (talk) 13:48, 3 October 2014 (UTC)Reply[reply]
Hi APPER. I'm currently working from the academic lit. -- mostly User:West.andrew.g's work. See  for a list of features that he's developed for WP:STiki's classifier. I'm planning to run my first tests with a linear SVM classifier, but I'll experiment with a few others too. I'd love to make use of your badwords database. That is one the of highly manual parts of building new classifiers that is difficult for non-native speakers (such as myself). One of our early goals in this project is to gather such badwords lists.
Given your background in this area. I'd also be interested in having you stick around as an advisor or volunteer if you have the time. :) --EpochFail (talk) 22:29, 3 October 2014 (UTC)Reply[reply]
Thanks for the link to the paper. I can dump the word database for you or I can grant you read access on tool labs. --APPER (talk) 11:17, 4 October 2014 (UTC)Reply[reply]
How would you make sure the datasets produced in this project will be reusable? You might want to make sure the datasets can be CC0. To keep it reusabe for a longer term, you might want to include text in the datasets, not just IDs of revisions which could be deleted or suppressed. In that case, the licensing of the datasets could be a little bit more complicated, though. whym (talk) 02:10, 18 October 2014 (UTC)Reply[reply]
One of the goals of this project is just that, to unify current (& possibly future scoring) allowing re-usability. As for licensing, I think sticking to CC-BY-SA may be more sound for the reasons you have mentioned as it would be compatible with the most restrictive licensing (CC-BY-SA) used on parts of the data. -- とある白い猫chi? 19:18, 5 November 2014 (UTC)Reply[reply]
Thanks for confirming, とある白い猫. A CC-BY-SA licensed one makes sense. I wonder if it is feasible/worthwhile to create a CC0-licensed reduced version without text. It might atracct more use cases that don't require text (such as network analysis), since it is significantly more permissive than BY-SA. whym (talk) 02:08, 8 November 2014 (UTC)Reply[reply]
On a second thought, network analysis is probably not the best example. A more realistic one would be to provide revisions IDs, scoring results and metadata (timestamp, username, etc) for some sort of trend analysis. whym (talk) 13:47, 16 November 2014 (UTC)Reply[reply]
This may be too detailed to discuss at this phase, but I just wondered: is there any idea on how to implement (or use implementations of) tokenization in different languages? Some languages have word spacing while others (Chinese, Japanese, etc) don't. Even when they have word spacing, you might want to split some long words into components (e.g. long nouns in German, composed of shorter nouns). I am sure there are ready-to-use tools for well-studied languages (such as en and de, I'm not too sure about az and tr), but when considering freely licensed ones only, your choice might have to be limited. A character-level n-gram tokenization might work as a language-independent fallback.
Furthermore, assuming you keep a suitable abstraction at the level of tokenizer and make it pluggable, I wonder if the system can be extended to support non-text content (such as data items of Wikidata, or images on Commons) with a reasonable amount of adaptation. whym (talk) 09:08, 20 October 2014 (UTC)Reply[reply]
"All the software given out on this Snowball site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html ), with Copyright (c) 2001, Dr Martin Porter, and (for the Java developments) Copyright (c) 2002, Richard Boulton. Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give the Snowball software to the fact that it is covered by the BSD license. We have not bothered to insert the licensing arrangement into the text of the Snowball software."
I am unsure if that is "free enough" but if needed explicit permissions may be asked.
Expanding to Unicode languages (Chinese, Japanese, Korean, Malay, Thai, Arabic, etc.) would be an interesting expansion of this project at a later point. If successful, I feel by no means should this project to be confined to the languages mentioned in the proposal. It is however important to stress test the code with languages programmers are more familiar with to better judge how successful it really is.
Thank for your response. I'm sure that the BSD license is acceptable, as Labs accepts all licensed approved by OSI. I agree that the languages familiar to the programmers would be better focused at first. An adequate abstraction to allow future expansion would be good enough, I suppose. whym (talk) 02:08, 8 November 2014 (UTC)Reply[reply]
Could you please clarify how the technical work will be shared by the three mentioned in this proposal? GitHub commits seem to suggest that EpochFail has been the main contributor. Will this continue to be so, despite his volunteer position here? If the plan is that とある白い猫 and He7d3r will undertake the technical work more, some pointers to their previous work would help the IEG review. I can see en:User:EpochFail nicely summarizes at his (volunteer) work, but I couldn't get such information easily from User:とある白い猫 and User:He7d3r's userpages. whym (talk) 03:21, 22 October 2014 (UTC)Reply[reply]
I think the bulk of the technical work will be shared among all three of us. While I have contribution to various Wikimedia sites for nearly a decade now, my technical work on AI thus far has been mostly academic. I had programmed the original anti-vandalism IRC bots used by CVU/CVN but I passed the torch for that to other developers quite some time ago now. I intend to focus on tasks related to algorithms but depending on the feel of things I may get involved with more technical tasks as needed/required. -- とある白い猫chi? 19:37, 5 November 2014 (UTC)Reply[reply]
I started writing code for this project before the IEG proposal, but Helder and とある白い猫 have already contributed substantially to managing technical concerns. E.g. your concerns about stemming above has been a focal point of our recent discussions and tests. Don't let the github commit log fool you. :) --EpochFail (talk) 23:20, 6 November 2014 (UTC)Reply[reply]
Thanks for clarifying these, and thank you for what appear to be long-standing technical contributions, He7d3r and とある白い猫. :) whym (talk) 02:08, 8 November 2014 (UTC)Reply[reply]
"provide us with a random sample of hand-coded revisions (as damaging vs. not-damaging)" - Gesichtete Versionen?
If a new wiki-language community wants to have access to scores, we'd ask them to provide us with a random
sample of hand-coded revisions (as damaging vs. not-damaging) from which we can train/test new models.
Isn't that what the flaggedrevisions extension provides, for dozens of wikis, for years? The reviewing users decide: accept or undo new revision. Huge samples in polish, finnish, german, russian, arabic, turkish etc pp. Did i miss something or shouldn't this be mentioned/explored in the proposal? --Atlasowa (talk) 12:59, 12 November 2014 (UTC)Reply[reply]
I believe the emphasis should be on random sample, which is used to train the machine learning models so that these things can work well. Helder 19:09, 12 November 2014 (UTC)Reply[reply]
+1 Random sample is essential to make sure that the classifier will be as accurate as possible. We'll certainly make use of flagged revisions and other implicit signals in testing, but I think it is important that we train with the best data available when standing up a production system. We're currently discussing ways to make it easier to produce a labelled dataset. See Research_talk:Revision_scoring_as_a_service#Revision_handcoding_.28mockups.29 --EpochFail (talk) 15:47, 14 November 2014 (UTC)Reply[reply]
"...train with the best data available...", but which isn't available currently, the random sample. Looking at the mockups, i read "English Wikipedia 2014 - 10k sample" - is that the scale we are talking about? 10.000 ratings x 2 different hand-coders = 20.000 ratings = how many volunteer hours? I think german WP does ~60.000 Sichtungen every month by ~3.000 different reviewing users, so this project is asking for a third of our monthly workload, additionally? Did i get this right? This service for potential tools seems quite expensive for a wiki-language community? ;-P --Atlasowa (talk) 15:07, 15 November 2014 (UTC)Reply[reply]
Hey Atlasowa. First. 10k may be many more observations than we need. This is something we'll find out once we generate the enwiki sample. However, I have personally hand-coded large samples and I've organized other large scale hand-codings in the past (see Research:Article_feedback/Stage_2/Quality_assessment & Research:Article_feedback/Stage_2/Quality_assessment), so I have a good sense for how much work is involved. Splitting 10k evaluations between a small group isn't too bad -- especially since you get something immediately valuable for your work. In the past, the only benefit was a scientific result and we never really had trouble recruiting participants.
Now, you say that German WP only has about 60k sightings by 3 user (180k then?). Dewiki saw about 854k revisions in October of this year. That would mean that the other 674k revisions to dewiki went unchecked. --EpochFail (talk) 14:40, 18 November 2014 (UTC)Reply[reply]
Aggregated feedback from the committee for Revision scoring as a service
Does it fit with Wikimedia's strategic priorities?
Does it have potential for online impact?
Can it be sustained, scaled, or adapted elsewhere after the grant ends?
(B) Innovation and learning
Does it take an Innovative approach to solving a key problem?
Is the potential impact greater than the risks?
Can we measure success?
(C) Ability to execute
Can the scope be accomplished in 6 months?
How realistic/efficient is the budget?
Do the participants have the necessary skills/experience?
(D) Community engagement
Does it have a specific target community and plan to engage it often?
Does it have community support?
Does it support diversity?
Comments from the committee:
It is a wise idea to separate the scoring engine and user-facing tools and save tool developers' time overall.
Experiences from the community engagement work in this project could inform adoption strategies of other kinds of tool that are under-utilized.
In terms of diversity, focusing onto the two non-English projects is probably a good idea as the first step.
Would like to see more communities actively engaged with this tool and related tools that support editing, perhaps as a subsequent project.
This could potentially be a great way to find out more about existing editors as well as newbies and vandals
Love the idea of harvesting information that gets collected anyway through existing workflows
Looks like a nice team is already on board with this idea
Lots of endorsers
Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 16:53, 13 November 2014 (UTC)Reply[reply]
Round 2 2014 decision
Congratulations! Your proposal has been selected for an Individual Engagement Grant.
The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $16,875
Comments regarding this decision: We appreciate seeing the talented technical team, strong advisors, and solid community engagement working under a clear need from multilingual users waiting for the service. Looking forward to seeing your focus on the hand-coding and volunteer recruitment up front to best manage the timeline.
You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
Thanks to WMF for this decision that I support too. It'll be great to dispose powerful tools anti vandalism in others wikis than en.wiki. This grant will make possible to advance in this direction. Automation has a high rate of return for communities and anti-vandals tool is among the most important. This is a giant leap for our projects around the world. Kim richard (talk) 23:20, 7 December 2014 (UTC)Reply[reply]