Research talk:Revision scoring as a service/2015

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Contents

Revision handcoding (mockups)[edit]

Hey folks. I got some mockups together of a hand-coding gadget interface. What do you guys think?

Revision handcoder (mock).svg
Coder interface. 
A mockup of a revision handcoder home interface is presented within Special:UserContribs.
Handcoder home. A mockup of a revision handcoder home interface is presented within Special:UserContribs.

--EpochFail (talk) 19:45, 18 October 2014 (UTC)

I made some updates to the mocks and worked on a generalizable configuration strategy. I propose something like this for a campaign:
YAML campaign configuration
name: Revision Coding -- English Wikipedia 2014 10k sample
source: enwiki 2014 revisions -- 10k random sample
author:
    name: Aaron Halfaker
    email: aaron.halfaker@gmail.com

coder:
    class: revcoding.coders.RevisionDiff
    form:
        fields:
            - damaging
            - good-faith

fields:
    damaging:
        class: revcoding.ui.RadioButtons
        label: Damaging?
        help: Did this edit cause damage to the article?
        options:
            -
                label: "yes"
                value: "yes"
                tooltip: Yes, this edit is damaging and should be reverted.
            -
                label: "no"
                value: "no"
                tooltip: >
                         No, this edit is not damaging and should not be
                         reverted.
            -
                label: unsure
                value: unsure
                tooltip: >
                         It's not clear whether this edit damages the article or
                         not.
    good-faith:
        class: revcoding.ui.RadioButtons
        label: Good faith?
        help: >
              Does it appear as though the author of this edit was
              trying to contribute productively?
        options:
            -
                label: "yes"
                value: "yes"
                tooltip: Yes, this edit appears to have been made in good-faith.
            -
                label: "no"
                value: "no"
                tooltip: No, this edit appears to have been made in bad-faith.
            -
                label: unsure
                value: unsure
                tooltip: >
                         It's not clear whether or not this edit was made in
                         good-faith.
A server running in WMF Labs that would make sources of rev_ids available. The above configuration describes a campaign. The gadget running in the user's browser will have a hard-coded campaign list page (e.g. en:User:EpochFail/Revcoding/CampiagnList.js). The campaigns listed there will appear in gadget users' Special:UserContribs page. The WMF labs server will be responsible for delivering (1) the campaign definition (described above) and (2) tracking, delivering and accepting submissions from work sets. --EpochFail (talk) 17:48, 19 October 2014 (UTC)
Decided to hack together a quick diagram.
The server architecture for the revscores system is presented.
Server architecture. The server architecture for the revscores system is presented.
--EpochFail (talk) 18:01, 19 October 2014 (UTC)
I've realized a problem. When a user requests or submits a coding, how does the server know who they are? I wonder if we can get an oauth handshake in here somehow. If we open a popup window to the server that performs the oauth handshake and sets up a session with the user's browser, then subsequent requests will be identifiable. So... that means that a logged-in Wikipedia editor could be a logged-out Revcoder. Here's what that might look like:
A mockup of the handcoder home while logged out.
Handcoder home (logged out). A mockup of the handcoder home while logged out.
--EpochFail (talk) 18:16, 19 October 2014 (UTC)
@EpochFail: newbie question: how/where do we use a YAML file like this? Helder 22:34, 20 October 2014 (UTC)
Good Q. In the past, I have designed configuration strategies that build forms. See this one I use in en:WP:Snuggle: [1] (look for "user_actions:", it corresponds to Media:Snuggle.UI.user_menu.png). We'll have to write a form interpreter ourselves, but that's not too difficult. --EpochFail (talk) 23:14, 20 October 2014 (UTC)
A few ideas:
  1. It might be worth to allow users to add a note about a specific revision when reviewing it, mainly when the user is "unsure" about the correct label.
  2. Maybe we could save a click for each review by not having a submit button? Then, when the user clicks for selecting the second label, the review is also submited to the system
  3. Keyboard bindings/shortcuts
    Shortcuts :) let's put some emacs C-J-W-1-2-3 bindings --Jonas AGX (talk) 00:44, 17 November 2014 (UTC)
  4. Yes check.svg Done Use colors in the vertical bars, to indicate if the revision is damaging or not, good-faith or not, etc.. (the bottom half of the bar could be used for a feature and the upper half for the other)
    • This was implemented in the gadget by splitting each vertical bar in blocks (one for each field).
  5. X mark.svg Not done Move the "unsure" button to the middle (yes, unsure, no), so the ordering of the "scale" is more intuitive (+1, 0, -1)
    • This does not scale to non-binary things (e.g. article quality class). However moving the unsure option into a separate checkbox makes sense, as it requires the user to make his best guess while still informing that there are doubts. Helder 18:27, 30 January 2015 (UTC)
Helder 18:52, 16 November 2014 (UTC)
A mockup of a revision diff hand-coder interface.
Here is a screenshot of the interface as implemented in the gadget. Some notes:
  1. Each vertical rectangle corresponds to a revision, and it is split into boxes, where the boxes in the first row corresponds to "Damaging" and those in the second row correspond to "good-faith".
  2. If new fields are added to the spec:
    • New boxes are added below the two current boxes of each revision
    • New styles (e.g. colors) need to be added manually to the CSS file.
      • Maybe it is a good idea to provide a default set of 10 colors which would be used for, say, the 10 first options of a field.
  3. I'm treating "unsure" as being different from "not evaluated yet", and I assume "unsure" would be stored as an actual value in the database
  4. The workset wraps automatically when the browser window is too small.
  5. I exemplified how to get a dataset of revids from recent changes
  6. The buttons do not do anything yet, but they would use CORS to make API calls to e.g. Danilo.mac's prototype on Labs, to store the values provided by a user.
    • Update: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and in the future it would use CORS to make API calls to e.g. Danilo's prototype on Labs, to store the values provided by a user. Helder 01:05, 14 January 2015 (UTC)
    • Update 2: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and uses jsonp to make API calls to Danilo's prototype on Labs, to store the values provided by the user. Helder 18:20, 22 January 2015 (UTC)
This looks very good to me. I appreciate that you are thinking about how the visualization will extend for more fields. I'm worried about entering into some crazy visuals if someone adds a lot of fields to the form, but then again, they probably shouldn't have that many fields. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
@EpochFail: Yeah, if there are too many fields, the users will also have a hard time evaluating each revision. Helder 18:22, 22 January 2015 (UTC)
It's safe to publish openly those quality ratings we are going to collect? I mean, while we learn how vandals are working those vandals will learn (reading such datasets) how we learn from them and be able find out new way to get hidden in the ocean of approved revisions. Ok, it's much more a sci-fi question than a practical issue. --Jonas AGX (talk) 00:44, 17 November 2014 (UTC)
@Jonas AGX: I think yes. It is common knowledge what a vandalism is, and I don't think vandals will change anything in their behaviour just because they know that we consider their actions as being disruptive. Helder 18:28, 22 January 2015 (UTC)

@Jonas AGX, EpochFail, とある白い猫: I made a first draft of a gadget based on these mockups. It is the first one available on testwiki:Special:Preferences#mw-prefsection-gadgets and its result can be seen in a diff page such as testwiki:Special:Diff/219084. Helder 20:47, 24 November 2014 (UTC)

Are somebody developing something about database? I can try to make an API in toollabs:ptwikis to collect the data sent by the gadget in a database. I have already made a tool to register data, this tool is for voting in the last WLE photos, the votes are registered in a database, but it don't use OAuth. I still have to learn how to use OAuth and as ptwikis is a tool for Portuguese projects I will initially make this only for ptwiki, ok? (sorry bad English) Danilo.mac talk 02:47, 26 November 2014 (UTC)
halfak and gwicke were talking something about this (I think) yesterday / today on #wikimedia-research. Maybe they have something to add here? Helder 16:17, 26 November 2014 (UTC)
Just to explain better, I'm not trying to make some definitive, it is just for tests. I have made this tool that shows data saved using this API. Danilo.mac talk 16:53, 27 November 2014 (UTC)
I don't think we want to use RestBASE (The stuff I was discussing with Gwicke) to store our training set data. We'll probably want to maintain our own system and the testing that Danilo.mac is doing is helping us get that up and running. Do you guys have any design docs put together yet? I have some mockups that I'd like to share. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
@EpochFail: Nope. Feel free to share your mockups. Helder 18:30, 22 January 2015 (UTC)
Shouldn't we use radio buttons (see the WMF living styleguide) instead of buttons groups? With radio buttons, we could just use the class "mw-ui-radio" and the selected option in each group would be styled automatically, while with buttons we would need to define some new class with styles for the selected buttons. Helder 19:54, 12 January 2015 (UTC)
Or, depending on the kinds of fields that we will have, checkboxes instead of radio buttons, to allow for multiple values (tags) for a single revision... Helder 23:03, 12 January 2015 (UTC)
+1 for following whatever standards exist in MediaWiki. Otherwise, I'd like to optimize for usability. Radio buttons are small and hard to click. We can also surround the radio with clickable space. Something like this: [( ) Label ] vs [(*) Label ]. --EpochFail (talk) 17:24, 20 January 2015 (UTC)
I don't see the difference in the example [( ) Label ] vs [(*) Label ], but I usually make the label of radio buttons clicable to make it easier to select an option.
Looks like there are two living style guides, the other one being about OOjs UI. This one has support for Button selects and options. Helder 12:33, 3 February 2015 (UTC)
I ask this having in mind the mockup you have on your Google Drive (Revision scoring › coding › Revision handcoder), showing edit types. How would that be stored in the database? What if someone decide to add new options to a field like these? Helder 18:40, 22 January 2015 (UTC)
@EpochFail: ^. Helder 18:34, 30 January 2015 (UTC)
It's this pattern that makes me want to use JSON for the form field. For example, we could have a form field that stores a list of values:
{"edit_type": ["copy", "refactor", "addition"]}
Or we could have a collection of booleans:
{"copy": true, "add_citation": false, "addition": true, "refactor": true, "removal": false}
As I mentioned in the related trello card [2], Postgre's JSONB type supports this as well as querying and indexing of json elements -- though I suspect that we won't really want to index form data directly. --EpochFail (talk) 19:25, 30 January 2015 (UTC)

Handcoder home[edit]

A mockup of the revcoder home is presented.
Revcoder home mockup. A mockup of the revcoder home is presented.

Hey folks, I created a new mockup for a home for revcoding work. Such an interface could give our volunteer handcoders a window into the system's labeled data needs and may provide easy access to the revision handcoder to add new data.

  • [propose] buttons would take users to the bug tracker to file a bug.
  • The "training data" histograms on the right would visually present the recency of available data for training classifiers. More recent data is more better.
  • [add data] button would load up a random sample of recent revisions into the handcoder for the user to process

I imagine that we'd have a suite of admin tools that would allow us to train/test/deploy new classifiers from the web interface. --EpochFail (talk) 17:32, 20 January 2015 (UTC)

Cool! There could be a link to report bugs/make requests for the existing stemmers too (e.g. https://github.com/nltk/nltk/issues). Helder 18:54, 22 January 2015 (UTC)
I do like this a lot. I just have a minor point with the color blue in the screen shot. It's a ad bit too bright. A tad bit darker shade would be better. Like the color of the lines in the graph. Is this a possibility? -- とある白い猫 chi? 21:03, 31 January 2015 (UTC)
I should just make my mockups be black and white. :P But seriously, I wouldn't mind pulling in a someone with some visual design experience. In the meantime, I'll makes sure you have access to the google drawing to make changes on your own. --Halfak (WMF) (talk) 17:14, 1 February 2015 (UTC)

Halfak's approach[edit]

I'd like to throw my two cents on the matter. After Halfak's explanation yesterday I find his approach on the matter quite sound. I think there is great benefit in having a gadget which has several campaigns which is divided among tasks that expire if people are sitting on them. I think this approach suits our crowd-sourcing culture at Wikimedia projects better. After all crowd-sourcing itself is a divide-and-conquer strategy to begin with. -- とある白い猫 chi? 09:04, 21 February 2015 (UTC)

I think that the thread you are looking for is Research talk:Revision scoring as a service/Coder. :P --Halfak (WMF) (talk) 16:37, 21 February 2015 (UTC)
I think you are probably right. -- とある白い猫 chi? 18:19, 21 February 2015 (UTC)

(Bad)Words as features[edit]

One thing I saw in the Machine Learning was an application of Linear SVM to SPAM detection, where we took a list of ~2000 English words (the ones which appeared more than 100 times in a subset of the SpamAssassin Public Corpus) and used the presence/absence of each of their stems in an e-mail as a (binary) feature. So, given a word list in the form (foo, bar, baz, quux, ...) and an e-mail whose text is "Get a discount for bar now!", we would represent the e-mail by a vector (0, 1, 0, 0, ...) whose dimension is the number of words in our list, and which contained ones in the entries corresponding to each word which was found in the given e-mail. Then we used a set of 4000 examples to train a SVM and tested it on 1000 other examples, getting ~98% of accuracy. After that, we also sorted our vocabulary by the weiths learned by the model to get a list of the top predictors of SPAM.

In the context of vandalism detection, top predictors could be used to improve the lists of badwords used by abuse filters, Salebot and similar tools which do not use machine learning. In the specific case of the Salebot list, we could even use the learned weights to fine tune the weights used by the bot.

This approach differs from the one currently in use on Revision-Scoring, where we just count the number (and the proportion, etc) of badwords added in a revision. It is as if all words in the list had the same weight, which doesn't look quite right. Helder 13:10, 6 January 2015 (UTC)


Papers[edit]

These are a few papers on vandalism detection which might be of interest to us:

  • Khoi-Nguyen Tran & Peter Christen. Cross Language Learning from Bots and Users to detect Vandalism on Wikipedia. 2014.
  • Santiago M. Mola Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals. 2012.
  • Jeffrey M. Rzeszotarski & Aniket Kittur. Learning from history: predicting reverted work at the word level in Wikipedia. 2012.
  • Andrew G. West & Insup Lee. Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence. 2011.
  • Kelly Y. Itakura & Charles LA Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. 2009.
  • F. Gediz Aksit. Wikipedia Vandalism Detection using VandalSense 2.0. 2011.
  • Sara Javanmardi & David W. McDonald & Cristina V. Lopes. Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. 2011.
  • B. Thomas Adler & Luca de Alfaro & Santiago Mola-Velasco & Paolo Rosso & Andrew G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. 2011.
  • Koen Smets & Bart Goethals & Brigitte Verdonk. Automatic vandalism detection in Wikipedia: Towards a machine learning approach. 2008.
  • Martin Potthast & Benno Stein & Robert Gerling. Automatic Vandalism Detection in Wikipedia. 2008.
  • Jacobi Carter. ClueBot and Vandalism on Wikipedia. 2007.

There is also this one, about PR and ROC curves:

  • Jesse Davis & Mark Goadrich. The relationship between Precision-Recall and ROC curves. 2006.

Other sources for similar articles:

@Aaron: this should be a good start... Smile.png Helder 18:30, 9 January 2015 (UTC)

Progress report: 2015-01-11[edit]

Hey folks,

We've been doing a lot of hacking over the holiday season. Since the last update, we:

User:とある白い猫 and User:He7d3r, please check out the bits above to see if I missed anything. --Halfak (WMF) (talk) 18:33, 11 January 2015 (UTC)

Performed a minor correction. Looks good aside from that. :) -- とある白い猫 chi? 04:55, 15 January 2015 (UTC)

Installing and checking requirements.txt[edit]

I tested an installation of the revscoring system and dependencies on Ubuntu 14.04 with Python 3.4.0. See my work log below.

# First thing I would like to do is set up a python virtual environment.  I like to store my virtual 
# environment in a "venv" folder, so I'll make one in my home directory.
$ mkdir ~/venv
~/venv $ cd venv

# Regretfully, pip is broken in the current version of venv, so we have to install it manually. 
~/venv $ pyvenv-3.4 3.4 --without-pip

# Before we try to install pip, we'll need to activate the virtualenv
(3.4) ~/venv $ source 3.4/bin/activate

# Then use the installer script to install the most recent version
(3.4) ~/venv $ wget -O - https://bootstrap.pypa.io/get-pip.py | python

# Now we have pip in our venv so we can install our dependencies. 
(3.4) ~/venv $ pip install deltas
(3.4) ~/venv $ pip install mediawiki-utilities
(3.4) ~/venv $ pip install nltk
(3.4) ~/venv $ pip install numpy

# Numpy failed because we are missing some headers for python and try again
(3.4) ~/venv $ sudo apt-get install python3-dev
(3.4) ~/venv $ pip install numpy

# OK back to the list
(3.4) ~/venv $ pip install pytz
(3.4) ~/venv $ pip install scikit-learn
(3.4) ~/venv $ pip install scipy

# Install a scipy fails due to missing libraries
(3.4) ~/venv $ sudo apt-get install gfortran libopenblas-dev liblapack-dev

# And now we try again
(3.4) ~/venv $ pip install scipy

# Now, before we get going, we should download the nltk data we need.  
(3.4) ~/venv $ python
>>> python
>>> import nltk
>>> nltk.download()
>>> Downloader> d
>>> Identifier> wordnet
>>> Downloader> d
>>> Identifier> omw
>>> Downloader> q
>>> ^d

# OK now it is time to set up the revscoring project.  I like to pull in all my projects -- whether library or
# or analysis into a "projects" directory.
(3.4) ~/venv $ mkdir ~/projects/
(3.4) ~/venv$ cd ~/projects
(3.4) ~/projects $ git clone https://github.com/halfak/Revision-Scoring revscoring
(3.4) ~/projects $ cd revscoring
(3.4) ~/projects/revscoring $ python demonstrate_extractor.py

# And it works!

--EpochFail (talk) 17:30, 15 January 2015 (UTC)

On Linux Mint 17.1 (64 bits):
  • pip install numpy fails with RuntimeError: Broken toolchain: cannot link a simple C program, but sudo apt-get install python3-dev fixes it.
  • pip install scikit-learn fails with "sh: 1: x86_64-linux-gnu-g++: not found", so sudo apt-get install g++ needs to be executed before it "Successfully installed scikit-learn-0.15.2"
  • Before running pip install scipy, it was necessary to run sudo apt-get install liblapack-dev (due to numpy.distutils.system_info.NotFoundError: no lapack/blas resources found) and sudo apt-get install gfortran (due to error: library dfftpack has Fortran sources but no Fortran compiler found). The package libopenblas-dev was not necessary.
  • Before cloning the repository, I had to install git too: sudo apt-get install git
Helder 19:31, 1 February 2015 (UTC)

Logo design[edit]

So there are three designs so far. -- とある白い猫 chi? 22:14, 18 January 2015 (UTC)

@Danilo.mac: I think your suggestion would be perfect if the blue circle were in the center of the white circle, and the whole gear at the bottom were the same color (red/brown). Helder 18:53, 23 January 2015 (UTC)
I uploaded a new version with these changes. Helder 21:20, 24 January 2015 (UTC)
@He7d3r: You might want to upload your version as a new file rather than overwriting. It helps us show and see different versions to compare. whym (talk) 03:52, 26 January 2015 (UTC)
@whym: Done. Helder 08:56, 26 January 2015 (UTC)
I particularly like the animated version. I think different aspects of our project can have different logos. For instance this animation can be used as the "loading" animation since I imagine some queries would take time. Or perhaps it could be the logo of revscores itself. -- とある白い猫 chi? 11:57, 26 January 2015 (UTC)
@Helder: Thanks for fixing that! I'm not sure if my !vote counts, but I personally like Danilo.mac's (the 3rd one) because it apparently symbolizes the tool by a robot eyeball monitoring products. whym (talk) 11:05, 29 January 2015 (UTC)


Symbolism behind logo design[edit]

ORES (Objective Revision Evaluation Service) logo

So we agreed for this as the logo for ORES (Objective Revision Evaluation Service) for the time being. I want to explain the symbolism/story behind it.

First of the abbreviation of our system forms is an acronym with the plural of the word ores. Because we datamine raw data (data ore if you will) we felt this name fit our system best.

Gold ranks among the most valuable ore. In the 17th century the Philosopher's stone was a legendary substance believed to be capable of turning inexpensive metals into gold and was represented by an alchemical gliph. Our logo is inspired by this glyph since the service we intend to provide will convert otherwise worthless ore data (raw data) to gold ore data. Mind that it is still ore for others to process. As this services main goal is to enable other more powerful tools.

The idea behind the logo came from User:Mareklug and User:Ekips39 was kind enough to draft two versions of the logo.

-- とある白い猫 chi? 11:53, 26 January 2015 (UTC)

Progress report: 2015-01-16[edit]

Hey folks, we got another week of work in which means it is time for a status update.

  • We welcomed User:Jonas_AGX to the project team and he submitted a pull request to improve our feature set. [9]
  • We completed a readthrough of AGWest's work [10]
  • We wrote up installation notes for Ubuntu, Mint and Windows 7. [11]
  • We've worked through the train/test/classify structure so that the whole team is familiar with it [12]
  • We've done some substantial work testing and refining our classifiers on real world data [13]
    • In a somewhat contrived environment, I've been able to demonstrate 0.85 AUC on English Wikipedia reverts -- which puts us on par with STiki. --16:32, 19 January 2015 (UTC)
  • We attended the IEG office hour [14]
  • We presented on revscoring at the January Metrics Meeting (video, slides) [15]
  • We started a new repo for the Revision Handcoder [16]
  • We fixed some issues with model file reading/writing that will make the system easier to generalize [17]

And I wrote a bunch of documentation to mark off our first month. Ping User:He7d3r, User:とある白い猫 and User:Jonas_AGX--Halfak (WMF) (talk) 16:32, 19 January 2015 (UTC)

Progress report: 2015-01-23[edit]

Hey folks. Again, I'm late with this report. You can blame the mw:MediaWiki Developer Summit 2015. I talked to a lot of people about the potential of this project there and learned a bit about operation concerns related to service proliferation. Anyway, last week:

A mockup of the revcoder home is presented.
Revcoder home mockup. A mockup of the revcoder home is presented.
  • We discussed the design of revcoder -- including a new mockup of a revcoder homepage. [18] [19]
  • We tested sending data between the revcoder and a service running on labs and worked out some of the details of storage strategies. [20]
  • We pushed the accuracy of the enwiki revert classifier to .83 AUC. See an ipython notebook demonstrating the work. [21]
  • Merged a feature that generalized is_mainspace to is_content_namespace [22]
  • We filed a month report for December on the IEG page [23]
  • We also finished up a little bit of other remaining IEG info bits [24]
  • We made it easier to load and share model files [25]

That's all I've got. User:とある白い猫 and User:He7d3r, please review. :) --00:04, 30 January 2015 (UTC)

This is the list of issues and pull requests we closed during that week:
Helder 01:07, 17 February 2015 (UTC)

Progress report: 2015-01-30 (draft)[edit]

Last week:

Helder 01:07, 17 February 2015 (UTC)

Progress report: 2015-02-06 (draft)[edit]

This week:

Helder 22:26, 6 February 2015 (UTC)

Progress report: 2015-02-20[edit]

This week we:

See our changes to the repos during this week.

--Halfak (WMF) (talk) 16:57, 21 February 2015 (UTC)

Progress report: 2015-02-27[edit]

Hey folks. Progress report time. Here we do. During the last week, we:

  • Filed a bug against NLTK to add a stemmer for Turkish [42]. They would like us to submit a pull request [43], so that will need to wait.
  • We dug into work to create models for Turkish and Azerbaijani wikis by creating a test corpus of reverted revision [44]
  • We performed some refactoring of the scoring system so that it can handle multiple models at a time and share features across them. [45] This provides for substantial performance improvements when a request is made for scores from multiple models. We also fixed some bugs with the dependency solver to improve caching behavior. [46]
  • We also refactored languages so that they are expressed as a set of language utilities. [47] This allows for languages to be partially specified, but still useful.
  • We implemented advances operator modifiers "* / min, max, log, ==, !=" [48] This allows one to express compound features in an intuitive way.

That's all for this week. --EpochFail (talk) 17:42, 7 March 2015 (UTC)

Progress report: 2015-03-06[edit]

Hey folks. Our article ran in the signpost! This week we:

  • Improved diff algorithm performance so that we can generate those features more quickly. [49]
  • We implemented partial language utilities for trwiki and azwiki [50] , build feature sets for classifying reverts [51] and we build revert models for them and found that we could get relatively high AUC despite the lack of language features [52]
  • We translated our signpost article for the portuguese signpost [53].

That's all for now. --EpochFail (talk) 18:03, 7 March 2015 (UTC)

Progress report: 2015-03-13[edit]

This week was packed with refactoring and translation work.

  • We completed (but didn't quite merge) a major refactor of the revscoring code base that brought better structure, more features and more tests. [54]
  • We completed translations of our signpost article for Turkish and Azerbaijani wikis [55]
  • We also made substantial progress towards connecting the back-end for our revision coder with our gadget prototype. See Research talk:Revision scoring as a service/Coder.

--Halfak (WMF) (talk) 19:54, 20 March 2015 (UTC)

Progress report: 2015-03-20[edit]

Hey folks,

This week, we finished up quite a lot of work that was in progress last week.

  • We official added Ladsgroup to the project. [56]
  • reza1615 translated our coding gadget to Farsi [57] and Gediz translated the gadget to Turkish and Azerbaijani [58]
  • We published our IEG midpoint report [59] and our monthly report for February [60]
  • We initialized the paths for the revision coder backend server [61] and configured the coder gadget to re-write mediawiki pages [62]
  • We merged a major refactoring and expansion of features into revscoring [63]

That's all folks. Stay tuned. :) --Halfak (WMF) (talk) 21:31, 27 March 2015 (UTC)

Feedback from Raylton[edit]

@Halfak: This is the conversation I had with Raylton some time ago about the revcoder mockups and other aspects of the project (it is in Portuguese, sorry):

chat log

Friday, January 23, 2015:
Raylton:
como tá o ieg?
tem coisa que eu possa ajudar (que não demore muito tempo)
?

Helder:
tá indo bem, eu acho
andei prototipando um gadget

Raylton:
huum

Helder:
e testando integração com o Labs
se quiser fazer uns testes rápidos (clicar aqui e ali, de um jeito que
descubra bugs...hehe)
É o primeiro dessa lista:
https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets
nele estou testando como poderia ser a interface para permitir que
usuários avaliassem um conjunto de edições como sendo boas/ruins, de
boa/má fé...

Raylton:
<gadget-QualityCoding>?

Helder:
isso

Raylton:
parece que não tem mensagem traduzida né?

Helder:
é... deu preguiça
...pra que os algoritmos de aprendizado de maquina tenham um dataset
sobre o qual serão treindados
pode testar aqui:
https://test.wikipedia.org/wiki/Sandbox?diff=0&uselang=pt
depois de ativar o gadget
ele mostra uma barra de progresso, em que cada retangulo corresponde a
uma revisão do conjunto sob análise
o editor responde às opções apresentadas (rotula/tagueia) e submete, e
o gadget abre a próxima revisão pra analise

Raylton:
tem uma coisa
ele lembra a resposta anterior

Helder:
a submissão enviaria algo para o Labs
q por enquanto
vai parar nessa tabelinha que o danilo fez no ptwikis:
https://tools.wmflabs.org/ptwikis/dev/Pontua%C3%A7%C3%A3o
Pois é... notei essa memória depois de uns tantos testes.

Raylton:
é ruim ele lembrar a resposta anterior

Raylton:
nesse pequeno teste gostei da interface

Helder:
to usando mw.ui


Raylton:
embora eu não goste de boa vs má fé

Helder:
do conceito ou da terminologia que foi usada para descrevê-lo?

Raylton:
o unico problema que enxerguei até agora é a memoria. que pode trazer
falsos positivos

Helder:
concordo
enquanto era só eu q tinha pensado nisso, fui adiando... mas vou ver
se apago a memória dele....
</momento MIB>

Raylton:
é que boa fé assume que nós conseguimos medir eficientemente a fé dos usuários.

Helder:
aliás, aceito Pull requests:
https://github.com/he7d3r/mw-gadget-RevisionCoding

Raylton:
na verdade o boa fé vai ser impregnado de moralismos. tipo.. palavras
improprias serão consideradas má fé e erros mais elaborados serão
considerados boa

Helder:
sim, tem uma dose de subjetividade

Raylton:
uhum

Helder:
ei, o Aaron fez uma apresentação em um encontro meio interno do povo
da wikimedia
sobre o IEG...
acho q deve ter no youtube]

Raylton:
manda

Helder:
tomamos umas notas sobre vandalismo/boa/má fé hj,
linha 13: http://etherpad.wikimedia.org/p/revscoring
a princípio os dois campos são ortogonais (independentes), mas não são
tanto assim, já que é dificil imaginar uma edição de má fé que seja
boa (construtiva)
/me procura o video
https://www.youtube.com/watch?v=53bG9mYMYE8#t=11m20s

Raylton:
eu acho que o melhor essa etapa de bater o olho pra mim seria útil
algo do tipo edição "é construtiva?" []sim []não

"precisa de melhorias?"
[]sim (quais) []não

Helder:
Isso me lembra que ainda não arranjei um lugar pra permitir notas
(texto livre,por extenso)

Raylton:
era bom que nessa parte do sim tivesse além de texto livre umas "tags"
mais comuns
tipo aquelas do wordpress

Helder:
Ei, eu agradeceria muito se vc pudesse colocar essas suas impressões
mais importantes do que melhorar em uma talk page no meta-wiki onde
estamos discutindo esses mockups
ah, o Aaron fez algo do gênero em um dos mockups dele
(não sei bem qual)

Raylton:
ponho sim. só não sei bem se conseguir me engajar na discussão.

Helder:
Aceito opiniões aqui, sobre tudo que achar importante, seja
concordando, discordando ou sugerindo mudanças:
https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Revision_handcoding_.28mockups.29
sem problemas se não puder acompanhar mais de perto

Raylton:
vou comentar lá nessa tua página

Raylton:
acho que vou fazer um mockup dessa sugestão que te dei de interface
aí te chamo

Raylton:
tem biblioteca no mediawiki pra adicionar aquelas tags estilo wordpress
?
pra eu faze um mockup pra tu
(ou melhor... pra nois hehe)

Helder:
nao conheço

Raylton:

vou futucar os CSSs

Helder:
heh
Feito:
https://github.com/he7d3r/mw-gadget-RevisionCoding/commit/8e1de6723109f9d417c761ebbad356c7158849ba
e
https://test.wikipedia.org/w/index.php?diff=221230

Raylton:
perfeito
logo mais te dou o mockup

Monday, February 9, 2015:

Raylton:
hey dude!
vc colocou aqueles mockups no na discussão?

Helder:
não

Raylton:
ok...
vou botar

Helder:
acabei não fazendo isso

Raylton:
em que topico ele é pertinente mesmo?

Helder:
https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Revision_handcoding_.28mockups.29
lembre de dar um ping no aaron e no gediz
/me confere os usernames
https://meta.wikimedia.org/wiki/User:%E3%81%A8%E3%81%82%E3%82%8B%E7%99%BD%E3%81%84%E7%8C%AB
https://meta.wikimedia.org/wiki/User:EpochFail
esses dois
+eu
+Danilo.mac

Raylton:
mas como uma pre opinião... vc acha que essa abordagem faz sentido?
a que eu propus?

Helder:
eu cheguei a comentar algo com o Aaron sobre a sugestão de permitir
indicar "tipos de melhorias"

Raylton:
(i mean... vc preferia usar isso ou outro)
?

Helder:
e acredito que acabaremos tendo um campo de texto livre para coletar
esse tipo de coisa, e eventualmente, se surgirem padrões de sugestões,
talvez colocar tags específicas pra sugestõe mais comuns
e houve uma outra coisa que discutimos, mas que ainda não coloquei no
gadget, que é separar o "não tenho certeza" da lista de opções
binárias
[Sim][Não]
[ x ] Não tenho certeza

Raylton:
desculpa
de que se trata a discussão?

Helder:
qual?

Raylton:
essa do yes/no/unsure

Helder:
as notas do dia foram essas: http://etherpad.wikimedia.org/p/revscoring
== 01/30 ==

Raylton:
que significa?
eu li mas não entendi

Helder:
em vez de permitir que a pessoa não escolha "sim" nem escolha "não"
(escolhendo "não tenho certeza" em vez delas), será mais útil que a
pessoa seja "forçada" a escolher entre "sim" e "não", mas com a opção
de dizer que não estava muito confiante da avaliação que fez (marcando
"[x] não tenho certeza")

Raylton:
isso seria uma outra questão né?

Helder:
cada aspecto de uma edição que estivesse sendo analisado (construtiva?
de boa fé? classe de qualidade) teria o seu próprio "[ ] não tenho
certeza disso"

Raylton:
acho complexidade desnecessária

Helder:
ficou mais simples, em vez de complicado
ou melhor,
não mudou a complexidade
só aumentou a utilidade de avaliações "incertas"
Escolher "não sei" em
[sim][não][não sei]"
não diz nada que seja útil para os algoritmos, mas dizer
"Sim" e não "tenho certeza", ou dizer "não" e "não tenho certeza", é útil

Raylton:
entendi

Helder:
typo: 'e não "tenho certeza"'
--> 'e "não tenho certeza"'

Raylton:
mas do ponto de vista de usuário eu preferia um botão pular
na verdade esse tenho certeza ou não só funcionaria com coisas menos subjetivas
boa ou má fé é impossível de definir com certeza

Helder:
"seu filho da ..." é fácil de ter certeza
e não tem problema se a maioria das avaliações em relação a esse
critério "boa fé" forem sem certeza
alguns algoritmos tb trabalham com probabilidades de que algo seja
sim, ou seja não...

Raylton:
não... isso é uma tipificação maniqueísta de boa fé. para pessoas que
usam palavrão em seu cotidiano ou que não encontraram uma seção de
comentários com tanto destaque que o botão editar pode ter feito isso
de boa fé... como é possível medir a fé das pessoas hehe
?

Helder:
os casos em que os humanos ficam em dúvida, o algoritmo tb ficará
e nos casos em que eles tiverem certeza, o algoritmo tb terá mais certeza
o algoritmo só aprendera a chamar de boa fé o que as os editores em
geral chamarem de boa fé...

Raylton:
minha questão com o termo é que ele é focado no sujeito da ação... e o
objeto de analise naturalmente não deve se esse e sim a ação em
questão... no caso a edição
não existem edições de boa fé pq edições não tem fé
exitem edições com problemas

Helder:
já conhece/testou/ouviu falar do Snuggle e do Stick (não lembro qual dos agora)?

Raylton:
isso é uma analise objetiva
(na minha opinião)
(já ouvi sim)

Helder:
um deles faz uma certa triagem dos novos editores, para que os de boa
fé (os que serão mais úteis) possam ser melhor recebidos/tutorados
e o que está por trás disso é um algoritmo treinado em um conjunto de
edições rotuladas entre de boa e de má fé...
(se eu me lembro bem)
e como a API q pretendemos construir, deve servir como uma forma
unificada de fornecer dados para implementar esse tipo de ferramenta
em outras wikis, precisaríamos ter esse tipo de avaliação feito para
as wikis que estiverem interessadsa
se, por exemplo, no Wikilivros não quisermos isso... é só não
ativarmos esse critério lá..

Raylton:
não tem problema em definir um editor de boa ou má fé com base a seu
numero de edição problematicas ou ruins... mas a edição não devia ser
chamado de má fé... mas sim um editor que é recorrente em algum grau
em praticar tais ações devia ser

Helder:
ah, mais aí é só uma questão de mudar a redação da mensagem que
aparece na interface, não?
"edição feita com boa fé" em vez de "edição de boa fé"
ou até "editor bem intencionado"?
sei lá
[não lembro qual o texto atual]

Raylton:
não
nada de edição de boa fé
edição = tem ou não tem problemas
usuários que causa muitos problemas = má fé

Helder:
"esta edição é típica de editores de má fé?"

Raylton:
mas qual a necessidade de assumir a má fé do editor gente do céu

Helder:
"se fosse pra vc chutar se este editor tem agido de forma construtiva,
com base exclusivamente nesta edição atual, o que diria?"

Raylton:
é a primeira edição do cara

Helder:
não...
é uma edição aleatória tirada do histórico (dump)

Raylton:
se ele for avisado que não é assim e ele seguir aí sim tem má fé
aleatória incluindo as primeiras edições
cento?
se sim então minha inquietação ainda e valida

Helder:
se der sorte...
dependendo do tamanho da amostra de edições que pegarmos para
analisar, poderão aparecer mais ou menos "primeiras edições" no
conjunto
de treino
se for pequeno, é possível que não apareça nenhuma
(mas conjuntos pequenos não são muito uteis para treinar os algoritmos)
então, sim, provavelmente terá primeiras edições no conjunto

Raylton:
se fosse pra eu chutar eu descreveria o problema e deixaria o
algoritimo escolher com base na gravidade e na recorrencia quem tem
boa ou má fé...
naturalmente uma unica edição é uma taxa de amostragem muito baixa pra
tirar uma conclusão
da apenas pra ter um palpite
um chute como vc disse

Raylton:
vc entendeu o que quero dizer?
faz sentido?

Helder:
acho q sim

Raylton:
pq poxa... saber que minha avaliação vai definir coisas como de boa ou
má fé de forma tão subjetiva seria bem ruim
ainda mais considerando que o objeto de analise não tem fé...
hehe
esse boa fé seria como um argumentum ad hominem
pronto
agora consegui uma boa analogia

Helder:
exceto que não necessariamente mostraríamos quem e' o tal homem (autor
da revisão) na hora de avaliar uma revisão do conjunto

Raylton:
e por isso mesmo a fé não deveria estar em questão
nessa conjuntura específica

Helder:
em resumo, acho que você considera o campo relacionado à "boa fé"
bastante questionável, e as coisas que disse se aplicariam em uma
discussão onde certa wiki estivesse decidindo se utilizaria esse campo
ou não. Enquanto que o campo relacionado a ser "construtiva" não
estaria sujeito às mesmas críticas, por ter relação apenas com o
conteúdo das revisões.

Raylton:
sim...
exatamente

Helder:
MAAAS..

Raylton:
e aqui estamos problematizando a relação do boa ou má fé com o conteúdo

Helder:
lembre-se que os algoritmos também estarão recebendo "features" que
têm a ver com o editor, não com o conteúdo...
https://github.com/halfak/Revision-Scoring/tree/master/revscoring/features

Raylton:
mas existem questões relacionadas ao uso do termo em usuários que
talvez deixasse a conversa muito confusa caso ou citasse

Helder:
SE o algoritmo perceber que grande parte das edições feitas por
anônimos costumam ser de ma fé, ele dará um peso baixo a outros
features como "página está no domínio principal ou não"

Raylton:
tem dois problemas com esse termo além a relação com o objeto de
analise (como já tratamos),
E aqui estou falando apenas dos problemas de usar isso em usuários...
já que já abordamos os problemas de usar em edições.
O primeiro é que não é concreto.
o segundo é que bom e mal é que é maniqueísta.
por isso acho que a definição mais sensata pra mim é. "Usuários
potencialmente destrutivos"
ou algo do gênero
ou até usuários destrutivos
em caso de termos mais certeza das da recorrencia da destrutividade das edições
critérios pra definir o qual destrutivo é o editor poderiam ser
fez alguma edição destrutiva?
Qual a gravidade da edição destrutiva?
Foi avisado da edição destrutiva?
A ação destrutiva é recorrente depois do aviso?
"o quão"


ou em termos mais de banco de dados
Destrutiva-gravidade
Destrutiva-foi avisado
Destrutiva-recorrente
Ps: a gravidade dependeria daquele campo que descreve o problema na página
edição*

Helder:
mas novamente tudo isso parece ser apenas relativo ao texto que
devemos colocar na interface que aparecerá para quem avalia um
conjunto de edições
por que para os algoritmos não importa qual é a interpretação que os
humanos dão para as classes (sim/não) de um problema de classificação

Raylton:
sim.. desde o inicio é uma discussão sobre experiencia de usuário
a implementação seria trivial nesse contexto...
mas provavelmente sofreria mudanças tbm

Helder:
dadas revisões r1, r2, r3, ..., r100000, com seus diversos atributos
(os ~40 features da pasta que linkei acima), e os respectivos rótulos
(sim/não) que indicam a classe a que cada revisão pertence, os
algoritmos só aprendem a fazer classificação análoga para novas
revisões (que não estavam no conjunto) de forma tão parecida quanto
possível com o jeito que humanos fariam, mas não há nenhuma
interpretação envolvida.
não há nada subjetivo, do ponto de vista do algoritmo. Ele
simplesmente encontrará o melhor modelo que se encaixa nos dados
fornecidos por humanos, para que estes façam o que bem entender com
classificações automáticas que poderão ser obtidas a partir do
algoritmo (já treinado)
Coisas como: o huggle filtrar as Mudanças Recentes para mostrar apenas
os mais prováveis vandalos, ou os mais prováveis vandalismos, etc..
Mas, claro, qualquer que seja a interpretação das classes, ela tem que
ser usada consistentemente nas duas etapas: a de classificação de N
revisões por humanos, e as previsões fornecidas por um algoritmo
treinado. Senão quem for usar os dados que fornecermos em nossa API
poderá tirar conclusões indevidas do que significa uma propriedade do
tipo "boa fé: 0.85"

Raylton:
nota: toda vez que falei objeto de analise, me referia a analise do
usuário que vai fornecer os dados
subjetivo eu queria dizer na classificação

Helder:
acabou de me confundir

Raylton:
é objetivo que que o usuário acha ou não a edição de boa fé... mas o
termo boa fé não é objetivo

Helder:
objeto de análise = editor que fez uma das edições que por acaso foram
sorteadas para serem avaliadas por um humano que nos ajudará a
fornecer dados para os algoritmos?

Raylton:
é abstrato
ou pelo menos não é concreto
olha quando eu falo analise provavelmente não deve ter a ver com a
definição matematica de analise
já que sou leigo

Helder:
[acho q não coloquei o significado de "análise matemática" na conversa]

Raylton:
objeto de analise que eu falei foi:
Editor[ ]
Edição[ ]
edição não podemos citar boa fé pq é meio que uma impossível definir
fé de uma edição...
não de forma tão objetiva como podemos definir que uma edição tem ou
não um problema
ou qual o problema que tem

Helder:
sim, mas acho q já concordamos nisso
fé é um atributo dos editores, não das edições

Raylton:
e no caso de usuários ele não poderia ser usado principalmente pra
assumir boa fé dos editores em primeiro lugar e tbm pq continua sendo
abstrato fé em qualquer contexto relacionado coisas realmente
tangíveis... e segundo se receberíamos dados sobre a natureza das
edições seria mais sensato dizer que o editor costuma fazer boas
edições ou não ou ainda qual a gravidade ou recorrencia das suas más
edições
pois... agora estava problematizando o termo fé em seu uso para editores
por isso o resumo da primeira parte
em teoria se ele pode ser usado é para editores... mas tbm não acho
util usa-lo para editores

Helder:
"costuma fazer boas edições" é só "mais um" feature a ser implementado
e que, uma vez implementado, pode muito bem ser incluido entre os que
são usados para treinar os algoritmos

Raylton:
pois então... se estamos falando de destrutivo ou não destrutivos
podemos falar de usuários destrutivos ou não destrutivos ou ainda de
graus de destrutividades da edição ou do editor
e podemos tirar fé da jogada

Helder:
mas precisará de um conjunto de edições rotuladas pra poder fazer
alguma previsão a respeito

Raylton:
a questão é que se a fé não pode ser medida qual o benefício de seguir
usando o termo
poderiamos ter edições e usuários destrutivos ou não
e graus para isso
sendo que uma edição não ia definir um editor como destrutivo
mas sim um conjunto de fatores

Helder:
Escolhida qualquer pergunta cujas respostas possíveis sejam "sim" e
"não", rotule uns 10000 itens como sendo "sim" ou "não" em relação a
essa pergunta. Feito isso, o algoritmos conseguem responder "sim" ou
"não" para a mesma pergunta a respeito de outros itens de forma muito
próxima de como os humanos responderiam. Só é preciso fornecer aos
algoritmos "features" que possam ser usados (por um humano) para
responder corretamente "sim" ou "não" à tal pergunta sobre os itens.

O que você diz parece ser que os metadados das **edições** não são
suficientes para decidir se a resposta correta a respeito de um

    • editor** é "sim" ou "não". Mas então só o que precisa é implementar

novos features que possibilitem (um humano) decidir isso, e o
algoritmo imitará a performance do humano na mesma tarefa de
classificação.
Tais novos features podem ser coisas como o número de reversões do
editor, o número de avisos que já recebeu, etc
(reversões do editor = edições dele que foram revertidas por outros)

Raylton:
sim... eu acredito que seja isso

Helder:
e.. diga-se de passagem
anualmente são publicados artigos de pesquisa acadêmica sobre quais
seriam bons features pra prever certas coisas (responder certas
perguntas) automaticamente

Raylton:
pra mim seria muito impossivel decidir a fé de uma edição e muito
dificil decidir a fé de um editor... mas seria fácil decidir se uma
edição tem problemas e se um usuário é dado a fazer edições
problematicas

Helder:
nos artigos que andei lendo há tabelas mostrando o quanto cada feature
é capaz de prever a resposta a certas perguntas, quais são os que
melhor prevêm vandalismo, etc...

Raylton:
features são coisas pra definir melhor(para humanos) uma única
resposta positiva ou negativa né?

Helder:
exemplos de "features" de uma edição:

  • número de palavras
  • número de palavrões inseridos
  • número de letras maiúsculas inseridas
  • etc


Raylton:
mas aí são os detectados por computador

Helder:
os artigos mais recentes mostram estatísticas de modelos treinados com
uns 60 features

Raylton:
esses serviriam pra sugerir coisas a humanos
por exemplo... uma maquina poderia detectar isso e me dizer a
tendencia da edição
e eu confirmaria ou não

Helder:
dado que são problemas de "machine learning" é meio natural que sejam
coisas que possam ser calculadas por computador


Raylton:
sim

Helder:
"esses serviriam pra sugerir coisas a humanos": sim. Isso resume o
objetivo do nosso IEG.
fazer uma API dessas sugestões automatizadas, que possam ser usadas
por humanos (e robôs, ou gadgets, ou aplicativos como o Huggle) pra
fazerem o que quiserem com isso

Raylton:
huuum
isso é bom
e inclusive fazer o que quiser pode significar que se as avaliações de
maquina forem muito pertinentes eu posso deixar elas trabalharem
sozinhas em alguns casos né?
meio que uma troca reciproca humano-maquina
até finalmente as maquinas dominarem tudo :)

Helder:
reescrevendo uma frase anterior: "se o objetivo é traar certos
problemas de classificação presentes no dia a dia das wikis como
problemas de "machine learning", em que computadores possam ajudar os
humanos de alguma forma, é meio natural que os features utilizados
pelos algoritmos sejam coisas que possam ser calculadas por um
computador, para que eles possam o desempenho dos humanos na tarefa de
classificação"
sim pra isso de deixar trabalhando sozinho em alguns casos
especificamente, me vem o robô ClueBot NG à mente
https://en.wikipedia.org/w/index.php?title=User:ClueBot_NG#Statistics
Veja só essa frase acima: "Selecting a false positive rate of 0.25%
(old setting), the bot catches approximately 55% of all vandalism."
pode-se fazer uso das previsões automatizadas exatamente pra "que as
máquinas dominem tudo", pelo menos nos casos em que elas têm como
decidir com muita segurança a resposta correta a uma pergunta
(digamos, "preciso reverter essa edição?)"

Raylton:
entendi...
parece bacana

Helder:
dependendo de quantos falsos positivos a comunidade estiver disposta a
tolerar de um robô que faça uso da API, ele pegará mais ou menos
vandalismos

Raylton:
"ps:pronto.. ainda sobre UX... vandalo é um bom nome pra alguem que é
dado quase exclusivamente a edições destrutivas mas não para alguem
que fez uma ou outra. "

Helder:
a definição de "vândalo" pressupõe que haja "má fé"+"destruição" (em
vez de "boa fé"+"destruição")

Raylton:
hahaha boa fé
recorrencia+ignorar avisos+ destruição = vandalo

Helder:
"boa fé" implica "não ignorar avisos"

Raylton:
vou olha no dicionário... mas né isso não
parece bem mais subjetivo
e se fosse isso não poderiamos definir edições de boa ou má fé... como
eu supus no começo

Helder:
por que?

Raylton:
pq edições não recebem avisos
naturalmente

Helder:
mas a presença de avisos na discussão do autor de uma edição é um
metadado da edição

Raylton:
vai... c sabe que c tá forçando a barra

Helder:
é sério, dada uma revisão vc pode checar quem foi o autor, e então ver
o conteúdo da página de discussão, e ver se havia algum aviso lá

Raylton:
ainda não é a edição que recebe o aviso né?

Helder:
(os reversores humanos fazem isso)

Raylton:
mas vc não pode não é a edição que vê ou não o aviso e continua sendo
ou não destrutiva
hehe
pensei que já tinhamos concordado sobre isso
ignore o "mas vc não pode"

Helder:
mas a edição diz alguma coisa a respeito das boas ou más intenções do editor

Raylton:
sim...
concordo
foi o que falamos acima

Helder:
Por exemplo, suponha que X prejudicou o conteúdo de um artigo, e
recebeu um aviso.
1. Se ele fizer uma nova edição, e for construtiva, é um "bom sinal"
2. Se ele fizer uma nova edição, mas for destrutiva, é um "mau sinal"

Raylton:
se x é um editor eu continuo concordando
como antes

Helder:
E levando esse exemplo ao extremo, se houve mais de um aviso sobre
destruição de conteúdo, e ainda assim o editor fez uma nova edição que
destroi conteúdo, o "mau sinal" tem mais intencidade

Raylton:
sim...
tudo gira em torno do editor receber o aviso e ser recorrente ou não
a edição ruim

Helder:
ou seja, pra resolver o problema que tem em mente, é só uma questão de
implementar os features que pegam uma revisão, busca o autor,
inspeciona sua discussão (ou histórico de contribuições) e conta o
número de avisos, quantas vezes insistiu em fazer ações destrutivas
após os avisos, etc, que são coisas computáveis.

Raylton:
[y] só uma nota
eu acho que tudo gira em torno de destrutiva ou não

Helder:
sim sim

Raylton:
daí para aumentar a precisão podemos adicionar problemas
e classifica-los como destrutivos ou não
ou até adicionar pesos sobre sua destrutividade

Helder:
os algoritmos costumam se encarregar de uns pesos (para os features),
não sei se é o mesmo que está pensando...

Raylton:
estou pensando em pra cada problema um peso
ou pra cada grupo de problemas um peso
mas depende de quanto isso for util
mas acho que um palavrão pode ser mais destrutivo que um erro de
ortografia por exemplo

Helder:
deixe-me dar um exemplo do que quiz dizer com os pesos para os features
aliás, eu já tinha postado o exemplo no meta tb
/me pega o link
Veja esse exemplo:
https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#.28Bad.29Words_as_features
tentei resumir uma coisa que vi no Coursera
acho q é relacionado ao que vc está pensando

Raylton:
alias... estou pensando aqui que eventualmente problemas podem ser
independentes da destrutividade da edição
afinal ter problemas não indica necessariamente que a edição é destrutiva
isso é coisa pra se fazer um projeto bacana com maquetes
antes de implementar
talvez ja tenham passado dessa fase
mas enfim
/me se inscreveu no curso de machine learn de tanto que helder falou...

Helder:
sobre esse tipo de problema q está falando, considere por exemplo o
"saldo" de referências de uma edição em relação à anterior: se é
positivo, o editor estava referenciando o texto (algo bom), se for
negativo, estava removendo referências (possivelmente ruim). Esse é um
feature que não está implementado ainda
naquele código do github
mas q se eu não me engano já foi usado em algum dos papers sobre o uso
de "machine learning" para a detecção de vandalismos...
oba!!!! aposto que vai gostar

Raylton:
huuum
o numero de caracteres pode ajudar a definir isso

Raylton:
mas nao unicamente neh

Helder:
esse feature já está na lista

Raylton:
eu posso melhorar o artigo removendo algumas coisas

Helder:
(caracteres)

Raylton:
entendi

Helder:
pois é... eu mesmo costumo aproveitar as edições em que vou reverter
vandalismo para corrigir uma ou outra coisa tb...
isso tem uma desvantagem que é impossibilitar a identificação de
reversões por meio da comparação de "hashes" sha1 das revisões
da mesma forma, as suas melhorias que removem coisas "prejudicam" o
uso da contagem de caracteres como medida de algumas coisas...

Raylton:
huuum.. tow entendendo... baseado nessa conversa eu acho que consigo
fazer uma mudança na maquete.. do ponto de vista de usuário.
ps... tenho problema com vandalismo

Raylton:
ate convivo com a palavra vandalo mas vandalismo cai naquele mesmo
lance do boa e má fé
Helder:
Raylton:
vou tentar fazer um mockup baseado nessa nossa conversa
uma coisa
existe interface de usuário para os dados obtidos a partir das avaliações?

Helder:
ainda nao
o projeto no labs foi recem criado, ainda precisamos escrever código pra por lá

Raylton:
rapaz... faz projeto primeiro

Helder:
o Aaron fez um mockup da página principal

Raylton:
projeto primeiro codigo depois
eu aprendi isso no gsoc
economia é muita
de tempo e trabalho

Helder 01:00, 1 April 2015 (UTC)

Progress report: 2015-03-27[edit]

Hey folks,

This week we:

We also have a some substantial work in progress:

  • We've been propagating a recent refactor of the revscoring library across our other projects.
  • We're cleaning up annoying little bugs like this guy in the revscoring library that first appeared in the wild.
  • Helder and I have settled on a configuration strategy for the coding gadget. [67]

That's all. Stay tuned. --EpochFail (talk) 15:02, 4 April 2015 (UTC)

Progress report: 2015-04-03[edit]

This week we finished off a lot of stuff. :)

That's all. Stay tuned. --EpochFail (talk) 15:08, 4 April 2015 (UTC)

Featured in a Washington Post Article.[edit]

See http://www.washingtonpost.com/news/the-intersect/wp/2015/04/15/the-great-wikipedia-hoax/

Excerpt:

Wikipedians have proposed other reforms, too. The Wikimedia Foundation is funding research into more robust bots that could score the quality of site revisions and refer bad edits to volunteers for review. Another proposed bot would crawl the site and parse suspicious passages into questions, which editors could quickly research and either reject or approve.

Cool! --Halfak (WMF) (talk) 15:28, 16 April 2015 (UTC)

I just noticed the link back to our project. I am at a loss of words. -- とある白い猫 chi? 12:30, 26 April 2015 (UTC)

Progress report: 2015-04-10[edit]

Hey folks,

$ revscoring -h

Provides access to a set of utilities for working with revision scorer models.

Utilities

* score             Scores a set of revisions
* extract_features  Extracts a list of features for a set of revisions
* train_test        Trains and tests a MLScorerModel with extracted features.

Usage:
    revscoring (-h | --help)
    revscoring <utility> [-h|--help]

Revscoring utility documentation

This week was another productive one with a lot of tasks coming together.

  • We centralized the utility scripts that support revscoring within the revscoring project and made a cute general utility to make them easy to work with [77]. We also took the opportunity to make file reading/writing easier in Windows [78].
  • We deployed a form builder interface for writing new form configurations used in the revcoder.
  • We implemented a means for extracting all labels from the coder server [79]
  • We added a feature to revscoring that prints the dependency tree for a feature. [80] This is useful when debugging dependency issues in feature extraction.
  • We added a simple revert detector script to the ORES project [81]. This in combination with the centralized revscoring utilities provides automation for training new classifiers.
  • We implemented an OAuth login for the revcoder system [82]. You can test the workflow by going to http://ores-test.wmflabs.org/coder/auth/initiate.

That's all for this week. Stay tuned. --Halfak (WMF) (talk) 22:25, 17 April 2015 (UTC)

Progress report: 2015-04-17[edit]

Hello all,

I will be filing the progress report for a short while in place of Halfak.

This week we:

  • We now have a function OO.ui.instantiateFromParameters() which takes some JSON configuration and construct an OOjs UI field. This also populates a fieldMap with "name"/widget pairs that can be used later.[83] You can get a sense of it by trying our form builder.
  • We refactored ORES for language specific features to reflect the changes made to Revision Scoring. We also reorganized the features list to both reuse code and to improve on performance and accuracy. [84]
  • We created a Mediawiki gadget to filter recent changes feed by reverted score. [85]
  • We renamed the service Revision handcoder to Wiki-Tagger. [86].

That is this weeks summary. Stay tuned. -- とある白い猫 chi? 12:54, 26 April 2015 (UTC)

Progress report: 2015-04-24[edit]

Hello all,

  • We updated ORES server to the newer version of revscoring and included new models. [87]
  • We renamed Wiki-Tagging to Wiki-Labels. We hope to name our hand coding campaigns with a "Wiki labels foo" format a bit like wiki loves monuments.[88] We have also defined dependencies for Wiki-Labels.[89]
  • We have explore additional methods for automatically detecting badwords.[90]
  • We have investigated advanced bag of words approaches such as TF-IDF and by its extension Latent Semantic Analysis (LSA) [91]

That is this weeks summary. Stay tuned. -- とある白い猫 chi? 13:41, 26 April 2015 (UTC)

Notes on ORES performance.[edit]

So, there's been some discussion recently of ORE's performance and how it's not nearly as fast as we would like to request that a bunch of revisions get scored. I'd like to take the opportunity to document a few things that I know are slow. I'll sign each section I create so that we can have a conversation about each point.

Looking for misspellings[edit]

Looking for misspellings is one of our most substantial bottlenecks. Right now, we're using nltk's "wordnet" in order to look for words in English and Portuguese. This is slow. One some pages, scanning for misspellings can take up to 4 seconds on my i5. That's way too much -- especially because we end up scanning at least two revisions for misspellings. So, I've been doing some digging and I think that 'pyenchant' might be able to help us out here. The system uses your unix installed dictionaries to do lookups and it is much faster. Here's a performance comparison looking for misspellings in enwiki:4083720:

$ python demonstrate_spelling_speed.py 
Sending requests with default User-Agent.  Set 'user_agent' on api.Session to quiet this message.
Wordnet check took 3.7539222240448 seconds
Enchant check took 0.008267879486083984 seconds

So, it looks like we can get back 3 orders of magnitude there. It looks like we can get a lot of dictionaries too. Here's apt-gets listing if myspell dictionaries:

myspell-af              myspell-el-gr           myspell-fo              myspell-hy              myspell-ns              myspell-st              myspell-ve
myspell-bg              myspell-en-au           myspell-fr              myspell-it              myspell-pl              myspell-sv-se           myspell-xh
myspell-ca              myspell-en-gb           myspell-fr-gut          myspell-ku              myspell-pt              myspell-sw              myspell-zu
myspell-cs              myspell-en-us           myspell-ga              myspell-lt              myspell-pt-br           myspell-th              
myspell-da              myspell-en-za           myspell-gd              myspell-lv              myspell-pt-pt           myspell-tl              
myspell-de-at           myspell-eo              myspell-gv              myspell-nb              myspell-ru              myspell-tn              
myspell-de-ch           myspell-es              myspell-he              myspell-nl              myspell-sk              myspell-tools           
myspell-de-de           myspell-et              myspell-hr              myspell-nn              myspell-sl              myspell-ts              
myspell-de-de-oldspell  myspell-fa              myspell-hu              myspell-nr              myspell-ss              myspell-uk

No Turkish or Azerbaijani, but we can do Farsi, English and Portuguese. :) --EpochFail (talk) 16:29, 26 April 2015 (UTC)

API latency and the need to perform multiple requests/score[edit]

Right now, we gather data for extracting features one revision at a time. For a common 'reverted' scoring, we'll perform the following requests:

  1. Get the content of the revision under scrutiny,
  2. Get the content of the preceding revision (lookup based on parent_id)
  3. Get metadata from the first edit to the page (for determining the age of the page, lookup based on page_id, ordered by timestamp)
  4. Get metadata about the editing user (lookup based on user_text)
  5. Get metadata about the editing user's last edit (lookup based on user_text, ordered by timestamp)

One way that we can improve this is by batching all of the requests in advance before we provide the data to the feature extractor. So, let's say we receive a request to score 50 revisions, we would make one batch request to the API for content from those 50 revisions. Then we would make another batch request to retrieve the content of all parent revisions. I think we can also batch the requests for a first edit to a page (specifying multiple page_ids to prop=revisions with rvlimit=1). We can batch the request to list=users and list=usercontribs too. We'd have to use the extractors dependency injection to address these bits for each revision after the fact then. For example:

features = extractor.extract(rev_id, features, cache={revision.doc: <our doc>, parent_revision.doc: <our doc>, ... })

It makes me a bit sad to do this since we don't know that the revision.doc, parent_revision.doc necessary in the code. We might want to provide some functionality at the ScorerModel level to allow us to check this. E.g.

if scorer_model.requires(revision.doc):
    cache[revision.doc] = session.revisions.query(revids=...)

--EpochFail (talk) 16:29, 26 April 2015 (UTC)

Other than the batching of requests, which seems very appropriated, would it help if the system used database access instead of API requests to extract the features? Helder 17:32, 26 April 2015 (UTC)
+1 to Helder - 1 and 2 require the API, but 3-5 steps can all be done via the database batched very quickly. Yuvipanda (talk) 18:09, 26 April 2015 (UTC)

Caching[edit]

Right now, there's no caching at all. If a score is requested, it's is calculated and returned and then forgotten. This is sad because we could probably store scores for the entire history of all the wikis in ~ 50-75GB. We could also make use of a simple LRU cache in memory (e.g. https://docs.python.org/3/library/functools.html#functools.lru_cache. This would work really well for managing the load of the set of bots/tools tracking the recentchanges feed. --EpochFail (talk) 16:29, 26 April 2015 (UTC)

+1 for computing the scores for all revisions of all (supported) wikis and storing them in some kind of database, with associated version numbers to identify which model was used to compute the scores. As a user of the system I would like to be able to get e.g. a list of revisions whose previous score was a false positive (i.e. a constructive edit scored as a vandalism) and whose scores generated by a more recent model is now a true positive (similarly for other combinations of true/false positives/negatives). This would allow use to get an idea of how the system is improving over time, and to identify regressions in the quality of the scores we provide for users. Helder 17:32, 26 April 2015 (UTC)
I'd suggest a local install of redis for caching over in-process caching. This ensures that you can restart your process nilly-willy without having to worry about losing cache. Yuvipanda (talk) 18:10, 26 April 2015 (UTC)

Pre-caching[edit]

Since we know that the majority of our requests are going to be for recent data, we could try to beat our users to the punch by generating scores and caching them before they are requested. Assuming caching is in place, we'd just need to listen to something like RCStream and simply submit requests to ORES for changes as they happen. If we're fast enough, we'll beat the bots/tools. If we're too slow, we might end up needing to generate a score twice. It would be nice to be able to delay a request if a score is already being generated so that we only do it once. --EpochFail (talk) 16:29, 26 April 2015 (UTC)

This looks like an interesting thing to do. Helder 17:32, 26 April 2015 (UTC)

Misclassifications[edit]

Moved to Research:Revision scoring as a service/Misclassifications/Edit quality --EpochFail (talk) 20:11, 8 December 2015 (UTC)

Progress report: 2015-05-01[edit]

Hello all,

We decided that I shall carry out weekly reports now on.

  • We had major on going work for Wiki-Labels, we intend to have everything up and running by 8 May where we will have first hand coder input from Wiki-Labels. Once this is achieved it will be a milestone for our project.
  • We held a general discussion on the landing page and also designed the page (w:Wikipedia:Labels). [92]
  • We wrote general documentation for Wiki-Labels here on meta: Wiki labels. [93]

That is this weeks summary. Brace to your seats for more. -- とある白い猫 chi? 00:05, 7 May 2015 (UTC)

Badwords[edit]

@Ladsgroup: what is the condition for keeping/removing a string in the list? I belive it is so subjective, and that there are so many criteria, that I don't really know what to do with lists like these. I even kept the list which I generated as is, due to the lack of an objective criteria for removing items from it.

I believe we need to have some kind of labelling efort for adding (multiple) tags for each string in the lists, so that we can have different categories. I also don't know which would be the common categories, but there are many reasons why an edit adding a given string might be considered damaging:

  • It talks to the reader (e.g. "you", "go **** yourself", "<someone>, I love you!"), and this is not acceptable in an enciclopedic article
  • It is related to sex, and the article isn't
  • It is about a part of the human body (likely inappropriate in an article about Math)
  • It is in a language other than the article's or wiki's language (if it is a "badword" in that other language, should it be in the list for this language too?)
  • It is not a word ("lol", "hahaha", "kkkkkk")
  • It is a personal attack/name calling
  • It is an accronym only used in informal talking (e.g. on chats)
  • It is a website or brand (e.g. "easyspace", "redtube")
  • It discriminates some group of people, for whatever reason (ideology, beliefes, etc...)
  • It is a mispelled (bad)word
  • It is an uncommon word
  • It is something else
  • It is many of the above things simultaneously

Helder 13:44, 8 May 2015 (UTC)

Anyway, here is a possible split of the Portuguese list you generated: https://gist.github.com/he7d3r/6a5ecf56941a323cb568. Helder 13:47, 8 May 2015 (UTC)
Helder: Thank you for your feedback, It generates a list automatically, it's on us if we want to use some words and don't use others. I'm using a more advanced technique to create better results, I will update it and please review it and tell me whether it's improved or not. Best Amir (talk) 23:46, 8 May 2015 (UTC)
The problem is that I don't have a clear criteria for doing such a review, per above. Helder 12:46, 10 May 2015 (UTC)

Progress report: 2015-05-08[edit]

Hello all,

Checking in with the weekly report.

  • We have achieved our mile-stone as our hand coder Wiki-Labels is live as of 8 May on four languages: English, Persian, Portuguese, Turkish.
    • We engaged in further community engagement prior to 8 May to promote the first wiki labels campaign: quality labeling. [94]
    • We updated the gadget to point to ORES server. [95]
    • We deployed Wiki labels to labels.wmflabs.org and updated the docs [96]
    • We loaded 20k samples for all languages. This was over-saturated by very high number of SUL notifications so we resampled from a year back. [97]
    • We configured Wiki labels to run from wmflabs [98]
      • We generalized paths so that wiki labels works from localhost or labels.wmflabs as expected. [99]
    • Wikilabels bugfix: Worksets were shuffled when user navigates away from the page and then back. Now worksets are sorted by task ID. [100]
    • We translated the interface of Wiki labels to Turkish, Persian and Portuguese. [101] [102] [103]
    • We implemented a language fallback chain in wikilabels for cases where translation isn't available. [104]
  • We detailed documentation on individual revscoring features that are used by classifiers. [105]
  • We implemented enchant spell checker for Revscoring. [106]

That was your weekly report. -- とある白い猫 chi? 08:32, 13 May 2015 (UTC)

@とある白い猫, Halfak: So, if the 20k samples are from 2014, we should probably rename the campaigns, because they are saying the edits are from 2015. Helder 13:44, 13 May 2015 (UTC)
The edits are from the year ending in 2015-04-15, so some may be from 2014. --Halfak (WMF) (talk) 22:37, 13 May 2015 (UTC)
@Halfak: and the previous sample was from which period? Helder 09:18, 14 May 2015 (UTC)
I'm sorry. What previous sample? Could you be thinking of the sample we trained on revert/not-reverted? That was also extracted in 2015. It turns out that our test dataset for loading Wiki labels contain campaign names that reference 2014, but that's just test data. --Halfak (WMF) (talk) 15:17, 14 May 2015 (UTC)
@Halfak (WMF): I'm referring to this:

This was over-saturated by very high number of SUL notifications so we resampled from a year back.

Helder 15:57, 14 May 2015 (UTC)
Ahh yes. So the last sample was from the last 30 days since that is what the recentchanges table keeps. But the new sample uses the revision table and a whole year's worth of revisions. So, a "year back" from 2015-04-15. --Halfak (WMF) (talk) 16:15, 14 May 2015 (UTC)
Got it! Thanks for clarifying. Helder 18:17, 14 May 2015 (UTC)

Progress report: 2015-05-15[edit]

Hello all,

Checking in with the weekly report.

  • Wikilabels bugfix: We had a strange bug where full screen button appeared twice. Up on further investigation this was because the global.js was being loaded twice. We modified our code to prevent double-loading of the UI. [107]
  • We generated bad words list for az, en, fa, pt and tr wikis. Unlike the previous lists these are generated from known bad revisions. [108]
  • We have updated the translation for Portuguese. [109]
  • We have conducted some maintenance and administrative work concerning Wikilabels. [110]

That was your weekly report. -- とある白い猫 chi? 17:25, 21 May 2015 (UTC)

Progress report: 2015-05-22[edit]

Hello all,

Checking in with the weekly report.

  • We have conducted some maintenance and administrative work concerning Wikilabels. [111]
  • We have filtered likely-non-damaging edits from tasks on en, fa, pt, tr wikis. [112]
  • We posted a progress report on enwiki, fawiki, ptwiki and trwiki on the first campaign.[113][114]
  • We will attend Mediawikiwiki:Wikimedia Hackathon 2015 to recruit and hack-a-thon away...[115]

That was your weekly report. -- とある白い猫 chi? 15:11, 24 May 2015 (UTC)

ORES performance improvements[edit]

Hey folks,

I've been working with Yuvipanda to work out some performance and scalability improvements for ORES. I've captured our discussions about upcoming work in a series of diagrams that describe the work we plan to do.

The basic ORES request flow. All processing happens within a single thread (limited to a single CPU core). No caching is done.
Basic flow. The basic ORES request flow. All processing happens within a single thread (limited to a single CPU core). No caching is done.
The basic flow augmented with caching. All processing is still single-threaded, but a cache is used to store previously generated scores. This enables a quick response for requests that include previously generated scores.
Basic + caching. The basic flow augmented with caching. All processing is still single-threaded, but a cache is used to store previously generated scores. This enables a quick response for requests that include previously generated scores.
The basic flow augmented by caching and celery. Processing of scores is farmed out to a celery computing cluster. Re-processing of a revision is prevented by tracking open tasks and retrieving AsyncResults.
Basic + caching & celery. The basic flow augmented by caching and celery. Processing of scores is farmed out to a celery computing cluster. Re-processing of a revision is prevented by tracking open tasks and retrieving AsyncResults.

My plan is to work from left to right implementing improvements incrementally and testing against the server's performance. I've already been doing that actually as we have been implementing improvements along the way. Right now, the basic flow as seen substantial improvements in the misspellings look-up speed and request batching against the API.

Empirical probability density functions for ORES scoring is generated using the 'reverted' model for English Wikipedia and 5k revisions batched 50 revision requests. Groups represent different iterations of performance improvements for the ORES service.
Server response timing. Empirical probability density functions for ORES scoring is generated using the 'reverted' model for English Wikipedia and 5k revisions batched 50 revision requests. Groups represent different iterations of performance improvements for the ORES service.

I plan to have this figure updated as we roll out upcoming performance improvements. --EpochFail (talk) 11:03, 29 May 2015 (UTC)

Progress report: 2015-05-29[edit]

Hello all,

Checking in with the weekly report.

  • We have conducted some maintenance and administrative work concerning Wikilabels. [116]
  • We completed Portuguese translation of documentation on meta. [117]
  • We observed a CSS conflict. User:Hedonil/XTools/XTools.js conflicts with Wikilabels and prevents the display of Wikilabels UI. [118]
  • We started up a compute server on labs (testing redis/celery & generating models). [119]
  • We attended Wikimedia Hackathon 2015
    • Day 1: Lots of hacking, pywikipediabot meeting, first ever in person meeting. [120], [121]
    • Day 2: More hacking, Hackathon hack session (T90034). [122][123]
    • Day 3: Even more hacking, added French language specific utilities to revision scoring, live demo of French language of revision scoring at closing showcase. [124]
      • We added French language specific utilities. Special thanks goes to fr:User:Paannd a for the French translation and review of the French bad word list.[125]

That was your weekly report. -- とある白い猫 chi? 02:44, 8 June 2015 (UTC)

Progress report: 2015-06-05[edit]

Hello all,

Checking in with the weekly report.

  • We refactored revision scoring dependency management. While this did not make a difference in the interface, it has improved performance and management. A total of 46 files were modified in some capacity with 847 added lines and 703 removed. [126][127]

This was your weekly report. -- とある白い猫 chi? 02:52, 8 June 2015 (UTC)

Also we have docs now pythonhosted.org/revscoring. --Halfak (WMF) (talk) 14:50, 9 June 2015 (UTC)

Fluid animation[edit]

Objective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gif
Objective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gif
Objective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gif
Objective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gif
Objective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gifObjective Revision Evaluation Service logo.gif

While sorting media I noticed the above property of our animated ORES logo. :) -- とある白い猫 chi? 13:48, 14 June 2015 (UTC)

LOL. Helder 14:31, 14 June 2015 (UTC)
So awesome. :) --Halfak (WMF) (talk) 15:10, 14 June 2015 (UTC)
I wonder if we can use this property for our purposes. :/ -- とある白い猫 chi? 18:01, 11 July 2015 (UTC)

Article quality scores[edit]

Hello User:Halfak (WMF) pinging you after your Wikimania talk. Super interesting stuff! I asked about direct database access to a full set of article quality scores. I'm maintaining the WikiMiniAtlas for which I need a metric to prioritize articles shown on the mat at a given zoomlevel. I'd prefer to show high quality articles as prominently as possible (my current ranking is just based on article size). --Dschwen (talk) 15:47, 19 July 2015 (UTC)

Hi Dschwen! We're looking into it now. I wonder if you are just sizing things in the interface if requesting scores from ORES would work for you in the short term. In the long term, we have a phab ticket that you can subscribe to. See Phab:T106278. We're currently working out what the initial table will contain. --Halfak (WMF) (talk) 19:55, 22 July 2015 (UTC)
I'm not just resizing stuff (or needing a small set of data at a time), but I need to process the entire set of all geocoded Wikipedia articles at once to build a database of map labels for each zoom level. I will subscribe to the Phabricator ticekt. Thanks! --Dschwen (talk) 21:14, 26 July 2015 (UTC)

New stats on ORES speed.[edit]

Not much time to type. Here's the plot. Should be self explanatory. Woot!

The density of API response time is plotted for requests to ORES (score 50 enwiki revisions).
API Response speed (by type). The density of API response time is plotted for requests to ORES (score 50 enwiki revisions).

--EpochFail (talk) 18:14, 19 July 2015 (UTC)

Perpetuating bias[edit]

This is great work, add me to the list of people who hope it opens the door to a much more permissive wiki culture!

Apologies if this discussion is already under way somewhere else... I wanted to ask about the training data for the revscoring:reverted models, and whether you plan to unpack the various motivations behind the "undo" action. In short, I think it's imperative that we present a multiple-choice field during undo, to allow the editor to categorize their reason for reverting. This will allow us to provide much higher quality predictions in the future.

To make an sloppy analogy, what we're currently doing is like training a neural network on the yes-no question, "is this a letter of the alphabet?". What we want to do is, have it learning which specific letter is which.

Meanwhile, to continue with the analogy, imagine the written language is evolving and new letters are being invented, old ones are changing form.

I'd like to see documentation on exactly how the training data is harvested, because I'm concerned that our revert model is actually capturing something ephemeral about on-wiki culture, which has shifted over time. Clearly you've considered this problem, the introduction to ORES/reverted suggests as much. For historical data, we would probably need to correlate reverts with debate about the revert--was this a contentious revert? Can we guess whether it was done in good faith? Was the outcome to rollback the revert? Was there an edit war? Also, was the revert destructive, was the original author offended? Engaged? Retained? This seems like a really hard problem, which is why I'd suggest we focus on recent data only, to capture current norms around acceptable article style, and also introduce a self-reporting mechanism where the editor can categorize the reason for their revert.

Adamw (talk) 05:45, 23 July 2015 (UTC)

+1 to all that you've said.
I think it is an interesting idea to have a multiple choice option for 'undo'. It seems like different actions should be taken given the user's reasoning. E.g. if blatantly offensive vandalism, level 4 warning & revert. If playful vandalism, level 1 (or N+1) warning & revert. If test edit (key mash, "hi there", etc.), then test edit warning & revert. If good-faith, but still does not belong, revert and post reason on talk page. If good, no undo for you! Assuming that judgements made by editors had sufficient coverage (there's reason to believe that reverts have coverage), then we could use this to train and deploy better prediction models.
Right now, we're looking at using our Wiki labels campaign to answer the questions: "Is this damaging?" and "Is this good-faith?" so that we can (1) check the biases in our 'reverted' model and (2) train a better classifier that focuses on damage. I'd really like to be able to stand up a classifier that specializes in bad-faith damage for quality control purposes.
Your point about recency is well received as well. Once substantial concern we must manage is the periodic nature of vandalism. E.g. when school is in session in North America, we seem to get a lot more vandalism in enwiki and of a different type. Right now, we're training our models based on revisions from the entire year of 2014 because we started work in January, 2015 -- but we could also sample from the entire year before yesterday. I think we'll find it difficult to extend our wiki-labeling campaign once per year since it involves a substantial amount of effort. This might work for enwiki where we have been lucky to find many volunteer labelers, but I suspect that less active wikis will fall behind unless we integrate with mediawiki's undo/rollback. --EpochFail (talk) 18:29, 23 July 2015 (UTC)

Progress report: 2015-06-05 - 2015-08-02[edit]

We have neglected reporting lately but we have been working hard on expanding and developing our project.

Below list is in inverse chronological order where newest item is on the top.

  • Init for wb_vandalism [128]
  • Fix Wikilabels DB config issues [129]
  • Add full-page view to Wikilabels [130]
  • Remove USB 2.0 driver to resolve an issue with ORES vagrant [131]
  • Revision scoring in production discussion [132]
  • Create a Debian package for python3-jsonpify [133]
  • Create a python package for ‘stopit’ module [134]
  • Create a Mediawiki utilities Debian package [135]
  • Create Debian package for yamlconf [136]
  • Setup Wikilabels infrastructure and deployment [137]
  • Fix revision scoring requests version issues [138]
  • Fix language cache issue in API Extractor [139]
  • Implemented Regex Language generalization in revscoring [140]
  • Scored Revisions to use test server [141]
  • Develop Vagrant for Revision Scoring [142]
  • Fix DocumentNotFound error for article/page creations [143]
  • Spec out research for edit type classifier [144]
  • Select imports for languages (revision scoring) [145]
  • Add basic metric/ outcome goals to IEG renewal [146]
  • Add promise of longer-term viability plan to renewal [147]
  • Indonesian, Spanish, Vietnamese language utilities [148] [149][150]
  • We made a presentation “Revision Scoring Service – Exposing Quality Wiki tools” in Wikimania 2015 [151]
  • We made a presentation “Would you like some AI with that” in Wikimania 2015 [152]
  • We implemented a pre-caching daemon for ORES [153]
  • HACKING Tools that use revision scoring [154]
  • We implemented puppet for ORES celery [155]
  • We implemented distributed processing for ORES. [156]
  • We got Wikilabels off of NFS that was causing some instability on labs [157]
  • Community outreach at Wikimania 2015 to explain what AI can do for the communities [158]
  • We proposed the renewal of the Revision Scoring IEG [159] and also posted a plan for it [160]
  • We created documentation for the Wikilabels Visual Editor Experiment Campaign [161]
  • We had published the Revision Scoring IEG final report [162][163]
  • We added the languages French, Spanish, German, and Russian TF-IDF generated bad words list [164][165]
  • We expanded our TF-IDF generated badword lists to 250 words. [166][167]
  • We migrated to Phabricator! Work board

That was your Revscoring report.

-- とある白い猫 chi? 18:29, 2 September 2015 (UTC)

Wikitrust[edit]

Hi. Wikitrust has been to my opinion one of the most advanced tool to do users & revisions scoring. I'm surprised to not see here an analysis of it, it's even not in the list of tools. What is the reason for that? Regards Kelson (talk) 12:33, 3 September 2015 (UTC)

Hi Kelson, the short answer is that WikiTrust solves a different type of problem. In this project, we score revisions. WikiTrust does not score revisions directly and it doesn't do it's scoring in real time. It scores editors and applies a trustworthiness score to their contributions and applies an implicit review pattern. WikiTrust is actually just one of many algorithms that use this strategy. For my work in this space, see R:Measuring value-added. For a summary of other content persistence algorithms, see R:Content persistence. --Halfak (WMF) (talk) 13:21, 3 September 2015 (UTC)
Also, FYI, you can do cross-wiki links like this: en:WikiTrust --Halfak (WMF) (talk) 13:21, 3 September 2015 (UTC)
Thank you for your quick&clear answer. Kelson (talk) 13:59, 3 September 2015 (UTC)

FYI: some ORES downtime today.[edit]

See Talk:Objective Revision Evaluation Service and the post mortem. --EpochFail (talk) 17:11, 8 September 2015 (UTC)

Progress report: 2015-09-19[edit]

Hey folks. It's been about a month since you've gotten a progress report. I figured one was due.

  • We minimized the rate of duplicate score generation in ores [168]. Parallel requests to score the same revision will now share the same celery AsyncResult.
  • We turned pylru and redis into optional dependencies of ORES [169]. This makes deployment a little easier since we don't have to make Debian packages for libraries we don't use in production.
  • We did a bunch of homework around detecting systemic bias in subjective algorithms (like our classifiers). See our notes here: [170]
  • We made the color scheme in ScoredRevisions configurable. [171]
  • We primed lists of stopwords by applying a en:TFiDF strategy to edits in various Wikipedias (af, ar, az, de, et, fa, he, hy, it, nl, pl, ru, uk) [172]
  • We added language feature Hebrew [173] and Vietnamese [174] to revscoring
  • We read and discussed critiques of subjective algorithms in computer-mediated social spaces [175]
  • We implemented a regex-based badwords detector that handles multiple-token badwords (important for turkish and persian) [176]
  • We *deployed* a series of performance improvements to ores.wmflabs.org. [177]
  • We implemented a diff-detecting system for Wikidata's JSON data format [178]
  • We accidentally nuked our Celery Flower monitoring system for ores.wmflabs.org and then brought it back online [179]
  • We built a script for extracting data from the recently-finished Wikilabels edit quality campaigns [180] and we're now working on building models with the data.
  • We substantially improved the stability of ORES worker nodes and the redis backend that they use [181]

Also worth noting, ORES has been adopted by Huggle and we've been working with them to address performance issues by suggesting they request scores in parallel. ORES can take it! --EpochFail (talk) 16:22, 19 September 2015 (UTC)

FYI: Code for extracting features re. edit type classification[edit]

See work here: https://bitbucket.org/diyiy11/wiki_edit

We'll need to adapt and export the feature extraction code for use inside revscoring. F-scores for each class are comparable to the state of the art. --EpochFail (talk) 18:00, 23 September 2015 (UTC)

@EpochFail: I get this on that URL: "You do not have access to this repository.Use the links at the top to get back." Helder 21:33, 23 September 2015 (UTC)
Gotcha. I'll have to talk to Diyi about opening it up. She may prefer that I rewrite before publishing. I'll report back when I can get to it. --EpochFail (talk) 21:51, 23 September 2015 (UTC)

Progress report: 2015-09-26[edit]

So your weekly reports should actually be weekly reports now. Sorry for the last hiccup.

  • We used clustered reverted edits by a k-means algorithm. T110581
  • We Prepared a summary of SigClust and other methods for choosing number of clusters. T113057
  • We started working on our mid point report which should provide a good overview of our work since the last tri-monthly report. We hope to have a draft by 1st of October. T109845
  • We are in the process of deploying the results of data collected through wiki labels edit quality campaign. We will compare the results of the newer model generated from this data with the older revert based model in the upcoming weeks. T108679
  • We are winding down our community outreach efforts and will focus on assisting more responsive communities in the meanwhile. T107609
  • We are re-generating stop words by ommiting interwiki links from them. Interwiki links do not provide an indicative signal for edit quality. T109844

We have a lot of things going on in parallel. Stay tuned for next week!

-- とある白い猫 chi? 01:10, 28 September 2015 (UTC)

Better usage statistics in graphite[edit]

A screenshot of the graphite logging interface shows the 50, 75, 95 and 99th percentiles of response times for our precaching system.
Graphite logging (precached speed). A screenshot of the graphite logging interface shows the 50, 75, 95 and 99th percentiles of response times for our precaching system.

Hey folks,

We just had a good work session on our midterm report for the IEG. It became blatantly obvious that our methods for gathering metrics on requests, cache-hit/miss and activity of our precached service left a lot to be desired. So I put in a couple of marathon sessions this week and got our new metrics collection system (using graphite.wmflabs.org) up and running.

Check out the screenshot to the right. You can also find similar statistic by navigating to graphite.wmflabs.org. --EpochFail (talk) 19:52, 3 October 2015 (UTC)

Progress report: 2015-10-03[edit]

Your weekly report.

  • ORES celery workers made more quiet by error handling. This allows unexpected errors to be more prominent as other known issues wont bog it down. T112472[182]
  • Batch feature extraction is now implemented. This will expedite model creation. T114248
  • Midpoint report drafted early to meet the schedules of IEG and Grantees such that delays are avoided. T109845
  • We included model version in ORES response structure so that scores will be regenerated with updated model instead of having persistent outdated scores of the earlier model. new scores will be generated on demand. T112995
  • Model testing statistics and model_info utilities are added to revscoring. T114535
  • Metrics collection added for ORES. This will help quantify ORES usage. T114301

That was your weekly report. -- とある白い猫 chi? 15:49, 11 October 2015 (UTC)

Progress report: 2015-10-10[edit]

Your weekly report.

  • We have trained and deployed the handcoded "damaging" and "good faith" models gathered through the now complete edit quality campaigns via wikilabels. This has been deployed for enwiki, fawiki and ptwiki. T108679
  • Turkish wikilabels campaign has completed and is ready for modelling provided AUC confirms gain.
  • Preparing many wikis for their own edit quality campaign. Wikilabels interface awaits translation by local communities.
  • Midpoint report was approved.

That was your weekly report. -- とある白い猫 chi? 15:57, 11 October 2015 (UTC)

See ORES for details on the new models. --EpochFail (talk) 16:17, 11 October 2015 (UTC)

Dealing with Wikidata[edit]

Per the revscoring sync meeting here I my thoughts on the matter. First off, we have established that bots are almost never reverted on wikidata. Secondly I think everyone can agree that on wikidata bots FAR out weight humans in terms of number of edits. As a consequence we are dealing with an over-fitting problem where probably all human edits will be treated as bad because the algorithm will give too much weight on features that basically distinguish bots from humans. This is more of an intuitive assessment than actual analysis. I could very well be wrong. I came to this assessment because on wikidata vast majority of good edits will come from bots and bots will always dominate the random sample set. We can perform a more selective sampling but honestly I do not see the benefit of it.

Based on the two assessments above I propose a different type of modelling for wikidata than what we use on wikipedias. First off, we need to segregate bot edits from other edits. Indeed this will have the potential of a bias but I will explain how this can be avoided. So we would have two models for wikidata, one for bots and other for non-bots. Two independent classifiers would be trained. For instance different classifiers can be used. In such a case bot edits can have Naive Bayes while human edits can be processed with SVM. This is kind of a top level decision tree which delegates first two branches to classifiers.

If ORES is asked to score a revision and the edit is from a bot, the bot model would generate the score. If ORES is asked to score a revision and the edit is not from a bot, the non-bot model would generate a score. The output would still be "damaging/not damaging" and "good faith/bad faith". Bias would be avoided because the way bots edit and humans edit is very different. If humans make bot like edits and this isn't reverted (and vice versa) it would still be treated as good.

-- とある白い猫 chi? 13:42, 17 October 2015 (UTC)

Seems like this is putting the cart before the horse to me. If we are worried about winding up in a bots vs. humans situation, then we can leave the user.is_bot flag out of the feature set. We need representative training data to train any model (your suggestion or a more straightforward approach) and that is the issue I brought up at our meeting. We don't want to have to extract features for 2m edits. That would take forever. Also, bots are *never* reverted, so in order to get *any* signal about bots, we'd need to have humans handcode ~2m edits to get a representative sample of bot damage. IMO, we should build a sample stratified on whether the edit was reverted or not and exclude the user.is_anon and user.is_bot flags. We can do this is a relatively straightforward way now that we know the rough percentage of edits that are reverted by processing the XML dumps.
Honestly, I'm really hoping that we don't need to have a hierarchical model because that implies a substantial increase in code complexity. Right now, we don't even know that we do have a problem with bias yet. --EpochFail (talk) 15:48, 17 October 2015 (UTC)

Major update to revscoring package documentation[edit]

Hey folks,

I just updated the revscoring package documentation for 0.6.7. It's got a new theme (alabaster is the new default), better examples, and simplified access patterns for basic types (e.g. from revscoring import ScorerModel vs. from revscoring.scorer_models import ScorerModel). Check it out. :) --EpochFail (talk) 13:34, 22 October 2015 (UTC)