I made some updates to the mocks and worked on a generalizable configuration strategy. I propose something like this for a campaign:
YAML campaign configuration
name:Revision Coding -- English Wikipedia 2014 10k samplesource:enwiki 2014 revisions -- 10k random sampleauthor:name:Aaron Halfakeremail:aaron.halfaker@gmail.comcoder:class:revcoding.coders.RevisionDiffform:fields:-damaging-good-faithfields:damaging:class:revcoding.ui.RadioButtonslabel:Damaging?help:Did this edit cause damage to the article?options:-label:"yes"value:"yes"tooltip:Yes, this edit is damaging and should be reverted.-label:"no"value:"no"tooltip:>No, this edit is not damaging and should not bereverted.-label:unsurevalue:unsuretooltip:>It's not clear whether this edit damages the article ornot.good-faith:class:revcoding.ui.RadioButtonslabel:Good faith?help:>Does it appear as though the author of this edit wastrying to contribute productively?options:-label:"yes"value:"yes"tooltip:Yes, this edit appears to have been made in good-faith.-label:"no"value:"no"tooltip:No, this edit appears to have been made in bad-faith.-label:unsurevalue:unsuretooltip:>It's not clear whether or not this edit was made ingood-faith.
A server running in WMF Labs that would make sources of rev_ids available. The above configuration describes a campaign. The gadget running in the user's browser will have a hard-coded campaign list page (e.g. en:User:EpochFail/Revcoding/CampiagnList.js). The campaigns listed there will appear in gadget users' Special:UserContribs page. The WMF labs server will be responsible for delivering (1) the campaign definition (described above) and (2) tracking, delivering and accepting submissions from work sets. --EpochFail (talk) 17:48, 19 October 2014 (UTC)Reply
Decided to hack together a quick diagram.
Server architecture. The server architecture for the revscores system is presented.
I've realized a problem. When a user requests or submits a coding, how does the server know who they are? I wonder if we can get an oauth handshake in here somehow. If we open a popup window to the server that performs the oauth handshake and sets up a session with the user's browser, then subsequent requests will be identifiable. So... that means that a logged-in Wikipedia editor could be a logged-out Revcoder. Here's what that might look like:
Handcoder home (logged out). A mockup of the handcoder home while logged out.
It might be worth to allow users to add a note about a specific revision when reviewing it, mainly when the user is "unsure" about the correct label.
Maybe we could save a click for each review by not having a submit button? Then, when the user clicks for selecting the second label, the review is also submited to the system
Done Use colors in the vertical bars, to indicate if the revision is damaging or not, good-faith or not, etc.. (the bottom half of the bar could be used for a feature and the upper half for the other)
This was implemented in the gadget by splitting each vertical bar in blocks (one for each field).
Not done Move the "unsure" button to the middle (yes, unsure, no), so the ordering of the "scale" is more intuitive (+1, 0, -1)
This does not scale to non-binary things (e.g. article quality class). However moving the unsure option into a separate checkbox makes sense, as it requires the user to make his best guess while still informing that there are doubts. Helder18:27, 30 January 2015 (UTC)Reply
A mockup of a revision diff hand-coder interface. Here is a screenshot of the interface as implemented in the gadget. Some notes:
Each vertical rectangle corresponds to a revision, and it is split into boxes, where the boxes in the first row corresponds to "Damaging" and those in the second row correspond to "good-faith".
New boxes are added below the two current boxes of each revision
New styles (e.g. colors) need to be added manually to the CSS file.
Maybe it is a good idea to provide a default set of 10 colors which would be used for, say, the 10 first options of a field.
I'm treating "unsure" as being different from "not evaluated yet", and I assume "unsure" would be stored as an actual value in the database
The workset wraps automatically when the browser window is too small.
I exemplified how to get a dataset of revids from recent changes
The buttons do not do anything yet, but they would use CORS to make API calls to e.g. Danilo.mac's prototype on Labs, to store the values provided by a user.
Update: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and in the future it would use CORS to make API calls to e.g. Danilo's prototype on Labs, to store the values provided by a user. Helder01:05, 14 January 2015 (UTC)Reply
Update 2: The submit button updates the progress bar with the values for the current diff (selected using the other buttons) and uses jsonp to make API calls to Danilo's prototype on Labs, to store the values provided by the user. Helder18:20, 22 January 2015 (UTC)Reply
This looks very good to me. I appreciate that you are thinking about how the visualization will extend for more fields. I'm worried about entering into some crazy visuals if someone adds a lot of fields to the form, but then again, they probably shouldn't have that many fields. --EpochFail (talk) 17:24, 20 January 2015 (UTC)Reply
It's safe to publish openly those quality ratings we are going to collect? I mean, while we learn how vandals are working those vandals will learn (reading such datasets) how we learn from them and be able find out new way to get hidden in the ocean of approved revisions. Ok, it's much more a sci-fi question than a practical issue. --Jonas AGX (talk) 00:44, 17 November 2014 (UTC)Reply
@Jonas AGX: I think yes. It is common knowledge what a vandalism is, and I don't think vandals will change anything in their behaviour just because they know that we consider their actions as being disruptive. Helder18:28, 22 January 2015 (UTC)Reply
Are somebody developing something about database? I can try to make an API in toollabs:ptwikis to collect the data sent by the gadget in a database. I have already made a tool to register data, this tool is for voting in the last WLE photos, the votes are registered in a database, but it don't use OAuth. I still have to learn how to use OAuth and as ptwikis is a tool for Portuguese projects I will initially make this only for ptwiki, ok? (sorry bad English) Danilo.mactalk02:47, 26 November 2014 (UTC)Reply
I don't think we want to use RestBASE (The stuff I was discussing with Gwicke) to store our training set data. We'll probably want to maintain our own system and the testing that Danilo.mac is doing is helping us get that up and running. Do you guys have any design docs put together yet? I have some mockups that I'd like to share. --EpochFail (talk) 17:24, 20 January 2015 (UTC)Reply
Shouldn't we use radio buttons (see the WMF living styleguide) instead of buttons groups? With radio buttons, we could just use the class "mw-ui-radio" and the selected option in each group would be styled automatically, while with buttons we would need to define some new class with styles for the selected buttons. Helder19:54, 12 January 2015 (UTC)Reply
Or, depending on the kinds of fields that we will have, checkboxes instead of radio buttons, to allow for multiple values (tags) for a single revision... Helder23:03, 12 January 2015 (UTC)Reply
+1 for following whatever standards exist in MediaWiki. Otherwise, I'd like to optimize for usability. Radio buttons are small and hard to click. We can also surround the radio with clickable space. Something like this: [( ) Label ] vs [(*) Label ]. --EpochFail (talk) 17:24, 20 January 2015 (UTC)Reply
I don't see the difference in the example [( ) Label ] vs [(*) Label ], but I usually make the label of radio buttons clicable to make it easier to select an option.
I ask this having in mind the mockup you have on your Google Drive (Revision scoring › coding › Revision handcoder), showing edit types. How would that be stored in the database? What if someone decide to add new options to a field like these? Helder18:40, 22 January 2015 (UTC)Reply
As I mentioned in the related trello card [2], Postgre's JSONB type supports this as well as querying and indexing of json elements -- though I suspect that we won't really want to index form data directly. --EpochFail (talk) 19:25, 30 January 2015 (UTC)Reply
Revcoder home mockup. A mockup of the revcoder home is presented.
Hey folks, I created a new mockup for a home for revcoding work. Such an interface could give our volunteer handcoders a window into the system's labeled data needs and may provide easy access to the revision handcoder to add new data.
[propose] buttons would take users to the bug tracker to file a bug.
The "training data" histograms on the right would visually present the recency of available data for training classifiers. More recent data is more better.
[add data] button would load up a random sample of recent revisions into the handcoder for the user to process
I do like this a lot. I just have a minor point with the color blue in the screen shot. It's a ad bit too bright. A tad bit darker shade would be better. Like the color of the lines in the graph. Is this a possibility? -- とある白い猫chi?21:03, 31 January 2015 (UTC)Reply
I should just make my mockups be black and white. :P But seriously, I wouldn't mind pulling in a someone with some visual design experience. In the meantime, I'll makes sure you have access to the google drawing to make changes on your own. --Halfak (WMF) (talk) 17:14, 1 February 2015 (UTC)Reply
I'd like to throw my two cents on the matter. After Halfak's explanation yesterday I find his approach on the matter quite sound. I think there is great benefit in having a gadget which has several campaigns which is divided among tasks that expire if people are sitting on them. I think this approach suits our crowd-sourcing culture at Wikimedia projects better. After all crowd-sourcing itself is a divide-and-conquer strategy to begin with. -- とある白い猫chi?09:04, 21 February 2015 (UTC)Reply
Latest comment: 9 years ago1 comment1 person in discussion
One thing I saw in the Machine Learning was an application of Linear SVM to SPAM detection, where we took a list of ~2000 English words (the ones which appeared more than 100 times in a subset of the SpamAssassin Public Corpus) and used the presence/absence of each of their stems in an e-mail as a (binary) feature. So, given a word list in the form (foo, bar, baz, quux, ...) and an e-mail whose text is "Get a discount for bar now!", we would represent the e-mail by a vector (0, 1, 0, 0, ...) whose dimension is the number of words in our list, and which contained ones in the entries corresponding to each word which was found in the given e-mail. Then we used a set of 4000 examples to train a SVM and tested it on 1000 other examples, getting ~98% of accuracy. After that, we also sorted our vocabulary by the weiths learned by the model to get a list of the top predictors of SPAM.
In the context of vandalism detection, top predictors could be used to improve the lists of badwords used by abuse filters, Salebot and similar tools which do not use machine learning. In the specific case of the Salebot list, we could even use the learned weights to fine tune the weights used by the bot.
This approach differs from the one currently in use on Revision-Scoring, where we just count the number (and the proportion, etc) of badwords added in a revision. It is as if all words in the list had the same weight, which doesn't look quite right. Helder13:10, 6 January 2015 (UTC)Reply
Latest comment: 9 years ago1 comment1 person in discussion
These are a few papers on vandalism detection which might be of interest to us:
Khoi-Nguyen Tran & Peter Christen. Cross Language Learning from Bots and Users to detect Vandalism on Wikipedia. 2014.
Santiago M. Mola Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals. 2012.
Jeffrey M. Rzeszotarski & Aniket Kittur. Learning from history: predicting reverted work at the word level in Wikipedia. 2012.
Andrew G. West & Insup Lee. Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence. 2011.
Kelly Y. Itakura & Charles LA Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. 2009.
F. Gediz Aksit. Wikipedia Vandalism Detection using VandalSense 2.0. 2011.
Sara Javanmardi & David W. McDonald & Cristina V. Lopes. Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. 2011.
B. Thomas Adler & Luca de Alfaro & Santiago Mola-Velasco & Paolo Rosso & Andrew G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. 2011.
Koen Smets & Bart Goethals & Brigitte Verdonk. Automatic vandalism detection in Wikipedia: Towards a machine learning approach. 2008.
Martin Potthast & Benno Stein & Robert Gerling. Automatic Vandalism Detection in Wikipedia. 2008.
Jacobi Carter. ClueBot and Vandalism on Wikipedia. 2007.
There is also this one, about PR and ROC curves:
Jesse Davis & Mark Goadrich. The relationship between Precision-Recall and ROC curves. 2006.
Latest comment: 9 years ago2 comments2 people in discussion
Hey folks,
We've been doing a lot of hacking over the holiday season. Since the last update, we:
Cleaned up the structure of the Feature Extractor so that it is easier to use. [3]
We chose a name for the system that will serve revscores -- The Objective Revision Evaluation System -- or ORES. [4]
We built an ipython notebook demonstrating the use of our LinearSVC scorer. [5]
We implemented file writing and reading for the scorer models so that storing and sharing models will be easy [6]
Danilo tested a set of classifiers for detecting reverts in PTwiki and showed some success. We'll likely need to refine a bit.
Part of that refining will be using modifiers on our input data. For that, we've implemented "modifiers" as Psuedo-Features that can be used in training classifiers. See [7].
We've read a User:West.andrew.g's research on building machine learning models of damage and gathered a list of related work to pick through -- see #Papers. [8]
Latest comment: 9 years ago2 comments2 people in discussion
I tested an installation of the revscoring system and dependencies on Ubuntu 14.04 with Python 3.4.0. See my work log below.
# First thing I would like to do is set up a python virtual environment. I like to store my virtual # environment in a "venv" folder, so I'll make one in my home directory.
$mkdir~/venv
~/venv$cdvenv
# Regretfully, pip is broken in the current version of venv, so we have to install it manually.
~/venv$pyvenv-3.43.4--without-pip
# Before we try to install pip, we'll need to activate the virtualenv(3.4)~/venv$source3.4/bin/activate
# Then use the installer script to install the most recent version(3.4)~/venv$wget-O-https://bootstrap.pypa.io/get-pip.py|python
# Now we have pip in our venv so we can install our dependencies. (3.4)~/venv$pipinstalldeltas
(3.4)~/venv$pipinstallmediawiki-utilities
(3.4)~/venv$pipinstallnltk
(3.4)~/venv$pipinstallnumpy
# Numpy failed because we are missing some headers for python and try again(3.4)~/venv$sudoapt-getinstallpython3-dev
(3.4)~/venv$pipinstallnumpy
# OK back to the list(3.4)~/venv$pipinstallpytz
(3.4)~/venv$pipinstallscikit-learn
(3.4)~/venv$pipinstallscipy
# Install a scipy fails due to missing libraries(3.4)~/venv$sudoapt-getinstallgfortranlibopenblas-devliblapack-dev
# And now we try again(3.4)~/venv$pipinstallscipy
# Now, before we get going, we should download the nltk data we need. (3.4)~/venv$python
>>>python
>>>importnltk
>>>nltk.download()
>>>Downloader>d
>>>Identifier>wordnet
>>>Downloader>d
>>>Identifier>omw
>>>Downloader>q
>>>^d
# OK now it is time to set up the revscoring project. I like to pull in all my projects -- whether library or# or analysis into a "projects" directory.(3.4)~/venv$mkdir~/projects/
(3.4)~/venv$cd~/projects
(3.4)~/projects$gitclonehttps://github.com/halfak/Revision-Scoringrevscoring
(3.4)~/projects$cdrevscoring
(3.4)~/projects/revscoring$pythondemonstrate_extractor.py
# And it works!
pip install numpy fails with RuntimeError: Broken toolchain: cannot link a simple C program, but sudo apt-get install python3-dev fixes it.
pip install scikit-learn fails with "sh: 1: x86_64-linux-gnu-g++: not found", so sudo apt-get install g++ needs to be executed before it "Successfully installed scikit-learn-0.15.2"
Before running pip install scipy, it was necessary to run sudo apt-get install liblapack-dev (due to numpy.distutils.system_info.NotFoundError: no lapack/blas resources found) and sudo apt-get install gfortran (due to error: library dfftpack has Fortran sources but no Fortran compiler found). The package libopenblas-dev was not necessary.
Before cloning the repository, I had to install git too: sudo apt-get install git
@Danilo.mac: I think your suggestion would be perfect if the blue circle were in the center of the white circle, and the whole gear at the bottom were the same color (red/brown). Helder18:53, 23 January 2015 (UTC)Reply
I particularly like the animated version. I think different aspects of our project can have different logos. For instance this animation can be used as the "loading" animation since I imagine some queries would take time. Or perhaps it could be the logo of revscores itself. -- とある白い猫chi?11:57, 26 January 2015 (UTC)Reply
@Helder: Thanks for fixing that! I'm not sure if my !vote counts, but I personally like Danilo.mac's (the 3rd one) because it apparently symbolizes the tool by a robot eyeball monitoring products. whym (talk) 11:05, 29 January 2015 (UTC)Reply
So we agreed for this as the logo for ORES (Objective Revision Evaluation Service) for the time being. I want to explain the symbolism/story behind it.
First of the abbreviation of our system forms is an acronym with the plural of the word ores. Because we datamine raw data (data ore if you will) we felt this name fit our system best.
Gold ranks among the most valuable ore. In the 17th century the Philosopher's stone was a legendary substance believed to be capable of turning inexpensive metals into gold and was represented by an alchemical gliph. Our logo is inspired by this glyph since the service we intend to provide will convert otherwise worthless ore data (raw data) to gold ore data. Mind that it is still ore for others to process. As this services main goal is to enable other more powerful tools.
The idea behind the logo came from User:Mareklug and User:Ekips39 was kind enough to draft two versions of the logo.
We wrote up installation notes for Ubuntu, Mint and Windows 7. [11]
We've worked through the train/test/classify structure so that the whole team is familiar with it [12]
We've done some substantial work testing and refining our classifiers on real world data [13]
In a somewhat contrived environment, I've been able to demonstrate 0.85 AUC on English Wikipedia reverts -- which puts us on par with STiki. --16:32, 19 January 2015 (UTC)
Latest comment: 9 years ago2 comments1 person in discussion
Hey folks. Again, I'm late with this report. You can blame the mw:MediaWiki Developer Summit 2015. I talked to a lot of people about the potential of this project there and learned a bit about operation concerns related to service proliferation. Anyway, last week:
Revcoder home mockup. A mockup of the revcoder home is presented.
Latest comment: 9 years ago1 comment1 person in discussion
Hey folks. Progress report time. Here we do. During the last week, we:
Filed a bug against NLTK to add a stemmer for Turkish [42]. They would like us to submit a pull request [43], so that will need to wait.
We dug into work to create models for Turkish and Azerbaijani wikis by creating a test corpus of reverted revision [44]
We performed some refactoring of the scoring system so that it can handle multiple models at a time and share features across them. [45] This provides for substantial performance improvements when a request is made for scores from multiple models. We also fixed some bugs with the dependency solver to improve caching behavior. [46]
We also refactored languages so that they are expressed as a set of language utilities. [47] This allows for languages to be partially specified, but still useful.
We implemented advances operator modifiers "* / min, max, log, ==, !=" [48] This allows one to express compound features in an intuitive way.
Improved diff algorithm performance so that we can generate those features more quickly. [49]
We implemented partial language utilities for trwiki and azwiki [50] , build feature sets for classifying reverts [51] and we build revert models for them and found that we could get relatively high AUC despite the lack of language features [52]
We translated our signpost article for the portuguese signpost [53].
Latest comment: 9 years ago1 comment1 person in discussion
@Halfak: This is the conversation I had with Raylton some time ago about the revcoder mockups and other aspects of the project (it is in Portuguese, sorry):
chat log
Friday, January 23, 2015: Raylton:
como tá o ieg?
tem coisa que eu possa ajudar (que não demore muito tempo)
?
Helder:
tá indo bem, eu acho
andei prototipando um gadget
Raylton:
huum
Helder:
e testando integração com o Labs
se quiser fazer uns testes rápidos (clicar aqui e ali, de um jeito que
descubra bugs...hehe)
É o primeiro dessa lista: https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets
nele estou testando como poderia ser a interface para permitir que
usuários avaliassem um conjunto de edições como sendo boas/ruins, de
boa/má fé...
Raylton:
<gadget-QualityCoding>?
Helder:
isso
Raylton:
parece que não tem mensagem traduzida né?
Helder:
é... deu preguiça
...pra que os algoritmos de aprendizado de maquina tenham um dataset
sobre o qual serão treindados
pode testar aqui: https://test.wikipedia.org/wiki/Sandbox?diff=0&uselang=pt
depois de ativar o gadget
ele mostra uma barra de progresso, em que cada retangulo corresponde a
uma revisão do conjunto sob análise
o editor responde às opções apresentadas (rotula/tagueia) e submete, e
o gadget abre a próxima revisão pra analise
Raylton:
tem uma coisa
ele lembra a resposta anterior
Helder:
a submissão enviaria algo para o Labs
q por enquanto
vai parar nessa tabelinha que o danilo fez no ptwikis: https://tools.wmflabs.org/ptwikis/dev/Pontua%C3%A7%C3%A3o
Pois é... notei essa memória depois de uns tantos testes.
Raylton:
é ruim ele lembrar a resposta anterior
Raylton:
nesse pequeno teste gostei da interface
Helder:
to usando mw.ui
Raylton:
embora eu não goste de boa vs má fé
Helder:
do conceito ou da terminologia que foi usada para descrevê-lo?
Raylton:
o unico problema que enxerguei até agora é a memoria. que pode trazer
falsos positivos
Helder:
concordo
enquanto era só eu q tinha pensado nisso, fui adiando... mas vou ver
se apago a memória dele....
</momento MIB>
Raylton:
é que boa fé assume que nós conseguimos medir eficientemente a fé dos usuários.
Raylton:
na verdade o boa fé vai ser impregnado de moralismos. tipo.. palavras
improprias serão consideradas má fé e erros mais elaborados serão
considerados boa
Helder:
sim, tem uma dose de subjetividade
Raylton:
uhum
Helder:
ei, o Aaron fez uma apresentação em um encontro meio interno do povo
da wikimedia
sobre o IEG...
acho q deve ter no youtube]
Raylton:
eu acho que o melhor essa etapa de bater o olho pra mim seria útil
algo do tipo edição "é construtiva?" []sim []não
"precisa de melhorias?"
[]sim (quais) []não
Helder:
Isso me lembra que ainda não arranjei um lugar pra permitir notas
(texto livre,por extenso)
Raylton:
era bom que nessa parte do sim tivesse além de texto livre umas "tags"
mais comuns
tipo aquelas do wordpress
Helder:
Ei, eu agradeceria muito se vc pudesse colocar essas suas impressões
mais importantes do que melhorar em uma talk page no meta-wiki onde
estamos discutindo esses mockups
ah, o Aaron fez algo do gênero em um dos mockups dele
(não sei bem qual)
Raylton:
ponho sim. só não sei bem se conseguir me engajar na discussão.
Raylton:
mas como uma pre opinião... vc acha que essa abordagem faz sentido?
a que eu propus?
Helder:
eu cheguei a comentar algo com o Aaron sobre a sugestão de permitir
indicar "tipos de melhorias"
Raylton:
(i mean... vc preferia usar isso ou outro)
?
Helder:
e acredito que acabaremos tendo um campo de texto livre para coletar
esse tipo de coisa, e eventualmente, se surgirem padrões de sugestões,
talvez colocar tags específicas pra sugestõe mais comuns
e houve uma outra coisa que discutimos, mas que ainda não coloquei no
gadget, que é separar o "não tenho certeza" da lista de opções
binárias
[Sim][Não]
[ x ] Não tenho certeza
Helder:
em vez de permitir que a pessoa não escolha "sim" nem escolha "não"
(escolhendo "não tenho certeza" em vez delas), será mais útil que a
pessoa seja "forçada" a escolher entre "sim" e "não", mas com a opção
de dizer que não estava muito confiante da avaliação que fez (marcando
"[x] não tenho certeza")
Raylton:
isso seria uma outra questão né?
Helder:
cada aspecto de uma edição que estivesse sendo analisado (construtiva?
de boa fé? classe de qualidade) teria o seu próprio "[ ] não tenho
certeza disso"
Raylton:
acho complexidade desnecessária
Helder:
ficou mais simples, em vez de complicado
ou melhor,
não mudou a complexidade
só aumentou a utilidade de avaliações "incertas"
Escolher "não sei" em
[sim][não][não sei]"
não diz nada que seja útil para os algoritmos, mas dizer
"Sim" e não "tenho certeza", ou dizer "não" e "não tenho certeza", é útil
Raylton:
entendi
Helder:
typo: 'e não "tenho certeza"'
--> 'e "não tenho certeza"'
Raylton:
mas do ponto de vista de usuário eu preferia um botão pular
na verdade esse tenho certeza ou não só funcionaria com coisas menos subjetivas
boa ou má fé é impossível de definir com certeza
Helder:
"seu filho da ..." é fácil de ter certeza
e não tem problema se a maioria das avaliações em relação a esse
critério "boa fé" forem sem certeza
alguns algoritmos tb trabalham com probabilidades de que algo seja
sim, ou seja não...
Raylton:
não... isso é uma tipificação maniqueísta de boa fé. para pessoas que
usam palavrão em seu cotidiano ou que não encontraram uma seção de
comentários com tanto destaque que o botão editar pode ter feito isso
de boa fé... como é possível medir a fé das pessoas hehe
?
Helder:
os casos em que os humanos ficam em dúvida, o algoritmo tb ficará
e nos casos em que eles tiverem certeza, o algoritmo tb terá mais certeza
o algoritmo só aprendera a chamar de boa fé o que as os editores em
geral chamarem de boa fé...
Raylton:
minha questão com o termo é que ele é focado no sujeito da ação... e o
objeto de analise naturalmente não deve se esse e sim a ação em
questão... no caso a edição
não existem edições de boa fé pq edições não tem fé
exitem edições com problemas
Helder:
já conhece/testou/ouviu falar do Snuggle e do Stick (não lembro qual dos agora)?
Raylton:
isso é uma analise objetiva
(na minha opinião)
(já ouvi sim)
Helder:
um deles faz uma certa triagem dos novos editores, para que os de boa
fé (os que serão mais úteis) possam ser melhor recebidos/tutorados
e o que está por trás disso é um algoritmo treinado em um conjunto de
edições rotuladas entre de boa e de má fé...
(se eu me lembro bem)
e como a API q pretendemos construir, deve servir como uma forma
unificada de fornecer dados para implementar esse tipo de ferramenta
em outras wikis, precisaríamos ter esse tipo de avaliação feito para
as wikis que estiverem interessadsa
se, por exemplo, no Wikilivros não quisermos isso... é só não
ativarmos esse critério lá..
Raylton:
não tem problema em definir um editor de boa ou má fé com base a seu
numero de edição problematicas ou ruins... mas a edição não devia ser
chamado de má fé... mas sim um editor que é recorrente em algum grau
em praticar tais ações devia ser
Helder:
ah, mais aí é só uma questão de mudar a redação da mensagem que
aparece na interface, não?
"edição feita com boa fé" em vez de "edição de boa fé"
ou até "editor bem intencionado"?
sei lá
[não lembro qual o texto atual]
Raylton:
não
nada de edição de boa fé
edição = tem ou não tem problemas
usuários que causa muitos problemas = má fé
Helder:
"esta edição é típica de editores de má fé?"
Raylton:
mas qual a necessidade de assumir a má fé do editor gente do céu
Helder:
"se fosse pra vc chutar se este editor tem agido de forma construtiva,
com base exclusivamente nesta edição atual, o que diria?"
Raylton:
é a primeira edição do cara
Helder:
não...
é uma edição aleatória tirada do histórico (dump)
Raylton:
se ele for avisado que não é assim e ele seguir aí sim tem má fé
aleatória incluindo as primeiras edições
cento?
se sim então minha inquietação ainda e valida
Helder:
se der sorte...
dependendo do tamanho da amostra de edições que pegarmos para
analisar, poderão aparecer mais ou menos "primeiras edições" no
conjunto
de treino
se for pequeno, é possível que não apareça nenhuma
(mas conjuntos pequenos não são muito uteis para treinar os algoritmos)
então, sim, provavelmente terá primeiras edições no conjunto
Raylton:
se fosse pra eu chutar eu descreveria o problema e deixaria o
algoritimo escolher com base na gravidade e na recorrencia quem tem
boa ou má fé...
naturalmente uma unica edição é uma taxa de amostragem muito baixa pra
tirar uma conclusão
da apenas pra ter um palpite
um chute como vc disse
Raylton:
vc entendeu o que quero dizer?
faz sentido?
Helder:
acho q sim
Raylton:
pq poxa... saber que minha avaliação vai definir coisas como de boa ou
má fé de forma tão subjetiva seria bem ruim
ainda mais considerando que o objeto de analise não tem fé...
hehe
esse boa fé seria como um argumentum ad hominem
pronto
agora consegui uma boa analogia
Helder:
exceto que não necessariamente mostraríamos quem e' o tal homem (autor
da revisão) na hora de avaliar uma revisão do conjunto
Raylton:
e por isso mesmo a fé não deveria estar em questão
nessa conjuntura específica
Helder:
em resumo, acho que você considera o campo relacionado à "boa fé"
bastante questionável, e as coisas que disse se aplicariam em uma
discussão onde certa wiki estivesse decidindo se utilizaria esse campo
ou não. Enquanto que o campo relacionado a ser "construtiva" não
estaria sujeito às mesmas críticas, por ter relação apenas com o
conteúdo das revisões.
Raylton:
sim...
exatamente
Helder:
MAAAS..
Raylton:
e aqui estamos problematizando a relação do boa ou má fé com o conteúdo
Raylton:
mas existem questões relacionadas ao uso do termo em usuários que
talvez deixasse a conversa muito confusa caso ou citasse
Helder:
SE o algoritmo perceber que grande parte das edições feitas por
anônimos costumam ser de ma fé, ele dará um peso baixo a outros
features como "página está no domínio principal ou não"
Raylton:
tem dois problemas com esse termo além a relação com o objeto de
analise (como já tratamos),
E aqui estou falando apenas dos problemas de usar isso em usuários...
já que já abordamos os problemas de usar em edições.
O primeiro é que não é concreto.
o segundo é que bom e mal é que é maniqueísta.
por isso acho que a definição mais sensata pra mim é. "Usuários
potencialmente destrutivos"
ou algo do gênero
ou até usuários destrutivos
em caso de termos mais certeza das da recorrencia da destrutividade das edições
critérios pra definir o qual destrutivo é o editor poderiam ser
fez alguma edição destrutiva?
Qual a gravidade da edição destrutiva?
Foi avisado da edição destrutiva?
A ação destrutiva é recorrente depois do aviso?
"o quão"
ou em termos mais de banco de dados
Destrutiva-gravidade
Destrutiva-foi avisado
Destrutiva-recorrente
Ps: a gravidade dependeria daquele campo que descreve o problema na página
edição*
Helder:
mas novamente tudo isso parece ser apenas relativo ao texto que
devemos colocar na interface que aparecerá para quem avalia um
conjunto de edições
por que para os algoritmos não importa qual é a interpretação que os
humanos dão para as classes (sim/não) de um problema de classificação
Raylton:
sim.. desde o inicio é uma discussão sobre experiencia de usuário
a implementação seria trivial nesse contexto...
mas provavelmente sofreria mudanças tbm
Helder:
dadas revisões r1, r2, r3, ..., r100000, com seus diversos atributos
(os ~40 features da pasta que linkei acima), e os respectivos rótulos
(sim/não) que indicam a classe a que cada revisão pertence, os
algoritmos só aprendem a fazer classificação análoga para novas
revisões (que não estavam no conjunto) de forma tão parecida quanto
possível com o jeito que humanos fariam, mas não há nenhuma
interpretação envolvida.
não há nada subjetivo, do ponto de vista do algoritmo. Ele
simplesmente encontrará o melhor modelo que se encaixa nos dados
fornecidos por humanos, para que estes façam o que bem entender com
classificações automáticas que poderão ser obtidas a partir do
algoritmo (já treinado)
Coisas como: o huggle filtrar as Mudanças Recentes para mostrar apenas
os mais prováveis vandalos, ou os mais prováveis vandalismos, etc..
Mas, claro, qualquer que seja a interpretação das classes, ela tem que
ser usada consistentemente nas duas etapas: a de classificação de N
revisões por humanos, e as previsões fornecidas por um algoritmo
treinado. Senão quem for usar os dados que fornecermos em nossa API
poderá tirar conclusões indevidas do que significa uma propriedade do
tipo "boa fé: 0.85"
Raylton:
nota: toda vez que falei objeto de analise, me referia a analise do
usuário que vai fornecer os dados
subjetivo eu queria dizer na classificação
Helder:
acabou de me confundir
Raylton:
é objetivo que que o usuário acha ou não a edição de boa fé... mas o
termo boa fé não é objetivo
Helder:
objeto de análise = editor que fez uma das edições que por acaso foram
sorteadas para serem avaliadas por um humano que nos ajudará a
fornecer dados para os algoritmos?
Raylton:
é abstrato
ou pelo menos não é concreto
olha quando eu falo analise provavelmente não deve ter a ver com a
definição matematica de analise
já que sou leigo
Helder:
[acho q não coloquei o significado de "análise matemática" na conversa]
Raylton:
objeto de analise que eu falei foi:
Editor[ ]
Edição[ ]
edição não podemos citar boa fé pq é meio que uma impossível definir
fé de uma edição...
não de forma tão objetiva como podemos definir que uma edição tem ou
não um problema
ou qual o problema que tem
Helder:
sim, mas acho q já concordamos nisso
fé é um atributo dos editores, não das edições
Raylton:
e no caso de usuários ele não poderia ser usado principalmente pra
assumir boa fé dos editores em primeiro lugar e tbm pq continua sendo
abstrato fé em qualquer contexto relacionado coisas realmente
tangíveis... e segundo se receberíamos dados sobre a natureza das
edições seria mais sensato dizer que o editor costuma fazer boas
edições ou não ou ainda qual a gravidade ou recorrencia das suas más
edições
pois... agora estava problematizando o termo fé em seu uso para editores
por isso o resumo da primeira parte
em teoria se ele pode ser usado é para editores... mas tbm não acho
util usa-lo para editores
Helder:
"costuma fazer boas edições" é só "mais um" feature a ser implementado
e que, uma vez implementado, pode muito bem ser incluido entre os que
são usados para treinar os algoritmos
Raylton:
pois então... se estamos falando de destrutivo ou não destrutivos
podemos falar de usuários destrutivos ou não destrutivos ou ainda de
graus de destrutividades da edição ou do editor
e podemos tirar fé da jogada
Helder:
mas precisará de um conjunto de edições rotuladas pra poder fazer
alguma previsão a respeito
Raylton:
a questão é que se a fé não pode ser medida qual o benefício de seguir
usando o termo
poderiamos ter edições e usuários destrutivos ou não
e graus para isso
sendo que uma edição não ia definir um editor como destrutivo
mas sim um conjunto de fatores
Helder:
Escolhida qualquer pergunta cujas respostas possíveis sejam "sim" e
"não", rotule uns 10000 itens como sendo "sim" ou "não" em relação a
essa pergunta. Feito isso, o algoritmos conseguem responder "sim" ou
"não" para a mesma pergunta a respeito de outros itens de forma muito
próxima de como os humanos responderiam. Só é preciso fornecer aos
algoritmos "features" que possam ser usados (por um humano) para
responder corretamente "sim" ou "não" à tal pergunta sobre os itens.
O que você diz parece ser que os metadados das **edições** não são
suficientes para decidir se a resposta correta a respeito de um
editor** é "sim" ou "não". Mas então só o que precisa é implementar
novos features que possibilitem (um humano) decidir isso, e o
algoritmo imitará a performance do humano na mesma tarefa de
classificação.
Tais novos features podem ser coisas como o número de reversões do
editor, o número de avisos que já recebeu, etc
(reversões do editor = edições dele que foram revertidas por outros)
Raylton:
sim... eu acredito que seja isso
Helder:
e.. diga-se de passagem
anualmente são publicados artigos de pesquisa acadêmica sobre quais
seriam bons features pra prever certas coisas (responder certas
perguntas) automaticamente
Raylton:
pra mim seria muito impossivel decidir a fé de uma edição e muito
dificil decidir a fé de um editor... mas seria fácil decidir se uma
edição tem problemas e se um usuário é dado a fazer edições
problematicas
Helder:
nos artigos que andei lendo há tabelas mostrando o quanto cada feature
é capaz de prever a resposta a certas perguntas, quais são os que
melhor prevêm vandalismo, etc...
Raylton:
features são coisas pra definir melhor(para humanos) uma única
resposta positiva ou negativa né?
Helder:
exemplos de "features" de uma edição:
número de palavras
número de palavrões inseridos
número de letras maiúsculas inseridas
etc
Raylton:
mas aí são os detectados por computador
Helder:
os artigos mais recentes mostram estatísticas de modelos treinados com
uns 60 features
Raylton:
esses serviriam pra sugerir coisas a humanos
por exemplo... uma maquina poderia detectar isso e me dizer a
tendencia da edição
e eu confirmaria ou não
Helder:
dado que são problemas de "machine learning" é meio natural que sejam
coisas que possam ser calculadas por computador
Raylton:
sim
Helder:
"esses serviriam pra sugerir coisas a humanos": sim. Isso resume o
objetivo do nosso IEG.
fazer uma API dessas sugestões automatizadas, que possam ser usadas
por humanos (e robôs, ou gadgets, ou aplicativos como o Huggle) pra
fazerem o que quiserem com isso
Raylton:
huuum
isso é bom
e inclusive fazer o que quiser pode significar que se as avaliações de
maquina forem muito pertinentes eu posso deixar elas trabalharem
sozinhas em alguns casos né?
meio que uma troca reciproca humano-maquina
até finalmente as maquinas dominarem tudo :)
Helder:
reescrevendo uma frase anterior: "se o objetivo é traar certos
problemas de classificação presentes no dia a dia das wikis como
problemas de "machine learning", em que computadores possam ajudar os
humanos de alguma forma, é meio natural que os features utilizados
pelos algoritmos sejam coisas que possam ser calculadas por um
computador, para que eles possam o desempenho dos humanos na tarefa de
classificação"
sim pra isso de deixar trabalhando sozinho em alguns casos
especificamente, me vem o robô ClueBot NG à mente https://en.wikipedia.org/w/index.php?title=User:ClueBot_NG#Statistics
Veja só essa frase acima: "Selecting a false positive rate of 0.25%
(old setting), the bot catches approximately 55% of all vandalism."
pode-se fazer uso das previsões automatizadas exatamente pra "que as
máquinas dominem tudo", pelo menos nos casos em que elas têm como
decidir com muita segurança a resposta correta a uma pergunta
(digamos, "preciso reverter essa edição?)"
Raylton:
entendi...
parece bacana
Helder:
dependendo de quantos falsos positivos a comunidade estiver disposta a
tolerar de um robô que faça uso da API, ele pegará mais ou menos
vandalismos
Raylton:
"ps:pronto.. ainda sobre UX... vandalo é um bom nome pra alguem que é
dado quase exclusivamente a edições destrutivas mas não para alguem
que fez uma ou outra. "
Helder:
a definição de "vândalo" pressupõe que haja "má fé"+"destruição" (em
vez de "boa fé"+"destruição")
Raylton:
hahaha boa fé
recorrencia+ignorar avisos+ destruição = vandalo
Helder:
"boa fé" implica "não ignorar avisos"
Raylton:
vou olha no dicionário... mas né isso não
parece bem mais subjetivo
e se fosse isso não poderiamos definir edições de boa ou má fé... como
eu supus no começo
Helder:
por que?
Raylton:
pq edições não recebem avisos
naturalmente
Helder:
mas a presença de avisos na discussão do autor de uma edição é um
metadado da edição
Raylton:
vai... c sabe que c tá forçando a barra
Helder:
é sério, dada uma revisão vc pode checar quem foi o autor, e então ver
o conteúdo da página de discussão, e ver se havia algum aviso lá
Raylton:
ainda não é a edição que recebe o aviso né?
Helder:
(os reversores humanos fazem isso)
Raylton:
mas vc não pode não é a edição que vê ou não o aviso e continua sendo
ou não destrutiva
hehe
pensei que já tinhamos concordado sobre isso
ignore o "mas vc não pode"
Helder:
mas a edição diz alguma coisa a respeito das boas ou más intenções do editor
Raylton:
sim...
concordo
foi o que falamos acima
Helder:
Por exemplo, suponha que X prejudicou o conteúdo de um artigo, e
recebeu um aviso.
1. Se ele fizer uma nova edição, e for construtiva, é um "bom sinal"
2. Se ele fizer uma nova edição, mas for destrutiva, é um "mau sinal"
Raylton:
se x é um editor eu continuo concordando
como antes
Helder:
E levando esse exemplo ao extremo, se houve mais de um aviso sobre
destruição de conteúdo, e ainda assim o editor fez uma nova edição que
destroi conteúdo, o "mau sinal" tem mais intencidade
Raylton:
sim...
tudo gira em torno do editor receber o aviso e ser recorrente ou não
a edição ruim
Helder:
ou seja, pra resolver o problema que tem em mente, é só uma questão de
implementar os features que pegam uma revisão, busca o autor,
inspeciona sua discussão (ou histórico de contribuições) e conta o
número de avisos, quantas vezes insistiu em fazer ações destrutivas
após os avisos, etc, que são coisas computáveis.
Raylton:
[y] só uma nota
eu acho que tudo gira em torno de destrutiva ou não
Helder:
sim sim
Raylton:
daí para aumentar a precisão podemos adicionar problemas
e classifica-los como destrutivos ou não
ou até adicionar pesos sobre sua destrutividade
Helder:
os algoritmos costumam se encarregar de uns pesos (para os features),
não sei se é o mesmo que está pensando...
Raylton:
estou pensando em pra cada problema um peso
ou pra cada grupo de problemas um peso
mas depende de quanto isso for util
mas acho que um palavrão pode ser mais destrutivo que um erro de
ortografia por exemplo
Raylton:
alias... estou pensando aqui que eventualmente problemas podem ser
independentes da destrutividade da edição
afinal ter problemas não indica necessariamente que a edição é destrutiva
isso é coisa pra se fazer um projeto bacana com maquetes
antes de implementar
talvez ja tenham passado dessa fase
mas enfim
/me se inscreveu no curso de machine learn de tanto que helder falou...
Helder:
sobre esse tipo de problema q está falando, considere por exemplo o
"saldo" de referências de uma edição em relação à anterior: se é
positivo, o editor estava referenciando o texto (algo bom), se for
negativo, estava removendo referências (possivelmente ruim). Esse é um
feature que não está implementado ainda
naquele código do github
mas q se eu não me engano já foi usado em algum dos papers sobre o uso
de "machine learning" para a detecção de vandalismos...
oba!!!! aposto que vai gostar
Raylton:
huuum
o numero de caracteres pode ajudar a definir isso
Raylton:
mas nao unicamente neh
Helder:
esse feature já está na lista
Raylton:
eu posso melhorar o artigo removendo algumas coisas
Helder:
(caracteres)
Raylton:
entendi
Helder:
pois é... eu mesmo costumo aproveitar as edições em que vou reverter
vandalismo para corrigir uma ou outra coisa tb...
isso tem uma desvantagem que é impossibilitar a identificação de
reversões por meio da comparação de "hashes" sha1 das revisões
da mesma forma, as suas melhorias que removem coisas "prejudicam" o
uso da contagem de caracteres como medida de algumas coisas...
Raylton:
huuum.. tow entendendo... baseado nessa conversa eu acho que consigo
fazer uma mudança na maquete.. do ponto de vista de usuário.
ps... tenho problema com vandalismo
Raylton:
ate convivo com a palavra vandalo mas vandalismo cai naquele mesmo
lance do boa e má fé Helder: Raylton:
vou tentar fazer um mockup baseado nessa nossa conversa
uma coisa
existe interface de usuário para os dados obtidos a partir das avaliações?
Helder:
ainda nao
o projeto no labs foi recem criado, ainda precisamos escrever código pra por lá
Raylton:
rapaz... faz projeto primeiro
Helder:
o Aaron fez um mockup da página principal
Raylton:
projeto primeiro codigo depois
eu aprendi isso no gsoc
economia é muita
de tempo e trabalho
Latest comment: 9 years ago1 comment1 person in discussion
Hey folks,
This week we:
Stood up a dummy revision coder server to develop the gadget against [64] It returns standard JSON for every request, but the responses are hard-coded.
Wikipedians have proposed other reforms, too. The Wikimedia Foundation is funding research into more robust bots that could score the quality of site revisions and refer bad edits to volunteers for review. Another proposed bot would crawl the site and parse suspicious passages into questions, which editors could quickly research and either reject or approve.
Latest comment: 9 years ago1 comment1 person in discussion
Hey folks,
$ revscoring -h
Provides access to a set of utilities for working with revision scorer models.
Utilities
* score Scores a set of revisions
* extract_features Extracts a list of features for a set of revisions
* train_test Trains and tests a MLScorerModel with extracted features.
Usage:
revscoring (-h | --help)
revscoring <utility> [-h|--help]
Revscoring utility documentation
This week was another productive one with a lot of tasks coming together.
We centralized the utility scripts that support revscoring within the revscoring project and made a cute general utility to make them easy to work with [77]. We also took the opportunity to make file reading/writing easier in Windows [78].
We deployed a form builder interface for writing new form configurations used in the revcoder.
We implemented a means for extracting all labels from the coder server [79]
We added a feature to revscoring that prints the dependency tree for a feature. [80] This is useful when debugging dependency issues in feature extraction.
We added a simple revert detector script to the ORES project [81]. This in combination with the centralized revscoring utilities provides automation for training new classifiers.
Latest comment: 9 years ago1 comment1 person in discussion
Hello all,
I will be filing the progress report for a short while in place of Halfak.
This week we:
We now have a function OO.ui.instantiateFromParameters() which takes some JSON configuration and construct an OOjs UI field. This also populates a fieldMap with "name"/widget pairs that can be used later.[83] You can get a sense of it by trying our form builder.
We refactored ORES for language specific features to reflect the changes made to Revision Scoring. We also reorganized the features list to both reuse code and to improve on performance and accuracy. [84]
We created a Mediawiki gadget to filter recent changes feed by reverted score. [85]
We renamed the service Revision handcoder to Wiki-Tagger. [86].
Latest comment: 9 years ago1 comment1 person in discussion
Hello all,
We updated ORES server to the newer version of revscoring and included new models. [87]
We renamed Wiki-Tagging to Wiki-Labels. We hope to name our hand coding campaigns with a "Wiki labels foo" format a bit like wiki loves monuments.[88] We have also defined dependencies for Wiki-Labels.[89]
We have explore additional methods for automatically detecting badwords.[90]
Latest comment: 3 years ago12 comments4 people in discussion
So, there's been some discussion recently of ORE's performance and how it's not nearly as fast as we would like to request that a bunch of revisions get scored. I'd like to take the opportunity to document a few things that I know are slow. I'll sign each section I create so that we can have a conversation about each point.
Looking for misspellings is one of our most substantial bottlenecks. Right now, we're using nltk's "wordnet" in order to look for words in English and Portuguese. This is slow. One some pages, scanning for misspellings can take up to 4 seconds on my i5. That's way too much -- especially because we end up scanning at least two revisions for misspellings. So, I've been doing some digging and I think that 'pyenchant' might be able to help us out here. The system uses your unix installed dictionaries to do lookups and it is much faster. Here's a performance comparison looking for misspellings in enwiki:4083720:
$ python demonstrate_spelling_speed.py
Sending requests with default User-Agent. Set 'user_agent' on api.Session to quiet this message.
Wordnet check took 3.7539222240448 seconds
Enchant check took 0.008267879486083984 seconds
So, it looks like we can get back 3 orders of magnitude there. It looks like we can get a lot of dictionaries too. Here's apt-gets listing if myspell dictionaries:
James Salsman, you'll need to provide a user_agent argument to the mwapi.Session() constructor. A good "user_agent" includes an email address to contact you at and a short description of what you are using the API session for. E.g. "Demonstrating spell check speed - aaron.halfaker@somedomain.com" --EpochFail (talk) 20:25, 2 September 2020 (UTC)Reply
@EpochFail: <3 How would you change [92] to do that? I'm not sure where the user agent string is set.
Right now, we gather data for extracting features one revision at a time. For a common 'reverted' scoring, we'll perform the following requests:
Get the content of the revision under scrutiny,
Get the content of the preceding revision (lookup based on parent_id)
Get metadata from the first edit to the page (for determining the age of the page, lookup based on page_id, ordered by timestamp)
Get metadata about the editing user (lookup based on user_text)
Get metadata about the editing user's last edit (lookup based on user_text, ordered by timestamp)
One way that we can improve this is by batching all of the requests in advance before we provide the data to the feature extractor. So, let's say we receive a request to score 50 revisions, we would make one batch request to the API for content from those 50 revisions. Then we would make another batch request to retrieve the content of all parent revisions. I think we can also batch the requests for a first edit to a page (specifying multiple page_ids to prop=revisions with rvlimit=1). We can batch the request to list=users and list=usercontribs too. We'd have to use the extractors dependency injection to address these bits for each revision after the fact then. For example:
It makes me a bit sad to do this since we don't know that the revision.doc, parent_revision.doc necessary in the code. We might want to provide some functionality at the ScorerModel level to allow us to check this. E.g.
if scorer_model.requires(revision.doc):
cache[revision.doc] = session.revisions.query(revids=...)
Other than the batching of requests, which seems very appropriated, would it help if the system used database access instead of API requests to extract the features? Helder17:32, 26 April 2015 (UTC)Reply
Right now, there's no caching at all. If a score is requested, it's is calculated and returned and then forgotten. This is sad because we could probably store scores for the entire history of all the wikis in ~ 50-75GB. We could also make use of a simple LRU cache in memory (e.g. https://docs.python.org/3/library/functools.html#functools.lru_cache. This would work really well for managing the load of the set of bots/tools tracking the recentchanges feed. --EpochFail (talk) 16:29, 26 April 2015 (UTC)Reply
+1 for computing the scores for all revisions of all (supported) wikis and storing them in some kind of database, with associated version numbers to identify which model was used to compute the scores. As a user of the system I would like to be able to get e.g. a list of revisions whose previous score was a false positive (i.e. a constructive edit scored as a vandalism) and whose scores generated by a more recent model is now a true positive (similarly for other combinations of true/false positives/negatives). This would allow use to get an idea of how the system is improving over time, and to identify regressions in the quality of the scores we provide for users. Helder17:32, 26 April 2015 (UTC)Reply
I'd suggest a local install of redis for caching over in-process caching. This ensures that you can restart your process nilly-willy without having to worry about losing cache. Yuvipanda (talk) 18:10, 26 April 2015 (UTC)Reply
Since we know that the majority of our requests are going to be for recent data, we could try to beat our users to the punch by generating scores and caching them before they are requested. Assuming caching is in place, we'd just need to listen to something like RCStream and simply submit requests to ORES for changes as they happen. If we're fast enough, we'll beat the bots/tools. If we're too slow, we might end up needing to generate a score twice. It would be nice to be able to delay a request if a score is already being generated so that we only do it once. --EpochFail (talk) 16:29, 26 April 2015 (UTC)Reply
Latest comment: 9 years ago1 comment1 person in discussion
Hello all,
We decided that I shall carry out weekly reports now on.
We had major on going work for Wiki-Labels, we intend to have everything up and running by 8 May where we will have first hand coder input from Wiki-Labels. Once this is achieved it will be a milestone for our project.
We held a general discussion on the landing page and also designed the page (w:Wikipedia:Labels). [93]
We wrote general documentation for Wiki-Labels here on meta: Wiki labels. [94]
@Ladsgroup: what is the condition for keeping/removing a string in the list?
I belive it is so subjective, and that there are so many criteria, that I don't really know what to do with lists like these. I even kept the list which I generated as is, due to the lack of an objective criteria for removing items from it.
I believe we need to have some kind of labelling efort for adding (multiple) tags for each string in the lists, so that we can have different categories.
I also don't know which would be the common categories, but there are many reasons why an edit adding a given string might be considered damaging:
It talks to the reader (e.g. "you", "go **** yourself", "<someone>, I love you!"), and this is not acceptable in an enciclopedic article
It is related to sex, and the article isn't
It is about a part of the human body (likely inappropriate in an article about Math)
It is in a language other than the article's or wiki's language (if it is a "badword" in that other language, should it be in the list for this language too?)
It is not a word ("lol", "hahaha", "kkkkkk")
It is a personal attack/name calling
It is an accronym only used in informal talking (e.g. on chats)
It is a website or brand (e.g. "easyspace", "redtube")
It discriminates some group of people, for whatever reason (ideology, beliefes, etc...)
Helder: Thank you for your feedback, It generates a list automatically, it's on us if we want to use some words and don't use others. I'm using a more advanced technique to create better results, I will update it and please review it and tell me whether it's improved or not. Best Amir (talk) 23:46, 8 May 2015 (UTC)Reply
I'm sorry. What previous sample? Could you be thinking of the sample we trained on revert/not-reverted? That was also extracted in 2015. It turns out that our test dataset for loading Wiki labels contain campaign names that reference 2014, but that's just test data. --Halfak (WMF) (talk) 15:17, 14 May 2015 (UTC)Reply
Ahh yes. So the last sample was from the last 30 days since that is what the recentchanges table keeps. But the new sample uses the revision table and a whole year's worth of revisions. So, a "year back" from 2015-04-15. --Halfak (WMF) (talk) 16:15, 14 May 2015 (UTC)Reply
Latest comment: 9 years ago1 comment1 person in discussion
Hello all,
Checking in with the weekly report.
Wikilabels bugfix: We had a strange bug where full screen button appeared twice. Up on further investigation this was because the global.js was being loaded twice. We modified our code to prevent double-loading of the UI. [108]
We generated bad words list for az, en, fa, pt and tr wikis. Unlike the previous lists these are generated from known bad revisions. [109]
We have updated the translation for Portuguese. [110]
We have conducted some maintenance and administrative work concerning Wikilabels. [111]
Latest comment: 9 years ago1 comment1 person in discussion
Hey folks,
I've been working with Yuvipanda to work out some performance and scalability improvements for ORES. I've captured our discussions about upcoming work in a series of diagrams that describe the work we plan to do.
Basic flow. The basic ORES request flow. All processing happens within a single thread (limited to a single CPU core). No caching is done.
Basic + caching. The basic flow augmented with caching. All processing is still single-threaded, but a cache is used to store previously generated scores. This enables a quick response for requests that include previously generated scores.
Basic + caching & celery. The basic flow augmented by caching and celery. Processing of scores is farmed out to a celery computing cluster. Re-processing of a revision is prevented by tracking open tasks and retrieving AsyncResults.
My plan is to work from left to right implementing improvements incrementally and testing against the server's performance. I've already been doing that actually as we have been implementing improvements along the way. Right now, the basic flow as seen substantial improvements in the misspellings look-up speed and request batching against the API.
Server response timing. Empirical probability density functions for ORES scoring is generated using the 'reverted' model for English Wikipedia and 5k revisions batched 50 revision requests. Groups represent different iterations of performance improvements for the ORES service.
Day 1: Lots of hacking, pywikipediabot meeting, first ever in person meeting. [121], [122]
Day 2: More hacking, Hackathon hack session (T90034). [123][124]
Day 3: Even more hacking, added French language specific utilities to revision scoring, live demo of French language of revision scoring at closing showcase. [125]
We added French language specific utilities. Special thanks goes to fr:User:Paannd a for the French translation and review of the French bad word list.[126]
Latest comment: 9 years ago2 comments2 people in discussion
Hello all,
Checking in with the weekly report.
We refactored revision scoring dependency management. While this did not make a difference in the interface, it has improved performance and management. A total of 46 files were modified in some capacity with 847 added lines and 703 removed. [127][128]
Latest comment: 8 years ago3 comments2 people in discussion
Hello User:Halfak (WMF) pinging you after your Wikimania talk. Super interesting stuff! I asked about direct database access to a full set of article quality scores. I'm maintaining the WikiMiniAtlas for which I need a metric to prioritize articles shown on the mat at a given zoomlevel. I'd prefer to show high quality articles as prominently as possible (my current ranking is just based on article size). --Dschwen (talk) 15:47, 19 July 2015 (UTC)Reply
Hi Dschwen! We're looking into it now. I wonder if you are just sizing things in the interface if requesting scores from ORES would work for you in the short term. In the long term, we have a phab ticket that you can subscribe to. See Phab:T106278. We're currently working out what the initial table will contain. --Halfak (WMF) (talk) 19:55, 22 July 2015 (UTC)Reply
I'm not just resizing stuff (or needing a small set of data at a time), but I need to process the entire set of all geocoded Wikipedia articles at once to build a database of map labels for each zoom level. I will subscribe to the Phabricator ticekt. Thanks! --Dschwen (talk) 21:14, 26 July 2015 (UTC)Reply
Latest comment: 8 years ago2 comments2 people in discussion
This is great work, add me to the list of people who hope it opens the door to a much more permissive wiki culture!
Apologies if this discussion is already under way somewhere else... I wanted to ask about the training data for the revscoring:reverted models, and whether you plan to unpack the various motivations behind the "undo" action. In short, I think it's imperative that we present a multiple-choice field during undo, to allow the editor to categorize their reason for reverting. This will allow us to provide much higher quality predictions in the future.
To make an sloppy analogy, what we're currently doing is like training a neural network on the yes-no question, "is this a letter of the alphabet?". What we want to do is, have it learning which specific letter is which.
Meanwhile, to continue with the analogy, imagine the written language is evolving and new letters are being invented, old ones are changing form.
I'd like to see documentation on exactly how the training data is harvested, because I'm concerned that our revert model is actually capturing something ephemeral about on-wiki culture, which has shifted over time. Clearly you've considered this problem, the introduction to ORES/reverted suggests as much. For historical data, we would probably need to correlate reverts with debate about the revert--was this a contentious revert? Can we guess whether it was done in good faith? Was the outcome to rollback the revert? Was there an edit war? Also, was the revert destructive, was the original author offended? Engaged? Retained? This seems like a really hard problem, which is why I'd suggest we focus on recent data only, to capture current norms around acceptable article style, and also introduce a self-reporting mechanism where the editor can categorize the reason for their revert.
I think it is an interesting idea to have a multiple choice option for 'undo'. It seems like different actions should be taken given the user's reasoning. E.g. if blatantly offensive vandalism, level 4 warning & revert. If playful vandalism, level 1 (or N+1) warning & revert. If test edit (key mash, "hi there", etc.), then test edit warning & revert. If good-faith, but still does not belong, revert and post reason on talk page. If good, no undo for you! Assuming that judgements made by editors had sufficient coverage (there's reason to believe that reverts have coverage), then we could use this to train and deploy better prediction models.
Right now, we're looking at using our Wiki labels campaign to answer the questions: "Is this damaging?" and "Is this good-faith?" so that we can (1) check the biases in our 'reverted' model and (2) train a better classifier that focuses on damage. I'd really like to be able to stand up a classifier that specializes in bad-faith damage for quality control purposes.
Your point about recency is well received as well. Once substantial concern we must manage is the periodic nature of vandalism. E.g. when school is in session in North America, we seem to get a lot more vandalism in enwiki and of a different type. Right now, we're training our models based on revisions from the entire year of 2014 because we started work in January, 2015 -- but we could also sample from the entire year before yesterday. I think we'll find it difficult to extend our wiki-labeling campaign once per year since it involves a substantial amount of effort. This might work for enwiki where we have been lucky to find many volunteer labelers, but I suspect that less active wikis will fall behind unless we integrate with mediawiki's undo/rollback. --EpochFail (talk) 18:29, 23 July 2015 (UTC)Reply
Latest comment: 8 years ago4 comments2 people in discussion
Hi. Wikitrust has been to my opinion one of the most advanced tool to do users & revisions scoring. I'm surprised to not see here an analysis of it, it's even not in the list of tools. What is the reason for that? Regards Kelson (talk) 12:33, 3 September 2015 (UTC)Reply
Hi Kelson, the short answer is that WikiTrust solves a different type of problem. In this project, we score revisions. WikiTrust does not score revisions directly and it doesn't do it's scoring in real time. It scores editors and applies a trustworthiness score to their contributions and applies an implicit review pattern. WikiTrust is actually just one of many algorithms that use this strategy. For my work in this space, see R:Measuring value-added. For a summary of other content persistence algorithms, see R:Content persistence. --Halfak (WMF) (talk) 13:21, 3 September 2015 (UTC)Reply
Latest comment: 8 years ago1 comment1 person in discussion
Hey folks. It's been about a month since you've gotten a progress report. I figured one was due.
We minimized the rate of duplicate score generation in ores [169]. Parallel requests to score the same revision will now share the same celery AsyncResult.
We turned pylru and redis into optional dependencies of ORES [170]. This makes deployment a little easier since we don't have to make Debian packages for libraries we don't use in production.
We did a bunch of homework around detecting systemic bias in subjective algorithms (like our classifiers). See our notes here: [171]
We made the color scheme in ScoredRevisions configurable. [172]
We primed lists of stopwords by applying a en:TFiDF strategy to edits in various Wikipedias (af, ar, az, de, et, fa, he, hy, it, nl, pl, ru, uk) [173]
We added language feature Hebrew [174] and Vietnamese [175] to revscoring
We read and discussed critiques of subjective algorithms in computer-mediated social spaces [176]
We implemented a regex-based badwords detector that handles multiple-token badwords (important for turkish and persian) [177]
We built a script for extracting data from the recently-finished Wikilabels edit quality campaigns [181] and we're now working on building models with the data.
We substantially improved the stability of ORES worker nodes and the redis backend that they use [182]
Also worth noting, ORES has been adopted by Huggle and we've been working with them to address performance issues by suggesting they request scores in parallel. ORES can take it! --EpochFail (talk) 16:22, 19 September 2015 (UTC)Reply
FYI: Code for extracting features re. edit type classification
We'll need to adapt and export the feature extraction code for use inside revscoring. F-scores for each class are comparable to the state of the art. --EpochFail (talk) 18:00, 23 September 2015 (UTC)Reply
Gotcha. I'll have to talk to Diyi about opening it up. She may prefer that I rewrite before publishing. I'll report back when I can get to it. --EpochFail (talk) 21:51, 23 September 2015 (UTC)Reply
Latest comment: 8 years ago1 comment1 person in discussion
So your weekly reports should actually be weekly reports now. Sorry for the last hiccup.
We used clustered reverted edits by a k-means algorithm. T110581
We Prepared a summary of SigClust and other methods for choosing number of clusters. T113057
We started working on our mid point report which should provide a good overview of our work since the last tri-monthly report. We hope to have a draft by 1st of October. T109845
We are in the process of deploying the results of data collected through wiki labels edit quality campaign. We will compare the results of the newer model generated from this data with the older revert based model in the upcoming weeks. T108679
We are winding down our community outreach efforts and will focus on assisting more responsive communities in the meanwhile. T107609
We are re-generating stop words by ommiting interwiki links from them. Interwiki links do not provide an indicative signal for edit quality. T109844
We have a lot of things going on in parallel. Stay tuned for next week!
Latest comment: 8 years ago1 comment1 person in discussion
Graphite logging (precached speed). A screenshot of the graphite logging interface shows the 50, 75, 95 and 99th percentiles of response times for our precaching system.
Hey folks,
We just had a good work session on our midterm report for the IEG. It became blatantly obvious that our methods for gathering metrics on requests, cache-hit/miss and activity of our precached service left a lot to be desired. So I put in a couple of marathon sessions this week and got our new metrics collection system (using graphite.wmflabs.org) up and running.
Latest comment: 8 years ago1 comment1 person in discussion
Your weekly report.
ORES celery workers made more quiet by error handling. This allows unexpected errors to be more prominent as other known issues wont bog it down. T112472[183]
Batch feature extraction is now implemented. This will expedite model creation. T114248
Midpoint report drafted early to meet the schedules of IEG and Grantees such that delays are avoided. T109845
We included model version in ORES response structure so that scores will be regenerated with updated model instead of having persistent outdated scores of the earlier model. new scores will be generated on demand. T112995
Model testing statistics and model_info utilities are added to revscoring. T114535
Metrics collection added for ORES. This will help quantify ORES usage. T114301
Latest comment: 8 years ago2 comments2 people in discussion
Your weekly report.
We have trained and deployed the handcoded "damaging" and "good faith" models gathered through the now complete edit quality campaigns via wikilabels. This has been deployed for enwiki, fawiki and ptwiki. T108679
Turkish wikilabels campaign has completed and is ready for modelling provided AUC confirms gain.
Preparing many wikis for their own edit quality campaign. Wikilabels interface awaits translation by local communities.
Latest comment: 8 years ago2 comments2 people in discussion
Per the revscoring sync meeting here I my thoughts on the matter. First off, we have established that bots are almost never reverted on wikidata. Secondly I think everyone can agree that on wikidata bots FAR out weight humans in terms of number of edits. As a consequence we are dealing with an over-fitting problem where probably all human edits will be treated as bad because the algorithm will give too much weight on features that basically distinguish bots from humans. This is more of an intuitive assessment than actual analysis. I could very well be wrong. I came to this assessment because on wikidata vast majority of good edits will come from bots and bots will always dominate the random sample set. We can perform a more selective sampling but honestly I do not see the benefit of it.
Based on the two assessments above I propose a different type of modelling for wikidata than what we use on wikipedias. First off, we need to segregate bot edits from other edits. Indeed this will have the potential of a bias but I will explain how this can be avoided. So we would have two models for wikidata, one for bots and other for non-bots. Two independent classifiers would be trained. For instance different classifiers can be used. In such a case bot edits can have Naive Bayes while human edits can be processed with SVM. This is kind of a top level decision tree which delegates first two branches to classifiers.
If ORES is asked to score a revision and the edit is from a bot, the bot model would generate the score. If ORES is asked to score a revision and the edit is not from a bot, the non-bot model would generate a score. The output would still be "damaging/not damaging" and "good faith/bad faith". Bias would be avoided because the way bots edit and humans edit is very different. If humans make bot like edits and this isn't reverted (and vice versa) it would still be treated as good.
Seems like this is putting the cart before the horse to me. If we are worried about winding up in a bots vs. humans situation, then we can leave the user.is_bot flag out of the feature set. We need representative training data to train any model (your suggestion or a more straightforward approach) and that is the issue I brought up at our meeting. We don't want to have to extract features for 2m edits. That would take forever. Also, bots are *never* reverted, so in order to get *any* signal about bots, we'd need to have humans handcode ~2m edits to get a representative sample of bot damage. IMO, we should build a sample stratified on whether the edit was reverted or not and exclude the user.is_anon and user.is_bot flags. We can do this is a relatively straightforward way now that we know the rough percentage of edits that are reverted by processing the XML dumps.
Honestly, I'm really hoping that we don't need to have a hierarchical model because that implies a substantial increase in code complexity. Right now, we don't even know that we do have a problem with bias yet. --EpochFail (talk) 15:48, 17 October 2015 (UTC)Reply
Latest comment: 8 years ago1 comment1 person in discussion
Hey folks,
I just updated the revscoring package documentation for 0.6.7. It's got a new theme (alabaster is the new default), better examples, and simplified access patterns for basic types (e.g. from revscoring import ScorerModel vs. from revscoring.scorer_models import ScorerModel). Check it out. :) --EpochFail (talk) 13:34, 22 October 2015 (UTC)Reply