Research talk:Autoconfirmed article creation trial/Work log/2018-01-11

Thursday, January 11, 2018

Today I've presented our initial results on account age at article/draft creation and initial results on article quality at the Research Group meeting, and wrapped up writing the December 18 work log which looked at survival of article and draft creators.

I'm currently digging into the ORES draft quality model in order to better understand our article quality results and how we can potentially improve them by using a threshold for flagging articles as "OK" using that model. Our previous results use a simple majority vote for an article being flagged as "OK".

Draft quality model

The ORES draft quality model's purpose is to be able to predict whether a revision would be deleted by one of three speedy deletion criteria: spam, attack, or vandalism. I found T148038 through a search on Phabricator, and that task appears to have the training results for the model (ref).

There are a few things to note about the training of the model. First, there are varying number of instances in each of the four classes. Adding up the rows in the confusion matrix I get 26,257 OK, 2,059 attack, 17,699 spam, and 6,505 vandalism instances.

The overall accuracy is high (85.4%), while accuracy per class differs greatly. OK and spam are the two large classes, and those appear to have high training accuracy (OK: 94.6%; spam: 89.9%). Accuracy for the other two classes is much lower (attack: 12.1%; vandalism: 58.8%), something which is also reflected in the F1 scores for those two classes.

Looking more closely at the confusion matrix, we see that instances labelled "attack" are often predicted as "vandalism" (59.7%) or "spam" (24.2%). Instances labelled "vandalism" are often predicted as "spam" (27.9%) or "OK" (10%). The number of "spam" instances predicted to be "OK" is roughly the same as "vandalism" instances predicted to as OK", but because the number of "spam" instances in the dataset is much higher, the proportion is much lower (3.5%).

Given these training results, it is reasonable to conclude that our previous approach of using the majority class is flawed. Particularly in the case of wanting to know more about revisions predicted as either spam, attack, or vandalism, we would want to know more about the probability for each of those three classes. That could provide us with more information, for example when the model predicts two of the classes to have similar probabilities.

Secondly, we would want to examine ORES' statistics to ensure that we treat revisions correctly when we predict the revision to be "OK". Chatting with A. Halfaker, he pointed me to this URL, which provides info about the draft quality model based on a given recall level for the negative class. The negative class in our case is predicting revisions as something different than "OK". For documentation of the various measurements, see the comments in the source code.

To utilized the ORES API, I wrote a Python script that grabs the draft quality model statistics based on the URL described above. This way we get full statistics for inverse recall from 0.5 to 0.99. I then plotted all of the statistics in a single graph:

While we supply the inverse recall to the model, there are three parts to the graph we want to focus on:

Precision, recall, and F₁ score for the "OK" class. These are all towards the top of the graph and appear to be stable, except for extreme inverse recall values (0.98–0.99).
False Positive Rate (FPR). This happens to be the inverse of the inverse recall value, meaning that an FPR of 10% (0.1) corresponds to an inverse recall rate of 0.9.
The inverse precision and F₁ scores. We see the inverse precision score decrease slowly and linearly up until inverse recall of about 0.8, then decreases more rapidly. The inverse F₁ score increases and peaks around an inverse recall of about 0.75, then starts dropping off due to the drop in inverse precision.

Given the stability in the scores for the "OK" class, we are more interested in the other statistics. The inverse precision and F₁ scores are perhaps the first ones to look at, and as mentioned we see that the inverse F₁ score peaks around an inverse recall of 0.75. While we might want to maximize that score, we also note that it comes with a relatively low inverse recall and fairly high FPR for the "OK" class.

The peak in the inverse F₁ score is mainly due to the faster drop in inverse precision that comes towards the right end of the graph. Should we accept this drop in precision in order to have a higher inverse recall and lower FPR? In this case, I think the answer is "yes", because the latter two have strong benefits. They mean that we can be much more confident that we get to learn the quality of content that passes the bar, while at the same time we also cover more of the complimentary class. Due to the rapid decrease in inverse precision for an inverse recall above 0.9, we decide that an inverse recall of 0.9 and an FPR of 0.1 give the best result. That also means that we require an "OK" prediction above a threshold of 0.664 in order to use it.