Evaluating bias encoded in ORES

I built a data pipeline to measure ORES's bias against newcomers and anonymous editors. Human labels from the training data are the "ground truth."

I retrieved the human labeled edits from Wiki_labels, scored the edits using the ORES API, and obtained edit metadata from the Wikimedia API (it turns out that I could have used the data lake instead). Next I pushed this data to the data lake and identified newcomers through their edit histories using a Spark script. I defined "newcomers" as editor accounts that have been active for less than a month and/or which have less than 5 edits. Anons are edits by editors that are not logged in. I also defined a group of "normal" editors who are neither newcomers nor anonymous.

I considered two kinds of ORES models. The first (the damaging classifier) predicts whether the community would consider an edit to be damaging. However, just because an edit is damaging doesn't imply that the editor caused damage on purpose. The second model (the goodfaith classifier) attempts to predict the intent of the edit in terms of whether it was made in goodfaith.

After building this dataset, I pulled it out of the datalake and made the plots below to evaluate how ORES algorithms in terms of fairness. I considered two notions of "fairness" that can be applied to classification systems. Both notions adopt a frame that fairness constitutes of equal treatment regardless of status. To people thinking about algorithms in the criminal justice system, "status" might refer to a person's race. Here we consider how the algorithm treats editors who might be newcomers or anonymous.

When I think about fairness, I care about more than whether the algorithm says that newcomer and anonymous editors make worse edits on average. This might be true (it almost certainly is), and it would be surprising if an algorithm not explicitly designed to favor such editors didn't scrutinize their edits. Such an algorithm might be considered fair or unfair depending on how we define "fairness," which is not an easy or obvious thing to do, as we will see.

I considered two possible ways of defining fairness. The first is calibration. A well-calibrated classifier predicts similarly accurate probabilities regardless of status. It is free to predict that newcomers (or anons) have a higher probability of making damaging edits compared to normal editors. But if it systematically over-estimates this probability, then we would say that it is biased against newcomers (or anons) in terms of calibration.

The second fairness criteria I consider is balance. As with calibration, a balanced classifier is free to associate newcomers and anons with a greater risk of making damaging edits. But instead of looking at the predicted probabilities, balance considers the kinds of errors the algorithm makes. A false-positive error occurs when the model predicts damage, but the edit was not truely damaging. Since having your edit labeled damaging is a bad thing, a model with a higher false-positive-rate (fpr) for newcomers (or anons) compared to normal (neither newcomer nor anonymous) editors is biased in terms of false-positive balance. If ORES has this kind of bias then edits by newcomers (or anons) will be more likely to be labeled as damaging when they are actually good edits compared to other editors.

Similarly, a false-negative error occurs when the model predicted that the edit was good, but the edit was actually damaging. Similarly, a model with a lower false-negative-rate (fnr) is based in terms of false-negative balance. If ORES has this kind of bias then edits by newcomers (or anons) will be less likely to be labeled as good when they are actually good compared to other editors.

It turns out that, while both of these notions of classifier fairness might seem reasonable or intuitive, that it isn't possible in practice to have them both (unless your model is a perfect predictor or the status is irrelavent). ^[1]

The code is in my fork of the editquality repository here, on github.

Calibration

To assess the calibration of ORES models, for each of the 26 wikis that have enabled ORES models, I first estimated the likelihood that an edit is damaging (or goodfaith) within each group of editors (newcomers, anons, normal) simply by taking the mean over the human labeled edits for each group. I compared these estimates to the mean probability estimate output by the model for each group. Taking the difference of these two means provides a measure of calibration.

Figure 1.1. ORES damaging models are biased against newcomers and anonymous editors in terms of calibration. We can see that most of the damaging models tends to overestimate the chances that an edit is damaging, and that predicted probabilities that newcomer and anonymous editors make damaging edits are overestimated even more than edits by normal editors. Error bars show 95% confidence intervals for estimated differences in sample means.

Figure 1.2. ORES goodfaith models are biased against newcomer and anonymous editors. We can see that most of the goodfaith models tends to underestimate the chances that an edit is made in goodfaith. Moreover, the predicted probabilities that newcomer and anonymous editors make edits in goodfaith is often underestimated even more than edits made by normal editors. Error bars show 95% confidence intervals for estimated differences in sample means. Compared to the damaging models, many of the goodfaith models (for ar, bs, ca, wikidata, es.wikibooks, he, hu, lv, pl, and sr) appear well calibrated across the editor groups.

Balance

The above evaluation in terms of calibration uses the raw probability scores output by the model. But to measure balance we have to choose a threshhold in order to convert probability predictions into discrete classifications. Choosing these threshholds is somewhat arbitrary, but the ORES models are being used by people and the threshholds that they use to make discrete decisions (like to define filters in RecentChanges, are published in Special:ORESModels. These define 4 different thresholds for each model to correspond with different levels of confidence in the classification. Different Wikipedia communities choose different thresholds according to their preferences, and not every wiki uses 4 threshholds for both models.

To make the plots below I estimated the false positive rate and false negative rate for each group of editor for each wiki using the human labeled edits.

Damaging models

Figure 2.1. The ORES damaging models are biased against newcomer and anonymous editors in terms of false-positive balance. Each of the 4 quadrants shows a different confidence threshold used by each Wiki. For the Very likely have problems quadrant, the number of false positives are often very low for all groups of editors, and we only see bias against anonymous editors. For every other threshold, however, we see that many models are more likely to classify edits as problematic edits by newcomers and anons compared to edits by others.

Figure 2.2. The evidence of bias for against newcomers and anons in terms of false-negative imabalence is less clear than for false-positive imbalance. Forthe threshhold designed to maximize precision of non-damaging edits, we indeed see substantially lower false-negative rates for newcomers and anonymous editors compared to other editors. On the other hand, for threshholds designed for high recall of damaging edits (on the bottom row), it is typical for ORES false negative rates to be higher for newcomers and anons. Note that the false negative rate in top-right quadrant is very low and the range y-axis is compressed into (0, 0.08).

Goodfaith models

Figure 3.1. The ORES goodfaith models are generally biased against newcomers and anons in terms of false-positive balance. ORES models are more likely to incorrectly classify normal editors as acting in good faith compared to newcomers and anons except at the threshold chosen for high precision at goodfaith classification (top left).

Figure 3.2. Similarly, the ORES goodfaith models are generally biased against newcomers and anons in terms of false-negative balance. ORES models are more likely to incorrectly classify newcomer and anonymous editors as acting in bad faith compared to other editors.

Discussion

I found evidence that the ORES models typically have systematic biased against newcomer and anonymous editors. In terms of calibration, all of the damaging models are biased against newcomers and anons except for Arabic, Bosnian, wikidata, and Finnish. And so are the goodfaith models for Czech, English, Spanish, French, Italian, Korean, Dutch, Portuguese, Romanian, Russian, Albanian, Swedish, and Turkish.

Whether a model is biased in terms of false-positive or false-negative balance can depend on the choice of threshhold for mapping between probabilities and classifications. I used the thresholds that Wikis are using in practice to power tools like the filters on en:Special:RecentChanges. For most wikis and thresholds, the damaging models are biased in terms of false-positive balance and goodfaith models are biased in terms of both false-positive and false-negative balance.

We shouldn't be surprised that the goodfaith models that were unbiased in terms of calibration are biased in terms of balance. In fact, there is an inherent tradeoff between these two different notions of algorithmic fairness.^[1] Kleinberg et al. present a rigerous proof of this, but I'll try to offer an intuitive explanation here:

Requirements of calibration and balance place constraints on the types of errors that the classifier can make, but these constraints can only be satisfied if the model is perfect predictor, or if the two groups of editors make damaging edits at the same rate.

For the model to be calibrated, the proportion of edits by anons labeled as damaging must equal the true rate of damaging edits by anons, and symmetrically the proportion of edits by nonanons labeled as damaging must equal the true rate of damaging edits by nonanons. This means that the errors within each group of editors have to be symmetrical. Errors that over-estimate the probability that an edit is damaging must be offset by errors that under-estimate it.

On the other hand, the balance constraint links errors in one group to errors in the other group. A balanced model can predict a higher rate of damaging edits for anons compared to non-anons, but it must do so without increasing the rate of false positives for anons above that of non-anons. This means that balance requires that the group of edits that are truly non-damaging, have the same average score for anons edits as for non-anons. And the same goes for the group of edits that are truely damaging.

Since anons are truely more likely to make damaging edits compared to non-anons, a calibrated model will assign all of their edits higher scores. But this means that the model will give non-damaging edits by anons higher schores than non-damaging edits by nonanons. So the model is not balanced. If the model could predict perfectly, then it wouldn't have to do this since all damaging edits could be assigned a score of 1 and all non-damaging edits would have a score of 0. But in the tragic world where we can't predict perfectly, we're stuck with a tradeoff between balance and calibration.

I also did a [similar analysis] looking for biases against edits to articles on women or about places in the Global South. However, in that case I did not observe a consistent pattern of bias.

References

↑ ^a ^b Kleinberg, Jon; Mullainathan, Sendhil; Raghavan, Manish (2016-09-19). "Inherent Trade-Offs in the Fair Determination of Risk Scores". arXiv:1609.05807 [cs, stat]. Retrieved 2019-04-02.

[kleinberg_inherent_2019-1] Kleinberg, Jon; Mullainathan, Sendhil; Raghavan, Manish (2016-09-19). "Inherent Trade-Offs in the Fair Determination of Risk Scores". arXiv:1609.05807 [cs, stat]. Retrieved 2019-04-02.

[1]