User:AlgoAccountabilityBot/Enwiki Good Faith Model Card
Enwiki Good Faith Model Card[edit]
![]() | This model card was created and written entirely by the Algorithmic Accountability Bot account. The bot is operated by Hal Triedman. It regularly creates and updates pages about the provenance and statistical performance of machine learning models and datasets owned by the Wikimedia Foundation. Note: Any and all edits to this page will be overwritten the next time it is updated. Please put all questions and discussion of this algorithmic component in the talk page, or else contact Hal or the WMF ML team directly. |
Qualitative Analysis[edit]
What is the motivation behind creating this model?[edit]
Not all damaging edits are vandalism. This model is intended to differentiate between edits that are intentionally harmful (badfaith/vandalism) and edits that are intended to be harmful (good edits/goodfaith damage). The model provides a guess at whether or not a given revision is in good faith, and provides some probabilities to serve as a measure of its confidence level. This model was inspired by research of Wikipedia's quality control system and the potential for vandalism detection models to also be used as "goodfaith newcomer" detection systems.[1]
Who created this model?[edit]
Aaron Halfaker (User:EpochFail) and Amir Sarabadani (amir.sarabadani@wikimedia.de).
Who currently owns/is responsible for this model?[edit]
WMF Machine Learning Team (ml@wikimediafoundation.org)
Who are the intended users of this model?[edit]
English Wikipedia uses the model as a service for facilitating efficient edit reviews or newcomer support. On an individual basis, anyone can submit a properly-formatted API call to ORES for a given revision and get back the result of this model.
What should this model be used for?[edit]
This model should be used for prioritizing the review and potential reversion of vandalism on English Wikipedia.
This model should be used for detecting goodfaith contributions by editors on English Wikipedia.
What should this model not be used for?[edit]
This model should not be used as an ultimate arbiter of whether or not an edit ought to be considered good faith. The model has been shown to be useful in Simple English Wikipedia, but it should be used outside of English Wikipedia with caution.
What community approval processes has this model gone through?[edit]
English Wikipedia decided (note: don't know where/when this decision was made, would love to find a link to that discussion) to use this model. Over time, the model has been validated through use in the community. The links below are just an example to show what this product might look like.
Dates of consideration forums[edit]
What internal or external changes could make this model deprecated or no longer usable?[edit]
- Data drift means training data for the model is no longer usable.
- Doesn't meet desired performance metrics in production.
- English Wikipedia community decides to not use this model anymore.
How should this model be licensed?[edit]
MIT License
If this model is retrained, can we see how it has changed over time?[edit]
To my knowledge, this model has not been retrained over time — it still uses the original dataset from 2014-2015.
How does this model mitigate data drift?[edit]
This model does not mitigate data drift.
Which service(s) rely on this model?[edit]
This model is one of many models that powers ORES, the Wikimedia Foundation's machine machine learning API.
Learn more about ORES here
Which dataset(s) does this model rely on?[edit]
This model was trained using hand-labeled training data from 2014-2015. It was tested on a small sample of data from a later hand-labeling campaign from 2015-2016.
Train dataset is available for download here
Test dataset is available for download here
Quantitative Analysis[edit]
How did the model perform on training data?[edit]
counts (n=19230):
label n ~True ~False
------- ----- --- ------- --------
True 18724 --> 18404 320
False 506 --> 261 245
rates:
True False
---------- ------ -------
sample 0.974 0.026
population 0.967 0.033
match_rate (micro=0.937, macro=0.5):
True False
------ -------
0.968 0.032
filter_rate (micro=0.063, macro=0.5):
True False
------ -------
0.032 0.968
recall (micro=0.967, macro=0.734):
True False
------ -------
0.983 0.484
!recall (micro=0.501, macro=0.734):
True False
------ -------
0.484 0.983
precision (micro=0.966, macro=0.736):
True False
------ -------
0.982 0.49
!precision (micro=0.506, macro=0.736):
True False
------ -------
0.49 0.982
f1 (micro=0.966, macro=0.735):
True False
------ -------
0.983 0.487
!f1 (micro=0.503, macro=0.735):
True False
------ -------
0.487 0.983
accuracy (micro=0.967, macro=0.967):
True False
------ -------
0.967 0.967
fpr (micro=0.499, macro=0.266):
True False
------ -------
0.516 0.017
roc_auc (micro=0.924, macro=0.924):
True False
------ -------
0.924 0.924
pr_auc (micro=0.979, macro=0.735):
True False
------ -------
0.997 0.474
How does the model perform on test/real world data across different geographies, different devices, etc.?[edit]
AUC Score | Overall accuracy | Negative sample precision | Negative sample recall | Negative sample f1-score | Negative sample support | Positive sample precision | Positive sample recall | Positive sample f1-score | Positive sample support | True Positives | True Negatives | False Positives | False Negatives | True Positive Rate (Sensitivity) | True Negative Rate (Specificity) | False Positive Rate | False Negative Rate | Positive Predictive Value | Negative Predictive Value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
All data | 0.645 | 0.792 | 0.272 | 0.176 | 0.214 | 17 | 0.852 | 0.910 | 0.880 | 89 | 81 | 3 | 14 | 8 | 0.910 | 0.176 | 0.823 | 0.089 | 0.852 | 0.272 |
New editors (<1 year) | 0.630 | 0.767 | 0.272 | 0.2 | 0.230 | 15 | 0.84 | 0.887 | 0.863 | 71 | 63 | 3 | 12 | 8 | 0.887 | 0.2 | 0.8 | 0.112 | 0.84 | 0.272 |
Experienced editors (>=1 year) | 0.638 | 0.9 | 0.0 | 0.0 | 0.0 | 2 | 0.9 | 1.0 | 0.947 | 18 | 18 | 0 | 2 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.9 | nan |
Anonymous editors | 0.702 | 0.75 | 0.272 | 0.25 | 0.260 | 12 | 0.842 | 0.857 | 0.849 | 56 | 48 | 3 | 9 | 8 | 0.857 | 0.25 | 0.75 | 0.142 | 0.842 | 0.272 |
Named editors | 0.563 | 0.868 | 0.0 | 0.0 | 0.0 | 5 | 0.868 | 1.0 | 0.929 | 33 | 33 | 0 | 5 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.868 | nan |
Mobile editors | 0.654 | 0.6 | 0.25 | 0.166 | 0.2 | 6 | 0.687 | 0.785 | 0.733 | 14 | 11 | 1 | 5 | 3 | 0.785 | 0.166 | 0.833 | 0.214 | 0.687 | 0.25 |
Desktop editors | 0.607 | 0.837 | 0.285 | 0.181 | 0.222 | 11 | 0.886 | 0.933 | 0.909 | 75 | 70 | 2 | 9 | 5 | 0.933 | 0.181 | 0.818 | 0.066 | 0.886 | 0.285 |
Model Information[edit]
What is the architecture of this model?[edit]
{
"type": "GradientBoosting",
"version": "0.5.1",
"params": {
"scale": true,
"center": true,
"labels": [
true,
false
],
"multilabel": false,
"population_rates": null,
"ccp_alpha": 0.0,
"criterion": "friedman_mse",
"init": null,
"learning_rate": 0.01,
"loss": "deviance",
"max_depth": 7,
"max_features": "log2",
"max_leaf_nodes": null,
"min_impurity_decrease": 0.0,
"min_impurity_split": null,
"min_samples_leaf": 1,
"min_samples_split": 2,
"min_weight_fraction_leaf": 0.0,
"n_estimators": 700,
"n_iter_no_change": null,
"presort": "deprecated",
"random_state": null,
"subsample": 1.0,
"tol": 0.0001,
"validation_fraction": 0.1,
"verbose": 0,
"warm_start": false
}
}
What is the score schema this model returns?[edit]
{
"title": "Scikit learn-based classifier score with probability",
"type": "object",
"properties": {
"prediction": {
"description": "The most likely label predicted by the estimator",
"type": "boolean"
},
"probability": {
"description": "A mapping of probabilities onto each of the potential output labels",
"type": "object",
"properties": {
"true": {
"type": "number"
},
"false": {
"type": "number"
}
}
}
}
}
- ↑ Halfaker, A., Geiger, R. S., & Terveen, L. G. (2014, April). Snuggle: Designing for efficient socialization and ideological critique. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 311-320).