Enwiki WP 1.0 Model Card[edit]

This model card was created and written entirely by the Algorithmic Accountability Bot account. The bot is operated by Hal Triedman. It regularly creates and updates pages about the provenance and statistical performance of machine learning models and datasets owned by the Wikimedia Foundation.

Note: Any and all edits to this page will be overwritten the next time it is updated. Please put all questions and discussion of this algorithmic component in the talk page, or else contact Hal or the WMF ML team directly.

This is a Tier 2 model card. That means it is generated by retrieving a detailed picture of the performance of the model when it was trained. This provides a decent sense of model performance at training time, but cannot tell you anything about how the model is performing on the MediaWiki platform right now. It also gives a full description of model architecture.

This tier also includes an in-depth explanation of the model rationale, owners, creators, provenance, etc.

Qualitative Analysis[edit]

What is the motivation behind creating this model?[edit]

Article quality in Wikipedia is of critical concern. Many wikis implement a featured article process (e.g. en:Wikipedia:Featured article) for identifying high quality content and many WikiProjects use quality assessments to prioritize and direct work. But this assessment work is extremely time intensive and assessments will become out of date. This model makes predictions about article quality to support these processes.

Who created this model?[edit]

Aaron Halfaker (aaron.halfaker@gmail.com) and Amir Sarabadani (amir.sarabadani@wikimedia.de).

Who currently owns/is responsible for this model?[edit]

WMF Machine Learning Team (ml@wikimediafoundation.org)

Who are the intended users of this model?[edit]

English Wikipedia uses this model as a service for facilitating efficient reviews. of article quality. On an individual basis, anyone can submit a properly-formatted API call to ORES for a given article and get back the result of this model.

What should this model be used for?[edit]

This model should be used for facilitating article quality reviews on English Wikipedia.

What should this model not be used for?[edit]

This model should not be used as an ultimate arbiter of whether or not an article is or is not high or low quality — a human should make that decision. It should not be used for any other English-language wiki besides English Wikipedia, and shouldn't be used for other languages.

What community approval processes has this model gone through?[edit]

English Wikipedia decided (note: don't know where/when this decision was made, would love to find a link to that discussion) to use this model. Over time, the model has been validated through use in the community. The link below is just an example to show what this product might look like.

Dates of consideration forums[edit]

2021-09-07

What internal or external changes could make this model deprecated or no longer usable?[edit]

Data drift means training data for the model is no longer usable.
Doesn't meet desired performance metrics in production.
English Wikipedia community decides to not use this model anymore.

How should this model be licensed?[edit]

Creative Commons Attribution ShareAlike 3.0

If this model is retrained, can we see how it has changed over time?[edit]

To my knowledge, this model has not been retrained over time — it still uses the original dataset from June of 2015.

How does this model mitigate data drift?[edit]

This model does not mitigate data drift.

Which service(s) rely on this model?[edit]

This model is one of many models that powers ORES, the Wikimedia Foundation's machine machine learning API.

Learn more about ORES here

Which dataset(s) does this model rely on?[edit]

This model was trained using hand-labeled training data from 2015. More details are available in the makefile of the articlequality github repository.

Train dataset is available for download here

Quantitative Analysis[edit]

How did the model perform on training data?[edit]

counts (n=32284):
		label       n         ~Stub    ~Start    ~C    ~B    ~GA    ~FA
		-------  ----  ---  -------  --------  ----  ----  -----  -----
		'Stub'   5415  -->     4589       791    23    11      1      0
		'Start'  5440  -->      670      3504   852   349     63      2
		'C'      5466  -->       65       943  2703  1063    603     89
		'B'      5472  -->       35       651  1362  2176    892    356
		'GA'     5495  -->        3        38   306   330   3527   1291
		'FA'     4996  -->        1         2    22   235    900   3836
	rates:
		              'Stub'    'Start'    'C'    'B'    'GA'    'FA'
		----------  --------  ---------  -----  -----  ------  ------
		sample         0.168      0.169  0.169  0.169    0.17   0.155
		population     0.576      0.322  0.054  0.035    0.01   0.003
	match_rate (micro=0.386, macro=0.189):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.501    0.269  0.117  0.085  0.097  0.066
	filter_rate (micro=0.614, macro=0.811):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.499    0.731  0.883  0.915  0.903  0.934
	recall (micro=0.745, macro=0.632):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.847    0.644  0.495  0.398  0.642  0.768
	!recall (micro=0.945, macro=0.926):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.971     0.91  0.904  0.926  0.908  0.936
	precision (micro=0.83, macro=0.372):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.976    0.772  0.229  0.161  0.065  0.031
	!precision (micro=0.845, macro=0.935):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.824    0.843  0.969  0.977  0.996  0.999
	f1 (micro=0.775, macro=0.388):
		  Stub    Start      C      B     GA    FA
		------  -------  -----  -----  -----  ----
		 0.907    0.702  0.313  0.229  0.118  0.06
	!f1 (micro=0.891, macro=0.928):
		  Stub    Start      C      B    GA     FA
		------  -------  -----  -----  ----  -----
		 0.892    0.875  0.935  0.951  0.95  0.967
	accuracy (micro=0.875, macro=0.893):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		   0.9    0.824  0.882  0.908  0.906  0.936
	fpr (micro=0.055, macro=0.074):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.029     0.09  0.096  0.074  0.092  0.064
	roc_auc (micro=0.942, macro=0.906):
		  Stub    Start      C      B    GA     FA
		------  -------  -----  -----  ----  -----
		 0.978    0.905  0.857  0.831  0.91  0.954
	pr_auc (micro=0.842, macro=0.401):
		  Stub    Start      C      B     GA     FA
		------  -------  -----  -----  -----  -----
		 0.983    0.788  0.251  0.175  0.128  0.078

Model Information[edit]

What is the architecture of this model?[edit]

{
    "type": "GradientBoosting",
    "version": "0.9.2",
    "params": {
        "scale": true,
        "center": true,
        "labels": [
            "Stub",
            "Start",
            "C",
            "B",
            "GA",
            "FA"
        ],
        "multilabel": false,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.01,
        "loss": "deviance",
        "max_depth": 7,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 500,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false
    }
}

What is the score schema this model returns?[edit]

{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely label predicted by the estimator",
            "type": "string"
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Stub": {
                    "type": "number"
                },
                "Start": {
                    "type": "number"
                },
                "C": {
                    "type": "number"
                },
                "B": {
                    "type": "number"
                },
                "GA": {
                    "type": "number"
                },
                "FA": {
                    "type": "number"
                }
            }
        }
    }
}