Research talk:Automated classification of article quality/Work log/2016-06-07
Add topicAppearance
Latest comment: 9 years ago by EpochFail in topic Tuesday, June 7, 2016
Tuesday, June 7, 2016
[edit]Working on ruwiki stuff today.
$ cat ruwiki.observations.first_labelings.20160501.json | json2tsv label | sort | uniq fa ga I II III IV sa $ cat ruwiki.observations.first_labelings.20160501.json | grep '"fa"' | wc 1155 12110 263524 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"ga"' | wc 1759 16733 328268 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"I"' | wc 4486 43542 871572 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"II"' | wc 14371 136840 2732236 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"III"' | wc 56042 538541 10415274 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"IV"' | wc 75315 701607 12855088 $ cat ruwiki.observations.first_labelings.20160501.json | grep '"sa"' | wc 1432 13978 282051
So, it looks like we can get about 1.1k observations per class and keep this all balanced. --EpochFail (talk) 15:54, 7 June 2016 (UTC)
$ make models/ruwiki.wp10.rf.model
cat datasets/ruwiki.features_wp10.8k.tsv | \
revscoring train_test \
revscoring.scorer_models.RF \
wikiclass.feature_lists.ruwiki.wp10 \
--version 0.0.1 \
-p 'n_estimators=501' \
-p 'min_samples_leaf=8' \
-s 'table' -s 'accuracy' -s 'roc' -s 'f1' \
--balance-sample \
--center --scale > \
models/ruwiki.wp10.rf.model
2016-06-07 17:33:53,641 INFO:revscoring.utilities.train_test -- Training model...
2016-06-07 17:34:08,186 INFO:revscoring.utilities.train_test -- Testing model...
ScikitLearnClassifier
- type: RF
- params: random_state=null, scale=true, verbose=0, min_samples_leaf=8, n_estimators=501, n_jobs=1, center=true, criterion="gini", bootstrap=true, balanced_sample=true, min_samples_split=2, balanced_sample_weight=false, warm_start=false, class_weight=null, max_features="auto", max_depth=null, min_weight_fraction_leaf=0.0, oob_score=false, max_leaf_nodes=null
- version: 0.0.1
- trained: 2016-06-07T17:34:08.180792
Table:
~I ~II ~III ~IV ~fa ~ga ~sa
--- ---- ----- ------ ----- ----- ----- -----
I 36 46 24 3 23 47 39
II 28 75 31 6 8 20 37
III 7 48 117 36 2 0 22
IV 1 7 57 157 1 0 5
fa 6 1 1 1 158 50 4
ga 19 5 7 3 50 143 16
sa 6 12 3 0 0 17 207
Accuracy: 0.561
ROC-AUC:
----- -----
'I' 0.73
'II' 0.782
'III' 0.868
'IV' 0.956
'fa' 0.939
'ga' 0.888
'sa' 0.956
----- -----
F1:
--- -----
II 0.376
III 0.496
I 0.224
IV 0.724
ga 0.55
fa 0.683
sa 0.72
--- -----
That looks like it is useful. It seems we have a low F for "I", I'd guess that this rating is between "ga" and "II". --EpochFail (talk) 19:03, 7 June 2016 (UTC)