Research talk:Automated classification of edit quality/Work log/2017-05-04
Add topicThursday, May 4, 2017
[edit]Today, I'm exploring an issue that was reported by the collab team. Apparently there's very little overlap between "goodfaith" and "damaging" edits for English Wikipedia, but other wikis have enough overlap to target goodfaith newcomers who are running into issues.
In order to examine this problem, I gathered a random sample of 10k edits from recentchanges in enwiki. I filtered out bot edits because those are uninteresting for recentchanges patrolling. Here's my query: https://quarry.wmflabs.org/query/18386
Now I'm working on a script that uses ores.api.Session to query live ORES and get scores for the sample of edits. I've just got this script in my little analysis repo, but we should probably add it as a utility to ORES soon.
""" Scores a set of revisions Usage: score_revisions (-h|--help) score_revisions <ores-host> <context> <model>... [--debug] [--verbose] Options: -h --help Prints this documentation <ores-host> The host name for an ORES instance to use in scoring <context> The name of the wiki to execute model(s) for <model> The name of a model to use in scoring """ import json import logging import sys import docopt from ores import api logger = logging.getLogger(__name__) def main(): args = docopt.docopt(__doc__) logging.basicConfig( level=logging.INFO if not args['--debug'] else logging.DEBUG, format='%(asctime)s %(levelname)s:%(name)s -- %(message)s' ) ores_host = args['<ores-host>'] context = args['<context>'] model_names = args['<model>'] verbose = args['--verbose'] rev_docs = [json.loads(l) for l in sys.stdin] run(ores_host, context, model_names, rev_docs, verbose) def run(ores_host, context, model_names, rev_docs, verbose): session = api.Session(ores_host, user_agent="ahalfaker@wikimedia.org") rev_ids = [d['rev_id'] for d in rev_docs] scores = session.score(context, model_names, rev_ids) for rev_doc, score_doc in zip(rev_docs, scores): rev_doc['score'] = score_doc json.dump(rev_doc, sys.stdout) sys.stdout.write("\n") if verbose: sys.stderr.write(".") sys.stderr.flush() if __name__ == "__main__": main()
I ran it on my 10k sample and only got 9 errors.
$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | wc 9 197 2342
$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | json2tsv score.damaging.error.type TextDeleted TextDeleted TimeoutError TextDeleted TextDeleted TextDeleted TextDeleted RevisionNotFound TimeoutError
Looks like a bit of deletion and timeout errors. I'll be working with this data as it looks good.
OK next step is to extract the fields I want into a TSV so that I can load them into R for some analysis.
$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep -v error | json2tsv rev_id score.damaging.score.probability.true score.goodfaith.score.probability.true --header | head rev_id score.damaging.score.probability.true score.goodfaith.score.probability.true 778068153 0.0341070039332118 0.9631482216073363 778323385 0.06079271012144102 0.9183819888507275 774264535 0.018699456923994003 0.9848181505213502 774896131 0.32644924496861927 0.5472383417030015 775918221 0.12748914158045266 0.8296519735326966 775977649 0.05609497811177157 0.8352973506092333 775539875 0.01176361409844698 0.9837210953518821 777263348 0.5899814608767912 0.5644538254856134 776059314 0.02054486212356617 0.9772033930188049
OK that looks good. Time for some analysis. --EpochFail (talk) 17:38, 4 May 2017 (UTC)
Analysis
[edit]OK! I've got the hist of what's going on. See my code here: https://github.com/halfak/damaging-goodfaith-overlap
We can see from these plots that, while scores will often get into the extremes, there's little overlap in the extremely high or low values for both models.
#High probability pairs makes the issue plain. There's just no overlap at the confidence that the Collab team has told me they expect(damaging min_precision=0.6, goodfaith min_precision=0.99). However, if I set the damaging threshold to abide by more moderate rules (damaging min_recall=0.75, goodfaith min_precision=0.99), I get some results as can be seen in #Moderate probability pairs.
OK, but is the moderate cross-section useful for anything? Let's check! The following table is a random sample of edits that meet the moderate pair thresholds with my annotations:
revision | damaging proba | goodfaith proba | notes |
---|---|---|---|
en:Special:Diff/776491504 | 0.4011175 | 0.6523189 | maybe damaging, goodfaith (newcomer, mobile edit) |
en:Special:Diff/776561939 | 0.5577317 | 0.6381191 | maybe damaging, goodfaith (anon) |
en:Special:Diff/773901225 | 0.4808844 | 0.6326436 | not damaging, goodfaith (anon) |
en:Special:Diff/776192598 | 0.5090065 | 0.7602717 | not damaging, goodfaith (anon) |
en:Special:Diff/775184319 | 0.5168659 | 0.6679756 | not damaging, goodfaith (anon) |
en:Special:Diff/776909321 | 0.4109281 | 0.8508490 | damaging, goodfaith (newcomer) |
en:Special:Diff/773839838 | 0.4705899 | 0.6161455 | damaging, goodfaith (newcomer) |
en:Special:Diff/775681846 | 0.3980012 | 0.8870231 | not damaging, goodfaith (anon) |
en:Special:Diff/777385056 | 0.4906228 | 0.6944950 | damaging, goodfaith (anon) |
en:Special:Diff/775954857 | 0.4083657 | 0.7240080 | damaging, goodfaith (newcomer) |
en:Special:Diff/778629261 | 0.4156775 | 0.7470698 | not damaging, goodfaith (anon) |
en:Special:Diff/777972078 | 0.4976089 | 0.6170718 | not damaging, goodfaith (newcomer) |
en:Special:Diff/776171391 | 0.5123592 | 0.8396888 | not damaging, goodfaith (anon, counter-vandalism) |
en:Special:Diff/775954413 | 0.3981722 | 0.6712455 | damaging, goodfaith (anon) |
en:Special:Diff/774703855 | 0.4264561 | 0.7632287 | not damaging, goodfaith (anon, adding category) |
en:Special:Diff/777069077 | 0.4241885 | 0.6990100 | damaging, goodfaith (newcomer) |
en:Special:Diff/777864924 | 0.4098085 | 0.6073056 | not damaging, goodfaith (anon, counter-vandalism) |
en:Special:Diff/774911971 | 0.4021984 | 0.6594416 | damaging, goodfaith (anon, misplaced talk post) |
en:Special:Diff/775082597 | 0.6174247 | 0.6371081 | damaging, goodfaith (anon, misplaced talk post) |
en:Special:Diff/778161116 | 0.4311144 | 0.6327798 | not damaging, goodfaith (newcomer) |
en:Special:Diff/776781184 | 0.4929796 | 0.6192534 | damaging, goodfaith (newcomer, BLP) |
en:Special:Diff/774472865 | 0.4664499 | 0.6066368 | damaging, goodfaith (newcomer) |
en:Special:Diff/774799454 | 0.4839814 | 0.7210619 | damaging, goodfaith (anon) |
en:Special:Diff/775569040 | 0.5607529 | 0.6193204 | damaging, goodfaith (newcomer) |
en:Special:Diff/775292667 | 0.4404379 | 0.8778261 | damaging, goodfaith (anon, failing to fix table) |
en:Special:Diff/775535192 | 0.4850735 | 0.6673567 | damaging, goodfaith (anon) |
en:Special:Diff/775352387 | 0.4932909 | 0.6775150 | damaging, goodfaith (anon) |
en:Special:Diff/776968902 | 0.4367727 | 0.6644402 | not damaging, goodfaith (anon, mobile) |
en:Special:Diff/776072339 | 0.5684984 | 0.6742460 | damaging, maybe badfaith (anon) |
en:Special:Diff/776084132 | 0.4516739 | 0.8753995 | damaging, goodfaith (newcomer-ish) |
So that looks pretty useful. My recommendation: Don't set such strict thresholds. Models will still be useful at lower levels of confidence. --EpochFail (talk) 18:48, 4 May 2017 (UTC)