Research talk:Automated classification of edit quality/Work log/2017-05-04

Thursday, May 4, 2017

Latest comment: 7 years ago1 comment1 person in discussion

Today, I'm exploring an issue that was reported by the collab team. Apparently there's very little overlap between "goodfaith" and "damaging" edits for English Wikipedia, but other wikis have enough overlap to target goodfaith newcomers who are running into issues.

In order to examine this problem, I gathered a random sample of 10k edits from recentchanges in enwiki. I filtered out bot edits because those are uninteresting for recentchanges patrolling. Here's my query: https://quarry.wmflabs.org/query/18386

Now I'm working on a script that uses ores.api.Session to query live ORES and get scores for the sample of edits. I've just got this script in my little analysis repo, but we should probably add it as a utility to ORES soon.

"""
Scores a set of revisions

Usage:
    score_revisions (-h|--help)
    score_revisions <ores-host> <context> <model>...
                    [--debug]
                    [--verbose]

Options:
    -h --help    Prints this documentation
    <ores-host>  The host name for an ORES instance to use in scoring
    <context>    The name of the wiki to execute model(s) for
    <model>      The name of a model to use in scoring
"""
import json
import logging
import sys

import docopt
from ores import api

logger = logging.getLogger(__name__)


def main():
    args = docopt.docopt(__doc__)

    logging.basicConfig(
        level=logging.INFO if not args['--debug'] else logging.DEBUG,
        format='%(asctime)s %(levelname)s:%(name)s -- %(message)s'
    )

    ores_host = args['<ores-host>']
    context = args['<context>']
    model_names = args['<model>']
    verbose = args['--verbose']

    rev_docs = [json.loads(l) for l in sys.stdin]

    run(ores_host, context, model_names, rev_docs, verbose)


def run(ores_host, context, model_names, rev_docs, verbose):
    session = api.Session(ores_host, user_agent="ahalfaker@wikimedia.org")

    rev_ids = [d['rev_id'] for d in rev_docs]
    scores = session.score(context, model_names, rev_ids)

    for rev_doc, score_doc in zip(rev_docs, scores):
        rev_doc['score'] = score_doc
        json.dump(rev_doc, sys.stdout)
        sys.stdout.write("\n")
        if verbose:
            sys.stderr.write(".")
            sys.stderr.flush()


if __name__ == "__main__":
    main()

I ran it on my 10k sample and only got 9 errors.

$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | wc
      9     197    2342

$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep error | json2tsv score.damaging.error.type
TextDeleted
TextDeleted
TimeoutError
TextDeleted
TextDeleted
TextDeleted
TextDeleted
RevisionNotFound
TimeoutError

Looks like a bit of deletion and timeout errors. I'll be working with this data as it looks good.

OK next step is to extract the fields I want into a TSV so that I can load them into R for some analysis.

$ cat enwiki.scored_revision_sample.nonbot_10k.json | grep -v error | json2tsv rev_id score.damaging.score.probability.true score.goodfaith.score.probability.true --header | head
rev_id	score.damaging.score.probability.true	score.goodfaith.score.probability.true
778068153	0.0341070039332118	0.9631482216073363
778323385	0.06079271012144102	0.9183819888507275
774264535	0.018699456923994003	0.9848181505213502
774896131	0.32644924496861927	0.5472383417030015
775918221	0.12748914158045266	0.8296519735326966
775977649	0.05609497811177157	0.8352973506092333
775539875	0.01176361409844698	0.9837210953518821
777263348	0.5899814608767912	0.5644538254856134
776059314	0.02054486212356617	0.9772033930188049

OK that looks good. Time for some analysis. --EpochFail (talk) 17:38, 4 May 2017 (UTC)Reply

Analysis

Latest comment: 7 years ago1 comment1 person in discussion

OK! I've got the hist of what's going on. See my code here: https://github.com/halfak/damaging-goodfaith-overlap

Density of predictions. Damaging and goodfaith ORES score density are plotted for a random sample of edits from English Wikipedia.

Prediction pairs scatter-plot. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia.

We can see from these plots that, while scores will often get into the extremes, there's little overlap in the extremely high or low values for both models.

High probability pairs. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia where damaging >= 0.879 and goodfaith >= 0.86 (both very high probability). No points == no overlap.

Moderate probability pairs. Damaging and goodfaith ORES scores are plotted for a random sample of edits from English Wikipedia where damaging >= 0.398 and goodfaith >= 0.601 (both moderate probability).

#High probability pairs makes the issue plain. There's just no overlap at the confidence that the Collab team has told me they expect(damaging min_precision=0.6, goodfaith min_precision=0.99). However, if I set the damaging threshold to abide by more moderate rules (damaging min_recall=0.75, goodfaith min_precision=0.99), I get some results as can be seen in #Moderate probability pairs.

OK, but is the moderate cross-section useful for anything? Let's check! The following table is a random sample of edits that meet the moderate pair thresholds with my annotations:

revision	damaging proba	goodfaith proba	notes
en:Special:Diff/776491504	0.4011175	0.6523189	maybe damaging, goodfaith (newcomer, mobile edit)
en:Special:Diff/776561939	0.5577317	0.6381191	maybe damaging, goodfaith (anon)
en:Special:Diff/773901225	0.4808844	0.6326436	not damaging, goodfaith (anon)
en:Special:Diff/776192598	0.5090065	0.7602717	not damaging, goodfaith (anon)
en:Special:Diff/775184319	0.5168659	0.6679756	not damaging, goodfaith (anon)
en:Special:Diff/776909321	0.4109281	0.8508490	damaging, goodfaith (newcomer)
en:Special:Diff/773839838	0.4705899	0.6161455	damaging, goodfaith (newcomer)
en:Special:Diff/775681846	0.3980012	0.8870231	not damaging, goodfaith (anon)
en:Special:Diff/777385056	0.4906228	0.6944950	damaging, goodfaith (anon)
en:Special:Diff/775954857	0.4083657	0.7240080	damaging, goodfaith (newcomer)
en:Special:Diff/778629261	0.4156775	0.7470698	not damaging, goodfaith (anon)
en:Special:Diff/777972078	0.4976089	0.6170718	not damaging, goodfaith (newcomer)
en:Special:Diff/776171391	0.5123592	0.8396888	not damaging, goodfaith (anon, counter-vandalism)
en:Special:Diff/775954413	0.3981722	0.6712455	damaging, goodfaith (anon)
en:Special:Diff/774703855	0.4264561	0.7632287	not damaging, goodfaith (anon, adding category)
en:Special:Diff/777069077	0.4241885	0.6990100	damaging, goodfaith (newcomer)
en:Special:Diff/777864924	0.4098085	0.6073056	not damaging, goodfaith (anon, counter-vandalism)
en:Special:Diff/774911971	0.4021984	0.6594416	damaging, goodfaith (anon, misplaced talk post)
en:Special:Diff/775082597	0.6174247	0.6371081	damaging, goodfaith (anon, misplaced talk post)
en:Special:Diff/778161116	0.4311144	0.6327798	not damaging, goodfaith (newcomer)
en:Special:Diff/776781184	0.4929796	0.6192534	damaging, goodfaith (newcomer, BLP)
en:Special:Diff/774472865	0.4664499	0.6066368	damaging, goodfaith (newcomer)
en:Special:Diff/774799454	0.4839814	0.7210619	damaging, goodfaith (anon)
en:Special:Diff/775569040	0.5607529	0.6193204	damaging, goodfaith (newcomer)
en:Special:Diff/775292667	0.4404379	0.8778261	damaging, goodfaith (anon, failing to fix table)
en:Special:Diff/775535192	0.4850735	0.6673567	damaging, goodfaith (anon)
en:Special:Diff/775352387	0.4932909	0.6775150	damaging, goodfaith (anon)
en:Special:Diff/776968902	0.4367727	0.6644402	not damaging, goodfaith (anon, mobile)
en:Special:Diff/776072339	0.5684984	0.6742460	damaging, maybe badfaith (anon)
en:Special:Diff/776084132	0.4516739	0.8753995	damaging, goodfaith (newcomer-ish)

So that looks pretty useful. My recommendation: Don't set such strict thresholds. Models will still be useful at lower levels of confidence. --EpochFail (talk) 18:48, 4 May 2017 (UTC)Reply