Research:Automated classification of draft quality
The average time between edits when an editor is "in-session" is 7 minutes. But the median time to deletion tagging a new article is 2 minutes. The most common reason for deletion tagging (in English Wikipedia) is A7: "No indication of importance". It seems likely, that newcomer article creators are *adding* a credible assertion of importance in a second edit that is blocked by a deletion tagging edit conflict. Research suggests that this early, negative feedback is one of the leading predictors that a newcomer will stop editing Wikipedia entirely.
The reason we need to review new page creation so quickly is to get rid of spam and egregious vandalism. Most other types of potentially undesirable new articles would not cause damage were they to be left alone for a little while -- enough time to allow the creator to finish their initial sequence of edits. We can split the feed of newly created pages using a machine learning classifier so that we can have two review backlogs: one for fast review of spam and egregious vandalism and another for slower review of all other new articles.
The labeling query (see below) was run for each month between Aug. 2015 and Aug. 2016 to acquire a dataset with 907,415 observations:
- 881,159 "OK"
- 26,256 otherwise
- 6506 "vandalism"
- 2451 "attack"
- 17,704 "spam"
The full dataset can be downloaded from the github repository: https://github.com/wiki-ai/draftquality/tree/master/datasets
SELECT page_title, rev_id, rev_timestamp AS creation_timestamp, FALSE AS archived, "OK" AS draft_quality FROM revision INNER JOIN page ON rev_page = page_id WHERE rev_timestamp BETWEEN @start AND @end AND rev_parent_id = 0 AND page_namespace = 0 UNION ALL SELECT ar_title AS page_title, ar_rev_id AS rev_id, ar_timestamp AS creation_timestamp, True AS archived, IF(log_comment REGEXP "WP:CSD#G3\\|", "vandalism", IF(log_comment REGEXP "WP:CSD#G10\\|", "attack", IF(log_comment REGEXP "WP:CSD#G11\\|", "spam", "OK")))) AS draft_quality FROM archive LEFT JOIN logging speedy_delete ON log_namespace = ar_namespace AND log_title = ar_title AND log_type = "delete" AND log_action = "delete" AND log_comment LIKE "[[WP:CSD#%" AND log_comment REGEXP "WP:CSD#(G3|G10|G11)\\|" AND log_timestamp > ar_timestamp WHERE ar_timestamp BETWEEN @start AND @end AND log_timestamp BETWEEN @start AND @end AND ar_parent_id = 0 AND ar_namespace = 0