Jump to content

Research:Automated classification of draft quality

From Meta, a Wikimedia project coordination wiki

This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.

The average time between edits when an editor is "in-session" is 7 minutes. But the median time to deletion tagging a new article is 2 minutes. The most common reason for deletion tagging (in English Wikipedia) is A7: "No indication of importance". It seems likely, that newcomer article creators are *adding* a credible assertion of importance in a second edit that is blocked by a deletion tagging edit conflict. Research suggests that this early, negative feedback is one of the leading predictors that a newcomer will stop editing Wikipedia entirely.



The reason we need to review new page creation so quickly is to get rid of spam and egregious vandalism. Most other types of potentially undesirable new articles would not cause damage were they to be left alone for a little while -- enough time to allow the creator to finish their initial sequence of edits. We can split the feed of newly created pages using a machine learning classifier so that we can have two review backlogs: one for fast review of spam and egregious vandalism and another for slower review of all other new articles.

The ORES service would be a great place to build and host such a model and the Research:Revision scoring as a service project team would be interested in providing support & advisement.

Labeled data


The labeling query (see below) was run for each month between Aug. 2015 and Aug. 2016 to acquire a dataset with 907,415 observations:

  • 881,159 "OK"
  • 26,256 otherwise
    • 6506 "vandalism"
    • 2451 "attack"
    • 17,704 "spam"

The full dataset can be downloaded from the github repository: https://github.com/wiki-ai/draftquality/tree/master/datasets

labeling query
  rev_timestamp AS creation_timestamp,
  FALSE AS archived,
  "OK" AS draft_quality 
FROM revision 
  rev_page = page_id WHERE
  rev_timestamp BETWEEN @start AND @end AND
  rev_parent_id = 0 AND
  page_namespace = 0 


  ar_title AS page_title,
  ar_rev_id AS rev_id,
  ar_timestamp AS creation_timestamp,
  True AS archived,
  IF(log_comment REGEXP "WP:CSD#G3\\|", "vandalism",
       IF(log_comment REGEXP "WP:CSD#G10\\|", "attack",
       IF(log_comment REGEXP "WP:CSD#G11\\|", "spam", "OK")))) AS draft_quality 
FROM archive 
LEFT JOIN logging speedy_delete ON
  log_namespace = ar_namespace AND
  log_title = ar_title AND
  log_type = "delete" AND
  log_action = "delete" AND
  log_comment LIKE "[[WP:CSD#%" AND
  log_comment REGEXP "WP:CSD#(G3|G10|G11)\\|" AND
  log_timestamp > ar_timestamp 
  ar_timestamp BETWEEN @start AND @end AND
  log_timestamp BETWEEN @start AND @end AND
  ar_parent_id = 0 AND
  ar_namespace = 0