Research talk:Automated classification of draft quality/Work log/2016-09-26

From Meta, a Wikimedia project coordination wiki

Monday, September 26, 2016[edit]

I've finally got some time to look at this problem. My goal today is to run a few queries that will allow me to extract a random sample of article creations from a recent time period (this year?) with labels for spam, vandalism, or "other". I'll focus on English Wikipedia for now. I'll be using the deletion log to get the sample of bad pages. We'll need to figure out some page_id bounds for gathering a sample of good pages too.

First things first, how do we find the bad page creations?

Query: https://quarry.wmflabs.org/query/12780 Referencing: en:Wikipedia:Criteria_for_speedy_deletion

I think we're generally interested in:

  • WP:CSD#G3 -- Pure vandalism and blatant hoaxes
  • WP:CSD#G10 -- Pages that disparage, threaten or harass
  • WP:CSD#G11 -- Unambiguous advertising
  • WP:CSD#A11 -- Obviously invented

With this query, we get 52810 results which is a pretty good set of "positive" examples:

SELECT log_id, log_title, log_comment, log_namespace 
FROM logging 
WHERE 
    log_type = "delete" AND 
    log_action = "delete" AND 
    log_timestamp BETWEEN "20150901" AND "20160901" AND
    log_comment LIKE "[[WP:CSD#%" AND 
    log_comment REGEXP "WP:CSD#(G3|G10|G11|A11)\\|";

So that seems to work nicely. Now we need a sample of articles that were deleted for innocuous reasons and articles that weren't deleted at all. --EpochFail (talk) 22:28, 26 September 2016 (UTC)[reply]


Oooh. I made a table of deletion reasons we get in this set with this query: https://quarry.wmflabs.org/query/12782

deletion_reason COUNT(*)
attack 3427
hoax 2132
spam 42498
vandalism 10144

--EpochFail (talk) 22:29, 26 September 2016 (UTC)[reply]