Research talk:Automated classification of draft quality/Work log/2016-09-26

Monday, September 26, 2016[edit]

I've finally got some time to look at this problem. My goal today is to run a few queries that will allow me to extract a random sample of article creations from a recent time period (this year?) with labels for spam, vandalism, or "other". I'll focus on English Wikipedia for now. I'll be using the deletion log to get the sample of bad pages. We'll need to figure out some page_id bounds for gathering a sample of good pages too.

First things first, how do we find the bad page creations?

Query: https://quarry.wmflabs.org/query/12780 Referencing: en:Wikipedia:Criteria_for_speedy_deletion

I think we're generally interested in:

WP:CSD#G3 -- Pure vandalism and blatant hoaxes
WP:CSD#G10 -- Pages that disparage, threaten or harass
WP:CSD#G11 -- Unambiguous advertising
WP:CSD#A11 -- Obviously invented

With this query, we get 52810 results which is a pretty good set of "positive" examples:

SELECT log_id, log_title, log_comment, log_namespace 
FROM logging 
WHERE 
    log_type = "delete" AND 
    log_action = "delete" AND 
    log_timestamp BETWEEN "20150901" AND "20160901" AND
    log_comment LIKE "[[WP:CSD#%" AND 
    log_comment REGEXP "WP:CSD#(G3|G10|G11|A11)\\|";

So that seems to work nicely. Now we need a sample of articles that were deleted for innocuous reasons and articles that weren't deleted at all. --EpochFail (talk) 22:28, 26 September 2016 (UTC)[reply]

Oooh. I made a table of deletion reasons we get in this set with this query: https://quarry.wmflabs.org/query/12782

deletion_reason	COUNT(*)
attack	3427
hoax	2132
spam	42498
vandalism	10144

--EpochFail (talk) 22:29, 26 September 2016 (UTC)[reply]