Research talk:Automated classification of draft quality/Work log/2016-09-26
Monday, September 26, 2016
I've finally got some time to look at this problem. My goal today is to run a few queries that will allow me to extract a random sample of article creations from a recent time period (this year?) with labels for spam, vandalism, or "other". I'll focus on English Wikipedia for now. I'll be using the deletion log to get the sample of bad pages. We'll need to figure out some page_id bounds for gathering a sample of good pages too.
First things first, how do we find the bad page creations?
I think we're generally interested in:
- WP:CSD#G3 -- Pure vandalism and blatant hoaxes
- WP:CSD#G10 -- Pages that disparage, threaten or harass
- WP:CSD#G11 -- Unambiguous advertising
- WP:CSD#A11 -- Obviously invented
With this query, we get 52810 results which is a pretty good set of "positive" examples:
SELECT log_id, log_title, log_comment, log_namespace FROM logging WHERE log_type = "delete" AND log_action = "delete" AND log_timestamp BETWEEN "20150901" AND "20160901" AND log_comment LIKE "[[WP:CSD#%" AND log_comment REGEXP "WP:CSD#(G3|G10|G11|A11)\\|";
Oooh. I made a table of deletion reasons we get in this set with this query: https://quarry.wmflabs.org/query/12782