Research talk:Automated classification of draft quality/Work log/2016-09-27

From Meta, a Wikimedia project coordination wiki

Wednesday, September 28, 2016[edit]

Today I'm working to gather page creations (not deleted) that occurred in the last year. Regretfully, there's no historic log of page creations. But I can filter revisions for those that have no rev_parent_id.

SELECT rev_timestamp, page_title, rev_len, rev_user_text 
FROM revision 
INNER JOIN page ON 
    rev_page = page_id
WHERE
    rev_timestamp BETWEEN "20150927" AND "20160927" AND
    rev_parent_id = 0 AND
    page_namespace = 0
LIMIT 10;
rev_timestamp page_title rev_len rev_user_text
20150926000023 General_Todorov 37 Ketiltrout
20150926000028 Scott_hoying 982 Lwp2004
20150926000319 Parque_de_la_Bombilla_(Mexico_City) 772 Josedricoa
20150926000435 Mogilno_Falsification 1591 Tymek
20150926000643 Temple_of_Venus 316 LlywelynII
20150926000727 Motorslug 2921 Soul Crusher
20150926000736 Temple_of_Venus_(Baalbek) 27 LlywelynII
20150926000840 John_R._McDermott 36 CactusWriter
20150926000940 The_Hard_Easy 60 23W
20150926001001 Conference_of_Secretaries_of_World_Christian_Communions 1260 1549bcp

OK. So, I'm thinking that we can get a sample of good pages this way.

Ultimately, I think we'll want a representative sample of pages that are:

  • Not deleted
  • Deleted for less concerning reasons (e.g. no assertion of importance)
  • Deleted for immediately concerning reasons
    • Spam
    • Vandalism
    • Attack
    • Hoax

I think I'd like to lump the first two together, but first, I'll need to do my sampling individually. I want the following columns:

  • page_title
  • creation_rev_id
  • creation_timestamp
  • archived (did we find the page in the `archive` table?)
  • creation_quality (OK, spam, vandalism, attack, hoax)

We should query in the past 30 days so that pages will have a chance to be deleted. See my work in R:Wikipedia article creation for justification of the 30 days threshold. Here's a query to get the good creations: https://quarry.wmflabs.org/query/12795 "English article creations that have survived at least 1 month"


Here's a query to get all of the deleted article creations in the same time period. https://quarry.wmflabs.org/query/12796 I needed to set the creation_quality to NULL because I'll need to join this with the logging table later in order to get a deletion reason for labeling. --EpochFail (talk) 00:51, 28 September 2016 (UTC)[reply]