Research talk:Wikipedia article creation/Work log/Monday, November 25th

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Monday, November 25th[edit]

Today, I'm going to try to replicate the change in the proportion of deleted pages for non-English wikis. The real problem is that the database servers for non-English Wikipedia do not allow for temporary table -- the kind of scratch space I need in order to move forward.

My plan is to extract exactly what I need to a local database server that I can use to perform the requisite joins. I've figured out how to extract page info as well as user stats, but I can't seem to figure out how to extract the revision that created the page so I can join with the user table and split newcomers from oldtimers.


Page status query
SELECT
    page_id,
    page_namespace,
    page_title,
    revisions,
    first_revision.
    last_revision,
    archived,
    (
        archived AND 
        UNIX_TIMESTAMP(last_revision) - UNIX_TIMESTAMP(first_revision) < 60*60*24*30 
    ) AS archived_quickly,
    creation.rev_id AS creating_rev_id,
    creation.rev_timestamp AS creating_rev_timestamp,
    user_id AS creator_id,
    IFNULL(creatore.user_name, creation.rev_user_text) AS creator_name,
    IFNULL(direct_create.log_action, indirect_create.log_action) AS creator_creation_action,
    user_registration AS creator_registration
FROM (
    SELECT
        ar_page_id AS page_id,
        ar_namespace AS page_namespace,
        ar_title AS page_title,
        COUNT(*) AS revisions,
        MIN(ar_timestamp) AS first_revision,
        MAX(ar_timestamp) AS last_revision,
        True AS archived,
        MIN(ar_rev_id) AS first_rev_id
    FROM archive
    WHERE ar_timestamp < "20131105000000"
    UNION
    SELECT
        page_id,
        page_namespace,
        page_title,
        COUNT(*) AS revisions,
        MIN(rev_timestamp) AS first_edit,
        MAX(rev_timestamp) AS last_edit,
        False AS archived,
        MIN(rev_id) AS first_rev_id
    FROM revision
    INNER JOIN page ON page_id = rev_page
    WHERE rev_timestamp < "20131105000000"
    GROUP BY 1,2,3
) as page
LEFT JOIN revision creation ON rev_id = first_rev_id
LEFT JOIN user creator ON rev_user = user_id
LEFT JOIN logging AS direct_create ON
    direct_create.log_type = "newusers" AND
    creator.user_id = direct_create.log_user AND
    direct_create.log_action IN ("create", "autocreate", "newusers")
LEFT JOIN logging AS indirect_create ON
    indirect_create.log_type = "newusers" AND
    indirect_create.log_action IN ("byemail", "create2") AND
    REPLACE(creator.user_name, " ", "_") = indirect_create.log_title;

Alright. This one is a monster. Here's what it is supposed to get:

(one row per page)

  • page_id: Page identifier (if not archived)
  • page_namespace: "
  • page_title: "
  • revisions: Number of revisions saved
  • first_revision: Date of the first revision
  • last_revision: Date of the last revision
  • archived: True if in archive table, false otherwise
  • archived_quickly: True if archived within 30 days of creation (judging by last rev_timestamp)
  • creating_rev_id: rev_id of the creating revision
  • creating_rev_timestamp: "
  • creator_id: user_id of the creator (or NULL for anons)
  • creator_name: user_name of the creator (or IP for anons)
  • creator_action: log_action for relevant account creations
  • creator_registration: registration date of creator (or NULL for anons)


Time to see how quickly I can get in trouble for unleashing this.

19:39, 25 November 2013 (UTC)