Research talk:Wikipedia article creation/Work log/Tuesday, November 19th

From Meta, a Wikimedia project coordination wiki

Tuesday, November 19th[edit]

All of my work yesterday went directly to the main page, so I didn't end up taking any notes. Today, I'm working on looking specifically at newcomer page creations. In order to determine newcomerness, I need to know two things, when page creators registered their accounts and how they did so. Since accounts are autocreated when global users on one wiki browse to another, I need to filter those accounts out. Here's the query I'm running right now to gather all relevant page creators, their registration date and account creation style:

SELECT
    user.user_id,
    user.user_name,
    user_registration,
    IFNULL(direct_create.log_action, indirect_create.log_action) AS account_creation_action
FROM (SELECT DISTINCT user_id FROM halfak.nov13_creation) AS creator
INNER JOIN enwiki.user AS user USING (user_id)
LEFT JOIN enwiki.logging AS direct_create ON
    direct_create.log_type = "newusers" AND
    creator.user_id = direct_create.log_user AND
    direct_create.log_action IN ("create", "autocreate")
LEFT JOIN enwiki.logging AS indirect_create ON
    indirect_create.log_type = "newusers" AND
    indirect_create.log_action IN ("byemail", "create2") AND
    REPLACE(user.user_name, " ", "_") = indirect_create.log_title
GROUP BY 1,2,3;

It's sad, but since some account creations are indirect ("byemail" - account requested by email and "create2" - account created by an already registered editor), I needed to match on the log_title instead.

Note that I also had to switch spaces to underscores to match "titles" since usernames have spaces and titles have underscores -- even when a username is being placed in the title field. Luckily, there's an index on log_title that I get to use, so this won't substantially affect the query speed.

20:42, 19 November 2013 (UTC)


I just thought of a potential issue. The logging table probably does not capture new user creations all of the way back until 2001.

 select min(log_timestamp) from logging where log_action = "create" and log_type = "newusers";
+--------------------+
| min(log_timestamp) |
+--------------------+
| 20060417041459     |
+--------------------+
1 row in set (1 min 14.71 sec)

That's to be expected. I suspect that the most concerning user creation action, "autocreate" didn't even exist before that point.

21:46, 19 November 2013 (UTC)


I just finished updating my page status query to get the status of the user who created the page so that we can do some filtering for newcomers.

SELECT
    LEFT(page.first_revision, 8) AS date,
    page_namespace,
    IF(
        creator.user_id IS NULL OR creator.user_id = 0,
        "anon",
        IF(
            creator.account_creation_action IS NULL,
            "unknown",
            creator.account_creation_action
        )
    ) AS account_type,
    IF(
        UNIX_TIMESTAMP(page.first_revision) - 
        UNIX_TIMESTAMP(creator.user_registration) < 60*60*24,
        "day",
        IF(
            UNIX_TIMESTAMP(page.first_revision) - 
            UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*7,
            "week",
            IF(
                UNIX_TIMESTAMP(page.first_revision) - 
                UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*30,
                "month",
                "oldtimer"
            )
        )
    ) AS experience,
    COUNT(*) AS pages,
    SUM(archived) AS archived
FROM halfak.nov13_page AS page
INNER JOIN halfak.nov13_creation AS creation USING (page_id)
LEFT JOIN halfak.nov13_user_stats AS creator USING (user_id)
GROUP BY 1,2,3,4;

This query should give us a daily timeseries of article creations stratified by namespace, account_creation_type and experience level of the creating editor.

22:11, 19 November 2013 (UTC)


Just for fun, I wanted to know what type of accounts create pages. Here are the counts from my user stats table that tracks how user accounts were created using the logging table.

> select account_creation_action, count(*) from nov13_user_stats group by 1;
+-------------------------+----------+
| account_creation_action | count(*) |
+-------------------------+----------+
| NULL                    |   325731 |
| autocreate              |    73212 |
| byemail                 |      672 |
| create                  |  2827264 |
| create2                 |    17198 |
+-------------------------+----------+
5 rows in set (2.23 sec)

The NULLs represent users who were created before the logging table started tracking user creations. Autocreates are all of the global users who browsed over to enwiki from another wiki. "Create"s are the usual user type. Presumably, the "create" and NULL groups are practically the same. Byemail can get lumped in there too. Create2 I'm not so sure about. We'll see how they show up in the timeseries.

22:14, 19 November 2013 (UTC)


Hmm... I take the above statement about nulls back. It looks like we get a sizable portion of NULL account_creation_actions in recent years.

> select LEFT(user_registration, 4), account_creation_action, count(*) FROM halfak.nov13_user_stats group by 1,2;
+----------------------------+-------------------------+----------+
| LEFT(user_registration, 4) | account_creation_action | count(*) |
+----------------------------+-------------------------+----------+
| NULL                       | NULL                    |    27112 |
| 2001                       | NULL                    |      193 |
| 2002                       | NULL                    |     1536 |
| 2003                       | NULL                    |     6258 |
| 2004                       | NULL                    |    27640 |
| 2004                       | autocreate              |        1 |
| 2005                       | NULL                    |   108625 |
| 2005                       | create2                 |        7 |
| 2006                       | NULL                    |   132644 |
| 2006                       | create                  |   389185 |
| 2006                       | create2                 |     1602 |
| 2007                       | NULL                    |      179 |
| 2007                       | create                  |   577920 |
| 2007                       | create2                 |     3087 |
| 2008                       | NULL                    |     1551 |
| 2008                       | autocreate              |     8440 |
| 2008                       | create                  |   421036 |
| 2008                       | create2                 |     3235 |
| 2009                       | NULL                    |     4437 |
| 2009                       | autocreate              |    16311 |
| 2009                       | create                  |   366379 |
| 2009                       | create2                 |     2426 |
| 2010                       | NULL                    |     3864 |
| 2010                       | autocreate              |    14467 |
| 2010                       | create                  |   300206 |
| 2010                       | create2                 |     1948 |
| 2011                       | NULL                    |     7396 |
| 2011                       | autocreate              |     9753 |
| 2011                       | create                  |   313508 |
| 2011                       | create2                 |     1912 |
| 2012                       | NULL                    |     3117 |
| 2012                       | autocreate              |    13575 |
| 2012                       | create                  |   269253 |
| 2012                       | create2                 |     1983 |
| 2013                       | NULL                    |     1179 |
| 2013                       | autocreate              |    10665 |
| 2013                       | byemail                 |      672 |
| 2013                       | create                  |   189777 |
| 2013                       | create2                 |      998 |
+----------------------------+-------------------------+----------+
39 rows in set (2.91 sec)

Given the strangeness of data before 2006, I might just filter out the NULLs.

22:21, 19 November 2013 (UTC)


Pages started in sandbox[edit]

OK. While I'm waiting for queries to get counts of article creations by user experience, I'll start working out how I'm going to figure out if an article started in a user sandbox. Sadly, because of lameness in the logging table (see #50784) we can't track page moves easily, but due to a brilliant observation by superm401, I found that page moves result in a revision being stored with a machine generated edit summary. This automatically generated edit summary should lend itself to easy analysis. Sadly, it will also involve a scan of the entire revision and archive tables.

First, let's look for an example page move comment and then lets figure out how long MediaWiki has been doing this. I just grabbed the appropriate revision from en:GroupLens Research corresponding to the move earlier.

> select rev_comment from revision where rev_id = 336269292;
+-----------------------------------------------------------------------------------------------------+
| rev_comment                                                                                         |
+-----------------------------------------------------------------------------------------------------+
| moved [[User:EpochFail/Grouplens Lab]] to [[GroupLens Research]]: Article ready for main namespace. |
+-----------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)
/moved \[\[([^\]]+)\]\] to \[\[([^\]]+)\]\]:.*/

I think I might just grab all of the revisions that match this regex and throw them into a table that I can work with easier.


Here we go:

SELECT 
    rev_id,
    page_id,
    page_namespace,
    page_title,
    rev_user AS user_id,
    rev_user_text AS user_text,
    rev_timestamp,
    rev_comment
FROM revision
INNER JOIN page ON rev_page = page_id
WHERE rev_comment RLIKE "moved \[\[([^\]]+)\]\] to \[\[([^\]]+)\]\]:.*"
UNION
SELECT
    ar_rev_id AS rev_id,
    ar_page_id AS page_id,
    ar_namespace AS page_namespace,
    ar_title AS page_title,
    ar_user AS user_id,
    ar_user_text AS user_text,
    ar_timestamp AS rev_timestamp,
    ar_comment AS rev_comment
FROM archive
WHERE ar_comment RLIKE "moved \[\[([^\]]+)\]\] to \[\[([^\]]+)\]\]:.*";