Research talk:Wikipedia article creation/Work log/Thursday, November 21st

From Meta, a Wikimedia project coordination wiki

Thursday, November 21st[edit]

Today, I'm trying to look at timeseries of newcomer page creations. My queries finished since the last update, so I got right to plotting the timeseries. I split up page creations by the time since a users registration at the time of page creation.

  • day = < 24 hours since registration
  • week = < 7 days since registration
  • month = < 30 days since registration
  • oldtimer = >= 30 days since registration
The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.
Article counts by experience. The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.

That looks weird. We see a big spike in deletions in 2008 -- especially for day.

The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.
Article counts (day experience only). The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.

Yeah, that's a strange time to see a spike. I've confirmed that the spike in deleted pages does not correspond to the time when page creation was turned off for anons (some time in 2005). I filtered this set to include only self-created accounts to make sure I didn't have issues with autocreated users. Let's look at all the new account types to see is a certain account creation type could be to blame.

The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.
Article counts by account type (day experience only). The number of article created are plotted by month and whether they were archived at the time of query for registered users at different levels of experience.

So here's what I think happened: The archive table didn't exist before 2008 and the pages that we see in that table are pages that were deleted after 2008.

21:28, 21 November 2013 (UTC)


After hacking on this for a while, it's become clear that I should probably just focus on post 2008 since it looks like the data is sane from that point forward. Here's what the newbie (< one day) counts look like with the new cutoff:

The number of article created are plotted by month and whether they were archived at the time of query.
Article counts by account type (day experience only, post 2008). The number of article created are plotted by month and whether they were archived at the time of query.

Looks good. Now lets look at the proportion of articles that were deleted over time based on the creators experience level.

The proportion of surviving articles is plotted for articles created by editors at 4 different levels of experience (time since registration).
Article survival rate. The proportion of surviving articles is plotted for articles created by editors at 4 different levels of experience (time since registration).

Interestingly, it looks like the survival rate is increasing for all experience classes. This could be an artifact of how long it takes to deleted a page. See en:censoring (statistics). Time to apply an artificial sunset. 22:13, 21 November 2013 (UTC)


From my time to deletion analysis (see main page), it looks like most pages get deleted within 1 month since creation. I just updated my page status query to capture all archived pages as well as pages that are archived withing 30 days of creation as "archived_quickly" so that we can repeat the analysis with those pages.

SELECT
    LEFT(page.first_revision, 8) AS date,
    page_namespace,
    IF(
        creator.user_id IS NULL OR creator.user_id = 0,
        "anon",
        IF(
            creator.account_creation_action IS NULL,
            "unknown",
            creator.account_creation_action
        )
    ) AS account_type,
    IF(
        UNIX_TIMESTAMP(page.first_revision) - 
        UNIX_TIMESTAMP(creator.user_registration) < 60*60*24,
        "day",
        IF(
            UNIX_TIMESTAMP(page.first_revision) - 
            UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*7,
            "week",
            IF(
                UNIX_TIMESTAMP(page.first_revision) - 
                UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*30,
                "month",
                "oldtimer"
            )
        )
    ) AS experience,
    COUNT(*) AS pages,
    SUM(archived) AS archived,
    SUM(
        archived AND 
        (
            UNIX_TIMESTAMP(last_revision) - 
            UNIX_TIMESTAMP(first_revision)
        ) < 60*60*24*30
    ) AS archived_quickly
FROM halfak.nov13_page AS page
INNER JOIN halfak.nov13_creation AS creation USING (page_id)
LEFT JOIN halfak.nov13_user_stats AS creator USING (user_id)
GROUP BY 1,2,3,4;

While I'm waiting for this, I'm going to get back to figuring out which pages started out in user space. So I created this table that contains all revisions that match the auto-generated comment style produced by a page move.

> select count(*) from nov13_move_revision;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.04 sec)
  • facepalm* What went wrong?
> SELECT 'moved [[Foo]] to [[Bar]]:' RLIKE 'moved \[\[([^\]]+)\]\] to \[\[([^\]]+)\]\]:.*' as move;
+------+
| move |
+------+
|    0 |
+------+
1 row in set (0.03 sec)

So that doesn't work.

> SELECT 'moved [[Foo]] to [[Bar]]:' RLIKE 'moved \\[\\[([^\]]+)\\]\\] to \\[\\[([^\]]+)\\]\\]:.*' as move;
+------+
| move |
+------+
|    1 |
+------+
1 row in set (0.04 sec)

That is incredibly strange. I have to double escape brackets ("[" and "]") outside of the group, but not inside. I've kicked of the query again. this is likely going to take all night. Oh well. Here we are.

22:48, 21 November 2013 (UTC)