Research talk:Wikipedia article creation/Work log/Friday, January 17th

From Meta, a Wikimedia project coordination wiki

Friday, January 17th[edit]

Today I'm working on the last piece in the puzzle of newcomer page creations -- figuring out the scale at which all newcomers enter the page creation process.

So first things first, I need to generate a funnel for observing how many registered editors --> R:New editors --> New page creators --> New article creators --> New draft article creators.

SELECT
    DATE(CONCAT(LEFT(user_registration, 6), "01")) AS registration_month,
    COUNT(*) AS registered_users, /* registered users */
    SUM(revision_stats.day_revisions > 0) AS new_editors,
    SUM(page_stats.pages > 0) AS new_page_creators, /* users who started a any page in first 30 days */
    SUM(page_stats.article_pages > 0) AS new_article_creators, /* users who started an article page in first 30 days  */
    SUM(page_stats.draft_pages > 0) AS new_draft_creators /* users who started a draft article page in first 30 days  */
FROM (
    SELECT
        user_id,
        user_registration,
        SUM(day_revisions) AS day_revisions
    FROM (
        (
            SELECT
                user_id,
                user_registration,
                SUM(rev_id IS NOT NULL)  AS day_revisions
            FROM user
            LEFT JOIN revision ON 
                rev_user = user_id AND
                rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S")
            WHERE
                user_registration > "2008"
            GROUP BY 1
        )
        UNION
        (
            SELECT
                user_id,
                user_registration,
                SUM(ar_id IS NOT NULL) AS day_revisions
            FROM user
            LEFT JOIN archive ON 
                ar_user_text = user_name AND
                ar_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S")
            WHERE
                user_registration > "2008"
            GROUP BY 1,2
        )
    ) AS revision_archive_stats
    GROUP BY 1,2
) AS revision_stats
LEFT JOIN (
    SELECT
        user_id,
        SUM(c.user_id IS NOT NULL) AS pages,
        SUM(ap.page_title IS NOT NULL) AS article_pages,
        SUM(ap.original_namespace != 0) AS draft_pages
    FROM user
    LEFT JOIN staging.nov13_dewiki_creation c USING (user_id)
    LEFT JOIN staging.nov13_dewiki_article_page ap USING (page_id)
    WHERE
        user_registration > "2008" AND 
        c.rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 30 DAY), "%Y%m%d%H%i%S")
    GROUP BY 1
) AS page_stats USING (user_id)
GROUP BY 1

I cheated a little bit. I got the above query wrong several times before I got it right.

Here's the results. First, the monthly proportion of users that make it to each part of the funnel. Since Enwiki is super big and takes forever to run, I'll start off with German.

New editor page creator proportions (dewiki). 

~ 30% of newly registered users will make an edit within 24h. ~40% of these new editors will create some page. ~60% of them will have created an article (or a draft article). And depending on where you look, about %3 of those users will have created a draft (non ns=0 page that became an article). There's a concerning dip in the number of draft creators that has me thinking that there might be some anomaly in my data.

Check out the raw numbers:

Newly registered users (dewiki). 
New editors (dewiki). 
New page creators (dewiki). 
New article creators (dewiki). 
New draft creators (dewiki). 

The number of editors who create drafts each months drops from 30 to 0-2 in the middle of 2011. My best guess is that there was a substantial change in the way that page moves were recorded at that point. I'll have to pick through the data to see if I can figure out what's going on. --Halfak (WMF) (talk) 23:56, 17 January 2014 (UTC)[reply]