Research talk:Wikipedia article creation/Work log/Monday, December 16th

From Meta, a Wikimedia project coordination wiki

Monday, December 16th[edit]

Back to hacking on my "article page" table. Time to gather some monthly stats for different types of article creators. I want a dataset that has one row per group of user-pages that were published (moved to Main) in the same month, originated in the same namespace, were created by the same type of user with the same amount of experience. Identifying columns (for aggregation) are underlined.

month_published
The month that the page was first moved to the main namespace
original_namespace
The namespace that the page started in
creator_type IN ("anon", "self-created", "auto-created", NULL)
The type of user who created the article
creator_tenure IN ("-day", "day-week", "week-month", "month-", NULL)
The amount of tenure (time between registration and page creation) the creating user had at the time of publishing
unique_authors
The aggregate number of unique page creators
articles_created
The aggregate number of pages
articles_unpublished_quickly
The aggregate number of pages that were deleted or moved out of Main within 30 days of publishing (to deal with right truncation issue)

Time to write the query. --Halfak (WMF) (talk) 18:25, 16 December 2013 (UTC)[reply]


Here we go. I changed the names of some fields to reduce my own confusion later. I'll change the field names in the description up above too.

Monthly article page status query
SELECT
    DATE(CONCAT(LEFT(published, 6), "01")) AS month_published,
    original_namespace,
    IF(
        creator.user_id IS NULL OR creator.user_id = 0,
        "anon",
        IF(
            creator.account_creation_action = "create" OR
            creator.account_creation_action = "byemail",
            "self-created",
            IF(
                 creator.account_creation_action = "autocreate",
                 "autocreate",
                 NULL
            )
        )
    ) AS creator_type,
    IF(
        UNIX_TIMESTAMP(page.created) - 
        UNIX_TIMESTAMP(creator.user_registration) < 60*60*24,
        "-day",
        IF(
            UNIX_TIMESTAMP(page.created) - 
            UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*7,
            "day-week",
            IF(
                UNIX_TIMESTAMP(page.created) - 
                UNIX_TIMESTAMP(creator.user_registration) < 60*60*24*30,
                "week-month",
                IF(
                    UNIX_TIMESTAMP(page.created) - 
                    UNIX_TIMESTAMP(creator.user_registration) >= 60*60*24*30,
                    "month-",
                    NULL
                )
            )
        )
    ) AS creator_tenure,
    COUNT(DISTINCT creation.user_text) AS unique_authors,
    COUNT(*) AS articles_published,
    SUM(
        (
            UNIX_TIMESTAMP(published) - 
            UNIX_TIMESTAMP(unpublished)
        ) < 60*60*24*30
    ) AS articles_unpublished_quickly
FROM halfak.nov13_article_page page
INNER JOIN halfak.nov13_creation AS creation USING (page_id)
LEFT JOIN halfak.nov13_user_stats AS creator USING (user_id)
GROUP BY 1,2,3,4;

And we're running. While this is running, I'm getting back to work on Schema:GettingStartedOnRedirect. --Halfak (WMF) (talk) 20:09, 16 December 2013 (UTC)[reply]


Waiting on EL's creator to talk schemas. I had a quick discussion over IRC with Steven (WMF) and superm401 about when the tenure of a page creator should be measured.

IRC chat log
<halfak> StevenW: I'm considering the experience level of an editor as it relates to article creation.  I'm splitting the time of article creation from the time that the article was moved to main space.  Should I be more concerned about an editor's experience level at the time of draft creation or the time the draft was published to main?
<superm401> Another possibility is "last edit before move to main".
<halfak> That's going to be infeasible.  :\
<superm401> Too hard to query?
<halfak> I'd have to re-run the pipeline from the beginning. 
<halfak> About 2 days worth of query time. 
<halfak> But then again, why would that be a more relevant timestamp?
<superm401> Because in theory, their last edit while it's a draft benefits from all the experience up to then.
<superm401> And that last edit might be required to make it ready for main ('ready' according to someone, at least).
<superm401> Even a beginner could gain a decent amount of experience just while it's a draft.
<superm401> For a long-time user, it could be a draft for a year, which could make a big difference.
<halfak> Hmmm fair enough, but wouldn't this be true at the time of the move too?
<superm401> Yeah, luckily "time of the move" is probably close to "last edit before move" in the majority of cases.
<halfak> But why I don't see why we'd rather have "last edit before move".
<halfak> minced words there.  Nevermind the "but why". 
<superm401> The time of the move just tells you when someone decided to move it to main.
<halfak> Yeah, but presumably, they evaluated the content of the article before moving it. 
<superm401> Yeah, but hypothetically assume:
<superm401> 1. Joe signs up in January 2013.
<superm401> 2. He works on the article for a month, so the last edit is in February 2013.
<superm401> 3. He puts it up for review.
<superm401> 4. He goes back and forth improving the article in response to review until March 2013
<superm401> 5. AFC is really backlogged now, so no one looks at it until June 2013.
<superm401> 6. It then gets approved and moved to main.
<halfak> But the last edit before the move will be the edit that signifies the acceptance. 
<superm401> It doesn't make sense to consider Joe a user with six months experience, since he didn't do anything since March.
<halfak> And that will happen right before the move. 
<halfak> So really, I'd need the last edit by the article creator before the move. 
<halfak> Hmm... That's even more difficult.  
<superm401> I think time of move would be a reasonable approximation, just trying to explain my perspective.
<halfak> Don't get me wrong.  You're making a good point. 
<superm401> Probably the gap (e.g. March to June in the example) is not that big in most real cases.
<halfak> When I think about the questions we want to answer, I find myself being more drawn to experience at time of draft creation.
<halfak> Since we're considering funneling newbies to draft creation 
<halfak> Or at least interested in what doing so might entail. 
<superm401> That's certainly relevant, too.
<superm401> Since their initial experience will structure the start of the draft.
<superm401> In some cases, it may improve a lot later (in response to review or otherwise).
<superm401> In other cases, it might basically remain quite similar to the first version of the draft (possibly because the reviewer can't see beyond the issues with the first version)
<halfak> OK.  I think I'm going to just settle on one and get us some plots.  We should talk again with swalling after standup.  
<halfak> Thanks for the input. :) 
<superm401> No problem
<StevenW> halfak: missed your first ping. My recommendation would be at the point of draft creation.
<halfak> Hokay.  That's what I was going to do anyway.  :) 

So I updated the query above to measure tenure between registration and the page creation date. I just kicked off the query again. --Halfak (WMF) (talk) 21:25, 16 December 2013 (UTC)[reply]


Lots of weirdness going on. It looks like I have about 12 million rows in my "article_page" table, but I have 13 million rows in my "page" table that are currently in namespace zero.

> select count(*) from halfak.nov13_page_origin where original_namespace = 0;
+----------+
| count(*) |
+----------+
| 13463952 |
+----------+
1 row in set (18.82 sec)

> select count(*) from halfak.nov13_article_page;
+----------+
| count(*) |
+----------+
| 12562628 |
+----------+
1 row in set (21.95 sec)

This can't be right. Every page that started in ns=0 is an "article page" and many that did not start in ns=0 are too. the only filtering condition I have on the article_page table is the following:

WHERE
    original_namespace = 0 OR
    publish.page_id IS NOT NULL;

I can't see why that would be causing the problem. I think I'll have to re-generate the table again to see if I encounter the same problem.

> SELECT NOW() AS generated, COUNT(*) FROM halfak.nov13_article_page_new;
+---------------------+----------+
| generated           | COUNT(*) |
+---------------------+----------+
| 2013-12-16 22:54:11 | 12562628 |
+---------------------+----------+
1 row in set (6.59 sec)

Time to do some more digging.  :\ --Halfak (WMF) (talk) 22:56, 16 December 2013 (UTC)[reply]