Research talk:Surviving new editor/Work log/2014-02-05

Add topic
From Meta, a Wikimedia project coordination wiki

Wednesday, Feb. 5th[edit]

So the first thing I want to do is to get a sense for how a few different survival periods (the time between and </math>t_3</math>) affect the measurement of "surviving" new editors.

I'd also like to throw a few simple measures out there:

  • 3 to 4 weeks - Measures short term (month) retention assuming that editors will tend to edit in the last week of the month
  • 1 to 2 months - Measures medium term retention
  • 1 to 2 years - Measures long term retention
The proportion of surviving new editors is plotted over time for English and German Wikipedia using several different survival periods.
Surviving new editors by survival period. The proportion of surviving new editors is plotted over time for English and German Wikipedia using several different survival periods.

One thing you'll notice right away is that there are more editors who appear in the 1 to 2 year survival period than survive in the 3 to 4 weeks period. At first this seems counter-intuitive. However, the 3 to 4 weeks survival period only considers edits that occur within a one week timespan while 1 to 2 years considers any edit that takes place within a whole year. However the large difference here points toward an important problem. If we chose 3 to 4 weeks as a survival period, we'd be missing many editors who will not edit during that time, but would edit again within a year.

Really, the key story here though is related to the similar trends seen in all metrics. The figure below plots factor difference between the surviving proportion for each "survival period" as compared to "2 to 6 months". I chose "2 to 6 months" since it lands between other potential metrics in proportion. A horizontal line at 1 would represent a perfect correlation.

The factor difference between the surviving proportion for each "survival period" as compared to "2 to 6 months" is plotted for English and German
Factor comparison of surviving editor proportion. The factor difference between the surviving proportion for each "survival period" as compared to "2 to 6 months" is plotted for English and German

There are a few things to pick out here:

  • "1 to 2 years" and "1 year + 1 month" seem to be measuring something different than "2 to 6 months" before 2007.
  • Other than this deviation, the proportional relationship between "2 to 6 months" and other survival periods holds relatively constant.
  • About half the number of editors that appear in the "3 to 4 weeks" survival period will appear in "2 to 6 months"

I'm really not sure that this analysis gives us a clear winner or allows us to pick an arbitrary survival period and be done with it. Next, I'll extract some edit counts for all months for a users first year so that I can do some more nuanced comparisons.

SQL for monthly activity
SELECT
    DATABASE() AS wiki_db,
    user_id,
    SUM(revisions_month_1) AS revisions_month_1,
    SUM(revisions_month_2) AS revisions_month_2,
    SUM(revisions_month_3) AS revisions_month_3,
    SUM(revisions_month_4) AS revisions_month_4,
    SUM(revisions_month_5) AS revisions_month_5,
    SUM(revisions_month_6) AS revisions_month_6,
    SUM(revisions_month_7) AS revisions_month_7,
    SUM(revisions_month_8) AS revisions_month_8,
    SUM(revisions_month_9) AS revisions_month_9,
    SUM(revisions_month_10) AS revisions_month_10,
    SUM(revisions_month_11) AS revisions_month_11,
    SUM(revisions_year_1) AS revisions_year_1,
    SUM(revisions_year_2) AS revisions_year_2
FROM (
    (SELECT 
        user_id,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*1+1 AND 30*2
        ) AS revisions_month_1,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*2+1 AND 30*3
        ) AS revisions_month_2,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*3+1 AND 30*4
        ) AS revisions_month_3,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*4+1 AND 30*5
        ) AS revisions_month_4,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*5+1 AND 30*6
        ) AS revisions_month_5,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*6+1 AND 30*7
        ) AS revisions_month_6,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*7+1 AND 30*8
        ) AS revisions_month_7,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*8+1 AND 30*9
        ) AS revisions_month_8,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*9+1 AND 30*10
        ) AS revisions_month_9,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*10+1 AND 30*11
        ) AS revisions_month_10,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30*11+1 AND 365
        ) AS revisions_month_11,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 365*1+1 AND 365*2
        ) AS revisions_year_1,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 365*2+1 AND 365*3
        ) AS revisions_year_2
    FROM user
    LEFT JOIN revision ON 
        rev_user = user_id AND
        rev_timestamp >= DATE_FORMAT(
            DATE_ADD(user_registration, INTERVAL 31 DAY), 
            "%Y%m%d%H%i%S"
        )
    GROUP BY 1)
    UNION
    (SELECT 
        user_id,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*1+1 AND 30*2
        ) AS revisions_month_1,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*2+1 AND 30*3
        ) AS revisions_month_2,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*3+1 AND 30*4
        ) AS revisions_month_3,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*4+1 AND 30*5
        ) AS revisions_month_4,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*5+1 AND 30*6
        ) AS revisions_month_5,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*6+1 AND 30*7
        ) AS revisions_month_6,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*7+1 AND 30*8
        ) AS revisions_month_7,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*8+1 AND 30*9
        ) AS revisions_month_8,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*9+1 AND 30*10
        ) AS revisions_month_9,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*10+1 AND 30*11
        ) AS revisions_month_10,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30*11+1 AND 365
        ) AS revisions_month_11,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 365*1+1 AND 365*2
        ) AS revisions_year_1,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 365*2+1 AND 365*3
        ) AS revisions_year_2
    FROM user
    LEFT JOIN archive ON 
        ar_user = user_id AND
        ar_timestamp >= DATE_FORMAT(
            DATE_ADD(user_registration, INTERVAL 31 DAY), 
            "%Y%m%d%H%i%S"
        )
    GROUP BY 1)
) user_span_revisions
GROUP BY 1,2;