Research talk:Teahouse long term new editor retention/Work log/2015-09-28

Monday, September 28, 2015

Latest comment: 8 years ago1 comment1 person in discussion

I just got my dataset of 14766 editors from J-Mo. Now I'm going to load them into the DB. I had to write a little bit of sed to switch the format to MySQL's dialect of TSV.

cat th_retention_sample20150928.csv | sed -r "s/,/\t/g" | sed -r "s/\"//g" > th_experimental_user.tsv

Now to load the data into the database:

CREATE TABLE th_experimental_user (user_id INT, invite_date DATE, bucket VARCHAR(255));

$ mysqlimport --local -h analytics-store.eqiad.wmnet -u research --ignore-lines=1 staging datasets/th_experimental_user.tsv 
staging.th_experimental_user: Records: 14766  Deleted: 0  Skipped: 0  Warnings: 0

OK. Cool. Now I want some retention measures. It looks like the last set of invites went out in January, so that means I have 8 months before the hard sunset to analyze. In my work on R:Surviving new editor, I took measures with different trial and survival periods. I'll make use of the ones that we have enough future for here.

SELECT
    user_id,
    SUM(revisions_3_to_4_weeks) AS revisions_3_to_4_weeks,
    SUM(revisions_1_to_2_months) AS revisions_1_to_2_months,
    SUM(revisions_2_to_6_months) AS revisions_2_to_6_months
FROM (
    (SELECT 
        user_id,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 21 AND 28
        ) AS revisions_3_to_4_weeks,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 30 AND 60
        ) AS revisions_1_to_2_months,
        SUM(
            rev_timestamp IS NOT NULL AND 
            DATEDIFF(rev_timestamp, user_registration) BETWEEN 60 AND 180
        ) AS revisions_2_to_6_months
    FROM staging.th_experimental_user as user
    INNER JOIN user USING (user_id)
    LEFT JOIN revision ON 
        rev_user = user_id AND
        rev_timestamp >= DATE_FORMAT(
            DATE_ADD(user_registration, INTERVAL 21 DAY), 
            "%Y%m%d%H%i%S"
        )
    GROUP BY 1)
    UNION
    (SELECT 
        user_id,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 21 AND 28
        ) AS revisions_3_to_4_weeks,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 30 AND 60
        ) AS revisions_1_to_2_months,
        SUM(
            ar_timestamp IS NOT NULL AND 
            DATEDIFF(ar_timestamp, user_registration) BETWEEN 60 AND 180
        ) AS revisions_2_to_6_months
    FROM staging.th_experimental_user AS user
    INNER JOIN user USING (user_id)
    LEFT JOIN archive ON 
        ar_user = user_id AND
        ar_timestamp >= DATE_FORMAT(
            DATE_ADD(user_registration, INTERVAL 21 DAY), 
            "%Y%m%d%H%i%S"
        )
    GROUP BY 1)
) user_span_revisions
GROUP BY user_id;

Here's a sample of the output:

user_id revisions_3_to_4_weeks  revisions_1_to_2_months revisions_2_to_6_months
22890795        0       0       0
22891039        0       0       0
22891690        0       0       0
22892606        0       0       0
22892705        0       0       0
22892807        0       0       0
22893113        0       0       0
22893263        7       67      106
22895159        0       2       6
22895278        0       0       3
22895570        0       0       0
22895602        0       7       57
22895911        0       0       0
22896534        2       16      116
22897567        0       1       0
22897845        0       0       0

Cool. Now it's time to load up R. --Halfak (WMF) (talk) 21:43, 28 September 2015 (UTC)Reply

Okay!

1+ edits survival

For each of the survival periods, I considered an editor "surviving" if they saved at least one edit.

bucket	n	3-4 weeks		1-2 months		2-6 months
bucket	n	k	p	k	p	k	p
control	3092	247	0.080*	360	0.116	397	0.128
invited	11674	1068	0.091*	1512	0.130	1651	0.141

Chi^2 tests

> prop.test(bucket.survival$revisions_3_to_4_weeks.k, bucket.survival$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival$revisions_3_to_4_weeks.k out of bucket.survival$n
X-squared = 3.9142, df = 1, p-value = 0.04788
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0226998168 -0.0005037463
sample estimates:
    prop 1     prop 2 
0.07988357 0.09148535 

> prop.test(bucket.survival$revisions_1_to_2_months.k, bucket.survival$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival$revisions_1_to_2_months.k out of bucket.survival$n
X-squared = 3.6658, df = 1, p-value = 0.05554
alternative hypothesis: two.sided
95 percent confidence interval:
 -2.61353e-02 -4.28903e-05
sample estimates:
   prop 1    prop 2 
0.1164295 0.1295186 

> prop.test(bucket.survival$revisions_2_to_6_months.k, bucket.survival$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival$revisions_2_to_6_months.k out of bucket.survival$n
X-squared = 3.3658, df = 1, p-value = 0.06656
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0266128531  0.0005537941
sample estimates:
   prop 1    prop 2 
0.1283959 0.1414254

That looks pretty promising. In all of the tests, the invitation condition is a clear winner. It looks like we only found significance at the short-term (3-4 weeks) timescale, but the other two are very close.

5+ edits survival

Latest comment: 8 years ago1 comment1 person in discussion

What if we don't consider someone surviving unless they score at least 5 edits in a time period?

bucket	n	3-4 weeks		1-2 months		2-6 months
bucket	n	k	p	k	p	k	p
control	3092	117	0.038	191	0.062	229	0.074*
invited	11674	496	0.042	841	0.072	1008	0.086*

Chi^2 tests

> prop.test(bucket.survival5$revisions_3_to_4_weeks.k, bucket.survival5$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival5$revisions_3_to_4_weeks.k out of bucket.survival5$n
X-squared = 1.213, df = 1, p-value = 0.2707
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.012508867  0.003212881
sample estimates:
    prop 1     prop 2 
0.03783959 0.04248758 

> prop.test(bucket.survival5$revisions_1_to_2_months.k, bucket.survival5$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival5$revisions_1_to_2_months.k out of bucket.survival5$n
X-squared = 3.8085, df = 1, p-value = 0.05099
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0201681321 -0.0003681001
sample estimates:
    prop 1     prop 2 
0.06177232 0.07204043 

> prop.test(bucket.survival5$revisions_2_to_6_months.k, bucket.survival5$n)

	2-sample test for equality of proportions with continuity correction

data:  bucket.survival5$revisions_2_to_6_months.k out of bucket.survival5$n
X-squared = 4.6468, df = 1, p-value = 0.03111
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.02303135 -0.00153591
sample estimates:
    prop 1     prop 2 
0.07406210 0.08634573

Very interesting. We see a clear, significant effect in the long term (2-6 month) bucket here, but not significance at 3-4 weeks. Wow! --Halfak (WMF) (talk) 22:22, 28 September 2015 (UTC)Reply