Research talk:Teahouse long term new editor retention/Work log/20150928
Add topicMonday, September 28, 2015
[edit]I just got my dataset of 14766 editors from JMo. Now I'm going to load them into the DB. I had to write a little bit of sed
to switch the format to MySQL's dialect of TSV.
cat th_retention_sample20150928.csv  sed r "s/,/\t/g"  sed r "s/\"//g" > th_experimental_user.tsv
Now to load the data into the database:
CREATE TABLE th_experimental_user (user_id INT, invite_date DATE, bucket VARCHAR(255));
$ mysqlimport local h analyticsstore.eqiad.wmnet u research ignorelines=1 staging datasets/th_experimental_user.tsv staging.th_experimental_user: Records: 14766 Deleted: 0 Skipped: 0 Warnings: 0
OK. Cool. Now I want some retention measures. It looks like the last set of invites went out in January, so that means I have 8 months before the hard sunset to analyze. In my work on R:Surviving new editor, I took measures with different trial and survival periods. I'll make use of the ones that we have enough future for here.
SELECT
user_id,
SUM(revisions_3_to_4_weeks) AS revisions_3_to_4_weeks,
SUM(revisions_1_to_2_months) AS revisions_1_to_2_months,
SUM(revisions_2_to_6_months) AS revisions_2_to_6_months
FROM (
(SELECT
user_id,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 21 AND 28
) AS revisions_3_to_4_weeks,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 60 AND 180
) AS revisions_2_to_6_months
FROM staging.th_experimental_user as user
INNER JOIN user USING (user_id)
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 21 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
UNION
(SELECT
user_id,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 21 AND 28
) AS revisions_3_to_4_weeks,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 60 AND 180
) AS revisions_2_to_6_months
FROM staging.th_experimental_user AS user
INNER JOIN user USING (user_id)
LEFT JOIN archive ON
ar_user = user_id AND
ar_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 21 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
) user_span_revisions
GROUP BY user_id;
Here's a sample of the output:
user_id revisions_3_to_4_weeks revisions_1_to_2_months revisions_2_to_6_months 22890795 0 0 0 22891039 0 0 0 22891690 0 0 0 22892606 0 0 0 22892705 0 0 0 22892807 0 0 0 22893113 0 0 0 22893263 7 67 106 22895159 0 2 6 22895278 0 0 3 22895570 0 0 0 22895602 0 7 57 22895911 0 0 0 22896534 2 16 116 22897567 0 1 0 22897845 0 0 0
Cool. Now it's time to load up R. Halfak (WMF) (talk) 21:43, 28 September 2015 (UTC)
Okay!
1+ edits survival
[edit]For each of the survival periods, I considered an editor "surviving" if they saved at least one edit.
bucket  n  34 weeks  12 months  26 months  

k  p  k  p  k  p  
control  3092  247  0.080*  360  0.116  397  0.128 
invited  11674  1068  0.091*  1512  0.130  1651  0.141 
Chi^2 tests


> prop.test(bucket.survival$revisions_3_to_4_weeks.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_3_to_4_weeks.k out of bucket.survival$n Xsquared = 3.9142, df = 1, pvalue = 0.04788 alternative hypothesis: two.sided 95 percent confidence interval: 0.0226998168 0.0005037463 sample estimates: prop 1 prop 2 0.07988357 0.09148535 > prop.test(bucket.survival$revisions_1_to_2_months.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_1_to_2_months.k out of bucket.survival$n Xsquared = 3.6658, df = 1, pvalue = 0.05554 alternative hypothesis: two.sided 95 percent confidence interval: 2.61353e02 4.28903e05 sample estimates: prop 1 prop 2 0.1164295 0.1295186 > prop.test(bucket.survival$revisions_2_to_6_months.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_2_to_6_months.k out of bucket.survival$n Xsquared = 3.3658, df = 1, pvalue = 0.06656 alternative hypothesis: two.sided 95 percent confidence interval: 0.0266128531 0.0005537941 sample estimates: prop 1 prop 2 0.1283959 0.1414254 
That looks pretty promising. In all of the tests, the invitation condition is a clear winner. It looks like we only found significance at the shortterm (34 weeks) timescale, but the other two are very close.
5+ edits survival
[edit]What if we don't consider someone surviving unless they score at least 5 edits in a time period?
bucket  n  34 weeks  12 months  26 months  

k  p  k  p  k  p  
control  3092  117  0.038  191  0.062  229  0.074* 
invited  11674  496  0.042  841  0.072  1008  0.086* 
Chi^2 tests


> prop.test(bucket.survival5$revisions_3_to_4_weeks.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_3_to_4_weeks.k out of bucket.survival5$n Xsquared = 1.213, df = 1, pvalue = 0.2707 alternative hypothesis: two.sided 95 percent confidence interval: 0.012508867 0.003212881 sample estimates: prop 1 prop 2 0.03783959 0.04248758 > prop.test(bucket.survival5$revisions_1_to_2_months.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_1_to_2_months.k out of bucket.survival5$n Xsquared = 3.8085, df = 1, pvalue = 0.05099 alternative hypothesis: two.sided 95 percent confidence interval: 0.0201681321 0.0003681001 sample estimates: prop 1 prop 2 0.06177232 0.07204043 > prop.test(bucket.survival5$revisions_2_to_6_months.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_2_to_6_months.k out of bucket.survival5$n Xsquared = 4.6468, df = 1, pvalue = 0.03111 alternative hypothesis: two.sided 95 percent confidence interval: 0.02303135 0.00153591 sample estimates: prop 1 prop 2 0.07406210 0.08634573 
Very interesting. We see a clear, significant effect in the long term (26 month) bucket here, but not significance at 34 weeks. Wow! Halfak (WMF) (talk) 22:22, 28 September 2015 (UTC)