Research talk:Teahouse long term new editor retention/Work log/2015-09-28
Add topicMonday, September 28, 2015
[edit]I just got my dataset of 14766 editors from J-Mo. Now I'm going to load them into the DB. I had to write a little bit of sed
to switch the format to MySQL's dialect of TSV.
cat th_retention_sample20150928.csv | sed -r "s/,/\t/g" | sed -r "s/\"//g" > th_experimental_user.tsv
Now to load the data into the database:
CREATE TABLE th_experimental_user (user_id INT, invite_date DATE, bucket VARCHAR(255));
$ mysqlimport --local -h analytics-store.eqiad.wmnet -u research --ignore-lines=1 staging datasets/th_experimental_user.tsv staging.th_experimental_user: Records: 14766 Deleted: 0 Skipped: 0 Warnings: 0
OK. Cool. Now I want some retention measures. It looks like the last set of invites went out in January, so that means I have 8 months before the hard sunset to analyze. In my work on R:Surviving new editor, I took measures with different trial and survival periods. I'll make use of the ones that we have enough future for here.
SELECT
user_id,
SUM(revisions_3_to_4_weeks) AS revisions_3_to_4_weeks,
SUM(revisions_1_to_2_months) AS revisions_1_to_2_months,
SUM(revisions_2_to_6_months) AS revisions_2_to_6_months
FROM (
(SELECT
user_id,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 21 AND 28
) AS revisions_3_to_4_weeks,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 60 AND 180
) AS revisions_2_to_6_months
FROM staging.th_experimental_user as user
INNER JOIN user USING (user_id)
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 21 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
UNION
(SELECT
user_id,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 21 AND 28
) AS revisions_3_to_4_weeks,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 60 AND 180
) AS revisions_2_to_6_months
FROM staging.th_experimental_user AS user
INNER JOIN user USING (user_id)
LEFT JOIN archive ON
ar_user = user_id AND
ar_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 21 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
) user_span_revisions
GROUP BY user_id;
Here's a sample of the output:
user_id revisions_3_to_4_weeks revisions_1_to_2_months revisions_2_to_6_months 22890795 0 0 0 22891039 0 0 0 22891690 0 0 0 22892606 0 0 0 22892705 0 0 0 22892807 0 0 0 22893113 0 0 0 22893263 7 67 106 22895159 0 2 6 22895278 0 0 3 22895570 0 0 0 22895602 0 7 57 22895911 0 0 0 22896534 2 16 116 22897567 0 1 0 22897845 0 0 0
Cool. Now it's time to load up R. --Halfak (WMF) (talk) 21:43, 28 September 2015 (UTC)
Okay!
1+ edits survival
[edit]For each of the survival periods, I considered an editor "surviving" if they saved at least one edit.
bucket | n | 3-4 weeks | 1-2 months | 2-6 months | |||
---|---|---|---|---|---|---|---|
k | p | k | p | k | p | ||
control | 3092 | 247 | 0.080* | 360 | 0.116 | 397 | 0.128 |
invited | 11674 | 1068 | 0.091* | 1512 | 0.130 | 1651 | 0.141 |
Chi^2 tests
|
---|
> prop.test(bucket.survival$revisions_3_to_4_weeks.k, bucket.survival$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival$revisions_3_to_4_weeks.k out of bucket.survival$n X-squared = 3.9142, df = 1, p-value = 0.04788 alternative hypothesis: two.sided 95 percent confidence interval: -0.0226998168 -0.0005037463 sample estimates: prop 1 prop 2 0.07988357 0.09148535 > prop.test(bucket.survival$revisions_1_to_2_months.k, bucket.survival$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival$revisions_1_to_2_months.k out of bucket.survival$n X-squared = 3.6658, df = 1, p-value = 0.05554 alternative hypothesis: two.sided 95 percent confidence interval: -2.61353e-02 -4.28903e-05 sample estimates: prop 1 prop 2 0.1164295 0.1295186 > prop.test(bucket.survival$revisions_2_to_6_months.k, bucket.survival$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival$revisions_2_to_6_months.k out of bucket.survival$n X-squared = 3.3658, df = 1, p-value = 0.06656 alternative hypothesis: two.sided 95 percent confidence interval: -0.0266128531 0.0005537941 sample estimates: prop 1 prop 2 0.1283959 0.1414254 |
That looks pretty promising. In all of the tests, the invitation condition is a clear winner. It looks like we only found significance at the short-term (3-4 weeks) timescale, but the other two are very close.
5+ edits survival
[edit]What if we don't consider someone surviving unless they score at least 5 edits in a time period?
bucket | n | 3-4 weeks | 1-2 months | 2-6 months | |||
---|---|---|---|---|---|---|---|
k | p | k | p | k | p | ||
control | 3092 | 117 | 0.038 | 191 | 0.062 | 229 | 0.074* |
invited | 11674 | 496 | 0.042 | 841 | 0.072 | 1008 | 0.086* |
Chi^2 tests
|
---|
> prop.test(bucket.survival5$revisions_3_to_4_weeks.k, bucket.survival5$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_3_to_4_weeks.k out of bucket.survival5$n X-squared = 1.213, df = 1, p-value = 0.2707 alternative hypothesis: two.sided 95 percent confidence interval: -0.012508867 0.003212881 sample estimates: prop 1 prop 2 0.03783959 0.04248758 > prop.test(bucket.survival5$revisions_1_to_2_months.k, bucket.survival5$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_1_to_2_months.k out of bucket.survival5$n X-squared = 3.8085, df = 1, p-value = 0.05099 alternative hypothesis: two.sided 95 percent confidence interval: -0.0201681321 -0.0003681001 sample estimates: prop 1 prop 2 0.06177232 0.07204043 > prop.test(bucket.survival5$revisions_2_to_6_months.k, bucket.survival5$n) 2-sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_2_to_6_months.k out of bucket.survival5$n X-squared = 4.6468, df = 1, p-value = 0.03111 alternative hypothesis: two.sided 95 percent confidence interval: -0.02303135 -0.00153591 sample estimates: prop 1 prop 2 0.07406210 0.08634573 |
Very interesting. We see a clear, significant effect in the long term (2-6 month) bucket here, but not significance at 3-4 weeks. Wow! --Halfak (WMF) (talk) 22:22, 28 September 2015 (UTC)