Research talk:Onboarding new Wikipedians/OB6/Work log/2013-11-27

From Meta, a Wikimedia project coordination wiki

Wednesday, November 27th - Survival model[edit]

I hacked together a survival model based on the number of sessions completed. I considered a user "death" to occur when they don't come back to edit for at least 1 week. The first model predicts the hazard by test condition alone:

Call:
coxph(formula = Surv(sessions, !censored) ~ bucket, data = user.metas)

  n= 26920, number of events= 26657 

                coef exp(coef)  se(coef)      z Pr(>|z|)
buckettest -0.009573  0.990473  0.012250 -0.781    0.435

           exp(coef) exp(-coef) lower .95 upper .95
buckettest    0.9905       1.01     0.967     1.015

Concordance= 0.503  (se = 0.005 )
Rsquare= 0   (max possible= 1 )
Likelihood ratio test= 0.61  on 1 df,   p=0.4345
Wald test            = 0.61  on 1 df,   p=0.4345
Score (logrank) test = 0.61  on 1 df,   p=0.4345

Basically, what this is saying is that being in the test condition slightly lowers hazard (of leaving), but the confidence that this effect is really bad (p=0.43).

Next I tried to control for initial investment by including the amount of time spent editing in the first session as a predictor. This is a good way to explain some of the noise around less important predictors that I've used in the past (see R:First edit session).

Call:
coxph(formula = Surv(sessions, !censored) ~ bucket + first_session_duration, 
    data = user.metas[sessions > 0, ])

  n= 8999, number of events= 8736 

                           coef exp(coef) se(coef)       z Pr(>|z|)    
buckettest             -0.01580   0.98433  0.02142  -0.737    0.461    
first_session_duration -0.96998   0.37909  0.02052 -47.272   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                       exp(coef) exp(-coef) lower .95 upper .95
buckettest                0.9843      1.016    0.9438    1.0265
first_session_duration    0.3791      2.638    0.3641    0.3946

Concordance= 0.826  (se = 0.011 )
Rsquare= 0.377   (max possible= 1 )
Likelihood ratio test= 4265  on 2 df,   p=0
Wald test            = 2235  on 2 df,   p=0
Score (logrank) test = 1438  on 2 df,   p=0

Note that the first session duration was highly significant and predicts a massive amount about survival between sessions. The R^2 of this model jumped from effectively zero to 0.377. Sadly, the test condition still fails to find significance at p=0.461.

--Halfak (WMF) (talk) 21:53, 27 November 2013 (UTC)[reply]