Research talk:Autoconfirmed article creation trial/Work log/2018-02-07

Wednesday, February 7, 2018

Today I'll continue working on our analysis to reach conclusions for our hypothesis. Mainly I'm focusing on H5, surviving editors.

H5: The proportion of surviving new editors who make an edit in their fifth week is unchanged.

Per the description of H5 on our research page, we use the concept of a surviving new editor to measure retention. Based on the time span on the trial, we define a "surviving new editor" as an account that makes at least one edit in the first and fifth weeks since registration.

We started our preliminary analysis of this in our August 17 work log. Looking at historical data, we found that splitting the data into autocreated and non-autocreated accounts provided us with more information. Secondly, it was again useful to measure the proportion of surviving new editors based on accounts that made at least one edit in their first week. If we instead base it on registered accounts, we are again to a certain degree just measuring the proportion of accounts that make edits in their first week.

First, we update the historical graph with data from the second half of 2017:

We can see in the graph that the proportions are generally higher for the autocreated accounts. This is largely due to few accounts making edits. There is also an increase in the survival of non-autocreated accounts in the second half of 2017, which might be around the time ACTRIAL started. As we have done for other hypotheses, we look more closely at 2016 and 2017, and add a trend line to the graph:

This more focused plot suggests that the proportion for autoconfirmed accounts is fairly consistent across time, while there is more variation in the proportion of non-autoconfirmed accounts. For the latter we can also see increases in survival proportions around August and September of each year. This trend is also found in the historical plot, although it appears that the increase might be larger in more recent years.

Similarly as for previous hypotheses, we wish to investigate the trend in this data, and move from daily to monthly calculations in order to build forecasting models. The graph of monthly proportion of surviving editors looks like this:

For autocreated accounts, the monthly plot shows that the survival rate for these appears to have shifted from around 5% down to around 3.75% between 2013 and 2014. This could be due to the SUL finalization in that the notion of a "global account" is different after the project. Once all accounts are connected, movement between Wikipedias is easier. When it comes to non-autocreated accounts, we see that survival has historically been around 2.5% or so, but that the variation in survival has increased in more recent years. There are peaks of survival in the second half of 2015 and 2016 going up to about 4%. Lastly, we can see a similar peak in the second half of 2017. We are curious to see if our models suggest that 2017 is as expected or not.

We first analyze the data for autoconfirmed accounts. Here we find that the time series is not stationary, which should be expected given the shifts in means across time. Secondly, we do not find evidence of a seasonal component. R's auto.arima function suggests an ARIMA(0,1,1) model with drift. Given the shift in mean between 2013 and 2015, allowing drift is sensible. We do investigate competing models, but find that none of them improve significantly over the proposed one and therefore use it. The forecast for the first three months of ACTRIAL then looks as follows:

We can see from the graph above that the survival rate during ACTRIAL is within what we would expect. The increase in October is higher than predicted, while for the two other months we see that the true proportion is very close to the forecasted value. These findings suggest that the survival rate for autoconfirmed users is unaffected by ACTRIAL.

Next we make a similar analysis for non-autocreated accounts. In the graph of monthly survival proportion, we can notice an increased variation in the data in more recent years, particularly around the time of year where ACTRIAL started. We investigated transforming the data in order to reduce the variation, but found that it had no effect in this case. The time series appears to have a 12-month cycle, and is stationary after controlling for seasonality. Applying R's auto.arima function suggested an ARIMA(0,0,3)(2,1,0)[12] model. We also built models based on studying the ACF and PACF graphs, but could not find any significant improvements. Using the suggested model to forecast the first three months where ACTRIAL was active gives the following graph:

Here we can see that for all three months, the actual survival rate is higher than forecasted, and typically outside of the 95% confidence interval. This suggests that ACTRIAL could be having a positive effect on retention of non-autocreated accounts. Further research is needed to understand the causality, particularly due to the increased variance in retention that has been seen in recent years for the months around where ACTRIAL started.