Research talk:Autoconfirmed article creation trial/Work log/2018-03-09

From Meta, a Wikimedia project coordination wiki

Friday, March 9, 2018[edit]

Today I'll wrap up our analysis by completing H21.

H21: The quality of newly created articles after 30 days will be unchanged.[edit]

In order to answer this question, we gathered datasets of non-redirecting article creations using the Data Lake and the log database. This process is similar to how we've gathered this data for other hypotheses. Because we have other hypotheses that also investigate aspects around article quality, survival, and deletion (e.g. H14, H18, and H20), we limit the analysis of H21 to only study the change in quality of articles that lasted at least 30 days. In other words, we do not use "the article was deleted" as an characteristic of its quality. After limiting the dataset that way, we gathered predictions for all initial revisions and the most recent revision 30 days after creation using ORES' draft quality and WP 1.0 article quality models. Because we need reliable data on when articles were deleted, we limit the dataset to only contain articles created on or after July 1, 2014. We've also used this limitation in other hypothesis for precisely the same reason.

In this analysis, we will measure "change in quality" as the change in the weighed sum of WP 1.0 quality predictions from ORES. This calculation uses the "weighed quality sum" from Halfaker's 2017 research paper, an approach that we also used in the analysis of H20 and H22. This weighed quality sum can be seen as an estimate of an article's overall quality. A score of 1.0 means the article is roughly Start-class quality, 2.0 is C-class, and so on.

Our dataset contains 817,273 articles created between July 1, 2014 and November 31, 2017. Of these, we find 29,797 to be redirects at the 30-day threshold. Because these are no longer "articles", we remove them from the dataset. Then we calculate the weighed sum at creation and after thirty days for all article creations, as well as the difference between the two. First, we are interested in understanding the general distribution of this data and make a histogram:

The histogram can tell us several things. First, we see a single spike just around 0, indicating that a lot of articles see very little change in quality. That spike means the change was between -0.1 and 0.1, or less than 1/10 of the gap between each quality class in the WP 1.0 assessment scale. Secondly, a larger proportion of the distribution appears to be on the right side of 0, meaning that most articles that survive for at least 30 days also improve in quality during that time. We can see that some of them improve quite significantly in quality, moving up one or two quality classes (e.g. from Start-class to C-class or B-class). Lastly, the margins are quite wide in this histogram, spanning from -5 to 5. That is necessary to capture the entire dataset, we see some articles have a very large increase or decrease in quality.

The wideness of the distribution is of some concern, and we therefore first investigate what the 1st and 99th percentiles in the distribution are in order to determine if they are outliers and should be removed. The 1st percentile is a change of -0.78, and the 99th is 2.3. We would expect that articles are more likely to gain quality than lose it, and if it loses quality that the loss is limited. Given how narrow the distribution is within these percentiles, we define articles outside as outliers and remove them from the dataset. Then, we calculate a daily mean and plot it across time:

What we see in the graph above is that the average change is fairly stable across time. There is some variation in 2016 and 2017, and what appears to be a bit of an increase in the second half of 2017. Focusing on 2016 and 2017 gives us the following graph:

We can see in the more focused graph that there is some variation, for example an increase in the beginning of the year. There are also some more movement in the second half of 2017, and it appears that the changes during ACTRIAL are roughly where one would expect them to be.

We first compare the first two months of ACTRIAL as a whole against similar periods in prior years, both for the average daily average as well as average change across the whole set of articles. The overall trend is fairly stable across the dataset, so we use the same months of 2014–2016 as our comparison. In both cases, we find that there is a small but significant increase and similar values. The average change for ACTRIAL is 0.21, compared to 0.17 for the same months in 2014–2016. T-tests for both variants where statistically significant (across the whole dataset: t=-12.58, df=48078, p << 0.001; for daily averages: t=-9.55, df=156.33, p << 0.001).

We would also like to examine this using a forecasting model. There appears to have been increases in quality prior to ACTRIAL starting, which we would like to account for. Similarly as we have done for many other hypotheses, we switch from measuring the mean daily to measuring it on a bimonthly basis, as that mitigates some of the challenges of making single-day predictions. The bimonthly graph looks as follows:

Here we see a fair amount of stability in 2015, but more variation in 2016 and early 2017. Secondly, we again see an increase in July and August, prior to ACTRIAL starting. Continuing the analysis, we first inspect the stationarity and seasonality of the time series, finding that it is non-stationary but does not appear to have a strong seasonal component. Using R's auto.arima function, we get a candidate ARIMA (0,1,3) model to use as a basis. We iteratively inspect other candidate models, both with and without a seasonal component, and find that an ARIMA(0,1,1) model provides similar performance with lower complexity. Applying Occam's razor, we choose the latter model. Using it to forecast the first two months of ACTRIAL gives the following result:

We can see that the change during ACTRIAL is within the 95% confidence interval. This result contradicts our previous analysis, which found a significant difference during ACTRIAL. As mentioned previously, the fluctuations prior to ACTRIAL starting should be taken into consideration, meaning that we give more weight to the results of the forecasting model. In other words, we do not see a clear indication that there is a significant change during ACTRIAL.

The way H21 is phrased, it states that articles do not improve in quality during the first 30 days. The analysis above reveals that articles that survive at least 30 days on average tend to see a small increase in quality, meaning that H21 is not supported. As we saw in the historic graph, this has been the case going back at least as far as July 1, 2014.