Research talk:Autoconfirmed article creation trial/Work log/2018-02-24

From Meta, a Wikimedia project coordination wiki

Saturday, February 24, 2018[edit]

Today I'll focus on analyzing H22 using our dataset of AfC submissions predictions. If time allows, I'll gather data for H21.

H22: The quality of articles entering the AfC queue will be unchanged.[edit]

We have gathered predictions from ORES' draft quality and article quality models for all revisions submitted to AfC from the Draft namespace. Based on our analysis of H16 and H17 (see the February 20 and February 21 work logs, respectively), we utilize our processed dataset of submissions submitted between July 1, 2014 and November 30, 2017, where the submission appears to have been reviewed.

The first indicator of draft quality we will look at is whether we are able to retrieve the content of a draft's submitted revision. If the revision is unavailable, it suggests there were quality issues with it, such as a copyright violation. Overall, we find very few deleted revisions. Out of 80,389 total revisions, only 954 (1.2%) are unavailable. Has this proportion been stable over time? We calculate it on a daily basis and plot it over time, both from July 2014 onwards as well as from January 2016 onwards:

We can see in the graph that the proportion is typically low, and that there are many days where it is zero. There is also a lot of variation, which is likely due to the number of submissions per day being fairly low. There are also not any clear trends in the proportion, e.g. it does not have a specific yearly cycle.

In the plot from 2016 onwards, we can again see that there are a lot of days with no permanent deletions. There appears to be an increase in the proportion around the end of 2016 and beginning of 2017. That might simply be that the number of AfC submissions is low because Wikipedia is a fairly quiet place around the holidays. Lastly, we see a bump in the trend after the start of ACTRIAL, there are some days with a fairly consistently high proportion. Note that during ACTRIAL the proportion is less prone to drop to zero, this is likely due to the increase in submissions during ACTRIAL, which we analyzed in H16.

Because the overall trend shows that the proportion has been stable, we investigate whether there has been a significant change in the proportion during ACTRIAL. Based on the previous graphs, we can compare it to the same time period in each of the years 2014–2016. We first investigate the distribution of the data and find that it is very skewed. This is due to the large number of days with no deletions. Note that during the first two months of ACTRIAL there has been nine days with no permanent deletions, or 14.8%. During the same time periods of 2014–2016 there were 106 days with no deletions, which out of 183 days in total equals 57.9%. Regardless of how we treat the days with no deletions, we see that there appears to be a clear change during ACTRIAL, likely due to the increase in submissions.

Due to the skewness, we use a Mann-Whitney U test to compare the two time periods and find that there has been a significant change (W=3720, p << 0.001). Looking at the pre-ACTRIAL period, we have a median of 0% (as mentioned above, almost 58% of the days have no deletions) and a mean of 1.025%. During ACTRIAL we have a median of 1.16% and a mean of 1.69%. These findings do not support H22, which hypothesizes that there is no change. We also notice that there appears to be some change in the proportion during ACTRIAL as well, something that could warrant further study to understand whether the later months of the trial are a regression back towards the previous mean.

Next we investigate to what extent the AfC submissions that were retrievable would be labelled as "OK" by ORES' draft quality model. Like we did for H20, we require the model to have a confidence of at least 66.4% in order to label the revision as "OK". We calculate the proportion of revisions flagged as "OK" per day and plot them over time, again both across the whole dataset and from 2016 onwards:

Something that is very interesting notice here is the trend of an increasing proportion over time. Secondly, we notice another increase in the second half of 2017. This latter trend is perhaps easier to see in the graph starting in 2016, where we find what appears to be a shift upwards around the start of ACTRIAL. It is unclear from the graph whether this increase is caused by ACTRIAL or not. Given the increasing trend across time, we do not want to compare the first two months of ACTRIAL as a whole against previous years, and instead turn to forecasting models.

For our forecasting model, we switch to calculating the proportion on a bimonthly basis in order to alleviate issues with daily forecasts. The graph of the bimonthly proportion across time looks as follows:

We can again see the increasing trend over time and the further increase in the second half of 2017. In this graph it appears clear that there's an increase prior to ACTRIAL starting, thereby suggesting that the higher level during ACTRIAL is not unexpected.

In order to understand more about this, we build forecasting models of the time series. We first investigate stationarity and seasonality of the time series and find that is non-stationary, which is expected due to the increasing trend over time, and that it is unclear whether it has a seasonal component (i.e. a yearly cycle). Using R's auto.arima function, we get a candidate ARIMA(3,0,0)(0,1,0)[24] model that allows for drift. We iteratively investigate other models both with and without seasonal components, but find that none of them perform better than the suggested one. Using that model to forecast the first two months of ACTRIAL results in the following graph:

We can see that three out of four of the bimonthly periods of the first two months fall within the 95% confidence interval of the forecast. The movement during ACTRIAL is slightly off from what is expected, as we can see the second half of October is outside the forecast. At the same time, we see an expected regression towards the mean in the forecast, but it is not clear that this is occurring during ACTRIAL. There is also quite a lot of uncertainty in the forecast, perhaps due to the increase in the proportion prior to ACTRIAL. Combining these signals we find that is unclear that ACTRIAL is causing a significant change in the proportion of submissions flagged as "OK". This finding supports H22.

The last quality indicator we will investigate is the general quality of drafts that are labelled "OK". To estimate the general quality, we use ORES' article quality model. This model is trained on article quality assessments following the WP 1.0 assessment scale, and will predict which of six assessment ratings an article falls into (e.g. Start-class). Similarly as we did for H20, we adopt Halfaker's weighed sum as a representation of quality. A weighed sum of 1.0 means the article appears to be Start-class, 2.0 is C-class, and so on. We calculate this weighed sum for each submission that is labelled OK" by the draft quality model, and plot it over time:

There are three things to note here. First, there appears to be a trend of increasing quality over time as the blue trend line moves slowly upwards. This trend is not as clear as it was for H20. Secondly, it is unclear whether there is a change in the average quality sum during ACTRIAL. We can see that the variation is reduced, which is expected since the number of submissions has increased, but it is not clear that there is a significant change in the average once ACTRIAL starts. Lastly, comparing this plot to the same for article creations (ref the Feb 22 work log), the average quality appears to be quite a bit higher. During the first two months of ACTRIAL, the average weighed sum of article creations labelled "OK" is 0.74, while for Draft AfC submissions it's 0.90. Understanding what causes this difference is not part of the current project, but could be part of a follow-up study.

We take two approaches when it comes to understanding if the average weighed sum during ACTRIAL is different from what we would expect. First, we look at the first two months of ACTRIAL as a whole and compare it to similar time periods of 2014-2016. Secondly, we'll use forecasting models because we saw in the 2014-2017 graph above that there appears to be a slight increase in the sum across time.

Looking at the first two months of ACTRIAL and comparing them to 2014-2016, we find that there is not a slight bit of skewness in the distribution of the average, but that a t-test should work well. This is also reflected in the summary statistics: pre-ACTRIAL mean 0.81, median 0.77; ACTRIAL mean 0.90, median 0.89. We find that the increase in mean during ACTRIAL is statistically significant (t-test: t=-2.74, df=237.74, p < 0.01; Mann-Whitney U test: W=4282, p < 0.01). This suggests that the quality of AfC submissions during ACTRIAL is higher than expected.

Next, we switch to calculating the average weighed sum on a bimonthly basis and using that to build a forecasting model. Let's look at how this measurement has developed over time:

We can again see the increasing trend over time in the above graph, but it does also appear to have quite some variation. Secondly, it is not clear that we are seeing an increase during ACTRIAL, instead it might be that it is as one would expect. Examining the time series, we find that it is non-stationary, and that it is unclear whether it has a seasonal component (e.g. a yearly cycle). Using R's auto.arima function, we find a candidate ARIMA(1,1,2) with drift model. We iterate through candidate models in order to see if a seasonal model improves, and whether simpler models perform as well. Seasonal models perform poorly on this time series, but we find that allowing for drift is not an improvement to the model. Using the ARIMA(1,1,2) model to forecast for the first two months of ACTRIAL leads to the following graph:

We can see that the average weighed sum during ACTRIAL is within the confidence interval of the forecast. Previously, we found that measuring this on a daily basis suggested there was a significant increase during ACTRIAL, but as the forecast takes into consideration longer term trends, we give more weight to it and conclude that there does not appear to be a significant effect from ACTRIAL.

Summarizing our findings for H22, we find partial support for the hypothesis that the quality of articles entering the AfC queue is unchanged. We see an increase in the proportion of AfC submissions that is permanently deleted, but we do not see a change in the proportion of drafts labelled "OK" by the draft quality model, nor do we see a change in the average quality of drafts labelled "OK".