Research talk:Autoconfirmed article creation trial/Work log/2018-02-14

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Wednesday, February 14, 2018[edit]

Today I'll continue work on H15, H17, and H22, after I did some initial work on those yesterday.

H15: The rate of article growth will be reduced.[edit]

Determining the rate of article growth requires a few definitions. First of all, what do we count as an "article"? Secondly, we know that some articles get deleted quickly after creation, should we count those? Third, do we count only pages created directly in the Main namespace, or should we also count pages that are moved from other namespaces?

When it comes to what constitutes an article, there are various definitions. The count of "content pages" on Special:Statistics appears to count all pages in the Main namespace that are not redirects. When discussing Wikipedia's size, definitions of an article might take page size and number of links to other pages into consideration, limiting it to pages above a certain size threshold and with a least one link to another page. Are disambiguation pages and lists "encyclopedic articles"? One challenge with filtering out lists and disambiguation pages are that they cannot all be identified through their title, but instead are members of certain categories. Determining historical category memberships require parsing page and template histories, which is a task that is outside the scope of this project.

A lot of articles get created but then deleted, some of them shortly after creation. That is partly a motivation for ACTRIAL. When we seek to measure article growth, it would be useful to filter out articles that are determined to not fit the encyclopedia. In order to do so, we would define a time period after creation during which an article is in an "undefined" state. If it is not deleted during this time period, we will count it as an article. Adding this filter requires the data to be "right-truncated", meaning that article creations done on a given day are withheld until the time period has passed, a process known as "censoring".

In addition to creating articles directly in the Main namespace, pages can also be moved from other namespaces. The two most likely namespaces are User and Draft, which are commonly used to develop articles before publishing them (either directly or through AfC). If we count moves, we would also need to consider whether we consider all of them to be articles, or if we apply the same notion of an "undefined" state as discussed in the previous paragraph.

In this analysis, we will define an "article" as any page in the Main namespace that is not a redirect. While limiting the notion of an "article" to pages with certain characteristics (e.g. at least some amount of content), that can quickly result in the need for parsing page content (ref lists and disambiguation pages above). We should expect that articles develop over time, and that articles that do not develop will be deleted. By requiring articles to survive for some amount of time, we can identify those articles that appear to have made a positive contribution to the growth of Wikipedia. Per H14, we refer to Schneider et al and define "survival" as an article that survives for at least 30 days. Lastly, we will account for moves from User and Draft namespaces, and require moved articles to survive for at least 30 days after the move for them to contribute to the article growth.

We restrict our dataset so that it starts on July 1, 2014, because our analysis of survival rate for autoconfirmed article creations (H14) indicated that we have reliable deletion data from that point on. Calculating growth by day from that date and through the first two months of ACTRIAL gives the following graph:

The graph above suggests that article growth can vary consistently with the seasons. Growth appears to be larger during the first half of the year. There also tends to be an increase during the fall, although the pattern is not very clear. We can also see what might look like a reduction in growth during the summer of 2017, and a further reduction during ACTRIAL.

In order to determine whether the reduction in growth during ACTRIAL is significant, we build forecasting models. As we have done for many other hypotheses, we switch to calculating growth on a bimonthly basis. That gives us a clean split for when ACTRIAL starts, as well as reducing the challenge of predicting growth on a daily basis. The graph for bimonthly growth looks as follows:

The graph displays the same growth trends that we also discussed for the daily graph, e.g. that growth is higher during the winter months. We can also see that the trend of a decrease in the summer and fall of 2017 is more pronounced.

To build a forecasting model for this data, we first investigate the seasonality and stationarity of the time series. We find that it is non-stationary, but that it is not clearly seasonal. The ACF and PACF suggests that a seasonal component is present, but it is not strong. Because of this, we should test models both with and without the seasonal component.

We first use R's auto.arima function to find a candidate model, and it suggests an ARIMA(0,1,3) model. This model fits well with our analysis of the time series' ACF and PACF. During testing of seasonal models, we find that they provide a much better fit for our data. Examining the results of several candidate models, we find that an ARIMA(3,1,1)(0,1,1)[24] model provides both a good fit as well as ACF and PACF graphs that do not indicate clear patterns. The Augmented Dickey-Fuller test also suggests the residuals are uncorrelated.

Using this model to predict the growth for the first two months of ACTRIAL gives the following result:

We can see in the graph that the true growth fits within the forecast's confidence intervals. While the true growth is below the forecast, we can see that they are fairly close to each other. As discussed previously, there is a reduction in the article growth in 2017 prior to ACTRIAL starting, which means that the further reduction that comes during ACTRIAL is within what we would expect. Further analysis could look into what are the factor causing this reduction earlier in 2017. It might also be worth looking into to what extent article growth is driven by contributors with autopatrol rights, for example to see whether the reduction in growth during ACTRIAL is more significant for non-autopatrolled creators.

In short, we find that H15 is not supported, there does not appear to be a significant reduction in article growth during the first two months of ACTRIAL.

Survival time window[edit]

We also did an analysis of how extending the survival time window affects article growth. Our analysis indicates that a window beyond 30 days does not provide much additional information as the difference in number of articles that survive is small.

As mentioned above, we refer to Schneider et al's research on AfC when it comes to defining an article as surviving once it has not been deleted within 30 days. During the discussion of preliminary data on patrolling, the idea that 30 days was not long enough was brought up. With data on article creations, moves, and deletions, we could quite easily check if extending the window to 60 or 90 days would significantly alter the growth rate. In other words, we would learn to what extent articles get deleted beyond the 30-day window.

We calculated the growth for three different windows: 30, 60, and 90 days. The difference between these is very small. Comparing a 30-day window to 90-day window gives the following summary statistics:

Minimum 1st quartile Median Mean 3rd quartile Maximum
Counts 0.0 6.0 9.0 9.6 12.0 77.0
Proportions 0.0% 1.2% 1.7% 1.9% 2.3% 12.7%

We can see that in both raw counts and proportions, the third quartile is low. On 75% of the days, the difference is 12 articles or less, and in proportions it is 2.3% or less. Out of 1,233 days in the dataset, only three have a difference above 10%, and only 25 have a difference above 5%. This lack of a large difference can also be seen in the plot below. In order to smooth out the graph a bit, we use seven day moving medians.

In summary, our data suggests that there is little additional information to be gained by extending the 30-day window, as few articles get deleted beyond that window.