Research talk:Autoconfirmed article creation trial/Work log/2017-08-22

From Meta, a Wikimedia project coordination wiki

Tuesday, August 22, 2017[edit]

Today I'll first work on setting up ReportUpdater to build a dataset of article creations based on the query I created yesterday.

ReportUpdater[edit]

I got ReportUpdater set up to gather statistics on page creations, but also used it to generate datasets for H9 and H10, the number of patrol actions per day, and the number of active patrollers. I'll continue working with the Analytics team to finalize our configuration for the measures that we're interested in.

Patrol actions and active patrollers[edit]

Note: This log was updated on 2017-08-27 after discovering that our article creation datasets only included surviving articles. We gathered a new dataset and updated two graphs: the article creation graph, and the graph of the proportion of patroller actions to article creations. The trends in article creation changed, while the conclusions about the relationship between patroller actions and article creations remained unchanged.

H9 and H10 concerns the number of patrol actions that are done, and the number of active patrollers. We've gathered data on this per day. We also have data on the number of created articles, as well as data on how many drafts are published through moves from the User and Draft (also Wikipedia talk) namespaces. This allows us to do a historic analysis of H9 and H10, as well as the related measures for both.

The plot above shows the number of patrol actions done per day since the first full month of data from the PageTriage extension is available, which is October 2012. As we can see, the number fluctuates a lot, but there are also strong sections of higher or lower activity. There are distinct peaks here and there, which suggest that work to reduce the backlog happens in bursts. However, we'll understand more about that when we compare it to the number of created articles later. Let's have a look at the number of active patrollers first:

The number of active patrollers also fluctuates a lot, but the overall trend is positive, from around 25 per day in 2013 to around 40 in the second quarter of 2017. We do not know whether most of these are doing a low number of review actions or not, but we will have data on that later as that is covered by H11. Let's instead look at the number of articles created per day:

The plot above shows the number of articles created per day, separated by those that are created "directly" in the main namespace, and those that are published drafts moved from either the User or Draft (and Wikipedia talk) namespaces. There are several things to note in this plot:

  • Generally, the number of articles created directly in the main namespace is fairly stable. There are a couple of plateaus, one from 2012 to mid-2014, and one from mid-2014 onwards. The former averages around 975 articles per day, the latter averages around 1,100 articles per day.
  • The number fluctuates quite a bit, there are definitely days where it peaks much higher than the general trend.
  • The number of drafts that are published each day is generally stable, but we see a big increase in moves from the Draft namespace in 2017. I'll look into whether that is a backlog drive or for instance related to the Education Program. The increase comes from efforts at AfC to work through old drafts, confirmed by Legacypac in en:Special:Diff/797420077.

Let's have a closer look at the publication of drafts from the other two namespaces, since the scale of those is much smaller than articles created directly:

On this plot it is much easier to see that the number of published drafts stays roughly the same. The daily average of moves into main from the User namespace is almost 21, whereas from the Draft namespace it's slightly above 25. The plot also makes it a lot easier to see the huge change of moves from the Draft namespace in 2017.

So, how does the number of patrol actions and active patrollers relate to the number of created articles? We sum up the number of direct creations and moves into main for each day, and then calculate two ratios:

  1. The proportion of patrol actions to number of articles. If this is above 100% it should indicate that the patrollers are keeping up with the influx of articles, and if it is below 100% the backlog increases.
  2. The ratio of new articles to active patrollers. This gives us the average number of reviews an active patroller has to do to balance out the influx of new articles.

First, the ratio of patrol actions to new article creations:

There are again several trends to note here. Perhaps most important is that the proportion is well below 100% across several long periods of time. Because this is all post-PageTriage there is no expiry on reviews, meaning any time the proportion is below 100% indicates that the backlog grows.

Secondly, we can see some indications of seasonality. For example, there's a strong dip in the proportion during the summer of 2016, while there does not appear to be a strong increase in the number of articles created at the same time.

Lastly, we see that the patrollers do heavy lifting in periods as shown by the peaks in the plot. Similarly as we discussed for the number of active patrollers, it will be interesting to see how (un)evenly work is distributed both generally, but perhaps particularly during those peaks.

Next, we plot the ratio of new article creations to active patrollers:

We previously saw that the number of active patrollers per day has slowly but surely increased across time, and we've also seen that the number of articles created per day is slowly decreasing. It is therefore expected that the ratio of new articles to active patrollers is also decreasing across time, as this plot clearly shows. In the second quarter of 2017 we see it dip below 25, meaning that each patroller would need to review 25 articles on average to keep up with demand. However, we also see peaks around 50 earlier the same year, which again indicates that it might be worth asking what the "right" number of active patrollers should be. It might be much higher than what it currently is.