Research talk:Autoconfirmed article creation trial/Work log/2017-08-31

From Meta, a Wikimedia project coordination wiki

Thursday, August 31, 2017[edit]

Today I've been working on improving our patroller statistics and gathered data on the time between article creation and reviews. I'll use the latter to understand how quickly reviews happen, and I'll also work on our dataset of user activity to learn about accounts that start out by creating articles.

How quickly are reviews done?[edit]

In order to understand this, I first gathered a dataset of article creations similar to the one we've used earlier as a preliminary dataset, but this time removing creations by users in the "bot", "sysop" and "autoreviewed" user groups. These three groups of users have their edits automatically patrolled, meaning that their article creations should not require reviewing.

I then processed the dataset by looking for the first logged review of the article that would happen after the article was created. This has the caveat that an article might have been deleted and get recreated before a review happens. We might later want to look at how often that turns out to be an issue, for now we will disregard it as we are more interested in the reviews that happen fairly quickly after an article is created, and we suspect that those are the vast majority of reviews.

The first question we investigate is: to what extent do articles get reviewed? We first look at whether we appear to have reliable data, and find that the data since the introduction of the PageTriage extension in October 2012 appears to be most reliable. Using the data from Oct 1, 2012 to Jul 1, 2017, we find that overall 19.1% of the articles did not get reviewed. We do not yet know whether these are all articles that would go on to be deleted without review, or to what extent they are part of the backlog.

We then measure the proportion of articles created on a given day that go on to get reviewed, the plot goes as follows:

As we can see, this is generally in the 80–90% range. There is a big drop in 2017, which might be correlated to the influx of articles being moved from Draft and User namespaces. As we've seen earlier (ref the Aug 22 work log), the overall rate of article creation during the same time period has been stable.

The second question we are interested in is: what is the typical time between article creation and its subsequent review?

We know that 19.1% of articles don't get reviewed, and filter those out of the subsequent analysis because we are interested in knowing the time to review for articles that do get reviewed. First, we look at the distribution of time to review across the entire dataset:

The plot above suggests there's a total of three modes. First, a lot of articles get reviewed within a day, or at most a week. Then, there's another mode at one month. Could this be due to filtering in the new pages feed? We know that articles do not get automatically removed, so the one month time could be backlog work. Lastly we see some articles get reviewed somewhere between a month and a year. Note that there's also a fourth mode, that the article does not get reviewed, which is not part of the graph.

Since we are interested in understanding the nature of Time-Consuming Judgement Calls, it appears that one week is a reasonable cutoff for that. 78% of the articles that get reviewed, are reviewed within a week.

Last question: How has the median time to review changed over time? We have previously seen how patroller participation has changed over time (ref the Aug 22 work log, again), and it might be that the reduction in number of active patrollers after the introduction of the NPR user right affected median time to review. We calculate the median time to review of all articles that got reviewed and plot it by day of article creation:

In this plot we see some indication of a monthly threshold in the data up to early 2014 since the median time fluctuates around a month timespan. We can also see that median time is somewhere between a day and a week. Since Q2 2014 median time to review drops and stabilizes somewhere between an hour and a day, and it often being around an hour.

There is a drop in median time in the first half of 2016, which appears to coincide with the large activity by a specific patroller. After they stop doing so much reviewing, the median time increases and again stabilizes slightly above an hour. Lastly, we see that median time appears to have increased since the beginning of 2017. The median of the medians since Jan 1, 2017 is 3 hours and 9 minutes (the mean is 6 hours and 25 minutes).