Research talk:Autoconfirmed article creation trial/Work log/2018-02-02
Friday, February 2, 2018
Today I'll continue the AfC analysis with H17, use our deletion data to get some answers for H14, start working on H22, and continue work on getting results posted.
Draft publication rate
Before I wrapped up work yesterday, I quickly revisited our finding from Wednesday where we looked at publication rate of drafts. In our analysis, we found that 1,550 (1.2%) of 126,957 pages created between July 1, 2014 and December 1, 2017 had been published. That seemed suspiciously low considering that the category of accepted AfC submissions contains almost 80,000 pages. While some of those are redirects, it's unreasonable to expect almost all of them to be.
Since I have page IDs of all the draft pages created, I figured I could do a spot check by counting the number of pages in the dataset that are currently live, meaning they have a matching page ID in the
page table in the Main namespace:
SELECT count(*) FROM nettrom_drafts JOIN enwiki.page USING (page_id) WHERE creation_timestamp >= '2014-07-01 00:00:00' AND page_namespace=0;
The query found 964 pages as of Feb 1. That suggests that our finding of 1,550 publications appears reasonable, as one would expect some publications to not pass AfD/PROD (e.g. there have been issues with paid publications). I think it's worth pondering on what the implications of this finding is for AfC and the Draft namespace.
I've worked on analyzing this, but am not sure if our dataset enables us to estimate the backlog in a meaningful way. Based on reading some older AfC information, it appears our data on number of submissions is not off by much. The dataset should enable us to measure the backlog, but it requires a slightly different approach than what I've tried so far. Will try again tomorrow, as I think I have an idea on how to do it.
H14: Autoconfirmed article deletions
I've gathered a dataset of non-autopatrolled, non-redirect article creations by users who had reached autoconfirmed status. This dataset is mentioned in our Jan 29 work log, where I discovered that we have an issue with being able to track deletions for older creations. Prior to early June 2014, page IDs do not correctly identify page deletions. While having older data would be useful, we can make good use of the 3.5 years of data that we do have.
H14 states that the survival rate of articles created by autoconfirmed users will remain stable. We will investigate this from a few different angles. First, let's look at survival rate:
The graph above shows the proportion of all creations that survived for at least 30 days, which corresponds with the definition of survival in H14. The trend line suggests that up until 2017, the proportion stayed fairly stable around 87%, although there are some days with large swings as well as some seasonal variations. Since the start of 2017, the proportion appears to fall, but that might largely be driven by the low proportions in July 2017.
Because the proportion is generally above 80%, plotting the inverse graph might be more useful. In other words, switching the definition to instead measure the proportion of articles deleted within 30 days. The plot then looks like this:
Since we're now only plotting the bottom half of the graph, the variation is shown as larger. The trend line then also suggests that the proportion might have moved from around 10-ish percent up to around 15, and in Q2 2017 moves up towards 20%. Based on the trends in the graph, we might find that the proportion has increased during ACTRIAL compared to similar time periods of previous years. However, the increasing trend in 2017 means that it might not have been caused by ACTRIAL. We'll investigate that with a forecast model later. Note also that the overall deletion rate is on par with an earlier analysis done prior to ACTRIAL starting, suggesting that our data is sound.
First, we'll take a look at how deletion rates differ depending on the age of the account. The underlying data might hide big differences as some accounts are young and some are very old. To begin with, let's split the data by over/under 30 days since registration:
The graph above suggests that there are clearly differences between younger and older accounts, and that we should investigate that more. Our initial split of over/under 30 days is crude, so we instead choose to use a bucketing scheme based on account age relative to the autoconfirmed cutoff of four days. To focus on the younger accounts, we apply a logarithmic scheme using a power of two. We then calculate the overall proportion of articles deleted within 30 days, resulting in the table below.
|Account age||N article creations||N deletions in 30 days||Proportion deleted (%)|