Research talk:Autoconfirmed article creation trial/Work log/2018-01-19
Friday, January 19, 2018
Today I'll continue the analysis of deletions in Main and User namespaces, formalize my data gathering of quality predictions in Draft and AfC, and write the Python code to gather Draft/AfC data.
Improving the data gathering
The initial analysis of deletion in the Draft namespace (ref Wednesday's work log) was fairly straightforward, mainly due to the number of reasons for deletions in that namespace being small. Working on deletions in the Main namespace, I understood that there was room for improvement in our data gathering and analysis. The two areas of concern were how we handle redirects and whether our regular expression for capturing references to deletion criteria misses references.
A spot check of some log comments from early 2009 suggested that switching our regular expression from being anchored at the beginning of the comment to anywhere within the comment would capture more references. Secondly, we found some usage of "WP:Criteria for speedy deletion#" and not just "WP:CSD#". I therefore altered the regular expression for CSD so it would allow for both variants, and at the same time also allowed for using both "WP:" and "Wikipedia:". Note that we expect most of deletions to be done through tools that leave standardized comments, meaning that our approach will capture the vast majority of these references. Diving further into usage of references to policy in deletion comments is outside the scope of this project.
In my initial analysis of Main namespace deletions yesterday, I noticed some spikes. These can come from deletion of redirects. It is not straightforward to identify whether a deleted page was a redirect (there is no boolean "is_redirect" flag in the archive table like there is in the page table). We can instead pick up reasons for deletion that refer to redirects and filter them out that way. Inspecting the criteria for speedy deletion we can see that R2 and R3 refer directly to redirects, and X1 does as well. Lastly, G6 and G8 also often refer to redirects (e.g. a redirect pointing to a deleted page, or a disambiguation page being deleted). I therefore propose that we capture these and remove them from consideration, as the other reasons are more likely to refer to deletion of articles.
Updated draft deletion analysis
Removing G6 and G8, we get the following sorted list of reasons and usage:
|Category||Reason||Number of deletions||%|
|G13||Abandoned draft or AfC||65,007||43.3|
|Other||Not matching another category||47,420||31.6|
|G11||Unambiguous advertisement or promotion||14,086||9.4|
|G12||Unambiguous copyright infringement||6,675||4.4|
|G7||Author requests deletion||5,366||3.6|
|G3||Pure vandalism and blatant hoaxes||3,527||2.4|
|G5||Creations by banned or blocked users||1,605||1.1|
|AfD||Articles for Deletion||1,074||0.7|
|G4||Recreation of a deleted page||259||0.2|
Similarly as before, I combine G9, G4, and G1 into "other" and keep the other categories. That gives me a total of 11 categories, and I can plot these as before:
The two graphs of total number of Draft deletion over time and from Jan 1, 2017 were also updated:
Using the new dataset and comparing the first two and a half months of ACTRIAL against the same time period in 2015 and 2016 continues to find a significant increase (median of 86.5 for 2015 and 2016 while 113 in 2017).
We make a similar update of the graph of total deletions in the Main namespace as well:
Secondly, we update the table of usage:
|Category||Reason||Number of deletions||%|
|A7||No indication of importance||525,329||29.9|
|G11||Unambiguous advertisement or promotion||189,306||10.8|
|Other||All other reasons||175,028||10.0|
|AfD||Articles for Deletion||171,028||9.7|
|G3||Pure vandalism and blatant hoaxes||102,057||5.8|
|G12||Unambiguous copyright infringement||78,966||4.5|
|G7||Author requests deletion||66,256||3.8|
|G5||Creations by banned or blocked users||40,510||2.3|
|A10||Duplicates existing topic||25,946||1.5|
|G4||Recreated deleted page||16,594||0.9|
|A9||No indication of importance (music)||7,855||0.4|
We split all 22 categories up into two groups of eleven and make a plot of the activity in each category from January 1, 2017 as that allows us to focus on how ACTRIAL affected deletions (a longer history plot is forthcoming). First, the eleven most common reasons:
Then the least common reasons (note that G9 and G13 are not shown due to their low usage):
Based on the graph of the most common speedy deletion criteria, it appears that several have decreased noticeably during ACTRIAL: A7 (no indication of importance), G3 (pure vandalism and blatant hoaxes), G12 (unambiguous copyright infringement), A3 (no content), A1 (no Context), and G10 (attack pages). For the least common criteria, we also see some indications of lower usage: G2 (test pages), A10 (duplicates existing topic), G1 (patent nonsense), and A11 (obviously invented).
In order to understand how the rates and proportions have changed, we compare ACTRIAL against a similar period of time in the five preceding years. We choose five years because from our earlier graph of page creations in the Main namespace, we've seen that it's fairly stable during all those five years. Going further back in time, the page creation rates appear to be higher. We use the first month and half of ACTRIAL as that is the same period we use for our survival analysis.