Research talk:Autoconfirmed article creation trial/Work log/2018-01-29

From Meta, a Wikimedia project coordination wiki

Monday, January 29, 2018[edit]

Today I plan to focus on H14, the survival rate of articles created by autoconfirmed users, as well as the AfC-related hypotheses we have (H16, H17, and H22).

H14: Autoconfirmed article creation survival rate[edit]

I gathered a dataset of non-autopatrolled, non-redirect page creations in the Main namespace on the Data Lake, and exported a dataset only containing creations made by autoconfirmed users (>= 10 edits, >= 4 days old). This dataset covered creations from Jan 1, 2009 through July 2017. I combined this dataset with a similar dataset from the page creation table in the log database to extend my data through the end of 2017. Lastly, I wrote a Python script to go through this data and identify if the page has been deleted.

In this case, I relied on page IDs as the authoritative identifier. This turns out to be problematic, because the logging table does not appear to correctly store the page ID of a deleted page for data prior to early June 2014. I'll therefore have to investigate alternative approaches to getting older deletions, or otherwise limit the dataset. Should know more tomorrow and do some follow-up analysis then.