Research talk:Autoconfirmed article creation trial/Work log/2017-09-05

From Meta, a Wikimedia project coordination wiki

Tuesday, September 5, 2017[edit]

Today I'll do another spot check of our dataset, looking at years before the "dip" we fixed yesterday to see if there are page creations we should've filtered that we're missing. I'll start working on building a dashboard for our page creation datasets that are now available and updated daily. Lastly, I'll start summarizing our findings so far to make sure we have a good overview and ideas on what additional research would be useful, and continue documenting our methods and datasets.

Dataset check[edit]

There are some days in 2009 where there were a lot of articles created, and I want to know what went on at that time to see if we need to alter our queries. I first identified candidate dates by filtering for those with unusually high number of article creations, and found that Feb 21, Feb 23, Mar 5, Apr 2, and Apr 16 were the key dates to look at. I first grabbed all creation events from April 16, finding that many of them were bot-created pages (redirects) and articles. Next I looked at Feb 23, finding again that there's a lot of bot-created redirects.

Because some of the pages created by bots are/were actual articles and some were redirects, we cannot filter out all bot creations. Secondly, the history table in the data lake does not contain historic information about redirect status (but that is forthcoming, ref T161146), which is why our queries applies some regular expressions to the edit comments to identify redirect creations. The edit comments used in the bot-created pages back in 2009 do not reflect whether it's a redirect or an actual article, so we cannot use the same approach there.

At this point, I'm inclined to keep things as they are. We filter out bots in our non-autopatrolled dataset, which I'm perfectly happy with. We also have a fairly good idea about what's going on with these peaks and can explain them. One concern is whether this type of bot creation affects the overall trend, but for now I am happy with what we have going back 5–6 years and would rather focus on other things.

The story, so far…[edit]

H1: Number of accounts registered[edit]

Code for this is in newaccounts.py. Our preliminary analysis of this part is in the August 8 work log, based on newaccounts.R.

Key findings here are:

  1. Many spikes in registrations in 2014 and 2015 due to SUL finalization and the mobile app requesting people register to use it.
  2. Number of accounts created by another user (logged as "create2" or "byemail") is generally stable at around 30 per day of each.
  3. Number of regular account creations in 2017 is at around 5,000 per day. There's about 2,500 accounts that are auto-created every day. These numbers follow general Wikipedia activity trends, peaks in spring and fall, slump in the summer and around the holiday season at the end of the year.
  4. Proportion of auto-created accounts appears to be increasing over time. This appears to come through an increase in the number of auto-created accounts per day over time, the number of regularly created accounts is instead fairly stable.

Because the number of accounts that are created for someone is stable and low, we've decided to combine them with the regularly created accounts in our subsequent analysis. We see the auto-created accounts as meaningfully different from other accounts due to the user having an existing relationship with another Wikimedia wiki. Secondly, we expect accounts that were created for someone to behave similar as the regularly registered accounts. Thus, we end up with two types of accounts in our analysis: auto-created and "regular".

H2: Accounts with non-zero edits[edit]

Code for this is in registrations.py and first30.py. Our preliminary analysis of this is in the August 16 work log, based on useractivity.R. We measure the number of edits in the first 30 days after account registration.

Key findings here are:

  1. Distinct difference in proportion of accounts with non-zero edits between auto-created accounts and others, with auto-created accounts much less likely to make an edit. This should be expected, because an auto-created account does not involve any significant user interaction, it might be that someone comes to the English Wikipedia just to read something.
  2. Proportion of regular accounts making at least one edit is fairly stable slightly above 30%.
  3. The proportion of auto-created accounts making at least one edit is slowly decreasing. This might just be correlated with there being more auto-created accounts per day over time, meaning that the raw number of auto-created editors is roughly stable.

It's also important to note that a lot of our remaining hypotheses for RQ-New Accounts are activity-related. Whether an account makes at least one edit is a big filter for those. If we don't apply it, all we get is a plot that looks like the non-zero-edit proportion plot because that's the determining factor. Filtering out all zero-edit accounts provides more meaningful stats and plots.

H3: Accounts reaching autoconfirmed status[edit]

Code for this is in registrations.py and first30.py linked earlier, because we know that if an account made at least ten edits in the first 30 days, it had to reach autoconfirmed status as well. Preliminary analysis is in the August 16 work log.

Key findings here are:

  1. Proportion of regular accounts with at least one edit in the first 30 days that go on to reach autoconfirmed status in the same timespan is stable at around 10%.
  2. Proportion of auto-created accounts that meet the same criteria is generally higher, perhaps around 12–13%, and has a lot more variation. The latter could be attributed to the low proportion of auto-created accounts that make edits. Yet, the proportion is still fairly stable across time.

Since these measurements were fairly stable, these should make for good indicators of whether something's happening once the trial starts.

H4: Median time to autoconfirmed[edit]

Code for this is in autoconfirmed.py, and it builds upon registrations.py and first30.py. Preliminary analysis is in the August 17 work log.

The key finding here is that regular accounts that go on to reach the autoconfirmed threshold most likely will reach it in four days. In other words, they make ten edits in less than four days. For auto-created accounts, there is a lot of variation in the median, it regularly hits about ten days.

H5: Proportion of surviving new editors[edit]

Code for this is in survival.py and the preliminary analysis is in the August 17 work log.

Key findings:

  1. Because editor survival is calculated based on edits in the first and fifth week, it is not as strongly affected by whether users edit or not. It is still affected to some degree, though, meaning that restricting analysis to accounts that edited in the first week provides more meaningful results.
  2. Regular account survival is fairly stable at around 2.5% for accounts that made at least one edit in the first week after registration.
  3. Survival of auto-created accounts is higher, perhaps around 5%, but appears to be decreasing over time.

This hypothesis also has a related measure, code for that is in articlesurvival.py, and the preliminary analysis is in the September 1 work log and the September 3 work log. In this analysis, we are only looking at accounts that made at least one edit, because if they didn't make an edit they couldn't possibly have created an article. We also restricted our analysis to the past three years (July 2014 to July 2017) due to an issue with the dataset (that issue has since been resolved).

Key findings from the related measure analysis:

  1. The vast majority of accounts that make edits do not start out by creating an article. About 8% of the regular accounts that make at least one edit start out by creating an article, and for auto-created accounts the proportion is 13%.
  2. If an account starts out by creating an article, that article is unlikely to survive for 30 days. For regular accounts, only one in five articles survive. Auto-created accounts do better, but still only 30% of their articles survive.
  3. Accounts that start out by creating an article are less likely to survive, this goes for both types of account creations.
  4. Accounts that start out by creating an article and have their article deleted, are less likely to survive, and this is consistent across account types.

H6: Diversity of participation[edit]

Code for this is in registrations.py and first30.py, linked earlier. Preliminary analysis was done in the August 17 work log. We measure diversity using the average number of namespaces edited in the first 30 days, and the average number of unique pages edited in the first 30 days. Due to skewness in the latter measure, we use a geometric mean. In both cases we find that analysis is helped by only looking at accounts that make at least one edit.

Key findings:

  1. Average number of namespaces edited is roughly stable for both regular and auto-created accounts at around 1.25. This means that most users only edit in a single namespace, but some users do edit multiple ones.
  2. Average number of pages edited has been fairly steady since 2012 at about 1.5 pages for regular accounts, and slightly higher for auto-created accounts. Before 2012 there is a decreasing trend, with a larger decrease for auto-created accounts.

H7: Average number of edits[edit]

Code for this is in registrations.py and first30.py, linked earlier. Preliminary analysis was done in the August 17 work log. We measure the average number of edits done in the first 30 days since account registration, and use a geometric mean due to skewness in the data (some accounts make a lot of edits).

The key finding here is that this measure has been stable across time, with regular accounts making about 2.6 edits, and auto-created accounts making slightly more edits.