Research talk:Autoconfirmed article creation trial/Work log/2017-09-01

From Meta, a Wikimedia project coordination wiki

Friday, September 1, 2017[edit]

Today I'll do a preliminary analysis of whether accounts start out by creating articles or not, and how that affects the proportion of surviving editors (ref H5's "further segmentation" section).

Starting out creating articles[edit]

In order to determine if a user started out by creating an article, we compare their first edit (using both the revision and archive tables to account for deleted edits) against our dataset of article creation. If the revision ID of the creation matches the user's first edit, we flag the user as having started out creating an article. Then, we use the logging table to see if the article was deleted within 30 days of being created, and have an additional flag if it did.

Q1: To what extent do users create an article in their first edit?[edit]

We ignore all users who did not make any edits, because as we've seen earlier the statistics and graphs are then dominated by whether users made an edit or not.

Overall, 25.2% of all users made at least one edit in the first 30 days. Of those, 10.2% started out by creating an article. Has that proportion changed over time? Let's calculate it and plot it:

What's going on in this graph? It mimics quite closely the pattern in our article creation graph (ref the Aug 22 work log). It looks like we might not be catching redirect creations well enough prior to 2012. Then, there's a suspicious dip from Q1 2012 to the end of Q2 2014. This corresponds to a similar plateau in the article creation graph. From the second half of 2014 onwards, it appears that the data we have is decent, with a trend of a 2.5 year increase, then a dip and stabilization in late 2016/early 2017.

I calculated the number of accounts creating an article in their first edit for January 1 and July 1, 2014. One of those dates is in the middle of the dip, the other is past it. The number of article creators, 19 and 130 respectively, appears to correlate closely to the difference in average number of articles created during the same periods (975 vs 1,100, again discussed in the Aug 22 work log). This means I'll have to look at the article creation statistics, it appears that we're not capturing the creations correctly during that dip period.

Another thing to note here is that the massive amount of account creations due to SUL finalization in 2014 and 2015 does not appear to affect this graph at all. Instead, that affected the proportion of accounts that make at least one edit. Since we count proportion of article creators based on the accounts that make at least one edit, we don't see that effect.

Q2: Is the extent of article creations in the first edit different between different types of account creations?[edit]

Account type Edited (in %) Created article (in %)
create 32.2 10.2
create2 18.0 9.1
byemail 13.4 5.9
autocreate 4.5 10.5

We already knew that autocreated accounts tend to edit less, so that's not news. Generally we prefer to join account types into two so it's autocreated vs others. In this case we see that while edit proportions differ, there is not much difference in the proportion of accounts creating articles, with the exception of those who get passwords through email, they are much less likely to create articles in their first edit. Could this be from hackathons or the Education Program, where they'll use sandboxes and move articles?

How has the proportion changed over time? We'll collapse "create2" and "byemail" into the "create" group because there are so few accounts of the former two types. Then we plot the proportion for "create" and "autocreate" side-by-side:

Here we can see that the dip affects both types of accounts, making it clear that we need to look into what happened with the data during that period. We can also see that the decrease in proportion from 2009 to 2012 does not affect autocreated accounts. It might be that we're correctly capturing article creation events, and that in those years there were more articles to create for newly registered accounts.

We can also see that the proportion of article creators among autocreated accounts is slightly higher than the overall proportion suggests, due to the dip in the dataset. The proportion tends to be stable between 10 and 15%, and appears to have increased somewhat in the more recent years. For the first half of 2017, the average is 13.1%.

Lastly, when it comes to other types of accounts, we see that the article creator proportion is often below 10%, although it does vary from year to year. In the first half of 2017, the average is 8.1%.

Q3: If they create an article, to what extent does it survive?[edit]

First we calculate the overall survival percentage and find that it's 75.8%. That seems incorrect. Let's plot survival percentage across time:

Well, that plot explains a lot about our article creation plot. Somehow it appears to not contain any deleted articles prior to somewhere in June 2014. On the other hand, we see that the survival percentage is very stable from Q3 2014 onwards.

We'll first calculate the overall survival percentage from July 1, 2014 onwards, then plot a split graph by account creation type from the same date. The overall survival rate is 21.3%. Split up by type of account creation, the rate is 20.6% for regular created accounts, and 29.9% for autocreated accounts. That should make for an interesting plot, which looks like this:

For regularly created accounts, the survival rate is very stable with a trend around 20%. It's sometimes as low as 10% and sometimes goes to 30% or beyond. So… if we ballpark some numbers: let's say 5,000 accounts get created in a day (not counting autocreated accounts). Of those, about 25% make an edit in the first 30 days, which is 1,250 accounts. 8% of those again created an article in their first edit, which is exactly 100 accounts. Of those, 20% survived, which is exactly 20 articles. This is a lot lower than the number of surviving articles created by non-autoconfirmed accounts per day (which we've estimated at 50–100), thus meaning that a lot of non-autoconfirmed accounts either create multiple articles, or that they make a few edits before going on to create an article. Something to perhaps look into further down the line.

Q4: How does creating an article affect editor survival?[edit]

Say that someone starts out by creating an article. Are those who do more or less likely to survive if they do? Because of our article creation data anomaly, we'll analyze this only for the three years from July 1, 2014 to July 1, 2017.

We create a 2x2 contingency table for this, and for simplicity we've calculated the proportion of the total number of accounts (1,930,590) that lands in each cell:

Edited in week 1 Edited in week 5
Didn't create article 88.7% 2.4%
Created article 8.8% 0.2%

We already knew that the vast majority of accounts don't create an article, so perhaps slightly more interesting is the same 2x2 contingency table with proportions based on the sum of each row:

Edited in week 1 Edited in week 5
Didn't create article 97.4% 2.6%
Created article 98.1% 1.9%

Here we can see more clearly the difference in proportion of survival between the two groups. Both of them have a low proportion of survival, which can arguably be attributed to single weeks being short timespans for calculating this. However, we do see a significant lower probability of survival: Chi-square goodness-of-fit test X2=390.06, df=1, p < 0.001.

How does survival differ between types of accounts? We know that regularly created accounts outnumber the autocreated accounts, there's only 95,074 autocreated accounts that edited in the first week, compared to the previous total of 1,930,590. Thus, it is reasonable to not calculate this for regularly created accounts because the result will be similar to the previous one.

We therefore focus on autocreated accounts and create the two contingency tables like we did previously:

Edited in week 1 Edited in week 5
Didn't create article 84.3% 3.0%
Created article 12.4% 0.3%
Edited in week 1 Edited in week 5
Didn't create article 96.6% 3.4%
Created article 97.5% 2.5%

For autocreated accounts, the proportion that created articles is higher, as expected. We also see higher survival rates than what we found previously. This should not be too surprising since we'd expect autocreated accounts that make edits in the first week to have some Wikipedia experience already. However, we again see that the survival rate of those that start out by creating an article is lower. We should go examine those accounts in more detail to see what's happening there.

Is the difference in survival statistical significant? We perform the same Chi-square goodness-of-fit test on the "created article" row, based on the probabilities from the "didn't create article" row. The result is again a statistically significant difference: X2=33.16, df=1, p < 0.001. In this case, it's perhaps worth noting that across the three years of our dataset, only 299 autocreated accounts created an article and edited in the fifth week. That's about 2 accounts every week.