The goal of this sprint is to explore properties of anonymous edits on Wikipedia. Are there general trends that we wouldn't expect intuitively? RQ5: Research:How do editors work anonymously?
First, we look at the number of anonymous revisions compared to non-anonymous revisions over time. The data is generated using a simple query that retrieves the user and the timestamp of each revision. The data is then aggregated by counting the number of revisions per month for both types of revisions.
Second, we do a similar analysis comparing the revision length of anonymous vs. non-anonymous edits. We create 100 uniformly sized bins that span a reasonable range of revision length (e.g. 0-50000). The data is generated by putting each revision in the corresponding bin. The resulting histogram gives an indication of the distribution of edits over the revision length.
Results and discussion
Revisions over time
The two plot below show the number of revision over time for the English Wikipedia (aggregation of 362912934 revisions), for anonymous edits on the left and non-anonymous edits on the right. Note that the scale on the y-axis is not the same. We can see that the exponential growth of wikipedia from 2002 until 2007 in both graphs. For both kind of edits, the number of revisions per month is peaking around March 2007. This growth pattern is consistent with the number of active editors found in the Editors Trend Study, which is showing a decline in the number of active editors after that point in time.
The number of anonymous revisions is declining faster than the number of edits of logged users, i.e. the slope of the decline is steeper for the former. The plot below shows the percentage of anonymous edits over time for the English wikipedia. Until the beginning of 2003, the percentage of anonymous edits is large and displays a high variability. This could be explained by the fact that there were relatively few editors at the beginning and automated edits represented a larger part of all revision. Additionally, there were presumably many automated scripts and bots at work as the development of the platform moved at a faster pace than today. Surprisingly, the percentage of anonymous edits was at its lowest, below 20%, between 2003 and August 2004. After a spike to over 30% at the end of 2005, the percentage of anonymous revisions has slowly but steadily declined to around 20% in June 2011.
As expected, the number of revisions decreases with the length of the edit. Interestingly, while the decrease of the logged users seems to be exponential, the anonymous revisions spike at revisions of about 2500 bytes in length. This means, that anonymous revisions are more likely to be larger revisions than shorter ones. A possible explanation for this could be that acts of vandalism tend to either add or remove large chunks of text. It is also worth pointing out that the number of revisions that are very short (the bar on the left of both plots), is similar for both kind of edits (~400000 edits). It is likely that these editions are minor corrections of typos by people who spot them, independent whether they are editors or not.
A central premise of the Summer of Research is that the decline of the retention of active editors needs to be reversed. While of course I agree that new editors are important for Wikipedia, the following question presents itself when taking into consideration 'Revisions over time' plots. First, I assume that the quality of Wikipedia as an encyclopedia has not declined since 2007. The hypothesis of the SoR is that, as the number of active editors is declining, more work rests on the shoulders of the remaining editors. As the retention rate is declining and the editor population ages, this will obviously lead to problems in the future. However, in the plots above we can see that the total number of revisions is also declining as well. This means, independent from the fact that fewer editors might contribute more work (references?), there seems to be less work necessary to ensure a high quality of the English wikipedia. The English wikipedia seems to have reached a certain saturation. So the question becomes, how much work is necessary to guarantee the quality and how many editors are needed to do the work?
We could examine if frequently edited articles are more likely to be revised by anonymous editors. We use the following variables:
- Choose $x$ articles that have been edited over $y1$ times in the time unit $z$
- Choose $x$ articles that have been edited less than $y2$ times in the time unit $z$
- $x$ is the number of articles , e.g. 1000
- $y1$ should be a fairly high number, but chosen such that finding $x$ articles is not a problem
- $y2$ is a low number
- $z$ can be a full year, or a month, or even all years
The data is aggregated by counting the number of anon and non-anon revisions for both group of articles. A scatter plot, i.e. a point for each article in the dataset, where on the x-axis is the number of anon edits, and on the y-axis the number of non-anon edits. The goal is to see whether we there is some sort of clustering effect. As the group of articles with less revisions will naturally have a smaller count of edits, we have to normalize the counts so that the two groups become comparable. Alternatively, we can remove that bias by, instead of collecting $x$ articles of each group, collecting two groups of different size but with the same number of revisions.