We aim to explore the editing trends of cohorts over time, where a cohort is defined to be a group of editors that started editing Wikipedia in a given month. We can also use additional features to further refine the definition of a cohort, for example by filtering out known bots, only consider editors who have accumulated a certain number of edits in their wikicareers. The visualization method presented can be used to analyze trends for any cohort centric statistic; we will introduce the idea using the contribution in terms of bytes added by cohorts.
Stacked bar chart visualization
In this section we describe the visualization method used to explore trends within Wikipedia using a simple example - the contributions of cohorts measured in bytes added (which is equivalent to characters added). We use wikilytics to generate this data for every registered user in each cohort, saving the aggregated values in a square matrix of dimension '# of cohorts' x '# of time units'. The matrix is square because there is one cohort for every time unit. Each row will contain one data point (i.e. the number of bytes added) for every time unit (i.e. for every month) for one cohort. Naturally, one column corresponds to the contributions of every cohort that was active for a given month and the sum is equivalent to all bytes added to Wikipedia during that month. We visualize this matrix in the form of a stacked bar plot. There is one new cohort in every month (the cohort of editors that had their first edit in that month), so every stacked bar will have one more element than the bar on the left of it. To visually be able to differentiate the contributions of individual cohorts, we use a rainbow gradient to color code the time dimension. The color bar on the right of the plot below helps to determine the approximate age of a cohort in the plot.
The plot above displays the total contributions by registered users in terms of bytes added to all namespaces. The top most rectangle in every bar is the contribution of the newest cohort, and the rectangles below are the contributions of the cohorts ordered by increasing age. By following the same color in the graph, we can see the absolute proportion of the contributions by cohorts of a certain age. The last bar on the right is the latest data point of December 2010, and it contains a cohort rectangle for every month since January 2001; thus all colors in the rainbow gradient are visible on that bar. Note that this absolute representation allows the observe the level of total contributions over time, but it doesn't conserve well the relative contributions of individual cohorts. I.e. the older cohorts (from red to yellow) consist of few contributors that joined before the exponential growth of Wikipedia, and as a consequence their total contributions are barely visible on that first graph. The plot below is based on the same data as the first one, but each bar is scaled to the same length. This way, we can observe the relative contributions of cohorts as a percentage of the total contributions of a given month. Thus we can observe the percentage of the total work done by cohorts of similar age over time. In other words, we can see how fast newer cohorts take over the work. This section served to introduce the visualization themselves, we will explain some of the results in the following sections.
Cohort contributions in bytes
When trying to measure the contributions of editors, besides the already established simple edit count, we can look at the contributions in terms of bytes added or removed from the encyclopaedia. The contribution of an editor in a cohort is calculated as the sum of the bytes added to Wikipedia in a given month. Additionally, we can break down the contribution using features like the name space that the contribution was in. Alternatively, instead of looking at bytes added, we can also aggregate for bytes removed or the net contribution (added-removed). The number of bytes added is equivalent to the number of characters added and we use both expressions interchangeably.
Not all bytes are created equal
The bot policy has been adapted over time, but in ancient times editors were allowed to do automated work using their regular accounts. At some point in 2005(?), clear rules about bot names and usage were introduced, which allows us to create a list of known bots. To our knowledge there is no comprehensive list of all bots. As a consequence, filtering out the bots introduces a bias towards contributions done before the updated policy.
The amount of bytes added or removed does not equal the amount of content added or removed. For example when an editor adds a template, the markup creates a lot of added bytes - but the in terms of content such an edit is worth less than a block of text of the same length. A system that classifies the edits based on the actual diff of the two revisions would allow us to apply a more detailed filtering of contribution types. Unfortunately this leads us directly to the future work section.
Below are a series of three graphs that display the contributions added, removed and net (=added-removed).
Total characters added
Net characters added / removed
The trends for characters added / removed / net are all similar (besides a few outliers), so in the rest of this analysis we focus on the bytes added and consider this metric representative for the general trend within the Wikipedia ecosystem.
|Total contributions added, bots filtered|
|Total contributions added, only bots|
|Total contributions added, main namespace only (ns=0)|
|Total contributions added, bots filtered, main namespace only, 4 colors only|
|Total contributions added, bots filtered, main namespace only, per year|
To inspect the relative workload that cohorts are contributing, we normalize the plots above the see the percentage of the total bytes added by a cohort in a given month. As indicated above, the trends are similar for bytes added / removed / net, so we only created these plots for the byte added category.
|Percentage of added contributions, bots filtered, main namespace only, 4 colors only|
|Percentage of added contributions, bots filtered, main namespace only, per year|
Relative to namespace
We also produced secondary versions of these charts in order to compare contributions to different namespaces of the encyclopedia. Since namespaces very roughly divide up topical areas of activity (e.g. discussion vs. content etc.) it gives us an idea of what different cohorts are doing in the encyclopedia.
The above graph shows the absolute contributions by cohort to the main namespace in megabytes. As you can see, new cohorts tend to contribute quite a significant amount of content this namespace, and despite the overall decline in bytes contributed there appears to be a natural slow tapering off of contributions by a cohort over time. This latter trend is not expected, based on the natural decline in edits or other contributions as a cohort ages.
The above graph shows the absolute contributions by cohort to the namespaces four and five in megabytes. NS4 is the "Project" or "Wikipedia" namespace which contains maintenance, policy/guideline, deletion discussion, dispute resolution and more. NS5 is the associated set of Talk pages for these project spaces. As you can see, there is a distinct and clearly visible lower level of participation from younger cohorts in the project namespaces.
Observations (Please add to this)
- In October 2002, Rambot was used to create U.S. county and city articles on a large scale
- In August 2010, more than 10000 pages were deleted in a copyright infringement case
- Only looking at the contribution of bots, we can see two peaks in February 2008 and March 2009. The deep drop in between these two modes is explained by a certain level of botophobia after their exponential growth in the previous years as well as the fact that a lot of the work (e.g. spell checking, enforcing standard notation) for which the bots were designed was completed. It is also interesting to note that the majority of bots that are active today have been introduced around 2006, and there are few and only sporadically active bots created after 2008.
- The percentage of work done by cohorts that started editing after January 2009 is larger when we filter out bots. This is a consequence of the fact that most active bots are from 2006/2007, but it also leads us to underestimate the work done by newer cohorts when looking at trends that don't filter bots.
- It seems that the number of characters added to Wikipedia saturated between March 2007 and April 2008, ever since it has been stable or slightly decreasing.
- Editors that first edited after January 2009 contribute more than 40% of the bytes to the main namespace.
- Editors that first edited after March 2007 contribute more than 60% of the bytes to the main namespace.
- Old editors that have been editing before October 2005 have been consistently contributing about 20% of all bytes since January 2008.
- Overall, it seems that newer cohorts continue to take over their share of the work in the Wikipedia ecosystem. Following the logic of the editor trend study, we would expect that there is an unhealthy distribution of work among cohorts which visible in this kind of contribution trend visualization