The long term existence of Wikimedia projects relies on those who actively contribute to them, by spell-checking, formatting, writing new content, patrolling, and socializing. However, as shown by other researchers, the number of active editors in Wikipedia as a whole has been declining. Editor decline studies so far have focused on understanding the decline phenomena as well as describing the reasons for decline. The focus of this study is on operationalizing our knowledge of editor decline through the lens of prediction models, i.e., our goal is to use prediction models to predict the return of active editors at a pre-specified point in the future. Such prediction models can help us
- become more proactive than reactive;
- assess the effectiveness of the tools we develop with very large volumes of data and on an ongoing basis;
- design customized experiences for the users.
- Active editor
- active editor in month m is one who has 5+ edits in any namespace in that month.
- An active editor survives month m if he/she does 5+ edits in that month.
- Prediction model
- A model used to predict the probability of an outcome.
We build prediction models for active editor survival that will help us
- Predict future activity on the projects along with reported accuracy;
- Cluster the active editor space based on the probability of survival or risk of not surviving;
- Identify the most important variables (characteristics) that can help us predict survival.
16,...,... records representing 7,333,182 editors' monthly activities on enwiki are used for this analysis. The data spans registration peirod between 2001-01-21 and 2014-03-31. The monthly edit data shows all aggregated user edits in each namespace per month.
The following independent variables are used to build the prediction models.
Vintage: For each (edit month, user id),
- The number of months since registration.
Active editor score
- Total number of edits by the active editor prior to the last edit month
- Number of months between registration date and edit month with 5+ edits/month record
- Proportion of months from the total months in the system for which the editor has done 5+ edits/month
Active editor span
- maximum length of time in months during which the editor has continuously edited 5+ edits/month
Active editor break
- Number of months between current edit month and the last time the editor has done 5+ edits/month
Current level of activity
- Total number of content namespace edits in the the current edit month
Early level of activity
- Total number of content namespace edits in the registration month
- Registration day of the week
- Registration day of the month
- Registration week of the month
- Registration month of the year
- Registration year
First time editor
- A binary variable indicating whether the current edit activity is the first 5+ content namespace activity by the editor after registration
The data is divided into two sets: training and test. The training set contains 80% of the records and is used to train the prediction models. The test set contains the remaining 20% of the records and is used to test the performance of the models.
Classification trees, more specifically, GUIDE are used to build prediction models. They have very few tuning parameters. Furthermore, the output of the algorithm is a binary tree that is easy to implement and visualize.
Prediction models for the 1-month, 6-month, and 12-month time periods are built. For a given prediction model predicting n-month survival, the outcome is assumed to be 1 if the editor is active month n periods from the reference month, and 0 otherwise. Note that we do not require the editor to stay continuously active during the n-month period.