Research:Survival

Contact

Wikimedia Foundation

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The long term existence of Wikimedia projects relies on those who actively contribute to them, by spell-checking, formatting, writing new content, patrolling, and socializing. However, as shown by other researchers, the number of active editors in Wikipedia as a whole has been declining. Editor decline studies so far have focused on understanding the decline phenomena as well as describing the reasons for decline. The focus of this study is on operationalizing our knowledge of editor decline through the lens of prediction models, i.e., our goal is to use prediction models to predict the return of active editors at a pre-specified point in the future. Such prediction models can help us

become more proactive than reactive;
assess the effectiveness of the tools we develop with very large volumes of data and on an ongoing basis;
design customized experiences for the users.

Definitions

Active editor: active editor in month m is one who has 5+ edits in any namespace in that month.
Survival: An active editor survives month m if he/she does 5+ edits in that month.
Prediction model: A model used to predict the probability of an outcome.

Research goals

We build prediction models for active editor survival that will help us

Predict future activity on the projects along with reported accuracy;
Cluster the active editor space based on the probability of survival or risk of not surviving;
Identify the most important variables (characteristics) that can help us predict survival.

Data

Sampling

16,...,... records representing 7,333,182 editors' monthly activities on enwiki are used for this analysis. The data spans registration peirod between 2001-01-21 and 2014-03-31. The monthly edit data shows all aggregated user edits in each namespace per month.

Variables

The following independent variables are used to build the prediction models.

Vintage: For each (edit month, user id),

The number of months since registration.

Active editor score

Total number of edits by the active editor prior to the last edit month
Number of months between registration date and edit month with 5+ edits/month record
Proportion of months from the total months in the system for which the editor has done 5+ edits/month

Active editor span

maximum length of time in months during which the editor has continuously edited 5+ edits/month

Active editor break

Number of months between current edit month and the last time the editor has done 5+ edits/month

Current level of activity

Total number of content namespace edits in the the current edit month

Early level of activity

Total number of content namespace edits in the registration month

Registration date

Registration day of the week
Registration day of the month
Registration week of the month
Registration month of the year
Registration year

First time editor

A binary variable indicating whether the current edit activity is the first 5+ content namespace activity by the editor after registration

Methods

The data is divided into two sets: training and test. The training set contains 80% of the records and is used to train the prediction models. The test set contains the remaining 20% of the records and is used to test the performance of the models.

Classification trees, more specifically, GUIDE^[1] are used to build prediction models. They have very few tuning parameters. Furthermore, the output of the algorithm is a binary tree that is easy to implement and visualize.

Prediction models for the 1-month, 6-month, and 12-month time periods are built. For a given prediction model predicting n-month survival, the outcome is assumed to be 1 if the editor is active month n periods from the reference month, and 0 otherwise. Note that we do not require the editor to stay continuously active during the n-month period.

Results

References

↑ Loh, Wei-Yin (2009). "Improving the precision of classification trees". Annals of Applied Statistics 3: 1710–1737. doi:10.2307/27801568.