Research:Ignored period and retention
Concerning the recent decline in the retention of new editors, it can be important to see what style of response to them by the community has a positive or negative effect. It might be the case that the longer a new editor has no interaction with other people, the less he/she is motivated to contribute. Or, it could be the case that it is stressful for a new editor to have too much communication with the community at first. I would like to find out if there is a tendency in the data showing any of the two theories.
- Does the length of ignored period affect editor retention? (RQ1.1)
The variables below are collected from edit histories of the articles and user talk pages for each users who started editing in each of the years between 2003 and 2011, using the Wikilytics databases and Toolserver databases. I attempt to see if the length of ignored period differs between "retained" and "leaving" editors, observing the trends in the length of ignored periods over time.
- Ignored period: the period between a new editor's first edit and the first message to him/her.
- Retention: a binary variable showing whether the editor has more than N edits per month after spending one year since his/her first edit, where N is 10, 30 or 50.
The resulting table of the relevant variables consists of the following columns:
- year: the year the editor started editing
- num_edits: the number of edits the editor made in the first year
- editor_id: MediaWiki internal ID of the editor
- first_edit_date: the date the editor started editing
- fe_rev_id: the revision ID of the first edit (the revision's URL is 'http://en.wikipedia.org/w/index.php?old_id=' + fe_rev_id)
- message_date: the date the editor got the first message to his/her talk page
- m_rev_id: the revisio nID of the first message
- delta: the duration between the first edit and first message
- edits_1yl: the number of edits per month, one year later than the first edit
In order to foresee the effectiveness of the analysis, the preliminary experiment has been done by picking 1% of the all users in the English Wikipedia and do the analysis within the limited user set.
Results and discussion
- Average length of ignored period is decreasing
- All charts above suggest that the ignored period is shortening over the years, although several anomalies are found. One theory to explain this trend would be the increasing use of bots and templates which give messages to the talk pages efficiently. Exceptionally long ignored period is observed in 2006, when the Wikipedia community had too many people to be welcomed. In early 2007, the English Wikipedia had the largest number of new Wikipedians per month. 
- Shift in the effect of early messages
- 2006 is also the point when the effect of early messages reversed. Before 2006, it is most clearly seen in the second chart (with the threshold of 30 edits between retained/leaving) that retained editors got their first interactions earlier than leaving editors did, whereas after 2006 the trend is the opposite.
- Effciency of the analysis on the randomly reduced dataset
- The analysis on the randomly reduced dataset saved 95% of the computational time that was required when I used the whole dataset. However, most likely due to the lack of enough samples, the reduced dataset contained more anomalies that was not found in the whole dataset. Preliminary charts from those samples can be found at commons:File:Ignored period 1% sample 10 edits EN wiki.svg, commons:File:Ignored period 1% sample 30 edits EN wiki.svg and commons:File:Ignored period 1% sample 50 edits EN wiki.svg.
This analysis indicates that some earlier interactions can have negative impact on retention of new editors. On the contrary to a speculation that early messages motivate new editors to contribute, retained editors are found to have shorter ignored period than leaving editors do after 2006.
- What types of early messages give positive impact on retention? The negative effect seen after 2006 might be caused by the increasing number of edits by bots. The same analysis excluding bot edits from this dataset would give an insight on how non-automated messages matter.
- Clearer presentation of the 'shift'? The charts I have now might not be clear enough to show the 'shift'. We might be able to think of other aspects to see the dataset.
- Other types of signals to capture the start of interactions? When finding the point of time when an editor starts interactions, talk page edits made by him/her should be included.
- How much different the trends are in different languages? Although the current dataset is created on top of the Wikilytics databases, it is not hard to use Toolserver databases to create the same dataset for other languages especially if they have smaller number of revisions than the top languages of Wikipedias.
- Unexplained anomalies. The bars for 2003 and 2005 seem to be out of the trends. The drop in 2003 could be explained by the lack of records in the database. Currently I have no clue about the drop in 2005.
- Excluding negative ignored period? Negative ignored period can be found when the record shows that an editor got message before his/her first edit. In this dataset, roughly 10% of new editors have their ignored periods in negative values, probably having their first edits deleted.