Research:Newbie reverts and article length

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Nutshell.png
This page in a nutshell: This research tested the theory that the increased reversion of edits by new editors is because they are more likely to have their good faith edits reverted if they edit longer pages. It concluded that Newbies are editing longer pages, and that edits to longer pages have always been more likely to have been reverted.
Research project
Newbie reverts and article length
Main contact
Start 2011-05
End 2011-05
Fields computer science
psychology
Open data This project has published open-licensed data
Open access This project has open access publications
WMF support
Wikimedia research projects Wikimedia research projects

This sprint will be an examination of the relationship between page-length (as a proxy to completeness/quality of an article) and the probability of being reverted for new editors.

  • The sample that I'll gather will be divided by the year of an editor's first edit (2001-2010)
  • I'll be removing probable vandals from the dataset using the V_LOOSE+V_STRICT function (see Priedhorsky et al. from GROUP'07).
  • I intend to plot the results and examine the effects using a logistic regression (dependent variable is True if an editor's sampled edit was reverted and false otherwise).

If page_length is a positive predictor of the probability of being reverted, that should support my hypothesis that new editors' work is more likely to be rejected when they edit more complete articles. If year is a positive predictor, that will refute the gold rush hypothesis (or at least suggest that the gold rush alone is not sufficient to explain the increased proportion of newbie reverts.

Methods[edit]

Sample of editors[edit]

To gather a sample of editors, I ran a query against a database constructed from the Jan 30th, 2010 dump that I had access to in the GroupLens research lab. First, I grouped editors by the year in which they made their first edit to the English Wikipedia. Then I randomly select up to 20,000 editors from each first edit year group.

SELECT * FROM enwp_dump_20100130.editor
WHERE EXTRACT(YEAR FROM TO_TIMESTAMP(first_edit)) = %(YEAR)s
AND user_id IS NOT NULL
ORDER BY RANDOM()
LIMIT 20000;

For each of these randomly selected editors, I examined the first 10 edits that they made to main namespace articles.

SELECT 
	r.revision_id,
	r.text_length,
	rvtd.revision_id IS NOT NULL AS reverted,
	rvtd.is_vandalism
FROM enwp_dump_20100130.revision_by_user r
LEFT JOIN enwp_dump_20100130.reverted rvtd USING (revision_id)
WHERE r.username = %(username)s
AND r.namespace = 0
ORDER BY r.TIMESTAMP
LIMIT 10

Sample of editors' work as newbies[edit]

From this list of the first 10 edits per editor, I both generated the proportion of these edits that were marked as vandalism (see above) and randomly selected one of the edits to represent a newbie experience for the editor (includes length of article at time of edit, whether it was reverted or not and whether that revert was for vandalism).

random.shuffle(revs)
row = {
	'user_id': editor['user_id'],
	'edits': len(revs),
	'v_edits': len([r for r in revs if r['is_vandalism'] == True]),
	'revision_id': revs[0]['revision_id'],
	'text_length': revs[0]['text_length'],
	'reverted':    revs[0]['reverted'],
	'vandalism':   revs[0]['is_vandalism']
}
writer.write(row)

This data was used for the plot and regression below. Note that editors with more than 20% of their first 10 edits marked as vandalism were discarded in an attempt to weed out damaging editors from the analysis.

Results[edit]

Newbie edits by page length and year[edit]

The distribution of page length of newbie edits by the year that the newbie started editing.

The plot on the right was generated by randomly sampling one of the first 10 edits to articles (main namespace) by editors who started editing in a year. It should thus be representative of the types of edits that newbies perform shortly after registering an account. The distribution of page lengths (number of characters) creeps higher exponentially for each year (although it appears to be linear due to the log scaling of the x axis). This suggests that the average size of the articles that newbies edit in their first 10 edits has quadrupled since 2003.



Regression[edit]

Call:
glm(formula = reverted ~ sc(text_length) * sc(first_edit_year), 
    family = binomial(link = "logit"), data = newbie_initial_edit[newbie_initial_edit$vandal_prop <= 
        0.2, ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.6194  -0.4146  -0.3430  -0.2756   2.8578  

Coefficients:
                                    Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                         -2.72615    0.01207 -225.801  < 2e-16 ***
sc(text_length)                      0.23049    0.01076   21.429  < 2e-16 ***
sc(first_edit_year)                  0.40717    0.01188   34.272  < 2e-16 ***
sc(text_length):sc(first_edit_year) -0.02878    0.01026   -2.805  0.00504 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64886  on 131225  degrees of freedom
Residual deviance: 62782  on 131222  degrees of freedom
AIC: 62790

Number of Fisher Scoring iterations: 6

This regression shows that both the size of the article being edited (text_length) and the year in which an editor started editing (first_edit_year) are significant, positive predictors of the probability of being reverted. (ie. The longer the article, the more likely a newbie's edit will be reverted and the more recent they started editing, the more likely a newbies edit will be reverted.)

However, the interaction of these two variables (text_length:first_edit_year) is a significant, negative predictor of the probability of being reverted meaning that the more positive either of the variables become, the less of an effect the other has. This could mean that it is actually easier for newbies to successfully edit long articles than it used to be. This could also be interpreted as a change in the notion of what makes an article "long".

Summary[edit]

Newbies are editing longer pages in their first 10 edits than they used to. The plot above shows that the average length of pages edited by a newbie in their first 10 edits has quadrupled since 2003. Newbies are also getting reverted more in recent years than they used to be. The regression confirms that these two effects appear to be independent (ie. newbies were reverted for editing long pages when Wikipedia was young and newbies are more likely to be reverted for editing short pages now than before).

Future Work[edit]

  • What is it about long articles that makes them hard for newbies to successfully edit?
    • Is it just harder for a newbie to contribute productively to an already-complete article?
    • Is there a confounding factor (like the number of other editors active on an article) that results in newbies being reverted?
  • Can we predict how long a newbie will stick in Wikipedia by the length of pages they start editing?
  • What types of edits do newbies make to long/short articles and does the difference in edit type matter more than where it is done?