Research:Template A/B testing

This page is an interwiki hub for previous and on-going A/B testing experiments with Wikipedia templates, especially user talk templates delivered via bots or semi-automated tools such as Huggle, Twinkle, and Extension:NewUserMessage.

Background[edit]

Motivation: Why user talk templates matter[edit]

As Wikipedia projects have evolved over the past 11 years, templated messages have become an increasingly common tool for communicating to editors – especially to those that are new and anonymous. Prior to approximately 2006, most of the user talk page messages left for new editors were normal human conversation, customized for each person and for their contribution history. In the face of an enormous growth in the number of editors and the volume of communication needed between editors and on articles, the community developed several messaging systems -- which rely on templates -- to simplify and speed up communication about common issues.

The Wikimedia Summer of Research in 2011 provided new insight into the changing nature of communication on Wikipedia (Summary of Findings). Today, for example, close to 80% of first messages to new English Wikipedians are sent using a bot or semi-automated editing tool (see fig. 1).

Hypothesis: what we tested[edit]

Our core hypothesis, which is informed by the motivation above, is that the increase in automated template messages to new users strongly correlates with the decline in retention of new editors to Wikipedia, and improving these templates will lead to an increase in editor retention.

In particular, we tested the hypothesis by changing the style, tone, and content of a user talk template message, we would be able to improve communication to new or anonymous editors, and thereby encourage positive contributions and/or discourage unwanted contributions.

Projects with tests[edit]

How template A/B testing works[edit]

Two summer researchers, Aaron Halfaker and R. Stuart Geiger, designed the original experiment to test this hypothesis. The method used is roughly as follows:

A switch parser function randomly selects one of the test or control templates based on time.
The chosen template is substituted on to the user talk page, and appears normal.
Blank templates transcluded in each test or control message allow for tracking receipt of templates.

Please note: These were live observational studies, not randomized controlled experiments.

How we analyzed the data[edit]

We used a mixed method of analysis for these tests. Our base metric was whether new versions of templates led to more edits to articles by a contributor. Increasing the quantity of edits is not our only or primary goal, however, and most of the tests also required a special look at certain kinds of editing activity based on the context or activity for which an editor received a specific template message. For example, if a user was being warned for vandalism, increased editing activity (which could be more vandalism) would not necessarily be a positive outcome.

In our early tests, we also spent a considerable amount of time categorizing the type and quality of contributions by hand. Later, we focused more on quantitative ways to focus on changes in the activity of good faith editors. For example, controlling for the relative experience of an editor before they were warned greatly increased our statistical confidence in the results.

Important note: for all tests where we did not do any hand-checking for quality, we did not include editors who went on to be blocked for any period of time – any increase or decrease in editing activity was among editors who were not vandalism-only accounts. We also always focused on editors for whom this was their first warning. However, some tests involved measuring the rate at which users went on to be blocked after receiving a template message on their Talk page, necessarily here blocked editors were included in the analysis.

Test results[edit]

Read our summary of findings from all the completed tests to date.

Credits[edit]

These tests have mostly been run by Maryana Pinchuk and Steven Walling, with data analysis led by Ryan Faulkner. The testing methodology was created by Aaron Halfaker and R. Stuart Geiger, who also performed other key analyses. However, these tests would not have been possible with help from many Wikimedians, especially:

Want to start a test on your project?[edit]

You're highly encouraged to contact User:Maryana (WMF) and User:Steven (WMF) if you'd like to run a test, as they are available to help coordinate new tests and analyze data. Some projects are also eligible for enhanced tracking stats on any templates, such as anonymized logging of who actually read their talk page messages.