Template A/B testing/Results

From Meta, a Wikimedia project coordination wiki

This is a final report and summary of findings from more than six months of A/B testing user talk templates on Wikimedia projects. Conclusions are bolded below for easier skimming, but you're highly encouraged to compare the different versions of the templates and read our more detailed analysis.

Warnings about vandalism[edit]

Further reading: Huggle data analysis notes

Huggle is a powerful vandal-fighting tool used by many editors on English, German, Portuguese, and other Wikipedias. The speed and ease of using Huggle means that reverts and warnings can be issued in very high numbers, so it is not surprising that Huggle warnings account for a significant proportion of first messages to new editors.

Together with the original Huggle test run during the Summer of Research, we ran a total of six Huggle tests in two Wikipedias -- English and Portuguese.

Example templates tested:

Fig. 3 - This is the current default level 1 vandalism warning on English Wikipedia
Fig. 4 - This version was one of our most successful.

English[edit]

Our first experiment tested many different variables in the standard level one vandalism warning: image versus no image; more teaching language, personalization, or a mix of both (read). The results showed us that it is better to test a small number of variables, but there were some positive suggestions that increasing personalization led to a better outcome. This test reached 3241 registered and unregistered editors, of whom 1750 clicked through based on the "new messages" banner. More details about our first analysis round are available.

First experiment templates:

Iterating on the results from the first Huggle test, we tested the default level one vandalism warning against the personalized warning, as well as an even more personalized version, which also removed some of the "directives"-style language and links to general policy pages. There were 2451 users in this test, which we handcoded into the following subcategories:

  • 420 were 'vandals' that obviously should have been reverted and may have merited an immediate block. Examples: 1, 2.
  • 982 were 'bad faith editors', people doing vandalism and the simple level 1 warning they received was appropriate. Examples: 1, 2
  • 702 were 'test editors', people making test edits that should be reverted for quality but who aren't obvious vandals. Examples: 1 and 2
  • 347 were 'good faith editors', who were clearly trying to improve the encyclopedia in an unbiased, factual way. Examples: 1, 2

Our results confidently showed that the two new warning messages did best at retaining good-faith users (who were not subsequently blocked) in the short term than the default. However, as time passed (and, presumably, those users received more default template messages on their talk that were not personalized for other editing activities), retention rates for the control and test groups converged.

Second experiment templates:

Read about an editor from this test.

In our last test of vandalism warnings, we tested our best performing template message from the second test against the default and a much shorter new message written by a volunteer. Results showed that both new templates did better at encouraging good faith editors to contribute to articles than the default in the short term. The personal/no-directives version was a little more successful than the short message at getting users to edit outside of the article space (e.g., user talk), due to its invitation to ask questions of the warning editor.

Third experiment templates:

Portuguese[edit]

Working with community members from Portuguese Wikipedia, we designed a test that was similar to our second Huggle experiment, testing a personalized, personalized and no directives, and default level one vandalism warning.

Our total sample size in this test was 1857 editors, The small number of registered editors that were warned during the course of the test (half as many as the English Huggle 2 test) meant that the sample size was too small to test for quantitative effects of the templates.

Additionally, both registered and unregistered editors warned were extremely unlikely to edit after the warnings, also making it difficult to see any effect from the change in templates. Qualitatively, this points to Huggle's impact as the most common vandalism-fighting tool on Portuguese Wikipedia, and the negative impact on new editor retention which has limited the size of the lusophone Wikipedia community.

Templates:

Warnings about specific issues, such blanking, lack of sources, etc.[edit]

Further reading: Huggle data analysis notes

Based on our findings from previous experiments on the level one vandalism warning in Huggle, we decided to use our winning message strategy — a personalized message with no directives — to create new versions of all the other level one user warnings used by Huggle. These issue-specific warnings are used for warning messages about test edits, spam, non-neutral POV, unsourced content, and attack edits, as well as unsourced edits to biographies of living people, purposefully inserting errors, blanking articles or deleting large portions of text without an explanation. Some examples:

Fig. 5 - The current default warning for inserting unsourced content
Fig. 6 - One test version, which aimed to be friendlier and encourage people to help build the encyclopedia

Templates:

One of the key initial results of this test was that some issue-specific warnings are very rarely used: the common issues encountered by Hugglers are unsourced additions, removing content without an edit summary, test edits, and spam. This suggests that the most common mistakes made by new or anonymous editors are not particularly malicious, and should be treated with good faith for the most part.

Our explanation for this is that the style of the new template, which was extremely friendly, reminded editors about the community aspect of encyclopedia building, and encouraged them to edit again to fix their mistakes, was only effective at registered editors who'd showed a minimum of commitment. The confidence in the result for registered editors also increased when looking at all namespaces, which supports the theory that the invitation to the warning editor's talk page or other links are helpful.

In our next test of issue-specific templates, we ran the default against a very short version of each message written by an experienced Wikipedian. This version exhibited some aspects that were friendlier than the current default, but did not begin by stating who the reverting editor was. It focused almost solely on the action the warned editor should take.

Fig. 7 - The current default warning for inserting unsourced content
Fig. 8 - A short new version, which simply instructed editors how to fix the issue that got them reverted

Templates:

Warnings about external links[edit]

In addition to the test of spam warnings via Huggle, we experimented with new messages for XLinkBot, which warns users who insert inappropriate external links into articles.

Example templates:

Fig. 9 - The default level one spam warning left by XLinkBot
Fig. 10 - A new shorter, friendlier version

Warnings:

Welcome messages:

Read about an editor from this test.

The results were inconclusive on a quantitative basis, with no significant retention-related effects for the 3121 editors in this test. One result we noted in this test was that the bot sends multiple warnings to IPs, which combine with Huggle/Twinkle-issued spam warnings and pile up on IP talk pages, making them hard to tease apart.

Though there was no statistically significant change in editor retention in the test, qualitative data suggests that:

  1. more boldly declaring that the reverting editor was a bot that sometimes makes mistakes may encourage editors to revert it more or otherwise contest its actions.
  2. it is clear that the approach used for vandalism warnings in prior tests may not be effective in the case of editors adding external links which are obviously unencyclopedic in some way (even if they are not always actually spam). These complex issues require very specific feedback to editors, and XLinkBot uses a complex cascading set of options to determine the most appropriate warning to deliver.

Warnings about test edits[edit]

Read about an editor from this test.

In addition to our experiment with test edit warnings in Huggle, we worked with 28bot by User:28bytes. 28bot is an English Wikipedia bot which reverts and warns editors who make test edits: it automatically identifies very common accidental or test edits such as the default text added when clicking the bold or italics elements in the toolbar.

While not an extremely high volume bot, the users this bot reverts are of particular interest, since they are not malicious at all. Improving the quality of the warnings for this bot could help encourage test editors to improve and stick around on Wikipedia.

Templates:

In comparing the default and test versions of the templates used by 28bot, there was an unexpected bias in the sample: editors in the control group almost all had received warnings prior to receiving the warning from 28bot, while editors who received test messages had almost no warnings before. From previous tests, we know that prior warnings tend to dilute the impact of any new messages. Additionally, the quantitative data from this test demonstrated to us that prior warnings are a very significant predictor of future warnings and blocks. While this devalued any attempt to compare the effectiveness of templates to each other, it may be fruitful for producing algorithmic methods for identifying problematic users.

Though comparing 28bot templates proved to be unsucessful, we did have a valuable alternative: comparing the survival of those reverted and warned by 28bot to those reverted by RscprinterBot. This is particularly interesting to compare, as RscprinterBot does not warn test editors for the most part, but targets a population of test editors to revert that is almost exactly the same as 28bot. We looked at a combined sample of 236 editors from these bots.

You can read more in our data analysis journal.

Warnings about files[edit]

Read about an editor from this test.
Further reading: our data analysis

In this test, we worked with English Wikipedia's ImageTaggingBot to test warnings delivered to editors that upload files which are missing some or all of the vital information about licensing and source. We ran our test with 1133 editors given a warning by ImageTaggingBot.

Templates:

Fig. 11 - The default version
Fig. 12 - This new version of the template was the least successful

Warnings about copyright[edit]

In this test, we worked with English Wikipedia's CorenSearchBot to test different kinds of warnings to editors whose articles are very likely to be copied in part or wholesale from another website. CorenSearchBot relies on a Yahoo! API and is current no longer in operation, but during its prime work was the English Wikipedia's number one identifier of copyright violations in new articles. Read more about our data analysis in our journal notes.

The templates we tested were the welcome template given by the bot, a warning about copy-pasted excerpts, and a warning about entire articles that appeared to be copy-pasted. We also separated these warnings out by experience level, with anyone who made more than 100 edits prior to the warning receiving a special message for very experienced editors.

The results of this test were largely inconclusive, due in part to the bot breaking down in the midst of the test.

Deletion notifications[edit]

Fig. 13 - Deletion notifications are one of the two most common uses of Twinkle
Further reading: our data analysis notes and qualitative feedback from the users who had been affected by these tests.

In this test, we attempted to rewrite the language of two of the three kinds of deletion notices used on English Wikipedia, to test whether clearer, friendlier language was better at retaining users and helping them properly contest a deletion of the article they authored.

Articles for deletion (AFD)[edit]

The goal of the new version we wrote for the AFD notification was to gently explain what AFD was, and to encourage authors of articles to go and add their perspective to the discussion.

For a quantitative analysis, we looked at editing activity in the Wikipedia (or "project") namespace among the 693 editors in both the test and control group, and found no statistically significant difference in the amount of activity there.

However, qualitative feedback we received from new editors suggested that it was an improvement simply to remove the number of excess links that are present in the current AFD notification – included in the default are links to a list of all policies and guidelines and documentation on all deletion policies.

Templates:

Fig. 14 - The default version of the notification
Fig. 15 - The new version of the notification

Proposed deletion (PROD)[edit]

Read about an editor from this test

Since the primary method for objecting to deletion via a PROD is simply to remove the template, we used the API to search for edits by contributors in our test and control groups for this action (shortly after receiving the message, not over all time).

When looking at revisions to articles not deleted, there was no significant difference:

  • 16 (2.8%) of the 557 editors who received the test message removed a PROD tag.
  • 13 (2.2%) of the 579 editors who received the control message removed a PROD tag.

So while there was a small increase in the amount of removed PROD templates in the test group, it was not statistically significant.

However, when examining deleted revisions (i.e. articles that were successfully deleted via any method), we see a stronger suggestion that the new template more effectively told editors how and why to remove a PROD template:

  • 96 out of 550 (17%) users whose articles were deleted removed the PROD tag if they got the test message.
  • 70 out of 550 users (13%) removed the PROD tag if they got the standard message we used as a control.

Note that the absolute numbers are also a lot higher than in non-deleted revisions (where roughly 2% of each groups removed the tag). This makes sense – most new articles on Wikipedia get deleted in one way or another.

Templates:

Fig.1 6 - This is the current default proposed deletion notification on English Wikipedia
Fig. 17 - This version new version was preferred by new users.

Speedy deletion (CSD)[edit]

Read about an editor from this test.

English Wikipedia's SDPatrolBot is a bot that warns editors for removing a speedy deletion tag from an article they started. We hypothesized that the main reason this happens is not a bad-faith action on the part of the article author, but because the author is new to Wikipedia and doesn't understand how to properly contest a speedy deletion.

To test this hypothesis, we created warnings that focused less on reproaching the user and more on inviting them to contest the deletion on the article's talk page, and tested the messages with a total of 858 editors.

Templates:

Fig. 18 - The default version of the warning
Fig. 19 - The new version of the warning

Welcome messages[edit]

Portuguese Wikipedia[edit]

In Portuguese, we tested a plain text welcome template very similar to the default welcome in English Wikipedia, as compared to the extremely graphical default welcome that is the current standard in the lusophone Wikipedia. Further reading: our data analysis.

Templates:

When looking at the contributions of all editors who received these two welcome templates, we initially did not see a significant difference their editing activity, with between 22-25% of the 429 editors making an edit after being welcomed.

Some possible explanations for the result include:

  • Simple bias in the groups of editors sampled. There were slightly more editors in the control sample and this seemed to minutely skew the results in favor of the control in a non-significant way, even before we filtered the analysis to only those who made edits before being welcomed.
  • The test message was not read as much, as it was not graphical in any way.
  • The current standard Portuguese welcome message begins with directives to Be Bold and collaborate with other editors, and only includes recommended reading ("Leituras recomendadas") at the bottom of the template. The test version includes a list of rules more direct and upfront. This may be discouraging, even though the large, graphical style of the current welcome is quite overwhelming.

German Wikipedia[edit]

The German Wikipedia welcome test was designed to measure whether reducing the length and number of links in the standard welcome message would lead new users to edit more. Though the sample size was very small (about 90 users total), the results were encouraging:

  • 64% of editors who received the standard welcome message went on to make at least one edit on German Wikipedia;
  • 88% of those who received the same welcome message, but with fewer links went on to edit;
  • 84% who received a very short, informal welcome message went on to edit;

Templates:

Fig. 20 - This is one of the current default welcome messages on German Wikipedia
Fig. 21 - A radically shorter new version.

General talk page-related findings[edit]

  • Very experienced Wikipedians do get templates a lot too. Though we began this testing project thinking in terms of new users, it quickly became clear over the course of many of these tests – especially Twinkle, ImageTaggingBot, and CorenSearchBot – that user warnings and deletion notices affect a huge number of long-time Wikipedians, too. Yet despite the considerable difference in Wikipedia knowledge between a brand-new user and someone who's been editing for years, in most cases people get the same message regardless of their experience.
  • The more warnings on a talk page, the less effective they are. Especially for anonymous editors, the presence of old irrelevant warnings on a page strongly diluted the effectiveness of a warning or other notification. We strongly encourage regular archiving of IP talk pages by default, since these editors are generally not able to request it for themselves under usual circumstances. We began testing this via a SharedIPArchiveBot.
  • When reverting registered editors, no explanation is even worse than a template. Quantitative data from comparing test editors reverted by 28bot and Rscprinterbot strongly backed up previous surveys that suggested the most important thing to do when reverting someone was to explain why. In the 2011 Editor Survey, only 9% of respondents said that reverts with an explanation was demotivating, while 60% said that reverts with no explanation was demotivating.

Future work[edit]

Since the results of the various tests were each quite different, and in some cases inconclusive, we do not recommend any single blanket change to all user talk messages. However, we do recommend further testing or new tests in some areas, as well as permanent revision to some templates.

Promising future tests include, but are not limited to:

  1. The speedy deletion (CSD) warnings left by Twinkle
  2. Block notices, especially username policy blocks which are in good faith
  3. More tests of welcome template content, especially outside of English

Templates we recommend be revised directly based on successful testing:

  1. Level one warnings in English Wikipedia, especially those used by Huggle and Twinkle
  2. Twinkle articles for deletion (AFD) and (PROD) notifications

Perhaps most importantly, none of these experiments tested whether handwritten messages were superior or inferior to templates of any kind. A test of whether asking for custom feedback in some cases (such as through Twinkle) produces better results is merited, and is part of the idea backlog for the Editor Engagement Experiments team.