Template A/B testing/Huggle Analyses
Overview
[edit]The Huggle testing experiment descriptions can be found here.
The experiments themselves involved comparing template sets including a control and a test template. The control templates are made up of existing Huggle warning templates in use in the community while the test templates represent modified versions in measurable ways (length of warning, directives, personalization etc.). Below is a link to the description of what each experiment is meant to test:
This analyses groups together Huggle warnings in response to test, spam, delete, and unsourced revisions.
Whether the changes made to test templates had an effect was primarily assessed using logistic regression analysis in R. Editor groups were prefiltered based on:
 registered / nonregistered
 minimum number of edits made before the posting
 maximum number of edits made before the posting
 minimum number of edits made after the posting (usually at least 1)
The results of these analyses along with the conditions on the editor activity samples can be seen in the experiment results in the next section.
In the plots accompanying the analyses the effect on the mean ratio of edits before and after the posting as well as the mean absolute number of edits made after the posting are measured against the minimum number of edits before the posting used to select the sample group. Accompanying these plots are those including the sample sizes for each data point in the edit trend (both control and test).
Analyses Results
[edit]Huggle test one (July 19August 5)
[edit]We tested a variety of factors against the default control versions, all of which can be seen at: uwvandalrand1/Experiment1. The primary test was of improving the clarity of instructional language, personalization of the messages, and whether images enhanced the impact of the message.
 Data and metrics
Detailed results and raw stats are currently up at Research:Warning Templates in Huggle. The total sample was 3241, of which 1750 clicked through to the new messages banner.
 Key findings
A more humanreadable summary was posted on the Village Pump, but the key findings were:
 Adding "personalization", e.g. including usernames, speaking in the first person rather than passive voice, and generally making it obvious who the reverting/warning editor is, had a marginally significant impact on continued good faith editing. More testing was necessary to try and confirm this outcome.
 "Teaching messages", i.e. including no personalization but making the instructional content of a warning simpler and more direct, had no positive impact on retention of good faith editors, but was best at discouraging outright vandals.
 The new teachingfocused template had a significantly higher rate of retaliatory vandalism than the current versions used as control in our study. Simply improving the clarity of instruction is not effective and may in fact cause negative outcomes.
 Personalization of messages, especially including a more explicit invitation to ask questions of other editors, increased the amount of positive contact between those being warned and those reverting the editor.
 Other findings
 Whether templates include an icon or not made no statistically significant difference in the further editing behavior of those being warned.
 The single biggest factor in whether an editor continues to participate after being warned is the amount of editing they did prior to getting the message. Previous experience is the best predictor of continued editing, so in order to get accurate results, you need to perform regression analyses that account for this.
Huggle test two (September 25October 10)
[edit]Using the templates uwvandalrand1/Experiment2 we compared...
 the current default template
 a template with "personalization" plus instructional language, such as a suggestion to use the sandbox and read the introduction to editing
 a template with personalization but no instructional/teaching language or links to policy, plus a more explicit thank you for editing
We tested these variations because in the first test we discovered that personalization — adding usernames, speaking in the first person instead of passive voice, and inviting people to ask questions on the vandalfighter's talk page — was better, but we want to know exactly whether instructionaltype language works or not. Our hypotheses was that attempting to assign tasks and point people to long, complicated policy to read not only doesn’t get them to improve, it actually discourages them from editing entirely.
 Data and metrics
There were a grand total of 2451 messages delivered. (Most of which are to IPs.) We broke these down further by a couple rounds of qualitative coding on a four point scale. There were...
 420 were 'vandals' that obviously should have been reverted and may have merited an immediate block. Examples: 1, 2.
 982 were 'bad faith editors', people doing vandalism and the simple level 1 warning they received was appropriate. Examples: 1, 2
 702 were 'test editors', people making test edits that should be reverted for quality but who aren't obvious vandals. Examples: 1 and 2
 347 were 'good faith editors', who were clearly trying to improve the encyclopedia in an unbiased, factual way. Examples: 1, 2
We then did some analysis on each of these groups separately based on several metrics. Raw statistical outputs from R available here.
 Key findings
 We didn’t see an improvement in the long term retention rates of editors, which wasn’t a big surprise. We didn’t really expect one different template to make people stick around months later.
 We did see an improvement in whether new Wikipedians kept editing articles or not. For people who’d already cut their teeth as editors (~10 edits before getting warned), both new templates did a significantly better job of encouraging them to keep editing in the main namespace. Recipients of these templates made further edits equal to 20% of their prior contributions. Considering all warnings generally discourage further editing, this is a positive outcome. Note that this effect was only found in the groups we coded as test editors or those editing in good faith, not vandals.
 Another piece of encouraging news is that the ‘nodirectives’ template was the best overall for encouraging people of all kinds to communicate more. Statistically speaking, people who received that template performed one more user talk edit than others after receiving the message. Considering one user talk edit can mean a completed message to another editor, this is good news.
 We found only 1% of the user talk edits made after being warned were retaliation directed at other editors. That’s good, since we have had some concerns that giving vandalfighter’s usernames more exposure might increase vandalism directed at them.
Huggle short 2
[edit]Data Munging / Filtering:
Only tracking edits in the first three days after posting
5 <= edits before <= Inf (registered),
1 <= edits before <= Inf (nonregistered),
Edits deleted before >= 0,
Blocks after = 0 (no blocks after seeing template)
namespace = 0
first_warning = TRUE
Findings:
For nonregistered users the drop in edits from the control template was less than the test (71.38% vs. 79.91%) with a confidence level of 88.1%.
For registered users the drop in edits from the test template was less than the control (55.63% vs. 88.27%) with a confidence level of 92.25%.
Therefore there was an observable effect of this template in favour of the test for nonregistered users and registered users.
It is also noteworthy that for nonregistered users the difference was heavily skewed to those that made very few edits before the template posting. It should be further noted that much of the effect being seen is due to editors that don't make any edits after the posting.
Modelling Analysis, NonRegistered Users  R Output


Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.157 1.157 1.023 1.198 1.518 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.2116 0.1073 1.972 0.0486 * edits_decrease 0.1635 0.1035 1.579 0.1143  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1079.8 on 779 degrees of freedom Residual deviance: 1077.2 on 778 degrees of freedom AIC: 1081.2 Number of Fisher Scoring iterations: 4 Reduction in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 3.4290 1.0000 1.0000 0.7991 1.0000 1.0000 Reduction in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 6.0000 0.9250 1.0000 0.7138 1.0000 1.0000 
Modelling Analysis, Registered Users  R Output


Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.1224 0.8574 0.7639 1.1268 1.6578 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.9267 0.9848 0.941 0.3467 edits_decrease 2.0091 1.1379 1.766 0.0775 .  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 80.201 on 62 degrees of freedom Residual deviance: 74.571 on 61 degrees of freedom AIC: 78.57 Number of Fisher Scoring iterations: 5 Reduction in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 3.4400 0.5500 0.9706 0.5563 1.0000 1.0000 Reduction in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.5263 0.8333 0.9359 0.8827 1.0000 1.0000 
Huggle short 1 & 2
[edit]Data Munging / Filtering:
Only tracking edits in the first three days after posting
1 <= edits before <= Inf,
Edits deleted before >= 0,
Blocks after = 0 (no blocks after seeing template)
namespace = 0
first_warning = TRUE
Findings:
For nonregistered users the drop in edits from the test template was less than the control (73.15% vs. 76.66%) with a confidence level of 88.1%.
For registered users the drop in edits from the test template was less than the control (68.79% vs. 85.42%) with a confidence level of 91.05%. There were 70 and 41 samples from test and control respectively.
Therefore there was an observable effect of this template in favour of the test for nonregistered users.
Modelling Analysis, Decrease in edits  NonRegistered Users  R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.7200 1.3945 0.9663 0.9748 0.9748 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.56299 0.04348 12.948 <2e16 *** edits_decrease 0.06576 0.04222 1.558 0.119  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6482.6 on 4901 degrees of freedom Residual deviance: 6480.1 on 4900 degrees of freedom AIC: 6484 Number of Fisher Scoring iterations: 4 Reduction in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 20.0000 1.0000 1.0000 0.7315 1.0000 1.0000 Reduction in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0000 1.0000 1.0000 0.7666 1.0000 1.0000 
Modelling Analysis, Decrease in edits  Registered Users  R Output


execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE) glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.9876 1.3166 0.8896 1.0443 1.0443 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 1.3245 0.5276 2.510 0.0121 * edits_decrease 1.0031 0.5908 1.698 0.0895 .  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 146.21 on 110 degrees of freedom Residual deviance: 142.38 on 109 degrees of freedom AIC: 146.38 Number of Fisher Scoring iterations: 4 Reduction in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.2000 0.6667 0.9206 0.6879 1.0000 1.0000 Reduction in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.5000 0.8667 1.0000 0.8542 1.0000 1.0000 
Modelling Analysis, Edits 03 Days After  Registered Users  R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.816 1.331 0.915 1.031 1.031 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.35459 0.20200 1.755 0.0792 . edits_decrease 0.12007 0.06823 1.760 0.0785 .  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 183.64 on 139 degrees of freedom Residual deviance: 179.47 on 138 degrees of freedom AIC: 183.47 Number of Fisher Scoring iterations: 4 Mean edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 1.000 2.427 2.000 26.000 Mean edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 1.196 1.500 9.000 
Huggle short 1 & 2 (84, 85, 86 only)
[edit]Data Munging / Filtering:
Only tracking edits in the first three days after posting
4 <= edits before <= Inf, (z84 vs. z85)
5 <= edits before <= Inf, (z84 vs. z86)
Blocks after = 0 (no blocks after seeing template),
namespace = 0,
first_warning = TRUE
Findings:
z85 (35 samples) had a significant (80.00% confident) difference in the decrease in edits after posting over z84 (36 samples), 67.35% decrease vs. 83.40% decrease.
z86 had a semisignificant (86.50% confident) difference in the decrease in edits after posting over z84, 72.22% decrease vs. 85.08% decrease.
z85 (45 samples) had a significant (84.80% confident) difference in the mean edit count after posting over z84 (45 samples), 2.356 vs. 1.289.
z86 (40 samples) had a significant (84.20% confident) difference in the mean edit count after posting over z84 (45 samples), 2.500 vs. 1.289.
Modelling Analysis, Decrease in edits  Registered Users z84 vs. z85  R Output


glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.546 1.111 1.099 1.258 1.258 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.4934 0.4822 1.023 0.306 edits_decrease 0.6812 0.5316 1.281 0.200 (Dispersion parameter for binomial family taken to be 1) Null deviance: 98.413 on 70 degrees of freedom Residual deviance: 96.552 on 69 degrees of freedom AIC: 100.55 Number of Fisher Scoring iterations: 4 Percentage decrease in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.2000 0.7936 1.0000 0.6735 1.0000 1.0000 Percentage decrease in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.5000 0.8399 0.9582 0.8340 1.0000 1.0000 
Modelling Analysis, Decrease in edits  Registered Users z84 vs. z86  R Output


glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.5624 1.0258 0.9575 1.2497 1.4145 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 1.0279 0.8957 1.148 0.251 edits_decrease 1.5699 1.0515 1.493 0.135 (Dispersion parameter for binomial family taken to be 1) Null deviance: 74.192 on 53 degrees of freedom Residual deviance: 71.616 on 52 degrees of freedom AIC: 75.616 Number of Fisher Scoring iterations: 4 Percentage decrease in edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3333 0.6667 0.8149 0.7222 0.9557 1.0000 Percentage decrease in edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1000 0.8355 0.9132 0.8508 1.0000 1.0000 
Modelling Analysis, Edits 03 Days After  Registered Users z84 vs. z85  R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.5109 1.1014 0.2871 1.2554 1.2554 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.18145 0.24340 0.745 0.456 edits_decrease 0.10422 0.07281 1.431 0.152 (Dispersion parameter for binomial family taken to be 1) Null deviance: 124.77 on 89 degrees of freedom Residual deviance: 122.40 on 88 degrees of freedom AIC: 126.40 Number of Fisher Scoring iterations: 4 Mean edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 2.356 2.000 20.000 Mean edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 1.289 2.000 9.000 
Modelling Analysis, Edits 03 Days After  Registered Users z84 vs. z86  R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.497 1.044 1.044 1.266 1.317 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.32157 0.25627 1.255 0.210 edits_decrease 0.11638 0.08233 1.413 0.158 (Dispersion parameter for binomial family taken to be 1) Null deviance: 117.54 on 84 degrees of freedom Residual deviance: 114.88 on 83 degrees of freedom AIC: 118.88 Number of Fisher Scoring iterations: 4 Mean edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 1.0 2.5 2.0 26.0 Mean edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 1.289 2.000 9.000 
Huggle 3 analysis on specific template postings  (60,62,66,76 VS. 61,63,67,77)
[edit]Data Munging / Filtering:
Only tracking edits in the first three days after posting
Blocks after = 0 (no blocks after seeing template),
namespace = 0,
first_warning = TRUE
> Nonregistered
3 <= edits before <= Inf
test datapoints = 214
control datapoints = 170
> Registered:
5 <= edits before <= Inf
test datapoints = 30
control datapoints = 30
Findings:
For nonregistered the mean decrease in test edits exceeded the control 83.83% and 75.02% respectively. The result is 94.59% confident.
For registered the mean decrease in control edits exceeded the test 83.20% and 70.58% respectively. The result is 84.00% confident.
The result of the effect is swapped between registered and nonregistered users.
Modelling Analysis, NonRegistered Users  R Output


Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.319 1.319 1.043 1.043 1.596 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.1510 0.2243 0.673 0.5007 edits_decrease 0.4769 0.2476 1.926 0.0541 .  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 527.28 on 383 degrees of freedom Residual deviance: 523.36 on 382 degrees of freedom AIC: 527.36 Number of Fisher Scoring iterations: 4 Percentage decrease in deleted edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.6670 0.8260 1.0000 0.8384 1.0000 1.0000 Percentage decrease in deleted edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.6670 0.6818 1.0000 0.7502 1.0000 1.0000 
Modelling Analysis, Registered Users  R Output


Call: glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), data = all_data) Deviance Residuals: Min 1Q Median 3Q Max 1.5728 1.0761 0.1728 1.2366 1.2894 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 0.8939 0.6968 1.283 0.200 edits_decrease 1.1533 0.8205 1.406 0.160 (Dispersion parameter for binomial family taken to be 1) Null deviance: 83.178 on 59 degrees of freedom Residual deviance: 81.049 on 58 degrees of freedom AIC: 85.049 Number of Fisher Scoring iterations: 4 Percentage decrease in deleted edits Test: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2703 0.4309 0.9071 0.7058 1.0000 1.0000 Percentage decrease in deleted edits Control: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.7770 0.9857 0.8320 1.0000 1.0000 
Plots
[edit]
Plots  Ratio of decrease of edits after (03 days) posting to edits before
Plots  Samples counts of edit decrease ratio plots
Plots  Number of edits after (03 days) posting to edits before (<= 50)
Plots  Number of edits after (030 days) posting to edits before (<= 50)
Plots  Samples counts of number of edit plots
Summary
[edit]Legend of Terms: ================ Variables: Mean_Diff_Edits_Normalized  Mean Difference of edits before and edits 03 days after posting normalized by edits before  def: (AVG(Edits before posting)  AVG(Edits 03 days after posting)) / AVG(Edits before posting) !! Lower values are better !! Diff_Edits_After_0_3  Mean Number of Edits 03 days after posting  def: AVG(Edits 03 days after posting) Diff_Edits_After_0_30  Mean Number of Edits 030 days after posting  def: AVG(Edits 030 days after posting)
Main Namespace Only
[edit]Experiment  Registered  Variable(s)  Control Result  Test Result  Winner  % Increase  Confidence  Sample Size  Params 

Huggle 3 (60,62,66,76 vs. 61,63,67,77)  TRUE  Mean_Diff_Edits_Normalized  0.7058  0.8320  test  15.2% (fewer)  84.00%  30 (test), 30 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE) 
Huggle 3 (60,62,66,76 vs. 61,63,67,77)  FALSE  Mean_Diff_Edits_Normalized  0.8383  0.7502  control  10.5% (fewer)  94.59%  214 (test), 170 (control)  execute.main(min_edits_before = 3, max_edits_before = Inf, min_edits_after = 0, registered = FALSE) 
Huggle Short 1 & 2  TRUE  Mean_Diff_Edits_Normalized  0.8542  0.6879  test  19.5% (fewer)  91.05%  70 (test), 41 (control)  execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE) 
Huggle Short 1 & 2  TRUE  Diff_Edits_After_0_3  1.196  2.427  test  50.7%  92.15%  89 (test), 51 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) 
Huggle Short 1 & 2  FALSE  Mean_Diff_Edits_Normalized  0.7666  0.7315  test  4.58%  88.10%  hundreds (test), hundreds (control)  execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE) 
Huggle Short 1 & 2 (84, 85)  TRUE  Mean_Diff_Edits_Normalized  .8340  .6735  test  19.2% (fewer)  80.00%  35 (test), 36 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE) 
Huggle Short 1 & 2 (84, 86)  TRUE  Mean_Diff_Edits_Normalized  .8508  .7222  test  15.1% (fewer)  86.50%  24 (test), 30 (control)  execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE) 
Huggle Short 1 & 2 (84, 85)  TRUE  Diff_Edits_After_0_3  1.289  2.356  test  82.8%  84.80%  45 (test), 45 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) 
Huggle Short 1 & 2 (84, 86)  TRUE  Diff_Edits_After_0_3  1.289  2.500  test  93.9%  84.20%  40 (test), 45 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE) 
Huggle Short 2  TRUE  Mean_Diff_Edits_Normalized  0.8827  0.5563  test  17.0% (fewer)  92.25%  21 (test), 42 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE) 
Huggle Short 2  FALSE  Mean_Diff_Edits_Normalized  0.7138  0.7991  control  10.7% (fewer)  88.10%  373 (test), 407 (control)  execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE) 
All Namespaces
[edit]Experiment  Registered  Variable(s)  Control Result  Test Result  Winner  % Increase  Confidence  Sample Size  Params  Notes 

Huggle 3 (60,62,66,76 vs. 61,63,67,77)  TRUE  Mean_Diff_Edits_Normalized  0.8220  0.5631  test  31.5% (fewer)  98.76%  32 (test), 32 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)  
Huggle 3 (60,62,66,76 vs. 61,63,67,77)  FALSE  Mean_Diff_Edits_Normalized  0.7368  0.8321  control  11.5% (fewer)  96.13%  223 (test), 175 (control)  execute.main(min_edits_before = 3, max_edits_before = Inf, registered = FALSE)  
Huggle Short 1 & 2  TRUE  Mean_Diff_Edits_Normalized  0.8023  0.4425  test  44.8% (fewer)  92.74%  82 (test), 48 (control)  execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)  
Huggle Short 1 & 2  TRUE  Diff_Edits_After_0_3  1.932  4.181  test  53.8%  89.60%  94 (test), 59 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)  
Huggle Short 1 & 2  FALSE  Mean_Diff_Edits_Normalized  0.7896  0.8331  control (flipped)  5.2% (fewer)  73.69%  657 (test), 447 (control)  execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE)  
Huggle Short 1 & 2 (84, 85)  TRUE  Mean_Diff_Edits_Normalized  0.7945  0.1359  test  82.9% (fewer)  86.5%  29 (test), 35 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)  
Huggle Short 1 & 2 (84, 86)  TRUE  Mean_Diff_Edits_Normalized  0.7793  0.5293  test  32.1% (fewer)  87.1%  39 (test), 44 (control)  execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)  
Huggle Short 1 & 2 (84, 85)  TRUE  Diff_Edits_After_0_3  2.057  4.958  test  111.9%  87.5%  49 (test), 54 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)  
Huggle Short 1 & 2 (84, 86)  TRUE  Diff_Edits_After_0_3  2.057  3.452  test  67.8%  77.2%  43 (test), 54 (control)  execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)  
Huggle Short 2  TRUE  Mean_Diff_Edits_Normalized  0.8456  0.4190  test  50.4% (fewer)  96.85%  26 (test), 44 (control)  execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)  
Huggle Short 2  FALSE  Mean_Diff_Edits_Normalized  0.7947  0.7094  test (flipped)  10.7% (fewer)  78.4%  204 (test), 207 (control)  execute.main(min_edits_before = 2, max_edits_before = Inf, registered = FALSE) 
Test Descriptions
[edit]Experiment  Description 

Huggle 3 (60,62,66,76 vs. 61,63,67,77)  Huggle 3 experiments  spam, error, unsourced, and delete warnings. Shortened templates. 
Huggle Short 1 & 2  spam, error, unsourced 
Huggle Short 1 & 2 (84, 85)  vandal  Control against "without directives" 
Huggle Short 1 & 2 (84, 86)  vandal  Control against "short" 
Huggle Short 2  neutral point of view, biographical information about living persons, attack, blank content, delete  all long and short 
Discussion & Lessons
[edit]The plots above depict the the change in edit activity metrics against the minimum number of edits all editors in each group must have made before seeing the template. The metrics shown are defined:
 Mean Difference of edits before and edits 03 days after posting normalized by edits before (ME1)  def: (AVG(Edits before posting)  AVG(Edits 03 days after posting)) / AVG(Edits before posting) => Lower values are better
 Mean Number of Edits 03 days after posting (ME2)  def: AVG(Edits 03 days after posting)
 Mean Number of Edits 030 days after posting (ME3)  def: AVG(Edits 030 days after posting)
Note the interesting trend of the effect observed upon newer editors (generally fewer than 5 edits) that the template has. In each case the effect not only flattens among more experienced editors but the templates themselves seem to make less of a difference (the curves converge) with the exception of the Huggle short 2 experiment. In this case the effect could be delayed to a higher threshold but poses an interesting exception (WHY?). The test group out performed the control in all but the "Huggle 3" experiment and "Huggle Short 2" experiment for nonregistered users (WHY?). However, in this experiment there was a good deal of variance among the template types themselves (ie. templates that serve to warn users about different policy violations or behavour) and so Huggle 3 warrants more attention to fully understand the effect that the test templates could have had.
There is a lower threshold before the effect of the template becomes significant. There is also an interesting swapping effect for these templates where the control results in stronger mean edit count when including editors with only a single edit before posting. The test becomes the dominant template significantly for mean edit count when filtering on larger edit numbers.
Huggle short 2
[edit]For nonregistered users the drop in edits from the control template was less than the test (71.38% vs. 79.91%) with a confidence level of 88.1%. For registered users the drop in edits from the test template was less than the control (55.63% vs. 88.27%) with a confidence level of 92.25%.
Therefore there was an observable effect of this template in favour of the test for nonregistered users and registered users.
It is also noteworthy that for nonregistered users the difference was heavily skewed to those that made very few edits before the template posting. It should be further noted that much of the effect being seen is due to editors that don't make any edits after the posting.
Note: Some templates in this test are very rarely used and may have added noise to the data. In particular: z107 & z108 (Uwnpov), z109 & z11 (Uwbio), z111 & z112 (Uwattack). We should rerun the numbers without these to see if the results are clearer.
Huggle short 1 & 2
[edit]For nonregistered users the drop in edits from the test template was less than the control (73.15% vs. 76.66%) with a confidence level of 88.1%. For registered users the drop in edits from the test template was less than the control (20.22% vs. 25.32%) however the pvalue (=> 23.5% confident) indicates that this result is not significant.
Therefore there was an observable effect of this template in favour of the test for nonregistered users.
Huggle short 1 & 2 (84 VS. 85, 86)
[edit]z85 (24 samples) had a significant (88.80% confident) difference in the decrease in edits after posting over z84 (30 samples), 68.43% decrease vs. 83.40% decrease. z86 had a semisignificant (72.5% confident) difference in the decrease in edits after posting over z84, 64.91% decrease vs. 76.44% decrease. z85 (45 samples) had a significant (84.80% confident) difference in the mean edit count after posting over z84 (45 samples), 2.356 vs. 1.289. z86 (38 samples) had a significant (84.20% confident) difference in the mean edit count after posting over z84 (38 samples), 2.500 vs. 1.289.
There is a lower threshold before the effect of the template becomes significant. There is also an interesting swapping effect for these templates where the control results in stronger mean edit count when including editors with only a single edit before posting. The test becomes the dominant template significantly for mean edit count when filtering on larger edit numbers.
Huggle 3 (60,62,66,76 VS. 61,63,67,77)
[edit]For nonregistered the mean decrease in test edits exceeded the control 83.83% and 75.02% respectively. The result is 94.59% confident. For registered the mean decrease in control edits exceeded the test 83.20% and 70.58% respectively. The result is 84.00% confident.
The result of the effect is swapped between registered and nonregistered users.