Research:Postedit feedback/PEF2
This page documents the results of the second iteration of the Postedit feedback experiment. The goal of the experiment was to determine whether receiving feedback had any significant desirable or undesirable effect on the volume and quality of contributions by new registered users, compared to the control group.
We measured the effects on volume by analyzing the number of edits, contribution size and time to threshold for participants in each experimental condition; we measured the impact of the experiment on quality by looking at the rate of reverts and blocks in each experimental condition.
Prior to performing the analysis we generated a clean dataset from the entire population of participants in the experiment to filter out known outliers and focus on genuinely new registered users.
Unless otherwise noted, all analyses refer to a 2week interval since registration time to include a supplementary week after the 1week treatment period. We report when significant differences emerge comparing the intreatment and posttreatment period.
Research questions[edit]
 RQ1. Does receiving feedback increase the number of edits?
 RQ2. Does feedback lead to larger contributions?
 RQ3. Does receiving feedback shorten the time to the second contribution?
 RQ4. Does feedback affect the rate at which newcomers are blocked?
 RQ5. Does feedback affect the success rate of newcomers?
Treatment  1 edit  5 edits  10 edits  25 edits  50 edits  100 edits 

Control  3535  843  389  118  42  13 
Historical  3607  853  371  99  33  11 
Edit volume  Edit Count & Bytes Added[edit]
The following analysis address the question:
 RQ1. Does receiving feedback increase the number of edits?
Edit count is the most direct measure of editor activity. We measured the total edit counts of new editors that were added by experimental condition in the first 14 days of activity since registration. New users would not receive the treatment message until after completing their first edit. Therefore, for each editor included in the experiment the first contribution was omitted.
 RQ2: Does feedback lead to larger contributions?
The bytes added are computed in four ways for each editor:
 Net  the net sum of bytes added or removed
 Positive  the sum of bytes added
Below are the means of byte changed normalized by edit count for each group. We considered logarithmic transformations of bytes changed to work with normally distributed data. In order to perform the log operation on the distribution over "Net" byte count the net negative samples were ommited (about 15% of total samples). The samples for total bytes added for any given editor were normalized by edit count, so for example, if an editor had made five edits contributing 100,200,300,50, and 50 bytes the sample for this editor would be . Furthermore, the byte count data was verified to be lognormal under the Shapiro–Wilk test for each treatment and bytes added metric and, given rejection of the null hypothesis (), ttests were performed over the transformed data sets.
Finally, the sample group was subsampled based on the milestones reached and analysis was executed separately for each of these groups.
At least one edit:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 5067" [1] "Mean: 5.73810933491218" [1] "Processing treatment historical ..." [1] "Sample Size: 5106" [1] "Mean: 4.76615746180964" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 1.332, df = 6475.449, pvalue = 0.1829 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2.4024405 0.4585367 sample estimates: mean of x mean of y 4.766157 5.738109 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 4172" [1] "Mean: 675.897552233935" [1] "Log Mean: 4.84501234490462" [1] "Log SD: 1.80334700944111" [1] "Processing treatment historical ..." [1] "Sample Size: 4189" [1] "Mean: 631.088538597629" [1] "Log Mean: 4.8046557700368" [1] "Log SD: 1.83063245067292" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 1.0154, df = 8357.998, pvalue = 0.3099 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.11826257 0.03754942 sample estimates: mean of x mean of y 4.804656 4.845012 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 4521" [1] "Mean: 650.379701700953" [1] "Log Mean: 4.76932470827655" [1] "Log SD: 1.86205220432389" [1] "Processing treatment historical ..." [1] "Sample Size: 4517" [1] "Mean: 662.897631197254" [1] "Log Mean: 4.77219397012983" [1] "Log SD: 1.84844301028586" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 0.0735, df = 9035.624, pvalue = 0.9414 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.07363828 0.07937680 sample estimates: mean of x mean of y 4.772194 4.769325 
At least five edits:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 1159" [1] "Mean: 19.2950819672131" [1] "Processing treatment historical ..." [1] "Sample Size: 1141" [1] "Mean: 15.4005258545136" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 1.2374, df = 1469.371, pvalue = 0.2161 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 10.068340 2.279228 sample estimates: mean of x mean of y 15.40053 19.29508 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 1028" [1] "Mean: 335.308856601793" [1] "Log Mean: 4.95773516964509" [1] "Log SD: 1.47819614339751" [1] "Processing treatment historical ..." [1] "Sample Size: 993" [1] "Mean: 430.754503040083" [1] "Log Mean: 4.91251031795925" [1] "Log SD: 1.58892759783986" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 0.6619, df = 1996.286, pvalue = 0.5081 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.17921677 0.08876707 sample estimates: mean of x mean of y 4.912510 4.957735 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 1145" [1] "Mean: 354.219910966528" [1] "Log Mean: 4.96745394691687" [1] "Log SD: 1.52978642175134" [1] "Processing treatment historical ..." [1] "Sample Size: 1129" [1] "Mean: 583.61043706938" [1] "Log Mean: 4.93766493534276" [1] "Log SD: 1.60921251009533" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 0.4523, df = 2262.548, pvalue = 0.6511 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.15894169 0.09936366 sample estimates: mean of x mean of y 4.937665 4.967454 
At least ten edits:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 552" [1] "Mean: 33.4420289855072" [1] "Processing treatment historical ..." [1] "Sample Size: 519" [1] "Mean: 26.1599229287091" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 1.1091, df = 703.703, pvalue = 0.2678 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 20.172774 5.608562 sample estimates: mean of x mean of y 26.15992 33.44203 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 505" [1] "Mean: 313.728768315528" [1] "Log Mean: 5.00375381388526" [1] "Log SD: 1.38156556001068" [1] "Processing treatment historical ..." [1] "Sample Size: 460" [1] "Mean: 262.417910824173" [1] "Log Mean: 4.90365018609712" [1] "Log SD: 1.38545229416708" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 1.1225, df = 954.157, pvalue = 0.2619 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.27510813 0.07490088 sample estimates: mean of x mean of y 4.903650 5.003754 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 552" [1] "Mean: 345.986279404488" [1] "Log Mean: 5.0894331845964" [1] "Log SD: 1.39942519838873" [1] "Processing treatment historical ..." [1] "Sample Size: 517" [1] "Mean: 598.114833976045" [1] "Log Mean: 5.01104570988809" [1] "Log SD: 1.35577471269409" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 0.9301, df = 1065.776, pvalue = 0.3525 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.24376173 0.08698678 sample estimates: mean of x mean of y 5.011046 5.089433 
At least twentyfive edits:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 159" [1] "Mean: 79.6352201257862" [1] "Processing treatment historical ..." [1] "Sample Size: 130" [1] "Mean: 59.8307692307692" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 0.8755, df = 208.599, pvalue = 0.3823 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 64.39786 24.78896 sample estimates: mean of x mean of y 59.83077 79.63522 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 145" [1] "Mean: 226.068984205479" [1] "Log Mean: 4.96643026742924" [1] "Log SD: 1.06034302486461" [1] "Processing treatment historical ..." [1] "Sample Size: 114" [1] "Mean: 209.51882457994" [1] "Log Mean: 4.89118138165821" [1] "Log SD: 1.0920944340629" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 0.5575, df = 239.385, pvalue = 0.5777 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3411229 0.1906251 sample estimates: mean of x mean of y 4.891181 4.966430 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 159" [1] "Mean: 274.915579020194" [1] "Log Mean: 5.08194732006123" [1] "Log SD: 1.26429886569131" [1] "Processing treatment historical ..." [1] "Sample Size: 130" [1] "Mean: 235.993573979585" [1] "Log Mean: 5.02020573080006" [1] "Log SD: 1.03040697199858" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 0.4574, df = 286.998, pvalue = 0.6477 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3274235 0.2039403 sample estimates: mean of x mean of y 5.020206 5.081947 
At least fifty edits:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 45" [1] "Mean: 195.955555555556" [1] "Processing treatment historical ..." [1] "Sample Size: 41" [1] "Mean: 116.560975609756" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 1.0463, df = 54.669, pvalue = 0.3 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 231.48240 72.69324 sample estimates: mean of x mean of y 116.5610 195.9556 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 40" [1] "Mean: 239.267255195948" [1] "Log Mean: 4.95482642279293" [1] "Log SD: 1.16683813935989" [1] "Processing treatment historical ..." [1] "Sample Size: 40" [1] "Mean: 187.267537169432" [1] "Log Mean: 4.71418715169494" [1] "Log SD: 1.06172502152303" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 0.9647, df = 77.315, pvalue = 0.3377 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.7373014 0.2560228 sample estimates: mean of x mean of y 4.714187 4.954826 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 45" [1] "Mean: 277.629441771944" [1] "Log Mean: 5.22840919178115" [1] "Log SD: 0.931856342633689" [1] "Processing treatment historical ..." [1] "Sample Size: 41" [1] "Mean: 229.291974855973" [1] "Log Mean: 4.94473999286836" [1] "Log SD: 1.00534667958695" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 1.3531, df = 81.65, pvalue = 0.1797 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.7007350 0.1333966 sample estimates: mean of x mean of y 4.944740 5.228409 
At least one hundred edits:
R Output Edit Count


[1] "Processing Metric edit_count ..." [1] "Processing treatment control ..." [1] "Sample Size: 14" [1] "Mean: 484.285714285714" [1] "Processing treatment historical ..." [1] "Sample Size: 12" [1] "Mean: 247.666666666667" [1] "Ttest for treatment1 historical under edit_count" Welch Two Sample ttest data: t1 and ctrl t = 1.037, df = 16.079, pvalue = 0.3151 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 720.1616 246.9235 sample estimates: mean of x mean of y 247.6667 484.2857 
R Output Bytes Added (Net & Positive)


[1] "Processing Metric bytes_added_net ..." [1] "Processing treatment control ..." [1] "Sample Size: 11" [1] "Mean: 229.406669680937" [1] "Log Mean: 4.91198268865895" [1] "Log SD: 1.1065715024704" [1] "Processing treatment historical ..." [1] "Sample Size: 11" [1] "Mean: 195.969411899016" [1] "Log Mean: 4.58418080579895" [1] "Log SD: 1.32018683152677" [1] "Ttest for treatment1 historical under bytes_added_net" Welch Two Sample ttest data: t1 and ctrl t = 0.6311, df = 19.408, pvalue = 0.5353 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.4133482 0.7577444 sample estimates: mean of x mean of y 4.584181 4.911983 [1] "Processing Metric bytes_added_pos ..." [1] "Processing treatment control ..." [1] "Sample Size: 14" [1] "Mean: 296.674331844149" [1] "Log Mean: 5.15561020776912" [1] "Log SD: 1.06501799606155" [1] "Processing treatment historical ..." [1] "Sample Size: 12" [1] "Mean: 259.162261249939" [1] "Log Mean: 4.84816581571695" [1] "Log SD: 1.27657700824566" [1] "Ttest for treatment1 historical under bytes_added_pos" Welch Two Sample ttest data: t1 and ctrl t = 0.6603, df = 21.55, pvalue = 0.5161 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.2742982 0.6594094 sample estimates: mean of x mean of y 4.848166 5.155610 
There were no significant differences in bytes added per edit when considering positive and negative bytes added. Nor was there a significant result for edits. However, the control group had consistently larger edit counts. Given more data the observed effect size may prove to be significant.
Time to threshold[edit]
We measured the time to threshold as the number of minutes between historical milestones. Only editors that reached successive milestones were included in this analysis. Using this metric helps us address the following:
 RQ3. Does receiving milestone feedback shorten the time to reach the next milestone?
Milestone: 1st Edit  5th Edit
R Output


[1] "Processing treatment control ..." [1] "Sample Size: 1328" [1] "Mean: 428.634789156626" [1] "SD: 552.596301197372" [1] "Processing treatment historical ..." [1] "Sample Size: 1303" [1] "Mean: 404.027628549501" [1] "SD: 535.706250327613" [1] "Ttest for treatment1 historical under time_diff" Welch Two Sample ttest data: ctrl and t1 t = 1.1598, df = 2628.62, pvalue = 0.2463 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 16.99781 66.21213 sample estimates: mean of x mean of y 428.6348 404.0276 
Milestone: 5th Edit  10th Edit
R Output


[1] "Processing treatment control ..." [1] "Sample Size: 682" [1] "Mean: 450.475073313783" [1] "SD: 547.879891974263" [1] "Processing treatment historical ..." [1] "Sample Size: 647" [1] "Mean: 411.343122102009" [1] "SD: 533.742223682875" [1] "Ttest for treatment1 historical under time_diff" Welch Two Sample ttest data: ctrl and t1 t = 1.3188, df = 1326.063, pvalue = 0.1875 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 19.07783 97.34173 sample estimates: mean of x mean of y 450.4751 411.3431 
Milestone: 10th Edit  25th Edit
R Output


[1] "Processing treatment control ..." [1] "Sample Size: 241" [1] "Mean: 694.435684647303" [1] "SD: 557.765681286775" [1] "Processing treatment historical ..." [1] "Sample Size: 203" [1] "Mean: 668.704433497537" [1] "SD: 557.733693265889" [1] "Ttest for treatment1 historical under time_diff" Welch Two Sample ttest data: ctrl and t1 t = 0.4843, df = 429.28, pvalue = 0.6284 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 78.70409 130.16659 sample estimates: mean of x mean of y 694.4357 668.7044 
Milestone: 25th Edit  50th Edit
R Output


[1] "Processing treatment control ..." [1] "Sample Size: 78" [1] "Mean: 764.538461538462" [1] "SD: 506.461693076038" [1] "Processing treatment historical ..." [1] "Sample Size: 67" [1] "Mean: 697.328358208955" [1] "SD: 533.078747609886" [1] "Ttest for treatment1 historical under time_diff" Welch Two Sample ttest data: ctrl and t1 t = 0.7745, df = 137.283, pvalue = 0.4399 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 104.3783 238.7985 sample estimates: mean of x mean of y 764.5385 697.3284 
Milestone: 50th Edit  100th Edit
R Output


[1] "Processing treatment control ..." [1] "Sample Size: 30" [1] "Mean: 631.8" [1] "SD: 518.036638299541" [1] "Processing treatment historical ..." [1] "Sample Size: 17" [1] "Mean: 976.647058823529" [1] "SD: 581.519877258773" [1] "Ttest for treatment1 historical under time_diff" Welch Two Sample ttest data: ctrl and t1 t = 2.0307, df = 30.251, pvalue = 0.05115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 691.53713 1.84301 sample estimates: mean of x mean of y 631.8000 976.6471 
The table below contains the mean values and sample sizes of timetothreshold for each milestone event:
Treatment  Sample Size  1st  5th Edit  5th  10th Edit  10th  50th Edit  * 50th  100th Edit 

Control Sample  1328  682  241  78  30 
Historical Sample  1303  647  203  67  17 
Control Mean (minutes)  428.63  450.48  694.44  764.54  631.8 
Historical Mean (minutes)  404.03  411.34  668.70  697.33  976.65 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
The only milestone event that came close to a significant result was observed under the 50th100th edit event. However, the significance is still on the fringe of being marginal and the sample sizes are relatively small. This may motivate a greater interest in investigating more targeted responses among more active early editors. It should finally be noted that the time to threshold was shorter for the control treatment not receiving any feedback.
Quality[edit]
While feedback may increase the volume of newcomer edits, it might do so at the cost of decreased quality. This is concerning since increasing the amount of edits that will need to be reverted is counter productive. Similarly increasing the amount of newcomers who eventually need to be blocked would increase the burden on en:WP:AIV.
To explore whether the changes in the volume of newcomer work were coming at the cost of decreased quality, we examined the work that newcomers performed in their first two weeks of editing. We identified two aspects of newcomers and the work that they perform: the proportion of newcomers who were eventually blocked from editing and the rate at which newcomers' contributions were rejected (reverted or deleted). We used these metrics to answer the following questions:
 RQ4. Does feedback affect the rate at which newcomers are blocked?
 RQ5. Does feedback affect the success rate of newcomers?
Block rate[edit]
To determine which newcomers were blocked, we processed the logging
table of the enwiki database to look for block events for newcomers in the experimental conditions. We decided that a newcomer had been blocked if there was any event for them with log_type="block" AND log_action="block"
between the beginning of the experimental period and midnight GMT Sept. 5th. Blocked newcomers plots the proportion of newcomers were blocked by experimental condition. As the plot suggests, the difference in proportions varies insignificantly (around 0.072) which suggests that the experimental treatment had no meaningful effect on the rate at which newcomers were blocked from editing.
To make sure that this result wasn't due to blocks of editors who hadn't earned them through a series of badfaith edits, we examined the relationship between the number of edits these newcomers saved and the proportion of them that were blocked. Blocked newcomers by revisions shows a steady increase in the proportion of newcomers that were blocked between 1 and 4 revisions. This seems likely due to the 4 levels of warnings that are used on the English Wikipedia.
R Output Block Rates


[1] "Ttest for treatment1 historical under block_count" Welch Two Sample ttest data: ctrl and t1 t = 0.4718, df = 42322.56, pvalue = 0.6371 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.0004647933 0.0007594753 sample estimates: mean of x mean of y 0.0009875657 0.0008402246 
The observed block rates for the two groups were extremely small: control = .099% and historical = .084%. Although, the block rate for the historical group was smaller the result was not significant.
Success rate[edit]
We examined the en:SHA1 checksum associated with the content of revisions to determine which revisions were reverted (see Research:Revert detection) by other editors. By comparing the number of revisions saved with the number of revisions reverted, we can build a proportion of reverted revisions (see Research:Metrics/revert_rate) and the success rate (the proportion of revisions saved by an editor that were not reverted). We use the success rate of an editor as a proxy for the quality of their work and a direct measure of the additional work their activities necessitate from Wikipedians.
To look for evidence of a causal relationship between PEF and the quality of newcomer work, we calculated the mean success rate for newcomer for each experimental condition.
R Output Revert Rates


[1] "Treatment: control" [1] "Samples: 5235" [1] "Mean: 0.0245267195984868" [1] "Treatment: historical" [1] "Samples: 5279" [1] "Mean: 0.0292926869221453" [1] "Ttest for treatment1 historical under revert_rate" Welch Two Sample ttest data: ctrl and t1 t = 1.625, df = 10420.69, pvalue = 0.1042 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.0105149549 0.0009830203 sample estimates: mean of x mean of y 0.02452672 0.02929269 
The revert rate for the control was actually lower, however the result is not significant.