Banner Test: "Jimmy Face VS Jimmy Arms Crossed"[edit]

Banners[edit]

"Arms Crossed". This is a fader banner and features two text strings which toggle by fading into each other for equal intervals. These strings are: *"If everyone reading this donated $5 our fundraiser would be over today. Please donate to keep Wikipedia free"* and *"Only 2 days left to make a tax-deductible contribution to keep Wikipedia free. Please help Wikipedia pay its bills in 2011."*

"Face". This is a fader banner and features two text strings which toggle by fading into each other for equal intervals. These strings are: *"If everyone reading this donated $5 our fundraiser would be over today. Please donate to keep Wikipedia free"* and *"Only 2 days left to make a tax-deductible contribution to keep Wikipedia free. Please help Wikipedia pay its bills in 2011."*

Result[edit]

Test Time: 2010-12-30 16:04:00 UTC - 2010-12-30 16:52:00 UTC
Sampling Interval = 2 minutes
Testing Interval = 12 minutes
Total Number of Samples per Class = 25

Donations / Impression*: "Arms Crossed" banner won by 12.19%.
Amount50 / Impression**: "Arms Crossed" banner won by 17.74%.

WINNER: "Arms Crossed" Banner

(*) The rate of donations per banner impression over a fixed time interval.

(**) The rate of amount50 per banner impression over a fixed time interval. Amount50 is the dollar amount raised from donations initiated under a given banner where all donations of more than $50 are recorded as $50 donations. This counters the skewing effect of outlier donations.

Methodology[edit]

Data Cleaning: Check test period in CiviCRM for duplicate donations. This is done to ensure consistency among our donations.
Data Cleaning: Break down of Donations, Impressions, Amount50 / Impression, and Donations / Impression - investigate any anomalous data. Discard any data where impressions or donations may have been lost due to human error in setting up the test, systems error that can arise from the central notice system, or other uncontrolled interference with the normal operation of the Fundraiser donation pipeline.
Determine which data to use in confidence testing from above data analysis while choosing a suitable sampling rate of the data and testing interval to perform hypothesis testing. More information can be found here.
Perform Welch's T-Test over averaged parameters from each testing interval derived from sampled values for Amount50 / Impression and Donations / Impression. The reason for choosing Welch's T-Test and modelling using a Student's T distribution are given below.
The full testing period runs from: 16:04 PM UTC - 16:54 PM UTC on December 30th, 2010

Data Analysis[edit]

This section analyzes and interprets the results of the tests.

Data Consistency and Cleaning[edit]

During the Fundraiser a great deal of noise was introduced into the natural processing of donations. Campaign banners and landing pages were being reset on an hourly schedule at most points which opened the opportunity for both human and hardware error to affect the system. This could be as a result of internal causes such as incorrectly naming artifacts, turning artifacts on late or turning them off early, server outages; or due to external factors such as spikes in donations or banner clicks due to a variety of activities on the web, or a drop in such activity. The data was checked thoroughly to determine if any of these factors were present. The plots below display the counts of the data sources over the testing period as verification of the consistency of the donation pipeline data used in testing. It should be noted that a certain amount of natural variance is expected in donation counts and is no cause for alarm. The bottom two figures depict the total views and donations with those coming from each banner. This illustrates when the campaign becomes active and is also a useful tool for determining where anomalous behaviour may be visible in the data.

It should be noted that the data is analyzed over a period at least as large as the full testing period and that the testing period was chosen based on the period of time where significant hits and donations were observed. Finally, the CiviCRM donation database was checked for duplicate donations and none were found over the test period.

Analyzing the above plots the donation and impression data appear to be quite regular over the interval 2010-12-30 16:04:00 UTC - 2010-12-30 16:52:00 UTC. Therefore, two minute intervals will be used for sampling over this period as a source for the paired t-test to assess confidence in the winner.

Modelling and Hypothesis Testing[edit]

The data sampled from Donations / Impression and Amount50 / Impression from donations is modelled using a Student's T distribution. The assumptions involved are discussed further here. The hypothesis test chosen was Welch's T-Test, this is similar to a Student's T-Test without the assumption that the variances of each class (banner) are equal. The Welch's T-Test is a suitable test in this case to determine how likely it is that the rates due to different banners come from truly different distributions (ie. the difference in the results is not simply due to random noise). In the context of these tests, the nominal variables in this case are the banner class to which donations belong and the measurement variables are samples of Amount50 / Impression and Donations / Impression. The level of confidence here is taken to mean the confidence that we have in rejecting the null hypothesis (that the donation rates are actually equal) given the data. When a confidence level of X% is observed that is taken to mean that if the experiment were performed identically and independently the fraction of times the alternative hypothesis (that donation rates really are different) result is observed is expected to be X% ^[1]. Note that over short time periods where anomalous external factors or time localized factors can have an affect the alternative hypothesis becomes more likely.

Care was taken in choosing the sampling intervals. Basically these were made as small as possible so that at least a single donation would be observed in the time-frame (two minute intervals in most cases), this would aid in gathering more data which would enable achieving higher confidence in tests. It should be noted that with the Student's T-Test even that, all other things being equal, the more data you have to model with tends to increase the confidence in the result ^[2].

The plots below depict the mean and standard deviation of each banner over four ten minute testing intervals. Samples were taken over two minute periods. The hypothesis testing was performed over the average mean and variance of the test intervals. "Jimmy Arms Crossed" won in each case for donations/impression and amount50/impression with increases of 12.19% and 17.74% respectively. The student's t-test was used to assess confidence over each metric and the confidence in the winner for donations/impression and amount50/impression are at least 97.5%, and 99.5% respectively. Although this seems to be a very conclusive result it should be noted that the actual gains while significant are small (~15%) and that the "Face" banner had been running for much longer before hand and there had been an observed affect where banners running for longer periods tended to slump in donations while fresh banners performed better. So while this result is clearly a significant win factors beyond the banners themselves may be at play.


TOTAL DONATIONS "Face": 248
TOTAL DONATIONS "Arms Crossed": 321

TOTAL AMOUNT50* RAISED "Face": $3862.66
TOTAL AMOUNT50* RAISED "Arms Crossed": $5235.00

* AMOUNT50 indicates the total amount raised where all donations greater than $50 are taken to be a donation of $50.


DONATIONS PER IMPRESSION:

Between 97.5% and 99.0% confident about the winner.

Jimmy Arms Crossed VS Jimmy Face -- 2010-12-30 16:04:00 - 2010-12-30 16:52:00

item 1  = "Face"
item 2  = "Arms Crossed"

The winner "Arms Crossed" had a 10.13% increase.

interval	mean1		mean2		stddev1		stddev2

0		0.00029		0.00039		0.00006		0.00003
1		0.00043		0.00044		0.00013		0.00021
2		0.00040		0.00033		0.00006		0.00015
3		0.00028		0.00038		0.00001		0.00010


Overall Parameters:

mean1		mean2		stddev1		stddev2
0.00035		0.00039		0.00008		0.00014


AMOUNT50 PER IMPRESSION:

Between 99.5% and 99.95% confident about the winner.

Jimmy Arms Crossed VS Jimmy Face -- 2010-12-30 16:04:00 - 2010-12-30 16:52:00

item 1  = "Face"
item 2  = "Arms Crossed"

The winner "Arms Crossed" had a 22.68% increase.

interval	mean1		mean2		stddev1		stddev2

0		0.00375		0.00658		0.00126		0.00193
1		0.00665		0.00788		0.00296		0.00430
2		0.00683		0.00487		0.00182		0.00079
3		0.00390		0.00658		0.00001		0.00265


Overall Parameters:

mean1		mean2		stddev1		stddev2
0.00528		0.00648		0.00185		0.00273

Discussion[edit]

This test involved two banners featuring an image of Jimmy. The primary differences are (1) the scaling - "arms crossed" features a distant shot while "angel" primarily frames the face, (2) the tone and facial expression, "arms crossed" is more stern while "angel" is more ponderous and thoughtful, (3) the clothing, "arms crossed" uses a black polo shirt while "angel" uses a blue dress shirt. Most of the data over the test period looked consistent and useful however, this result may not generalize as donation patterns change over hours and days which could render this result invalid. This is noted above as the actual advantage for "Arms Crossed" was small and the conditions may not have been totally fair. Ideally this sort of test comparing a "Jimmy" poses can be tested more thoroughly and over a more general set of conditions. That being said the case may be made to serve banners having the features of the winner if more tests over longer periods could lend support to the conclusions of this test.

References[edit]

↑ Statistical hypothesis testing - http://en.wikipedia.org/wiki/Hypothesis_testing
↑ Welch's T-Test - http://en.wikipedia.org/wiki/Welch's_t_test

Endnotes[edit]

Campaign = "20101230JA091_US"
"Arms Crossed" utm_source = "20101230_JAFS006fader_US"
"Face" utm_source = "20101229_JAFS002fader_US"

[hyp_testing-1] Statistical hypothesis testing - http://en.wikipedia.org/wiki/Hypothesis_testing

[welch_testing-2] Welch's T-Test - http://en.wikipedia.org/wiki/Welch's_t_test

[1]

[2]