Fundraising 2010/Report/Example Test 1

From Meta, a Wikimedia project coordination wiki

[edit]

Banners[edit]

"Jimmy". This is a fader banner and features two text strings which toggle by fading into each other for equal intervals. These strings are: "If everyone reading this donated $5 our fundraiser would be over today. Please donate to keep Wikipedia free" and "Only 2 days left to make a tax-deductible contribution to keep Wikipedia free. Please help Wikipedia pay its bills in 2011."


"Moka Hands". This is a fader banner and features two text strings which toggle by fading into each other for equal intervals. These strings are: "If everyone reading this donated $5 our fundraiser would be over today. Please donate to keep Wikipedia free" and "Only 2 days left to make a tax-deductible contribution to keep Wikipedia free. Please help Wikipedia pay its bills in 2011."


Result[edit]

Test Time: 2010-12-30 14:04:00 UTC - 2010-12-30 14:44:00 UTC
Sampling Interval = 2 minutes
Testing Interval = 10 minutes
Total Number of Samples per Class = 20

Donations / Impression*: "Jimmy" banner won by 42.68%.
Amount50 / Impression**: "Jimmy" banner won by 49.49%.

WINNER: "Jimmy" Banner


(*) The rate of donations per banner impression over a fixed time interval.

(**) The rate of amount50 per banner impression over a fixed time interval. Amount50 is the dollar amount raised from donations initiated under a given banner where all donations of more than $50 are recorded as $50 donations. This counters the skewing effect of outlier donations.

Methodology[edit]

  • Data Cleaning: Check test period in CiviCRM for duplicate donations. This is done to ensure consistency among our donations.
  • Data Cleaning: Break down of Donations, Impressions, Amount50 / Impression, and Donations / Impression - investigate any anomalous data. Discard any data where impressions or donations may have been lost due to human error in setting up the test, systems error that can arise from the central notice system, or other uncontrolled interference with the normal operation of the Fundraiser donation pipeline.
  • Determine which data to use in confidence testing from above data analysis while choosing a suitable sampling rate of the data and testing interval to perform hypothesis testing. More information can be found here.
  • Perform Welch's T-Test over averaged parameters from each testing interval derived from sampled values for Amount50 / Impression and Donations / Impression. The reason for choosing Welch's T-Test and modelling using a Student's T distribution are given below.
  • The full testing period runs from: 14:04 PM UTC - 14:44 PM UTC on December 30th, 2010

Data Analysis[edit]

This section analyzes and interprets the results of the tests.

Data Consistency and Cleaning[edit]

During the Fundraiser a great deal of noise was introduced into the natural processing of donations. Campaign banners and landing pages were being reset on an hourly schedule at most points which opened the opportunity for both human and hardware error to affect the system. This could be as a result of internal causes such as incorrectly naming artifacts, turning artifacts on late or turning them off early, server outages; or due to external factors such as spikes in donations or banner clicks due to a variety of activities on the web, or a drop in such activity. The data was checked thoroughly to determine if any of these factors were present. The top four plots below display the counts of the data sources over the testing period as verification of the consistency of the donation pipeline data used in testing. It should be noted that a certain amount of natural variance is expected in donation counts and is no cause for alarm. The bottom two figures depict the total views and donations with those coming from each banner. This illustrates when the campaign becomes active and is also a useful tool for determining where anomalous behaviour may be visible in the data.

It should be noted that the data is analyzed over a period at least as large as the full testing period and that the testing period was chosen based on the period of time where significant hits and donations were observed. Finally, the CiviCRM donation database was checked for duplicate donations and none were found over the test period.

Impressions broken out in two minute intervals over the test period.


Donations broken out in two minute intervals over the test period.


Donations/Impression broken out in two minute intervals over the test period.


Amount50/Impression broken out in two minute intervals over the test period.


The views corresponding to each banner and the total views over the campaign over two minute intervals.


The donations corresponding to each banner and the total donations over the campaign.

Analyzing the above plots the donation and impression data appear to be quite regular over the interval 2010-12-30 14:04:00 UTC - 2010-12-30 14:44:00 UTC. Therefore, two minute intervals will be used for sampling over this period as a source for the paired t-test to assess confidence in the winner.

Modelling and Hypothesis Testing[edit]

The data sampled from Donations / Impression and Amount50 / Impression from donations is modelled using a Student's T distribution. The assumptions involved are discussed further here. The hypothesis test chosen was Welch's T-Test, this is similar to a Student's T-Test without the assumption that the variances of each class (banner) are equal. The Welch's T-Test is a suitable test in this case to determine how likely it is that the rates due to different banners come from truly different distributions (ie. the difference in the results is not simply due to random noise). In the context of these tests, the nominal variables in this case are the banner class to which donations belong and the measurement variables are samples of Amount50 / Impression and Donations / Impression. The level of confidence here is taken to mean the confidence that we have in rejecting the null hypothesis (that the donation rates are actually equal) given the data. When a confidence level of X% is observed that is taken to mean that if the experiment were performed identically and independently the fraction of times the alternative hypothesis (that donation rates really are different) result is observed is expected to be X% [1]. Note that over short time periods where anomalous external factors or time localized factors can have an affect the alternative hypothesis becomes more likely.

Care was taken in choosing the sampling intervals. Basically these were made as small as possible so that at least a single donation would be observed in the time-frame (two minute intervals in most cases), this would aid in gathering more data which would enable achieving higher confidence in tests. It should be noted that with the Student's T-Test even that, all other things being equal, the more data you have to model with tends to increase the confidence in the result [2].

The plots below depict the mean and standard deviation of each banner over four ten minute testing intervals. Samples were taken over two minute periods. The hypothesis testing was performed over the average mean and variance of the test intervals. "Jimmy" won in each case for donations/impression and amount50/impression with increases of 42.68% and 49.49% respectively. The student's t-test was used to assess confidence over each metric and the confidence in the winner for donations/impression, amount/impression, and amount50/impression is at least 99.5% and 99.0% respectively. Therefore we can be confident that the "Jimmy" significantly outperforms "Moka Hands".

Mean and standard deviation over test intervals over Donations / Impression.


Mean and standard deviation over test intervals over Amount50 / Impression

TOTAL DONATIONS "Moka Hands": 115
TOTAL DONATIONS "Jimmy": 152

TOTAL AMOUNT50* RAISED "Moka Hands": $1917.00
TOTAL AMOUNT50* RAISED "Jimmy": $2556.50

* AMOUNT50 indicates the total amount raised where all donations greater than $50 are taken to be a donation of $50.


DONATIONS PER IMPRESSION:

Between 99.5% and 99.95% confident about the winner.

Jimmy VS Moka Hands -- 2010-12-30 14:04:00 - 2010-12-30 14:44:00

item 1  = "Moka Hands"
item 2  = "Jimmy"

The winner "Moka Hands" had a 42.68% increase.

interval	mean1		mean2		stddev1		stddev2

0		0.00039		0.00028		0.00004		0.00004
1		0.00046		0.00022		0.00016		0.00014
2		0.00043		0.00033		0.00015		0.00002
3		0.00043		0.00036		0.00020		0.00014


Overall Parameters:

mean1		mean2		stddev1		stddev2
0.00043		0.00030		0.00015		0.00010


AMOUNT50 PER IMPRESSION:

Between 99.0% and 99.5% confident about the winner.

Jimmy VS Moka Hands -- 2010-12-30 14:04:00 - 2010-12-30 14:44:00

item 1  = "Moka Hands"
item 2  = "Jimmy"

The winner "Moka Hands" had a 49.49% increase.

interval	mean1		mean2		stddev1		stddev2

0		0.00731		0.00368		0.00344		0.00124
1		0.00731		0.00385		0.00315		0.00418
2		0.00786		0.00625		0.00048		0.00295
3		0.00714		0.00605		0.00298		0.00407


Overall Parameters:

mean1		mean2		stddev1		stddev2
0.00741		0.00496		0.00278		0.00333

Discussion[edit]

The "Jimmy" banner came out the conclusive winner however the problem with generalizing this result is that donation patterns change over the hours and the days which could render this result invalid. Ideally this sort of test comparing a "Jimmy" banner to a "non-Jimmy" image banner can be tested more thoroughly and over a more general set of conditions.

References[edit]

Endnotes[edit]

  1. Campaign = "20101230JA089_US"
  2. "Moka Hands" utm_source = "20101230_JAFS004fader_US"
  3. "Jimmy" utm_source = "20101229_JAFS002fader_US"