Fundraising 2011/Report/Example Methodology testing methodology and reports

Overview[edit]

The Wikimedia Foundation meets its financial needs with an annual online fundraiser. We communicate our needs to users through banners at the top of projects, and with appeal letters read by those who click on the banners. This campaign is based on constant testing and optimization of different themes, messages, images, and donation form designs. We want to run the most efficient and timely campaign — therefore, we must always be improving our methods.

Methodology[edit]

This section briefly covers the approaches used in testing and assessing results.

Testing Strategy[edit]

In order to measure the performance of the fundraising campaign, three separate sources of data artifacts are collected: banner impressions, landing page views, and donations (occurrences and amounts). Here an artifact refers to any one of impressions, landing page views, or donations while donation pathway artifacts refer to actual banners and landing pages. Both the banner impressions and landing page views are associated with a timestamp, source-name, campaign, country, project, and language that uniquely defines them. Squid Logs are used to track every landing page view and banner impression while donations are collected separately under the CiviCRM relationship management software. In CiviCRM donation data is stored on donor contact information, donation amount, source-name (uniquely identifies banners and landing pages), testing campaign, and country of origin. The Squid logs containing the banner and landing page data are parsed and the data is retained in a MySQL database for further processing.

Once the data is available, artifacts at different stages in the donor pipeline are typically joined by source-name, timestamp, and campaign. From this, several metrics of interest arise that can be used to measure the effect of banner and landing page designs and messages. Some of the more frequently used metrics are listed here:

Metrics used in landing page testing:

D / LPi - Donations per Landing Page Impression
A / LPi - Amount per Landing Page Impression
An / LPi - Amount normal per Landing Page Impression (including donations up to the average; larger amounts are set at the average)

Metrics used in banner testing:

D / Bi - Donations per Banner Impression
A / Bi - Amount per Banner Impression
An / Bi - Amount normal per Banner Impression (including donations up to the average; larger amounts are set at the average)

Some of the prominent features of the Fundraiser are listed below:

Tests are typically exclusively banner or landing page oriented.
Tests are univariate — that is, only a single feature of the items being tested is varied and only a single metric is being measured. However, multi-variate testing may be used for more flexibility in our strategy.
Testing is constant over the fundraiser with the best performing donation pathway artifacts being retained.
Testing typically lasts between 1 to 2 hours.

Data Modelling and Determining Statistical Significance[edit]

The data are generated by sampling counts from banner impressions, landing page views, and donations at fixed intervals for a predefined period of time. For example, n data points are generated over a time period of length T by sampling over intervals of T/n time units. This strategy assumes that the underlying distribution of the rate samples (ie. Gaussian^[1], student's-t^[2], chi squared^[3]) are constant over the interval T. While this does not hold over large intervals, it has been observed to hold approximately over intervals of about an hour — which are generally used in the testing here.

The elusive property of our data is that they are not generated under a fixed distribution. Therefore, although there is a massive amount of data at our disposal, simple binomial analyses of donation rates do not always apply, except in the case of highly localized (in time) measurements. This leads us to consider the parameters of the model itself to be distributed according to some function. Hypothesis testing over the samples is primarily applied with a Student's T-Test ^[4] (actually a special case called Welch's t-test as the variances of the classes are not assumed to be equal). That is, the test is a means to determine how likely it is that the rates in question come from different distributions. In the context of these tests, the nominal variables in this case are the donation pipeline artifact (e.g., banner, landing page) and the measurement variable is the presence of a banner click, "donate" button click, donation or a donation amount.

For the tests in the table below, the student's t-test is used to measure whether the difference in performance among artifacts is likely beyond what may be considered chance. It may be instructive to test other assumptions on the data by using other tests. Further analysis of the data could also yield other distributions that may better fit the data.

The implementation used for this analysis was done in Python and can be found in the SVN Repository^[5].

Hypothesis Testing Strategy[edit]

The tests chosen were judged to clarify salient features of the donation process that can be used as a guide for future fundraisers. Test Intervals are the intervals over which an estimate of a measurement is computed based on a number of samples (e.g., five samples were drawn over a test interval of 10 minutes). Each sample in this interval is uniformly drawn — in the example, it would be every two minutes. The Sampling Interval indicates the time taken to make a single measurement from a distribution — two minutes in the example. That is, donations per impression can be measured over two minutes of donations and impressions to yield a sample. Given a set of samples, the mean and variance of the metric over the testing interval may be computed. An entire test period consists of a set of consecutive test intervals, which in turn are sub-divided into consecutive sampling intervals.

Note that the sampling scheme selected is chosen such that (1) test intervals include at least two samples; (2) test intervals are short enough such that the sample is not subject to time-dependent fluctuations in measurement; (3) there is at least a single measurement per sample. It should be emphasized that this part of the analysis was determined empirically and should be considered with the observed trends. In the plots generated for each test, a point represents the mean and standard deviation (error bars) for each test interval over the entire testing period. The title of the plot indicates the full length of the testing period.

Reports[edit]

Reports on a small sample of tests can be found below:

References[edit]

↑ Wald test - http://en.wikipedia.org/wiki/Wald_test
↑ Welch's t test - http://en.wikipedia.org/wiki/Welch%27s_t_test
↑ Chi-square test - http://en.wikipedia.org/wiki/Chi-square_test
↑ Handbook of Biological Statistics: student's t-test - http://udel.edu/~mcdonald/statttest.html
↑ Python implementation of confidence testing - http://svn.wikimedia.org/viewvc/wikimedia/trunk/fundraiser-analysis/fundraiser-scripts/compute_confidence.py?view=log

[wald_test-1] Wald test - http://en.wikipedia.org/wiki/Wald_test

[welch_test-2] Welch's t test - http://en.wikipedia.org/wiki/Welch%27s_t_test

[chi_test-3] Chi-square test - http://en.wikipedia.org/wiki/Chi-square_test

[HBS_ttest-4] Handbook of Biological Statistics: student's t-test - http://udel.edu/~mcdonald/statttest.html

[stat_test_code-5] Python implementation of confidence testing - http://svn.wikimedia.org/viewvc/wikimedia/trunk/fundraiser-analysis/fundraiser-scripts/compute_confidence.py?view=log

[1]

[2]

[3]

[4]

[5]