Case study 2013-02-27

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page contains a draft report for an AB test -- plus some extra graphs to report on an experiment we did on this data sample. We're working to come up with a standard report that we're going to then apply to a whole bunch of past tests. The reports should do three things:

  1. expose any anomalies in the test data -- for example, if one banner was taken down by accident for part of the test.
  2. show which variable won, and make predictions about the "true" difference that variable will make in the long run.
  3. provide visualizations of the underlying data, the result of the test, and the degree of statistical confidence achieved that can be understood by anyone with a tiny bit of training.

We have many kinds of tests (banner, landing page, design, text, multivariate, etc...). Reports will look different depending on the test type. The type of test represented below is part of a two-variable multivariate banner test. The report here only looks at one of the variables: "All bold" vs "Some bold". One banner had all text bolded, which is our normal style. The other only had some key phrases bolded. Here's what the two banners looked like (in Italian):

[] []

This test ran in many countries all at the same time for more than seven days. First the highest-level results. The graph below may look overwhelming at first if you don't have a background in statistics, but understanding it doesn't require any specialized knowledge of math or stats. Once you understand it, then you'll be able to quickly understand future test reports that use similar graphs.

Here's a step-by-step guide to understanding the graph below:

  1. Look at the red dotted line. That is how much All Bold beat Some Bold by if you include all the data from this very long test. The Y Axis on the right is the relative percent by which All Bold won -- almost 10%. A 10% win means that All Bold did 1.10 times better than Some Bold.
  2. Look at the purple and turquoise lines at the right of the graph. Each one falls about 2 percentage points above and below the blue line. Those two lines represent the "Confidence Interval". According to the confidence interval we arrived at by the end of the test, if we repeated this same test over and over under the same conditions, the end result would fall in between those two lines 95% of the time. The plain and simple way of relating to this confidence interval is to say "All Bold beat Some Bold by between about 8% and 12%."

Here's a quick verification that the two banners were served roughly equally:

TestA.png

The header of the graph explains that one banner was served more frequently than the other by more than can be expected to happen by chance. We know of one reason why this would happen, and we are beginning to correct for it in our testing. That problem is explained here. There may be other reasons for this skew. Nevertheless, the difference isn't so great that it would significantly effect the results of the test even if we weren't correcting for it. Moreover, all results are calculated in terms of donations per banner impression -- so the skew shouldn't matter at all. There are some potential problems with our data in regard to calculating donations per banner impressions, however, that we're still looking into.

Here's a quick look at which banner motivated the most donations.

TestB.png

We think that providing clear, simple descriptive views of the data is important for ease of interpretation. The graph below shows how banners we served over the course of the test. This test was run in a large set of countries all around the world. The ups and downs in the graph of banner impressions per hour below correspond to different parts of the world waking up and going to sleep. In the peaks, the highest number of people are awake. (Elsewhere other research we show some data on how traffic in different countries varies different over the course of days and weeks.)

This graph is a good way to spot any anomalies that might effect the test results. For example, if one banner was accidentally turned off during a test, you would see one line drop to zero. In this test, you can see that both banners were turned off in a couple of places for periods of the test (in this case for system maintenance). Seeing those problems in the data can help us work around them or reject the test outright. In this case, we slightly modified some of our tests below to account for the gaps in the data.

TestD.png

The next graph presents the same data as above but as an average banner impressions per hour over all seven days. This graph was calculated to only include hours that had banner impressions. This graph is probably not particularly useful for analyzing this test, but we find that just looking at the data in many different ways increases our familiarity with it and improves our ability to interpret results and test better in the future.

TestC.png

The next graph shows the donation rates (donations/banner impressions) for each banner type (All Bold vs Some Bold) by hour of the day. Comparing this graph to the above graph shows that donation rates vary in a different pattern from banner impressions. That is because different parts of the world (more specifically: different timezones) have different response rates.

20130410 Allbold v Somebold TestE.png

The next graph shows by how much All Bold beat Some Bold by hour of day. Showing these data by hour of day is a little arbitrary -- it's only a little different than simply dividing up the sample into 24 different random samples. Nevertheless, it helps us understand what's happening with our data anytime we can see a clear visualization of something. In this case, it is interesting to see that there is no clear relationship between the pattern of response rate with the pattern of one banner's victory over the other. It's also encouraging to see that All Bold wins in every hour. But that is not a statistically valid way of measuring the winner. That comes next!

This test is one of many that we ran for a very long time for the purpose of experimentation. Now

20130410 Allbold v Somebold TestF.png 20130410 Allbold v Somebold TestG.png 20130410 Allbold v Somebold TestH.png 20130410 Allbold v Somebold TestI.png TestJ.Sample.1.png TestJ.Sample.2.png TestJ.Sample.3.png