Research:MoodBar/Experimental design

From Meta, a Wikimedia project coordination wiki

Introduction[edit]

The present report describes the experimental conditions of the MoodBar study. Besides documenting the MoodBar study itself, it is also intended as a short primer about experimental design in an online setting. The text should be accessible to a general audience of people interested in the new editor engagement initiative. However, some background knowledge in basic statistics (en:sample size, en:standard deviation, etc.) is required to understand the more technical parts of the document. General information on MoodBar can be found here; background details about research on MoodBar can be found in the parent page and its other subpages.

Motivations for performing experiments[edit]

Our objective is to assess whether the possibility for users to report their mood and to receive feedback about their editing experience has any effect on editor retention. The idea is that reporting feedback about editing, and possibly receiving a meaninful response, may help newbies overcome their initial problems with participating to Wikipedia. We choose to measure the retention rate of newly-registered editors at 1, 2, 5, 10, and 30 days since the first time a newly-registered user clicks on the edit button.[1]

A way to test for the above, which is also the de-facto standard in terms of scientific rigor, is a randomized experiment. The idea is simple: we randomly choose two groups of users. We make MoodBar available to users in the first (the treatment), and do nothing for the second group (the control). We observe the rates of retention across the two groups and see whether it is, in a statistical sense, significantly different (and, of course, better for the treatment!).

The first question is: do we really need to perform an experiment at all? Can't we just compare the current population of MoodBar users with the broader population of Wikipedia editors and compare their retention rates? The second question (given we respond affirmatively to the first) is whether this is the best design for our needs.

We first tackle the first question. The answer is of course yes but, to answer to it, we have to talk about confounding factors. This will also let us understand how to answer to the second question: because of certain inherent characteristics of the MoodBar treatment the best thing we can do is to run a special type of randomized experiment, called a natural experiment.

Self-selection and confounding factors[edit]

Why do we need to perform an experiment? And why we need to take random groups? The technical reason is that we want to control for confounding factors that might possibly have a role in determining the retention of our editors. Factors related to the self-selection are, in particular, the obvious suspects.

Let us make an example to illustrate why self-selection is a problem. Because one has to include a short text message describing the feedback, people who have sent a feedback through MoodBar might be more communicative than average.[2] If you are a better communicator, you might incur in less problems when interacting with fellow editors, and this would translate in higher chances of long-term participation.

Another possible confounding related to self-selection is “tech-savvy”-ness. We know it is not easy to spot the link that opens MoodBar, and this might be evidence that those who send a feedback are also more knowledgeable of wikis (and of geeky stuff in general!). Again, this translates in higher chances of long-term activity on Wikipedia.

In a nutshell, the very fact that a user decides to send a feedback message might correlate with factors that translate in higher chances of long-term retention, and thus, without a proper randomized experiment, we might incur in the risk of mis-attributing to MoodBar the merit of improving retention when it is not really the case.

A natural experimental design[edit]

In the previous section we argued that confounding factors might be a problem when one wants to compare certain characteristics (in our case, the retention rate) of a self-selected group with the general population from which such group is drawn. We also argued that randomization is usually the cure against the problem of self-selection, and we described the classic design of a en:randomized trial. We can now answer the second question: is this design the most suited one for our situation? The answer is negative: even by allocating MoodBar to only one of two randomly-chosen groups, we wouldn't manage to avoid the phenomenon self-selection. In fact, the decision of sending a feedback is (by definition!) voluntary. Thus we would still have self-selection at work, within the treatment group only.

There are also technical and legal limitation related to our inability to perform a fully randomized trial,[3][4] however the main reason is simply that this design wouldn't cure the problem for which we wanted to employ it. How do we circumvent this problem?

Enter natural experiments. A natural experiment is an experiment in which randomization is not performed by the experimenters (that is, us), but by ________ (fill blanks with your favourite: Nature, Chance, the Flying Spaghetti Monster, etc.). The idea is, again, simple: we designate a certain period of time (let us call it the treatment window) and we add a logical switch to the code of MoodBar that prevents it from enabling itself for all users (and only those) who registered in that period. Everything else will be exactly the same as any other Wikipedia user experiences. The only difference will be that the users in the treatment group will never have the chance to see MoodBar.[5]

The assumption underlying this schema is that all relevant confounding factors will act in both groups with the same strenght or, in simpler words, that in both groups the rates of tech-savvy people, of better communicators vs poor comunicators, etc., will be approximately the same.[6] Of course not all possible confounding factors might be controllable in this way. In general retention is, for obvious reasons, worst during the Summer than during Winter. However, if the size of the treatment window is not too big then we can safely assume that changes in the population of users due to seasonal variations won't influence the response variable (that is, the retention rate) too much.[7]

Estimation of treatment window length[edit]

In this part we go a bit more deep into the technical details, in particular we describe how to estimate the size of the two groups of users we want to compare. This parameter will tell us how long the treatment window must be and thus for how long we “switch-off” MoodBar.[8]

Why we need to care about the sample size[edit]

The first consideration is about the kind of en:statistical test we want to use in order to perform our assessment of the retention rate. Our variables -- the retention rates -- are average proportions of users still active after a certain number of days. The simplest way to test for differences in average proportions is the en:t-test for the location parameter of the distributions of two samples. This will tell us whether the differences we see are just due to natural random fluctuations of the variables (this the so-called en:null hypothesis ), or if the two underlying distributions are truly different (which means that is disproved). In the latter case we shall conclude that MoodBar does have an effect on editor retention.[9] But before rushing to collecting data and crunching the numbers with R, we should first decide which size our two samples should be.

Why do we need to decide a sample size beforehand? The reason is simple: we want to minimize the chances of committing a mistake when testing whether the null hypothesis (: MoodBar has no effect) is true or not. This is something many practitioners fail to appreciate, but statistical tests, being probabilistic decisions, can be wrong. The good news are that, by their very own nature, we know how often a given test will be wrong, and set the rates of the possible errors we can incur in.

How many types of errors can we do? The answer is two. We have two options (either is true or not) and the test returns us one of two outcomes, so we have a total of four possible situations. In two of them the test gives us the true answer: either was true and the test tells us exactly that (fails to reject a true null hypothesis: a true negative), or was false and the test correctly rejected it (a true positive). In the other two situations we commit an error: either was false but the test does not reject it (a false positive, or en:Type I Error), or was true but the test rejects it, coming to wrong conclusion (a false negative, or en:Type II Error). So, to respond to our original question, we set the sample size to control, at the same time, the rate of false negatives and the rate of false positives.

In order to do so, we need to have an idea of the entity of the difference between the treatment group and the control group -- the so-called en:effect size that we expect to see. The estimation of the effect size and of the sample size is called en:power analysis of a test.

An example of power analysis: estimation of effect and sample size for the MoodBar experiment[edit]

Let us now give an example based on the MoodBar experiment. The rate of false negatives is controlled by the significance of the test . It is customary to go for 1 error every 20 attempts, which translates in a value of .

The rate of false positives controlled by the parameters , the power of the test. A value of at least is usually considered in most applications. Given these two parameters, the formula for estimating the minimum sample size required by our test is the following:

In the above formula is the en:effect size and the variance of the data. But how to estimate the both parameters before we even run the experiment? One way is to do a pre-assessment (a priori en:power analysis) using the data of usage of MoodBar that we already have. We can compute the retention rate with the following query:

SELECT 
    ept.ept_user as user_id,
    IFNULL(MAX(rev_timestamp) - INTERVAL ? DAY >= DATE(ept.ept_timestamp), 0) as retention
FROM 
    edit_page_tracking ept
JOIN
    user u
ON
    u.user_id = ept.ept_user
LEFT JOIN
    revision r
ON 
    ept.ept_user = r.rev_user
WHERE 
    DATE(u.user_registration) >= @min_registration
GROUP BY
    ept.ept_user

In the above code, the question mark ? is going to be substituted with the actual number of days at which we want to compute the retention. The variable min_registration let us instead select only users who used the last developmental iteration of MoodBar. We use the above query to create a table with all our variables (retention at 1, 2, 5, 10, and 30 days respectively) and then we join it with the list of MoodBar users to compute the average retention. For more information see the source code here and here.

Once we have the retention figures for our two groups (control: all users, treatment: MoodBar users), we can compute the effect size and the standard deviation . There are several ways to estimate the effect size in our case. We choose the formula due to Hedges (see the previous link), which takes into account the variances of both groups (pooled variance).

Having estimated the effect size , we can compute, by means of the above formula, the minimum sample size . Given this parameter, and considered that there are on average 1,500 new live accounts every day, we finally get the length of the treatment window.[10]

You can see all the details of the calculation on this spreadsheet (currently only available to Wikimedia Foundation staff). Together with the group of MoodBar users, the spreadsheet also reports estimates for the sample of MoodBar users who received at least one response and of those who received a useful response.

The results are not as good as we expected. The estimated effect size is in fact is very small, mostly because the fraction of MoodBar users over the total population of editors is very small, which means that the estimated retention of the control group (that is, the the general population minus the MoodBar users) differs by less than one percent from the treatment. Increasing the number of MoodBar users would increase , bringing the window length to much more reasonable values. The above spreadsheet has a boost factor cell that can control the number of MoodBar users over the total sample of editors for each treatment. Increasing this value (for example from 1x to 2x) will give the corresponding treatment window length.

Conclusions & Recommendations[edit]

Let's imagine you come up with an idea for increasing editor engagement. Together with the community you design a new mediawiki feature, make it bug-free and deploy it. You then advertise it community-wide. Finally, you can sit down and watch whether those who use it are indeed more engaged than the average. Are you doing it right?

After reading this document, you should at least consider the following issues:

  1. Is there any external factor that might contribute to the retention patterns you see? If the feature requires that editors perform certain actions, and adoption is on a voluntary basis, then you should consider whether other characteristics of your self-selected sample could equally explain any higher (or lower!) retention rate you see.
  2. Do you want to perform a controlled randomized experiment? Again, plan it in a way that your samples are truly randomized. If you ask users to do a certain action and you collect data only on those who actually did it, then you there are good chances that you are incurring again in self-selection bias.
  3. Choose your statistical methodology beforehand, and try to estimate how large a difference you expect to see. Historical data can be useful for this, but keep also in mind that those estimates can be conservative, and thus you should also perform a post-hoc power analysis after you collected the data.

In the specific case of MoodBar, the a priori power analysis told us that the effect size is going to be very small, and thus we need to collect a large sample of users for testing differences in the retention at 30 days since first edit click between users who sent a feedback and the general population.[11] Because this would translate in a large treatment window, we have to face a decision. We can either:

  1. Focusing only on variables with a high effect size (for example dropping the retention at 30 days).
  2. Increasing the rate of MoodBar adopters by improving the visibility of the MoodBar link, so that the effect size would increase as well.

And, of course, do both things. Each option has certain side-effects that we should also take into account. For example increasing the visibility of MoodBar is likely to change the composition of the sample, and this may have counterproductive consequences on . Another power analysis after the UI changes will be done is thus advisable.

References & Notes[edit]

  1. This is the event that actually triggers the activation of MoodBar.
  2. Feedbacks are 150 characters long (twitter-like) so having the gift of conciseness helps a lot in getting a meaningful reply.
  3. The legal ones mostly related to our privacy policy.
  4. On the technical side, the main problem is that we can access the information about which group a user belongs to only when that user makes a contribution to the site (e.g. an edit, a mood feedback, etc.), but not when he/she doesn't perform any.
  5. Technically, we are manipulating an en:instrumental variable.
  6. More precisely, differences in the response variable across the two groups due to confoundings only are negligible.
  7. This is not to say that we should not control for seasonal trends, but only that if the treatment window is not too long it is reasonable to assume that the change due to this kind of external factors will be negligible.
  8. Users with an account registered before the treatment window will see MoodBar as normal. During the treatment, only the newly registered accounts will have MoodBar disabled. After the treatment window is over, the newly registered accounts will get MoodBar back, but the group that registered during the window will still have it disabled. This is required because we want to measure retention, which is longitudinal feature.
  9. Technically speaking, we would really conclude that the null hypothesis was disproved. Science advances by throwing away bad hypotheses.
  10. For the figure on daily rate of new live accounts, see the second graph here here.
  11. in the order of , with .