Research:MoodBar/Time to first feedback

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

MoodBar

Pilot: Early data
(July 2011 - September 2011)
Stage 1: Usage and UX
(September 2011 - May 2012)
Stage 2: Impact on editor engagement
(May 2012 - September 2012)
Nutshell.png
This page in a nutshell: This report presents an analysis of usage of MoodBar by registered users in the English Wikipedia. The analysis found that the first mood feedback is usually posted within the first four days since the activation of the feature. Moreover, feedback with a confused or happy mood tends to be reported earlier than feedback with a sad mood.

This report presents an analysis of how new registered users use MoodBar. MoodBar is a new editor engagement feature allowing new registered users to report their mood. Users can choose to report one of the following moods: sad, confused, and happy, along with a short comment.

MoodBar is enabled for all new accounts registered since July 25, 2011. However, by default it is not visible to newly registered user until they click on the “Edit” button. We call this event the activation of MoodBar. We analyze the time to first feedback, that is, the time elapsed between the activation of MoodBar and the moment the user posts their first mood feedback.

The main result of this analysis is the estimation of the time window during which users are more likely to send their first feedback. The secondary result is the difference in time to first feedback for different moods. This analysis may inform further design decisions regarding the prominence of the MoodBar UI.

Research Questions[edit]

  1. When do new users report their first feedback?
  2. Which design decisions influence the rate of adoption of MoodBar?
  3. Are some moods reported earlier than others?

Methods[edit]

Figure 1.
MoodBar activation and data censoring. The vertical red line represents the time data are collected.

We perform a survival analysis of the time of first feedback on the set of all Wikipedia editors who have had the MoodBar extension activated. Survival analysis owes its name from the statistical study of the lifetime of patients in clinical trials. It is generally used to study any data that represents the time to a certain kind event. In this context we do not use it for studying editor retention patterns, but simply to estimate how long it takes for a newly-registered editor to start using MoodBar.

From a statistical standpoint, our sample is affected by right censoring due to the staggered entry of new users into Wikipedia (also known as generalized type I censoring). Figure 1 illustrates this concept. Consider four users A, B, C and D. All four register an account at different times (light gray segments). Later, A, B, and C click on the “Edit” button, thereby activating MoodBar. Shortly thereafter both A and B post their mood (black segment). By the time we collect the data (vertical red line) user C, instead, has not yet reported any mood (black segment). We say this observation is censored. All we know about the time to first feedback of user C is only that it is greater than the length of the black segment, i.e. C has “survived” up to this point without producing an event. In contrast to these three, user D has never clicked on the “Edit” button, and therefore D has not seen MoodBar yet. We need not to take D into account in our analysis.

In addition to this, we should note that several accounts are inactive (i.e. they will never produce any edit or transaction) and it is highly unlikely they will send any feedback in the future. We filter out inactive accounts from this analysis. The details on data filtering are given later in this section.

Dataset[edit]

We use the data collected since the deployment of the MoodBar extension on the English Wikipedia, which occurred on July 25, 2011. Our dataset spans the first 9 months of usage of MoodBar. It contains the first feedbacks sent by 19,219 users who registered an account after that date. Despite its large number, this group is a just a tiny fraction of the 1,589,255 users who registered an account in that period (1.21%). Finally, selecting only users who clicked at least once on “Edit” (e.g. A, B, and C in Figure 1) reduces the sample to 861,011 users, yielding a percentage of non-censored observations equal to 2.32%.

Filtering inactive accounts[edit]

Figure 2.
Data histograms.

We include a censored observation if either of the following conditions applies:

  • The user performed at least one edit in the 30 days before the time we collected the data, i.e. the user is still active.
  • The user has zero edits and his/her account was registered at least 5 days before the time we collected the data, i.e. the follow-up time for this user has not expired yet.

Since most users give up editing almost immediately, we choose 30 days as a conservative estimate for the first condition. The second condition is motivated by the fact that many mood feedbacks are sent by users with 0 edits. These users would be discarded if we only considered the first condition.

To choose a threshold for the follow-up time we focused on the subset of mood posts sent by users with 0 edits (all of which are uncensored). We estimated the hazard rate in the same way as the full sample (see the Results section below for more information), and we took the interval of maximum hazard as the follow-up time.

This final steps reduces the dataset to 95,585 observations, 20.11% of which are uncensored, i.e. actual mood feedbacks. Figure 2 reports a histogram of the data (in which censored observations are treated as normal observations). We can see that many feedback are reported in the first 50 days, but also that many are reported much later, up to 250 days after activation. So which it is? Early or late? A simple histogram does not tell us much, therefore we need to use a more suited method.

Results[edit]

When do new users report their first feedback?[edit]

Figure 3.
Estimated survivor function
Figure 4.
Estimated survivor function (logarithmic scale)
Figure 5.
Estimated hazard rate

Let us denote with the time to first feedback of a generic user. The survivor function is the probability that the user will “survive” up to time without sending the first feedback, that is, . We compute the estimate of the survivor curve using the Kaplan-Meier estimator (Figures 3 and 4). In both figures a censored observation is denoted with the sign '+'. Figure 3 shows that the probability of survival drops sharply on the first day and then it goes down steadily[1]. To give a better sense of what happens in the first minutes, we re-plot the survivor function in Figure 4 using on a logarithmic scale (where unit increments denote a 10x change). Because the curve has a first drop between the first seconds (10-4 days is approximately 8.6 seconds) and to the first 15 minutes (10-2 days is equal to 864 seconds) we can infer that many feedbacks are sent in that period.

However the survivor curve does not tell us much about when people send their first feedback. It could let us compute the median time of first feedback, provided that certain conditions are met, and this would help us understand what a typical time to first feedback looks like. The bad news is that in our case such conditions are not met. In fact, we have more than 50% of censored observations, and thus the median survival time is not defined[2]

The hazard rate can come to our rescue. The hazard that our generic user sends a feedback a time is the probability density of rescaled by the probability of surviving up to , that is .[3]

The hazard is always defined and we can plot its estimate using a smoothing kernel to see when users are more likely to send a mood feedback. The estimate of the hazard function is reported in Figure 5 (blue solid line) together with 95% confidence bands (dark gray area) computed via bootstrap. The plot show unequivocally that the period of highest hazards for posting a mood feedback is in the four days immediately after MoodBar is activated. The hazard of sending a feedback is almost constant through the first day, and drops rapidly on the second day. After the fourth day the hazard remains roughly at a baseline level that is much lower than during the first four days. For example, on the tenth day a users is 14 times less likely to send a feedback than on the first day since activation.

Which design decisions influence the rate of adoption of MoodBar?[edit]

Figure 6.
A mock-up of the MoodBar notification tooltip. From the MoodBar design document.
Figure 7.
Estimated hazard rate, without tooltip notification.
Figure 8.
Estimated hazard rate, with tooltip notification

What could be the cause for this? Upon activation MoodBar notifies the user with a tooltip (see Figure 8). Is it possible that the behavior we see in Figure 5 could be attributable to this feature? Luckily there is a way to test this question.

The tooltip feature was in fact introduced a few months after the deployment of MoodBar, in the effort of increasing the saliency of MoodBar within the general Mediawiki UI.[4] Because the tooltip appears only once, all users who activated MoodBar before that date have never seen it, while the majority of the other users saw the tooltip only once.[5] So, if the increased likelihood of reporting a feedback is due to the increased saliency of the MoodBar link, for this earlier group we should see a flatter hazard.

We thus divided our dataset of users in two groups, depending on when they had MoodBar activated. The first group (without the tooltip) consists of users who had MoodBar activated before the tooltip notification feature was introduced (9,961 users), while the latter consists of users who had MoodBar activated after this date of 79,744.[6] We estimated again the hazard rate using the same technique, and show them in Figs 7 and 8. A comparison of the two figures confirms our hypothesis: before the introduction of the tooltip the hazard peaks at a lower value, and takes longer to reach the baseline. Moreover, for the pre-tooltip group the baseline hazard after 50 days is higher than in the post-tooltip group.

A note about the pre-tooltip period[edit]

A flatter hazard rate means that moods are sent more uniformly in time, and this is certainly a good thing because we would like to hear from editors also at later stages of their participation into Wikipedia. So, does this mean that before the tooltip we were better off? Luckily for us this is not the case, and here's why.

As can be seen from the MoodBar data dashboard (in particular the second plot), the number of users who sent a mood before the introduction of both the icon and the tooltip is relatively much lower than after. As we can expect, the introduction of the tooltip notification – and to some extent of the icon before it – increase the saliency of the MoodBar link within the Mediawiki UI. However, this has two distinct effects on user adoption: the first is that because more people are aware of the possibility to report their mood, we receive (unsurprisingly!) more feedbacks. The second is that because a one-off notification eventually goes forgotten, the vast majority of the feedbacks we get are sent only in the first few days after the tooltip is shown.

Are some moods reported earlier than others?[edit]

Figure 9.
Survivor function by mood type

Obviously, we cannot observe the mood type for the uncensored observations in our dataset, and thus have to discard them all before carrying out the analysis. Figure 9 shows the estimated survival curves of the three samples, together with 95% confidence bands.

In this plot show only the first 20 days for consistency with the previous plot. The survival curve for “sad” feedbacks is consistently above the other two for most of the time. To test for differences of the three survival curves we perform a Logrank test. Because the previous analysis of the full sample tells us that users are more likely to report the mood feedbacks very early, we use a Peto test, which gives more weight to observations in the head of the distribution over those in the tail. The null hypothesis is that all three samples are drawn from the same distribution. The test rejects the null hypothesis (two-tailed p = 0, χ2 = 82.8). The three samples therefore do not come from the same distribution.

Because the effect of the data on the tail of the distribution is low, there are no particular reasons to suspect that discarding the censored observations (which, by definition, influence the tail of the distribution) might bias the result of the test.[7] It should be noted however that discarding the censored observations usually results in underestimating the distribution parameters. Keeping this caveat in mind, the estimated median times for the three moods are reported in the table below:

mood observations median (hh:mm:ss) 95% Lower C.L. (hh:mm:ss) 95% Upper C.L. (hh:mm:ss)
sad 2333 2:01:06 1:19:03 2:52:39
confused 5281 0:34:42 0:30:05 0:39:10
happy 11605 0:31:06 0:28:30 0:34:33

Summary[edit]

The first result of the present analysis is that users tend to report their mood only immediately after the activation of MoodBar; after the first four days since its activation the chances of sending a feedback become in fact tenuous (14x lower hazard on day 10 compared to day 1). We impute this behavior to the effect of the tooltip notification that MoodBar shows upon its activation. Our second result, a comparison of the hazard rate before and after the introduction of this features, seems to support this conclusion.

Of course this second result has to be read in the general context of the usage of MoodBar. The tooltip notification increases the prominence of the MoodBar UI on the screen and, in fact, since its introduction the number of unique daily posts has increased by roughly 50%.[8] Thus what our analysis says is that the beneficial effects of the tooltip reminder on user adoption of MoodBar endure only for a few days. Right now only 3% of all users who have had MoodBar activated ever sent a feedback.[9] The present analysis suggests that the overall adoption of MoodBar could be improved if, instead of only once, users were reminded of the possibility to send feedbacks multiple times – for example by showing the tooltip after the initial 4 days period.

The third result is that the “happy” and “confused” moods tend to be posted substantially earlier than the “sad” mood. We speculate that the first two might be used to either express satisfaction (for example for a successful first edit) or report attempts at solving problems, while the latter might cover cases that do not have necessarily to do with an actual (attempted) contribution. Another explanation is that maybe “sad” is used in the context of more complex transactions which might take longer than a simple edit--and hence the difference reflected by the time to first feedback.

To test both hypotheses we should look at the feedback comments themselves and assess whether “happy” and “confused” mood correspond to more concrete UX issues than those flagged with “sad”.

Source code[edit]

A github repository with the scripts used in this analysis can be found here.

Notes[edit]

  1. By definition of survivor function, and , so the survivor function is always monotonically decreasing.
  2. By definition the median time is the value on x axis where the survivor curve intersects y = 0.5.
  3. The density and the survivor function are related to each other by the property that is the derivative of in .
  4. The actual date of introduction of the tooltip notification is December 14, 2011.
  5. MoodBar stores a cookie to prevent from showing the tooltip every time it is loaded. This means that users who do not save cookies across sessions will see it more than once. However, it is safe to assume that this group of users is only a minority of the total population.
  6. The first group additionally excludes observations taken after the introduction of the icon next to the MoodBar link, which happened on 2011-11-01. This was needed because we cannot tell whether those users saw the tooltip or not. This means that the increased saliency of the MoodBar UI is also due to this other element.
  7. In particular, the difference is still significant also under the standard Logrank test, which weighs all observations equally.
  8. See our MoodBar data dashboard, in particular the second graph.
  9. See the graph of the MoodBar data dashboard about this metric.