Discovery/Testing

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page documents the Discovery team's A/B testing guidelines and approach, as well as individual A/B tests that have been run, are being run, or will be run.

Guidelines[edit]

We try to run our A/B tests in a standardised way (although we're always learning new things) and document that standardised way here. The top-level summary of each stage in our process is:

Stage Responsibility Description Needs a Phabricator ticket?
Definition Product Manager, Analysts Identify what we want to test, how, and over what time range NoN
Specification Analysts Specify exactly what fields we're going to need, how they should be structured, and how the sampling should work. NoN (but that's the deliverable)
Implementation Engineering Build a test according to the analysts' specification, communicating back (in email or via phabricator) any limitations. Y
Description Analysis, Engineering Describe the implemented model of the test on this page Y
Deployment and checking Analysis, Engineering Deploy the test and check it is performing as expected Y
Disable and analyse Analysis, Engineering Disable the test and analyse the final results Y
Announce test Analysis Announce the test to our users by adding it to the next status update. NoN
Retrospective Analysis Schedule a 25-minute retrospective, inviting the relevant Product Manager(s), developers, and other parties. By default, it can be facilitated by the Team Practices member who is supporting Discovery. NoN
Definition (Product + Analysis)[edit]

The first requirement is to define the test; what KPI or metric are we trying to address? What is the proposed change that we expect to alter that KPI or metric? Do we consider an increase or decrease in it to be a success? This step is undertaken by Product (to set the high-level requirements) and Analysis (to talk about what is or is not possible, or how we'd go about testing things). It usually happens in a meeting, rather than on a Phabricator ticket.

As well as the "meta" questions, this stage should also identify when we should test. Seasonality can be incredibly important to user behaviour, and so timing has an impact on how generalisable our results might be.

Specification (Analysis)[edit]

Once the high-level questions have been answered, Analysis creates a Phabricator ticket that specifies the data we need to collect. This includes:

  1. Whether raw logs or EventLogging should be used;
  2. What, specifically, that schema should look like;
  3. The buckets;
  4. The sampling rates for each bucket;
  5. The population the schema or logs should be aimed at. This is especially important, since we should not be testing features on populations who would not be affected by those features.

If Engineering has questions about any of these elements (or elements not listed here) they should be directed at the Analysis team at the earliest possible opportunity.

Implementation (Engineering)[edit]

The specification is then handed off to Engineering for implementation; building that schema or raw log setup. This should have a Phabricator ticket associated with it.

Engineers, once they're done with the implementation, should come back to the Analysis team to confirm that it meets the specification, and any practical variation they're aware of - for example, data collection being directed at people with particular browser capabilities because of the test's requirements.

Another example is an A/B test wherein we rearrange the top 10 language links around the Wikipedia globe logo after detecting the user's primary preferred language to see how their experience changes compared to the default English-centric design. An unbiased (purely random) sampling system would sample the whole population, even if, say, 72% of the sampled population is made up of English-only speakers. Under random sampling, 72% of the control and test groups would not be affected by what we're trying to test. Therefore, the sampling system for that particular test should actually take into consideration the user's preferred languages before including them in the test. Bucketing should still (and always) be done randomly, though.

Description (Analysis + Engineering)[edit]

At this point we know what we intend to test, how we intend to test, and not only the theoretical parameters for that test but what it actually looks like. The next step is to document it, on this page (see the examples below). There should be a phabricator ticket for that summary.

Describing the test is primarily an Analysis task, but Engineering may need to be involved too if there are bits of the implementation that demand very-much-insider knowledge.

Deployment and Checking (Engineering + Analysis)[edit]

So we've got our test defined, implemented and documented. Next, we deploy on a specified date. This should have a Phabricator ticket associated with it, as well as a specified date that allows the test to run for however long the analysts think is necessary. This is never going to be less than a week, due to temporal differences in user behaviour. Any proposal for a test less than 1 week in length will almost certainly end in headshakes.

Once Engineering confirms the deployment was a success, Analysts should start working on validating the data coming in. Again, this should have a Phabricator ticket associated with it. Validation consists of:

  1. Checking there is data;
  2. Checking the data represents all test groups;
  3. Checking the data represents all user options;
  4. Checking the data represents all test groups approximately equally.
Switching-off and analysis (Analysis + Engineering)[edit]

After N days (whatever the Analysis team specified), the test should be disabled; all users should be returned to the default user experience. Again, a Phabricator ticket should exist for this task - and it should also exist for the final piece of the puzzle, the analysis of the results.

This work should be performed by the Analysis team and documented on meta, on this page. Hosting for the code and (publicizable, aggregated) data is on GitHub, under the Wikimedia Research organization; we default to MIT and CC-0 or CC-BY-SA licensing for code and reports. Analysis is not complete until it has not only been handed off to the Product Manager with the Analysis team's recommendation, but placed on Commons in a human-readable form and documented on this page.

Tests/Experiments[edit]

This documents the actual experiments we have run, over our various projects, using some variant on the methods above. Adding an entry to this section is a mandatory part of both the implementation and reporting stages of our process.

Testing (generally) occurs using URLs like these:

Sections here provide:

  1. What the purpose was;
  2. What we did;
  3. What the result was;
  4. Where the code lives;
  5. Where the report lives.

2015[edit]

July-September ("FY 2015-16 Q1")[edit]
Confidence[edit]
Confidence
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: Meta
Result: Failure

Corresponding Phab tickets: Deploy A/B test, Analyse initial results, Report final results

For the experimental group, we reduced confidence from 2.0 to 1.0, and changed smoothing algorithm from "stupid_backoff" to "laplace". Our rationale for this was that reducing confidence brings in more suggestions while increasing smoothing of the suggester makes it prefer better ones. We expected that some search users who would have received zero results and no suggestion will now receive suggestion(s) along with the corresponding results. We did not find an unambiguous improvement in outcome for users subject to the experimental condition, and recommend disabling the experiment and returning users to the default search experience - coupled with some pointers on future A/B test design.

Phrase Slop Parameter[edit]
Phrase Slop Parameter
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: Meta
Result: Failure

Corresponding Phab tickets: Preliminary analysis, Deploy, Analyze initial results, Final analysis & report

The slop parameter indicates how far apart terms are allowed to be while still considering the document a match. Far apart means: how many times do you need to move a term in order to make the query and document match? We hypothesized that increasing the slop parameter would yield less zero results. This was based on our intuition that less restrictive (more flexible) phrase matching should find more documents that match the terms. In summary, we did not find a consistent improvement in outcome for users subject to the experimental conditions, and recommend disabling the experiment and returning users to the default search experience - coupled with some investigation suggestions.

Completion Suggester[edit]
Completion Suggester
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: phab:F2757462
Result: Success

Corresponding Phab tickets: Deploy, Verify initial data, Fix EL and restart test, Verify new data, Analyze

The completion suggester is meant to replace prefix searching, with the aim of increasing recall. Completion suggester benefits include: less typos suggested (e.g. if the redirect is close to the canonical title we will display the canonical title), fuzzy searching, and ignoring stop words. The drawbacks of completion suggester are: precision is sometimes very bad because the scoring formula is not well designed for some kinds of pages, fuzzy searching can display weird results, and page titles with punctuation or special characters only are not searchable.

  • Zero results rate in the searches using completion suggester goes down by 7.8%, with a 95% probability that the difference can be between 6.4% and 9.1%. They are also 1.1-1.2 times more likely to get nonzero results than the control group.
  • Completion suggester was actually more effective when searching German Wikipedia:
    • Zero results rate goes down by 6.3%-11.8% in German Wikipedia searches using the completion suggester. The odds of getting nonzero results when using the suggester were 1.5-2.2 times those of controls.
    • While in English Wikipedia, the zero results rate goes down by 5.8%-9% when using the completion suggester. The odds were 1.5-1.8 times the odds of controls, which is still good, but not as good as German Wikipedia searches.
  • Further research is needed to assess the suggester's impact on the quality of the results, not just their quantity.
October - December ("FY 2015-16 Q2")[edit]
AND Operator Relaxation[edit]
AND Operator Relaxation
Status: Paused
Project: Cirrus

Corresponding Phab tickets: [EPIC], Code the test, Turn on, Verify, Turn off, Analyze

Language Search[edit]
Language Search
Status: In-progress
Project: Cirrus
Analysis Codebase: GitHub
Report: Commons
Result: Failure

Corresponding Phab tickets: Change satisfaction to lang search, Turn off test, Analyze

This was an A/B test to determine the impact of switching search languages in the case that a query produces zero results. Our hope was that applying language detection methods to failed queries would let us, for a subset, identify that they were using the wrong language, switch them over to the correct one, and get them results.

We launched a test that collected data from 4 November 2015 to 11 November 2015, running over 1 in 10 queries. Half of the search queries (the control group – variation “A”) were given the status quo; half (the test group – variation “B”) would have language detection methods and a second search query if their query produced 0 results. Over a week of testing this produced 25,348,283 events.

In practice, we found no evidence that this had a substantial impact on the zero results rate. Our hypothesis is that this is due to a combination of poor language detection and the small proportion of queries that both produced zero results and were not in the language of the project they were made on.

Fewer Than 3 Results[edit]
Fewer Than 3 Results
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: PDF on Commons
Result: Success

Corresponding Phab tickets: Measure clickthroughs from multiple wikis, Write test, Turn on, Verify, Turn off, Analyze

This was an A/B test to determine the impact of switching search languages in the case that a query produces zero results, and in the case that a query produces fewer than 3 results, depending on the test group. This followed an earlier test that just applied this to queries producing zero results. Our hope was that applying language detection methods to failed queries would let us, for a subset, identify that they were using the wrong language, switch them over to the correct one, and get them results - and that new data collection methods would let us produce a more certain result than the last test.

To check our hypothesis we ran a second A/B test, from 21 November to 27 November 2015. This contained two test groups; one would have language detection applied with zero (<1) results, one with less than 3 (<3) results. In both cases, query metadata was added to indicate whether a search would have had language detection applied. We hoped that with these additional datapoints (and a wider population) we could see the impact. This test generated 33,282,741 queries.

We found very promising evidence that, for those queries the language detection could actually be applied to, this makes a remarkable difference to the zero results rate. We recommend that a user-side A/B test be run to look at the clickthrough rate when language detection is applied, to ensure that it produces useful results, not just "some" results.

Language detection via Accept-Language[edit]
Language detection via Accept-Language
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: PDF on Commons
Result: Success

Corresponding Phab tickets: Documentation, Write test, Turn on, Verify, Turn off, Analyze

This test has a top level sampling rate of 1 in 7 requests. The buckets equally split the top level sampling, giving each bucket a 1 in 21 sample of requests. While API requests are included in this test, the API user must pass an explicit flag opting into the general query rewriting feature which basically no-one enables. Any measurable effect will be entirely within users of the web based search. Users participating in this test will be marked as participating in the multilang-accept-lang test.

Three buckets are defined:

  • A - control bucket.
  • B - perform language search based on accept-language then es-plugin when user has zero results
  • C - perform language search based on accept-language then es-plugin when user has less than three results.

If the accept-language detection cannot find a valid language because, for example, the user is on enwiki and only has english accept-language headers, we will fall back to the es-plugin detection method. As opposed to the last test which only recorded {"langdetect": true} indicating we attempted language detection, this new test can now record three different values into that field. The presence of any of these three variables is the equivilent to the prior true variable. These values are:

  • accept-lang - Found a valid language to use via the accept-language detection method
  • es-plugin - Found a valid language to use via the elasticsearch language detection plugin
  • failed - Attempted to detect a language via accept-lang and es-plugin, but no language could be decided upon.

If no langdetect parameter is included that means we did not attempt language detection for some reason. That could be because they had more results than the threshold, it could be because query rewriting was turned off, it could also be because they used some special query syntax (such as intitle:). This detection is attempted for all requests in the test, even the control bucket. For the control bucket while we detect a language and record the results of the detection, we do not perform the followup query against the second wiki.

The control bucket A uses the same parameters as bucket C. The control bucket can be filtered to match bucket B by only considering those requests that returned no results.

We found evidence that the Accept-Language header detection makes a slight positive difference to the zero results rate, with 3.18-3.34% more requests getting some results in the test group than the control group, and the test group being 1.026-1.029 more likely to get some results than the control group when we found a valid language via Accept-Language detection.

2016[edit]

January-March ("FY 2015-16 Q3")[edit]
Phrase Rescore Boosting[edit]
Phrase Rescore Boosting
Status: Done
Project: Cirrus
Analysis Codebase: GitHub
Report: PDF on Commons
Result: Failure?

Corresponding Phab tickets: EPIC, Write test, Turn on, Verify, Turn off, Analyze, Re-analyze

From 15 March 2016 to 22 March 2016 the Discovery/Search team ran an A/B test to assess how changing the phrase rescore boost from its current value of 10 to a proposed value of 1 would affect our users' behavior. Phrase rescore reorders the returned results, ranking results that have the same phrase higher. It appeared to be overboosted so we hypothesized that it may yield sub-optimal results. It is important to note right up front that the differences in metrics/distributions between the test group (users with phrase rescore boost of 1) and the control (users with phrase rescore boost of 10) were close to 0 and were not statistically significant, even after making sure we only analyzed the eligible (affected) sessions -- queries with two or more words. Even in the lab it was a very small effect ([PaulScore] of 0.59 to 0.60).

  • The test group has a slightly higher proportion of sessions with only 1 or 2 search results pages than the control group, which has a slightly higher proportion of multi-search sessions.
  • The control group had a 1.3% higher probability of clickthrough and was 1.05 (1.03–1.08) times more likely to click on a result than the test group.
  • Most of the users clicked on the first result they're presented with. We had some expectation that this might change, but it is unsurprising that the two groups behaved almost exactly the same. Most of the users clicked on a search result within the first 25 seconds, with 10s being the most common first clickthrough time. This did not vary by group.
  • Number of results visited did not change by much between the two groups, although it looks like a slightly larger proportion of the test group visited fewer (1-2) results than the control group (larger % of sessions with 3+ clickthroughs). It also looks like more test group users have shorter sessions than the control group, with a slightly greater number of test users having sessions lasting 10-30s and a slightly greater number of control users having sessions lasting more than 10 minutes. Users in the test group remained just a little bit longer on pages they visited than the control group, but barely so.

Putting the close-to-0-differences aside, if we take a very naive look at the differences and focus on their direction, we still cannot determine whether the change is a positive or negative (however small) impact for our users. Fewer searches may mean better results, or it may (cynically) mean users figuring out faster that they're not going to find what they're looking for. Certainly that's what a lower clickthrough rate and a slightly shorter average session length imply. Perhaps in this particular case it may be worth making the config change decision based on how the different phrase rescore boost values affect the performance and computation time, since it doesn't appear to affect the user's behavior, at least in terms of the metrics analyzed in this report.

Wikipedia.org Portal Testing[edit]

2016[edit]

More information on the Portal tests can be found here.


January - March ("FY 2015-16 Q3")[edit]
Portal Search Box Tests[edit]
Portal Search Box tests
Status: Done
Project: Portal
Analysis Codebase: GitHub
Report: PDF on Commons
Result: Success

This test covers two potential improvements to the Wikipedia portal (www.wikipedia.org), namely:

  1. The expansion of the search box so that it is more prominent;
  2. The inclusion of a small image and Wikidata description to each search result.

The test will run for a week from 13 January 2016, hitting 0.3% of portal visitors. 1/3rd of them get the default experience, 1/3rd the search box expansion, and 1/3rd both the expansion and the included elements in the search results.

Portal Preferred Browser Language Detection[edit]
Preferred Browser Language Detection
Status: Done
Project: Portal
Analysis Codebase: GitHub
Report: PDF on Commons
Result: Success

Corresponding Phab tickets: [[phab:T121567|EPIC], Write test, Implement, Deploy, Verify, Turn off, Analyze

This will be an A/B test which will detect the user's browser's language(s) and will re-sort the links around the globe image according to their preferred language settings. The user's preferred language will be displayed in the top left link. If the user's browser does not have as many language preferences as there are available links to display (10), fill the remainder links around the globe with the "top" links that are not in their language preferences. The test ran for approximately three weeks starting on 22 March 2016, due to the sampling size of users with browsers having preferred languages other than English is fairly small for both the test group and control group.

Users who received these dynamic primary links were more likely to engage with those primary links than the users who received the default, static experience, albeit not by a lot. The biggest impact is actually found in where those users went to from the Portal. When they were presented with primary links that reflected their preferred languages, they had a 7.5-16.1% higher probability of visiting — and were 1.15-1.3 times more likely to visit -- a Wikipedia in their most preferred language (or one of their preferred languages) — 7.5% in the case of multilingual users, 16.1% in the case of users whose Accept-Language did not include English. We believe this is evidence of localization having a positive effect on the users' experience and engagement with the Portal.