Why A/B testing

A/B testing is a great way to test out which of two or more variants (A or B) of a website or product perform better. It’s frequently used as an experimentation methodology because it allows us to get rid of a number of statistical biases, and isolate the signal (an increase in a metric) from the noise (random fluctuations in the metric). The power of A/B testing comes from randomization: users are allocated randomly to the control group or A (no change) and a proposed improvement (B).

Measuring a metric before/after a change (pre/post) sometimes doesn’t work, because many things are happening at the same time. If we launch a marketing campaign AND a feature to improve retention, and we see retention decrease, how do we know if it’s caused by poorly qualified traffic from the marketing campaign or by the feature not performing? By A/B testing features, we can still measure their impact in this situation as both A/B test variants would have seen, on average, the same amount of traffic from existing marketing channels and the new marketing campaign.

When to A/B test

A/B testing is a good fit when there are several competing valid options with no clear answer, a clear metric to measure success, and a large enough number of users to test on. Evaluate:

Impact

A/B tests take time and resources, and we can only run a few A/B tests at a time to avoid interferences between tests. We should keep A/B testing for high-impact metrics optimization that match the other conditions here.

Uncertainty

A/B testing is useful when there are several options to choose from, possibly with competing explanations on why they would work, but no clear winner.

It’s not useful A/B testing versions if we already know the result with high probability. Sometimes though, we will want to use A/B testing for things we know are important just to quantify impact (is it worth maintaining this?), as opposed to just validate (yes/no) impact.

Measurement

A/B testing requires a clear quantitative way to measure what is "better". Common advice is to pick a single, clear metric so that there is no ambiguity in deciding if the change passed or failed.

Volume of users to reach statistical significance

A/B testing requires a large enough number of users to get statistically significant results. The smaller the number of users in the A/B testing cohort, the more the change needs to have a big impact on the metric to be measurable. This is because the signal (the impact of the change) needs to be higher than the noise (the random fluctuation in a metric).

If you want to A/B test over a part of the product that does not receive a lot of traffic and therefore needs a much bigger impact on the metric to be statistically significant, the variant tested generally needs to be much more ambitious and thoughtful (e.g. removing 75% of steps in a sign-up flow instead of changing the color of a button).

For example, retention fluctuates week over week for "random" reasons we don't understand. The more users, the smaller those "random" fluctuations will be: this will make it possible to decide whether a change caused by an experiment was likely random, or likely caused by the experiment.

Running A/B tests on a number of users that is too small is inefficient (waste time to setup the A/B test), inconclusive (the result will be that no conclusion can be made) and sometimes counterproductive (miscommunicated, an insignificant A/B test leads teammates to draw conclusions that aren't valid and can be hard to debunk). In that case, it's better to choose another method, such as user interviews.

Examples

Good A/B test candidates