A/B Testing

An A/B Test (also known as a Randomized Controlled Trial) is a controlled experiment aimed at measuring the impact of a product change, with the intent of proving causality. Essentially, an A/B test is a reliable way of judging, with high probability, whether an observed user behaviour is caused by a specific modification in how a product looks or how it works. When A/B tests are applied systematically, they can be used to release valuable product changes with confidence, and to prevent low-value or damaging changes from being released.

Benefits of A/B tests
Prerequisites for running A/B tests
Applicability and limitations of A/B tests
Running A/B tests at scale

The fundamental idea of a controlled experiment is to expose several groups of people to different experiences and then measure the difference in their behaviours. If both groups behave the same, then there is a high probability that the difference in experiences is not causing a behavioral impact. If the groups behave differently, there is a high probability that the change in experiences is causing the impact.

In software product management, an A/B test is the simplest version of a controlled experiment, where one group is usually experiencing an existing system (version A) and the other group is experiencing some change in design or functionality (version B). By collecting key user behaviour metrics (such as system interactions, task completion or purchases), and observing the difference in those metrics between the groups, an experiment can show if the change under test is potentially valuable or not. The name A/B test implies two versions, but the same name is informally used for tests involving more than two variants (also called A/B/n tests).

Benefits of A/B tests

The primary benefits of A/B tests are to increase confidence in product releases and to reduce the risk of making changes that negatively impact user experience or business value. By testing different versions of a feature or interface on a subset of users before rolling it out more widely, product teams can gather data-driven insights and evidence to make informed decisions. This approach helps to ensure that any changes introduced are likely to lead to improvements or, at the very least, not cause harm.

Controlled experiments can quickly detect small changes to important business metrics, and help to inform product management decisions and discussions with stakeholders. Such small changes might pass undetected when some other global trend is affecting the product or the market. For example, if there is a surge in sales during a Christmas period, a new feature that actually damages revenue might not be noticed for a while, unless it is deployed with a controlled experiment. By running A/B tests, product delivery teams can pinpoint exactly which changes lead to positive or negative outcomes, providing a clear understanding of their impact.

A/B tests can be simple, looking only at a few key metrics, or they can have complex scorecards involving hundreds or thousands of metrics. More complex metric scorecards also help to detect unexpected and unintended impacts, and prevent product teams from releasing features that might improve one metric in the short term but cause side-effects that damage other important business metrics in the long term. Similarly, they can help product teams spot unintended positive side-effects (for example when a new design intended to improve user engagement also causes customers to increase purchases), and then exploit those opportunities to further optimize the product. These hidden benefits can provide valuable insights for future product development and contribute to long-term business growth.

Prerequisites for running A/B tests

Effectively running A/B tests requires a clear quantifiable target (also called Overall Evaluation Criteria). A key prerequisite for A/B tests is that the overall evaluation criteria can actually be measured during the test and evaluated at the end of the experiment. This often requires using proxy metrics. For example, profit before tax is an important business metric, but it usually gets calculated at the end of a tax year and involves specialist accountants. It is not the right metric to use in an A/B test that lasts a few days. Instead of end-of-year profit, the number of purchases per customer can be tracked quickly and serve as a proxy for profit.

A/B testing also requires a clear experimental unit, so that different units can be assigned to different variants. A typical experimental unit is a user, but it could also be a client company or a group of users in an enterprise application. There are three key prerequisites for experimental units in A/B tests:

units can reliably be assigned to test variants
units are independent and will not interfere with other units (for example, a user assigned to one version of the application should not be able to cause other users to behave differently).
there are enough units to interpret the results in a statistically relevant way.

The number of units required for an experiment depends on the target metrics (especially the variance), the required minimum meaningful change the test needs to detect, and the error tolerance. Generally, A/B tests can be used to detect arbitrarily small changes, as long as there are enough units involved in the test. More sensitive the test need significantly larger populations. A/B tests evaluate a sample of the full population to predict how the population will behave, and any such prediction comes with potential errors. The smaller the tolerance for errors, the longer and larger the tests need to be.

Running an A/B test often requires a trade-off between the sensitivity, error tolerance, the number of people involved and the duration of a test. Many free online test calculators, such as the CXL A/B test calculator can help you estimate the requirements.

Because the test participants in a controlled experiment are usually a small representative sample of the entire group of users or customers, and the different groups people are exposed to different experiences, interpreting the results of an A/B test requires statistical analysis. Just comparing the raw numbers from the metrics in two variants is not enough.

Applicability and limitations of A/B tests

A/B tests are usually not applicable to very early stage products, as it’s difficult to get enough test units (users) to participate in a reasonable period of time. With early stage products that have some traffic, but not enough for highly-sensitive tests, it’s possible to reduce the required sample size by increasing the minimum detectable effect (only testing for larger changes), or choosing proxy metrics with lower variance.

As a quantitative experiment, an A/B test can show what happened, but not necessarily why it happened. This is particularly relevant when a test detects a side-effect or an unintended change to some metrics. Because of that, it’s sometimes necessary to run multiple tests or combine A/B tests with user interviews and observations.

A/B tests involve humans who are affected by various short and long term biases. People can show extraordinary interest in a new feature due to temporary causes such as the Novelty Effect, or be reluctant to try new things. Such causes do not impact long-term usage, but may skew the results of a quick test. Likewise, people can be affected by their past experiences, and participation in previous tests can influence the actions a person takes in future tests. User Learning Effects can cause long-term biases.

A/B tests tend to consider individual users through proxy identifiers, sometimes causing issues especially for longer-running tests. For example, one typical proxy identifier is a web cookie, but people are not cookies. Users might have multiple devices using the same application, or multiple entry points into the application, causing multiple cookies to be assigned to the same person. Over a longer period of time, people are likely to clear cookies (or their browsers will automatically discard old cookies). Some people will browse in private mode, causing a new cookie to be reassigned for each session or even each page view. This leads to the same person showing up several times in the same test. If a significant portion of people gets assigned to multiple test cohorts, that may distort the results.

Running A/B tests at scale

Getting started with A/B tests does not necessarily require any specific infrastructure, as changes in metrics can be usually detected in logs or with existing analytics infrastructure, collected manually and then interpreted using free online calculators. However, this is time-consuming, error-prone and limits the capability for experimentation. To run tests at scale reliably, teams usually need automated cohort assignment, data collection, evaluations, or even starting and stopping parallel experiments. The Experimentation Growth Model describes a usual set of stages organizations go through when scaling up A/B testing.

When running a small number of tests, teams usually set an overall criteria for each test separately, usually a set of key metrics. Over time, as the number of tests increases, the OEC tends to converge and gradually evolves into a more complex scorecard. At large scale, in order to control test execution and deployment automatically, teams usually model and capture the acceptable trade-offs into a single combined numerical rating.

Learn more about the A/B Testing

Trustworthy Online Controlled Experiments: A practical guide to A/B testing, ISBN 978-1108724265, by Ron Kohavi, Ya Xu (2020)
Quick Introduction To A/B Testing, YouTube video by Ron Kohavi