Overall Evaluation Criteria

The Overall Evaluation Criteria (also called the Overall Evaluation Criterion, or a Fitness Function) is a quantitative measure of the objective of a product experiment. An OEC can range from a single metric to a set of complementary metrics, and it can be specific to a single experiment, related to a set of experiments, or focused on longer-term aspects of product success.

Selecting a good OEC is critical for effectively conducting A/B tests and similar product experiments. Overall Evaluation Criteria used by product teams typically vary a lot and evolve early on, when a team starts doing experimentation, but stabilise as experimentation practices become more mature and experiments become more frequent.

Attributes of a good OEC

In Trustworthy Online Controlled Experiments, Kohavi, Tang and Xu suggest that an OEC must be:

When running online experiments, getting numbers is easy; getting numbers you can trust is hard

Kohavi, Tang and Xu

Evolving OEC over time

The OEC usually evolves from experiment-specific to product-specific as the number of related experiments and the frequency of experiments increases. According to the Experimentation Growth Model, the typical progression is:

The evolution from experiment specific to product-oriented, and from multiple criteria to a single metric, allows for better automation and speeds up decision making, further enabling the product teams to run tests more frequently. For example, it’s typical to introduce automated alerting about experiments that cause negative impacts in the Run phase. For the Fly phase, it’s typical to introduce automated rollback of experiments that cause negative impacts, or to automatically expand and accept experiments that create a significant positive effect.

From signals to metrics

From the Crawl to the Walk phase, an OEC usually evolves from ad-hoc signals chosen for each experiment to a more structured set of metrics.

In The Evolution of Continuous Experimentation in Software Product Development, Fabijan, Dimitrev, Olsson and Bosch provide examples of signals considered at Microsoft for web products.

The signals are turned into metrics by dividing it with a unit of analysis. The Evolution paper provides examples of units for Microsoft Web products, such as

Kohavi and co-authors suggest that metrics need to consider potential overall impacts. For example, measuring click-through on a single button on a page might not capture the fact that click-throughs on other calls to action on the same page dropped. Measuring the whole page click-through rate is better than looking at a single button. Ideally, metrics should capture some measure of success (such as a purchase) or time it takes a user to perform the successful action.

The HEART framework can be used to choose good signals and metrics relating to specific product goals.

From metrics to a single criteria

During the Run phase, the OEC usually turns from multiple metrics into a single, consolidated measurement. A common option for this transition is to normalize each metric to a predefined range (0-100 or 0-1), then assign weights to different metrics and calculate the weighted average or weighted sum. In that way, the OEC represents a trade-off between different metrics.

Selecting the correct weights usually takes time to understand the impacts and trade-offs, and iterating over various combinations. Kohavi and colleagues suggest narrowing down the number of metrics to no more than five, then starting to evolve the understanding of how those metrics correlate to success by using a 4 steps decision process:

When creating a single OEC, especially if it will lead to automated releases, it’s important to consider long-term effects as well as indicators from a specific test. In particular, User Learning Effects may need to be included in an OEC to compensate for long-term changes.

An example single-criteria OEC

In the Trustworthy Online Controlled Experiments book, Kohavi and co-authors provide an example of a single-criteria OEC for e-mail notifications at Amazon. The goal of the experiment was to check how sending e-mail notifications impacts long-term revenue from users. Some users would click through the notifications and purchase additional items, increasing the revenue. Some customers would get annoyed by additional e-mails and unsubscribe from further notifications, and Amazon would lose the opportunity to contact them with promotions in the future. The balance of these two forces was set as the OEC for experiments around e-mail promotions:

OEC = (Revenue - Unsubscribes ✕ Unsubscribe_Lifetime_Loss) ÷ Number_Of_Users

This calculation was applied to each variant in the test, and effectively measured additional revenue per user in a variant, accounting for potential future revenue loss when users unsubscribe.

Learn more about the Overall Evaluation Criteria

Related and complementary tools to the Overall Evaluation Criteria