Overall Evaluation Criteria

The Overall Evaluation Criteria (also called the Overall Evaluation Criterion, or a Fitness Function) is a quantitative measure of the objective of a product experiment. An OEC can range from a single metric to a set of complementary metrics, and it can be specific to a single experiment, related to a set of experiments, or focused on longer-term aspects of product success.

Selecting a good OEC is critical for effectively conducting A/B tests and similar product experiments. Overall Evaluation Criteria used by product teams typically vary a lot and evolve early on, when a team starts doing experimentation, but stabilise as experimentation practices become more mature and experiments become more frequent.

Attributes of a good OEC
Evolving OEC over time

Attributes of a good OEC

In Trustworthy Online Controlled Experiments, Kohavi, Tang and Xu suggest that an OEC must be:

Sensitive: it should detect changes that matter, and not be influenced too much by other factors
Timely: measurable in the short-term
Attributable: it should be possible to measure it for each test variant
Relevant: reasonably predicting success for long-term strategic objectives

When running online experiments, getting numbers is easy; getting numbers you can trust is hard

Kohavi, Tang and Xu

Evolving OEC over time

The OEC usually evolves from experiment-specific to product-specific as the number of related experiments and the frequency of experiments increases. According to the Experimentation Growth Model, the typical progression is:

Crawl - a few key signals specific to each experiment
Walk - a structured set of metrics instead of signals, not just to a single experiment
Run - metrics evolve to capture more abstract concepts related to overall product success, ideally turning into a single metric that captures trade-offs between multiple metrics
Fly - a single metric that is stable and changes only infrequently (once per year or so)

The evolution from experiment specific to product-oriented, and from multiple criteria to a single metric, allows for better automation and speeds up decision making, further enabling the product teams to run tests more frequently. For example, it’s typical to introduce automated alerting about experiments that cause negative impacts in the Run phase. For the Fly phase, it’s typical to introduce automated rollback of experiments that cause negative impacts, or to automatically expand and accept experiments that create a significant positive effect.

From signals to metrics

From the Crawl to the Walk phase, an OEC usually evolves from ad-hoc signals chosen for each experiment to a more structured set of metrics.

In The Evolution of Continuous Experimentation in Software Product Development, Fabijan, Dimitrev, Olsson and Bosch provide examples of signals considered at Microsoft for web products.

action signals: clicks, page views, visits…
time signals: session duration, total time on site, page load time…
value signals: revenue, units purchased, ads clicked…

The signals are turned into metrics by dividing it with a unit of analysis. The Evolution paper provides examples of units for Microsoft Web products, such as

user (e.g. clicks per user),
session (e.g. minutes per session),
user-day (e.g. page views per day),
experiment (e.g. clicks per page view).

Kohavi and co-authors suggest that metrics need to consider potential overall impacts. For example, measuring click-through on a single button on a page might not capture the fact that click-throughs on other calls to action on the same page dropped. Measuring the whole page click-through rate is better than looking at a single button. Ideally, metrics should capture some measure of success (such as a purchase) or time it takes a user to perform the successful action.

The HEART framework can be used to choose good signals and metrics relating to specific product goals.

From metrics to a single criteria

During the Run phase, the OEC usually turns from multiple metrics into a single, consolidated measurement. A common option for this transition is to normalize each metric to a predefined range (0-100 or 0-1), then assign weights to different metrics and calculate the weighted average or weighted sum. In that way, the OEC represents a trade-off between different metrics.

Selecting the correct weights usually takes time to understand the impacts and trade-offs, and iterating over various combinations. Kohavi and colleagues suggest narrowing down the number of metrics to no more than five, then starting to evolve the understanding of how those metrics correlate to success by using a 4 steps decision process:

if at least one metric changed in a positive way, and the other metrics did not change in a statistically significant way, then release the change
if at least one metric changed in a negative way, and the others did not change in a statistically significant way, then do not release
if some metrics changed in a positive way, some metrics changed in a negative direction, consider the trade-offs and decide whether to release or not; document this decision to be able to evolve weights for metrics later.
if no metrics changed in a statistically significant way, don’t release the change and consider extending the experiment, or stopping it

When creating a single OEC, especially if it will lead to automated releases, it’s important to consider long-term effects as well as indicators from a specific test. In particular, User Learning Effects may need to be included in an OEC to compensate for long-term changes.

An example single-criteria OEC

In the Trustworthy Online Controlled Experiments book, Kohavi and co-authors provide an example of a single-criteria OEC for e-mail notifications at Amazon. The goal of the experiment was to check how sending e-mail notifications impacts long-term revenue from users. Some users would click through the notifications and purchase additional items, increasing the revenue. Some customers would get annoyed by additional e-mails and unsubscribe from further notifications, and Amazon would lose the opportunity to contact them with promotions in the future. The balance of these two forces was set as the OEC for experiments around e-mail promotions:

OEC = (Revenue - Unsubscribes ✕ Unsubscribe_Lifetime_Loss) ÷ Number_Of_Users

This calculation was applied to each variant in the test, and effectively measured additional revenue per user in a variant, accounting for potential future revenue loss when users unsubscribe.

Learn more about the Overall Evaluation Criteria

Trustworthy Online Controlled Experiments: A practical guide to A/B testing, ISBN 978-1108724265, by Ron Kohavi, Ya Xu (2020)
The Evolution of Continuous Experimentation in Software Product Development, International Conference on Software Engineering in Buenos Aires, Argentina by Aleksander Fabijan, Pavel Dmitriev, Helena Holmström Olsson, Jan Bosch (2017)