User Learning Effects
User Learning Effects are changes to user behaviour and habits caused by interactions with a product. Users modify their behavior based on their experiences with a digital product or service, which can cause bias in A/B testing and influence long-term behaviours beyond the time frames for typical experiments. Considering user learning effects during A/B testing is important to prevent short-term gains that cause long-term losses, and to increase the relevance of shorter experiments. A/B tests need to account for both immediate user reactions and for how users adapt over time.
- The Origins of User Learning: Thorndike’s Law of Effect
- Measuring User Learning
- Including User Learning into experiment evaluation
The Origins of User Learning: Thorndike’s Law of Effect
User Learning Effects are caused by Thorndike’s Law of Effect, which was proposed in the early 20th century by Edward Thorndike (after experiments on animals). Thorndike’s research led to the S-R framework of behavioral psychology (Stimulus/Response). The Law of Effect suggests that a stimulus and a response interact with positive responses reinforcing a stimulus. Stimulated behaviors followed by positive outcomes are more likely to be repeated, while those followed by negative outcomes are less likely to occur again.
Applied to modern product management, this means that users tend to repeat actions that produce beneficial outcomes and avoid actions that lead to frustration or failure. As users interact with a product, their behavior is shaped by both positive and negative experiences, which gradually reinforces certain behaviors and discourages others, and over time leads to users forming habits. For example, people who experienced better outcomes clicking on ads are more likely to click on ads in the future, and people who experienced worse outcomes clicking on ads were less likely to click on ads in later experiments. The relevance and quality of the avertised materials then impacts user behaviour on the platform showing ads in th long term. These effects drive behaviour which can manifest as ads blindness and sightedness.
Measuring User Learning
There are several good ways to measure long-term effects including user learning. They include post-period learning measurements tests, staggered treatment comparisons and reverse experiments.
Post-Period Learning Measurement
In the paper Focusing on the Long-term: It’s Good for Users and Business, Henning Hohnhold, Deirdre O’Brien, and Diane Tang (who were all working for Google at the time) propose to “sandwich the treatment period between two A/A test periods”. Running an A/A test before the A/B test can baseline the behaviour for cohorts (and should generally confirm that both user cohorts behave in the same way). Running an A/A test after the A/B test can then track the behaviour of original cohorts separately, and show if there are significant differences in their behaviour. Since both cohorts are experiencing the same product in the post-period A/A test, any differences in their behaviour are not caused by the product, but by the behavioral influences from the exposure during the A/B test. This method provides insight into how much users have “learned” or changed their behavior due to the experiment.
Although the post-period learning measurement is quite simple, Hohnhold, O’Brien and Tang point out that there are several challenges with that approach. Over a longer period of time, the link between an individual experiment participant and the cohort might disappear (for example, if cookies are used for cohort assignment, users are more likely to clear cookies over longer periods of time). Over longer periods, because both experimental and control cohorts see the same version of the product, their behaviour is likely to converge (effectively the control cohort starts to learn if the experiment was a success and the tested change was deployed, or the experiment cohort starts to unlearn if the experiment failed and the tested change was not deployed). An additional issue is that post-period learning can only be measured after an experiment ends, it cannot be taken into consideration during the experiment.
Time-Staggered treatments
In Trustworthy Online Controlled Experiments, Kohavi and co-authors suggest measuring learning effects by having two experimental cohorts and one control cohort, but starting the treatment for the second experimental cohort with a delay. By measuring the difference in behaviour between the two experimental cohorts, we can spot effects of the first cohort forming habits.
A similar method is proposed by Hohnhold, O’Brien and Tang, but they suggest launching several experimental cohorts, one every day. The first is actually used for the A/B testing results, and kept for the duration of the experiment. The others launch staggered every day of the test and get re-randomized. The assumption underlying this method is that is people in the additional experimental units do not receive some exposure to the tested change, but not enough to accumulate learning effects as much as the cohort that receives continuous exposure. This allows you to take measurements throughout the experiment, rather than waiting until the end as in the post-period learning measurement method. The original test cohort is assigned using a cookie, the additional ones are assigned using a combination of a cookie and a date, hence the name Cookie/Cookie-Day.
The benefit of the Time-Staggered treatments and CCD is that they can run concurrently with an experiment, but the downside is that they requires a more complex testing infrastructure.
Reverse experiments
Another potential option to measure long-term effects is to run a reverse experiment, as suggested in Trustworthy Online Controlled Experiments. Weeks or months after a change is deployed, a percentage of the users could be shown the original control version (effectively losing the benefits of the deployed change). By that point, all users were exposed to the change for a while, and they may have formed new habits because of that. By removing the change for an experimental population of users later, we can detect behavioural changes caused by the learning and not by software itself.
This approach is useful if there is a market or business pressure to deploy a successful change to everyone. The downside is that users who experience an older version in the future might be confused.
Including User Learning into experiment evaluation
Hohnhold, O’Brien and Tang recommend using both post-period A/A tests and re-randomization (cookie/cookie-day), and then modelling predictions of long-term behavior changes based on short-term indicators, by establishing how CCD factors relate to longer-term changes detected by post-period measurements. Those metrics can then be used to adjust the Overall Evaluation Criteria for experiments.
It’s important to watch out for subtle differences in how users learn across different platforms. Hohnhold, O’Brien and Tang reported that their estimates for user learning were close to actual values on mobile devices, but the same models underestimated learning for laptop and desktop users by a factor of 1.5 to 3. In the population of users that they tested, laptop/desktop users either learned slower than mobile users, or had more intensive learning effects over a longer period.
Learn more about User Learning Effects
- Animal Intelligence: An experimental study of the associative processes in animals, In Psychological Monographs: General and Applied, 2(4) by E.L. Thorndike (1898)
- Trustworthy Online Controlled Experiments: A practical guide to A/B testing, ISBN 978-1108724265, by Ron Kohavi, Ya Xu (2020)
- Focusing on the Long-term: It's Good for Users and Business, from the Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining by Henning Hohnhold, Deirdre O'Brien, Diane Tang (2015)