What is Holdout Testing?

Atticus Li

← Glossary · Experimentation Strategy

Holdout Testing

A method for measuring the cumulative impact of all shipped experiments by withholding changes from a small percentage of users.

Holdout testing (also called holdback testing) is how you prove the aggregate value of your experimentation program. You permanently exclude a small group — typically 5–10% of users — from all shipped changes, then compare their metrics against the group receiving all optimizations. This measures the true cumulative impact of your program, which is almost always different from the sum of individual test lifts.

Individual A/B tests measure single changes. Holdouts measure the system. Because of interaction effects, regression to the mean, and changing user behavior, the cumulative impact of 20 shipped changes over a quarter almost never equals the sum of their individual reported lifts.

Also Known As

Marketing: Global holdback, campaign holdout
Sales: Control group, untreated cohort
Growth: Growth holdout, program holdback
Product: Feature holdback, legacy experience group
Engineering: Control bucket, persistent control
Data: Cumulative impact measurement, global control

How It Works

A SaaS company shipped 14 winning tests over Q1 that, summed individually, claimed a 22% increase in trial-to-paid conversion. Their 7% holdout group saw the product as it was on January 1. At the end of the quarter, comparing the two groups revealed the actual cumulative lift was 11% — about half of what the individual tests claimed. The difference came from overlapping tests, interaction effects, and a few winners that stopped working over time.

The 11% was still enormous and worth celebrating. More importantly, the holdout gave leadership a defensible answer to "is experimentation actually working?" that no individual test could provide.

Best Practices

Start with a 5–10% holdout; lower percentages require too much time to reach significance.
Run holdouts in quarterly or half-yearly cycles — long enough to accumulate multiple shipped winners, short enough to release the holdout users to the optimized experience.
Track the same primary and guardrail metrics as your individual experiments.
Document which tests shipped during the holdout window so you can diagnose disappointing results.
Rotate holdout cohorts across cycles to avoid persistent inequity for specific users.

Common Mistakes

Contaminating the holdout by accidentally exposing holdout users to new features through marketing emails or in-app announcements.
Expecting the sum of individual lifts — teams get defensive when holdouts show smaller aggregate impact, but that's the point of running them.
Running holdouts too short to detect significance, producing unreliable program-level conclusions.

Industry Context

SaaS/B2B: Holdouts are especially valuable for measuring the cumulative impact of activation and retention experiments where individual test lifts are small but compound. Low traffic makes percentage holdouts statistically painful, so many B2B teams use account-level rather than user-level holdouts.

Ecommerce/DTC: High-traffic sites can run tighter 5% holdouts and still reach significance quickly. Holdouts here often reveal that checkout optimization wins overlap substantially — you cannot stack 10 checkout lifts and expect linear additivity.

Lead gen: Holdouts work well for measuring the cumulative impact of landing page optimizations, especially when combined with downstream revenue attribution.

The Behavioral Science Connection

Holdout testing is a structural defense against the illusion of control — the cognitive bias where people overestimate their influence over outcomes. Teams running successful experiments naturally attribute lifts to their work; holdouts force a reality check by showing what would have happened anyway. This honest accounting is culturally expensive but strategically essential.

Key Takeaway

Holdout testing answers the executive question "is our experimentation program actually moving the business?" with evidence no individual test can provide.

← Browse All Terms