What a holdout test proves (and what it doesn't)

If you're under pressure to grow revenue, "our ROAS looks good" isn't proof. It's a story that shows correlation, not causal impact. Incrementality testing answers the real question: did we create revenue that wouldn't have happened anyway?

A holdout test is the gold standard of incrementality testing: a controlled experiment where you intentionally withhold a treatment from a randomly assigned control group. The treatment might be ads, an email sequence, an in-app paywall, a promo, a sales assist, or an AI-driven personalization model.

Holdouts are especially useful when marketing attribution lies to you, which is most of the time. Last-click attribution is the classic case in retargeting. Many of those buyers were already on the path. The platform gets credit, but your bank account doesn't change.

What a holdout test is great at

  • Measuring incremental revenue and incremental conversions, not just clicks
  • Making budget decisions where credited conversions overstate impact
  • Measuring product-led growth motions like lifecycle nudges, paywall changes, and onboarding interventions

What it's bad at

  • Diagnosing why something worked. It's a scale truth, not a UX microscope
  • Short tests with fast-changing demand. If your baseline is unstable, your conclusion will be too

Designing a holdout test that Finance will accept

Holdout tests usually fail for boring reasons: bad randomization, too-small samples, or interference between groups. If you fix those, the rest is mostly arithmetic and discipline.

  1. Pick one primary outcome.

If you're proving revenue, use revenue (not CTR). If revenue is lagged, use paid conversion with a clear ARPA assumption.

  1. Define the unit of randomization.

User, account, household, geo. Pick the unit that matches how the treatment is delivered.

  1. Choose holdout size based on risk.

Start with 5% to 20%. Higher holdout increases precision but costs more in foregone upside.

  1. Run a pre-period baseline check.

Before the treatment starts, Test and Holdout should look similar on the outcome and key leading indicators.

  1. Commit to a test duration.

Don't stop early because the chart looks good. That's how you buy false certainty.

Most teams underpower these tests. Plan duration and minimum detectable effect up front with a calculator, then decide if the test is worth running.

Two practical design notes

First, privacy regulations are making user-level isolation harder, so use geographic testing with geo-split designs. It's messier, but still workable.

Second, applied AI helps but it doesn't replace design. AI can help with audience selection, stratified randomization, and anomaly alerts. Still, the assumptions must hold:

  • Stable first-party data tracking
  • Clean assignment
  • Minimal spillover between groups
If you can't explain how someone ends up in Holdout, you don't have a causal test. You have a dashboard.

Calculating incremental revenue and making the call

The cleanest calculation compares test audience vs control group during the test period. In practice, adjust for baseline because real markets move.

A simple way is difference-in-differences:

  1. Measure revenue per user in pre-period for both groups
  2. Measure revenue per user in test period for both groups
  3. Incremental lift is the change in test audience minus the change in control group

Concrete example

  • Test audience: 80,000 users

Pre-period revenue per user: $10.00

Test-period revenue per user: $11.40

Change: +$1.40

  • Control group: 20,000 users

Pre-period revenue per user: $10.10

Test-period revenue per user: $10.60

Change: +$0.50

  • Incremental revenue per user: $1.40 − $0.50 = $0.90
  • Total incremental revenue: $0.90 × 80,000 = $72,000 for that window

From there, compute incremental ROAS (iROAS) and compare to your threshold.

Decision rules

  • If iROAS clears a threshold (say, 3× the cost of the program), scale
  • If the confidence interval includes materially negative, stop or redesign
  • If it's positive but small, look for a cheaper variant to avoid diminishing returns

Before you trust the result, check:

  • Assignment integrity (no leakage between test and holdout)
  • Missing conversions or tracking gaps
  • Spillover effects (e.g., geo or household contamination)
  • Whether the treatment changed mix (discounted orders vs full price). A holdout can prove lift while still lowering profit.

Conclusion: the decision I'd make this week

When you need to prove incremental revenue, run incrementality testing with a holdout that includes:

  • Clean assignment
  • Baseline checks
  • A pre-committed duration

Then translate lift into profit and pick a clear action.

Next step: choose one channel or lifecycle trigger that's expensive or politically protected, set a 10% known-audience split, and define the profit threshold that earns scale.

Actionable takeaway: if you can't write down (1) your counterfactual, (2) your baseline check, and (3) your profit threshold, don't run the test yet. Fix the design first.

Unknown block type "image", specify a component for it in the `components.types` option