How Holdout Tests Prove Incremental Revenue

Atticus Li

← Blog · causal inference

How Holdout Tests Prove Incremental Revenue

What holdout tests actually prove about incremental revenue, when to use them, and how to defend results under stakeholder pressure.

By Atticus Li January 22, 2025 4 min read

What a holdout test proves (and what it doesn't)

If you're under pressure to grow revenue, "our ROAS looks good" isn't proof. It's a story that shows correlation, not causal impact. Incrementality testing answers the real question: did we create revenue that wouldn't have happened anyway?

A holdout test is the gold standard of incrementality testing: a controlled experiment where you intentionally withhold a treatment from a randomly assigned control group. The treatment might be ads, an email sequence, an in-app paywall, a promo, a sales assist, or an AI-driven personalization model.

Holdouts are especially useful when marketing attribution lies to you, which is most of the time. Last-click attribution is the classic case in retargeting. Many of those buyers were already on the path. The platform gets credit, but your bank account doesn't change.

What a holdout test is great at

Measuring incremental revenue and incremental conversions, not just clicks
Making budget decisions where credited conversions overstate impact
Measuring product-led growth motions like lifecycle nudges, paywall changes, and onboarding interventions

What it's bad at

Diagnosing why something worked. It's a scale truth, not a UX microscope
Short tests with fast-changing demand. If your baseline is unstable, your conclusion will be too

Designing a holdout test that Finance will accept

Holdout tests usually fail for boring reasons: bad randomization, too-small samples, or interference between groups. If you fix those, the rest is mostly arithmetic and discipline.

Pick one primary outcome.

If you're proving revenue, use revenue (not CTR). If revenue is lagged, use paid conversion with a clear ARPA assumption.

Define the unit of randomization.

User, account, household, geo. Pick the unit that matches how the treatment is delivered.

Choose holdout size based on risk.

Start with 5% to 20%. Higher holdout increases precision but costs more in foregone upside.

Run a pre-period baseline check.

Before the treatment starts, Test and Holdout should look similar on the outcome and key leading indicators.

Commit to a test duration.

Don't stop early because the chart looks good. That's how you buy false certainty.

Most teams underpower these tests. Plan duration and minimum detectable effect up front with a calculator, then decide if the test is worth running.

Two practical design notes

First, privacy regulations are making user-level isolation harder, so use geographic testing with geo-split designs. It's messier, but still workable.

Second, applied AI helps but it doesn't replace design. AI can help with audience selection, stratified randomization, and anomaly alerts. Still, the assumptions must hold:

Stable first-party data tracking
Clean assignment
Minimal spillover between groups

If you can't explain how someone ends up in Holdout, you don't have a causal test. You have a dashboard.

Calculating incremental revenue and making the call

The cleanest calculation compares test audience vs control group during the test period. In practice, adjust for baseline because real markets move.

A simple way is difference-in-differences:

Measure revenue per user in pre-period for both groups
Measure revenue per user in test period for both groups
Incremental lift is the change in test audience minus the change in control group

Concrete example

Test audience: 80,000 users

Pre-period revenue per user: $10.00

Test-period revenue per user: $11.40

Change: +$1.40

Control group: 20,000 users

Pre-period revenue per user: $10.10

Test-period revenue per user: $10.60

Change: +$0.50

Incremental revenue per user: $1.40 − $0.50 = $0.90
Total incremental revenue: $0.90 × 80,000 = $72,000 for that window

From there, compute incremental ROAS (iROAS) and compare to your threshold.

Decision rules

If iROAS clears a threshold (say, 3× the cost of the program), scale
If the confidence interval includes materially negative, stop or redesign
If it's positive but small, look for a cheaper variant to avoid diminishing returns

Before you trust the result, check:

Assignment integrity (no leakage between test and holdout)
Missing conversions or tracking gaps
Spillover effects (e.g., geo or household contamination)
Whether the treatment changed mix (discounted orders vs full price). A holdout can prove lift while still lowering profit.

Conclusion: the decision I'd make this week

When you need to prove incremental revenue, run incrementality testing with a holdout that includes:

Clean assignment
Baseline checks
A pre-committed duration

Then translate lift into profit and pick a clear action.

Next step: choose one channel or lifecycle trigger that's expensive or politically protected, set a 10% known-audience split, and define the profit threshold that earns scale.

Actionable takeaway: if you can't write down (1) your counterfactual, (2) your baseline check, and (3) your profit threshold, don't run the test yet. Fix the design first.

Timeline chart comparing revenue per user for test and control groups in a holdout experiment — Difference-in-differences: incremental lift is the change in the test group minus the change in the holdout group over the same period.

causal inference holdout tests incrementality revenue measurement

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter

How Holdout Tests Prove Incremental Revenue

What a holdout test proves (and what it doesn't)

What a holdout test is great at

What it's bad at

Designing a holdout test that Finance will accept

Two practical design notes

Calculating incremental revenue and making the call

Concrete example

Decision rules

Conclusion: the decision I'd make this week

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the Weekly
Experimentation Playbook

What a holdout test proves (and what it doesn't)

What a holdout test is great at

What it's bad at

Designing a holdout test that Finance will accept

Two practical design notes

Calculating incremental revenue and making the call

Concrete example

Decision rules

Conclusion: the decision I'd make this week

Related Articles

Before/After Analysis vs True SEO Experiments: Know the Difference

Related Articles

Before/After Analysis vs True SEO Experiments: Know the Difference

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook