The teams that successfully scale experimentation programs all do the same thing in the first 90 days. They do not try to run more tests. They do not try to cover more surface area. They do not try to build sophisticated tooling. They pick a few high-leverage tests, run them cleanly, report the dollar value in language the C-suite understands, and use that as the foundation for every budget request that follows.

The teams that fail to scale do the opposite. They try to run everything at once, produce a flurry of activity with unclear business impact, and then wonder why leadership is reluctant to give them more budget, more headcount, or better tools.

Proving value before you scale is not caution. It is leverage.

"Building an experimentation program is really about initially getting some quick winnings and running clean tests — so we can prove that experimentation has a dollar value attached to the stuff we're doing." — Atticus Li

Why Scaling Without Proof Fails

When a program scales before proving value, what happens is predictable. You start running more tests. Different stakeholders come in with ideas, many of which are low-quality or HiPPO-driven. Your win rate drops because the test quality is uneven. Your operational cost goes up because you are running more tests with the same team. Your storytelling gets harder because you are producing more results but fewer of them are compelling.

Eventually, someone in finance looks at the spend and says "what are we getting for this?" And you do not have a clean answer because the program's value was never framed in dollars from day one. At that point, budget arguments become defensive instead of offensive, and the program stalls.

The inverse path is much cleaner. Start small. Run 5-10 high-leverage tests. Calculate the projected and realized revenue impact of each one. Build a single scorecard that says "in the first quarter, our program generated $X in realized revenue and $Y in learning value." Walk that scorecard into every executive conversation for the next year.

The Pre-Test Revenue Calculation (in Detail)

This is the single most important habit I teach experimentation leads. Before a test is approved to run, calculate its expected revenue impact.

The math is simple:

Inputs you need:

  • Baseline conversion rate on the primary metric
  • Weekly sessions on the page being tested
  • Revenue per converted user (or LTV if appropriate)
  • Minimum detectable effect you are powering for (MDE)

The calculation:

Weekly baseline conversions = Weekly sessions × Baseline conversion rate

Weekly baseline revenue from this page = Weekly baseline conversions × Revenue per conversion

If the variant wins at the MDE you are powering for:

Weekly lift in revenue = Weekly baseline revenue × MDE

Annualized projected impact = Weekly lift × 52

Now you have a single number: "this experiment is projected to generate $X in annualized revenue if it wins at the minimum detectable effect." That number is your executive translation.

The Post-Test Realized Value Calculation

The post-test version is equally important. After the test concludes, you should report the actual realized impact, which will almost always be different from the projection.

  • If the variant lost, the realized impact is zero (or negative if you ran to statistical significance losing, but you would have caught that earlier with proper guardrails).
  • If the variant won but the lift was smaller than projected, the realized impact scales down proportionally.
  • If the variant won and the lift was larger than projected, the realized impact scales up.

The key discipline is to not pretend. If you projected $440k and delivered $180k, report $180k. Over time, your credibility depends on the accuracy of your reporting, not the optimism of your projections.

I also recommend running a post-launch holdout for 30-60 days on high-value winners. This lets you verify that the lift persists over time, which it often does not. Novelty effects fade. Seasonal patterns distort. A 21% lift measured during a test window might be a 9% lift measured over a full year. The holdout tells you which one you actually have.

"Using revenue per customer, we can calculate — before the test runs — the projected value of this test based on MDE. After the test runs, we look at the actual stats, the actual lift, how much it generates during the test, and how much it's going to over a year." — Atticus Li

Why Experimentation Wins the ROI Argument

Here is a conversation I have had with CFOs more than once. The finance team is comparing experimentation to brand marketing and asking why they should fund more of the former.

The answer is not that brand marketing does not work. It does. But the ROI of brand marketing comes with a lot of asterisks — brand lift studies, estimated impression counts, multi-touch attribution assumptions, share-of-voice models. Every one of those numbers has a range of uncertainty around it.

Experimentation has almost none of that uncertainty. You ran a controlled test with a control and a variant. You measured the lift directly on the same population during the same time window. The noise is minimized by design. When you say "this test drove a 4.2% lift in revenue per visitor," you can defend that number at a level of rigor that almost no other marketing function can match.

This is not an argument against brand marketing. It is an argument that experimentation has a structural advantage in the ROI conversation if you choose to frame it that way.

"A/B testing has scientific rigor. We know exactly what was lifted, what changes were made. There's way less noise compared to brand marketing — where companies spend a lot of time and money, but can't really get a very clear answer on the impact." — Atticus Li

The First 90 Days

If you are scaling an experimentation program from scratch, here is how I would structure the first quarter:

Weeks 1-2: Choose your proving-ground tests.

Pick 3-5 tests on high-traffic, high-revenue pages where you can plausibly reach significance within 4 weeks. Avoid novelty tests. Avoid HiPPO tests. The goal is to stack the deck so you have clean, reportable results fast.

Weeks 3-4: Instrumentation audit.

Before running anything, validate your tracking. Run A/A tests. Check for sample ratio mismatch. Audit the data pipeline from event fire to dashboard. You cannot report dollar impact if your instrumentation is broken.

Weeks 5-10: Run the tests.

With pre-test revenue calculations attached to each one. Document every hypothesis, every projected impact, every assumption. Treat this phase as the foundation for the narrative you will tell later.

Weeks 11-12: Build the scorecard.

One page. Top section: total projected impact across all tests run. Next section: total realized impact (be honest). Next section: biggest learnings (what we know now that we did not know 90 days ago). Final section: what we could do with more budget.

Week 13: Present to leadership.

Not to report. To ask for the next level of investment, with a specific plan for what you would do with it.

The Standardization Advantage

One reason proving value works is that it forces the team to standardize early. When you are calculating projected and realized impact for every test, you have to agree on:

  • How baseline conversion is measured
  • How revenue per conversion is attributed
  • Which MDE is acceptable for which types of tests
  • How long the post-test holdout runs
  • Who signs off on the final realized-value number

These decisions become your program's standard operating procedure. And because they are tied to reportable dollar value from day one, they do not get skipped. Standards imposed after a program is already scaling are almost impossible to enforce. Standards built into the reporting from the start survive.

FAQ

What if the first few tests lose?

That is still valuable as long as you learned something and the losses were cheap. Report the learning in dollar terms: "this test cost $12k in implementation time, but we eliminated a hypothesis worth $X if we had shipped it without testing." Losses that teach are investments, not failures.

How conservative should pre-test projections be?

Conservative enough that you rarely over-deliver by more than 2x. If your projections are consistently way under-shooting, you are sandbagging and leadership will notice. If they are consistently over-shooting, you are losing credibility. Calibration is the goal.

What if I do not have a clean revenue-per-conversion number?

Work with finance to get one. If finance cannot give you one, that is a bigger problem than your experimentation program — it means nobody in the company can answer basic unit economics questions. Solving this pays for itself many times over.

How do you handle tests whose value is indirect?

For tests that affect engagement, retention, or activation, use a two-step translation. Step one: calculate the lift in the proximal metric. Step two: apply a learned or assumed conversion rate from that metric to revenue. Document the assumption. Leadership will push back on it, and that conversation is healthy.

Turn Your Program Into a Revenue Function

If your experimentation program feels like a cost center, the problem is almost always framing. You have the data. You have the results. What is missing is the discipline to translate every test into the language leadership uses to allocate capital.

I built GrowthLayer with pre-test and post-test revenue calculations as a first-class feature. Every experiment you log is automatically tied to a projected and realized dollar value, which makes executive reporting trivial instead of the most painful part of the quarter.

If you are hiring for experimentation leaders who know how to frame programs in revenue terms, or looking to build those skills, explore open roles on Jobsolv.

Or book a consultation and I will help you design a 90-day proving-ground plan for your program.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Leads applied experimentation at NRG Energy. $30M+ in verified revenue impact through behavioral economics and CRO.