How I Built a $30M Experimentation Program in a Year

Atticus Li

← Blog · experimentation

How I Built a $30M Experimentation Program in a Year

Atticus Li shares how he scaled NRG Energy's experimentation program from 20 tests per year to 150+ total experiments across 7 brands, tying every test to revenue per customer and generating over $1.2M in projected annual lift.

Atticus Li April 7, 2026 9 min read

Atticus Li is the experimentation lead at NRG Energy, where he scaled the company's testing program from roughly 20 experiments per year to over 100 per year in 2025, with 150+ total experiments completed across five retail energy brands. This article details exactly how that happened — the frameworks, the politics, the failures, and the specific wins that made C-suite finance pay attention.

The State of Things When I Arrived

When I joined NRG, the experimentation program existed but it was small. Around 20 tests a year, mostly driven by agency partners. The tests weren't bad, but they suffered from a problem I've seen at every enterprise I've worked with: the HiPPO problem.

HiPPO stands for Highest-Paid Person's Opinion. And at a company managing NRG's five retail brands, there were a lot of HiPPOs with a lot of opinions.

The typical test request looked like this: a VP sees a competitor's homepage and says "let's test that layout." No data backing the hypothesis. No understanding of whether the current design was actually underperforming. No pre-test analysis to determine if the test could even reach statistical significance with available traffic.

I knew from my time at SVB that experimentation only scales when you remove opinion from the equation and replace it with process. The question was how to do that across five brands with different traffic volumes, different customer segments, and different stakeholder groups.

Step One: Bringing Everything In-House

The first major shift was bringing experimentation fully in-house. The agency model had served its purpose, but it created a dependency that slowed everything down. Turnaround times were long. Context was lost between handoffs. And the agency team didn't live inside our data the way an internal team needs to.

I built out the internal capability piece by piece. That meant standardizing our tooling — Adobe Analytics for measurement, Optimizely for test execution, Contentsquare for behavioral analytics, and Tealium as our CDP layer. Each tool had a specific role, and I documented exactly how they connected so that anyone joining the team could get productive within their first week.

This wasn't glamorous work. It was writing SOPs, building test request templates, creating pre-test analysis frameworks, and establishing review cadences. But it was the foundation everything else was built on.

Step Two: Tying Every Test to Revenue

Here's the thing most CRO content on the internet won't tell you: at real companies, you don't have millions of monthly users. You don't get to run 50 concurrent tests with clean traffic allocation. You face sample size constraints that make many "best practice" recommendations from CRO influencers completely irrelevant.

When you're working with the traffic volumes of a regional energy brand — not Google, not Amazon — every test slot is precious. You can't waste one on a test that has no chance of reaching significance, or worse, one that reaches significance but can't be tied back to dollars.

So I built a pre-test framework centered on Minimum Detectable Effect (MDE) projections tied to revenue per customer. Before any test gets greenlit, we calculate:

Current conversion rate for the page or flow being tested
Traffic volume over the planned test duration
Revenue per converted customer for that specific brand and product
MDE threshold — the smallest lift we'd need to detect to justify the test
Projected annual revenue impact if the test wins at the MDE level

This changed the conversation entirely. Instead of "let's test this because the VP wants to," it became "this test has a projected annual impact of $200K if it achieves a 5% lift, and we have enough traffic to detect a 3% lift at 95% confidence."

Finance understood that language. And when finance understands your language, budget follows.

What Drove the Biggest Gains in 2025

Rather than cherry-picking individual tests, here are the categories that consistently produced our highest-impact wins across the portfolio:

Mobile checkout optimization was our single highest-revenue category. Reducing friction in mobile checkout flows — fewer form fields, clearer pricing displays, smarter default selections — moved completion rates more than any homepage change we tested all year. This wasn't one breakout test; it was a pattern across multiple iterations.

Homepage and hero layout redesigns delivered wins when the hypotheses came from heat maps and scroll data rather than competitor inspiration. Mobile-first layouts consistently outperformed desktop-optimized variants because most of our traffic lands on mobile and CTAs were getting pushed below the fold on smaller screens.

Landing page optimization had the highest win rate of any category — roughly 57% versus the program average of 27%. Tests that simplified the value proposition above the fold, surfaced pricing earlier in the flow, and removed unnecessary comparison elements consistently outperformed more complex variants.

Enrollment flow friction removal was our most consistent winner category. Session replay analysis revealed where users dropped off at specific steps, and iterating on those pinpoint friction points produced some of our larger lifts of the year.

Collectively, the winning tests across these categories delivered several million dollars in projected annual revenue impact — enough to shift the finance conversation from "can we afford more testing?" to "how do we scale this?"

The System That Makes It Repeatable

The hardest part of scaling experimentation isn't running more tests. It's making the program run without you being the bottleneck.

I built standardized processes so that anyone joining the team can follow the PRISM framework from hypothesis to post-test analysis without needing to reinvent the wheel. That includes:

Test request intake forms that force requestors to state the problem, not the solution
Pre-test analysis templates with MDE calculations built in
QA checklists for every test deployment across Optimizely
Post-test reporting templates that include statistical results, projected annual lift, and recommendations for holdout testing
A centralized test repository so stakeholders across all five brands can see what's been tested, what won, and what lost

The repository is especially important in a multi-brand environment. A test that wins on one brand might inform a hypothesis for another. A pattern that fails on one brand might save others from wasting a test slot.

What I've Learned About Experimentation at Scale

Most CRO advice is written for companies with millions of users. If you're at a real enterprise with five brands and varying traffic levels, you need a completely different playbook. Sample size constraints are your biggest strategic challenge, not your testing tool's feature set.

Experimentation isn't about running tests. It's about making better decisions. Every test is a decision-making tool. The output isn't a green or red result — it's a dollar value attached to a specific change, with a confidence interval that tells you how much to trust it.

The HiPPO problem never fully goes away. But when you have a framework that translates every hypothesis into projected revenue impact, you give stakeholders a common language. They can still advocate for their ideas, but now those ideas compete on merit, not org-chart position.

Your win rate matters, but context matters more. Our 24%+ win rate in 2025 is well above the industry average of around 12%. But the reason it's high isn't because we're smarter — it's because we invest heavily in the research phase before any test launches. We use heat maps, session replays, click-rate analysis, rage-click detection, and qualitative research to build hypotheses. A well-researched hypothesis has a much higher chance of winning than a gut-feel hypothesis.

These are best estimates with available data. I want to be honest: projected revenue numbers are estimates. They're calculated using the best data we have — conversion lifts, traffic volumes, revenue per customer — but they're not audited financial statements. More precision is possible, but it comes at the cost of flexibility and speed. In a fast-moving experimentation program, I'd rather be directionally right and fast than precisely right and slow.

How AI Changed My Experimentation Workflow in 2025

Somewhere in the middle of 2025 I stopped treating AI as a novelty and started building it into the daily workflow. Not to run experiments for me — that judgment still comes from 9+ years of running programs — but to compress the mechanical work that surrounds every test.

Claude Code and Codex now handle most of my scaffolding and audit work:

Test brief drafting. I give the AI the hypothesis plus traffic and conversion context, and it produces a first-draft brief with MDE calculations, power analysis, and projected revenue impact. I audit every number before it goes to stakeholders.
Post-test analysis. The AI generates statistical comparisons, pulls segment breakdowns, and drafts the results summary. The math is deterministic — where AI helps is running all the permutations I'd otherwise skip under time pressure.
Data quality audits. Before I trust a test result, I use AI to cross-check for sample ratio mismatch, guardrail metric regressions, and segment anomalies. This catches issues the primary metric would miss.
Cross-test pattern mining. Across 150+ experiments, finding recurring themes by hand is slow. AI surfaces candidate patterns I can then verify against the actual test library.

The non-negotiable rule: I review every AI-generated output before it goes anywhere near a stakeholder or a production test. AI will produce analysis that looks right but has subtle errors — hallucinated metric names, wrong aggregation logic, statistical claims that don't match the underlying data. Hallucinated analysis is worse than slow analysis. The 9+ years of experimentation muscle memory is what tells me when something smells off.

I built GrowthLayer to solve the underlying experimentation knowledge management problem that AI alone can't fix — a place to store, search, and learn from every test a team runs, so institutional knowledge compounds rather than evaporates between quarters.

What's Next

The program continues to grow. We're expanding into personalization testing, exploring AI-assisted hypothesis generation, and building more sophisticated holdout methodologies to validate long-term impact beyond the initial test window.

If you're building an experimentation program at a real company — not a Silicon Valley unicorn with unlimited traffic — I'd love to hear how you're handling sample size constraints. Reach out at atticus@atticusli.com.

You can see more about my work at NRG on my NRG case study page, or read about the PRISM framework I developed for running revenue-driven experiments.

experimentation cro enterprise-marketing analytics

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

The State of Things When I Arrived

Step One: Bringing Everything In-House

Step Two: Tying Every Test to Revenue

What Drove the Biggest Gains in 2025

The System That Makes It Repeatable

What I've Learned About Experimentation at Scale

How AI Changed My Experimentation Workflow in 2025

What's Next

Related Articles

'Not Significant' Doesn't Mean Your A/B Test Failed — It Means You're Uncertain

Adaptive Algorithms vs. Fixed-Sample A/B Tests: Why I Changed How I Run Experiments

A/B Test Repository Architecture: Schema, Tagging, and the Retrieval Time Budget

Related Articles

'Not Significant' Doesn't Mean Your A/B Test Failed — It Means You're Uncertain

Adaptive Algorithms vs. Fixed-Sample A/B Tests: Why I Changed How I Run Experiments

A/B Test Repository Architecture: Schema, Tagging, and the Retrieval Time Budget

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook