Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he scaled the program from 20 to 100+ experiments per year and generated $30M in verified revenue impact in 2025. He writes about the operational mechanics of experimentation at enterprise scale.
When I took over the experimentation program at NRG, we were running about 20 tests a year. Good tests, mostly. Solid methodology. Reasonable win rate. But 20 tests a year from a Fortune 150 company with millions of monthly visitors and dozens of digital products is dramatically under-testing.
The mandate was clear: scale. More tests, more coverage, more revenue impact. What wasn't clear — and what nobody warned me about — was that scaling an experimentation program doesn't break in the ways you'd expect. It doesn't break all at once, either. It breaks in a specific sequence, and if you don't address each failure point in order, the whole thing collapses.
After taking NRG from 20 to over 100 tests per year, here's exactly what breaks, in the order it breaks, and how to fix each one.
The First Thing That Breaks: Idea Quality
When you're running 5 tests a quarter, the ideas come from a small group of people who understand experimentation. The hypotheses are thoughtful. The expected impact is real. The tests are worth running.
Then you announce that the experimentation program is scaling and accepting ideas from across the organization. Suddenly you have 40 test requests in the backlog. Product managers, brand managers, UX designers, executives — everyone has an idea they want tested. And the HiPPO problem comes roaring back.
HiPPO — the Highest Paid Person's Opinion — is the single biggest threat to test quality at scale. The SVP of marketing wants to test a new tagline because they workshopped it at a leadership offsite. The VP of product wants to test a feature redesign because a competitor launched something similar. These aren't bad people making irrational requests. They genuinely believe their ideas will win. But without a systematic way to evaluate and prioritize, the loudest voice or the most senior title determines what gets tested.
The result: you burn your testing capacity on low-probability, politically-motivated tests. Your win rate drops. Your revenue impact per test drops. And ironically, the program starts looking less effective right when it should be demonstrating more value.
The fix: ICE or RICE scoring on every single idea. No exceptions. No bypass for seniority.
ICE scores every idea on Impact (how much will this move the metric if it wins), Confidence (how much evidence do we have that it will win), and Ease (how quickly can we build and launch it). RICE adds Reach — how many users will the test affect. Both frameworks work. Pick one and apply it consistently.
The critical move is that scoring depersonalizes the conversation. You're not telling the SVP their idea is bad. You're showing them that their idea scores a 4 on Confidence because there's no supporting data, while the idea from the junior analyst scores a 7 because it's backed by three months of funnel analytics. The framework makes the decision, not you. That political cover is essential.
At NRG, every idea that enters the backlog gets scored. We review the prioritized list in a monthly planning session. The top-scoring ideas get resourced. Everything else goes back in the queue. This single change more than doubled our revenue-per-test within two quarters.
The Second Thing That Breaks: Hypothesis Integrity
At small scale, one person often owns a test from hypothesis through analysis. They know what the test is supposed to measure because they designed the whole thing. As you scale, handoffs multiply — and with every handoff, the hypothesis drifts.
Here's how it plays out. The CRO manager writes a hypothesis: "If we reduce the form from 7 fields to 4, we will increase form completion rate by 15% because users abandon long forms." Clear, testable, measurable.
The hypothesis goes to the designer. The designer interprets "reduce the form" as an opportunity to redesign the entire form layout. They add progressive disclosure, change the button copy, move the form above the fold, and reduce the fields. Now you're testing five changes simultaneously, not one.
The design goes to the developer. The developer can't implement the progressive disclosure pattern within the sprint, so they simplify it. The final variant has fewer fields and a new layout, but not the progressive disclosure the designer intended.
The test launches. It wins. But what won? The field reduction? The layout change? The button copy? The placement? You have no idea, which means you can't apply the learning to any other form on the site. The hypothesis drifted from a single-variable test to a multi-variable mess across three handoffs.
The fix: The hypothesis brief becomes a contract.
Before any design work begins, the hypothesis brief must be approved and locked. The brief specifies exactly what changes, what stays the same, and what the primary metric is. At every handoff — hypothesis to design, design to development, development to QA — the receiving party reviews the brief and confirms that what they're building matches the original intent.
This sounds bureaucratic. It takes maybe 15 minutes per handoff. And it's saved us from countless wasted tests where the final variant bore little resemblance to the original hypothesis. When you're running 100+ tests a year, even a 10% reduction in wasted tests from hypothesis drift saves you 10 tests worth of capacity. That's an entire quarter of testing for a smaller program.
The Third Thing That Breaks: Team Capacity
This one is pure math, and teams consistently underestimate it.
A single CRO manager can probably own 4 to 6 tests at various stages simultaneously — some in hypothesis, some in design review, some running, some in analysis. A single analyst can support analysis for maybe 8 to 10 concurrent tests. A single developer can build 2 to 3 test variants per sprint alongside their regular roadmap work. A single designer can produce variants for about 3 tests per sprint.
When you're running 5 tests a quarter, one person in each role can handle it. When you try to run 25 tests a quarter, the math breaks immediately. You don't need fractionally more people. You need multiples.
The capacity ceiling usually manifests as one specific bottleneck, and it's different for every organization. At NRG, our first bottleneck was development capacity. We had hypotheses and designs ready to go, but developers were allocated to product roadmap work and could only build 2 test variants per sprint. Our testing pipeline backed up behind a development bottleneck.
The fix: Map capacity by role, then make the case with revenue data.
First, audit where the bottleneck actually is. Don't assume — measure. Track how long each test spends in each stage: hypothesis, design, development, QA, running, analysis. The stage with the longest queue is your bottleneck.
Then make the business case for headcount using revenue data. This is where having rigorous measurement pays off. When I went to leadership to ask for additional development capacity, I didn't say "we need more developers for testing." I said: "In 2025, the experimentation program generated $30M in verified revenue impact with a team of four. Our constraint is development velocity — we have a backlog of 15 scored and designed tests waiting for dev resources. Adding two dedicated experimentation developers would increase our testing throughput by approximately 60%, which projects to an additional $12M to $18M in annual revenue impact."
That's a conversation executives can engage with. Revenue per headcount is a language every CFO speaks. If your program can demonstrate clear revenue attribution, the headcount case writes itself. If it can't, that's a measurement problem you need to solve before you try to scale.
The Fourth Thing That Breaks: Tooling
When you're running a handful of tests, you can manage them in a spreadsheet. Hypothesis in column B, status in column C, results in column D. Someone maintains a PowerPoint deck for the monthly readout. It's not elegant, but it works.
At 20+ tests, the spreadsheet becomes a liability. Version control is nonexistent — someone overwrites someone else's formula. The PowerPoint takes half a day to update because you're manually pulling screenshots and formatting results tables. Historical test data lives across three different spreadsheets because the original one got too unwieldy and someone started a new one for Q3.
The problems compound. You can't search past tests effectively. New team members have no way to learn what's already been tested. Stakeholders ask "didn't we test something like this before?" and nobody can answer confidently because the institutional memory lives in scattered documents and people's heads.
The fix: A proper test repository with structured intake.
You need a centralized system that handles the full test lifecycle: intake, prioritization, hypothesis documentation, status tracking, results storage, and reporting. It doesn't have to be fancy. It does have to be searchable, consistent, and the single source of truth.
At NRG, we went through three iterations before getting this right. The first was a Notion database — workable but clunky for reporting. The second was a custom Airtable setup — better, but the reporting still required manual export. What we really needed was something purpose-built for experimentation workflow: structured intake forms so every request comes in the same format, automated prioritization scoring, hypothesis briefs that lock before development, and results that tie back to the original brief automatically.
That need is exactly why I built GrowthLayer. It handles the intake, prioritization, and test repository workflow so teams can focus on running good tests instead of managing spreadsheets. But whether you use GrowthLayer, Airtable, Notion, or a custom internal tool, the principle is the same: your tooling needs to scale with your program, and Excel is not that tool.
The Pattern: Standardize Before You Scale
Every one of these failure points has a common thread. The fix isn't more effort or more people (though you will need more people eventually). The fix is standardization.
Standardize how ideas enter the pipeline. Standardize how hypotheses are documented. Standardize the handoff process between roles. Standardize how results are stored and reported. Then scale.
When I was transitioning NRG from an agency-managed model to a fully in-house program, the first six months were entirely about standardization. We wrote SOPs for every stage of the testing lifecycle. We created templates for hypothesis briefs, design specs, QA checklists, and result readouts. We documented the scoring framework and the decision rules for prioritization.
It felt slow. We weren't running more tests. Stakeholders were impatient. But when we started scaling, those standards meant that new hires could be productive within their first month. The agency team members who transitioned in-house had clear processes to follow. The quality of tests remained consistent even as volume tripled.
The organizations that try to scale before standardizing inevitably hit all four failure points simultaneously. They increase volume, quality drops, hypotheses drift, the team gets overwhelmed, and the tooling can't keep up. Then they pull back, declare that "scaling didn't work," and return to running 5 tests a quarter. The problem was never the scaling. The problem was the sequence.
The Practical Scaling Roadmap
If I were starting from scratch — taking a program from 5 tests per quarter to 25+ — here's the sequence I'd follow.
Quarter one: Standardize. Write the SOPs. Build the hypothesis brief template. Implement ICE/RICE scoring. Set up whatever test repository you're going to use, even if it's a well-structured spreadsheet to start. Keep running tests at your current pace, but run them through the new process.
Quarter two: Remove the first bottleneck. Map your capacity constraints, identify the tightest one, and either add resources or find efficiency gains. Usually this means getting dedicated dev capacity or formalizing the design handoff. Increase test volume by 50%, not 200%.
Quarter three: Stress-test the process. At 50% more volume, you'll find the cracks in your standards. Some SOPs won't survive contact with reality. Some templates will need revision. Fix them while the volume increase is still manageable.
Quarter four: Scale for real. With proven processes, identified capacity, and a working repository, you can start pushing toward your target volume. This is also when you make the headcount case, because you now have two quarters of data showing what the program delivers per test.
This entire sequence assumes you have the measurement infrastructure to demonstrate revenue impact. If you don't, step zero is building that, because everything else — headcount, tooling, executive buy-in — depends on being able to answer one question: what is this program worth?
The answer to that question, backed by rigorous data, is how you get from 5 experiments to 20, from 20 to 50, and from 50 to 100+. It's how you build a program that scales without breaking.