Atticus Li is the experimentation lead at NRG Energy, where he scaled the company's testing program from roughly 20 experiments per year to over 100 per year in 2025, with 150+ total experiments completed across five retail energy brands. This article details exactly how that happened — the frameworks, the politics, the failures, and the specific wins that made C-suite finance pay attention.

The State of Things When I Arrived

When I joined NRG, the experimentation program existed but it was small. Around 20 tests a year, mostly driven by agency partners. The tests weren't bad, but they suffered from a problem I've seen at every enterprise I've worked with: the HiPPO problem.

HiPPO stands for Highest-Paid Person's Opinion. And at a company managing NRG's five retail brands, there were a lot of HiPPOs with a lot of opinions.

The typical test request looked like this: a VP sees a competitor's homepage and says "let's test that layout." No data backing the hypothesis. No understanding of whether the current design was actually underperforming. No pre-test analysis to determine if the test could even reach statistical significance with available traffic.

I knew from my time at SVB that experimentation only scales when you remove opinion from the equation and replace it with process. The question was how to do that across five brands with different traffic volumes, different customer segments, and different stakeholder groups.

Step One: Bringing Everything In-House

The first major shift was bringing experimentation fully in-house. The agency model had served its purpose, but it created a dependency that slowed everything down. Turnaround times were long. Context was lost between handoffs. And the agency team didn't live inside our data the way an internal team needs to.

I built out the internal capability piece by piece. That meant standardizing our tooling — Adobe Analytics for measurement, Optimizely for test execution, Contentsquare for behavioral analytics, and Tealium as our CDP layer. Each tool had a specific role, and I documented exactly how they connected so that anyone joining the team could get productive within their first week.

This wasn't glamorous work. It was writing SOPs, building test request templates, creating pre-test analysis frameworks, and establishing review cadences. But it was the foundation everything else was built on.

Step Two: Tying Every Test to Revenue

Here's the thing most CRO content on the internet won't tell you: at real companies, you don't have millions of monthly users. You don't get to run 50 concurrent tests with clean traffic allocation. You face sample size constraints that make many "best practice" recommendations from CRO influencers completely irrelevant.

When you're working with the traffic volumes of a regional energy brand — not Google, not Amazon — every test slot is precious. You can't waste one on a test that has no chance of reaching significance, or worse, one that reaches significance but can't be tied back to dollars.

So I built a pre-test framework centered on Minimum Detectable Effect (MDE) projections tied to revenue per customer. Before any test gets greenlit, we calculate:

  1. Current conversion rate for the page or flow being tested
  2. Traffic volume over the planned test duration
  3. Revenue per converted customer for that specific brand and product
  4. MDE threshold — the smallest lift we'd need to detect to justify the test
  5. Projected annual revenue impact if the test wins at the MDE level

This changed the conversation entirely. Instead of "let's test this because the VP wants to," it became "this test has a projected annual impact of $200K if it achieves a 5% lift, and we have enough traffic to detect a 3% lift at 95% confidence."

Finance understood that language. And when finance understands your language, budget follows.

Step Three: Proving Value to the C-Suite

Growing from 20 tests to 100+ per year didn't happen because I asked nicely. It happened because I proved ROI in terms the CFO's team could validate.

Here are four wins from 2025 that made the case:

Enrollment Flow Optimization: We redesigned the enrollment flow based on Contentsquare session replay analysis that showed users dropping off at the plan comparison step. The test delivered a 12% conversion lift, translating to approximately $299K in projected annual revenue.

Homepage Phone Number Placement: This one surprised everyone. By testing the placement and prominence of the phone number on the homepage, we saw a 300% lift in call-driven sales. Projected annual impact: $523K. The change itself was simple. The insight that drove it — that the brand's customer demographic skewed older and preferred phone enrollment — came from cross-referencing call center data with web analytics.

HBE Copy Redesign: A copy-focused test on the home battery experience page delivered a 60% lift on mobile devices. Annual projected impact: $177K. This test was born from a dead-click analysis in Contentsquare — users were tapping on non-interactive elements, signaling confusion about what was clickable.

Hero Layout Test: A hero section layout change on the homepage produced a 7% lift in enrollment starts, worth approximately $212K annually. The hypothesis came from heat map data showing that users weren't scrolling past the hero on mobile — the CTA wasn't visible without scrolling on most phone screens.

These four tests alone projected over $1.2M in annual impact. That's the kind of number that gets you more headcount, better tools, and a seat at the planning table.

The System That Makes It Repeatable

The hardest part of scaling experimentation isn't running more tests. It's making the program run without you being the bottleneck.

I built standardized processes so that anyone joining the team can follow the PRISM framework from hypothesis to post-test analysis without needing to reinvent the wheel. That includes:

  • Test request intake forms that force requestors to state the problem, not the solution
  • Pre-test analysis templates with MDE calculations built in
  • QA checklists for every test deployment across Optimizely
  • Post-test reporting templates that include statistical results, projected annual lift, and recommendations for holdout testing
  • A centralized test repository so stakeholders across all five brands can see what's been tested, what won, and what lost

The repository is especially important in a multi-brand environment. A test that wins on one brand might inform a hypothesis for another. A pattern that fails on one brand might save others from wasting a test slot.

What I've Learned About Experimentation at Scale

Most CRO advice is written for companies with millions of users. If you're at a real enterprise with five brands and varying traffic levels, you need a completely different playbook. Sample size constraints are your biggest strategic challenge, not your testing tool's feature set.

Experimentation isn't about running tests. It's about making better decisions. Every test is a decision-making tool. The output isn't a green or red result — it's a dollar value attached to a specific change, with a confidence interval that tells you how much to trust it.

The HiPPO problem never fully goes away. But when you have a framework that translates every hypothesis into projected revenue impact, you give stakeholders a common language. They can still advocate for their ideas, but now those ideas compete on merit, not org-chart position.

Your win rate matters, but context matters more. Our 24%+ win rate in 2025 is well above the industry average of around 12%. But the reason it's high isn't because we're smarter — it's because we invest heavily in the research phase before any test launches. We use heat maps, session replays, click-rate analysis, rage-click detection, and qualitative research to build hypotheses. A well-researched hypothesis has a much higher chance of winning than a gut-feel hypothesis.

These are best estimates with available data. I want to be honest: projected revenue numbers are estimates. They're calculated using the best data we have — conversion lifts, traffic volumes, revenue per customer — but they're not audited financial statements. More precision is possible, but it comes at the cost of flexibility and speed. In a fast-moving experimentation program, I'd rather be directionally right and fast than precisely right and slow.

What's Next

The program continues to grow. We're expanding into personalization testing, exploring AI-assisted hypothesis generation, and building more sophisticated holdout methodologies to validate long-term impact beyond the initial test window.

If you're building an experimentation program at a real company — not a Silicon Valley unicorn with unlimited traffic — I'd love to hear how you're handling sample size constraints. Reach out at [email protected].

You can see more about my work at NRG on my NRG case study page, or read about the PRISM framework I developed for running revenue-driven experiments.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Leads applied experimentation at NRG Energy. $30M+ in verified revenue impact through behavioral economics and CRO.