Multi-Brand Experimentation Governance: Running 100+ Tests Across NRG's Retail Brands

Atticus Li

← Blog · 'experimentation'

Multi-Brand Experimentation Governance: Running 100+ Tests Across NRG's Retail Brands

Atticus Li built the governance framework for running 100+ A/B tests per year across NRG Energy's five retail brands, serving 7M+ customers in 24 states.

Atticus Li April 9, 2026 11 min read

Atticus Li built the governance framework for running 100+ A/B tests per year across NRG Energy's five retail brands serving 7M+ customers in 24 states. This framework standardized how tests are designed, prioritized, executed, and evaluated across organizations with fundamentally different traffic volumes, stakeholders, and priorities.

The Challenge Nobody Warns You About

Every piece of CRO content on the internet assumes you're running tests on one website with one team and one set of stakeholders. My reality at NRG was five brands, each with its own marketing team, its own priorities, its own customer base, and its own opinions about what should be tested.

Each brand serves a different market segment with varying traffic levels and customer profiles. Some target mass-market retail, others serve specific customer preferences like renewable energy, and each has its own demographic mix.

When I started building the experimentation program, the biggest challenge wasn't statistical methodology. It was organizational. How do you create a testing culture across five brands where each brand's marketing team has its own roadmap, its own VP, and its own definition of success?

The answer is governance. Not bureaucracy — governance. There's a difference, and getting it right determines whether your program scales or stalls.

Standardizing the Fundamentals

The first thing I established was a set of non-negotiable standards that apply across every brand. These aren't suggestions. They're requirements.

Statistical confidence level: 95%. Every test must reach 95% confidence before we call a winner. Not 90% because it's more convenient. Not 80% because the stakeholder is impatient. 95%. Full stop.

Statistical power: 80%. Before launching a test, we verify that the planned test duration and traffic volume give us at least 80% power to detect the minimum detectable effect. If the math doesn't work, we don't run the test — we find a higher-traffic page, extend the duration, or test a bigger change.

Minimum Detectable Effect (MDE): calculated per test, justified in the brief. There's no universal MDE threshold because traffic volumes differ dramatically across brands. What's detectable on a high-traffic brand's enrollment page in two weeks might require three months on a smaller brand's site. The MDE is calculated for every test, documented in the test brief, and reviewed before launch.

Test duration: minimum two full business cycles. For energy retail, that typically means at least two full weeks to capture both weekday and weekend patterns. Tests that end mid-week or mid-billing-cycle produce results that don't generalize.

When to stop: never early for a positive result. This is the rule that generates the most pushback. When a test variant is "winning" after three days, the brand team wants to ship it. But early stopping inflates false positive rates dramatically. We run every test to the planned duration unless there's a clear negative impact on a guardrail metric that requires stopping for business reasons.

These standards exist so that any person joining the team — whether they're a new analyst, a brand marketer submitting a test request, or an agency partner executing a test — can follow the process without needing to derive the methodology from first principles.

Test Collision Avoidance

When multiple teams are running tests simultaneously across overlapping user populations, you get test collisions. Two tests competing for the same traffic. One test's variant affecting another test's control group. Interaction effects that make both tests' results uninterpretable.

At the scale of 100+ tests per year across five brands, this isn't a theoretical concern — it's a weekly operational challenge.

I built a collision avoidance system with three layers:

Layer 1: The test calendar. Every test across all brands goes on a shared calendar with traffic allocation percentages. Before any test launches, I review the calendar for overlaps. If two tests target the same page or the same audience segment, they get sequenced — not run simultaneously.

Layer 2: Traffic allocation rules. No single test gets more than 50% of page traffic unless it's a critical business page with a clear risk/reward justification. The remaining traffic stays on the default experience. This limits the blast radius of any single test and preserves baseline measurement.

Layer 3: Mutual exclusion groups. For tests that absolutely must run concurrently, we use Optimizely's mutual exclusion feature to ensure each user only sees one test at a time. This eliminates interaction effects at the cost of reducing effective sample size for each test.

The calendar is the most important layer. It sounds low-tech, and it is. But having visibility into what's running where, across all five brands, prevents more collisions than any automated system I've seen.

Stakeholder Management Across Brands

Here's where governance gets political.

Each brand has a marketing team that believes their priorities should come first. Each brand has a VP who saw a competitor's landing page and wants to test a copycat version. The HiPPO problem — Highest-Paid Person's Opinion — doesn't just multiply across five brands. It compounds.

I manage this through three mechanisms:

Mechanism 1: The test brief. Every test request, from every brand, goes through the same test brief template. The brief requires: a data-backed hypothesis (not an opinion), a projected EBITDA impact (not a gut feeling), and a pre-test power analysis (not a hope).

The brief is the great equalizer. When a VP says "I want to test a new hero image," the brief forces them to articulate why — what data suggests the current image is underperforming, what lift the new image is expected to generate, and what the financial impact would be. Half the time, the process of filling out the brief reveals that the test idea doesn't have a data-backed foundation, and the requestor either strengthens the hypothesis or withdraws the request.

Mechanism 2: The prioritization framework. All approved test briefs get scored on three dimensions: projected financial impact (from the EBITDA model), test feasibility (can we actually run this test cleanly?), and strategic alignment (does this test address a known conversion bottleneck?). Tests are ranked and scheduled in priority order, not in "who-asked-first" order.

This removes my personal opinion from the equation too. I'm not deciding which brand's test is more important — the framework is. And because the inputs are transparent, anyone can challenge a prioritization decision by questioning the inputs rather than arguing about politics.

Mechanism 3: Cross-functional workshops. I ran 10+ cross-functional workshops with Marketing, Product, and UX teams across the brands. These workshops served two purposes: educating stakeholders on how the testing process works, and surfacing test ideas from people who are closest to the customer.

The workshops generated genuine buy-in. When a brand marketing manager understands why we require 95% confidence — and has seen a real example of a "winning" test that turned out to be a false positive at 90% confidence — they stop pushing to cut corners. When a UX designer sees how their session replay insights can be translated into a high-impact test hypothesis, they become an active contributor rather than a passive spectator.

The result: 90%+ executive and stakeholder buy-in across all brands. Not because I was persuasive, but because the process was transparent and the results were documented.

When New People Join

Process documentation matters most when the person who built it isn't in the room. I designed the governance framework to be transferable, not dependent on my institutional knowledge.

Every standard, every template, every decision framework is documented and accessible. When a new analyst joins the team, they can read the documentation and be productive within their first sprint. They don't need three months of shadowing to understand how things work.

But here's the nuance: the process isn't a religion. When new people join with experience from other programs — maybe they ran testing at a company with different traffic profiles, or they've used different statistical methodologies — I want to hear their recommendations.

The standards I set are defaults, not commandments. If someone can make a compelling case that we should use sequential testing instead of fixed-horizon testing for certain test types, or that we should adopt a Bayesian framework for low-traffic brands, I'm open to that. The test brief template has been revised four times based on team feedback. The prioritization scoring weights have been adjusted twice.

Good governance absorbs better methods without losing its structural integrity. Rigidity isn't a feature of governance — it's a failure of governance.

How AI Fits Into This Framework

The governance framework is codified enough that AI tools have become a serious force multiplier. I use Claude Code and Codex to help maintain and extend it — scaffolding new test brief templates, generating prioritization scoring updates when our inputs change, drafting cross-functional workshop materials, and spotting inconsistencies in how tests are documented across teams.

The acceleration is dramatic. Generating a new test brief template with MDE calculations, power analysis, and stakeholder-friendly projections takes minutes instead of hours. Auditing 30 test briefs for governance compliance takes an afternoon instead of a week.

But governance is exactly the wrong area to hand off to AI without a review step. The nuances of what's worked and what hasn't at enterprise scale — what gets past finance, what breaks stakeholder trust, which process shortcuts create multi-quarter mess — aren't in any training dataset. Those patterns live in 9+ years of running programs like this. I treat AI-generated governance artifacts the same way I treat junior team output: valuable draft, mandatory review.

The workflow I've landed on: AI drafts the scaffolding and catches mechanical issues, I apply the judgment layer before anything goes live. Every template, every SOP revision, every process document gets a rigorous read before it reaches the team. Speed matters, but not at the cost of the framework's integrity.

This is also why I built GrowthLayer — to operationalize experimentation knowledge management as a product rather than a set of Google Docs. When governance lives in a system that tracks results, hypothesis patterns, and cross-test learnings automatically, new hires get up to speed in weeks instead of months.

Guardrail Metrics: What to Watch When Scaling

Optimizing for a primary metric without watching guardrails is how experimentation programs cause damage. A test that increases enrollment confirmations by 8% but drops 30-day retention by 15% isn't a win — it's a disaster that takes months to identify.

At NRG, the guardrail metrics I monitor across every test are:

Enrollment confirmations. The primary metric for most enrollment funnel tests. But it's also a guardrail for tests on upstream pages — if we test a new homepage layout and enrollment confirmations drop, the homepage test failed regardless of what the homepage metrics show.

Deposit completion rates. For energy retail, the enrollment isn't complete until the customer completes their deposit. A test that increases enrollment starts but decreases deposit completions is moving friction, not removing it.

NPS impact. We monitor post-enrollment NPS scores segmented by test exposure. This is a lagging indicator, but it catches tests that optimize conversion at the expense of customer experience. A confusing but high-converting enrollment flow will show up in NPS before it shows up in churn — and by the time it shows up in churn, the damage is done.

Revenue per customer post-enrollment. This catches tests that attract lower-value customers. If a test increases conversion rate but the customers it converts generate less revenue, the EBITDA impact may be negative even though the conversion rate went up.

Guardrails aren't optional. They're part of the test definition. Every test brief specifies which guardrail metrics will be monitored and what threshold constitutes a stop condition.

What Governance Actually Looks Like Day to Day

Abstract frameworks are easy to describe and hard to sustain. Here's what governance looks like in practice:

Monday: Review the test calendar for the week. Check for any collisions or allocation issues. Review any tests that reached their planned end date over the weekend.

Tuesday: Test brief reviews for new submissions. This usually involves conversations with brand teams about hypothesis refinement, sample size calculations, and scheduling.

Wednesday: Mid-week check on running tests. Review guardrail metrics. Flag any anomalies for investigation.

Thursday: Results review for completed tests. Calculate EBITDA impact for winners. Document learnings for the test library.

Friday: Stakeholder communications. Weekly summary of results, upcoming tests, and program metrics. This is where the brand teams get visibility into what's happening across the portfolio.

This cadence isn't glamorous. It's operational rhythm. And it's what makes 100+ tests per year across five brands possible without the program collapsing under its own complexity.

The Payoff

The governance framework I built at NRG serves Atticus Li's PRISM Method in practice — structured measurement and iterative improvement applied at enterprise scale. It's not about controlling people. It's about creating an environment where good experimentation happens consistently, not accidentally.

When I hear about experimentation programs that ran a lot of tests but can't show business impact, the root cause is almost always governance. Not bad testers. Not bad ideas. Bad process.

Good governance means every test has a purpose, every result has context, and every stakeholder understands both.

Building a multi-brand or multi-team experimentation program? Reach me at atticus@atticusli.com.

'experimentation' 'governance' 'enterprise-marketing' 'cro' 'process'

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

The Challenge Nobody Warns You About

Standardizing the Fundamentals

Test Collision Avoidance

Stakeholder Management Across Brands

When New People Join

How AI Fits Into This Framework

Guardrail Metrics: What to Watch When Scaling

What Governance Actually Looks Like Day to Day

The Payoff

Related Articles

How I Use AI Tools to Run Experiments 40% Faster

The Behavioral Economics Playbook for Conversion Optimization

Data Storytelling: How to Present Analytics to Executives Who've Never Seen the Data

Related Articles

How I Use AI Tools to Run Experiments 40% Faster

The Behavioral Economics Playbook for Conversion Optimization

Data Storytelling: How to Present Analytics to Executives Who've Never Seen the Data

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook