Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he runs 100+ experiments per year and generated $30M in verified revenue impact in 2025. He writes about the operational reality of building experimentation programs that survive contact with organizational politics.

Articles about A/B testing can never quite get it right. And I think I know why: they're so broad that they aim to be accurate instead of realistic. They describe what experimentation looks like in theory, under ideal conditions, with unlimited resources. They don't describe what it looks like on Tuesday afternoon when your VP overrides the test results, your sample size is too small for the test your stakeholder wants, and you're the only person on the team who understands statistics.

That's the version I want to talk about. The real one.

Political Constraints: When Stakeholders Override Data

Let me start with the constraint that no statistics textbook covers: organizational politics.

I have seen test results overridden by stakeholders more times than I can count. Not hypothetically. Literally. A test produces a clear negative result — the variant hurts conversion — and a senior leader decides to ship it anyway. Because it looks better. Because it aligns with the brand direction. Because they already promised their VP it was happening.

The first time this happened to me, I was furious. I had the data. The data was clear. The variant was worse. And they shipped it anyway.

Over time, I've learned that the political override is a feature of operating in an organization, not a bug in the experimentation program. You can have perfect methodology and still get overruled by someone who controls the budget. The question is how you handle it.

Here's what I do now. I document the recommendation and the override. I write it down: "Experimentation team recommends Control based on [data]. Decision maker X approved shipping Variant based on [reason]." I don't fight it in the moment. I document it. And when the predicted negative impact materializes — which it usually does — I have the record to show that the data said what the data said.

This documentation serves two purposes. First, it protects the program's credibility. When the numbers come in and the variant underperforms, leadership can see that the experimentation team called it correctly. Second, it builds the political capital to prevent future overrides. After a few documented cases of "we told you so," stakeholders start listening more carefully.

But make no mistake: political constraints are real, they're permanent, and no amount of statistical rigor eliminates them. The best you can do is build enough trust over time that the overrides become less frequent.

Sample Size Constraints: Most Companies Can't Test Button Colors

The most commonly cited A/B testing example in the CRO world is the button color test. Change a button from green to red and measure the click-through rate difference. It's a great teaching example. It's also wildly impractical for most companies.

Button color changes produce small effect sizes. Detecting a small effect size with statistical significance requires enormous sample sizes. If your expected lift is 2% and your baseline conversion rate is 3%, you need roughly 80,000 users per variant to detect that lift with 80% power at 95% significance. That's 160,000 total users for one test.

Most companies don't have 160,000 users hitting a single page per month. Many B2B companies are testing with 5,000 to 20,000 monthly visitors. At those traffic levels, you can't detect small effects. Period.

This means the entire philosophy of "test everything, including minor UI tweaks" that dominates CRO content is irrelevant for most companies. If you have 10,000 visitors a month, you can only test changes that produce large effects — typically 20%+ relative improvements. That limits you to big, bold changes: fundamentally different page layouts, entirely new value propositions, major workflow redesigns.

I've worked with teams that tried to run textbook CRO programs on low-traffic sites. They'd launch five micro-optimizations simultaneously, none of which had the statistical power to reach significance. After six months of inconclusive results, leadership concluded that experimentation doesn't work. It wasn't that experimentation didn't work — it was that the tests were designed for a traffic level the company didn't have.

Use a sample size calculator before you scope a test. If the required sample size exceeds your monthly traffic, either scope a bigger change that targets a larger effect size, or test on a higher-traffic page. Don't run underpowered tests and pretend the results mean something.

Resource Constraints: 2-3 Person Teams Running Entire Programs

The CRO influencer world has a blind spot, and it's this: most of the people giving advice work with companies that have millions of monthly sessions and dedicated experimentation teams of 10+. They've never run a program where one person does the research, writes the hypothesis, creates the test design, coordinates with development, monitors the data, runs the analysis, and presents the results to leadership.

That's what most experimentation programs actually look like. Two to three people. Maybe one dedicated experimentation manager and a shared analyst. Maybe just one person doing everything.

At that scale, the CRO playbook of "run 20+ tests per month with rigorous research backing each one" is fantasy. You're lucky to get four to six good tests per month. And "good" means adequately powered, properly QA'd, and aligned with a real business hypothesis. Not "we changed a headline because someone had an idea in a meeting."

The resource constraint forces prioritization that the CRO content world rarely addresses. You can't test everything. You have to ruthlessly prioritize the tests with the highest expected value — highest potential impact multiplied by the probability of success. That means saying no to stakeholder requests that would consume test slots on low-impact ideas.

I've found that the best small teams operate more like internal consultants than service providers. They don't run every test that's requested. They evaluate each request against the program's capacity and the expected business impact, and they push back on low-value tests. This requires organizational authority that takes time to build, but it's the only way a small team stays effective.

Knowledge Constraints: The CRO Manager Is Also the Analyst, PM, and Presenter

Related to the resource constraint is the knowledge constraint. In most organizations, the person running the experimentation program is expected to be an expert in statistics, user experience, data analysis, project management, stakeholder communication, and business strategy. Simultaneously.

Nobody is an expert in all of those. The typical CRO manager is strongest in one or two areas and adequate in the rest. The knowledge gaps are real and they lead to real problems.

I've seen programs where the manager was a strong analyst but a poor communicator. The tests were rigorous, but leadership couldn't understand the results, so the program was perceived as underperforming. I've seen the reverse: excellent communicators who ran methodologically sloppy tests, got lucky with some wins, and then couldn't explain why the program stopped delivering when the luck ran out.

The fix isn't "become an expert in everything." It's knowing your gaps and compensating for them. If you're weak on statistics, use calculators and automated checks that enforce rigor. If you're weak on communication, build templates and get feedback from someone who's good at it. If you're weak on research, partner with the UX team.

The Speed vs. Rigor Tradeoff Nobody Writes About

Here's the one that keeps me up at night. In theory, every experiment should be perfectly designed, adequately powered, run for the full duration, and analyzed with proper statistical methods. In practice, you're constantly making tradeoffs between rigor and speed.

The stakeholder needs a decision by Friday. The test hasn't reached significance. Do you call it early? Technically, no — peeking at results and making early calls inflates your false positive rate. Practically, sometimes you have to. Because the business doesn't wait for your confidence interval to converge.

I've developed a framework for these tradeoffs. High-stakes decisions — anything involving significant revenue, major product changes, or customer-facing commitments — get full rigor. No early peeking. No relaxed significance thresholds. No shortcuts. The decision waits for the data.

Low-stakes decisions — UI tweaks, copy variations, minor layout changes — get pragmatic rigor. I'll look at directional results earlier. I'll use sequential testing methods that allow valid early stopping. I'll accept lower confidence levels for changes that are easily reversible.

This isn't in any textbook. But it's how every experienced practitioner actually operates. The ones who don't admit it are either lying or working with unlimited resources.

The CRO Influencer Blind Spot

I want to say this directly because I think it needs to be said. A significant portion of CRO content is produced by people who work with high-traffic e-commerce sites or SaaS products with millions of monthly users. Their advice is optimized for that context.

When they say "test everything," they're speaking from a world where traffic supports dozens of simultaneous tests. When they say "never peek at results," they're speaking from a world where tests reach significance in days, not months. When they recommend complex multivariate testing programs, they're assuming team sizes and traffic volumes that most companies simply don't have.

This doesn't make their advice wrong. It makes it narrow. And the lack of context causes real damage when someone at a company with 15,000 monthly visitors tries to implement a program designed for 15 million.

The reality of experimentation at most companies is messier, slower, more politically constrained, and more resource-limited than what you read online. Acknowledging that reality is the first step toward building a program that actually works within your constraints instead of one that looks good on paper and fails in practice.

Working Within Your Constraints Instead of Fighting Them

The programs that succeed aren't the ones with the most resources. They're the ones that honestly assess their constraints and design around them.

Low traffic? Test fewer, bigger changes. Limited team? Prioritize ruthlessly and say no to low-value requests. Political pressure? Document everything and build trust incrementally. Knowledge gaps? Use tools and templates that enforce methodology.

Constraints aren't excuses for sloppy work. They're the parameters that define what good work looks like in your specific context. The experimentation program that runs 15 well-designed tests per quarter on low traffic and delivers measurable revenue impact is more valuable than the one that runs 50 underpowered tests and delivers noise.

Work with what you have. Be honest about what you don't have. And stop comparing your program to the ones you read about on LinkedIn.

---

_Whatever your traffic level, make sure your experiments are properly powered. GrowthLayer's sample size calculator tells you exactly how many users you need — so you stop running underpowered tests and start running experiments that can actually deliver answers._

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.