Why Most A/B Tests Fail (And How to Fix Your Experimentation Program)

Atticus Li

← Blog · 'ab-testing'

Why Most A/B Tests Fail (And How to Fix Your Experimentation Program)

Most A/B tests fail because the process is broken, not because the ideas are bad. Five failure modes that kill experimentation programs, and the consultant mindset that fixes them.

Atticus Li April 9, 2026 13 min read

Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he's run 150+ experiments with a 24%+ win rate — double the industry average. After scaling NRG's program from 20 to 100+ tests per year and consulting with experimentation teams of every size, he's identified the five failure modes that kill most A/B testing programs before they can deliver value.

Most A/B tests fail — and I don't mean the test variants lose to the control. I mean the tests themselves are broken. The process that produced them, ran them, and interpreted their results is fundamentally flawed. After running over 150 experiments across multiple products and consulting on experimentation programs for teams of every size, I've seen the same five failure modes kill testing programs over and over.

The good news: every one of these is fixable. The bad news: fixing them requires changing how your organization thinks about experimentation, not just how it runs tests.

Failure Mode 1: No Pre-Test Calculations

This is the most common and most destructive failure mode. A team has a test idea, builds the variants, launches the test, and then watches the results dashboard like a stock ticker. After a week, they see a "winner" and ship it.

The problem: they never calculated how long the test needed to run. They never determined the minimum sample size required to detect a meaningful effect. They never defined what "meaningful" even means in the context of their business.

Here's what actually happens when you skip pre-test calculations:

You stop tests too early. You see a 12% lift after three days, declare victory, and ship. But you needed 25,000 visitors per variant to detect a real 5% effect, and you only had 3,000. That "12% lift" is noise. It will regress to zero — or worse — once you ship to 100% of traffic.

You run tests too long. Without a stopping rule, tests just... keep running. I've seen companies with tests running for six months because nobody calculated when they should be called. Those six months of traffic were split between variants for no reason, costing real revenue.

You can't detect real effects. A test might produce a genuine 3% lift, but if your sample size can only detect effects of 10% or larger, you'll call it inconclusive. That 3% lift, compounded across your entire funnel, might be worth millions. But you'll never know because you didn't do the math upfront.

The fix: Before any test launches, calculate three things. First, your baseline conversion rate for the primary metric. Second, the minimum effect size you care about detecting (this is a business decision, not a statistical one). Third, the sample size and runtime required at 80% power and 95% confidence. If the required runtime exceeds what's practical, either find a larger-effect hypothesis or don't run the test. Running an underpowered test is worse than running no test.

Failure Mode 2: Not Calibrating Your Tools

Would you trust a scale that's never been calibrated? Then why do you trust your A/B testing platform without verifying it works correctly?

I've seen teams run hundreds of tests on platforms that had fundamental tracking issues. The JavaScript snippet wasn't firing correctly on certain browsers. The event tracking was double-counting conversions on page reloads. The randomization wasn't actually random — users were being assigned to variants based on predictable patterns.

AA tests first. Before you run a single A/B test, run an AA test — a test where both "variants" are identical. If your platform reports a statistically significant difference between two identical experiences, something is broken. Your tracking, your randomization, your analytics pipeline — something is producing false signals.

I recommend running AA tests quarterly, not just once. Platform updates, tag manager changes, and site redesigns can all introduce tracking issues that weren't there before.

SRM checks on every test. Sample Ratio Mismatch (SRM) is the canary in the coal mine. If you set a 50/50 traffic split and one variant gets 51.3% of traffic while the other gets 48.7%, that's a problem. Small imbalances can indicate bot traffic, caching issues, or redirect problems that invalidate your results.

Check SRM within the first 24 hours of every test. If the ratio is off by more than a percentage point, pause the test and investigate before the contaminated data gets worse.

End-to-end tracking validation. Walk through every step of your test tracking manually. Submit a test conversion yourself. Check that it appears in your analytics correctly, attributed to the right variant. Do this on multiple browsers and devices. The number of tests I've seen invalidated by tracking bugs that a 15-minute manual check would have caught is genuinely depressing.

Failure Mode 3: Solutionizing Before Finding the Problem

"Let's test a new hero image." Why? "Because the creative team made a new one and it looks better."

This is the most common way tests get conceived, and it's backwards. You're starting with a solution and working backward to justify it. The test might "work" — the new image might win — but you'll have no idea why, which means you can't learn anything transferable.

The right order: problem first, then solution.

Start by identifying the specific problem in your funnel. Use quantitative data (analytics, heatmaps, session recordings) and qualitative data (user research, surveys, support tickets) to understand what's actually blocking conversion. Then — and only then — ideate solutions.

And here's the part most teams skip: get diverse perspectives on the solution. When one person identifies a problem and proposes a solution, you get one person's mental model. When you bring in perspectives from design, engineering, customer support, and actual users, you get solutions you never would have considered.

I've seen this play out repeatedly. The CRO specialist proposes changing button color. The support team says "Users keep asking if we offer X — they don't see the feature callout." The engineer says "The page takes 4 seconds to load on mobile, maybe we should fix that before changing colors." The support team was right. Fixing the information architecture produced a 14% lift. The button color would have moved the needle by maybe 1%, if at all.

The method: When you've identified a real problem, give the problem statement — without any proposed solutions — to five different people from different functions. Ask each of them: "What would you do about this?" The best solution often comes from the person furthest from the problem, because they're not anchored to the obvious answer.

Failure Mode 4: Wrong KPIs

This one is insidious because the tests look perfectly valid. The methodology is sound. The sample sizes are adequate. The results are statistically significant. But you're optimizing for the wrong thing.

I've watched teams celebrate a 23% lift in click-through rate on a CTA button, only to discover that downstream conversion — the metric that actually generates revenue — didn't move at all. More people clicked. The same number bought. All that happened was that more people entered the next step of the funnel and then dropped off.

Even worse: I've seen teams optimize for metrics that are actively disconnected from business outcomes. Email open rate improvements that don't translate to purchases. Session duration increases driven by users being confused and searching for information, not by engagement. Page view increases from click-bait internal linking that fragments the user journey.

The fix: trace every test metric back to revenue. If you can't draw a direct line from your primary metric to dollars, you're probably measuring the wrong thing.

Here's my hierarchy for test metrics:

Revenue per visitor — the gold standard. If your test lifts RPV with adequate sample size, ship it.
Conversion rate for revenue-generating actions — purchases, plan upgrades, paid feature adoption. Good proxy when RPV sample sizes are impractical.
Qualified micro-conversions — actions that have a demonstrated, measured correlation with eventual purchase. Not "clicks" or "engagement" — specific actions you've validated actually predict revenue.
Everything else — engagement metrics, scroll depth, time on page. These are diagnostic tools, not test metrics. Use them to understand user behavior. Don't use them to make shipping decisions.

Companies frequently optimize for things that are important to one team but irrelevant to the business's actual growth. The content team optimizes for time on page. The product team optimizes for feature adoption. The marketing team optimizes for lead volume. None of them are checking whether their optimizations contribute to the bottom line. This is how organizations end up with impressive dashboards and stagnant revenue.

Failure Mode 5: Stakeholder Override

This is the failure mode nobody wants to talk about. The test results clearly show that variant B — the stakeholder's preferred design — loses to the control. The data is clean, the sample size is adequate, the effect is statistically significant. And then someone in leadership says "Ship variant B anyway, I think the data is wrong" or "The test doesn't capture the long-term brand impact."

I've been in this room. Multiple times. It's excruciating.

When stakeholders override test results, they don't just waste the resources that went into that test. They destroy the credibility of the entire experimentation program. The team learns that data doesn't actually matter — what matters is whether the most important person in the room likes the design. After that happens twice, nobody takes testing seriously anymore. It becomes theater.

The fix starts before the test, not after. Get explicit alignment on three things before you launch:

What is the primary metric? Everyone agrees, in writing, before the test starts.
What constitutes a winner? Define the decision criteria upfront. "If variant B shows a statistically significant lift of 3% or more in RPV at 95% confidence, we ship it. Otherwise, we ship the control."
Who has decision authority? This needs to be established before results are available. If the VP of Marketing has final say regardless of data, acknowledge that openly. At least then the team knows the testing program is advisory, not decisive.

The hardest part of this conversation is pre-commitment. People are happy to agree to "let the data decide" when they think the data will support their preference. The moment it doesn't, the goalpost moving starts. "Well, the test didn't capture brand perception." "The test period was unusual because of the holiday." "I don't trust the tracking."

Pre-commitment doesn't eliminate these objections, but it makes them visible. When someone has to explicitly say "I know we agreed to follow the data, and the data says my preferred option lost, but I want to override that anyway," the cost of overriding becomes real and public.

The Real Fix: Stop Being a Tester. Start Being a Consultant.

Here's the uncomfortable truth about experimentation programs: the technical execution is the easy part. Calculating sample sizes, implementing tests, analyzing results — these are learnable skills. Any reasonably analytical person can master them in months.

The hard part is organizational. It's getting stakeholders to care about data. It's persuading product managers that their "obvious" improvement needs testing. It's building the political capital to push back when someone wants to ship a loser. It's making the experimentation program relevant to the people who control budget and roadmap.

This means your job isn't "A/B test operator." Your job is internal consultant.

Build relationships before you need them. Meet with stakeholders regularly, not just when you have test results to share. Understand their goals, their pressures, their definition of success. When you understand what the VP of Product is evaluated on, you can frame your experimentation results in terms that matter to them.

Learn data storytelling. A p-value doesn't persuade anyone. A narrative does. "We tested the team's hypothesis that a simpler checkout would increase conversion. Here's what we found, here's what it means for Q3 revenue targets, and here's what I recommend we do next." That's a story. A spreadsheet of statistical outputs is not.

Surface what matters, filter what doesn't. Your stakeholders don't need to know about every test. They need to know about the tests that change their decisions. Develop a sense for which results are strategically important and lead with those. Save the methodological details for people who care about methodology.

Make recommendations, not just reports. "The test was inconclusive" is accurate and useless. "The test was inconclusive, which tells us the effect — if it exists — is smaller than 2%. I recommend we move to higher-impact hypotheses in the checkout flow, where our data suggests a 15% drop-off that behavioral analysis attributes to payment anxiety." That's a recommendation. It tells people what to do next.

Tying It All Together

The NRG framework gives you a way to measure the overall health of your experimentation program — are you running enough tests, at sufficient quality, to drive meaningful growth? The PRISM method gives you the structure for individual experiments — from problem identification through measurement.

But frameworks don't fix culture. You fix culture by demonstrating value, building trust, and being the person in the room who cares more about finding the truth than being right.

I've run over 150 experiments, and my win rate is around 24%. That means I'm wrong about three out of four hypotheses. If that sounds like a low number, consider this: most teams don't track their win rate at all, which means they don't know how often they're wrong. They just ship whatever the highest-paid person in the room prefers and hope for the best.

A 24% win rate with 150+ experiments means we found roughly 36 genuine improvements. Each one is validated, measured, and permanent. That's 36 changes we know — not think, not hope, know — made the product better. Compare that to the average product team that ships 50 changes a year with zero measurement of impact.

The experimentation program that runs 100+ tests per year and wins 24% of the time will outperform the team that runs 10 tests and "wins" 80% of the time — because the second team is almost certainly fooling itself with underpowered tests, wrong metrics, or confirmation bias.

Where to Start

If your experimentation program is struggling — or if you don't have one yet — start here:

Fix your tracking. Run AA tests. Validate your analytics pipeline. You cannot learn anything from broken instruments.
Define your primary metric. Pick one metric that ties to revenue. Optimize for that. Everything else is secondary.
Do the math before every test. Calculate sample size. Determine runtime. If the math doesn't work, find a better hypothesis.
Get pre-commitment from stakeholders. Agree on decision criteria before launching. Put it in writing.
Build the consultant muscle. Learn to tell stories with data. Build relationships. Make recommendations.

The experimentation framework page has the complete methodology. But methodology is 20% of success. The other 80% is having the organizational credibility to act on what you learn.

Stop running tests. Start building an experimentation program. The difference is everything.

'ab-testing' 'experimentation' 'conversion-optimization' 'cro' 'analytics'

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

Failure Mode 1: No Pre-Test Calculations

Failure Mode 2: Not Calibrating Your Tools

Failure Mode 3: Solutionizing Before Finding the Problem

Failure Mode 4: Wrong KPIs

Failure Mode 5: Stakeholder Override

The Real Fix: Stop Being a Tester. Start Being a Consultant.

Tying It All Together

Where to Start

Related Articles

How I Use AI Tools to Run Experiments 40% Faster

The Behavioral Economics Playbook for Conversion Optimization

Data Storytelling: How to Present Analytics to Executives Who've Never Seen the Data

Related Articles

How I Use AI Tools to Run Experiments 40% Faster

The Behavioral Economics Playbook for Conversion Optimization

Data Storytelling: How to Present Analytics to Executives Who've Never Seen the Data

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook