Why 61% of A/B Tests Are Inconclusive — And Why That’s Exactly Right

Atticus Li

← Blog · inconclusive tests

Why 61% of A/B Tests Are Inconclusive — And Why That’s Exactly Right

Most A/B tests don't produce winners. Our data from 97 experiments reveals why a 61% inconclusive rate signals a rigorous program, not a broken one.

Atticus Li March 31, 2026 16 min read

The testing industry has a dirty little secret. Not the kind buried in fine print or hidden behind paywalls — the kind that sits in plain sight across thousands of optimization programs, quietly ignored because acknowledging it would undermine the narrative that sells testing platforms, conference tickets, and consulting engagements. Here it is: most A/B tests do not produce a statistically significant winner. And if your experimentation program is well-run, that is exactly what you should expect.

I know this because we lived it. Across a portfolio of 97 controlled experiments for a mid-market energy provider, 59 of them — a full 61% — returned inconclusive results. Not winners. Not losers. Inconclusive. The kind of outcome that makes stakeholders fidget in quarterly reviews and prompts well-meaning executives to ask whether the testing program is “working.”

It is working. The problem is not with the data. The problem is with what we have been trained to expect from it.

The Conventional Wisdom: Every Test Should Produce a Winner

Open any optimization blog, attend any CRO conference, and you will absorb a specific worldview: A/B testing is a machine that produces winners. The implied promise is that if you follow the right process — conduct user research, build hypotheses, design variants — you will reliably generate lifts. A 15% increase here, a 22% improvement there. The case studies pile up like trophies in a display cabinet.

This framing is not accidental. It serves the commercial interests of everyone in the optimization ecosystem. Platform vendors need to justify subscription costs. Agencies need to demonstrate ROI. Consultants need to fill case study decks. The result is a systematic survivorship bias that would make Abraham Wald — the statistician who famously identified survivorship bias in WWII aircraft damage analysis — deeply uncomfortable.

The narrative goes something like this: smart teams test, testing produces winners, winners compound into growth. The unspoken corollary: if your tests are not producing winners, you are doing it wrong.

This is a seductive story. It is also fundamentally misleading.

The Evidence Against It: What 97 Experiments Actually Tell Us

Let me lay out the numbers from our experimentation portfolio, because specifics matter more than narratives.

Across 97 A/B experiments conducted for a Fortune 500 energy company, the outcomes distributed as follows:

59 inconclusive (61%)
26 winners (27%)
12 losers (12%)

That is a win rate of roughly 1 in 4. The majority of our experiments — well-researched, carefully hypothesized, properly executed — did not move the needle in a statistically significant way.

But the category-level data reveals something even more instructive.

Homepage tests ran 16 experiments and produced 5 winners, zero losers, and 11 inconclusive results. A 31% win rate on the highest-traffic page of the site — the page with the most visibility and the most scrutiny from leadership.

Pricing page tests ran 13 experiments: 2 winners, 3 losers, 8 inconclusive. A 15% win rate on the page most directly tied to revenue. The pricing page also produced the highest ratio of losers to winners in our portfolio, suggesting that pricing is a domain where user expectations are entrenched and changes carry real downside risk.

Product comparison tests ran 19 experiments: 7 winners, 3 losers, 9 inconclusive. The highest absolute win count, but still a 37% win rate — meaning nearly two-thirds of our ideas for improving product comparison did not produce significant results.

Mobile tests returned 5 winners from 13 experiments — a 38% win rate, the highest in the portfolio. This makes intuitive sense. Mobile experiences in legacy industries like energy tend to be underinvested, creating more headroom for improvement.

Analysis page tests ran 5 experiments, all inconclusive. Navigation tests ran 2, both inconclusive. Checkout tests produced 0 winners from 5 experiments. Social proof yielded an inconclusive result from its single test.

These are not the numbers of a failing program. These are the numbers of an honest one.

Why We Got It Wrong: The Psychology of Expecting Winners

The gap between what experimentation actually delivers and what we expect it to deliver is not a knowledge gap. It is a cognitive gap, driven by well-documented psychological biases.

Survivorship Bias in CRO Literature

Daniel Kahneman and Amos Tversky’s foundational work on cognitive biases explains much of our distorted expectations around testing. Their concept of the availability heuristic — the tendency to judge probability based on how easily examples come to mind — is particularly relevant here.

When you read ten case studies about successful A/B tests and zero case studies about inconclusive ones, your brain calibrates its expectations accordingly. You begin to believe that winning is the default outcome because winning is the only outcome anyone publishes. This is not a conspiracy. It is the predictable result of publication bias operating across an entire industry.

No one submits a conference talk titled “We Ran 16 Homepage Tests and 11 Were Inconclusive.” No one writes a blog post celebrating the pricing experiment that confirmed the current design was already near-optimal. The inconclusive results vanish from the collective memory of the profession, leaving behind a distorted record of perpetual victory.

The Narrative Fallacy and Experimentation

Nassim Nicholas Taleb, building on Kahneman and Tversky’s work, identified what he calls the narrative fallacy — our compulsion to construct coherent stories from random or complex data. In experimentation, the narrative fallacy manifests as the belief that every test should tell a clear story: we identified a problem, we designed a solution, the solution worked.

But conversion behavior is not a narrative. It is a complex adaptive system influenced by hundreds of variables, most of which we cannot observe, let alone control. When a test comes back inconclusive, it is not a failure of storytelling. It is the system telling you that the signal you were looking for does not exist at the magnitude you hypothesized — or that it exists but is too small to detect with your current sample size and test duration.

Loss Aversion in Program Management

Kahneman and Tversky’s prospect theory demonstrates that losses loom larger than gains — roughly twice as large, according to their original estimates. In experimentation, this creates an asymmetric response to outcomes. An inconclusive result feels like a loss. It consumed resources, occupied calendar time, and produced nothing you can put in a deck. The emotional weight of that perceived loss drives teams toward two destructive behaviors: abandoning disciplined testing in favor of “quick wins,” or manipulating test parameters to manufacture significance.

Both responses make things worse. Quick-win testing sacrifices learning velocity for short-term optics. P-hacking — adjusting sample sizes, peeking at results, or running tests to arbitrary endpoints — produces false positives that erode trust in the entire program.

The Nuanced Truth: What Actually Matters About Experimentation

If a 27% win rate is normal — and the evidence from our portfolio, from Microsoft’s experimentation platform, and from the broader academic literature suggests it is — then the value proposition of experimentation needs to be reframed entirely.

Experimentation Is Risk Management, Not Growth Hacking

The most important function of an experimentation program is not producing winners. It is preventing losers from reaching production.

In our portfolio, 12 of 97 tests — about 12% — identified changes that would have actively harmed conversion rates. On the pricing page alone, 3 of 13 tests were losers. Without a controlled testing framework, those changes might have been shipped based on stakeholder opinion, competitor benchmarking, or designer intuition. The revenue impact of deploying three conversion-killing changes to a pricing page is not theoretical. It is a concrete, quantifiable business risk that experimentation eliminated.

Ron Kohavi, who led experimentation at Microsoft and Amazon, has written extensively about this protective function. His research suggests that roughly one-third of ideas that teams believe will improve metrics actually make them worse. Experimentation is the filter that catches those ideas before they reach users at scale.

Inconclusive Results Are Information, Not Absence of Information

An inconclusive result tells you something precise: the effect size of your proposed change, if it exists, is smaller than your test was powered to detect. This is genuinely useful information, especially when allocated correctly.

Consider our analysis page results: 5 tests, all inconclusive. The conventional interpretation — “nothing worked” — misses the point entirely. The accurate interpretation is that the analysis page experience is relatively robust against the types of changes we tested. User behavior on that page is stable. The existing design is doing its job.

That is a finding. It tells you where not to allocate future resources. It redirects attention toward areas with more variance and more headroom — like mobile, where our 38% win rate suggests significant opportunity remains.

The Portfolio Model of Experimentation

Intelligent experimentation programs should be managed like investment portfolios, not like individual bets. In venture capital, the expected distribution of returns follows a power law: most investments return nothing, a few return modestly, and a tiny fraction produce outsized returns. No serious investor evaluates individual investments in isolation. They evaluate portfolio performance.

Our experimentation portfolio generated 26 winners from 97 tests. The compounding effect of those 26 improvements — even modest ones in the 3-7% range — represents significant cumulative value. More importantly, the 12 losers we caught prevented value destruction that would have partially or fully offset those gains.

Michael Mauboussin’s research on skill versus luck in competitive domains is instructive here. In his framework, activities with high skill components show tight distributions of outcomes, while luck-dominated activities show wide distributions. The relatively stable win rate across our test categories — hovering between 15% and 38% — suggests that experimentation outcomes are influenced by both skill (in hypothesis generation and test design) and irreducible uncertainty (in user behavior). Expecting a skill-only distribution — where careful research reliably produces winners — is expecting the wrong distribution.

When the Old Framing Still Works

Fairness demands acknowledging the scenarios where a higher win rate is both achievable and reasonable.

Deeply broken experiences. If your mobile checkout has a 94% abandonment rate and your forms require 14 fields to complete a purchase, you are not optimizing — you are fixing. Fix-it testing against severely degraded baselines will produce higher win rates because the starting point is so poor that almost any thoughtful change constitutes an improvement. Our mobile tests, with their 38% win rate, likely reflect this dynamic.

High-traffic, high-sensitivity pages. Pages with enormous traffic volumes allow you to detect smaller effect sizes, which mechanically increases your win rate. If your homepage receives 10 million visits per month, you can detect a 0.5% conversion lift with confidence. At that resolution, more of your hypotheses will clear the significance threshold.

First-generation optimization. If an organization has never run A/B tests, the initial wave of experiments will often produce higher win rates. The lowest-hanging fruit gets picked first. Win rates naturally decline as the experience matures and the remaining opportunities get smaller and harder to find.

None of these scenarios invalidate the broader point. They represent specific conditions under which the distribution of outcomes temporarily shifts. As programs mature, win rates converge toward the 20-30% range that our data and the broader literature suggest is normal.

What to Do Instead: Building an Experimentation Program That Survives Contact with Reality

If you accept that most tests will be inconclusive — and you should, because the data is unambiguous — then program design needs to change in several concrete ways.

Redefine Success Metrics for Your Program

Stop measuring your experimentation program by win rate. Win rate is a vanity metric for testing programs, equivalent to measuring a baseball player by batting average alone while ignoring on-base percentage, slugging, and defensive contribution.

Instead, measure:

Learning velocity — How quickly does your organization generate actionable insights? An inconclusive test that redirects resources away from a dead-end category in two weeks is more valuable than a winning test that takes three months to reach significance.
Risk prevention value — Quantify the revenue impact of losers you caught before deployment. In our portfolio, the 12 losers represent concrete value protection that never appears in optimization case studies but absolutely appears on the balance sheet.
Decision quality — Are product and design decisions being informed by experimental evidence, even when that evidence is “no significant difference”? If so, decision quality is improving even when test outcomes are inconclusive.

Design for Learning, Not Just Winning

Structure your test roadmap to maximize information value, not expected win rate. This means deliberately including higher-risk, higher-uncertainty tests alongside safer incremental ones.

Our product comparison tests — 19 experiments, the largest category — reflected this philosophy. With 7 winners, 3 losers, and 9 inconclusive results, the category produced the most learning per test. The losers were as valuable as the winners because they established boundaries for what users would and would not accept in comparative shopping contexts.

Contrast this with a program optimized for win rate, which would run only safe, incremental tests on high-traffic pages. That program would look impressive in quarterly reviews and produce almost no meaningful learning.

Build Institutional Patience

The most underrated capability in experimentation is organizational patience. George Akerlof’s work on information asymmetry is relevant here — the people running the testing program have fundamentally different information than the executives evaluating it. The practitioners understand that inconclusive results are normal and valuable. The executives, informed by the same biased CRO literature everyone reads, expect a steady stream of winners.

Bridging this gap requires proactive communication. Share the distribution of outcomes across the industry. Explain why your 27% win rate is evidence of rigor, not failure. Frame inconclusive results in terms of their decision value, not their absence of a headline metric.

Richard Thaler’s concept of mental accounting is useful here. When executives mentally account for each test individually, inconclusive results feel like losses. When they mentally account for the program as a portfolio, the same results feel like normal variance within a profitable strategy.

Invest in Statistical Infrastructure

Many inconclusive results are not truly inconclusive — they are underpowered. Invest in:

Proper power analysis before every test. Know the minimum detectable effect size and make a deliberate decision about whether detecting that effect size is worth the test duration.
Sequential testing methods that allow valid peeking at results without inflating false positive rates.
Bayesian frameworks that provide probability distributions of effect sizes rather than binary significant/not-significant verdicts. A test that shows a 70% probability of a 2% lift is far more useful for decision-making than a test that says “not significant at p < 0.05.”

Our checkout tests — 5 experiments, 0 winners, 1 loser, 4 inconclusive — might have told a different story with larger sample sizes or longer test durations. Checkout pages often have low traffic volume relative to the site overall, making it harder to detect real effects. The inconclusive results may be hiding genuine small effects that a better-powered test could reveal.

The Uncomfortable Competitive Advantage

Here is the ultimate contrarian position: a 61% inconclusive rate is a competitive advantage, not a liability.

Organizations that understand this will invest in experimentation for the right reasons — risk management, learning velocity, decision quality — and sustain that investment through the inevitable quarters where the win column looks thin. Organizations that expect every test to produce a winner will abandon their programs at the first sign of “underperformance,” cycling through agencies and platforms in search of someone who will promise them a higher hit rate.

The latter group will eventually find what they are looking for: an agency willing to run underpowered tests with relaxed significance thresholds, producing a steady stream of “winners” that do not replicate in production. They will celebrate their improved win rate all the way to degraded conversion performance.

Meanwhile, the organizations that accepted the 61% inconclusive rate will have built something far more valuable: a body of institutional knowledge about what their customers actually respond to, grounded in rigorous evidence rather than optimistic storytelling.

That knowledge compounds. And unlike manufactured test wins, it does not evaporate when you look at it closely.

Frequently Asked Questions

What is a normal A/B test win rate?

Based on our data from 97 experiments and consistent with published research from Microsoft, Google, and Booking.com, a well-run experimentation program should expect a win rate between 20% and 35%. Our portfolio produced a 27% win rate, with the majority of tests (61%) returning inconclusive results. If your win rate is significantly higher than 35%, it is worth examining whether your tests are underpowered, your significance thresholds are too relaxed, or you are only running safe, low-risk tests that generate wins but limited learning.

Does an inconclusive A/B test mean the test failed?

No. An inconclusive result means the difference between your variant and control, if one exists, is smaller than your test was designed to detect. This is a valid and informative finding. It tells you either that your proposed change does not meaningfully affect user behavior, or that the effect is too small to justify the development cost of implementing it. Both are useful conclusions that inform future resource allocation and hypothesis development.

How should I report inconclusive test results to stakeholders?

Frame inconclusive results in terms of decision value and risk management rather than wins and losses. Report the confidence interval around the estimated effect size — for example, “We are 95% confident the true effect of this change is between -1.2% and +1.8%, meaning any impact is likely too small to prioritize.” Also report the cumulative value of losers prevented. In our portfolio, 12% of tests identified changes that would have harmed performance — catching those before deployment has a concrete dollar value that should be part of every program review.

How many A/B tests should I run before expecting a winner?

With a typical win rate of 25-30%, you should expect roughly 3-4 tests for every winner. However, this varies significantly by test category and the maturity of the experience being tested. Our mobile tests produced winners at a 38% rate (roughly 1 in 3), while checkout and analysis page tests produced zero winners from a combined 10 experiments. The right question is not “how many tests until a winner” but “are we generating enough learning velocity to improve our hypothesis quality over time?”

Should I lower my statistical significance threshold to get more winners?

Absolutely not. Lowering your significance threshold — for example, from 95% to 90% confidence — mechanically increases your win rate by accepting more false positives. This is the statistical equivalent of lowering the bar in a high jump competition and claiming your athletes improved. The “winners” you produce at relaxed thresholds are more likely to be noise than signal, and they will not replicate when you ship the changes to 100% of traffic. Maintain rigorous thresholds and accept the inconclusive rate that comes with them. Your future self — and your conversion rates — will thank you.

inconclusive tests a/b testing experimentation behavioral science test failure

Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.

About LinkedIn Newsletter