The Most Misunderstood Number in Experimentation

Every experimentation platform puts a significance number front and center. Teams celebrate when it crosses a threshold. They kill tests when it does not. And most of them have no idea what the number actually means.

Statistical significance is not the probability that your variant is better. It is not the chance your results are correct. It is not a measure of how big your effect is. Getting this wrong leads to bad decisions, wasted traffic, and a false sense of confidence that erodes trust in your entire experimentation program.

Let us fix that.

What Statistical Significance Actually Measures

Statistical significance answers one very specific question: if there were truly no difference between your control and variant, how likely would you be to see a result at least as extreme as the one you observed?

That is it. Nothing more.

When someone says a test reached significance at a given confidence level, they are saying the observed difference would be unlikely to appear by random chance alone, assuming the null hypothesis is true. The null hypothesis is the assumption that there is no real difference — that your variant does nothing.

This is a conditional probability statement. It tells you something about the data given a specific assumption. It does not tell you the probability that the assumption is true or false.

Why This Distinction Matters

Here is where teams get into trouble. They run a test, see a significant result, and conclude: "There is a high probability that our variant is better."

That conclusion does not follow from the math. What the math says is: "If the variant had zero effect, this data would be rare." Those are not the same statement.

The difference matters because:

  • Base rates matter. If you test wild ideas that rarely work, most of your significant results will be false positives — even at conventional confidence levels. The prior probability of the hypothesis being true affects how you should interpret the result.
  • Effect size matters. A significant result with a tiny effect may not be worth shipping. Significance tells you the result is unlikely due to chance. It does not tell you the result is meaningful for your business.
  • Multiple comparisons matter. If you test dozens of metrics, some will appear significant by chance alone. Significance on one metric does not account for the total number of comparisons you ran.

The Threshold Problem

Most teams use a conventional confidence level without questioning why. The threshold is arbitrary. It is a convention, not a law of nature.

The right threshold depends on your context:

  • High-cost changes (redesigns, pricing changes, infrastructure migrations) warrant a stricter threshold because the cost of being wrong is high.
  • Low-cost, easily reversible changes (button color, copy tweaks) can tolerate a looser threshold because you can always roll back.
  • High-traffic products can afford stricter thresholds because they accumulate sample size quickly.
  • Low-traffic products may need to accept looser thresholds or use alternative methods entirely.

There is no universal right answer. The threshold should reflect the decision you are making, not a number you inherited from a statistics textbook.

How Teams Misuse Significance in Practice

Stopping Tests Early

The most common mistake is checking results daily and stopping the test as soon as significance appears. This dramatically inflates your false positive rate because significance fluctuates — it can appear and disappear multiple times before stabilizing.

A result that is significant on day three might not be significant on day seven. If you stop at day three, you are locking in a noisy estimate and calling it truth.

Ignoring Insignificant Results

When a test does not reach significance, teams often conclude "the variant had no effect." That is not what insignificance means. It means you did not find sufficient evidence of an effect. The effect might exist but be too small for your sample size to detect.

Absence of evidence is not evidence of absence. An underpowered test cannot distinguish between "no effect" and "small effect we cannot detect."

Cherry-Picking Metrics

When the primary metric is not significant, teams sometimes search through secondary metrics until they find one that is. This is multiple comparisons by another name, and it inflates the false positive rate for the same reason.

If you define the primary metric after seeing the data, you are not testing a hypothesis. You are storytelling.

Confusing Statistical and Practical Significance

A result can be statistically significant but practically meaningless. With enough traffic, even a negligible difference will eventually reach significance. The question is whether that difference matters for your business.

Always pair significance with effect size. A significant result that translates to a fractional improvement in your key metric is probably not worth the engineering effort to ship permanently.

A Better Framework for Interpreting Results

Instead of asking "is it significant?" try asking these questions:

  1. What is the estimated effect size, and does it matter for our business? Look at the confidence interval, not just the point estimate. If the interval includes effects that are too small to care about, you may need more data.
  2. How powered was this test? If you ran with low statistical power, insignificant results tell you almost nothing. You were unlikely to detect the effect even if it existed.
  3. How many comparisons did we make? If you tested multiple variants, segments, or metrics, adjust your interpretation accordingly.
  4. What was our prior expectation? If the hypothesis was a long shot, even a significant result should be treated with more skepticism.
  5. What is the cost of being wrong in each direction? If shipping a false positive is expensive, demand stronger evidence. If missing a true positive is expensive, consider a looser threshold.

The Organizational Challenge

The hardest part of getting significance right is not mathematical. It is cultural.

Organizations that reward "winning" tests create incentive structures where people game significance. They peek at results, cherry-pick metrics, and design tests to confirm existing beliefs rather than genuinely learn.

Building a healthy experimentation culture means rewarding learning, not just wins. It means celebrating well-designed tests with null results because those tests saved the company from shipping something that would not have worked. It means being honest about uncertainty instead of pretending that a significance threshold eliminates it.

Moving Beyond Significance

Statistical significance is a useful tool when understood correctly. But it is just one input into a decision. The best experimentation programs combine significance with:

  • Confidence intervals for understanding the range of plausible effects
  • Power analysis for designing tests that can actually detect meaningful differences
  • Decision frameworks that account for the costs and benefits of each possible action
  • Qualitative insight from user research, session recordings, and customer feedback

Significance tells you whether the data is surprising under the null hypothesis. Your job is to decide what to do with that information — and that requires judgment, context, and a willingness to engage with uncertainty rather than hide behind a threshold.

FAQ

Is a higher confidence level always better?

Not necessarily. Higher confidence levels require larger sample sizes and longer test durations. The right level depends on the cost of a wrong decision and the traffic available. For most product experiments, conventional thresholds work well.

Can a test be significant but wrong?

Yes. Significance means the result is unlikely under the null hypothesis, but unlikely events do happen. At conventional confidence levels, roughly one in twenty significant results will be false positives — more if you account for peeking and multiple comparisons.

Should I use one-tailed or two-tailed tests?

Two-tailed tests are almost always the right choice. You should care whether your variant made things worse, not just whether it made things better. One-tailed tests inflate false positive rates by ignoring the possibility of harm.

What do I do if my test never reaches significance?

First, check whether the test was adequately powered. If not, the test was unlikely to detect the effect regardless. Consider whether the minimum detectable effect was realistic, whether the test ran long enough, and whether the traffic allocation was sufficient.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.