What Everyone Gets Wrong About Statistical Significance

Ask ten CRO practitioners what "95% statistical significance" means and you'll get roughly six different answers. Most of them will be wrong. The most common answer — "we're 95% sure the variant is better" — is incorrect in a specific way that matters enormously for how you make decisions.

Getting this wrong doesn't just make you sound imprecise at conferences. It leads to shipping losing variants, misreporting test results to stakeholders, and building a CRO program on a foundation that quietly erodes trust every time a "winner" fails to hold in production.

Here's the correct definition, why it matters, and how to use it to make better decisions — not just pass a statistics quiz.

The Correct Definition

Statistical significance tests a specific question: given that there is NO real difference between control and variant, how likely is it that we'd observe a difference this large (or larger) by chance?

That probability is the p-value. When we say p < 0.05, we're saying: if the null hypothesis (no real effect) were true, we'd see a result this extreme less than 5% of the time due to random sampling variation.

What it does NOT mean:

  • "There's a 95% chance the variant is better"
  • "We're 95% confident our lift estimate is accurate"
  • "There's only a 5% chance we're wrong"

The distinction matters because p-values are a statement about the probability of data given a hypothesis, not a statement about the probability of a hypothesis given data. If that feels like a hairsplitting philosophical difference, read on — it has real consequences for how you interpret test results.

The Coin Flip Analogy

Here's the intuition without the notation.

Imagine you flip a coin 100 times and get 60 heads. You suspect the coin is biased. Statistical significance testing asks: if this coin were actually fair (null hypothesis = 50/50), what's the probability of getting 60 or more heads in 100 flips just by chance?

The answer is about 2.8% — so p ≈ 0.028, which is below the conventional 0.05 threshold. By the standard rule, you'd "reject the null hypothesis" and conclude the coin probably isn't fair.

Now flip it 1,000 times and get 520 heads. The same question: if fair, how often would you get 520+? The answer is about 10% — p ≈ 0.10. Not significant. Even though 52% sounds more conclusive than 60%, you have less statistical evidence because the sample is better at distinguishing signal from noise.

This is the core of why "the numbers look good" is not sufficient justification for calling a test. It's not how big the gap is — it's how much data you have relative to the variance in the metric.

**Pro Tip:** Statistical significance is about the relationship between effect size, sample size, and variance. A tiny effect can be significant with enough data. A large effect can be non-significant with too little data. Always look at all three, not just the significance flag.

What p < 0.05 Means in Business Terms

In business terms, p < 0.05 means: if I ran this exact experiment 100 times on a null effect, I'd get a false positive about 5 times.

So when you declare a winner at 95% confidence, you're accepting that roughly 1 in 20 of your "wins" will be tests where there was no real effect — and you got lucky with the sample. At scale (running 50 tests per year), you'd expect about 2-3 false winners per year under ideal conditions.

This is the base rate of the statistical framework. It's not a bug, it's a feature — the threshold of evidence required to make a decision. The question is whether your process treats it that way.

Most teams do not. They treat statistical significance as a binary threshold: below 95% = uncertain, above 95% = proven. The more accurate mental model is a probability dial that shifts your confidence, but doesn't deliver certainty.

Statistical Significance vs. Practical Significance

Here's a failure mode that almost never gets discussed in introductory CRO content: you can have a result that is statistically significant but completely meaningless for business decisions.

If you have 10 million visitors per month hitting a page and you run a test long enough, you can detect a 0.001% lift with statistical significance. p < 0.05. "Winner." But 0.001% on your revenue baseline is $200/month if you're doing $20M/year. The engineering cost to implement the variant probably exceeds that. The variant is a statistically significant non-result.

Practical significance — sometimes called "minimum detectable effect" or "effect size" — asks whether the magnitude of the lift actually matters to the business. This requires a business judgment, not a statistics calculation.

Before running any test, decide: what is the minimum lift that would justify implementing this change? Write that number down. That's your MDE. A result below your MDE is a non-result, regardless of its p-value.

**Pro Tip:** In high-traffic programs, set your MDE higher than the default. A statistically significant 0.5% lift on checkout conversion is real but may not clear the ROI bar once you factor in engineering hours to maintain the variant. Focus statistical power on lifts that would actually change a business decision.

Type I and Type II Errors: The Two Ways You Can Be Wrong

There are exactly two types of mistakes in hypothesis testing:

Type I error (false positive): You conclude the variant is better, but it isn't. The statistical significance was driven by random chance. Your false positive rate is equal to your alpha (typically 5% at 95% confidence).

Type II error (false negative): The variant actually is better, but you failed to detect it. You conclude "no significant difference" and move on, leaving real lift on the table. Your false negative rate is (1 - statistical power), where power is typically set to 80%.

In business terms:

  • Type I error: you implement a bad change. You waste engineering resources, may degrade the experience for users, and ship a variant that doesn't actually convert better. Over time, your team's results in production diverge from test results — which destroys trust in the program.
  • Type II error: you miss a real win. You fail to improve conversion. Over time, your program looks less impactful than it really is, and you leave revenue on the table.

Both errors have costs. Teams typically obsess over Type I (false positives) because they're more visible — the variant ships and doesn't perform. But Type II errors are just as expensive; they're just invisible. A CRO program with 90% statistical power is correctly identifying real effects 9 times out of 10. At 60% power (which many underpowered tests effectively have), you're missing 40% of your real wins.

**Pro Tip:** When evaluating your sample size, optimize for 80% power as a baseline. If you're in a high-stakes domain (pricing, checkout), bump to 90%. Lower power saves time but increases the rate of missed wins — and missed wins compound into a program that looks weaker than it is.

Why 95% Is a Convention, Not a Law

The 95% confidence threshold (alpha = 0.05) comes from R.A. Fisher's 1925 work, where he described it as "a convenient boundary." It has no special mathematical status. It became an industry norm because journal editors found it useful and it spread from academic research into applied statistics.

In practice, the right threshold depends on the cost asymmetry of your error types:

When 90% confidence (p < 0.10) is defensible:

  • Low-stakes tests: copy changes, color tweaks, image swaps
  • High-traffic, fast-iterating programs where you can quickly verify the winner in a follow-up test
  • Early-stage programs where learning velocity matters more than precision

When you should hold to 95%:

  • Most standard A/B tests in established programs
  • Tests where the variant will be hard to roll back once shipped

When you need 99% confidence (p < 0.01):

  • Pricing changes
  • Checkout flow changes
  • Any test where a false positive has a substantial financial downside
  • Tests where the losing variant could cause regulatory or compliance issues

The framework is: calibrate your alpha to the cost of a Type I error in your specific context. Don't blindly apply 95% because that's what the tool defaults to.

**Pro Tip:** When you present test results to stakeholders, report both statistical significance AND practical significance (the lift estimate and its confidence interval). "Statistically significant at 97% confidence, with an estimated lift of 8-14%" is a complete result. "Statistically significant" by itself is not.

The Peeking Problem: How Daily Check-ins Inflate Your False Positive Rate

This is one of the most common and costly mistakes in applied CRO. The scenario: a test launches Monday. By Thursday, the dashboard shows p = 0.03. Someone in the business sees it and wants to ship. The CRO lead approves because "it hit significance."

What happened statistically: the test wasn't designed to make a decision at Thursday's sample size. If you check results at 5 interim points during a test, and you're willing to stop at any significant result, your effective false positive rate is no longer 5%. It's closer to 23%.

The intuition: p-values fluctuate. Early in a test, there's high variance, which means the p-value will cross 0.05 frequently just due to random fluctuation — even when there's no real effect. If you stop whenever it crosses, you're cherry-picking a favorable fluctuation and calling it a result.

The fix for classical testing: decide your sample size before you launch, check results once after you've hit that sample size. No interim looks.

The fix for sequential testing: use a tool with a valid sequential testing engine. Optimizely's Stats Engine is built on this methodology — it uses always-valid confidence sequences that maintain your target false positive rate regardless of when you look. You can check daily and the system's confidence bounds are calibrated to account for the sequential nature of the data.

**Pro Tip:** If your stakeholders are pushing for early decisions (and they always are), the right answer is to switch to a sequential testing methodology — not to cave on sample size requirements. Frame it as "our testing platform is designed for this" rather than "I won't give you an answer."

Significance Theater: The Quiet Epidemic in CRO

I'll be direct: a significant portion of A/B test "winners" reported in industry case studies would not replicate under rigorous conditions. This isn't fraud — it's the accumulated effect of significance theater.

Significance theater describes test programs where results are presented as significant without meeting the actual mathematical requirements:

  • Calling a winner at 60% confidence because the deadline is Friday
  • Running tests until they hit significance, regardless of pre-set sample sizes
  • Reporting results selectively (publishing wins, quietly archiving losses)
  • Using multiple metrics as outcome variables and reporting the one that happened to be significant

The cumulative effect: your "win rate" looks great. Your post-launch impact is disappointing. Leadership loses faith in the program. The CRO team blames "regression to the mean" when the real issue is the quality of the original test conclusions.

The fix is cultural and process-driven. Pre-register your primary metric and sample size before launch. Lock the test dashboard so only the program lead can see results before the test completes. Report all results — wins, losses, and inconclusive — in your program archive.

What to Do Next

  1. Audit your last 5 test "winners." For each one: was the primary metric pre-registered before launch? Was the result checked only after hitting the planned sample size? Was statistical power above 80%? If not, the result is questionable.
  2. Set confidence thresholds by test type. Use 99% for pricing and checkout. 95% for standard tests. 90% only for low-stakes tests where you can validate the result quickly.
  3. Implement a pre-registration process. Before any test launches, document: primary metric, MDE, required sample size, planned end date. Lock it. Don't change it based on interim results.
  4. If you're using Optimizely, confirm Stats Engine is enabled. This gives you valid sequential testing — daily monitoring without inflating your false positive rate.
  5. When reporting results to stakeholders, always include: confidence level, lift estimate with confidence interval, and sample size achieved vs. planned. "We hit significance" is not a complete result.

For a complete methodology on test design, metric selection, and statistical setup in Optimizely, see the Optimizely Practitioner Toolkit at atticusli.com/guides/optimizely-practitioner-toolkit.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.