Statistical Significance in A/B Testing: What It Actually Means (And What It Doesn't)

Atticus Li

← Blog · optimizely

Statistical Significance in A/B Testing: What It Actually Means (And What It Doesn't)

Most practitioners misread statistical significance as 'probability you're right.' It isn't. Here's what it actually means, why 95% is a convention not a law, how peeking kills your results, and how Optimizely's Stats Engine solves the classical problem.

Atticus Li March 31, 2026 11 min read

You ran a test. The dashboard shows 95% statistical significance. Your variant is up 4%. You're ready to ship.

Stop.

That number does not mean what most people think it means. And if you're making ship/no-ship decisions based on a misreading of statistical significance, you are guaranteed to ship losing tests eventually — probably already have.

After 100+ experiments, I've watched smart teams freeze at p-values and confident teams barrel through with fundamentally broken interpretations. This article fixes both problems.

What Statistical Significance Actually Means

Here's the definition most practitioners have internalized: "95% confidence means there's a 95% chance my variant is better than control."

That is wrong.

The correct interpretation is this: if there were truly no difference between the variant and control (the null hypothesis), we would see results this extreme only 5% of the time by random chance.

It's a statement about your data under a hypothetical world where your test had zero effect — not a statement about the probability that your test actually worked.

The distinction matters enormously. Statistical significance tells you how surprised you should be if nothing is real. It does not tell you how likely it is that something is real.

Think of it like a smoke detector. It tells you whether it's statistically unlikely that there's no fire. It doesn't tell you whether there's definitely a fire, how big it is, or whether you should evacuate.

The Null Hypothesis Framework

Classical A/B testing is built on null hypothesis significance testing (NHST). You start with an assumption: "there is no difference between variant and control." Then you collect data. Then you ask: how weird is this data, assuming the null hypothesis is true?

The p-value is that weirdness score. A p-value of 0.05 means: if there truly were no effect, you'd see results at least this extreme 5% of the time due to random variation alone.

When p < 0.05, you "reject the null hypothesis." You do not "confirm the alternative hypothesis." You do not "prove your variant is better." You simply have evidence that random chance alone is an unlikely explanation.

**Pro Tip:** When explaining stats to stakeholders, try this framing: "If we ran this experiment 100 times on a page with no actual difference, we'd expect to see results this dramatic in only 5 of those runs. We're seeing it now — which suggests the difference is probably real."

Statistical Significance vs. Practical Significance

This is where experiments go wrong in ways that cost real money.

A result can be statistically significant but practically meaningless. Here's a worked example.

You run an experiment on your checkout page. Baseline conversion rate: 3.00%. Variant conversion rate: 3.03%. That's a 1% relative lift. You have 500,000 visitors in the test. Your p-value: 0.001. Extremely statistically significant.

Now ask: should you ship this?

At 3.03% conversion on 100,000 monthly visitors, you're generating an additional 30 conversions per month. At a $50 average order value, that's $1,500/month in incremental revenue — or $18,000/year. If implementing this variant required $25,000 in development time, you'd be underwater for 17 months before breaking even.

Statistically significant. Practically not worth it.

The reverse is also true: a test can fail to reach statistical significance while representing a meaningful business effect — your sample was just too small to detect it reliably.

Practical significance means asking: "Is this lift large enough to be worth implementing given the cost and risk?" Statistical significance doesn't answer that. You have to.

**Pro Tip:** Before every test, define your minimum meaningful lift — the smallest effect that would justify shipping the variant. This is your Minimum Detectable Effect (MDE), and it should be set before you run a single visitor through the test.

The Peeking Problem

Here's a scenario that plays out in every experimentation program eventually.

You launch a test on Monday. By Wednesday, your variant is showing 94% statistical significance. Your CEO asks you Thursday morning whether the test is winning. You say "almost." By Friday, it hits 95%. You call it.

You've just made a potentially catastrophic statistical error.

When you check your results repeatedly during a test — what statisticians call "peeking" — you dramatically inflate your false positive rate. Here's why.

Imagine flipping a fair coin 1,000 times. At some point during those 1,000 flips, random variation will almost certainly produce a streak where the coin appears biased at 95% confidence — even though it's perfectly fair. If you stop the moment you hit that threshold, you'll call a fair coin biased.

A/B tests work the same way. Conversion rates fluctuate. Early in a test, the sample is small and variance is high. Hit refresh enough times and you'll catch a moment where random fluctuation looks like a real effect.

Studies have shown that if you check a test 5 times during its run and stop whenever p < 0.05, your actual false positive rate jumps from 5% to over 20%.

**Pro Tip:** Set your experiment duration before launch based on sample size calculations, then don't check results until you hit that sample size. This is hard discipline. Build it into your process by only reviewing results in weekly read-outs, not daily dashboard checks.

How Optimizely's Stats Engine Solves the Peeking Problem

Optimizely built its Stats Engine specifically to address the peeking problem — and this is one of its most underappreciated features.

Traditional NHST (what you learned in stats class) assumes you collect all your data, then analyze it once. That's the fixed-horizon approach. Peeking invalidates this assumption.

Optimizely's Stats Engine uses sequential testing — a methodology that allows valid inference at any point during the experiment. It's built on Wald's sequential probability ratio test with modifications informed by false discovery rate control.

The practical effect: you can check your Optimizely results every day without inflating your false positive rate. The Stats Engine adjusts its thresholds dynamically to account for continuous monitoring. When it shows 95% statistical significance, that 95% is valid even if you've been watching the test run.

This is meaningfully different from running a chi-squared test in a spreadsheet after you hit your sample size. The Stats Engine is doing more sophisticated math in real time.

**Pro Tip:** When you see Optimizely's results fluctuate — sometimes showing statistical significance, then losing it, then regaining it — that's the Stats Engine working correctly. It's accounting for sample accumulation over time. Don't be alarmed by the movement. Be concerned if a result keeps reaching significance then retreating: that's usually a novelty effect or a segment issue.

Why 95% Is a Convention, Not a Law

The 95% confidence threshold (corresponding to a p-value of 0.05) was essentially invented by Ronald Fisher in the 1920s as a convenient rule of thumb. It has no special mathematical significance.

Whether 90% or 99% is appropriate depends on the consequences of being wrong.

When 90% is fine:

Low-stakes copy tests (button labels, minor headlines)
Tests where you'll run follow-up validation before a major commitment
High-traffic pages where you'll accumulate data quickly regardless
Teams in early stages building an experimentation culture (speed matters more than perfect rigor)

When you need 99%:

Pricing changes (false positives here can destroy margin)
Regulatory or compliance contexts
Tests that will trigger irreversible technical changes
Experiments with significant engineering implementation costs

Optimizely allows you to set significance thresholds between 70% and 99%. Match the threshold to the stakes, not to convention.

**Pro Tip:** Create a simple decision table for your team: test type vs. required confidence level. Low-stakes UX test = 90%. Revenue-impacting test = 95%. Pricing or feature-gate test = 99%. Document it in your experimentation charter and enforce it consistently.

Type I and Type II Errors: The Business Translation

Statistics textbooks call them Type I and Type II errors. Here's what they actually mean to your business.

Type I Error (False Positive): You conclude your variant is better when it isn't. You ship a change that does nothing — or worse, actively hurts performance. Your significance threshold controls this. At 95% confidence, you accept a 5% false positive rate.

Type II Error (False Negative): You conclude there's no effect when there actually is one. You kill a genuinely good idea because your test didn't reach significance. Your statistical power controls this. At 80% power (the standard), you accept a 20% chance of missing a real effect.

Here's the business tension: reducing Type I errors (raising your confidence threshold) increases Type II errors if you keep the same sample size. You need more data to maintain the same power at a higher confidence level.

Most teams obsess over false positives and ignore false negatives. That's backwards for high-traffic experimentation programs. Missing a real 10% lift because you underpowered your test is just as expensive as shipping a losing variant — it just doesn't show up as clearly in your results.

The Relationship Between Sample Size, Effect Size, and Confidence

These three variables are locked together. Change one and you must adjust at least one other.

Larger sample size → can detect smaller effects at the same confidence level
Larger effect size → requires smaller sample to reach significance
Higher confidence requirement → requires larger sample for the same effect size

A 10% relative lift on a 3% baseline (0.3 percentage points absolute) requires roughly 50,000 visitors per variation at 95% confidence and 80% power. If you only have 10,000 visitors per week, that test runs for 10 weeks. That's probably too long.

The solution isn't to lower your confidence threshold arbitrarily. It's to either accept longer test durations or set your MDE to a larger effect that's detectable with your available traffic.

**Pro Tip:** Run your sample size calculation before you write the test hypothesis, not after. If your traffic won't support detection of a meaningful lift in a reasonable timeframe, don't run the test — you'll get noise at best and a confident false positive at worst.

Common Mistakes

Mistake 1: Stopping as soon as you hit significance. Unless you're using Optimizely's Stats Engine (which supports early stopping), hitting 95% confidence on day 3 of a planned 4-week test means nothing. You planned 4 weeks for a reason.

Mistake 2: Treating statistical significance as binary. A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. The universe doesn't change at the 0.05 boundary. Results near the threshold should trigger more data collection, not a coin flip decision.

Mistake 3: Ignoring effect size. "We reached significance" is an incomplete sentence. The full sentence is "We reached significance on a [X%] lift." A 0.2% lift reached significance on 5 million visitors. A 15% lift didn't reach significance on 200 visitors. These require completely different responses.

Mistake 4: Running tests without setting a hypothesis first. Statistical significance testing is confirmatory, not exploratory. You state a hypothesis, collect data, test it. If you look at 20 metrics and find 1 that's significant, that's not a discovery — that's noise (see false discovery rate).

Mistake 5: Confusing confidence level with accuracy. 95% confidence doesn't mean your estimate of the lift is 95% accurate. It means, roughly, that you've constructed a process that produces correct inferences 95% of the time across many experiments.

What to Do Next

Audit your last 5 called tests. For each one, document: what was the lift? What was the sample size? Was the lift practically significant given implementation cost? If you can't answer these questions, your decision process has a gap.
Set significance thresholds by test type. Create a one-page reference sheet mapping test categories to required confidence levels. Default to 95%, use 90% for low-stakes tests, require 99% for pricing and feature-gating.
Stop peeking — or switch to Stats Engine. If your team is using classical NHST, implement a rule: no results review until the pre-planned sample size is reached. If you're on Optimizely's Stats Engine, you can check results freely — but still commit to a minimum runtime (at least one full business cycle).
Add "practical significance" to every readout. Before any test result is presented to stakeholders, the presenter must state: "This lift represents $X in annual revenue impact" or "This would generate N additional conversions per month." Make the business case explicit every time.

optimizely statistics statistical-significance ab-testing stats-engine

Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.

About LinkedIn Newsletter