Skip to main content
Glossary Testing & Experimentation

Statistical Significance

What statistical significance actually means in A/B testing, why p-values are misunderstood, and how to set the right threshold for your experimentation program.

Statistical significance is the probability that the difference you observed between your test variants didn’t happen by chance. In experimentation, we typically express this through a p-value — the probability of seeing results at least as extreme as yours if there were actually no difference between variants.

What practitioners get wrong

Most teams treat statistical significance as a binary pass/fail gate. They pick a p-value threshold (usually 0.05), run the test until they hit it, and ship. This creates two problems.

First, peeking. If you check your results daily and stop the test the moment you see p < 0.05, you’re inflating your false positive rate dramatically. A test that would produce a 5% false positive rate at a fixed sample size can produce 20-30% false positives with repeated peeking.

Second, threshold worship. A p-value of 0.049 is not meaningfully different from 0.051. The 0.05 threshold is a convention, not a law of physics. What matters is whether the evidence is strong enough to justify the business decision you’re about to make.

How to use it correctly

Set your significance threshold before the test starts. For most CRO experiments, 95% confidence (p < 0.05) is reasonable. For high-stakes changes — pricing, checkout flow — consider 99%. For low-risk tests like copy variations, 90% can be appropriate.

More importantly, pair your significance threshold with a minimum detectable effect and calculate the required sample size upfront. This gives you a fixed test duration. Run the test for that duration. Don’t peek, don’t stop early, don’t extend because you’re “almost there.”

If you need the flexibility to stop early, use sequential testing or Bayesian methods — both are designed for continuous monitoring without inflating error rates.

Practical example

You’re testing a new checkout page layout. Your current conversion rate is 3.2%. You want to detect a 10% relative lift (to 3.52%). At 95% significance with 80% power, you need roughly 35,000 visitors per variant. That’s your test duration — run it, then read the results. Not before.

Work Together

Put This Into Practice

Understanding the theory is step one. Building an experimentation program that applies these concepts systematically — and ties every test to revenue — is where the real impact happens.

Lean Experiments Newsletter

Revenue Frameworks
for Growth Leaders

Every week: one experiment, one framework, one insight to make your marketing more evidence-based and your revenue more predictable.