Statistical Significance vs Practical Significance

Atticus Li

Why p < 0.05 doesn't mean you should ship. A business economics analysis of statistical vs practical significance, MDE as a business decision, and the real cost of ignoring effect size.

A Statistical Significance

B Practical Significance

Overview

A mathematical threshold indicating that an observed result is unlikely to have occurred by chance, typically defined as p < 0.05 in the frequentist framework.

An assessment of whether an observed effect is large enough to be meaningful in a real-world business context, considering implementation costs, opportunity costs, and revenue impact.

Strengths

Provides a clear, binary decision criterion
Controls false positive rate when used correctly
Universally understood (if often misunderstood) standard
Enables reproducible analysis across teams
Required for regulatory and academic contexts

Directly tied to business outcomes and ROI
Forces teams to define 'what matters' before testing
Prevents shipping trivially small improvements
Accounts for implementation and maintenance costs
Encourages strategic resource allocation

Weaknesses

Says nothing about the size or value of the effect
Can be achieved with trivially small effects given enough data
Widely misinterpreted as 'probability the result is real'
Creates a false dichotomy — barely significant isn't different from barely not
Incentivizes p-hacking and selective reporting

Subjective — reasonable people disagree on thresholds
Requires business context that analysts may not have
Can be used to rationalize ignoring uncomfortable results
Harder to standardize across an organization
May lead to inaction when cumulative small effects matter

Best For

Establishing that a real effect exists — the first filter before assessing whether the effect is large enough to act on.

Making the final ship/no-ship decision — translating statistical findings into business strategy and resource allocation.

Expert Verdict

Statistical significance is your bouncer — it keeps random noise out of your decision pipeline. Practical significance is your CFO — it decides whether the real effect is worth the investment. You need both, but most teams over-index on the bouncer and forget to consult the CFO. Every test result should answer two questions: 'Is this effect real?' and 'Is this effect worth the cost of capturing it?' If you can't answer the second question, your testing program is generating insights without generating value.

— Atticus Li

The Most Expensive Mistake in Experimentation

Here is a scenario I've seen play out dozens of times: A team runs an A/B test for four weeks. They achieve statistical significance with p = 0.03. The conversion rate improved by 0.08%. They ship the variant. It takes two engineering sprints to productionize. Six months later, nobody can find the impact in the revenue numbers.

What went wrong? They confused statistical significance with practical significance. They proved the effect was real but never asked whether it was worth capturing.

This is the most expensive mistake in experimentation because it doesn't look like a mistake. The test "won." The data "supported" the decision. But the opportunity cost — those two engineering sprints that could have built something with 10x the impact — is invisible on the dashboard.

What Statistical Significance Actually Tells You

Statistical significance answers one narrow question: is the observed difference likely to be caused by something other than random chance?

When you achieve p < 0.05, you're saying: if there were truly no difference between A and B, there is less than a 5% probability of observing data this extreme. That's it. That's the entire claim.

Statistical significance does not tell you:
- How big the effect is
- Whether the effect matters for your business
- Whether you should allocate resources to capture the effect
- Whether the effect will persist over time
- Whether the effect generalizes to other contexts

It's a filter, not a decision. And yet, in most experimentation programs I've audited, achieving "stat sig" is treated as the decision itself.

What Practical Significance Actually Means

Practical significance asks: is this effect large enough to justify action given the costs involved?

This requires understanding at least three things:

1. The magnitude of the effect. A 0.08% conversion rate improvement and a 3.2% conversion rate improvement are both "statistically significant," but they represent vastly different business realities.

2. The revenue exposure. A 0.5% improvement on a page that processes $100M annually is $500K. The same improvement on a page with $100K in throughput is $500. Same statistical result, completely different business significance.

3. The cost of capturing the effect. Engineering time, QA cycles, increased code complexity, ongoing maintenance, opportunity cost of not building something else. These costs are real and should be factored into every ship decision.

The MDE Is a Business Decision, Not a Statistics One

The minimum detectable effect (MDE) is typically presented as a statistical parameter — the smallest effect your test is powered to detect. But framing it purely as a statistics question is a fundamental error.

Your MDE should be set by answering this question: What is the smallest improvement that generates positive ROI after accounting for all implementation costs?

Here is how I calculate it:

Step 1: Estimate implementation cost. Include engineering time, QA, code review, deployment risk, and ongoing maintenance. For a typical frontend change, this might be $15,000-$40,000 in fully loaded labor costs.

Step 2: Calculate the revenue per percentage point of conversion. If your page processes $20M annually with a 4% conversion rate, each percentage point of conversion is worth $5M. A 1% relative improvement (0.04 absolute percentage points) is worth $200K annually.

Step 3: Set MDE at breakeven. If implementation costs $30,000 and you want payback within 6 months, you need $60,000 in annualized value. On the page above, that's a 0.3% relative improvement. That's your MDE — anything smaller isn't worth detecting because it isn't worth capturing.

Step 4: Calculate required sample size from MDE. Now — and only now — do you engage the statistical machinery. Given your MDE, baseline conversion rate, and desired power, calculate the sample size. If you can't achieve it in a reasonable timeframe, you need to either accept a larger MDE or acknowledge that this page can't be effectively tested with your traffic.

Most teams do this backwards. They calculate sample size from an arbitrary MDE (often the platform default), run the test, and then try to make a business case for whatever effect they find. This is like deciding how much to spend on a house after you've already signed the mortgage.

The Revenue Impact Framework

I use a simple framework to translate any test result into a business decision:

Annual Revenue Impact = (Relative Lift) x (Baseline Revenue Throughput) x (Confidence Adjustment)

The confidence adjustment accounts for regression to the mean — observed effects in A/B tests typically shrink by 20-40% once deployed to 100% of traffic. I use a 0.7 multiplier as a conservative default.

Then I compare this against the total cost of implementation:

ROI = (Annual Revenue Impact - Implementation Cost) / Implementation Cost

If ROI is negative or marginally positive, don't ship — regardless of what the p-value says. If ROI is strongly positive, ship with confidence.

This framework makes practical significance concrete and auditable. Every test result gets a dollar value, and every ship decision has an expected ROI attached to it.

When Small Effects Matter: The Compounding Argument

The most common pushback I get is: "But small effects compound! Ten 0.5% improvements add up to a 5% improvement."

This is mathematically true and strategically dangerous. Here's why:

The compounding argument assumes independence. In reality, changes interact. Your brilliant headline test result may evaporate when combined with next month's layout change. Effects don't just add — they interfere.

The compounding argument ignores costs. Ten small improvements that each cost $30K to implement represent $300K in engineering resources. Could that $300K generate more value in a single high-impact project? Almost always yes.

The compounding argument creates a perverse incentive. If "every small win matters," teams optimize for volume of shipped tests rather than impact. They chase easy wins on low-traffic elements instead of tackling hard problems on high-leverage pages.

That said, there are legitimate cases where small effects matter:

High-traffic pages with low implementation cost. A one-line copy change on a homepage processing $500M annually? Even a 0.1% lift is $500K. Ship it.
Cumulative degradation prevention. If you're not detecting and reverting small negative changes, they compound in the wrong direction.
Learning value. Some tests generate strategic insights worth more than their direct revenue impact.

The key is being explicit about which argument you're making when you decide to ship a small effect.

Opportunity Cost: The Hidden Variable

The most underappreciated factor in ship decisions is opportunity cost. Every engineering sprint spent productionizing a marginally significant test result is a sprint not spent on something else.

I frame this as a portfolio optimization problem. Your experimentation program generates a pipeline of potential improvements, each with an estimated revenue impact and implementation cost. Your engineering team has a fixed capacity. The question isn't "should we ship this winner?" — it's "is this the highest-value use of our next engineering sprint?"

When you frame it this way, many "statistically significant" results fall off the priority list. Not because they're wrong, but because there's something more valuable waiting in the queue.

Building a Practical Significance Culture

Shifting from "stat sig equals ship" to "practical significance drives decisions" requires organizational change:

1. Require ROI estimates for every ship decision. Before a winning test goes to engineering, someone must attach a dollar value to the expected impact and compare it against implementation cost. This single practice eliminates 30-40% of low-value ships in my experience.

2. Set organizational MDE standards. Different page types and traffic levels warrant different thresholds. Create a lookup table: for each page category, what is the minimum relative lift worth detecting? Review and update quarterly.

3. Track post-implementation performance. Measure whether shipped variants deliver the value predicted by the test. This creates a feedback loop that calibrates your team's judgment over time. Most teams never do this, which means they never learn whether their significance thresholds are correct.

4. Celebrate informed no-ship decisions. When a team decides not to ship a statistically significant result because the practical impact doesn't justify the cost, that's a win. It means your experimentation culture is mature enough to distinguish between "interesting" and "valuable."

The p < 0.05 Trap in Practice

The fixation on p < 0.05 creates several perverse dynamics:

Tests run too long on low-impact elements. Teams wait weeks for significance on a button color test that, even if it wins, would generate $2,000 in annual revenue. The test itself cost more than $2,000 in analyst time.

Winners are shipped without effect size consideration. "It's significant!" becomes the rallying cry, regardless of whether the effect is 0.05% or 5%.

Losses are ignored. Negative results are discarded as "not significant" when they might contain valuable learning about customer behavior.

Resources concentrate on easy wins. Statistical significance is easier to achieve on high-traffic pages, so teams over-test homepages and under-test high-intent pages deeper in the funnel where the real leverage lives.

A Better Decision Framework

Replace the binary "significant/not significant" gate with a three-dimensional assessment:

Dimension 1: Statistical confidence. Is the effect real? (Statistical significance)

Dimension 2: Effect magnitude. Is the effect large enough to matter? (Practical significance)

Dimension 3: Strategic value. Does this change align with our product strategy and create compounding advantages? (Strategic significance)

A test result worth shipping scores high on all three dimensions. A result that's statistically significant but practically insignificant gets documented and deprioritized. A result that's practically significant but statistically uncertain gets more traffic and a retest.

This framework forces the right conversations and prevents the most common failure mode: shipping everything that clears p < 0.05 while ignoring the economic reality of what you're building.

The Bottom Line

Statistical significance tells you whether an effect is real. Practical significance tells you whether it's worth pursuing. The gap between these two concepts is where most experimentation programs leak value. Close the gap by making every ship decision an economic decision, not just a statistical one.

← Browse All Comparisons