Statistical Significance vs Practical Significance
Why p < 0.05 doesn't mean you should ship. A business economics analysis of statistical vs practical significance, MDE as a business decision, and the real cost of ignoring effect size.
- Provides a clear, binary decision criterion
- Controls false positive rate when used correctly
- Universally understood (if often misunderstood) standard
- Enables reproducible analysis across teams
- Required for regulatory and academic contexts
- Directly tied to business outcomes and ROI
- Forces teams to define 'what matters' before testing
- Prevents shipping trivially small improvements
- Accounts for implementation and maintenance costs
- Encourages strategic resource allocation
- Says nothing about the size or value of the effect
- Can be achieved with trivially small effects given enough data
- Widely misinterpreted as 'probability the result is real'
- Creates a false dichotomy — barely significant isn't different from barely not
- Incentivizes p-hacking and selective reporting
- Subjective — reasonable people disagree on thresholds
- Requires business context that analysts may not have
- Can be used to rationalize ignoring uncomfortable results
- Harder to standardize across an organization
- May lead to inaction when cumulative small effects matter
Statistical significance is your bouncer — it keeps random noise out of your decision pipeline. Practical significance is your CFO — it decides whether the real effect is worth the investment. You need both, but most teams over-index on the bouncer and forget to consult the CFO. Every test result should answer two questions: 'Is this effect real?' and 'Is this effect worth the cost of capturing it?' If you can't answer the second question, your testing program is generating insights without generating value.— Atticus Li
The Most Expensive Mistake in Experimentation
Here is a scenario I've seen play out dozens of times: A team runs an A/B test for four weeks. They achieve statistical significance with p = 0.03. The conversion rate improved by 0.08%. They ship the variant. It takes two engineering sprints to productionize. Six months later, nobody can find the impact in the revenue numbers.
What went wrong? They confused statistical significance with practical significance. They proved the effect was real but never asked whether it was worth capturing.
This is the most expensive mistake in experimentation because it doesn't look like a mistake. The test "won." The data "supported" the decision. But the opportunity cost — those two engineering sprints that could have built something with 10x the impact — is invisible on the dashboard.
What Statistical Significance Actually Tells You
Statistical significance answers one narrow question: is the observed difference likely to be caused by something other than random chance?
When you achieve p < 0.05, you're saying: if there were truly no difference between A and B, there is less than a 5% probability of observing data this extreme. That's it. That's the entire claim.
Statistical significance does not tell you:
- How big the effect is
- Whether the effect matters for your business
- Whether you should allocate resources to capture the effect
- Whether the effect will persist over time
- Whether the effect generalizes to other contexts
It's a filter, not a decision. And yet, in most experimentation programs I've audited, achieving "stat sig" is treated as the decision itself.
What Practical Significance Actually Means
Practical significance asks: is this effect large enough to justify action given the costs involved?
This requires understanding at least three things:
1. The magnitude of the effect. A 0.08% conversion rate improvement and a 3.2% conversion rate improvement are both "statistically significant," but they represent vastly different business realities.
2. The revenue exposure. A 0.5% improvement on a page that processes $100M annually is $500K. The same improvement on a page with $100K in throughput is $500. Same statistical result, completely different business significance.
3. The cost of capturing the effect. Engineering time, QA cycles, increased code complexity, ongoing maintenance, opportunity cost of not building something else. These costs are real and should be factored into every ship decision.
The MDE Is a Business Decision, Not a Statistics One
The minimum detectable effect (MDE) is typically presented as a statistical parameter — the smallest effect your test is powered to detect. But framing it purely as a statistics question is a fundamental error.
Your MDE should be set by answering this question: What is the smallest improvement that generates positive ROI after accounting for all implementation costs?
Here is how I calculate it:
Step 1: Estimate implementation cost. Include engineering time, QA, code review, deployment risk, and ongoing maintenance. For a typical frontend change, this might be $15,000-$40,000 in fully loaded labor costs.
Step 2: Calculate the revenue per percentage point of conversion. If your page processes $20M annually with a 4% conversion rate, each percentage point of conversion is worth $5M. A 1% relative improvement (0.04 absolute percentage points) is worth $200K annually.
Step 3: Set MDE at breakeven. If implementation costs $30,000 and you want payback within 6 months, you need $60,000 in annualized value. On the page above, that's a 0.3% relative improvement. That's your MDE — anything smaller isn't worth detecting because it isn't worth capturing.
Step 4: Calculate required sample size from MDE. Now — and only now — do you engage the statistical machinery. Given your MDE, baseline conversion rate, and desired power, calculate the sample size. If you can't achieve it in a reasonable timeframe, you need to either accept a larger MDE or acknowledge that this page can't be effectively tested with your traffic.
Most teams do this backwards. They calculate sample size from an arbitrary MDE (often the platform default), run the test, and then try to make a business case for whatever effect they find. This is like deciding how much to spend on a house after you've already signed the mortgage.
The Revenue Impact Framework
I use a simple framework to translate any test result into a business decision:
Annual Revenue Impact = (Relative Lift) x (Baseline Revenue Throughput) x (Confidence Adjustment)
The confidence adjustment accounts for regression to the mean — observed effects in A/B tests typically shrink by 20-40% once deployed to 100% of traffic. I use a 0.7 multiplier as a conservative default.
Then I compare this against the total cost of implementation:
ROI = (Annual Revenue Impact - Implementation Cost) / Implementation Cost
If ROI is negative or marginally positive, don't ship — regardless of what the p-value says. If ROI is strongly positive, ship with confidence.
This framework makes practical significance concrete and auditable. Every test result gets a dollar value, and every ship decision has an expected ROI attached to it.
When Small Effects Matter: The Compounding Argument
The most common pushback I get is: "But small effects compound! Ten 0.5% improvements add up to a 5% improvement."
This is mathematically true and strategically dangerous. Here's why:
The compounding argument assumes independence. In reality, changes interact. Your brilliant headline test result may evaporate when combined with next month's layout change. Effects don't just add — they interfere.
The compounding argument ignores costs. Ten small improvements that each cost $30K to implement represent $300K in engineering resources. Could that $300K generate more value in a single high-impact project? Almost always yes.
The compounding argument creates a perverse incentive. If "every small win matters," teams optimize for volume of shipped tests rather than impact. They chase easy wins on low-traffic elements instead of tackling hard problems on high-leverage pages.
That said, there are legitimate cases where small effects matter:
- High-traffic pages with low implementation cost. A one-line copy change on a homepage processing $500M annually? Even a 0.1% lift is $500K. Ship it.
- Cumulative degradation prevention. If you're not detecting and reverting small negative changes, they compound in the wrong direction.
- Learning value. Some tests generate strategic insights worth more than their direct revenue impact.
The key is being explicit about which argument you're making when you decide to ship a small effect.
Opportunity Cost: The Hidden Variable
The most underappreciated factor in ship decisions is opportunity cost. Every engineering sprint spent productionizing a marginally significant test result is a sprint not spent on something else.
I frame this as a portfolio optimization problem. Your experimentation program generates a pipeline of potential improvements, each with an estimated revenue impact and implementation cost. Your engineering team has a fixed capacity. The question isn't "should we ship this winner?" — it's "is this the highest-value use of our next engineering sprint?"
When you frame it this way, many "statistically significant" results fall off the priority list. Not because they're wrong, but because there's something more valuable waiting in the queue.
Building a Practical Significance Culture
Shifting from "stat sig equals ship" to "practical significance drives decisions" requires organizational change:
1. Require ROI estimates for every ship decision. Before a winning test goes to engineering, someone must attach a dollar value to the expected impact and compare it against implementation cost. This single practice eliminates 30-40% of low-value ships in my experience.
2. Set organizational MDE standards. Different page types and traffic levels warrant different thresholds. Create a lookup table: for each page category, what is the minimum relative lift worth detecting? Review and update quarterly.
3. Track post-implementation performance. Measure whether shipped variants deliver the value predicted by the test. This creates a feedback loop that calibrates your team's judgment over time. Most teams never do this, which means they never learn whether their significance thresholds are correct.
4. Celebrate informed no-ship decisions. When a team decides not to ship a statistically significant result because the practical impact doesn't justify the cost, that's a win. It means your experimentation culture is mature enough to distinguish between "interesting" and "valuable."
The p < 0.05 Trap in Practice
The fixation on p < 0.05 creates several perverse dynamics:
Tests run too long on low-impact elements. Teams wait weeks for significance on a button color test that, even if it wins, would generate $2,000 in annual revenue. The test itself cost more than $2,000 in analyst time.
Winners are shipped without effect size consideration. "It's significant!" becomes the rallying cry, regardless of whether the effect is 0.05% or 5%.
Losses are ignored. Negative results are discarded as "not significant" when they might contain valuable learning about customer behavior.
Resources concentrate on easy wins. Statistical significance is easier to achieve on high-traffic pages, so teams over-test homepages and under-test high-intent pages deeper in the funnel where the real leverage lives.
A Better Decision Framework
Replace the binary "significant/not significant" gate with a three-dimensional assessment:
Dimension 1: Statistical confidence. Is the effect real? (Statistical significance)
Dimension 2: Effect magnitude. Is the effect large enough to matter? (Practical significance)
Dimension 3: Strategic value. Does this change align with our product strategy and create compounding advantages? (Strategic significance)
A test result worth shipping scores high on all three dimensions. A result that's statistically significant but practically insignificant gets documented and deprioritized. A result that's practically significant but statistically uncertain gets more traffic and a retest.
This framework forces the right conversations and prevents the most common failure mode: shipping everything that clears p < 0.05 while ignoring the economic reality of what you're building.
The Bottom Line
Statistical significance tells you whether an effect is real. Practical significance tells you whether it's worth pursuing. The gap between these two concepts is where most experimentation programs leak value. Close the gap by making every ship decision an economic decision, not just a statistical one.