Most teams pick their MDE the same way: they open a sample size calculator, see what number makes the test length look reasonable, and enter that. This is backwards.
MDE — minimum detectable effect — is the smallest lift your test is designed to reliably detect. Set it correctly and your test program produces actionable results. Set it wrong and you spend six months collecting data for tests that were never going to tell you anything useful.
I've spent seven years debugging experiments where the MDE was the actual problem. Here's the framework I use.
What MDE Actually Means
MDE is not a prediction of how much lift you'll get. It's a statement about what your test is capable of measuring.
If you set an MDE of 10% relative lift on a 3% baseline conversion rate, you're saying: "I want my test to have an 80% probability of detecting a real effect, if the true effect is a 10% relative improvement or larger." You're not saying the effect will be 10%. You're saying you need the effect to be at least 10% for your test to reliably find it.
This is why MDE determines sample size. A smaller MDE — wanting to detect tinier effects — requires proportionally more data. There's no free lunch.
Relative vs. Absolute MDE: The Confusion That Breaks Calculators
This distinction trips up even experienced practitioners. Get it wrong and your sample size calculation is completely off.
Absolute MDE: The change in raw percentage points. Relative MDE: The change as a percentage of the baseline rate.
Example: Your baseline checkout conversion rate is 3%.
- A 10% relative MDE = 0.3 percentage points absolute = you're trying to detect a change from 3.0% to 3.3%
- A 1% absolute MDE = you're trying to detect a change from 3.0% to 4.0% — which is a 33% relative change
Most sample size calculators ask for relative MDE. Most practitioners think in absolute terms. The mismatch causes teams to enter "0.5%" thinking they mean absolute, when the calculator interprets it as 0.5% relative — a tiny effect that requires millions of visitors to detect.
**Pro Tip:** Always double-check which type your calculator uses. Before running any calculation, manually verify: if I enter "10" for MDE, does the tool interpret that as 10 percentage points absolute (3% → 13%) or 10% relative (3% → 3.3%)? These are dramatically different.
Worked Example: What the Math Actually Produces
Let's use real numbers.
- Baseline CVR: 3.0%
- MDE: 10% relative (meaning you want to detect a 10% relative lift or higher)
- Statistical significance: 95% (alpha = 0.05, two-tailed)
- Statistical power: 80% (beta = 0.20)
Step 1: Convert relative MDE to absolute. 10% of 3.0% = 0.3 percentage points. Your target variant CVR is 3.3%.
Step 2: Plug into a sample size formula. For a two-proportion z-test with these parameters, you need approximately 26,000 visitors per variation.
Step 3: Calculate test duration. If your test page gets 10,000 visitors/week and you're running a 50/50 split:
- Visitors per variation per week = 5,000
- Weeks required = 26,000 / 5,000 = 5.2 weeks
That's a realistic, manageable test. Now watch what happens when teams change the MDE.
At 5% relative MDE (detecting a smaller effect): ~104,000 visitors per variation → 20.8 weeks. Nearly 5 months.
At 20% relative MDE (only detecting larger effects): ~6,500 visitors per variation → 1.3 weeks. Fast, but you'll miss improvements below 20% relative lift.
The MDE is a dial that controls the fundamental tradeoff between test duration and sensitivity.
**Pro Tip:** For most e-commerce checkout optimizations, a 10-15% relative MDE is the practical sweet spot. Below that, test duration becomes unreasonable for most traffic levels. Above that, you're missing improvements that would be meaningful to the business.
Why Teams Set MDE Too Low
The most common mistake I see: teams set 1-2% relative MDE "to be safe," end up needing 200,000+ visitors per variation, and either run tests for 6 months or call them early when they look promising.
Why does this happen?
- Aspirational thinking. "Even a 1% lift on our revenue would be worth $500K/year, so I want to detect 1%." This is economically correct but statistically expensive.
- Not understanding the math. The relationship between MDE and sample size isn't linear — it's roughly quadratic. Halving the MDE quadruples the required sample size.
- Not accounting for implementation costs. A 1% lift might justify the development effort, but the test itself costs 6 months of opportunity cost, which needs to be factored in.
The Right Way to Set MDE: Start From Business Value
Here's the framework I actually use.
Question 1: What lift would make this test worth shipping, given development cost?
Say a developer spent 2 weeks building the variant. At $150/hour fully loaded, that's $12,000. If the feature touches a page that converts 3% of 50,000 monthly visitors at $80 average order value:
- Monthly revenue impact of 1% relative lift: 50,000 × 0.03 × 1.01 × $80 - baseline = $1,200/month
- Payback period: 10 months
That 1% lift barely justifies the build cost. A 10% relative lift ($12,000/month impact) pays back the build in 1 month. That's your MDE floor.
Question 2: What's the minimum lift we'd actually ship?
Sometimes the answer has nothing to do with statistics. If a variant creates visual debt, increases maintenance burden, or adds tech complexity, you'd only ship it if the lift was significant. That threshold becomes your MDE.
**Pro Tip:** Set your MDE at the minimum lift you'd actually implement. There's no point in designing a test to detect a 2% lift if you'd only ship the variant at 10%+. Your MDE should match your shipping threshold.
MDE Benchmarks by Test Type
Based on 100+ experiments, here are realistic MDE ranges by test category. These reflect the typical magnitude of effects actually observed — not aspirational targets.
- Layout and UX changes: 5-15% relative. Significant visual changes move the needle more than copy tweaks.
- Copy and messaging tests: 10-20% relative. Headline tests on high-intent pages can hit the high end. Body copy changes rarely exceed 5%.
- Pricing and discount tests: 2-5% relative. These have high economic value even at small lifts, and traffic is often high enough to detect them.
- Navigation and information architecture: 8-15% relative. Finding and fixing a navigation problem can produce large effects.
- Form optimization: 5-15% relative. Reducing field count, changing labels, or improving error messages produces measurable but modest effects.
- Social proof additions: 3-10% relative. Review counts and trust badges work, but effects are often smaller than expected.
**Pro Tip:** These ranges are for when the test idea is solid and validated by qualitative data. If you're testing a gut-feel idea without user research backing, add a 50% buffer to your expected MDE — gut ideas produce real effects less often than data-backed ones.
Understanding 80% Statistical Power
You'll see "80% power" as a default in every sample size calculator. Here's what it actually means, in plain terms:
If there's a true effect of exactly your MDE size, and you ran this same experiment 100 times with fresh samples, you would detect a statistically significant result approximately 80 times. The other 20 times, you'd get an inconclusive result and incorrectly conclude there's no effect.
This 20% miss rate (the false negative rate, or Type II error) is the tradeoff. You could set power to 90% or 95% — those give you fewer false negatives, but they require 35-65% more visitors per variation.
For most commercial experimentation, 80% power is the right default. For experiments with significant downside risk (you're considering removing a feature), 90% power is worth the extra sample size cost.
What to Do When a Test Ends Before Hitting MDE Sample Size
It happens constantly: the test gets called early, the timeline is compressed, or traffic underperformed the projection. You hit 40% of your required sample size. What does the data tell you?
Honest answer: less than you'd hope.
At 40% of required sample size with 80% power, you have roughly 50-55% power. You're nearly coin-flipping on whether you'd detect a real effect at your MDE. Your confidence intervals are wide. You can see direction, but you can't make a confident inference.
The correct interpretation: "The data suggests a directional lift of X%, but we don't have sufficient statistical power to confirm it. We'd need Y more visitors to reach a conclusive result."
The incorrect interpretation: "It was trending positive with 75% confidence, good enough, let's ship it."
**Pro Tip:** If you know you'll face time constraints, set your MDE higher from the start so the test is adequately powered within your realistic timeline. A well-powered short test beats an underpowered long one every time.
Common Mistakes
Mistake 1: Using total site traffic in your sample size calculator. If your homepage gets 100,000 visitors/week but only 40% see the tested element, your effective traffic is 40,000. Use traffic exposed to the test, not total sessions.
Mistake 2: Not adjusting MDE for multiple variations. Adding a third variation doesn't just require more traffic — it requires significantly more, because you're now making multiple comparisons. A 3-variation test with 95% significance per comparison needs family-wise error rate correction.
Mistake 3: Changing MDE mid-test. If results look promising and you lower your MDE threshold to declare significance faster, you've invalidated the analysis. The MDE must be set before the test starts.
Mistake 4: Setting the same MDE for every test. Different test types have different realistic effect sizes. Use the benchmarks above rather than defaulting to 10% for everything.
What to Do Next
Use the MDE framework above before your next test design. Before opening a sample size calculator, answer: what's the minimum lift I'd actually ship? That's your MDE.
For a complete experimentation workflow including sample size planning, check the Optimizely Practitioner Toolkit — it includes a sample size planning template that connects MDE to business value automatically.