Here's a conversation I've had more times than I can count.
"Our test has been running for 8 weeks and hasn't reached significance."
"What MDE did you set?"
"What's MDE?"
That's the problem. Minimum Detectable Effect is the single most important number in experiment design, and it's the number teams are most likely to either skip entirely or set arbitrarily. The result is either tests that run forever chasing an effect too small to matter, or tests that declare winners on effects too small to detect reliably.
This is the guide I wish existed when I started running experiments.
What MDE Actually Is
The Minimum Detectable Effect is the smallest difference between your variant and control that your experiment is statistically powered to detect.
Note what it is not: it is not a prediction of how much lift you'll see. It is not a goal. It is a floor — the smallest effect that your experimental design has a reasonable chance of identifying as real, given your traffic, your baseline conversion rate, and your chosen confidence level.
If your MDE is 10% relative lift and your variant actually produces a 5% relative lift, your test will very likely fail to reach statistical significance — not because the effect isn't real, but because your experiment wasn't designed to see effects that small.
Think of MDE like the resolution on a camera. A 10-megapixel camera can capture a lot of detail. A 2-megapixel camera cannot. If the detail you care about requires 8 megapixels to resolve, the 2-megapixel camera will produce a blurry result. Your experiment works the same way. Insufficient sample size produces a test that can't resolve small effects — they appear as noise.
The Formula That Controls Your Experiment Duration
MDE, sample size, statistical power, and significance level are locked together by a single relationship. Change any one and you must adjust at least one other.
The core sample size formula (simplified for binary metrics like conversion rate):
n = (Zalpha + Zbeta)^2 2 p (1 - p) / MDEabsolute^2
Where:
- n = visitors needed per variation
- Zalpha = Z-score for your significance level (1.96 for 95%, 1.645 for 90%)
- Zbeta = Z-score for your statistical power (0.84 for 80% power)
- p = baseline conversion rate
- MDEabsolute = your MDE expressed as an absolute change in conversion rate
Let's work through a real example.
Setup: Checkout page. Baseline conversion rate: 3% (p = 0.03). You want to detect a 10% relative lift (so MDEabsolute = 0.003, since 10% of 3% = 0.3 percentage points). Significance: 95%. Power: 80%.
Calculation: n = (1.96 + 0.84)^2 2 0.03 0.97 / 0.003^2 = 7.84 0.0582 / 0.000009 = approximately 50,700 visitors per variation.
For a two-variation test, that's roughly 101,400 total visitors. At 10,000 visitors per week to the checkout page, your test runs for approximately 10 weeks.
Now change the MDE to 5% relative lift (MDEabsolute = 0.0015):
n = 7.84 0.0582 / 0.00000225, which gives approximately 202,800 visitors per variation.
That's 40+ weeks. One small change to your MDE quadrupled your experiment duration. This is why MDE is the most consequential number in your experiment plan.
**Pro Tip:** Optimizely's sample size calculator handles this math for you. Find it in the Stats Configuration panel when setting up a Frequentist (Fixed Horizon) test. Enter your baseline metric value, MDE, significance level, and number of variations. The calculator outputs visitors needed per variation. Use this number to estimate your test duration before you launch.
Why Teams Set MDE Too Low (And Suffer For It)
The most common MDE mistake I see: teams set the MDE to whatever lift would make them feel good, not to what's realistic or measurable.
"We want to see a 2% lift." Fine — but if your baseline is 3%, you need roughly 450,000 visitors per variation to detect that reliably. If your page gets 50,000 visitors a month, you've just committed to a 9-month test.
Teams don't consciously choose to run 9-month tests. They just don't do the math upfront. They set an aspirational MDE, launch the test, and discover the problem 3 months in when their boss asks why there's still no result.
The other failure mode: teams pick a high MDE to shorten the test, then miss real effects below that threshold. If you set MDE = 20% but your variant actually produces 12% lift, your test will probably not reach significance — and you'll incorrectly conclude the variant doesn't work.
The MDE is a design choice with real consequences. You're choosing what effects your experiment can and cannot see.
**Pro Tip:** Run your sample size calculation before you even write the test hypothesis. If the math says you need 300,000 visitors and your page gets 20,000/month, don't start that test. Either find a higher-traffic page, combine it with other pages in a multi-page test, or accept that this effect size is not measurable with your current traffic.
The ROI-Based Approach to Setting MDE
Here's the framework I use for every experiment: MDE = the lift that makes implementation worth it.
Before setting your MDE, answer this question: what is the minimum improvement that would justify the cost and risk of implementing this variant?
Work backward from business value:
Example: You're testing a new product recommendation algorithm. Development cost: $15,000. Your ecommerce revenue: $2M/month. Current cart page conversion rate: 8%.
If a 1% relative lift (0.08 percentage points absolute) on cart conversion = $16,000/month in incremental revenue, that breaks even in one month. Easy yes.
But 1% relative lift on a baseline of 8% requires ~450,000 visitors per variation to detect at 95% confidence. At 100,000 cart page visits per week, that's 9 weeks.
Is a 9-week test worth it to detect a $16,000/month improvement? Almost certainly yes — as long as you're disciplined about running the full duration.
But now ask: what if the variant is beautifully designed but realistically produces only 0.5% lift? You need 18 weeks. And the improvement is only $8,000/month — a 2-month payback period on a $15K investment. Still worth it, but now you're making a different decision.
Setting MDE = minimum commercially significant lift forces you to think about business value before you design the test. This reframes MDE from a statistical input to a strategic decision.
**Pro Tip:** When setting MDE, ask your development team what it would cost to implement the variant permanently. Then calculate how large the lift would need to be to recoup that cost within 6 months. That's your floor — the MDE should not be set below this number.
Relative vs. Absolute MDE: The Confusion That Breaks Sample Calculators
This trips up almost everyone the first time they use a sample size calculator.
Absolute MDE is expressed in the same units as your metric. If your conversion rate is 3%, an absolute MDE of 0.3 percentage points means you want to detect a change from 3.0% to 3.3%.
Relative MDE is expressed as a percentage of your baseline. A 10% relative MDE on a 3% baseline = 0.3 percentage points absolute change.
They sound different. They describe the same change.
The problem: most sample calculators ask for MDE without clearly specifying which format they expect. Some want absolute (0.003), others want relative (10%), some accept either.
Optimizely's built-in calculator asks for "Minimum Detectable Effect" expressed as a relative improvement percentage. So you'd enter 10 (for 10%), not 0.003.
If you accidentally swap these, your sample size calculation is off by a factor of 10-100, and your test duration estimate is completely wrong.
**Pro Tip:** Always sanity check your MDE input by computing both the relative and absolute values and confirming they match before you finalize your sample size. If your baseline is 3% and you enter a 10% relative MDE, the absolute change should be 0.3 percentage points — from 3.0% to 3.3%. Confirm that's what you intend.
Practical MDE Benchmarks by Test Type
Not every test type is equally sensitive to small changes. Here are the benchmarks I use based on 100+ experiments:
Layout and UX changes (navigation, page structure, hero section redesigns): 5-15% relative lift is realistic and worth detecting. These tests have high implementation cost, so you need meaningful evidence of substantial improvement.
Copy and headline changes (CTAs, value propositions, headlines): 10-20% relative lift. Good copy changes can move the needle significantly. Set MDE in this range to filter out small copy variations that aren't worth the ongoing maintenance.
Pricing and offer tests (pricing display, discount framing, urgency elements): 2-5% relative lift. These tests are high-stakes and worth running longer to detect smaller effects. Even a 2% lift in conversion on a $500 product has huge revenue impact.
Checkout flow tests (form simplification, payment methods, trust signals): 3-8% relative lift. Checkout is high-leverage real estate. Worth running for smaller detectable effects.
Personalization and recommendation tests: 5-15% relative lift. Effects vary dramatically by quality of implementation.
These are starting points, not rules. Always anchor your MDE to your specific business ROI calculation.
The Power Concept in Plain English
Statistical power is the probability that your test will detect a real effect of your specified MDE size, given your sample size and significance level.
80% power — the standard default — means: if there truly is an effect equal to or larger than your MDE, your test has an 80% chance of successfully reaching statistical significance.
The flip side: 20% of the time, even a real effect of your specified size will slip through undetected. This is a Type II error, or false negative.
Think of it like a metal detector at airport security. An 80%-power metal detector successfully identifies 80% of metal objects. 20% of metal objects pass through undetected. Improving to 90% power (a better detector) means fewer misses — but it takes longer to process each traveler (larger sample size).
Higher power costs sample size. The relationship isn't linear. Going from 80% to 90% power on a typical conversion test increases sample size by roughly 30%. Going from 80% to 95% power roughly doubles your sample size requirement.
For most commercial experimentation programs, 80% power is the right default. You're not a medical device trial; you're testing a homepage. The cost of a false negative is missing an improvement, which you can address by running more tests. The cost of dramatically increasing sample requirements (by pursuing 95% power) is slower iteration.
**Pro Tip:** If you're running a test where a false negative is expensive — say, a major site-wide redesign that you need to evaluate thoroughly before committing — increase power to 90% and accept the longer test duration. For routine CRO tests, 80% is appropriate.
What Happens When Your Test Ends Before Reaching MDE Sample Size
This comes up constantly. You planned a 6-week test. Business pressure forces you to call it at 3 weeks. You haven't hit your pre-specified sample size. Now what?
If you're using Optimizely's Stats Engine or Bayesian approach, you have more flexibility. Stats Engine provides valid continuous inference — if it's showing significance at 3 weeks, that inference is valid. The question is whether you have practical confidence in the result given the available data.
If you're using Frequentist Fixed Horizon and you stop early, your statistical guarantee is compromised. The correct interpretation: "We have suggestive but inconclusive evidence." Do not call a winner. Do not call a loser. Document the inconclusive result and either re-run with proper duration or treat it as a signal worth investigating further.
For results that fall short of your MDE sample size but show directional data:
- If the variant shows clear directional movement toward your MDE, it may be worth extending the test rather than calling it
- If after 2x your expected MDE sample size there's still no signal, the true effect is likely below your MDE — you can call it flat and move on
- Never interpret "didn't reach significance with partial data" as "no effect" — it means "underpowered to detect an effect"
**Pro Tip:** Build a "minimum runtime" into every experiment regardless of MDE. Optimizely's Frequentist configuration lets you set a minimum duration in days. Set it to at least 7 days to capture a full business cycle (weekday vs. weekend behavior patterns often differ significantly). A test that reaches its sample size in 3 days but only ran Mon-Wed is capturing biased traffic.
Common Mistakes
Mistake 1: Setting MDE after calculating sample size. MDE should determine your sample size, not be reverse-engineered from your available traffic. Start with the business question: what's the minimum meaningful lift? Then see if your traffic can support detecting it.
Mistake 2: Using the same MDE for every test. A 10% MDE on a 1% baseline and a 10% MDE on a 20% baseline require wildly different sample sizes. And the business implications are completely different. Calibrate MDE to the metric and the stakes every time.
Mistake 3: Confusing relative and absolute MDE in the calculator. Covered above — but worth repeating because this single error sends teams on weeks-long wild goose chases when their actual test duration is 10x what they calculated.
Mistake 4: Ignoring MDE when interpreting results. If your test didn't reach significance, ask: was the true effect likely above or below your MDE? If your MDE was 10% and your observed lift (non-significant) is 2%, the effect is probably real but below your detection threshold. That's a different situation from observing a 0.1% lift.
Mistake 5: Setting MDE without considering velocity. An MDE that requires 20 weeks of data is not a useful MDE for a fast-paced team. Either increase your MDE (accept detecting only larger effects) or find more traffic. A test that can't complete in a reasonable timeframe shouldn't be designed that way in the first place.
What to Do Next
- Run the sample size calculation for your next three planned tests before you write the hypotheses. Use Optimizely's calculator or Evan Miller's A/B testing calculator. Record: baseline CVR, MDE, required sample per variation, estimated weeks to completion. If any test will take more than 8 weeks, reassess the MDE or traffic allocation.
- Create an MDE policy for your experimentation program. Document the minimum MDE by test type (using the benchmarks above as a starting point), your standard power (80%), and your standard significance threshold. Apply it consistently so your test results are comparable across the portfolio.
- Conduct an MDE retrospective on your last 10 experiments. For each: what MDE was planned? What lift was observed? Did the test reach significance? Use this data to calibrate whether your MDE benchmarks match your actual experience — and adjust accordingly.
- Add MDE to your experiment planning template. Every experiment plan should include: hypothesis, primary metric, baseline value, MDE (absolute and relative), required sample size per variation, estimated duration, and the ROI threshold that justifies implementation. If you can't fill in all of these before launch, the experiment isn't ready.