Before I run any experiment, I calculate the required sample size. Not as a formality — as a genuine feasibility check. At least 30% of the test ideas I've seen would have required six months to run properly given the available traffic. Knowing that upfront saves time, money, and false confidence.
Most teams skip this step or do it wrong. Here's how to do it right.
Why Sample Size Matters: The Underpowered Test Problem
An underpowered test is one that doesn't have enough data to reliably detect the effect you're looking for — even if that effect is real.
Suppose you're testing a new product page design. The new design genuinely improves conversion rate by 8% relative. But your test only reached 40% of the required sample size before someone called it. The test shows no significant result. You kill the variant. You just killed a real winner because you didn't collect enough data.
This is a Type II error — a false negative. You miss real effects. And unlike a false positive (shipping a winner that wasn't), a false negative is invisible. Nobody knows what you lost.
The rate of false negatives is controlled by statistical power. At 80% power and your required sample size, you'll correctly detect a true effect 80% of the time. At 40% of required sample size, your power drops to roughly 50-55%. You're nearly coin-flipping.
**Pro Tip:** Run your sample size calculation before writing a single line of test code. If the test isn't feasible given your traffic, find out now — not after 6 weeks of running it.
The Four Inputs Every Sample Size Calculator Needs
Every credible sample size calculator — Optimizely's, Evan Miller's, AB Testguide — needs exactly four inputs. Mess up any one of them and your calculation is wrong.
1. Baseline conversion rate The current conversion rate of the metric you're optimizing. This must be the actual rate on the specific page being tested, not a site-wide average. If you're testing your checkout page, use checkout page conversion rate — not your overall site CVR.
2. Minimum Detectable Effect (MDE) The smallest relative lift you want the test to be able to detect. See the MDE guide for how to set this correctly — the short version is: set it at the minimum lift you'd actually ship. Check whether your calculator uses relative or absolute MDE.
3. Statistical significance threshold Almost always 95% (alpha = 0.05, two-tailed). Some teams use 90% when running exploratory tests to reduce sample size. Use 99% when the stakes are very high (pricing tests, checkout flow changes on high-revenue pages).
4. Statistical power Almost always 80% (beta = 0.20). Use 90% for high-stakes tests where missing a real effect is costly.
Worked Example with Real Numbers
Scenario: You're testing a new checkout page layout on an e-commerce site.
- Baseline CVR: 3.0%
- MDE: 10% relative (you want to detect a lift from 3.0% to 3.3% or larger)
- Significance: 95% (two-tailed)
- Power: 80%
Running this through a standard two-proportion z-test:
Required sample size: ~26,000 visitors per variation
Now calculate test duration:
- Weekly traffic to the checkout page: 10,000 sessions
- Split: 50/50 (5,000 per variation per week)
- Weeks required: 26,000 / 5,000 = 5.2 weeks
That's a manageable test. You'd plan for 6 weeks to be safe, check all four stopping conditions, and call it in week 6 if sample size and stability conditions are both met.
Now run the same calculation with a 5% relative MDE (you want to detect smaller effects): Required per variation: ~104,000 → 20.8 weeks at the same traffic level. Nearly 5 months. That test almost certainly shouldn't run.
**Pro Tip:** If a sample size calculation produces a test duration over 8 weeks, stop and revisit the MDE. Either raise the MDE to something your traffic can support, find a higher-traffic page to test the same hypothesis on, or reconsider whether the test is worth running at all.
The Two Most Common Mistakes
Mistake 1: Using total site traffic instead of traffic to the tested page.
This is the most expensive mistake in sample size planning. If your site gets 100,000 sessions/week but only 15% of those users visit your checkout page, the effective traffic for a checkout test is 15,000/week — not 100,000. Using the wrong denominator makes tests look feasible when they aren't.
How to fix it: Find the specific page's session count in your analytics tool, filtered to the same audience segment your test will target. That's your weekly traffic figure.
Mistake 2: Not accounting for multiple variations.
Adding variations doesn't just "split the traffic more." It requires you to adjust for multiple comparisons, or your false positive rate inflates.
With two variations (A and B), a 95% significance threshold gives you a 5% false positive rate per comparison.
With three variations (A, B, and C), you have two comparisons (B vs. A, C vs. A). If you use 95% significance for each without adjustment, your family-wise false positive rate rises to approximately 9.75%.
The sample size implication: For a 3-variation test at 95% per-comparison significance with Bonferroni correction, you need approximately 3× the per-variation sample size of a 2-variation test, not 1.5×. Most teams assume they just need more traffic for a 3rd variation — they don't realize the per-variation requirement also increases.
**Pro Tip:** Unless you have a specific reason to run more than two variations, stick to A/B (not A/B/C). Adding a third variation requires roughly 3× the total traffic and significantly extends test duration. The marginal value of testing two variants simultaneously is rarely worth the cost.
MDE Sensitivity: The Number That Changes Everything
Understanding how sensitive sample size is to MDE changes helps you make better tradeoffs.
Using the same baseline (3% CVR, 95% significance, 80% power):
- 5% relative MDE → ~104,000 visitors per variation
- 10% relative MDE → ~26,000 visitors per variation (75% reduction)
- 15% relative MDE → ~11,600 visitors per variation
- 20% relative MDE → ~6,500 visitors per variation
Going from 5% to 10% MDE cuts required sample size by approximately 75%. Going from 10% to 20% MDE cuts it by another 75%. The relationship is roughly quadratic: double the MDE, reduce the sample size by 75%.
This is the most powerful lever in your test design. Use it deliberately.
How to Read and Use the Sample Size Calculator in Optimizely
Optimizely's built-in sample size calculator (accessible in the test setup flow) uses the following defaults: 95% statistical significance, 80% power, and it accepts a relative MDE.
The key output to focus on: unique visitors per variation. This is the number each bucket needs to reach before you have adequate power. Multiply by the number of variations to get total required exposure.
Optimizely also shows a projected test duration based on historical traffic to the target URL. This estimate assumes stable traffic — if you're running a test during a campaign period or a historically volatile traffic window, apply a buffer.
One thing Optimizely's calculator doesn't do by default: adjust for multiple variations with family-wise error rate correction. If you're running 3+ variations, calculate the Bonferroni-corrected per-comparison alpha manually (alpha / number of comparisons) and use that adjusted significance level in the calculator.
**Pro Tip:** Cross-check Optimizely's traffic estimate against your analytics platform. Optimizely uses its own tagged session counts, which may differ from your analytics source of truth if your tag coverage isn't complete.
What to Do When Your Traffic Is Too Low
You run the calculation. The test would take 6 months. Your options:
Option 1: Increase the MDE. Accept that you'll only detect larger effects. If business context justifies this (a 10% lift would be meaningful), this is the right call.
Option 2: Test a higher-traffic page. If you're trying to validate a hypothesis about value messaging, test it on your homepage (high traffic) before testing on your product page (lower traffic). Use a higher-traffic proxy to validate the direction, then refine on the actual target page.
Option 3: Consider qualitative research instead. If the page gets 500 sessions/week, A/B testing isn't the right tool. Run usability testing, session recordings, or user interviews. Get qualitative signal. Redesign based on that. Then monitor the before/after with analytics. That's not a test, but it's better than running an underpowered test for a year.
Option 4: Run a partial test with a directional hypothesis. If you have 50% of the required sample size, you have ~60% power. You won't reach a definitive conclusion, but you can say "directionally positive at [confidence level]" and use that as input for the product roadmap without claiming a winner.
**Pro Tip:** Be honest with your stakeholders about what low-traffic tests can produce. "We'll run this for 2 weeks and see" on a 500-session/week page is not a test — it's guessing with extra steps. Setting that expectation correctly upfront is better than defending inconclusive results 2 weeks later.
The "We'll Run It for 2 Weeks and See" Fallacy
This is the most common mistake in experimentation practice. "Two weeks" is not a statistical input. It's an arbitrary duration driven by sprint cycles, reporting cadences, and impatience.
A 2-week test might be perfectly calibrated for your traffic and MDE. Or it might give you 30% of the required sample size. The duration doesn't tell you anything on its own. The sample size does.
The correct process: calculate required sample size → divide by available traffic per variation → that gives you required duration. Then round up to the nearest full week to capture weekly cycles. That's your test duration. Not "2 weeks" as a default.
If the calculated duration is longer than you have time for, that's a signal to revisit your MDE — not to run the test anyway and hope for the best.
Common Mistakes
Mistake 1: Not setting the sample size calculation before the test starts. Post-hoc sample size justifications are unreliable. Calculate before, commit to the number, and don't revisit it until the test is over.
Mistake 2: Treating the calculator's duration estimate as a guarantee. Traffic fluctuates. Campaigns change. A test that should take 4 weeks based on historical traffic might take 6 weeks during a slow period. Build buffer.
Mistake 3: Including sessions where the variant wasn't visible. If your test is above the fold and 30% of users don't scroll there, they shouldn't count toward your sample. Filter your analysis to users who were actually exposed to the experiment.
Mistake 4: Not checking the calculator's power assumptions. Some calculators default to 90% power instead of 80%. This isn't wrong, but it means you'll need more traffic. Know what your calculator assumes.
What to Do Next
Run sample size calculations for your next 3 planned tests before building anything. If any of them come back with durations over 8 weeks, use that as the trigger to revisit MDE or page selection.
For a complete test design template including sample size planning, stopping rules, and result documentation, see the Optimizely Practitioner Toolkit.