Why Sample Size Must Be Calculated Before the Test Begins
Calculating sample size before running an A/B test is not optional. It is a fundamental requirement of valid hypothesis testing. Without a predetermined sample size, you have no principled basis for deciding when to stop collecting data, which opens the door to the peeking problem and inflated false positive rates.
Think of sample size calculation as setting the rules of the game before play begins. Just as you would not start a basketball game and then decide mid-play whether it should be four quarters or three, you should not start collecting data and then decide how much is enough based on whether the results look favorable.
This guide walks through exactly how sample size calculation works, what inputs you need, and how to apply it in practice.
The Three Inputs to Sample Size Calculation
Every sample size calculation requires three inputs: your baseline conversion rate, your minimum detectable effect, and your desired statistical power. Understanding each of these is essential.
Baseline Conversion Rate
Your baseline conversion rate is the current performance of whatever you are testing. If your pricing page converts 4% of visitors to trial signups, then 4% is your baseline. This number matters because it determines the amount of noise in your data. Lower conversion rates produce noisier data, which means you need more observations to detect real effects.
To understand why, imagine you have a page that converts at 50%. Half your visitors convert and half do not. The data is highly informative because each visitor tells you a lot about the conversion process. Now imagine a page that converts at 0.5%. For every converting visitor, you have 199 non-converters. The signal-to-noise ratio is much worse, and you need far more data to reliably detect changes.
Minimum Detectable Effect (MDE)
The minimum detectable effect is the smallest improvement you want your test to be able to detect. This is a business decision, not a statistical one. You need to decide what size of improvement is worth the effort of implementing a change.
A 1% relative improvement on a page that generates $10 million per year is worth $100,000 annually. A 1% improvement on a page generating $100,000 per year is worth $1,000. The MDE should reflect the minimum impact that justifies the cost of implementation and the opportunity cost of running the test.
MDE is typically expressed as a relative change. A 10% MDE on a 5% baseline means you want to detect a shift from 5.0% to 5.5% (or down to 4.5%). Smaller MDEs require dramatically larger sample sizes. Detecting a 1% relative lift requires roughly 100 times more data than detecting a 10% lift.
Statistical Power
Statistical power is the probability that your test will correctly detect a real effect when one exists. A power of 80% means that if there truly is an effect of the specified size, you have an 80% chance of your test identifying it. Conversely, you have a 20% chance of a Type II error, or false negative, where you miss the real effect entirely.
The 80% power standard was established as a practical convention. It balances the need for sensitivity against the cost of larger sample sizes. Higher power (90% or 95%) provides more protection against false negatives but requires substantially more data. For most business applications, 80% is a reasonable tradeoff.
How the Three Inputs Interact
The relationship between these three inputs follows a clear pattern. All else being equal:
A lower baseline conversion rate requires a larger sample. More noise in the data means you need more data to find the signal.
A smaller minimum detectable effect requires a larger sample. Detecting subtle effects demands more precision, which requires more observations.
Higher statistical power requires a larger sample. Greater certainty about detecting real effects demands more evidence.
These relationships create important practical constraints. If you have low traffic and a low conversion rate, you may not be able to detect small effects with reasonable test durations. In these situations, you have a few options: target larger effects, accept lower power, or find ways to increase your traffic during the test period.
A Practical Walkthrough
Let us walk through a concrete example. Suppose you run an e-commerce site with the following characteristics:
Your product page has a 3% add-to-cart rate (baseline conversion rate). You want to detect a 15% relative improvement, meaning a shift from 3.0% to 3.45% (minimum detectable effect). You want 80% statistical power. Using standard sample size formulas, you would need approximately 23,000 visitors per variation, or about 46,000 visitors total for a two-variant test.
If your product page receives 2,000 visitors per day, you would need 23 days of traffic. Rounding up to complete weeks gives you 28 days, or four full weeks. This becomes your predetermined test duration.
Now consider what happens if you tighten your MDE to 5% relative (from 3.0% to 3.15%). The required sample size jumps to approximately 210,000 per variation. At 2,000 visitors per day, this would take 210 days, or about seven months. This is the harsh reality of trying to detect small effects with limited traffic.
How to Increase Statistical Power Without Infinite Traffic
When your sample size calculation reveals an impractically long test duration, you have several options to increase power without waiting forever:
Target larger effects. Instead of testing small tweaks, test bold changes that could produce larger lifts. A headline rewrite is more likely to produce a detectable effect than a button color change.
Test higher-traffic pages. Your homepage or main landing page will accumulate sample size faster than a niche product page. Prioritize tests where you can reach adequate sample size within a reasonable timeframe.
Use more sensitive metrics. Click-through rate is typically higher and less noisy than purchase rate. If your treatment affects upstream behavior, measuring an upstream metric can give you more statistical power with the same sample.
Reduce the number of variations. Testing five variations against a control requires dividing your traffic six ways. Testing one variation against a control only divides it two ways. Fewer variations means more traffic per variation and faster tests.
Why Small Sample Sizes Produce Unreliable Results
When you run an underpowered test (one with insufficient sample size), two things happen, both bad. First, you are more likely to miss real effects because you do not have enough data to distinguish the signal from noise. Second, and more subtly, any effects you do detect are likely to be exaggerated.
This second phenomenon is called the winner's curse or Type M (magnitude) error. In an underpowered test, the only way a true effect clears the significance threshold is if random noise happens to amplify it. This means your "statistically significant" result overstates the true effect size. You might see a 25% lift in the test, implement the change, and then discover the actual improvement is only 5%.
This creates a toxic cycle: underpowered tests produce inflated estimates, which make stakeholders skeptical of future tests, which leads to pressure for faster (even more underpowered) tests. Breaking this cycle requires committing to proper sample size calculation, even when the answer is inconvenient.
Common Mistakes in Sample Size Calculation
Using the wrong baseline. Make sure your baseline conversion rate is measured over a representative period, not just the last few days. Weekly and seasonal fluctuations can significantly shift the number.
Confusing relative and absolute MDE. A 10% relative improvement on a 5% baseline is a shift to 5.5% (absolute increase of 0.5 percentage points). A 10% absolute improvement would mean a shift to 15%, which is a very different test. Always clarify which you mean.
Ignoring the significance level. Sample size also depends on your chosen significance level (alpha). The standard is 5% (p < 0.05). Lower alpha levels (more stringent significance thresholds) require larger samples.
Forgetting about one-tailed vs. two-tailed tests. A one-tailed test requires a smaller sample than a two-tailed test for the same significance level. Make sure you use the correct formula for your chosen test type.
Integrating Sample Size into Your Testing Roadmap
Sample size calculation should inform your entire testing strategy, not just individual tests. Before prioritizing a test, calculate the required sample size and estimated duration. This helps you:
Assess feasibility. Some tests simply are not viable with your current traffic levels. Knowing this upfront prevents wasting time on tests that can never produce reliable results.
Prioritize effectively. If two tests have similar expected impact, prioritize the one that can reach statistical significance faster.
Set expectations with stakeholders. When leadership asks how quickly you can test something, sample size calculation gives you a data-driven answer instead of a guess.
Key Takeaways
Sample size calculation is non-negotiable for reliable A/B testing. The three inputs, baseline conversion rate, minimum detectable effect, and statistical power, together determine how much data you need. Running underpowered tests wastes resources and produces misleading results. Calculate your sample size before every test, use it to determine test duration, and resist the temptation to cut corners. This is the foundation on which all valid experimentation is built.