Why Most A/B Tests Are the Wrong Size

The single most common mistake in A/B testing is not calculating sample size before running the test. Teams launch experiments with no idea how long they need to run, check results daily, and either stop too early or run too long. Both are expensive.

Stopping too early means your results are unreliable — you are making decisions based on noise. Running too long means you wasted traffic that could have been allocated to other tests. Either way, you are leaving value on the table.

Sample size calculation solves this. It tells you, before you start, how much data you need to detect a meaningful effect with acceptable reliability. It turns experimentation from guesswork into engineering.

The Four Inputs You Need

Every sample size calculation requires four inputs. Change any one of them and the required sample size changes.

1. Baseline Conversion Rate

This is the current performance of whatever metric you are testing. If your checkout completion rate is at a certain level, that is your baseline. The further the baseline is from the extremes, the more variable your metric is, and the more data you need.

Get this from your analytics. Use a long enough window to account for seasonal variation — a few weeks of data usually suffices for stable metrics.

2. Minimum Detectable Effect (MDE)

This is the smallest improvement you want to be able to detect. It is a business decision, not a statistical one. If your test produces a real improvement below this threshold, you are okay with missing it because the effect is too small to be worth the effort of shipping.

Smaller MDEs require dramatically more sample size. Cutting the MDE in half roughly quadruples the required sample. This is the most important input because it directly reflects the trade-off between sensitivity and speed.

3. Significance Level

This controls the false positive rate — the probability of detecting an effect that does not exist. The conventional level means you accept a certain percentage chance of a false positive. Stricter thresholds reduce false positives but increase required sample size.

Most teams use the conventional level and move on. Adjusting it is warranted when the stakes are unusually high or low.

4. Statistical Power

This controls the false negative rate — the probability of missing an effect that does exist. Conventional power levels mean you accept a certain chance of a false negative. Higher power requires more sample.

Power is the most neglected input. Teams obsess over significance thresholds but ignore power, which means they run underpowered tests that are unlikely to detect real effects.

How These Inputs Interact

The relationship between these inputs follows a consistent pattern:

  • Smaller MDE → larger sample. If you want to detect smaller effects, you need more data. The relationship is roughly quadratic.
  • Higher power → larger sample. Being more certain you will catch real effects costs more data.
  • Stricter significance → larger sample. Reducing false positives costs more data.
  • Baseline closer to extremes → smaller sample. Metrics near the boundaries of their range are less variable, so you need less data to detect changes.

The practical implication: you cannot have a tiny MDE, high power, strict significance, and a short test duration. Something has to give. The art of experiment design is choosing which trade-off to make.

Calculating Sample Size Step by Step

Here is the general approach:

Step 1: Define the metric. Be specific about what you are measuring and how. Conversion rate, revenue per visitor, clicks per session — the metric determines the type of calculation.

Step 2: Get the baseline. Pull the current value of the metric from your analytics, using a representative time window.

Step 3: Choose the MDE. Work with stakeholders to determine the smallest effect worth detecting. This is a business question: what improvement would justify the cost of implementation?

Step 4: Set significance and power. Use conventional levels unless you have a specific reason to deviate.

Step 5: Run the calculation. Use an online calculator, your experimentation platform's built-in tool, or a statistical package. The formula for proportions (conversion rates) involves the baseline rate, the MDE, the significance threshold, and the power level.

Step 6: Convert to duration. Divide the required sample size by your daily traffic to get the test duration. Account for weekday/weekend variation — run in full-week increments when possible.

Common Scenarios and Rules of Thumb

Low-traffic pages

If your sample size calculation says you need more visitors than you will get in a reasonable timeframe, you have three options:

  • Increase the MDE. Accept that you can only detect larger effects.
  • Use a more sensitive metric. Click-through rate is often more sensitive than conversion rate for the same change.
  • Aggregate across pages. If the change applies to multiple pages, pool the traffic.

Multiple variants

Testing more than two variants increases the required sample size. Each additional variant needs its own allocation, and you may need to adjust for multiple comparisons. A good rule: avoid testing more than three or four variants unless you have abundant traffic.

Ratio metrics

Metrics like revenue per user or pages per session have higher variance than simple conversion rates, which means they require larger samples. If your primary metric is a ratio, expect longer test durations.

Sequential testing

If you want to monitor results throughout the test (not just at the end), use sequential testing methods. These allow early stopping with statistical validity but require larger sample sizes than fixed-horizon designs for equivalent power.

The Practical Trade-Offs

Speed vs. sensitivity

Smaller MDEs give you more sensitivity but longer tests. In fast-moving organizations, a test that runs for months blocks other experiments on the same surface. Sometimes it is better to accept a larger MDE and run more tests than to optimize one test for a tiny effect.

Power vs. throughput

Higher power makes each test more reliable but reduces the number of tests you can run per quarter. Some organizations intentionally run tests at moderate power to increase throughput, accepting that they will miss some real effects.

Significance vs. false discovery rate

Stricter significance reduces false positives per test but increases sample size per test. If you run many tests, the cumulative false discovery rate may be more important than the per-test significance level.

What Happens When You Skip the Calculation

Teams that do not calculate sample size in advance typically:

  • Run tests for an arbitrary duration ("let's check in two weeks") that may be far too short or unnecessarily long.
  • Stop when significant regardless of whether the sample size is adequate, inflating false positive rates.
  • Declare null results when the test was not powered to detect a reasonable effect, wasting the learning opportunity.
  • Lose trust in experimentation because results are unreliable and do not replicate.

Sample size calculation is not academic overhead. It is the foundation that makes every other part of the experimentation process work.

Building Sample Size Into Your Process

The best experimentation teams make sample size calculation a mandatory step in the experiment design process. Before any test is approved to launch:

  1. The hypothesis is documented with a clear primary metric.
  2. The MDE is agreed upon by the product and data teams.
  3. The sample size is calculated and converted to a test duration.
  4. The duration is evaluated against the experimentation roadmap.
  5. The test is approved or scoped differently based on feasibility.

This process takes fifteen minutes per experiment. It saves weeks of wasted traffic and prevents the organizational damage that comes from acting on unreliable results.

FAQ

Do I need to calculate sample size for every test?

Yes. Even a rough calculation is better than none. Without it, you have no basis for deciding when to stop the test or how to interpret the results.

What if I do not know the baseline conversion rate?

Estimate it from the closest data you have. An imprecise baseline produces an imprecise sample size estimate, but it is still far better than no estimate. You can also run a brief measurement period before launching the test.

Should I use per-visitor or per-session sample size?

Use the unit of randomization. If you randomize by visitor (most A/B tests), calculate per-visitor sample size. If you randomize by session, use per-session. Mixing units produces incorrect calculations.

How do I handle tests with multiple goals?

Calculate sample size for the primary metric. Secondary metrics are evaluated with whatever power the primary metric's sample provides. If a secondary metric needs its own power guarantee, include it in the calculation as an additional constraint.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.