The Most Dangerous Mistake in A/B Testing: Stopping Too Early

If you have ever stopped an A/B test the moment it hit statistical significance, you have likely made decisions based on false positives. This is not a minor issue. It is the single most common mistake in online experimentation, and it leads to implementing changes that have no real effect on your business.

The question of how long to run an A/B test seems straightforward, but the answer involves understanding several interconnected statistical concepts, business realities, and the nature of web traffic itself. This guide will give you a clear framework for determining test duration that protects you from false discoveries.

Why You Should Never Stop at Significance Alone

Here is a fact that surprises most people: if you run an A/A test (where both versions are identical, meaning there is no real difference), and you check for statistical significance repeatedly as data accumulates, roughly 771 out of 1,000 tests will show 90% significance at some point during the test. That means the vast majority of tests will falsely appear significant if you peek at the results and stop whenever you see significance.

This happens because of a phenomenon called the peeking problem. Statistical significance is designed to be evaluated at a single, predetermined point. When you check results continuously, you are running multiple comparisons, and each check gives you another chance to find a false positive. The math simply does not support the practice of stopping whenever results look good.

Think of it this way: if you flip a fair coin ten times, you might get seven heads. That looks like a biased coin. But if you flip it 1,000 times, you will converge on roughly 50/50. Early results are noisy. They fluctuate wildly. And if you stop at a convenient moment when the noise happens to favor your hypothesis, you will draw the wrong conclusion.

Predetermine Your Sample Size Before the Test Begins

The solution to the peeking problem is simple in principle: decide how many visitors (or conversions) you need before you start the test, and do not look at results until you reach that number. This is called predetermining your sample size.

Sample size calculation requires three inputs:

Baseline conversion rate: Your current conversion rate. If your page converts at 3%, that is your baseline.

Minimum detectable effect (MDE): The smallest improvement you care about detecting. If you only care about detecting lifts of 10% or more (from 3% to 3.3%), that is your MDE.

Statistical power: The probability that you will detect a real effect when one exists. The standard is 80%, meaning you accept a 20% chance of missing a real effect.

Once you calculate your required sample size, divide it by your daily traffic to get the minimum number of days you need to run the test. But this is only the starting point for determining duration.

The Two-Week Minimum: Running for Full Business Cycles

Even if your sample size calculation says you need only five days of traffic, you should run your test for at least two full weeks. Here is why.

Website traffic is not uniform throughout the week. Monday visitors behave differently from Saturday visitors. They arrive through different channels, have different intent, and convert at different rates. If you run a test from Monday to Friday, you have completely excluded weekend traffic. Your results describe only weekday behavior, not your actual user base.

Two full weeks gives you two complete cycles of all seven days. This means your sample includes the full spectrum of user behavior across your weekly traffic pattern. It also helps smooth out any single-day anomalies like a random spike in bot traffic or an unusual referring campaign.

For businesses with monthly cycles (such as SaaS with renewal patterns or e-commerce with pay-cycle effects), you may need to extend this to a full month. The principle is the same: capture at least one complete business cycle to ensure your sample is representative.

Why Magic Numbers Like 100 Conversions Are Dangerous

You will sometimes hear rules of thumb like "wait until you have 100 conversions per variation." These are dangerous oversimplifications that ignore the relationship between your baseline rate, your expected effect size, and the required statistical power.

Consider two scenarios. A page with a 50% conversion rate needs far fewer visitors to detect a 10% relative improvement than a page with a 2% conversion rate. The math is fundamentally different. A flat rule of 100 conversions would dramatically underpower the low-conversion test while potentially overshoting the high-conversion one.

The only reliable approach is proper sample size calculation based on your specific baseline, your specific MDE, and your desired power level. There are no shortcuts that maintain statistical rigor.

External Factors That Affect Test Duration

Beyond sample size and business cycles, several external factors can contaminate your test results if not properly accounted for:

Day-of-Week Effects

Traffic composition shifts dramatically across the week. B2B sites see heavy weekday traffic and minimal weekends. E-commerce sites often see the reverse. If your test does not capture complete weekly cycles, your results are biased toward whichever days your test happened to run.

Traffic Source Variations

A new paid campaign launching mid-test can fundamentally alter your traffic composition. Paid visitors typically have different behavior patterns than organic visitors. If your variation happens to perform better (or worse) specifically with the new paid traffic, your test results reflect the campaign timing rather than the actual treatment effect.

Return Visitors

Return visitors who have already seen the original page may react differently to your variation than first-time visitors. If your test is too short to accumulate a representative mix of new and returning visitors, your results may not generalize to your full audience.

Holidays and Seasonal Events

Running a test during Black Friday and extrapolating results to normal periods is a recipe for disappointment. Holiday behavior is fundamentally different from normal behavior. People have different purchase intent, different price sensitivity, and different browsing patterns. Unless you specifically want to optimize for holiday traffic, avoid running tests across major seasonal events.

Marketing Campaigns

An email blast, a viral social post, or a press mention can temporarily flood your site with visitors who behave nothing like your typical audience. These traffic spikes can dominate your test results if they happen to coincide with your testing period. The best practice is to note when such events occur and consider their impact on your results.

The Peeking Problem: A Deeper Look

The peeking problem deserves additional attention because it is so prevalent. When you check test results daily (or worse, multiple times per day), you are essentially running a new statistical test each time. Each check has a small probability of producing a false positive. Over many checks, these probabilities compound.

If your significance threshold is 5% (p < 0.05) and you check daily for 30 days, your actual false positive rate is far higher than 5%. It can be 20%, 30%, or even higher depending on how the data accumulates. This means that a significant portion of your winning tests are actually false positives.

There are legitimate methods for monitoring tests in progress, such as sequential analysis or alpha spending functions, which mathematically adjust for multiple looks at the data. But the default approach of simply watching a dashboard and stopping when results look good is not one of them.

A Practical Framework for Test Duration

Here is the framework that combines all of these considerations:

Step 1: Calculate required sample size based on baseline conversion rate, minimum detectable effect, and 80% power.

Step 2: Divide the required sample size by daily traffic to get the minimum number of days.

Step 3: Round up to the nearest complete business cycle (at minimum two full weeks).

Step 4: Check for known external events (holidays, campaigns, product launches) and adjust timing to avoid them or account for them.

Step 5: Commit to the predetermined duration and do not stop early based on intermediate results.

This framework will not always be convenient. Sometimes it will tell you that you need to run a test for six weeks when you wanted an answer in three days. But convenience and statistical rigor are often at odds, and choosing convenience means accepting a much higher rate of false discoveries.

When Short Tests Are Acceptable

There are legitimate cases where shorter test durations work. High-traffic sites with simple binary metrics (like click-through rate on a button) can accumulate sufficient sample sizes quickly. If your sample size calculation says you need 10,000 visitors per variation and you get 50,000 per day, you might reach statistical requirements in a few days.

However, even in these cases, the two-week minimum for capturing full business cycles still applies. Having a large enough sample size does not solve the problem of non-representative samples. You need both sufficient volume and representative composition.

The Business Cost of Getting Duration Wrong

Getting test duration wrong has real business consequences in both directions. Stop too early, and you implement changes based on noise. These false winners do not improve your metrics, but you burn engineering resources implementing them and you lose the opportunity to test something that might actually work.

Run too long, and you waste time that could be spent on the next test. In experimentation, your testing velocity matters. The faster you can run valid tests, the faster you learn and improve. The key is running tests that are long enough to be valid but no longer than necessary.

The predetermined sample size approach optimizes this tradeoff. It tells you exactly how long you need, and not a day more.

Key Takeaways

Never stop a test just because it reached statistical significance. Predetermine your sample size before starting. Run for at least two full business cycles. Account for external factors that can contaminate results. And resist the temptation to peek at results and stop early. These practices will dramatically improve the reliability of your experimentation program and ensure that the changes you implement actually move your metrics in the right direction.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.