When to Stop an A/B Test: Stopping Rules That Won't Wreck Your Data

"It's clearly winning — can we just ship it?"

"We've been running for 2 weeks and it's not significant. Kill it."

"Finance wants to stop the test because the variation is hurting revenue."

These are the three scenarios where experiments get stopped for the wrong reasons. All three can lead to decisions you'll regret: shipping a false positive that costs revenue, killing a real winner for lack of patience, or ending a test based on metric movements that are within normal noise.

Here's a framework for stopping experiments that won't wreck your data.

Why Early Stopping Is Dangerous in Classical Statistics

In standard frequentist A/B testing, the statistical guarantee of 95% confidence assumes one thing: you decide on a sample size before the test starts, run until you hit it, and look at results exactly once at the end.

The moment you start peeking at interim results and making stop/continue decisions based on what you see, you inflate your Type I error rate. Research by Evan Miller and others shows that if you check results continuously during a test and stop when you first hit p<0.05, your actual false positive rate is around 22-26% — not 5%. You'll think you have a winner when you have noise roughly one-quarter of the time.

This is called the peeking problem, and it's why so many "winning" tests fail to hold up after shipping.

How Optimizely's Stats Engine Changes the Equation

Optimizely built Stats Engine specifically to solve the peeking problem. It uses a sequential testing methodology called the mixture Sequential Probability Ratio Test (mSPRT), which allows you to check results at any time without increasing your false positive rate.

The significance badges you see in Optimizely are "always-valid p-values" — they account for the fact that you're looking at the test multiple times. The 95% confidence threshold in Optimizely genuinely means 95% confidence at any point during the test, not just at a predetermined endpoint.

This is a real and meaningful difference from tools using classical statistics. You are not peeking in the problematic sense when you check your Optimizely results page.

However — and this is critical — Stats Engine does not replace the need for minimum run time. Statistical validity is not the same thing as business validity.

**Pro Tip:** Even with Stats Engine, a test that hits significance after 36 hours is almost certainly showing you weekend-to-weekday variation, a promotional spike, or early-adopter bias. The significance is real — the effect may not generalize to your full visitor population over time.

Business Stopping Criteria vs. Statistical Stopping Criteria

You need two independent sets of criteria to stop a test responsibly. Both must be satisfied.

Statistical criteria (Optimizely handles these for you):

  • Reached your pre-specified confidence threshold (95% or 99%)
  • Confidence interval does not cross zero
  • Result is consistent for your primary metric

Business criteria (you must define these):

  • Minimum run time: at least 1 full business cycle (7 days minimum; 14 days preferred)
  • Sufficient sample size: the number of visitors needed to detect your minimum detectable effect at your chosen power level
  • Trend stability: the confidence interval has stabilized over the last 3-5 days and is not still moving meaningfully

The most common error is satisfying statistical criteria and ignoring business criteria. A test that hits 95% confidence on Day 3 with 1,200 visitors per arm does not have business validity — the effect size is being estimated from a non-representative sample of your visitor population.

Minimum Viable Run Time: The One Rule Everyone Ignores

Regardless of significance, your test needs to run through at least one full business cycle.

For most e-commerce and SaaS products, that's 7 days — enough to include both weekday and weekend behavior. For B2B products with strong Monday-Thursday concentration, it might be 14 days to capture two full weekly cycles.

Why does this matter if Stats Engine handles the statistics?

Because your visitors on Day 1 of a test are different from your visitors on Day 7. Early traffic skews toward your most engaged, most frequent visitors. They're more likely to convert regardless. The effect size you measure in the first 48-72 hours is almost always larger than the effect you'll see in steady state.

**Pro Tip:** Check your test's time-trend chart in Optimizely. If the confidence interval was wide and moving significantly in the first 3 days and then stabilized, that's a healthy pattern. If it's still moving significantly at Day 12, the test may be picking up seasonal or external variation — extend the run.

The "It's Clearly Losing" Case: When Is It OK to Stop Early?

Sometimes a test is genuinely harmful and needs to stop. Here's how to make that call responsibly.

First, calculate the business cost of continuing. If your variation is showing -8% CVR on your primary metric with 95% confidence, and you get 5,000 visitors per day at $50 average order value, continuing the test costs you approximately: 5,000 × 0.003 absolute CVR loss (at 3% baseline) × $50 × days remaining = real dollars per day.

If you have 7 days left in the planned run and the cost of continuing is $52,500, you have a real decision to make.

The framework:

  1. Is the losing result statistically significant (not just directional)?
  2. Is the confidence interval entirely below zero and not moving toward zero?
  3. Does the magnitude of the loss exceed your pre-defined guardrail threshold?
  4. Are there no secondary metrics suggesting the loss is noise or measurement error?

If all four are yes, stopping early for a clear loser is defensible. Document your reasoning, note the early stop in your experiment record, and treat the result with appropriate caution — the effect size will be slightly overestimated due to early stop.

**Pro Tip:** Before stopping a losing test, check the device and new/returning segments. A test that loses overall but wins on your most valuable segment (say, logged-in returning users on desktop) may be worth continuing with a modified targeting strategy rather than stopping entirely.

The Extended-Run Problem: Why Running Tests Too Long Is Also Wrong

Most practitioners worry about stopping too early. They should also worry about stopping too late.

When you run a test for 6-8 weeks, you accumulate a different problem: returning visitors who have been exposed to both experiences over multiple sessions. A visitor who saw the control version in Week 1, came back in Week 3 still in control, and returns in Week 6 has built up 5 weeks of habituation to the control. Their behavior in Week 6 is not representative of a new visitor's first impression.

This is selection bias, and it gets worse the longer you run. The segment of visitors who return repeatedly over a 6-week window is not representative of your full visitor population — it skews toward power users, loyal customers, and people who are engaged enough to return multiple times.

The practical rule: don't run tests past 4-6 weeks without a specific reason. If you haven't hit significance by 4 weeks and your power calculation suggested you should have, the effect is probably smaller than your minimum detectable effect — and you should stop, document it as inconclusive, and size a larger test if the hypothesis still seems worth testing.

Practical Stopping Rules for Mature Testing Programs

Here is the complete stopping rule framework I use. All conditions must be met before calling a result.

To call a winner:

  1. Minimum 7 days of runtime (14 for B2B)
  2. Reached pre-specified sample size for the primary metric
  3. Primary metric confidence interval entirely above zero at 95%+ confidence
  4. No guardrail metrics in statistically significant negative territory
  5. Result stable for at least 3 consecutive days (confidence interval not substantially moving)

To call a loser:

  1. Minimum 7 days of runtime
  2. Reached pre-specified sample size
  3. Primary metric confidence interval entirely below zero at 95%+ confidence
  4. OR: Test has run for 2× the planned duration and remains inconclusive with the confidence interval centered near zero

To call inconclusive and stop:

  1. Ran for planned duration (or 2× if results were directionally promising)
  2. Confidence interval crosses zero and shows no trend toward resolution
  3. Segment analysis has been completed and documented
**Pro Tip:** When you reach significance, don't stop the test immediately. Let it run for 2-3 more days to confirm the confidence interval is stable. Sometimes a test will briefly cross the significance threshold during a data spike and then regress back to inconclusive. Waiting for stabilization costs you 2-3 days but saves you from a false ship.

Decision Flowchart: Should I Stop This Test Right Now?

Use this in order:

  1. Has it run fewer than 7 days? → Do not stop. Continue.
  2. Is there a Samples Ratio Mismatch >2%? → Stop. Fix the implementation. Restart the test.
  3. Is a guardrail metric in statistically significant negative territory? → Stop. Investigate before continuing.
  4. Is the primary metric statistically significant with the CI fully above zero? → Is it stable for 3+ days? If yes: call the winner and stop. If no: continue for 3 more days and re-evaluate.
  5. Is the primary metric statistically significant with the CI fully below zero AND past minimum runtime? → Run the business cost calculation. Stop if cost of continuing exceeds threshold.
  6. Has it run past 2× planned duration with an inconclusive result? → Stop. Document as inconclusive.
  7. Otherwise: Continue.

Common Mistakes

Stopping at first significance. The first time your test crosses 95% confidence is the worst time to stop. Stats Engine is designed for continuous monitoring, but the effect size estimate is most volatile early on. Give it 3 more days.

Using a fixed calendar deadline as the stopping rule. "We stop everything at 30 days" ignores whether you actually have enough data. A test with 500 visitors per arm at Day 30 is not interpretable. Either get more traffic to the page being tested or don't test it.

Not having pre-specified stopping criteria. When you define stopping rules after you see results, you're unconsciously choosing criteria that support the conclusion you want. Write your stopping rules in your experiment brief before launch.

Stopping a losing test before the business math justifies it. "It's trending negative" is not a stopping criterion. Directional data with wide confidence intervals is noise. Wait for the interval to cross zero with significance before acting.

Ignoring velocity. If your confidence interval was at -20% to +25% on Day 5 and is now at -2% to +18% on Day 12, that's a narrowing interval trending positive. It needs more time. If it was -2% to +18% on Day 5 and is still -2% to +18% on Day 12, the interval is stable — no new information is coming in.

What to Do Next

  1. Review your last 3 experiments and check whether they hit your minimum run time criteria. Did you stop any before 7 days? What would the result have looked like at Day 14?
  2. Document your stopping criteria in your next experiment brief before the test launches. Get stakeholder sign-off on them before results come in.
  3. Calculate the business cost of continuing for your current running test that's in a negative trend. Is it worth the data quality to run the full planned duration?
  4. Read the results page walkthrough to ensure you're reading the confidence intervals correctly before making stopping decisions.
Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.