The question I get asked most often after "how do I set up a test" is "can I stop this test yet?" It's the most common place where experimentation programs go wrong — and the damage is invisible. You stop too early, you ship a false positive. You stop too late, you introduce seasonal drift. You stop for political reasons, you break your statistical model entirely.

I've reviewed dozens of experiment portfolios where the stopping decisions were the primary source of misleading results. Here's the framework I now use with every team.

The Three Wrong Reasons Teams Stop Tests

Before the framework, let's name the failure modes I see constantly.

Wrong Reason 1: "It looks like it's winning." A variant shows 72% probability of beating control at day 4. Team ships it. Three months later, they wonder why their overall CVR didn't improve. The test was right directionally, but the effect was smaller than it looked at day 4 — and they shipped a version that underdelivered.

Wrong Reason 2: "We need the slot for another test." Resource pressure is real, but this produces a predictable error: you stop tests that haven't accumulated enough data to be trustworthy, then fill your roadmap with results you can't rely on. A year later your test history is a list of false wins that never moved the revenue needle.

Wrong Reason 3: "The PM is asking daily." Stakeholder pressure is the most common stopping trigger in practice. The correct response is to show them the pre-set stopping conditions and hold the line. Every test that gets stopped because someone was impatient is a data point you paid for and threw away.

Why Stopping at Significance in Classical Stats Is Wrong

Here's the peeking problem with actual numbers, because it's important enough to understand precisely.

Suppose you're running a frequentist test at 95% significance (alpha = 0.05). You've determined you need 20,000 visitors per variation. You check the p-value every day.

If there's truly no difference between A and B (null is true), you'd expect to see p < 0.05 by random chance 5% of the time at your final analysis. But if you check the test every day for 14 days and stop the moment you see p < 0.05, the probability that you'll find a false positive at some point during those 14 checks is approximately 23% — not 5%.

You've increased your false positive rate by nearly 5x just by checking daily.

This is not a hypothetical concern. In a program running 50 tests per year, this means you'd ship roughly 11-12 false positives instead of the 2-3 you thought you were accepting. Those false positives compound: some block future tests from running, some get attributed as wins in quarterly reviews, some become the "proven" UX patterns you apply elsewhere.

**Pro Tip:** If you're running frequentist tests and you've been peeking, do an audit of your last 12 months of winners. Look for tests that were stopped within the first half of their planned runtime. Those results are suspect.

Why Sequential Testing Changes This

Sequential testing — specifically the approach used in Optimizely's Stats Engine and similar tools like Evan Miller's "always valid" inference — mathematically corrects for the peeking problem.

The mechanism: the confidence interval is inflated early in the test when you have little data, and narrows as data accumulates. At every checkpoint, the inference remains valid. You can check daily. You can stop when results cross the significance threshold. The false positive rate stays at your stated alpha.

This is not the same as saying "you can stop whenever you feel like it." The statistical validity of early stopping is guaranteed, but you still need to account for:

  • Novelty effects: Users interact with new experiences differently in the first few days. Lift from novelty isn't real lift.
  • Day-of-week effects: If you stop on a Tuesday after 8 days, you've captured two Mondays and two Tuesdays but only one weekend. Your sample is not representative.
  • Interaction with other changes: Campaigns, site updates, and seasonality can all contaminate early results.

Sequential testing solves the statistical problem. You still need business logic for the stopping decision.

The Four-Condition Stopping Framework

I use this checklist before calling any test. All four conditions must be met.

Condition 1: Minimum duration met. Run every test for at least 7 calendar days, regardless of traffic. This captures at least one full week of day-of-week variation. For tests on pages with significant weekend vs. weekday behavior differences (most e-commerce), 14 days is the minimum.

Condition 2: Sample size target hit. This means hitting the per-variation visitor count from your pre-test sample size calculation — not some percentage of it. 80% of required sample size gives you approximately 60-65% power at your stated MDE. That's not enough to call a test.

Condition 3: Results stable for 3+ days. The conversion rate in each variation, and the point estimate of the difference, should not be changing materially over the last 3 days of the test. If the lift estimate has moved from +8% to +3% to +6% to +9% over the last 4 days, the test isn't stable. "Stable" means the trend direction is consistent and the magnitude is converging, not oscillating.

Condition 4: Secondary metrics reviewed. Before stopping, check that secondary metrics don't tell a conflicting story. A test that lifts CVR by 8% but reduces average order value by 12% is a net negative. A test that improves clicks but tanks downstream conversion is not a win. Look at revenue per visitor, not just conversion rate, before calling it.

**Pro Tip:** Create a one-page stopping checklist in whatever project management tool your team uses. Require sign-off on all four conditions before shipping any test result. This single process change eliminates most premature stopping in experimentation programs.

When It's OK to Stop a Losing Test Early

The four-condition framework is for tests where you're evaluating a potential winner. Losing tests are different.

It is acceptable — and often correct — to stop a test early when:

  • The variant is performing materially worse than control (e.g., the lower bound of the confidence interval is below -5% CVR) and has been for at least 5 days
  • The test is causing user experience harm (errors, increased support contacts, degraded performance metrics)
  • An external factor has contaminated the test (a major sale, a site outage affecting one variation, a traffic source change)

Stopping a clearly harmful test early isn't statistical malpractice — it's responsible program management. Document why you stopped it, note that results are inconclusive, and don't use the partial data to claim the null hypothesis was proven.

**Pro Tip:** Set a "harm threshold" before every test: if the variant CVR is more than X% below control for Y consecutive days, we auto-stop. This removes politics from the stopping decision for losing tests.

The Extended-Run Problem

Most stopping-rules content focuses on stopping too early. The opposite problem gets less attention: running tests too long.

After 4-6 weeks, several contamination sources accumulate:

Returning visitor bias: Users who were exposed to your variant early in the test have now been exposed many times. Their behavior in week 6 reflects familiarity with the variant, not initial response to it. This is particularly acute for navigation and UX tests.

Seasonal drift: Running a test from mid-October through mid-December captures two fundamentally different shopping periods. The combined CVR is a meaningless blend of both.

Other site changes: Four weeks is long enough for a new marketing campaign, a pricing update, or a site redesign to partially deploy alongside your test. These interactions corrupt the isolated measurement you're trying to get.

Practical guideline: Cap test runtime at 4 weeks for most tests. If you haven't hit your sample size in 4 weeks, your MDE was set too low for your traffic level. Stop the test, re-evaluate your MDE, and re-run with a more realistic effect size threshold.

**Pro Tip:** For seasonal businesses, be especially careful about tests that span campaign periods (Black Friday, back-to-school, etc.). Either complete the test before the event starts, or start it after the event ends. Tests that run through major traffic events are rarely reliable.

What "Results Stabilized" Actually Means

"Stability" in test results is often described vaguely. Here's a precise definition I've found useful:

A test is stable when:

  1. The conversion rate for each variation has varied by less than 15% of its mean value over the last 5 days
  2. The relative lift estimate (B vs. A) has varied by less than 20% of its mean value over the last 3 days
  3. The confidence interval is narrowing, not widening

If variant B has shown lifts of +2.1%, +3.8%, +5.1%, +2.4%, +4.7% over the last 5 days, the test is not stable — even if the average is +3.6% and p < 0.05. The variance in the daily estimate suggests you're still seeing high day-to-day noise.

Stability is not about the final confidence level. A test can be at 97% confidence but still unstable if the lift estimate is moving around. Wait for convergence.

Decision Flowchart: Should I Stop This Test Right Now?

Answer these six questions in order:

  1. Has it run for at least 7 calendar days (14 for tests with strong day-of-week patterns)? If no — keep running.
  2. Has each variation reached the pre-calculated sample size target? If no — keep running.
  3. Has the lift estimate been stable for 3+ consecutive days? If no — keep running.
  4. Have all secondary metrics been reviewed and show no negative offsetting effects? If no — review them first.
  5. Is there a clear winner at your significance threshold, or a clear loser below your harm threshold? If neither — keep running.
  6. Are there any external factors (seasonal events, campaigns, site changes) that may have contaminated the data? If yes — document and assess before stopping.

If all six pass: stop the test and proceed to the shipping decision.

What to Do After Stopping

If there's a winner: Ship it, document the lift and confidence level, and schedule a follow-up test to iterate from the new baseline.

If there's a loser: Document what you learned about why the hypothesis was wrong. This is as valuable as a win — it updates your model of what works for your users.

If inconclusive: Document as "no significant effect detected at [MDE]." This is a valid result. It means either the effect is smaller than your MDE, or there is no effect. Plan a follow-up only if qualitative evidence suggests the hypothesis is still worth testing with a higher-traffic page.

The partial rollout option: For high-traffic sites, you can ship to 50% of users while collecting additional signal. This is particularly useful when the confidence is borderline (85-92%) — you get real-world validation while limiting exposure risk.

Common Mistakes

Mistake 1: Resetting the test clock when you make changes. If you modify the test variant mid-run, the clock resets. Any data collected before the change is contaminated.

Mistake 2: Using "statistical significance" and "stability" interchangeably. You can have significance without stability and stability without significance. Both conditions are required.

Mistake 3: Stopping immediately when you hit significance. Even with sequential testing, hitting significance on day 3 doesn't mean you should stop on day 3. Minimum duration and stability conditions still apply.

Mistake 4: Not documenting stopping decisions. Every stopped test needs a documented reason. "Test stopped per stopping framework — all 4 conditions met" is a valid entry. "PM asked us to stop it" is not acceptable without also noting it as a methodological concern.

What to Do Next

If your team doesn't have a formal stopping policy, start there. A one-page checklist with the four conditions above will immediately improve your program's data quality.

For teams using Optimizely, the Optimizely Practitioner Toolkit includes guidance on interpreting Stats Engine results alongside the stopping framework — including how to read the "chance to beat baseline" metric in context of the full four-condition checklist.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.