Early stopping is the single most common statistical mistake in new CRO programs. Fix it and you remove the largest source of false positives in your testing pipeline. This guide explains the mechanism in plain language, gives you the quick stats to remember, and ends with a Monday-morning checklist you can apply this week.

What You'll Learn

  • What "peeking" is and why it inflates your false positive rate by 4-6x
  • The simple math behind the 5% → 25% inflation
  • Why the 10-day test pattern shows up everywhere in DTC
  • The two stopping protocols that actually work (fixed-horizon and sequential)
  • A Monday-morning checklist you can apply to your own tests

Quick Stats Reference

Memorize these numbers:

>

- 5% — nominal false positive rate at 95% confidence (run correctly).
- 20-30% — empirical false positive rate when you peek and stop at peak.
- 3-5x — typical inflation of reported lift magnitude under peek-and-stop.
- 2 full weekly cycles (14 days minimum) — recommended floor for any DTC test.
- 0% — the amount of statistical inference your tool's "probability to be best" gives you under continuous monitoring (unless it explicitly uses always-valid inference).

The Core Definition

Peeking = observing a test's interim results and deciding whether to stop based on what you see. The statistical inference of a test is only valid if you commit to your stopping rule before the test launches. Peeking and stopping at the first favorable observation is a different protocol — one with a much higher false positive rate than the dashboard displays.

Why This Matters for New Analysts

If you are new to CRO, this is probably the single most valuable thing you can learn in your first month. Three reasons:

It is the most common source of false positives in real programs. A team that fixes peeking removes more false-positive risk than from any other single methodological change.

It is the cheapest fix. Pre-committing to a sample target before launch costs nothing. The hard part is the discipline, not the technique.

Your tool is not protecting you. Most consumer A/B testing platforms display a "probability variant B beats A" indicator that updates continuously, which strongly implies you should stop when it crosses some threshold. Without always-valid inference, this indicator is misleading. The dashboard does not warn you about this.

---

The Mechanism in Plain Language

A two-arm A/B test running at 95% confidence makes a specific probabilistic claim: if the variants do not actually differ, the test will incorrectly declare a winner about 5% of the time by chance. That 5% is the price you pay for finite samples.

The guarantee assumes one specific protocol: pre-specify a sample size, run the test to completion, and look at the result once. One look. One decision.

The moment you replace that protocol with continuous monitoring — checking the dashboard at days 3, 5, 7, deciding stop-or-continue based on what you see — the inferential properties change.

Quick mental model: Each look at an interim result is another chance for a noisy run to cross the significance threshold. With daily peeking on a 14-day test, the empirical false positive rate climbs from 5% to roughly 25-30%. About one in four "wins" produced this way are false positives on tests with no underlying effect.

This is not a contested claim. It is one of the most well-established results in sequential statistics. It is, however, almost never displayed in the dashboards of consumer A/B testing tools.

---

Why the 10-Day Test Pattern Is Everywhere

If you have read DTC CRO content, you have noticed that case studies tend to report tests that ran for about 10 days. The pattern is not a coincidence.

A 10-day duration is what emerges when an analyst monitors continuously and stops at the first favorable observation. Specifically:

  • Day 1-3: noise dominates, results look mixed.
  • Day 4-7: the variant has had enough time to drift into either tail; the analyst sees a favorable run.
  • Day 7-10: the analyst declares a winner during this window.

The statistical signature of "watch and stop when it looks good" is a test that ends sometime between day 7 and day 14, with most clustering around day 10. When you encounter a published case study with that duration, the methodology should be considered suspect by default.

Tip for new analysts: A test that ran for "approximately 10 days" without a pre-registered sample target is almost certainly a peek-and-stop test. Treat the reported lift as an upper bound, not as an estimate.

---

The Second Effect: Inflated Lift Magnitude

The peeking problem has a second, less-discussed effect that matters more for the headline numbers in case studies: the lifts you report are systematically biased upward, even when the underlying effect is real.

Here's the intuition. Imagine the true lift is 0% — no real effect. The observed lift varies day to day around zero. Some days it's slightly positive; some days slightly negative. If you apply a peeking rule — "stop the first day the observed lift crosses 15% with apparent significance" — you will eventually get a day where noise pushes the lift above the threshold. You declare a winner. The reported lift is at least 15% by construction.

What just happened: you sampled the upper tail of the distribution of possible observations. The reported "lift" is not an estimate of the true effect. It is a measurement of how far noise traveled before you stopped.

Quick rule of thumb: Reported lifts on peek-and-stop tests are typically 3-5x larger than the true underlying effect. A reported 25% lift is more honestly interpreted as a true effect in the 5-7% range.

The cost in production. When the change is rolled out and the impact is measured under different conditions, the noise component receives a fresh draw, which is on average near zero. The realized impact is therefore much smaller than the headline. The rollout was sized using a number that included the noise tail; the realized impact under-delivers; the team typically cannot identify the cause.

---

The Diagnostic Signal: Failed Replications

Most analysts who peek and stop will, at some point, attempt to validate a finding by running the same change on a comparable surface. About half the time, the validation returns flat or marginally negative.

The typical first response is to attribute the divergence to audience differences, seasonality, or traffic-mix variation. Sometimes those explanations are correct. More often, the simpler explanation is that the original was an upper-tail draw and the validation sampled a different region of the same distribution.

When a result fails to replicate on a comparable surface, the prior probability that the original was a false positive is materially higher than the prior probability that two similar audiences responded differently for unstated reasons.

For new analysts, the takeaway is to read these replication failures literally. They are the most reliable practical diagnostic that your testing pipeline has a peeking problem.

---

The Two Fixes

Fix 1: Fixed-Horizon Testing (start here)

The simplest and most reliable fix. Five-step discipline:

  1. Specify the MDE. What is the smallest effect size that would matter for the business? Common choices are 5%, 10%, or 20%, depending on the test surface. "Anything positive" is not a valid MDE — it leads to chasing noise.
  2. Calculate required sample size. Use any sample-size calculator. Inputs: baseline CVR, MDE, 95% confidence, 80% power. Output: sessions per arm.
  3. Convert to days. Divide sessions per arm by your daily traffic. Round up to the nearest full week.
  4. Pre-register. Document the target sample, expected end date, primary metric, MDE, and decision rule. Commit before launch.
  5. Defer evaluation until the end. Observe interim if you must, but do not act. Stop only when the pre-committed sample is reached.
Common objection: "This will slow us down." In practice, the marginal cost is small. A test that would have run 7 days under peek-and-stop runs about 14 days under fixed-horizon. You lose one week per cycle. You gain a result you can interpret.

Fix 2: Sequential Testing (when you really need to monitor)

A more sophisticated alternative that allows interim analyses without inflating false positives. Two principal families:

  • Group sequential designs (e.g., O'Brien-Fleming, Pocock boundaries) — pre-specify a small number of interim checkpoints with adjusted critical values.
  • Always-valid inference (e.g., mixture-SPRT, e-values) — permit continuous monitoring with valid inference at every observation.

Trade-off: stricter critical values mean detecting a win requires somewhat more data than under fixed-horizon. Benefit: you can stop as soon as evidence is decisive.

Critical caveat: Most DTC A/B testing tools do _not_ implement sequential testing rigorously. Many present Bayesian-style probabilities ("probability to be best > 95%") that nevertheless apply frequentist-style decision thresholds, which do not control the false positive rate under continuous monitoring. If a tool's documentation does not explicitly specify validity for continuous monitoring, default to fixed-horizon.

---

A Practical Stopping Rule for DTC

For teams not yet adopting sequential frameworks, the following defaults are a reasonable starting point:

| Test type                            | Minimum duration          | Sample sizing      | Stopping rule                                    |

| ------------------------------------ | ------------------------- | ------------------ | ------------------------------------------------ |

| Conversion rate test on product page | 14 days (2 weekly cycles) | Sized for ≥10% MDE | Run to fixed horizon                             |

| Pricing or offer test                | 21 days (3 weekly cycles) | Sized for ≥10% MDE | Run to fixed horizon, weekly seasonality check   |

| Post-purchase upsell or email        | 14 days                   | Sized for ≥15% MDE | Run to fixed horizon                             |

| High-stakes redesign                 | 28 days                   | Sized for ≥5% MDE  | Run to fixed horizon, pre-committed segment cuts |

These durations are not absolute requirements. They are defaults designed to capture full weekly cycles, generate sufficient sample to detect business-meaningful effects, and remain short enough to not impede the roadmap.

What matters more than the specific durations is that they are documented before launch and not adjusted based on interim observations.

---

Monday-Morning Checklist

Apply this to your team's next test, today:

  • [ ] Document the MDE before launch (specific number, written down).
  • [ ] Calculate required sample size using a sample-size calculator.
  • [ ] Convert to days; round up to a full week.
  • [ ] Add the target end date to your team's testing template.
  • [ ] Set a calendar reminder for the target end date — not before.
  • [ ] If you must look at interim results, do not act on them.
  • [ ] At the end date, evaluate. Then ship or kill, with full reporting.

For your existing test history:

  • [ ] Pull your last 5 "wins."
  • [ ] Note the run time of each. Tests under 14 days are suspect.
  • [ ] For any test without a pre-registered sample target, treat it as a candidate finding requiring replication, not as established evidence.

---

Quick Tips for New Testing Teams

  • The dashboard is not protecting you. A "94% probability variant B beats A" reading is not a stopping criterion under continuous monitoring unless your tool explicitly supports always-valid inference.
  • Failed replications are signal. When the same change tested elsewhere returns flat, the original was likely a false positive. Do not explain it away as "different audience" without a pre-registered hypothesis.
  • Pre-registration costs nothing. Documenting your test plan before launch is the single most valuable habit a new analyst can develop. It removes more false-positive risk than any other single change.
  • A 14-day test is a baseline, not a maximum. Larger samples or smaller MDEs require longer runs. Size for the effect you care about, not for the duration that feels short.

---

FAQ

What if I don't have enough traffic to run a 14-day test?

Your testing program is sized incorrectly relative to traffic. Options: reduce test cadence so each test reaches sufficient sample, shift to higher-MDE tests where smaller samples are adequate, or adopt sequential frameworks designed for low-traffic conditions. Do not run lots of underpowered tests and report the wins — those wins are predominantly noise.

My tool reports "94% probability variant B beats A." Can I stop?

Only if the tool explicitly states validity for continuous monitoring or references always-valid inference. Most tools do not. Read the documentation. If unclear, default to fixed-horizon.

Isn't 14 days excessive when our customers' purchase cycle is 3 days?

Purchase cycle and weekly seasonality are different problems. Even if a customer who lands on Tuesday purchases by Friday, the population landing on Tuesday differs systematically from the population landing on Saturday. A 14-day test captures the full weekly mix.

How can I tell if my historical wins were peeking-induced false positives?

Re-run a representative sample. Pick five tests from the past year that produced large reported lifts. Re-implement under proper fixed-horizon discipline. Approximately half will replicate at materially smaller magnitudes than the originals. That gap is the peeking tax.

Is there ever a justification for stopping a test early?

Yes — for futility (the test is so flat that even continuation will not reach significance) or for harm (the variant is meaningfully degrading the metric). Both should be pre-committed in the test plan with explicit thresholds. Stopping because the variant appears to be winning is a different category of decision and is the source of the false positive problem.

---

Build the Habit Early

For new CRO analysts, fixed-horizon testing is one of the highest-leverage habits you can develop in your first month. It costs little once standardized, and it removes the largest single source of false positives in your testing program.

I built GrowthLayer to make pre-registered, fixed-horizon testing the default for new analysts and small teams. The platform sizes the sample for your specified MDE, runs to that horizon, and reports the result with a confidence interval — not a peeking-based "probability to be best" that produces a false positive rate several times the nominal value.

To find experimentation roles where this discipline is the operating standard, explore open positions on Jobsolv.

Or book a consultation for a methodological audit of an existing test history, or for help establishing a disciplined testing program from scratch.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.