The Most Common Way Teams Invalidate Their Own Tests
Here is a scenario that plays out in experimentation programs every day: a product manager launches an A/B test on Monday. By Wednesday, the dashboard shows a positive result with a significant p-value. The PM screenshots the result, shares it with the team, and declares the variant a winner. The test is stopped, the variant is shipped.
Three weeks later, someone notices that the conversion metric did not actually improve. The projected lift never materialized. What happened?
The team fell victim to the peeking problem — one of the most pervasive and least understood threats to A/B test validity.
What Peeking Actually Does to Your Error Rate
Standard frequentist hypothesis tests are designed to be evaluated once, at a predetermined sample size. The significance threshold is calibrated to control the false positive rate at that single evaluation point.
When you check results multiple times during the test and stop when you see significance, you are running a different experiment than the one you designed. Each check is an opportunity for random noise to produce a spurious significant result. The more times you check, the more opportunities there are, and the higher the cumulative false positive rate becomes.
The math is straightforward. If you check a test at five equally spaced intervals during the experiment and stop as soon as one check is significant, your actual false positive rate is substantially higher than the nominal level you set. With daily checks on a multi-week test, the inflation is even worse.
In other words, what you think is a low false positive rate becomes a much higher one — potentially several times higher than intended. A meaningful fraction of your "winning" tests may be false positives.
Why Peeking Produces Inflated Estimates
The peeking problem does not just inflate the false positive rate. It also inflates the estimated effect size of the results you do see.
Here is why: early in a test, the estimate of the treatment effect is noisy. It bounces around. Sometimes it is too high, sometimes too low. If you stop the test the first time the estimate happens to be high enough to reach significance, you are systematically selecting for moments when noise pushed the estimate upward.
This means the lift number you see at the moment of early significance is an overestimate. When you ship the variant and measure the actual long-run effect, it will almost certainly be smaller than what the test showed. In some cases, it will be zero — the result was entirely noise.
This pattern has a name: the winner's curse. The expected effect of results selected via early stopping is biased upward.
Why Everyone Peeks Anyway
Dashboards make it irresistible
Most experimentation platforms show live results. The p-value is right there, updating in real time. Telling people not to look at a number that is displayed on their screen is a losing battle.
Organizational pressure to ship
Product teams have roadmaps, deadlines, and stakeholders asking for updates. When a test shows a significant result on day three, there is immense pressure to call it done and move on.
Misunderstanding of what the numbers mean
Many people believe that if the p-value is significant at any point, the result is reliable. They do not understand that the significance guarantee only holds at the planned endpoint.
No consequences for incorrect conclusions
Most teams do not systematically track whether shipped variants deliver their projected lift. Without this feedback loop, there is no mechanism for learning that early stopping produces inflated estimates.
How to Solve the Peeking Problem
Option 1: Discipline and process
The simplest solution is to calculate sample size before the test, commit to the endpoint, and evaluate only at that point. No early stopping, no early decisions.
This works in theory but fails in practice for most teams. The temptation to check is too strong, and there are legitimate reasons to monitor a test (detecting bugs, catching negative effects on guardrail metrics).
Option 2: Sequential testing methods
Sequential testing methods are designed for multiple looks at the data. They use adjusted thresholds at each check point — requiring stronger evidence for early significance — so that the overall false positive rate stays at the desired level.
Common approaches include:
- Group sequential testing. Pre-plan a fixed number of interim analyses with spending functions that allocate the overall error rate across checks.
- Alpha spending functions. Continuously adjust the significance threshold as more data accumulates, allowing checks at any time while controlling the overall error rate.
- Always-valid confidence sequences. Construct confidence intervals that are simultaneously valid at every sample size, allowing unlimited monitoring.
The trade-off: sequential methods require slightly larger sample sizes than fixed-horizon tests for the same power. The premium is typically modest — and far less expensive than the cost of inflated false positive rates from uncontrolled peeking.
Option 3: Bayesian methods
Bayesian analysis naturally handles continuous monitoring because the posterior probability is valid at every point in time. There is no peeking penalty because the framework does not rely on a fixed-horizon assumption.
The trade-off is that Bayesian methods require prior specification and do not provide the same explicit error rate guarantees as frequentist sequential methods.
Option 4: Platform-level guardrails
The most effective solution is often to build peeking protection into the platform itself:
- Hide p-values until the planned endpoint. Show effect estimates and confidence intervals but do not display significance until the test reaches its target sample size.
- Use always-valid statistics by default. Switch the platform's statistical engine to one that supports valid continuous monitoring.
- Require minimum run times. Enforce a policy that tests cannot be stopped before a minimum duration, regardless of what the numbers show.
- Show adjusted statistics. If the platform supports sequential testing, display the adjusted thresholds alongside the current results.
Monitoring Without Peeking
Stopping early based on significance is dangerous, but there are valid reasons to monitor a running test:
Detecting bugs
If a variant has a broken checkout flow, you want to know immediately. Monitor error rates, page load times, and other technical metrics. If a variant is clearly broken, stop it — this is not a statistical question.
Protecting guardrail metrics
If a variant is significantly harming a guardrail metric (revenue, engagement, error rate), you may want to stop early. Use separate guardrail thresholds that are stricter than your primary metric threshold to account for the multiple monitoring problem.
Detecting extremely large effects
Sequential methods allow you to stop early when the effect is much larger than expected. This is the responsible version of "peeking" — the thresholds are set so that early stopping only happens when the evidence is overwhelming.
The key distinction: monitoring for safety and stopping for significance are different activities with different statistical requirements. Monitor as much as you want. But only make significance-based stopping decisions using methods designed for that purpose.
Building an Anti-Peeking Culture
The peeking problem is ultimately a cultural and process issue, not just a statistical one.
- Educate the team. Most people peek because they do not understand the consequences. A thirty-minute training session on why early stopping inflates false positives can change behavior permanently.
- Separate monitoring from evaluation. Make it clear that checking for bugs and checking for winners are different activities.
- Celebrate well-run tests, not just wins. If the culture only rewards positive results, people will find ways to manufacture them — including early stopping.
- Track the replication rate. After shipping variants, measure whether the observed lift materializes. When teams see that early-stopped tests underperform, the lesson sticks.
FAQ
How many times can I safely check results?
With standard frequentist methods, the answer is once — at the planned endpoint. With sequential testing methods, you can check as often as you like, but the thresholds are adjusted to maintain validity.
Does peeking matter if I use Bayesian methods?
Bayesian posteriors are valid at any time, so continuous monitoring does not inflate the posterior probability in the same way. However, stopping rules still affect the expected properties of the decisions you make. Bayesian methods handle peeking more gracefully but are not entirely immune to selection effects.
How much does peeking actually inflate the false positive rate?
It depends on how often you check and when you stop. With daily checks on a standard multi-week test, the actual false positive rate can be multiple times the nominal level. The inflation increases with the number of checks.
Should I rerun a test if I suspect we peeked?
If the decision was made based on an early significant result, the estimate is likely inflated. If the decision matters, rerunning with a proper sequential design is worth the traffic investment.