Standard frequentist A/B testing assumes one look at the data. Every additional peek inflates the false-positive rate above the nominal 5%. Continuous monitoring of a fixed-horizon test pushes it past 30% --- and that is the single most common statistical mistake in industry experimentation.

It is 9:47 AM on a Wednesday. A product manager opens the experimentation dashboard for a button-color test that launched three days ago. The pre-registered sample size is 18,000 visitors per arm; the test is currently sitting at about 5,400 per arm. The dashboard shows a 4.1% lift on the primary conversion metric, p = 0.043. The PM screenshots the result, posts it in the product-launches Slack with a green-check emoji, calls the test, and ships the winning variant to 100% of traffic by lunch. Three weeks later, the post-launch monitoring shows no detectable lift. The “winning” variant is performing exactly the same as the control. The PM concludes that the production environment must somehow be different from the test environment, files a ticket about possible instrumentation drift, and moves on to the next experiment.

There is nothing wrong with the production environment. There was nothing different about the deployment. The test simply did not detect a real effect, because there was no real effect to detect. What the test detected was random fluctuation, dressed up as a p-value below 0.05, surfaced to a stakeholder who was watching the dashboard for exactly the moment when the random fluctuation crossed the magic line.

This is the peeking problem, and it is mathematically guaranteed to produce this outcome at a rate far higher than the 5% the methodology nominally promises. The promise of “5% false-positive rate” depends on a single specific behavior --- looking at the data once, at a pre-specified sample size, and making one go/no-go decision. Look at the data twice, even just twice, and you have already broken the methodology. Look at it daily in a two-week test and the false-positive rate is no longer 5%; it is somewhere between 25% and 40%, depending on how many times you looked and at what intervals. Look at it continuously through a real-time dashboard, the way most modern experimentation tools are designed to encourage, and the methodology has effectively no false-positive control at all.

The disturbing implication is that for any experimentation program still running on classical fixed-horizon Null Hypothesis Significance Testing (NHST) --- which is most internal experimentation programs and a substantial fraction of commercial ones --- a quarter to a third of the “winning” tests are statistical artifacts. They are not detecting real product improvements. They are detecting noise that happened to cross a threshold while someone was watching.

This article exists because the peeking problem is the most common statistical mistake in industry A/B testing, the one with the largest empirical impact on the trustworthiness of experimentation programs, and the one that practitioners are least likely to have been formally trained to avoid. The math is simple. The cognitive temptation to peek is overwhelming. And the methodology fixes that actually work --- always-valid inference, sequential probability ratio tests, Bayesian designs with proper priors --- are increasingly available in commercial platforms but still rare in homegrown experimentation stacks. If you run an experimentation program and you have not explicitly addressed peeking, you have a calibration problem you almost certainly do not know the size of.

What “Peeking” Actually Means Statistically

To understand why peeking is a problem, you have to understand what a p-value is actually claiming.

A p-value of 0.05 in a standard A/B test does not mean “there is a 95% probability that the variant beats the control.” It means something narrower and more procedural: assuming the null hypothesis (no real difference between variant and control) is true, the probability of observing a test statistic at least this extreme is 5%. The whole guarantee is conditional on a specific data-generating procedure --- a fixed sample size, decided in advance, with one statistical test performed at the end.

The fixed-sample-size requirement is the part that the dashboard-watching practitioner is breaking. The reason this matters is statistical, not philosophical. When you compute a test statistic on a sample of data, that statistic has a sampling distribution under the null --- a probability distribution that describes what values it would take across hypothetical repetitions of the experiment. The “5% false-positive rate” is the proportion of that sampling distribution that lies in the rejection region (the tails beyond p = 0.05). If you make one decision based on one test statistic, you accept the null when the statistic is in the central 95% and reject it when the statistic is in the tail 5%. That is the contract.

When you peek, you are doing something the math did not authorize. You are not computing one test statistic; you are computing a sequence of test statistics over time, and you are stopping the experiment whenever any of them happens to cross the rejection threshold. The relevant probability is no longer “what is the chance that a single test statistic falls in the tails?” It is “what is the chance that the maximum of a sequence of correlated test statistics falls in the tails?” That probability is necessarily larger, because the maximum of a sequence is always at least as extreme as any single element of the sequence.

In the limit of continuous monitoring --- looking at the test statistic literally every time new data arrives --- the probability that the cumulative test statistic ever crosses the p = 0.05 threshold under the null is not 5%. It is 100%. The law of the iterated logarithm guarantees that any random walk crosses any fixed threshold with probability one if you wait long enough. This is not a mild correction; it is a total breakdown of the false-positive guarantee.

The Always Valid Inference paper by Johari, Pekelis, Walsh, and collaborators frames this with mathematical precision in Johari, R., Pekelis, L., & Walsh, D. J. (2017). “Always Valid Inference: Bringing Sequential Analysis to A/B Testing.” arXiv:1512.04922. Their framing is that classical p-values are tests of a single sample-size design; treating them as if they could be evaluated at any point along the data-accumulation process is a category error. The fix is not to fiddle with the sample size or apply ad-hoc corrections; the fix is to use inference procedures specifically designed to remain valid under continuous monitoring.

This is the core mathematical fact that practitioners need to internalize: the 5% false-positive guarantee is a guarantee about a single specific procedure, and almost no real-world A/B test follows that procedure. The dashboards encourage looking; the stakeholders demand looking; the cognitive temptation to call a test early is enormous. And every look that informs a stopping decision invalidates the guarantee.

The Mathematics --- How Much The Inflation Actually Is

The theoretical worst case is unbounded. The empirical question is how bad the inflation gets under realistic peeking patterns. The answer is bad enough that it has reshaped how the serious commercial A/B testing platforms now handle their statistics.

The Optimizely team published the canonical industry treatment in Pekelis, L., Walsh, D., & Johari, R. (2015). “The New Stats Engine.” Optimizely white paper. Their analysis modeled what happens when a customer running a fixed-horizon A/B test with α = 0.05 checks the result at regular intervals and stops the test as soon as any check shows p < 0.05. The simulation numbers they reported are the ones every practitioner should have at the front of their mind:

  • Check the data every 500 visitors: about 26% false-positive rate under the null
  • Check every 1,000 visitors: about 20% false-positive rate under the null
  • Continuous monitoring with no minimum gap: false-positive rate climbs toward 40%+ over realistic test durations

These numbers come from Monte Carlo simulations of the null condition --- two arms that are truly identical, where any “significant” result is by definition a false positive. The Optimizely team ran the simulations explicitly because their customer data showed the same pattern empirically: A/B tests that were called early on the strength of an intermediate p < 0.05 reading were, on follow-up, frequently failing to replicate the claimed lift.

Their summary statement after deploying the new methodology, quoted from their public communications about the transition, was that the false-positive rate on their platform dropped “from over 20% to under 5%” --- meaning that prior to the new statistics engine, more than one in five claimed “winners” on their platform was a statistical artifact of peeking, even when the test had been technically designed with α = 0.05.

The peer-reviewed academic treatment is Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2017). “Peeking at A/B Tests: Why It Matters, and What To Do About It.” KDD ‘17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1517—1525. DOI: 10.1145/3097983.3097992. The paper formalizes the simulation results, demonstrates the mathematical structure of the inflation, and validates the mSPRT (mixture sequential probability ratio test) framework as a procedure that preserves the false-positive guarantee under continuous monitoring. The authors had direct access to Optimizely’s customer A/B test data, which makes their empirical claims about industry practice unusually grounded in actual production usage.

The take-away is that under realistic patterns of practitioner behavior --- checking the dashboard once a day or once a few hours during a multi-week test --- the false-positive rate of classical fixed-horizon NHST inflates from the nominal 5% to roughly the 20-40% range. The exact number depends on how often you peek and how long the test runs, but the order of magnitude is what matters. You are not getting the false-positive control your test design claims. You are getting roughly 5x to 8x more false positives than you think you are.

For a CRO or PM running 50-100 tests a year, this is the difference between a clean 2-5 false winners per year (the nominal claim) and 15-30 false winners per year (the actual rate under naive peeking). It is the difference between an experimentation program that produces a reliable evidence base for product decisions and one that produces a stream of “wins” that quietly fail to replicate post-launch and gradually erode organizational trust in experimentation as a discipline.

The Famous Demonstrations

The peeking problem is a specific instance of a broader pattern that the replication-crisis literature has been documenting in academic psychology for over a decade --- the same fundamental flaw that produced thousands of unreplicable findings in the social sciences before the methodological reckoning hit. The single most influential academic paper on this broader flexibility-in-data-analysis problem is Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science, 22(11), 1359—1366. DOI: 10.1177/0956797611417632.

Simmons, Nelson, and Simonsohn made a deceptively simple argument. They demonstrated, through both Monte Carlo simulation and a literal field demonstration (in which they “found” that listening to “When I’m Sixty-Four” makes people younger), that a small set of innocuous-looking degrees of freedom in data analysis --- collecting a few extra observations if the first analysis is non-significant; trying two outcome measures and reporting the one that works; dropping a covariate that flips the result; conducting two-tailed and one-tailed tests and reporting whichever is significant --- could inflate the nominal 5% false-positive rate to 61% in the worst-case combination they simulated.

The Simmons paper is canonically about academic psychology, but the underlying mathematics is identical to the A/B-testing case. The researcher who continues collecting data until a significant result appears and then stops is doing the same thing as the PM who continues running the A/B test until the dashboard shows p < 0.05 and then ships. Both are exploiting an undisclosed degree of freedom in the data-collection process to inflate the apparent significance of the result. Both are producing a nominal 5% false-positive rate that, in reality, is several times higher.

What makes the industry A/B testing case distinct from the academic-psychology case is that the peeking is built into the workflow. Academic researchers had to actively decide to keep collecting data; the temptation existed but the additional data collection took weeks and money. Industry experimentation platforms automated the temptation away. The dashboard updates in real time. The stakeholder asks for the running result in the standup. The PM is praised for “iterating quickly” when they call a test early. The very design of modern experimentation tools, until quite recently, made the peeking-induced false-positive inflation almost inevitable.

The empirical demonstration on actual production A/B test data came in the KDD 2017 paper, where Johari, Koomen, Pekelis, and Walsh analyzed the false-positive inflation across Optimizely’s customer base. They documented that a substantial fraction of “significant” results in real customer tests reflected peeking behavior rather than genuine treatment effects, which was the motivation for Optimizely’s full migration to always-valid inference. The academic-industry feedback loop here was unusual and healthy: Optimizely had the production data, the Stanford team had the statistical methodology, and the published paper resulted from a genuine collaboration in which the empirical patterns drove the theoretical framing.

The broader academic context is that the peeking problem in A/B testing is a special case of the optional-stopping problem that has been understood in statistics since at least Wald’s 1947 founding work on sequential analysis (Wald, A. (1947). Sequential Analysis. John Wiley & Sons.). Wald developed the original sequential probability ratio test as a procedure that explicitly allowed early stopping while preserving the false-positive guarantee, originally for industrial quality control during the Second World War. The mathematical machinery for handling continuous monitoring of experiments has existed for nearly 80 years. The clinical-trials literature has been using group-sequential methods (Pocock 1977, O’Brien-Fleming 1979, Lan-DeMets alpha spending) for almost 50 years. What is new is not the statistical theory; what is new is the industry A/B testing world finally adopting it after a decade of running tests with provably inflated false-positive rates.

Why The Industry Quietly Switched Methodology

The clearest indicator that the peeking problem is real and consequential is that the major commercial A/B testing platforms have already changed their underlying statistics to address it. The change was not loudly advertised, because admitting “our previous statistics were producing 20%+ false-positive rates” is not a great marketing message, but the methodology shift is documented in the technical literature.

The first major commercial migration was Optimizely’s launch of the New Stats Engine in 2015. The platform moved from classical fixed-horizon NHST (with t-tests at a single sample size) to a sequential testing methodology based on the mSPRT framework, with false discovery rate (FDR) control replacing Type I error control. The publicly available technical white paper is the 2015 Pekelis-Walsh-Johari document, and the academic version is the 2017 KDD paper.

Optimizely’s stated motivation in the customer-facing communications was the peeking problem specifically. The new methodology allowed customers to look at the dashboard whenever they wanted, stop the test whenever they wanted, and still have valid statistical inference at the stopping point. The cost was that statistical power was somewhat reduced at any given sample size compared to fixed-horizon NHST evaluated at its predetermined end --- you have to “buy” the right to look continuously by accepting slightly slower statistical detection of true effects --- but the trade was clearly worth it for the validity guarantee.

VWO followed with a similar migration to Bayesian statistics, which sidestep the optional-stopping problem differently (the Bayesian posterior probability that a treatment is better is a coherent quantity to compute at any sample size, although it has its own subtleties around prior calibration and what decision rule to apply to the posterior).

The Microsoft Experimentation Platform, documented at length in Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, takes a different approach again. Microsoft’s platform uses fixed-horizon NHST but enforces the fixed horizon strictly --- the platform does not surface “significance” indicators during the test, the test runs to its predetermined sample size, and the analysis is conducted once at the end. The peeking problem is solved organizationally rather than statistically, by removing the temptation to peek from the user interface. Kohavi’s book is explicit that this is a deliberate choice and that peeking is one of the major reliability threats to industry experimentation.

The notable laggard in adopting peeking-resistant methodology is the long tail of homegrown experimentation systems. Companies that built their own A/B testing infrastructure during the 2010s, often as a side project of a data-engineering team, typically implemented classical fixed-horizon NHST without the organizational discipline of Microsoft-style fixed-horizon enforcement. These systems are still in heavy use across mid-market SaaS, e-commerce, and consumer-app companies, and they are still producing the inflated false-positive rates that the commercial platforms have moved away from.

If you are running on a homegrown A/B testing platform built before 2018, the working assumption should be that your statistics engine does not handle peeking correctly and that your false-positive rate is meaningfully above 5%. Quantifying exactly how far above requires looking at your team’s actual peeking patterns, but the order-of-magnitude estimate based on the Optimizely simulations is 20-40%.

The Practitioner Reactions That Don’t Actually Work

When practitioners first encounter the peeking problem, the first reactions are usually intuitively appealing but mathematically wrong. It is worth walking through the common bad fixes, because each of them is a trap that experienced experimentation teams still fall into.

“I’ll just check less often.” This reduces the inflation but does not eliminate it. If you check twice during the test and stop on the first significant reading, your false-positive rate is roughly 8% rather than 5%. If you check ten times, it is roughly 19%. The exact numbers depend on the specific test design, but the principle is that any peeking that informs a stopping decision inflates the false-positive rate above the nominal level. “Checking less” is a directionally correct intuition that does not produce the actual fix.

“I’ll apply a Bonferroni correction.” This is the intuition that says “if I’m doing N looks at the data, I should require p < 0.05/N at each look.” Bonferroni does protect against the inflation, but in a much-too-conservative way that destroys statistical power. The correction assumes the N tests are independent, but the sequential tests on accumulating data are highly correlated --- the test statistic at time t+1 is mostly the same data as the test statistic at time t, with a little more data added. Naive Bonferroni applied to a sequential design throws away most of the statistical power without delivering any benefit beyond a properly designed sequential method. The clinical-trials literature worked this out in the 1970s --- Pocock 1977 and O’Brien-Fleming 1979 boundaries are the correct corrections for group-sequential designs, not Bonferroni.

“I’ll trust my gut on which signals are real and which are noise.” This is the most common practitioner reaction and the most dangerous. The trained intuition of an experienced experimentation lead does correlate, weakly, with which results are real --- people who have run a lot of tests develop pattern recognition for what a real lift looks like versus what a noise spike looks like. But the correlation is much weaker than the experienced practitioner believes, and the systematic bias is in the direction of believing too much. The same confirmation-bias machinery that drives the entire replication crisis in academic psychology operates in industry experimentation. Once a PM is invested in the hypothesis that their treatment is a winner, the dashboard showing p < 0.05 will look like signal regardless of whether it actually is. The right response to “trust your gut” is the response Daniel Kahneman gave to it across a career: intuition is calibrated only in domains with fast, accurate feedback, and A/B testing intuition is not getting fast, accurate feedback on which “wins” were noise versus signal. Post-launch performance is the feedback, and it arrives months later with substantial confounds. The intuition does not improve from the experience.

“I’ll just extend the test if I’m not sure.” Extending a fixed-horizon test past the pre-registered sample size is itself a form of peeking. The decision to extend is informed by the current state of the data, which creates the same optional-stopping problem in the opposite direction --- tests with promising-but-not-significant intermediate results get extended; tests with discouraging intermediate results get cancelled. Either pattern violates the fixed-horizon assumption and inflates the false-positive rate.

“I’ll just use a higher threshold like p < 0.01.” Tightening the threshold reduces the inflation but does not eliminate the structural problem. Under continuous monitoring of a true-null A/B test, the cumulative probability of ever crossing p < 0.01 over a long test is still well above 1%, just by less than the p < 0.05 case. The structural fix has to address the multiple-looks problem, not just adjust the threshold.

None of these intuitive fixes work because they all start from the assumption that fixed-horizon NHST is the right statistical framework and try to patch it. The actual fix is to use a different statistical framework specifically designed for sequential decision-making.

What Actually Works

There are four classes of solutions that genuinely preserve the false-positive guarantee under continuous monitoring. The choice between them is mostly about which trade-offs your experimentation program can absorb and which is easier to implement in your stack.

Always-Valid Inference (mSPRT). This is the methodology Optimizely adopted in 2015 and the canonical academic reference is the Johari-Pekelis-Walsh paper. The mixture sequential probability ratio test produces p-values and confidence intervals that are valid at any stopping time, including continuous monitoring. The user can look at the dashboard whenever they want, stop the test whenever the result looks compelling, and the published p-value is honest. The cost is reduced statistical power at any fixed sample size compared to fixed-horizon NHST evaluated at its predetermined end --- you pay for the right to look continuously by accepting a slightly larger sample to achieve the same detection power on true effects. For most modern experimentation programs with abundant traffic, this is the right trade.

Group-Sequential Designs (Pocock, O’Brien-Fleming, alpha spending). This is the clinical-trials approach, formalized in Pocock, S. J. (1977). “Group sequential methods in the design and analysis of clinical trials.” Biometrika, 64(2), 191—199. DOI: 10.1093/biomet/64.2.191 and refined extensively in the decades since. The test design specifies in advance a fixed number of looks at the data (typically 3-5) and adjusts the rejection threshold at each look so that the cumulative false-positive rate across all looks is bounded at the nominal level. The alpha-spending-function refinement (Lan-DeMets 1983) allows the number of looks to be flexible while still preserving the guarantee. This is more rigid than always-valid inference --- the looks have to be pre-specified, and continuous monitoring is not supported --- but it produces tighter statistical power at the looks that are specified. For experimentation programs where stakeholders can be held to a pre-specified review schedule (e.g., “we’ll look at week 1, week 2, and week 4”), this is a strong option.

Bayesian A/B Testing. A Bayesian analysis computes the posterior probability that treatment is better than control given the observed data, and that probability is a coherent quantity to evaluate at any sample size. There is no “p-value inflation” because there are no p-values; the posterior is whatever the posterior is. The catch is that you need a defensible prior, and the choice of prior materially affects the posterior in low-data regimes. Bayesian methods sidestep the peeking problem statistically but introduce a different methodological discipline question (what is the right prior, and how do you defend it to stakeholders who do not have a background in Bayesian inference). VWO and a number of other commercial platforms have gone this route.

Pre-Registered Stopping Rules With Strict Enforcement (the Microsoft approach). This is the organizational fix rather than the statistical fix. The test specifies a sample size in advance, the platform does not surface significance indicators before that sample size is reached, and the analysis is conducted once at the end. The fixed-horizon NHST methodology is mathematically valid under this discipline, because the looking-at-intermediate-results behavior that breaks the methodology is structurally prevented. This is what Microsoft’s Experimentation Platform does, what disciplined clinical trials have done for decades, and what every academic preregistration framework requires. It is the simplest fix to understand, the hardest fix to enforce organizationally, and the most robust fix when actually enforced.

The choice among these four is partly stylistic and partly about the realities of your organization. If your stakeholders insist on watching dashboards in real time and will not accept a “wait until day 14” discipline, always-valid inference is the right answer because it is the only methodology that survives the continuous-monitoring use case while preserving false-positive control. If you can enforce pre-registered stopping rules, fixed-horizon NHST is fine and statistically more efficient. The wrong answer is to keep using fixed-horizon NHST with intermediate looks, which is what most internal experimentation programs are currently doing.

How To Detect Peeking In Your Past Test Results

You can roughly diagnose how much your experimentation program has been affected by the peeking problem by looking at the historical record. There are several patterns that, if you see them in your test logs, are strongly suggestive that peeking has been inflating your apparent win rate.

Winners that fail to replicate post-launch. This is the headline diagnostic. If you systematically track the post-launch performance of “winning” variants and find that a substantial fraction of them show no measurable lift in production, the most likely explanation is that the wins were noise. Some fraction of any experimentation program’s wins will fail to replicate even with perfect statistics (the nominal 5% false-positive rate is non-zero, novelty effects fade, the production population differs subtly from the test population), but if your post-launch replication rate is below 70%, peeking is a leading suspect.

Wins concentrated in short-duration tests. Tests that hit “significance” in two or three days are statistically suspicious because the early portion of any test is where random fluctuations are most likely to produce extreme test statistics. A real lift typically becomes more apparent over time as the sample grows; a peeking-driven false positive often shows the largest apparent effect early, before the noise has been averaged out. If your team’s winning tests are systematically called within the first half of the planned test duration, that pattern is more consistent with peeking-driven inflation than with genuine effect detection.

Results that flip direction when extended. If you have tests where the variant was “significantly better” at day 5 but ended up “significantly worse” at day 14, or vice versa, that flipping behavior is diagnostic of small samples being driven by noise. A real treatment effect that is genuinely d = +0.05 does not become d = -0.05 by collecting more data; it stays approximately d = +0.05 with shrinking error bars. Flipping results are the signature of effect estimates that are dominated by sampling variance rather than by the underlying signal.

Wins that cluster around test designers’ check-in times. If you have the timestamps of when wins were called and they cluster around 9am Monday, after the weekend’s worth of data has accumulated, or around the standup before the leadership review, that clustering is suggestive that the wins are being detected at the moments when the dashboard was being looked at, not at the moments when the underlying signal genuinely changed. This is the smoking gun of peeking-driven inflation.

Highly variable lift estimates across similar tests. If you have run multiple A/B tests on similar UI changes (e.g., button color, copy, layout) and the estimated lifts vary wildly --- some +15%, some +3%, some -8% --- when there is no obvious reason for the underlying effect to differ that much, the variance is plausibly driven by peeking-induced inflation rather than by real differences in treatment effect. Real treatment effects on similar interventions tend to cluster; noisy estimates spread out.

The diagnostic exercise is not perfect and the patterns above are necessary but not sufficient evidence. But if you have several of these patterns in combination, your program is almost certainly producing more false-positive “wins” than the nominal 5% rate would suggest, and the most likely cause is peeking.

What This Means For Your Experimentation Program

The practical calibration question is: what fraction of your past A/B test winners were real, and what should you change going forward?

For an experimentation program running on classical fixed-horizon NHST with no enforced discipline against peeking --- which is what most internal programs and a substantial fraction of mid-market commercial platforms still are --- the working assumption should be that roughly 25-35% of your “winning” tests are false positives. That is a calibration claim, not a precise measurement, and the exact fraction depends on your specific peeking patterns. But the order of magnitude is what the Optimizely simulations support, what the academic literature predicts, and what the post-launch replication failure rates that we see across enterprise experimentation programs are consistent with.

This has several immediate implications for how you should be running things.

Discount your historical win rate. If your dashboard says your team has shipped 40 winning experiments over the last year, the realistic count of experiments that actually delivered the claimed lift is closer to 25-30. The other 10-15 were noise. This matters for retrospective ROI calculations, for stakeholder reporting, and for prioritization of future experiments based on which ideas seem to work. The base rate of “ideas that genuinely improve the metric” is meaningfully lower than the base rate of “ideas that hit p < 0.05 once.”

Stop calling tests early. The single highest-leverage operational change is to enforce pre-registered sample sizes and refuse to call tests before the predetermined endpoint, regardless of how compelling the intermediate result looks. This is hard organizationally because intermediate looks are addictive, and stakeholders will push hard to ship the early winner and move on. The discipline has to be institutional, not individual. The closest commercial analog is what Microsoft does --- the platform itself hides intermediate significance indicators, removing the temptation rather than asking individual PMs to resist it.

Migrate to a peeking-resistant methodology if you’re running on homegrown infrastructure. This is a meaningful engineering investment but it is the right one if your experimentation program is high-stakes enough to justify it. The mSPRT machinery is well-documented (the Johari-Pekelis-Walsh paper has full technical details), the academic literature has worked out most of the edge cases, and the implementation effort is on the order of a few engineer-months rather than the multi-year project that adopting a serious statistical methodology used to be. If you cannot do the full migration, the alternative is the organizational fix of strict pre-registered stopping rules with no intermediate looks.

Educate stakeholders on what the dashboard is and is not telling them. The single most damaging behavior in an experimentation program is the stakeholder who walks up to the PM, sees a green p-value on the dashboard, and asks “why haven’t we shipped this yet?” The stakeholder needs to understand that the intermediate result is meaningless and that calling the test early would produce a false win at high probability. This is a culture change, not a methodology change, and it is the harder of the two.

Build a post-launch replication monitor. The most honest accountability for an experimentation program is to track what fraction of shipped “winning” variants actually deliver the claimed lift in production. This requires instrumentation that compares pre-launch and post-launch performance of the metric on the affected surface, ideally with some form of holdout or control group. It is hard to build, it is uncomfortable to look at, and it is the single most informative diagnostic for how trustworthy your experimentation program actually is. A program with 80%+ post-launch replication is in good shape. A program below 60% has serious calibration problems, and peeking is the leading candidate for the cause.

The bottom-line message is uncomfortable. If your experimentation program is producing “wins” at a rate that feels too good to be true relative to your post-launch outcomes, the explanation is probably that your statistics are inflated by peeking and a meaningful fraction of those wins were never real. The fixes exist, they are well-documented, and they are increasingly standard in serious commercial platforms. The question is whether your organization has the discipline to adopt them.

Sources

  • Wald, A. (1947). Sequential Analysis. John Wiley & Sons. (The original foundation of sequential hypothesis testing.)
  • Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191—199. DOI: 10.1093/biomet/64.2.191
  • O’Brien, P. C., & Fleming, T. R. (1979). A multiple testing procedure for clinical trials. Biometrics, 35(3), 549—556. DOI: 10.2307/2530245
  • Lan, K. K. G., & DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659—663. DOI: 10.1093/biomet/70.3.659
  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359—1366. DOI: 10.1177/0956797611417632
  • Pekelis, L., Walsh, D., & Johari, R. (2015). The New Stats Engine. Optimizely white paper. Available at: optimizely.com
  • Johari, R., Pekelis, L., & Walsh, D. J. (2017). Always valid inference: Bringing sequential analysis to A/B testing. arXiv: 1512.04922. (Published version: Johari, R., Pekelis, L., & Walsh, D. J. (2022). Always valid inference: Continuous monitoring of A/B tests. Operations Research, 70(3), 1806—1821. DOI: 10.1287/opre.2021.2135)
  • Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2017). Peeking at A/B tests: Why it matters, and what to do about it. KDD ‘17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1517—1525. DOI: 10.1145/3097983.3097992
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265.
  • Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. KDD ‘19. DOI: 10.1145/3292500.3330722

Browse the full Replication Crisis Hub for related material on evidence evaluation and statistical inference:

FAQ

What about Bayesian A/B testing? Does it solve the peeking problem?

Bayesian A/B testing does solve the peeking problem in the sense that there is no false-positive rate to inflate --- the posterior probability that treatment beats control is a coherent quantity to evaluate at any sample size, and looking at it earlier does not break the math. The catch is that the choice of prior materially affects the posterior in low-data regimes, and an uninformative prior plus a stopping rule like “stop when posterior probability exceeds 95%” can produce behavior that looks a lot like the frequentist peeking problem in practice. The right Bayesian approach uses defensible priors derived from historical test data on similar interventions and a decision rule that incorporates the value of additional information (expected loss minimization, for example). Bayesian methods are not a free lunch --- they require methodological discipline of their own --- but they do sidestep the specific frequentist failure mode that is the focus of this article.

What if I HAVE to look at the test daily because stakeholders demand it?

This is the most common practical objection, and the answer is that you have two options. Option one is to use always-valid inference (mSPRT or similar) so that the daily looks are statistically harmless. Option two is to look at the test daily but commit not to make any stopping decisions based on the intermediate looks --- only the pre-registered endpoint counts. The second option requires unusual organizational discipline and a lot of “yes I know it looks significant, we agreed not to call it early” conversations. The first option is the easier institutional path because it removes the conflict entirely. The wrong option is to keep using fixed-horizon NHST and let the daily looks inform actual stopping decisions, which is exactly the pattern this article is warning against.

What about peeking during the ramp-up phase before full traffic exposure?

Most A/B testing platforms support a ramp-up phase where the variant is exposed to a small percentage of traffic first (say 1%, then 5%, then 50%, then 50/50 split) to check for catastrophic regressions before full exposure. This kind of looking is not the same as the peeking problem this article addresses, because the ramp-up monitoring is checking for guardrail violations (errors, page-load regressions, drastic conversion drops) not for primary-metric significance. Looking at guardrail metrics during ramp-up is fine and necessary. The peeking problem is specifically about checking primary-metric significance and using it to stop the test early. The two activities should be procedurally separate.

What about novelty-effect monitoring? Don’t I need to look to see if the lift is fading?

Novelty effects (where a treatment shows large initial lift that fades over time as users habituate) are real and worth monitoring. The right way to handle this is to set the test duration long enough to capture the post-novelty equilibrium (typically 2-4 weeks for most consumer surfaces) and analyze the result at the end of that window. Looking at the intermediate trend to assess novelty is informative as long as you do not let the assessment inform an early-stopping decision. If you see a fading lift and decide to extend the test to confirm, you are back in the peeking trap. The cleaner approach is to pre-register the test duration based on novelty assumptions and stick to it.

Does sequential testing make every test take longer?

In expectation, sequential testing takes slightly longer than fixed-horizon NHST evaluated at its predetermined end --- you pay for the right to look continuously by accepting a small loss of statistical efficiency. But in the cases where the true effect is large, sequential testing stops earlier than the fixed-horizon design would have, because the test statistic crosses the (appropriately adjusted) threshold before the predetermined sample size. So for large real effects, sequential testing is faster; for small effects and null effects, fixed-horizon is faster. The trade is favorable for most realistic experimentation portfolios because the cost of being wrong about a winner (shipping a false positive) is usually larger than the cost of running the test a little longer on a true null. The Johari-Pekelis-Walsh paper has detailed analysis of the power-versus-sample-size trade for various test designs.

Is the peeking problem just a statistical purist’s complaint, or does it actually cost money?

It costs real money in two ways. First, false-positive “wins” that get shipped and then quietly fail to replicate are direct costs --- engineering time was spent on the rollout, the metric did not improve, and the team’s roadmap was distorted by spurious evidence. Second, the cumulative effect of a high false-positive rate is loss of organizational trust in experimentation as a discipline --- after enough “winners” fail to replicate post-launch, stakeholders stop trusting the experimentation program’s results, which undermines the strategic value of building an experimentation infrastructure in the first place. The largest tech companies built their experimentation platforms on rigorous statistical foundations precisely because they understood that an experimentation program that produces unreliable results is worse than no experimentation program at all.

What about Multi-Armed Bandits and other adaptive allocation schemes?

Multi-armed bandit methods (Thompson sampling, UCB, contextual bandits) are designed for a different problem than classical A/B testing. The bandit is optimized for cumulative reward during the test (minimizing regret) rather than for clean inference about which arm is best (hypothesis testing). For experiments where the cost of exposing users to a worse variant is high and the test is being run to optimize cumulative reward, bandits are the right tool. For experiments where the goal is to learn whether the treatment is better than control with calibrated confidence, classical A/B testing (with peeking-resistant statistics) is the right tool. The peeking problem in bandit settings is a different problem with different solutions, and the literature on best-arm identification (rather than regret minimization) is the relevant reference.

Where can I learn more about implementing this in my own stack?

The Johari-Pekelis-Walsh arXiv paper has the full technical details of the mSPRT methodology and is implementable from the paper. The Kohavi-Tang-Xu book is the best practical guide for industry practitioners and covers the broader trustworthiness questions beyond just peeking. The clinical-trials literature on group-sequential designs (Pocock, O’Brien-Fleming, Lan-DeMets, Jennison-Turnbull) is mature and has good textbook treatments if you want the formal foundation. Commercial platforms (Optimizely, Eppo, Statsig, GrowthBook) have all converged on peeking-resistant methodologies and have documentation that explains their specific implementations. Open-source implementations of mSPRT and Bayesian A/B testing are available on PyPI and CRAN. The harder problem is usually not the methodology itself but the organizational change required to enforce pre-registered stopping rules and discount the temptation to call tests early.

replication-crisis ab-testing experimentation statistical-inference evidence-evaluation

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.