Here is a number that should alarm every growth team: when 1,000 A/A tests were run — tests where both groups saw the identical experience — 771 of them reached 90% statistical significance at some point during the test. Not 5%. Not 50%. Seventy-seven percent of tests where there was literally nothing to find still "found" a winner.
This isn't a bug. It's not a flawed testing tool. It's the predictable mathematical consequence of how most teams run A/B tests: they check results continuously and stop when the numbers look good. This practice — called "peeking" — is the single most destructive habit in experimentation, and it turns the 5% false positive rate you think you have into something far worse.
What False Positives Actually Are
A false positive (Type I error) occurs when you conclude that your variant outperforms the control, but the measured difference is actually due to random chance. There is no real effect — the variant is identical in performance to the control — but the data, by luck, happened to show a pattern that crossed your significance threshold.
When you set your significance level at 95% (alpha = 0.05), you're accepting a 5% chance of this happening. In other words, if you ran 100 tests where the variant had no real effect, you'd expect about 5 of them to "find" a statistically significant difference anyway. This is the cost of using probability-based inference — sometimes chance looks like signal.
Five percent sounds manageable. But here's the problem: that 5% rate only holds if you check results exactly once, at a predetermined sample size. The moment you check results more than once, the math changes dramatically.
The A/A Test Proof
An A/A test is the most powerful demonstration of the false positive problem. In an A/A test, both groups see the exact same experience. There is no variant. There is no possible real effect. Any "significant" result is, by definition, a false positive.
When researchers ran 1,000 simulated A/A tests and checked significance at regular intervals throughout each test (mimicking how most teams actually monitor their tests), the results were striking:
771 out of 1,000 tests reached 90% significance at some point. That means if a team was monitoring these tests and stopping them when they "reached significance," they would have declared a winner in 77.1% of tests where no winner existed.
This isn't a flaw in the statistical method. The method works perfectly when used correctly: check results once, at the predetermined sample size. The problem is entirely behavioral — teams can't resist checking early and often, and every check is another opportunity for random variation to cross the significance threshold.
Why Stopping at Significance Is the Number One Mistake
The most common workflow in A/B testing goes something like this: launch the test, check results the next morning, check again after lunch, check every day for a week, and stop the test when the dashboard shows "statistically significant." This feels responsible. It feels data-driven. It is, in fact, the worst possible approach.
Here's why. Statistical significance fluctuates during a test. In the early days, when sample sizes are small, random variation dominates. The conversion rate might be 12% in the control and 15% in the variant on Tuesday, 12% and 11.5% on Wednesday, and 12% and 14% on Thursday. Each of these snapshots might cross or uncross the significance threshold multiple times.
Every time you check, you're running a new statistical test. If you check 10 times during a test, you're not running one test with a 5% false positive rate — you're running 10 tests, each with its own 5% chance. The cumulative false positive rate is much higher. The math is similar to the birthday problem: individually unlikely events become probable with repetition.
The solution is simple in theory and difficult in practice: calculate your required sample size before launching the test, and only evaluate significance when that sample size is reached. Don't check early. Don't stop early. Run the test for the full planned duration.
The Business Cost of False Positives
False positives don't just waste time — they actively harm your business through multiple channels:
Wasted engineering resources. Every false positive that gets implemented requires engineering time to build, QA, deploy, and maintain. If 30% of your "winning" tests are actually false positives (a conservative estimate for teams that peek regularly), nearly a third of your optimization engineering effort produces zero value.
Erosion of trust. When implemented "winners" don't move metrics in production, stakeholders lose confidence in the experimentation program. "We tested it and it was supposed to improve conversions by 15%, but nothing changed." This narrative, repeated a few times, undermines the entire culture of data-driven decision-making.
Compounding bad decisions. If a false positive becomes your new control, subsequent tests are built on a flawed baseline. You're optimizing on top of a change that had no real effect, which means your measured improvements may also be artifacts.
Opportunity cost. Time spent implementing and evaluating false positives is time not spent on changes that would have produced real improvements. The opportunity cost is invisible but significant.
How to Protect Against False Positives
Reducing your false positive rate requires changes at the process level, not the tool level. The statistics work correctly when the process is followed correctly.
1. Pre-register your sample size and duration. Before launching any test, calculate the required sample size using a sample size calculator. Document this number. Commit to it. This single practice eliminates the most common source of false positives.
2. Don't check results early. This is the hardest practice to enforce. If you can't resist, use sequential testing methods (like SPRT or always-valid p-values) that are designed to be checked at multiple points without inflating the false positive rate. These methods exist precisely because the human temptation to peek is so strong.
3. Run tests for full business cycles. At minimum, run for one complete week to capture day-of-week effects. Two weeks is better. Tests that run for only 3-4 days are susceptible to day-of-week composition biases that look like treatment effects.
4. Use A/A tests to calibrate your system. Before trusting your testing infrastructure, run several A/A tests. If they show significant results more than 5% of the time (when checked only at the predetermined sample size), your system has a problem — perhaps a cookie mismatch, a population imbalance, or an instrumentation error.
5. Look at confidence intervals, not just p-values. A test that shows a significant result with a confidence interval of [+0.1%, +25%] is telling you that the effect is real but you have almost no idea how large it is. This wide interval is a sign that you don't have enough data for a reliable estimate, even if the p-value technically crosses the threshold.
6. Be skeptical of large effects. In mature optimization programs, genuine effects are typically in the 1-5% range for incremental changes. A test showing a 30% improvement should trigger skepticism, not celebration. Large measured effects are more likely to be inflated by noise, especially if the sample size is small. Regression to the mean will almost certainly pull the actual effect downward.
Sequential Testing: A Structured Alternative to Peeking
If checking results early is inevitable for your team (and for most teams, it is), sequential testing methods offer a principled alternative. These methods are designed to maintain a valid false positive rate even when results are checked continuously.
The most common approach is group sequential testing, which pre-specifies a set of "looks" at the data (e.g., at 25%, 50%, 75%, and 100% of the planned sample size). At each look, a stricter significance threshold is used to compensate for the multiple checks. The final look uses a threshold close to the standard 0.05, but earlier looks require much stronger evidence (e.g., p < 0.001 at the first look).
Another approach is always-valid p-values (also called anytime-valid inference), which use a different mathematical framework that produces valid conclusions regardless of when you check. These are increasingly popular in modern experimentation platforms because they match how teams actually work — checking results frequently — while maintaining statistical integrity.
The Cultural Challenge
The false positive problem is fundamentally a cultural problem, not a statistical one. The statistics are clear and well-understood. The challenge is building organizational habits that respect the math:
Don't celebrate wins too quickly. A test that "reaches significance" on day three probably hasn't. Wait for the full sample size before discussing results.
Don't incentivize win rates. If your experimentation team is measured by the percentage of tests that produce winners, they'll be tempted (consciously or not) to peek, stop early, and declare marginal results as wins. Measure testing velocity and learning quality instead.
Share the A/A test data. When stakeholders understand that 77% of A/A tests "find" a winner if you peek, the abstract concept of false positives becomes viscerally real. This single data point has changed more minds than any statistical lecture.
The Bottom Line
False positives are the silent killer of experimentation programs. They waste resources, erode trust, and compound into increasingly misguided optimization strategies. The fix is not more sophisticated statistics — it's more disciplined process: pre-register your sample size, resist the urge to peek, run tests for full business cycles, and treat large effects with healthy skepticism.
The 771-out-of-1,000 A/A test statistic isn't a curiosity. It's a warning. If your team is checking results daily and stopping at significance, your experimentation program is producing far more false positives than you realize. The changes you've implemented based on those results may not have moved the needle at all. And the sooner you confront that reality, the sooner you can start building an experimentation practice that actually works.