Multiple Comparisons Problem
The increased probability of falsely identifying a significant result when conducting multiple simultaneous statistical tests, as each test carries an independent chance of a Type I error.
What Is the Multiple Comparisons Problem?
When you run many statistical tests at once, the probability that at least one returns a false positive grows much faster than people expect. Each test has its own alpha, and the combined false-positive rate compounds. Unchecked, multiple comparisons produce bogus wins at alarming rates.
Also Known As
- Data science teams: multiplicity, look-elsewhere effect, multiple testing
- Growth teams: peeking problem, variant-count problem
- Marketing teams: "why our winners keep losing in retests"
- Engineering teams: FWER inflation
How It Works
Imagine running an A/B test with 10,000 visitors per variant and checking 10 secondary metrics at alpha = 0.05. The probability that at least one metric is falsely significant under a true null is 1 - (0.95)^10, or about 40%. Now imagine you also segment by device (3 categories), geography (5 regions), and traffic source (4 channels): you have effectively hundreds of tests. Unsurprisingly, you find "significant" segments. Most are illusions.
Best Practices
- Do declare a single primary metric before the test and evaluate it at standard alpha.
- Do apply Bonferroni for small sets of planned comparisons.
- Do apply FDR control for large exploratory programs.
- Do not slice data into segments post-hoc and report the best one as a finding.
- Do not peek repeatedly during a test without sequential testing corrections.
Common Mistakes
- Reporting surprise segment wins without multiplicity correction.
- Confusing "exploratory" with "unaccountable"; exploration still has a discovery rate.
- Treating pre-test peeking as harmless when it inflates false positives dramatically.
Industry Context
- SaaS/B2B: Funnel-stage segmentation creates silent multiplicity; watch for it.
- Ecommerce/DTC: Category-level cuts are a common source of false wins.
- Lead gen/services: Long sales cycles encourage repeat peeking, which is multiplicity in time.
The Behavioral Science Connection
This is the "Texas sharpshooter fallacy" — firing bullets into a barn, then drawing a bullseye around the densest cluster. Kahneman's work on narrative bias shows humans are compelled to explain noise. Multiplicity correction is a discipline that forces the organization to remember how many shots were fired.
Key Takeaway
More tests mean more false positives; either correct for them formally or narrow your scope to a single primary metric.