Skip to main content
← Glossary · Statistics & Methodology

Multiple Comparisons Problem

The increased probability of falsely identifying a significant result when conducting multiple simultaneous statistical tests, as each test carries an independent chance of a Type I error.

What Is the Multiple Comparisons Problem?

When you run many statistical tests at once, the probability that at least one returns a false positive grows much faster than people expect. Each test has its own alpha, and the combined false-positive rate compounds. Unchecked, multiple comparisons produce bogus wins at alarming rates.

Also Known As

  • Data science teams: multiplicity, look-elsewhere effect, multiple testing
  • Growth teams: peeking problem, variant-count problem
  • Marketing teams: "why our winners keep losing in retests"
  • Engineering teams: FWER inflation

How It Works

Imagine running an A/B test with 10,000 visitors per variant and checking 10 secondary metrics at alpha = 0.05. The probability that at least one metric is falsely significant under a true null is 1 - (0.95)^10, or about 40%. Now imagine you also segment by device (3 categories), geography (5 regions), and traffic source (4 channels): you have effectively hundreds of tests. Unsurprisingly, you find "significant" segments. Most are illusions.

Best Practices

  • Do declare a single primary metric before the test and evaluate it at standard alpha.
  • Do apply Bonferroni for small sets of planned comparisons.
  • Do apply FDR control for large exploratory programs.
  • Do not slice data into segments post-hoc and report the best one as a finding.
  • Do not peek repeatedly during a test without sequential testing corrections.

Common Mistakes

  • Reporting surprise segment wins without multiplicity correction.
  • Confusing "exploratory" with "unaccountable"; exploration still has a discovery rate.
  • Treating pre-test peeking as harmless when it inflates false positives dramatically.

Industry Context

  • SaaS/B2B: Funnel-stage segmentation creates silent multiplicity; watch for it.
  • Ecommerce/DTC: Category-level cuts are a common source of false wins.
  • Lead gen/services: Long sales cycles encourage repeat peeking, which is multiplicity in time.

The Behavioral Science Connection

This is the "Texas sharpshooter fallacy" — firing bullets into a barn, then drawing a bullseye around the densest cluster. Kahneman's work on narrative bias shows humans are compelled to explain noise. Multiplicity correction is a discipline that forces the organization to remember how many shots were fired.

Key Takeaway

More tests mean more false positives; either correct for them formally or narrow your scope to a single primary metric.