More Variants, More Problems

It sounds intuitive: if testing one variant against a control is good, testing five variants must be five times better. You cover more ground, test more ideas, and increase the chance of finding a winner.

Except the math does not work that way. Every additional comparison you make increases the probability that at least one will appear significant by chance alone. Run enough comparisons and you are virtually guaranteed to find a "winner" — one that is nothing more than random noise dressed up as a real effect.

This is the multiple comparisons problem, and it is one of the most common sources of false discoveries in experimentation programs.

The Core Problem

When you set a conventional significance threshold, you are accepting a certain probability of a false positive for each individual comparison. This seems reasonable — a small chance of being wrong.

But that error rate applies to each comparison separately. When you make multiple comparisons, the probability that at least one produces a false positive is much higher.

With a handful of independent comparisons, the probability of at least one false positive is already substantial. With ten or twenty comparisons, a false positive becomes more likely than not.

Think of it like rolling dice. One roll is unlikely to come up with a specific number. But roll enough times and that outcome becomes inevitable. Each comparison is another roll.

Where Multiple Comparisons Hide

The obvious case is testing multiple variants against a control. But multiple comparisons lurk in many less obvious places:

Multiple metrics

Most tests track a primary metric plus several secondary and guardrail metrics. If you test significance across all of them and report whichever ones are significant, you are making multiple comparisons even in a simple A/B test.

Segment analysis

After a test, teams often analyze results by segment: mobile vs. desktop, new vs. returning users, different geographic regions. Each segment comparison is an additional comparison. With enough segments, some will appear significant by chance.

Time-based analysis

Checking results at multiple time points is another form of multiple comparisons (this overlaps with the peeking problem). Each check is a separate comparison with its own false positive risk.

Post-hoc hypothesis testing

When a team looks at the data, notices an interesting pattern, and then tests whether that pattern is significant, they have made an implicit comparison. The discovery was driven by the data, not by a pre-specified hypothesis, and the significance test does not account for the exploration that led to it.

Multiple test iterations

Running the same test multiple times until it reaches significance is another form of multiple comparisons. If you re-run a test three times, the probability of seeing at least one significant result is much higher than the nominal rate.

The Family-Wise Error Rate vs. False Discovery Rate

There are two ways to think about controlling multiple comparisons:

Family-wise error rate (FWER)

The FWER is the probability of making one or more false discoveries across all comparisons. Controlling the FWER means that the chance of any false positive remains at the desired level, regardless of how many comparisons you make.

This is the most conservative approach. Methods like the Bonferroni correction control the FWER by dividing the significance threshold by the number of comparisons. If you make several comparisons, each one must reach a proportionally stricter threshold to be called significant.

The downside: Bonferroni is very conservative. With many comparisons, the per-comparison threshold becomes so strict that you lose the ability to detect real effects. You trade false positives for false negatives.

False discovery rate (FDR)

The FDR is the expected proportion of discoveries that are false. If you identify several significant results and the FDR is controlled at a certain level, that proportion of those discoveries are expected to be false positives.

FDR control is less conservative than FWER control. It accepts that some false discoveries will occur but limits their proportion among all discoveries. Methods like the Benjamini-Hochberg procedure control the FDR.

For most experimentation programs, FDR control is more practical than FWER control. You care about the proportion of your wins that are real, not about eliminating every possible false positive.

Practical Corrections for A/B Testing

For multiple variants

When testing multiple variants against one control:

  • Use Dunnett's test. It is specifically designed for comparing multiple treatments to a single control and is less conservative than Bonferroni.
  • Adjust the significance threshold. Divide your threshold by the number of variant-control comparisons. Simple and effective for a small number of variants.
  • Use a hierarchical approach. First test whether any variant differs from control (omnibus test). Only if that is significant, proceed to individual comparisons.

For multiple metrics

  • Designate a primary metric. The primary metric is evaluated at the full significance level. Secondary metrics are evaluated with a correction or treated as exploratory.
  • Use the Holm-Bonferroni method. Less conservative than standard Bonferroni, it adjusts thresholds in a step-down fashion based on the rank of p-values.
  • Pre-register your metric hierarchy. Decide before the test which metrics matter most and how you will handle conflicts between them.

For segment analysis

  • Pre-specify segments of interest. If you plan to look at specific segments, include them in the test design and adjust for the additional comparisons.
  • Treat post-hoc segments as exploratory. Segment analyses not planned before the test should be labeled as hypothesis-generating, not hypothesis-confirming.
  • Require a significant overall effect first. Only drill into segments if the overall test is significant. This gating reduces the number of comparisons.

For multiple test iterations

  • Count prior attempts. If a test has been run before without significance, the current run is not independent. Account for the total number of attempts.
  • Use sequential testing methods. If you want the option to re-run, design the original test as a sequential experiment with planned interim analyses.

When to Worry and When Not To

Not every instance of multiple comparisons demands a formal correction. Context matters.

Worry when:

  • You are making a binary ship/no-ship decision based on multiple comparisons.
  • The cost of a false positive is high (pricing changes, major redesigns).
  • You are reporting results to stakeholders who will act on them without additional validation.
  • The number of comparisons is large relative to the number of true effects.

Worry less when:

  • You are doing exploratory analysis to generate hypotheses for future tests.
  • You have a strong prior that the effect exists and you are estimating its magnitude.
  • The cost of a false positive is low and easily reversible.
  • You will validate significant results with a follow-up test.

The key is transparency. If you make multiple comparisons without correction, say so. If results are exploratory, label them as such. Problems arise when uncorrected multiple comparisons are presented as confirmed findings.

The Organizational Cost of Uncorrected Multiple Comparisons

When teams routinely make uncorrected multiple comparisons, the experimentation program accumulates false discoveries. Over time, this creates several problems:

  • Shipped changes that do not work. False discoveries become permanent features. They add complexity without adding value.
  • Inflated program metrics. If the team counts every significant result as a win, the program looks more effective than it is.
  • Erosion of trust. When shipped variants consistently fail to deliver their projected lift, stakeholders lose faith in experimentation.
  • Wasted development effort. Building, maintaining, and supporting features that were shipped based on false discoveries is pure waste.

A disciplined approach to multiple comparisons is not statistical perfectionism. It is quality control for your experimentation program.

FAQ

How many variants should I test at once?

As many as you can afford to power after correction. With moderate traffic, two to three variants plus a control is usually the sweet spot. More than that dilutes traffic per variant and requires stronger evidence per comparison.

Do I need to correct for metrics I only look at informally?

Strictly speaking, yes — any comparison you make and act on carries false positive risk. Practically, if you are only looking at secondary metrics to inform future hypotheses (not to make decisions), formal correction is less critical. But be honest about the distinction.

What about multivariate testing (MVT)?

MVT tests multiple factors simultaneously, which creates a large number of comparisons. The multiple comparisons problem is especially severe in MVT because the number of factor combinations grows exponentially. MVT requires either very large samples or careful restriction of the comparisons you plan to evaluate.

Can Bayesian methods avoid the multiple comparisons problem?

Bayesian methods handle multiple comparisons differently but do not eliminate the issue. The prior acts as a natural regularizer, and hierarchical models can pool information across comparisons. But if you run many Bayesian tests and always act on the most extreme posterior, you will still accumulate false discoveries.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.