The Multiple Comparisons Problem in A/B Testing: Why Running 20 Variants Guarantees A False Winner

Atticus Li

← The Replication Crisis · replication-crisis

The Multiple Comparisons Problem in A/B Testing: Why Running 20 Variants Guarantees A False Winner

At alpha = 0.05, running 20 independent A/B tests produces at least one false positive with probability 64%, even when no variant has a real effect. The math is 89 years old and unambiguous. Most CRO programs apply no correction at all. This is one of the simplest reasons "winners" do not replicate.

By Atticus Li May 25, 2026 39 min read

At alpha = 0.05, running 20 independent A/B tests produces at least one false positive with probability 64%, even when no variant has a real effect. The math is 89 years old and unambiguous. Most CRO programs apply no correction at all. This is one of the simplest reasons “winners” do not replicate.

It is the Thursday review meeting after a button-color sprint. The PM walks through the experiment dashboard. The team tested eight variants of the primary call-to-action --- five color treatments and three copy variations, all against a single control --- on the same checkout-conversion metric. Variant E (orange button, “Get Started” copy) shows a lift of 4.1% with p = 0.04. Six of the other seven variants are flat or slightly negative; one shows a marginally positive lift at p = 0.12. The dashboard’s significance shading highlights variant E in green. The team decides to ship E to 100% of traffic. Three weeks of production data later, the post-launch monitor shows the orange button is performing identically to the original control. The PM files a ticket about possible instrumentation drift, and the team moves on to the next set of experiments.

There is no instrumentation drift. There is also no real effect. Variant E was the chance winner of a multiple-testing tournament, and the 4.1% lift was always a coin flip dressed up in a p-value. With eight comparisons at the conventional alpha = 0.05 threshold, the probability that at least one variant would clear the significance bar by random noise alone --- with all eight variants being substantively identical to the control --- is approximately 34%. The fact that one variant did clear is not evidence of an underlying effect. It is the expected output of testing eight things at once and stopping at the first one that looks lucky.

This is the multiple comparisons problem in A/B testing, and it is the most arithmetically unambiguous of the systematic failure modes that produce non-replicating winners. The math is 89 years old. The corrections are well-established and computationally trivial. Every undergraduate statistics curriculum covers the issue. And yet most internal experimentation programs run multi-arm experiments, check dozens of secondary metrics for significance, slice results across many subgroups, and run hundreds of tests per year --- all without applying any correction for the resulting inflation of false positives. The consequence is that a meaningful fraction of every program’s headline wins each quarter are chance discoveries that do not reflect any real effect, and the noise compounds across the portfolio as the false winners propagate through quarterly reports, executive briefings, and forward-looking impact forecasts.

The framing for CRO and growth leaders is calibration. If your program runs 100 tests per year at alpha = 0.05 with no correction, the expected number of false-positive “winners” from pure chance --- before counting any genuine effects --- is approximately 5. If you report 8 to 12 wins per year as the program’s quarterly output, somewhere between one-third and one-half of those wins may have been produced by the multiple-comparisons mechanism rather than by real underlying effects. The program is honestly running tests; the platform is honestly reporting p-values; the analysts are honestly reading the dashboards. The math is doing the rest.

What The Multiple Comparisons Problem Actually Is

The multiple comparisons problem (sometimes called the multiplicity problem, the multiple-testing problem, or the look-elsewhere effect in physics) is the inflation of the cumulative false-positive rate when many independent or near-independent hypothesis tests are conducted simultaneously. The mechanism is elementary. If you run a single hypothesis test at the conventional significance threshold alpha = 0.05, you have committed to a 5% probability of declaring a “significant” effect when no effect exists. If you run two such tests on independent data, the probability that at least one of them produces a false positive is no longer 5% --- it is approximately 9.75%. If you run twenty tests, the probability is approximately 64%. If you run one hundred tests, the probability is approximately 99.4%, which is to say it is effectively guaranteed.

The arithmetic is the standard probability calculation for the union of independent events. If each individual test has a 5% chance of a Type I error (false positive) under the null hypothesis of no effect, then the probability that each individual test correctly fails to reject the null is 95%. The probability that all N tests correctly fail to reject the null is (0.95)^N. The probability that at least one test produces a false positive --- the family-wise error rate, or FWER --- is the complement: FWER = 1 - (0.95)^N for N independent tests at alpha = 0.05.

The substantive interpretation is that the conventional alpha = 0.05 threshold is a guarantee about the false-positive rate of a single pre-specified test, not a guarantee about the false-positive rate of a collection of tests. The guarantee does not survive aggregation. When you run a portfolio of tests and report the ones that came in significant, the significance threshold you are effectively applying is much weaker than the nominal 5% --- you are reporting the maximum of N noisy estimates and asking whether that maximum exceeds the per-test threshold. The maximum of N noisy estimates is systematically higher than any individual estimate, and the larger N is, the higher the maximum tends to be.

The mathematical framing was made rigorous by Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3-62, the foundational Italian-language treatise on the statistical theory of classes. Bonferroni’s contribution that survived into modern multiple-testing methodology is the union-bound inequality: for any collection of events, the probability of their union is bounded above by the sum of their individual probabilities. Applied to hypothesis testing, this gives the Bonferroni correction --- divide the per-test significance threshold by the number of tests to control the family-wise error rate. The mathematical observation is older than industrial A/B testing by 70 years.

The relevance to A/B testing is direct and underappreciated. Every modern experimentation program has at least one of the structural features that triggers the multiple-comparisons problem: multi-arm experiments testing more than two variants simultaneously, multi-metric analyses checking primary and secondary metrics for significance, multi-segment analyses looking for differential treatment effects across user subgroups, or sequential test portfolios running dozens of experiments per quarter. The FWER calculation applies to each of these structures, and the cumulative inflation across all of them is substantial.

The Math --- FWER For Independent Tests

The numerical examples are stark enough to be worth working through explicitly. For N independent tests at significance threshold alpha, the family-wise error rate under the global null hypothesis (no test has a real effect) is:

$$\text{FWER} = 1 - (1 - \alpha)^N$$

At alpha = 0.05, the numerical values are:

N = 1: FWER = 5.0%. The single-test baseline.
N = 2: FWER = 9.75%. Already nearly double.
N = 5: FWER = 22.6%. The probability of at least one false positive across five tests is more than one in five.
N = 10: FWER = 40.1%. Roughly two-in-five.
N = 20: FWER = 64.2%. With twenty independent tests, false-positive output is the more-likely-than-not outcome under the global null.
N = 50: FWER = 92.3%.
N = 100: FWER = 99.4%. Essentially guaranteed.

The numbers map directly onto industry A/B testing practice. A multi-arm experiment with eight variants against a single control involves seven pairwise comparisons against the control (and potentially 28 pairwise comparisons among all arms). A multi-metric analysis checking ten metrics for significance involves ten hypothesis tests. A subgroup analysis looking at treatment effects across twenty user segments involves twenty hypothesis tests. The numbers compound when these structures are combined --- a multi-arm experiment with five variants checked across five metrics in five segments involves 5 x 5 x 5 = 125 implicit comparisons, and the corresponding FWER at alpha = 0.05 is 99.9% under the global null.

The “independent tests” assumption is mathematically convenient but not strictly required for the FWER inflation to occur. For positively correlated tests (which is the usual case in A/B testing --- multiple metrics on the same users are correlated; multiple subgroups on the same experiment are correlated), the FWER inflation is somewhat smaller than the independent-tests formula predicts but still substantial. For negatively correlated tests, the inflation can be larger. The independent-tests calculation is a useful baseline; the actual FWER for a specific portfolio depends on the correlation structure of the tests within it.

The standard reference for the formal derivations and the dependence-structure corrections is Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological), 57(1), 289-300. DOI: 10.1111/j.2517-6161.1995.tb02031.x, which articulates the distinction between controlling the FWER (the probability of any false positive) and controlling the false discovery rate or FDR (the expected proportion of false positives among rejected hypotheses). The distinction is consequential for industry A/B testing and is addressed in detail below.

The Three A/B Testing Contexts Where Multiple Comparisons Hits

The multiple comparisons problem manifests in three structurally distinct A/B testing patterns, each of which is widespread and each of which materially inflates the false-positive rate when uncorrected.

Multi-arm experiments. Testing more than two variants simultaneously --- the eight button colors, the five copy variations, the four pricing displays --- is one of the most common patterns in modern A/B testing. A multi-arm experiment with K variants against a single control involves K - 1 pairwise comparisons against control, plus potentially K(K-1)/2 pairwise comparisons among all arms. At K = 4 variants plus control, that is 4 vs-control comparisons and up to 10 pairwise comparisons; at K = 8 plus control, it is 8 vs-control and up to 36 pairwise. Even restricting attention to the vs-control comparisons, the FWER inflation is substantial: at K = 4, FWER under the global null is 18.5%; at K = 8, it is 33.7%; at K = 16, it is 56.0%. The eight-button-color experiment from the opening vignette has roughly a one-in-three chance of producing a false-positive winner before any real effects are considered.

Multi-metric analyses. The typical A/B testing dashboard reports the experiment’s effect on a primary conversion metric, plus a long list of secondary metrics --- revenue per user, session duration, pages per session, bounce rate, signup completion rate, email-open rate, time-to-first-action, and so on. A team that scans all reported metrics for significance is effectively conducting one hypothesis test per metric. At ten metrics with alpha = 0.05, the FWER under the global null for that single experiment is 40.1%. The selection of which metric to highlight in the post-experiment write-up --- “the test moved revenue per user by 3.2%, p = 0.04” --- is a multiple-comparisons selection that inflates the apparent significance of whichever metric happens to clear the bar.

The pattern is so widespread that Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. DOI: 10.1177/0956797611417632 named it as part of the “researcher degrees of freedom” framework that they identified as a primary driver of the psychology replication crisis. The Simmons-Nelson-Simonsohn analysis demonstrated through computer simulation that even modest combinations of flexibility in metric selection, sample-size flexibility, and analysis-path flexibility can inflate the effective false-positive rate to over 60% across a single experiment. The A/B testing analog is direct: a team that checks ten metrics, stops the experiment when one of them clears significance, and writes up that metric as the primary outcome is operating under conditions where the actual false-positive rate dwarfs the nominal 5%.

Multi-segment analyses. Looking for differential treatment effects across user subgroups --- new users vs returning users, mobile vs desktop, North America vs Europe, free-tier vs paid-tier, by acquisition channel, by tenure cohort, and so on --- is a standard part of post-experiment analysis. The motivation is reasonable: the average treatment effect may mask important heterogeneity, and finding that a treatment works particularly well or particularly poorly for a specific segment is valuable product information. The methodological problem is that each subgroup analysis is an additional hypothesis test, and a team that examines treatment effects across twenty segments is running twenty additional hypothesis tests. The FWER under the global null of no segment-level effects is 64.2%, and the cherry-picked subgroup --- “the experiment worked great for mobile users in Europe!” --- is more often than not a chance pattern in the noise rather than a real heterogeneous treatment effect.

The pattern interacts dangerously with the natural tendency to look for explanations when an overall experiment is flat. A team whose primary metric shows no significant effect will frequently dig into subgroup analyses looking for “where the win was hiding.” The deeper the dive, the higher the probability of finding a spurious significant subgroup, and the resulting “the experiment worked for X” finding is then used as the basis for a targeted launch decision. The targeted launch is a launch decision made on a false-positive finding, and the production performance for the targeted subgroup will revert to no-effect once enough new data accumulates.

The compounding effect across all three structures is the part that most experimentation programs underappreciate. A program that runs a five-variant multi-arm experiment, checks the results across ten metrics, and slices each metric across ten segments has implicitly conducted 5 x 10 x 10 = 500 hypothesis tests on a single experiment. At alpha = 0.05, the expected number of false positives from pure chance is 25. The dashboard’s “significance” highlights are not signal; they are the predictable output of the testing volume.

The Programmatic Variant --- Why 100 Tests A Year Guarantees False Winners

The structurally subtler version of the multiple comparisons problem operates at the program level rather than within any individual experiment. An experimentation program that runs 100 hypothesis tests per year at alpha = 0.05, with no correction for multiplicity, will produce approximately 5 statistically significant findings per year from pure chance, even if none of the tested variants have any real effect. The math is the same FWER calculation applied at a longer time horizon: across 100 tests, the expected number of false-positive significant findings under the global null is 100 x 0.05 = 5.

The program-level inflation is more insidious than the within-experiment inflation because it is invisible to anyone looking at the individual test reports. Each individual test was correctly designed, properly powered, and correctly analyzed. Each individual significance call was correct relative to the per-test alpha threshold. The program-level claim --- “we shipped 12 winning variants this year” --- aggregates these individual results into a portfolio claim, and the portfolio claim is contaminated by the FWER inflation across the testing volume. Of the 12 wins, roughly 5 (if zero variants had real effects) or some smaller number (if some genuine effects are mixed in) are chance discoveries.

The pattern is documented at industry scale in the experimentation-platform literature. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265 discusses how Microsoft’s Experimentation Platform handles the programmatic multiple-comparisons problem through a combination of post-launch hold-out validation and explicit correction procedures for the most common multi-comparison scenarios. The Kohavi-Tang-Xu treatment is unambiguous that uncorrected portfolio-level reporting overstates the true cumulative impact of an experimentation program, and the magnitude of the overstatement is large enough to matter for executive decision-making.

The natural response to “5 of your 12 wins are chance” is that the magnitudes of real wins should still be larger than the magnitudes of chance wins, so the program’s aggregate impact should be dominated by the real effects. This argument has a kernel of truth and is also wrong in an important way. The kernel of truth is that real effects on the genuine winners will indeed often be larger than chance effects on the false winners, so the dollar-weighted contribution of real wins to the portfolio’s claimed impact will be larger than the dollar-weighted contribution of chance wins. The way it is wrong is that the chance wins, by virtue of having been selected for having the largest observed lifts (the multiple-comparisons mechanism), have systematically inflated headline numbers --- the Winner’s Curse, addressed separately in the winner’s curse article, compounds with the multiple-comparisons problem in exactly this way. The chance wins have inflated headline lifts not because they have real effects but because the selection rule favors variants whose noise realization was unusually large. The aggregate portfolio claim is contaminated by both the count inflation (more wins than there should be) and the magnitude inflation (the false wins look bigger than they should).

The Corrections --- Bonferroni, Benjamini-Hochberg, Holm

The methodological literature on correcting for multiple comparisons is mature, the procedures are computationally trivial, and the choice among them is well-understood. The three most-used corrections in industry practice are the Bonferroni correction, the Benjamini-Hochberg false discovery rate (FDR) procedure, and the Holm-Bonferroni step-down procedure.

Bonferroni correction. Divide the per-test significance threshold by the number of tests. If you are running 20 tests and want to maintain a family-wise error rate of 0.05, the per-test threshold becomes 0.05 / 20 = 0.0025. A test is declared significant only if its p-value is below 0.0025, not below 0.05. The procedure provably controls the FWER at the target level for any dependence structure among the tests. It is the most conservative widely-used correction --- it is easy to implement, mathematically rigorous, and substantially reduces statistical power. The Bonferroni procedure is the right correction when (a) the cost of a single false positive is very high, (b) the tests are very few or near-independent, or (c) the analyst wants the simplest possible defensible procedure. The reference is the 1936 Bonferroni paper plus the formal modern restatement in any standard mathematical-statistics text.

Benjamini-Hochberg false discovery rate (BH-FDR). The BH procedure controls the expected proportion of false positives among rejected hypotheses, rather than the probability of any false positive. The procedural mechanics: sort the p-values from smallest to largest, then declare significant any test whose p-value is below (rank / N) x alpha, where rank is the test’s position in the sorted order and N is the total number of tests. The first test (smallest p-value) needs to beat alpha / N (the Bonferroni threshold); the largest test needs to beat alpha (the unadjusted threshold); intermediate tests need to beat thresholds that scale linearly with rank. The procedure has higher statistical power than Bonferroni for any fixed FDR level, particularly when the proportion of true effects in the test set is non-trivial.

The BH procedure is the modern standard for multi-metric and multi-segment A/B testing analyses where the analyst is willing to tolerate some false positives in exchange for more statistical power on the true effects. The 1995 Benjamini-Hochberg paper provides the formal proofs and the procedure description; subsequent extensions (BH under positive regression dependence, the Benjamini-Yekutieli procedure for arbitrary dependence) handle the dependence-structure complications that arise in practice. The Wikipedia entry on the false discovery rate provides an accessible operational summary; the original BH paper is the canonical reference.

Holm-Bonferroni step-down. The Holm procedure, from Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70, is a uniformly more powerful version of Bonferroni that controls the FWER at the same target level. The mechanics: sort the p-values from smallest to largest, then test the smallest p-value against alpha / N, the second smallest against alpha / (N-1), the third smallest against alpha / (N-2), and so on until you reach a test that fails to reject. The Holm procedure rejects strictly more hypotheses than Bonferroni for the same target FWER, with no loss of error-rate control. It is essentially a free lunch relative to Bonferroni and should be the default whenever FWER control (rather than FDR control) is the goal.

The choice among the three is governed by the structure of the inference problem. Bonferroni is the most conservative and is appropriate when even a single false positive is unacceptably costly --- regulatory filings, medical-device approval, safety-critical decisions. BH-FDR is the right choice for exploratory analyses where the cost of a single false positive is moderate and statistical power matters --- most multi-metric A/B testing analyses, most subgroup analyses, most program-level reporting. Holm is the right choice for confirmatory analyses where FWER control is required but Bonferroni’s conservatism is excessive --- many regulatory-grade industry analyses use Holm rather than Bonferroni for this reason.

When To Use Which Correction

The methodological tradeoffs map onto the practical question of how A/B testing teams should choose among the corrections.

For multi-arm experiments with K variants against control, the standard recommendation is Holm correction on the K-1 vs-control comparisons. The number of tests is small, the analyst typically wants FWER control on the “did any variant beat control” claim, and Holm’s improved power over Bonferroni matters when K is modest. The Dunnett procedure (a multi-arm-specific FWER-control method) is technically optimal for this exact setting and is supported by most statistical packages, but Holm-Bonferroni produces nearly identical results in practice and is simpler to implement.

For multi-metric analyses, BH-FDR is the appropriate correction. The number of metrics is typically large enough that Bonferroni’s conservatism wastes too much power, and the analyst is typically willing to tolerate a controlled proportion of false positives in exchange for detecting real effects on more metrics. The standard implementation is to apply BH at FDR = 0.05 across the full panel of metrics for each experiment, and to report which metrics are FDR-significant rather than which metrics are nominally p < 0.05. The change in which metrics are reported is sometimes dramatic --- a metric with p = 0.03 may not be FDR-significant if many other metrics have similar or larger p-values.

For multi-segment analyses, BH-FDR is again the appropriate correction. The number of subgroups is typically large, the goal is to identify which subgroups have differentially significant effects rather than to make a yes/no claim about any single subgroup, and the cost of a controlled-proportion of false positives is acceptable. A common pattern is to apply BH at FDR = 0.10 (rather than 0.05) for subgroup analyses, on the grounds that subgroup analysis is inherently exploratory and the higher tolerance is appropriate.

For program-level reporting across many experiments, the methodologically cleaner approach is the online FDR control framework developed by Yang, F., Ramdas, A., Jamieson, K. G., & Wainwright, M. J. (2017). A framework for multi-A(rmed)/B(andit) testing with online FDR control. Advances in Neural Information Processing Systems 30 (NIPS 2017). arXiv: 1706.05378 and refined by Tian, J., & Ramdas, A. (2019). ADDIS: An adaptive discarding algorithm for online FDR control with conservative nulls. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv: 1905.11465. The online-FDR framework handles the sequential structure of an experimentation program --- tests arrive over time, decisions about significance need to be made at each test’s conclusion, and the analyst does not know in advance how many future tests will be conducted. The framework provides FDR-control guarantees that hold cumulatively across the program’s full history, not just within any single batch of tests. The implementation is more involved than the batch-mode BH procedure, but the resulting program-level reporting is genuinely calibrated for the cumulative multiple-comparisons load.

The interaction with the peeking problem is worth noting. The Yang-Ramdas-Jamieson-Wainwright framework is built on top of always-valid sequential inference procedures that solve the peeking problem at the per-test level, and the Tian-Ramdas extension handles the case where many of the per-test p-values are conservative (which is typical when the always-valid inference machinery is applied uniformly across a portfolio). A program that wants to handle both the peeking problem and the multi-test problem with full mathematical rigor should adopt the online-FDR framework rather than trying to compose ad-hoc per-test sequential inference with ad-hoc batch-mode multiple-comparisons corrections.

Why Most Programs Don’t Apply Any Correction

The empirical observation that most industry A/B testing programs don’t apply any multiple-comparisons correction --- not Bonferroni, not BH-FDR, not Holm, not the online-FDR framework --- has three reinforcing explanations, none of which are good defenses.

Speed and operational pressure. The conventional alpha = 0.05 threshold is fast: the dashboard shows green or red, the team makes a ship decision, the next experiment starts. Applying any correction adds analytic complexity --- the team needs to know how many tests are in the family, the per-test thresholds change based on the testing volume, and the resulting “significant” determinations are stricter than the dashboard’s nominal threshold. Programs optimized for rapid iteration tend to view multiple-comparisons corrections as friction that slows down the ship cycle and reduces the perceived hit rate of the experimentation program. The cost of the friction is real; the cost of the false positives is invisible until production performance disappoints, at which point it is attributed to implementation issues or external factors rather than to the structural inflation of the apparent win rate.

Methodological gap between platform and analyst. Modern A/B testing platforms typically display per-test p-values and per-test confidence intervals, with no automatic correction for the family of tests being conducted. The analyst is expected to apply corrections manually, but the platform’s dashboard makes the uncorrected p-values salient and easy to act on. The gap between what the platform reports and what should be reported is a methodological literacy gap that platform vendors have been slow to close. Some platforms (notably newer entrants like Statsig and Eppo) provide built-in FDR correction options; the legacy platforms (Optimizely, VWO, etc.) generally do not, leaving the correction as a downstream analyst responsibility that is frequently skipped. The Pinterest engineering team’s published account of their experimentation platform (the Apache Flink real-time experimentation analytics post) is unusually explicit about applying Bonferroni correction as part of their sequential-testing methodology; most platforms are not.

Statistical illiteracy at the leadership level. The decision-makers who consume A/B testing reports --- product managers, growth leaders, marketing executives --- typically have not been trained in the multiple-comparisons problem and do not ask whether the reported p-values are corrected for multiplicity. The pressure for correction has to come from analysts who understand the issue, and analysts who raise the topic are frequently told that the corrections are “too conservative” or “would slow us down” or “are an academic concern that doesn’t apply to industry.” The pushback is methodologically incorrect but operationally common, and it reflects the structural problem that the leadership audience for A/B testing reports does not have the background to evaluate the multiple-comparisons argument. The result is that the topic dies in the methodological-discussion phase and never propagates into operational practice.

The Simmons-Nelson-Simonsohn paper on researcher degrees of freedom was written for an academic-psychology audience and was a major contributor to that field’s reckoning with its replication problems. The industry-A/B-testing analog of the same paper has not yet had its impact, and the structural conditions --- speed pressure, platform-tooling gaps, leadership statistical illiteracy --- that prevented academic psychology from reforming its practice for decades are arguably more severe in industry. The reform requires either tooling changes (platforms that default to corrected p-values) or cultural changes (leadership audiences that ask the right questions about methodology) or both. Neither change is happening at the speed that would be required to meaningfully reduce the false-positive rate of the industry’s headline experimentation outputs in the near term.

What Industry Practitioners Are Doing

The industry experimentation literature provides several examples of platforms that have implemented multiple-comparisons corrections in production, and the patterns of which corrections are applied where are informative.

Microsoft’s Experimentation Platform is documented in the Kohavi-Tang-Xu book and in the Deng, A., Lu, J., & Chen, S. (2016). Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 243-252 paper (a related Microsoft paper, Deng, A., Lu, J., & Litz, J. (2017). Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. WSDM ‘17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 641-649, addresses the broader trustworthy-analysis question). The Microsoft approach is a multi-layered methodology: per-test sequential inference for the peeking problem, BH-FDR for multi-metric analyses, post-launch hold-out validation for portfolio-level claims, and explicit pre-registration of primary metrics to prevent cherry-picking. The platform exposes uncorrected p-values for analyst inspection but the reporting layer that feeds into executive briefings uses the corrected analyses.

LinkedIn’s experimentation platform has been documented in multiple publications, including the broader treatment of FDR control in their internal methodology. The LinkedIn pattern emphasizes BH-FDR for secondary-metric analyses and uses explicit pre-registration of primary metrics to avoid the multi-metric problem on the primary-metric reporting. The LinkedIn team has also been active in the academic literature on online controlled experiments, with multiple co-authorships on the trustworthy-experimentation papers in the KDD and WWW proceedings.

Pinterest’s experimentation platform, as described in the Pinterest engineering blog post on their Apache Flink-based real-time experimentation analytics system, uses Bonferroni correction explicitly on the sequential-testing portion of their methodology. The Pinterest write-up is unusually candid about the methodological choices: they considered Gambler’s Ruin, Bayesian A/B testing, and alpha-spending methods, and ultimately chose t-test plus Bonferroni correction with a pre-determined number of tests for their initial implementation. The Pinterest choice is the conservative end of the spectrum --- Bonferroni rather than BH-FDR --- and reflects their preference for strict FWER control over the higher power that BH would provide.

Facebook’s experimentation infrastructure has been documented through the Coey-Cunningham experiment-splitting work (more directly relevant to the Winner’s Curse than to multiple comparisons per se, but the underlying infrastructure handles both), and through other Facebook Data Science publications. The Facebook approach uses experiment splitting for high-stakes shipping decisions, BH-FDR for multi-metric analyses, and a sophisticated platform-level reporting layer that adjusts for selection effects across the portfolio. Facebook’s scale of experimentation (thousands of concurrent experiments) makes the multiple-comparisons problem particularly acute, and their platform is correspondingly more elaborate than the typical industry implementation.

Airbnb’s Experiment Reporting Framework (ERF), documented in the Lee-Shen 2018 paper and related work, uses empirical Bayes shrinkage for portfolio-level reporting and explicit FDR control for multi-metric analyses. The Airbnb approach is notable for treating the portfolio-level overstatement as a first-class concern in the reporting layer rather than an analyst-side correction.

The convergence across these platforms is that the major experimentation programs at the largest tech companies all apply some form of multiple-comparisons correction, typically through a combination of BH-FDR for batch analyses and online-FDR or empirical Bayes methods for program-level reporting. The smaller programs --- mid-market SaaS companies, e-commerce sites running 10 to 50 experiments per year, marketing teams running multi-arm experiments without dedicated data-science support --- typically apply no correction at all. The methodological gap between the leading platforms and the typical industry program is substantial, and the typical program’s headline outputs are correspondingly less reliable than the published industry benchmarks would suggest.

What This Means For Your Experimentation Program

The operational implications for CRO, growth, and product leaders running experimentation programs follow directly from the math.

Calibrate program-level expectations for the FWER inflation. A program that runs 100 hypothesis tests per year at alpha = 0.05 with no correction will produce approximately 5 chance-driven significant findings per year under the global null, and a proportional number above and beyond the genuine wins under a non-null portfolio. If your program reports 12 wins per year, expect that 3 to 5 of those wins are likely chance discoveries rather than real effects. The implication is not that the program is failing --- it is that the headline win count is an overestimate of the true win rate, and the program’s executive reporting should reflect that calibration explicitly.

Apply BH-FDR to multi-metric analyses on every experiment. The lowest-cost intervention is to add BH-FDR correction to the standard analysis workflow for multi-metric experiments. The implementation is a few lines of Python or R applied to the panel of secondary-metric p-values; most statistical packages include the procedure as a built-in function. The resulting “FDR-significant” determinations replace the nominal p < 0.05 determinations in the post-experiment writeup. Some metrics that previously looked significant will no longer clear the bar; the metrics that do clear are more credible and the analysis is more honest.

Apply Holm or Dunnett correction to multi-arm experiments. For experiments that test K variants against a single control, use Holm-Bonferroni on the K-1 vs-control comparisons (or use Dunnett’s procedure if your statistical package supports it). The correction is computationally trivial and produces FWER-controlled significance determinations on multi-arm tests. The eight-button-color experiment from the opening vignette should have been analyzed under Holm correction, which would have raised the per-test threshold to approximately 0.0064 and would have correctly classified variant E (p = 0.04) as not significant. The team would have moved on to another experiment rather than shipping a non-effect.

Apply BH-FDR at FDR = 0.10 to subgroup analyses. Subgroup analyses are inherently exploratory, so the higher FDR tolerance is appropriate. The implementation is the same as the multi-metric case: collect the per-subgroup p-values, apply BH at the higher FDR level, and report which subgroups are FDR-significant. The subgroups that clear the corrected bar are credible candidates for further investigation or targeted launches; the subgroups that don’t clear are not credible evidence of differential treatment effects.

Implement program-level online FDR control for the long-run portfolio. This is the highest-cost intervention but the methodologically cleanest. The implementation requires committing to an online-FDR framework (Yang-Ramdas-Jamieson-Wainwright or Tian-Ramdas as the academic reference, with the implementation typically requiring a few weeks of data-science work to integrate into the platform’s reporting layer). The resulting program-level reports are calibrated for the cumulative multiple-comparisons load across the program’s full history, and the headline win count is genuinely a controlled-FDR claim rather than an FWER-inflated count.

Educate stakeholders on why the corrected numbers are smaller. The leadership audience for experimentation reports will notice that “FDR-significant” wins are a smaller set than “nominal p < 0.05” wins, and the natural reaction is to view the corrected numbers as the program underperforming. The framing that works is the calibration framing: the corrected numbers are honest measurements of the program’s true output, and the uncorrected numbers were systematically overstated by the multiple-comparisons mechanism. Over time, leadership stakeholders develop intuition for the gap and come to trust the corrected numbers as more reliable. The transition is uncomfortable for the first quarter or two of corrected reporting and then becomes the new normal.

Treat “test everything” programs as a methodological hazard rather than as feature-rich experimentation. The program structure of “let’s test 20 variants of every change and see what wins” sounds like rigorous experimentation and is in fact a guarantee of false-positive output under the global null. The “test everything” pattern combines multi-arm experimentation (FWER inflation across variants), multi-metric analysis (FWER inflation across metrics), and high test volume (program-level FWER inflation) into a configuration that maximizes the multiple-comparisons load. Programs structured this way produce non-replicating winners at a high rate, and the appropriate methodological response is to reduce the testing volume and tighten the per-test alpha thresholds rather than to add more variants and more metrics. The discipline is uncomfortable for product teams that like to iterate fast but produces a meaningfully higher fraction of replicating winners over time.

The bottom-line operational summary: the multiple-comparisons problem is the most arithmetically rigorous of the systematic failure modes in A/B testing, and the corrections are the most computationally trivial of any of the corresponding fixes. A program that adopts BH-FDR on multi-metric analyses, Holm correction on multi-arm experiments, BH-FDR at FDR = 0.10 on subgroup analyses, and online-FDR control on the program-level portfolio will have a substantially lower false-positive rate than a program that applies no correction. The implementation cost is modest. The methodological literature is mature. The reasons most programs don’t do this are operational and cultural, not technical. The fix is available; it just isn’t being applied.

Sources

Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3-62.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological), 57(1), 289-300. DOI: 10.1111/j.2517-6161.1995.tb02031.x
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. DOI: 10.1177/0956797611417632
Deng, A., Lu, J., & Chen, S. (2016). Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 243-252. DOI: 10.1109/DSAA.2016.33
Deng, A., Lu, J., & Litz, J. (2017). Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. WSDM ‘17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 641-649. DOI: 10.1145/3018661.3018677
Yang, F., Ramdas, A., Jamieson, K. G., & Wainwright, M. J. (2017). A framework for multi-A(rmed)/B(andit) testing with online FDR control. Advances in Neural Information Processing Systems 30 (NIPS 2017). arXiv:1706.05378
Tian, J., & Ramdas, A. (2019). ADDIS: An adaptive discarding algorithm for online FDR control with conservative nulls. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv:1905.11465
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265.
Pinterest Engineering. (2018). Real-time experiment analytics at Pinterest using Apache Flink. Pinterest Engineering Blog. Medium URL

Browse the full Replication Crisis Hub for related material on evidence evaluation and statistical inference:

The Peeking Problem in A/B Testing --- a separate, also-pervasive A/B testing failure mode that compounds with the multiple-comparisons problem
Sample Ratio Mismatch in A/B Testing --- the platform-level integrity check that detects when randomization itself has failed
Simpson’s Paradox in A/B Testing --- the subgroup-aggregation problem that compounds with multi-segment analysis
The Winner’s Curse in A/B Testing --- the selection-on-maximum bias that compounds with multi-arm experimentation
Daryl Bem’s Precognition Studies --- the canonical academic illustration of selection-on-significance producing impossible findings
Confirmation Bias --- the cognitive driver of why teams over-interpret cherry-picked subgroups and secondary-metric wins

FAQ

Should I apply Bonferroni or BH-FDR to my multi-metric analyses?

BH-FDR for almost all industry A/B testing contexts. The Bonferroni correction is more conservative and is appropriate for confirmatory analyses where a single false positive is unacceptably costly --- regulatory filings, safety-critical decisions, public-health claims. For exploratory analyses where the cost of a controlled proportion of false positives is acceptable, BH-FDR provides substantially higher statistical power for the same level of overall false-positive control. The standard industry default is BH at FDR = 0.05 for multi-metric analyses; some programs use FDR = 0.10 for highly exploratory subgroup analyses. The Holm-Bonferroni procedure is a uniformly more powerful alternative to Bonferroni for FWER control and should be preferred over plain Bonferroni in any case where FWER control is the goal.

What about Bayesian A/B testing methods? Do they solve the multiple-comparisons problem?

Bayesian methods do not automatically solve the multiple-comparisons problem in the way that is sometimes claimed. A Bayesian A/B test produces a posterior distribution over the true effect for each individual variant, and the posterior is a coherent quantity that does not depend on the number of other tests being conducted in the program. But if the analyst then selects the variant with the highest posterior mean lift, or reports the variants for which the posterior probability of a positive effect exceeds some threshold, the selection rule reintroduces the multiple-comparisons problem at the decision layer. The Bayesian framework provides cleaner mathematical infrastructure for handling the corrections (the posterior-probability-of-meaningful-effect threshold can be calibrated for FDR control), but the corrections themselves are still necessary. Bayesian A/B testing without explicit multiple-comparisons handling at the decision layer has the same false-positive inflation as frequentist A/B testing without correction.

Does the multiple-comparisons problem apply to A/B tests with only one variant?

Within a single two-arm experiment, no --- the single hypothesis test has exactly the nominal alpha = 0.05 false-positive rate, with no inflation. The multiple-comparisons problem appears in two-arm experiments when the analyst is also checking multiple secondary metrics for significance (multi-metric inflation), looking at treatment effects across multiple user subgroups (multi-segment inflation), or aggregating the results into a program-level report across many experiments (program-level inflation). A two-arm experiment that is analyzed on a single pre-specified primary metric, with no subgroup analysis and no aggregation into a broader program report, has no multiple-comparisons problem. As soon as you check more than one metric, slice by more than one segment, or run more than one experiment, the inflation begins.

What about secondary metrics? Should I just not report them?

You should report them, but with appropriate FDR correction applied across the full panel of secondary metrics, and with clear visual or annotational distinction between primary-metric results (which are the pre-registered confirmatory analysis) and secondary-metric results (which are exploratory and subject to FDR correction). The standard practice in well-run experimentation programs is to pre-specify the primary metric for each experiment, apply standard alpha = 0.05 inference to the primary metric, apply BH-FDR to the secondary metrics, and report which secondary metrics are FDR-significant. The secondary metrics that clear the corrected bar are credible candidates for further investigation; the metrics that don’t clear are reported as “no significant effect” rather than being cherry-picked as the headline result of the experiment.

My experimentation program runs only 20 to 30 tests per year. Do I really need multiple-comparisons corrections?

At 20 to 30 tests per year, the program-level FWER under the global null is somewhere between 64% and 79%, which is to say it is almost certain that at least one test per year will produce a false-positive significant result purely by chance. The expected number of chance wins per year is 1 to 1.5 (calculated as N x alpha). If your program reports 5 to 8 wins per year, expect roughly one of those to be a chance discovery rather than a real effect. Whether that level of contamination matters depends on the stakes of the decisions you are making based on the test results. For low-stakes decisions where each shipped variant is easily rolled back if production performance disappoints, the contamination is operationally tolerable. For high-stakes decisions --- pricing changes, major feature launches, marketing-budget reallocations based on attribution analyses --- the one-false-winner-per-year level is meaningful and warrants correction. The Holm correction on individual experiments and the BH-FDR correction on multi-metric analyses are essentially free to apply and should be the default regardless of program size.

How does the multiple-comparisons problem interact with the Winner’s Curse?

The two problems compound multiplicatively. The multiple-comparisons problem inflates the count of false-positive winners --- variants that aren’t really winners get labeled as winners because the testing volume guarantees some chance significance. The Winner’s Curse inflates the magnitude of the apparent lift on actually-winning variants --- the selection rule “ship the one with the highest observed lift” systematically biases the estimated lift upward. A program with both problems has more wins than there should be (false-positive count inflation) and each of those wins has a headline lift that is bigger than it should be (Winner’s Curse magnitude inflation). Both problems require separate corrections --- BH-FDR or online-FDR for the multiple-comparisons problem, empirical Bayes shrinkage for the Winner’s Curse --- and a serious experimentation program needs both. The compounding effect is large enough that the combined overstatement of program-level value can be 2x to 4x the true value under realistic noise-to-signal ratios.

Why do most commercial A/B testing platforms not apply multiple-comparisons corrections by default?

The historical answer is that the commercial platforms were built primarily for marketing and product teams who wanted simple yes/no significance determinations on individual experiments, and the platform vendors prioritized ease of use over methodological rigor. The newer entrants in the experimentation-platform space (Statsig, Eppo, Optimizely’s newer offerings) have begun to include FDR correction as built-in options, often defaulted on for multi-metric analyses. The legacy platforms (older Optimizely configurations, VWO, AB Tasty, Convert) typically present uncorrected per-test p-values and leave the correction as a downstream analyst responsibility. If you are using a commercial platform, ask your vendor explicitly: “What multiple-comparisons corrections are applied to multi-metric and multi-arm experiments? Is FDR control applied at the program level across the testing history?” The answers will tell you how much correction you need to layer on top of the platform’s reporting.

Is the 5% nominal alpha threshold itself the right number for A/B testing?

This is a separate question from multiple-comparisons correction, but it bears mentioning. The conventional alpha = 0.05 threshold originated in early-twentieth-century agricultural statistics (R. A. Fisher’s work in the 1920s and 1930s) and was never intended as a universal standard for all hypothesis testing contexts. For A/B testing, where the cost of false positives is often modest (a shipped variant that doesn’t actually move the metric is easily rolled back) and the cost of false negatives is meaningful (missed opportunities for genuine improvements), some methodologists argue that alpha = 0.10 is a more appropriate threshold for individual A/B tests. The argument is contested. What is not contested is that whichever per-test threshold is used, the cumulative false-positive rate across multiple tests inflates exactly as the FWER formula predicts, and multiple-comparisons corrections are required to maintain control at the family level. The choice of per-test alpha does not eliminate the need for corrections; it just changes the per-test threshold to which the corrections are applied.

replication-crisis A/B Testing multiple-comparisons Experimentation evidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What The Multiple Comparisons Problem Actually Is

The Math --- FWER For Independent Tests

The Three A/B Testing Contexts Where Multiple Comparisons Hits

The Programmatic Variant --- Why 100 Tests A Year Guarantees False Winners

The Corrections --- Bonferroni, Benjamini-Hochberg, Holm

When To Use Which Correction

Why Most Programs Don’t Apply Any Correction

What Industry Practitioners Are Doing

What This Means For Your Experimentation Program

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook