You're running 15 active experiments. You've set each one to 95% statistical confidence. You're shipping winners and killing losers with discipline. Your testing program looks mature.

Here's the problem: you're almost certainly making more wrong decisions than you think.

The math of multiple testing means that running many experiments simultaneously — even with rigorous individual confidence thresholds — produces a program-level false positive rate that's far higher than 5%. Understanding this is the difference between a testing program that actually improves your product and one that ships noise.

The Multiple Comparisons Problem, Explained Without the Jargon

When you run a single test at 95% confidence and get a significant result, you have a 5% chance that the result is a false positive — the kind of luck that makes a coin flip come up heads 20 times in a row.

Now run 20 simultaneous tests. Each one has a 5% false positive rate. The probability that at least one of them produces a false positive by chance alone is:

1 - (0.95)^20 = 64%

At 50 simultaneous tests, it's 92%. At 100 tests, it's 99.4%.

In other words: if you're running 50 experiments a month at 95% confidence, and there are truly no real effects (everything is genuinely null), you'd expect to ship roughly 2-3 "winning" changes that do absolutely nothing. In practice, most experiments have some true effects, so the number is lower — but the direction is the same. High-volume testing programs are making more false discoveries than their confidence thresholds suggest.

**Pro Tip:** The false positive problem is asymmetric. False positives in testing are often invisible — the "winning" change gets shipped and performance looks stable (because the true effect was zero, just with noise), so no one goes back to check. False negatives (missing real wins) are also invisible. The only way to audit your program's accuracy is to run A/A tests and replication experiments.

A Worked Example: 10 Simultaneous Tests

Your team runs 10 experiments in Q1, all at 95% confidence. Let's say 7 of those tests have genuine, detectable effects and 3 are testing changes that truly do nothing (or changes so small they're beneath your MDE).

Expected outcome:

  • Of the 7 true positives: with 80% statistical power, you'll detect 5-6 of them correctly. You'll miss 1-2 (false negatives).
  • Of the 3 true nulls: at 95% confidence, each has a 5% chance of showing as significant. Expected false positives: 0.15.

So far, so good — 0.15 false positives out of 10 tests isn't bad. But here's the catch: you don't know which of your 10 tests are the 7 true effects and which are the 3 true nulls. You're making decisions under uncertainty across the whole portfolio.

Now scale to 100 tests per quarter. If 30% are testing truly null effects (which is generous — many teams test hunches with no strong prior), you have 30 null tests. Expected false positives: 1.5. Over a year at this pace, you're shipping 6 changes per year that actively do nothing — wasting engineering bandwidth, cluttering your codebase, and polluting future experiments that interact with those changes.

How Optimizely Handles This: FDR Control

Optimizely uses a False Discovery Rate (FDR) control methodology rather than the traditional family-wise error rate (FWER) approach used in academic statistics.

Here's the difference:

  • FWER (Bonferroni correction): Controls the probability of making any false positive across all tests. This is very conservative. If you're running 20 tests, Bonferroni divides your alpha (0.05) by 20, requiring each test to reach 99.75% confidence to be considered significant. Almost nothing will ever win.
  • FDR (Benjamini-Hochberg): Controls the proportion of discoveries that are false. At a 5% FDR, you accept that up to 5% of your "significant" results might be false positives — but you're not crippling your entire program to prevent them.

Optimizely's Stats Engine applies FDR control across metrics within a single experiment when you're testing multiple metrics simultaneously. So if you have one experiment with 10 metrics, Optimizely will adjust significance thresholds to ensure no more than 5% of your "winning" metrics are false positives.

What Optimizely does not automatically control is the false discovery rate across experiments running simultaneously. Each experiment is evaluated independently. The cross-experiment multiple comparisons problem is yours to manage.

**Pro Tip:** Optimizely's FDR control is most powerful for protecting you against metric fishing within a single experiment — the temptation to keep adding metrics until something reaches significance. For program-level FDR control across many experiments, you need process and prioritization, not just platform settings.

What the FDR Setting in Optimizely Actually Changes

When you adjust Optimizely's significance threshold or FDR settings, you're changing the trade-off between false positives and false negatives:

  • Lower FDR (more conservative, e.g., 1%): Fewer false positives, more false negatives. You'll miss more real effects but ship fewer duds.
  • Higher FDR (less conservative, e.g., 10%): More false positives, fewer false negatives. You'll ship some things that don't work but you'll also catch more real effects.

The right setting depends on your program's goals. Early-stage teams optimizing for learning velocity should lean toward higher FDR — missing fewer real insights matters more than the occasional false positive. Mature teams making high-stakes decisions (pricing, core checkout flow) should lean toward lower FDR.

The Experiment Portfolio Problem

The deeper issue isn't any single experiment's significance threshold — it's how your testing portfolio compounds the risk.

Consider a team running 4 experiments simultaneously, each testing a different part of the purchase funnel. Even if each experiment is methodologically clean, there's interaction risk: Experiment A changes the product page, Experiment B changes the cart, Experiment C changes checkout, Experiment D changes the confirmation page. A visitor who goes through all four is counted in all four experiments. Their behavior in one experiment influences their behavior in subsequent steps.

This creates correlated errors. The statistical model underlying each experiment assumes independence. When experiments interact — which they always do in a single-page funnel — the independence assumption breaks down.

**Pro Tip:** Use Optimizely's mutual exclusion groups to prevent visitors from being in conflicting experiments simultaneously. This reduces your sample sizes but eliminates the interaction confound. For funnel experiments especially, mutual exclusion is worth the traffic cost.

When to Be More Conservative

Not all tests carry equal stakes. A useful mental model: calibrate your confidence threshold to the reversibility and magnitude of the decision.

Use 95%+ confidence (be strict) when:

  • Testing pricing changes — false positives here mean shipping a price that actively hurts revenue
  • Making major UX architecture changes that are hard to revert (navigation restructures, checkout flow redesigns)
  • Running tests that will inform multi-million-dollar roadmap investments
  • Testing in segments with low traffic where each visitor is high value (enterprise-only tests)

Standard 95% confidence is appropriate when:

  • Testing component-level changes (CTA copy, image variants, form labels)
  • Running tests where the downside is minimal (you can always revert a button color)
  • Testing hypotheses with strong prior evidence from previous experiments

Lower confidence (85-90%) may be acceptable when:

  • Running exploratory tests to generate hypotheses, not to make shipping decisions
  • The cost of a false negative (missing a real win) exceeds the cost of a false positive
  • You're running A/B/n tests where you have a built-in replication within the experiment

Guardrail Metrics as False Positive Defense

The most practical defense against false positives in a high-volume testing program isn't statistical — it's behavioral. Adding guardrail metrics forces you to check whether a "winning" primary metric comes with a cost.

A classic example: an e-commerce team tests a simplified checkout form that removes the gift message field. Primary metric (checkout completion rate) goes up 8%. Win! Ship it. Three weeks later, average order value drops 12% because gift buyers — who spend more — are converting less. The experiment had a false positive on checkout rate because the true effect was a traffic mix shift, not a genuine UX improvement.

Guardrail metrics catch this. If your experiment shows revenue per visitor down 5% alongside the checkout completion lift, you have a signal that something is off before you ship.

**Pro Tip:** Set 3-5 guardrail metrics for every experiment. These should include: your primary revenue metric (if you're not already testing it), your primary conversion metric (if you're testing a downstream metric), and any metric that represents a different user segment's behavior. A win that degrades guardrails isn't a win.

Program-Level Implications: Structuring for Fewer Spurious Wins

High-volume testing teams can't solve the multiple comparisons problem with statistics alone. Here's how to structure your program to minimize spurious wins:

  1. Pre-register your hypotheses. Before launching each experiment, write down your prediction and your primary metric. This prevents post-hoc metric fishing — the practice of looking at all your metrics and choosing the one that "won" as your primary metric.
  2. Require replication for high-stakes decisions. Any test result that will drive a decision worth more than $X (set your own threshold) should be replicated in a second experiment before shipping. This catches false positives that passed significance by chance.
  3. Audit your win rate. A healthy testing program expects to win roughly 30-40% of experiments. If you're winning 70%+ of tests, either your ideation is phenomenal (unlikely) or your significance thresholds are too lenient.
  4. Track experiment interactions. Keep a log of which experiments were running simultaneously. When you do a post-mortem on shipped changes, correlate them with the concurrent experiment list.
  5. Use sequential testing properly. Optimizely's Stats Engine is designed for sequential testing (you can peek without inflating false positives). But it's not designed to save you from testing too many things at once.
**Pro Tip:** Run an A/A test quarterly — split your traffic 50/50 with identical experiences on both sides. If your platform is functioning correctly and your metrics have appropriate variance, you should see no significant differences. If you regularly see "significant" A/A results, your false positive rate is higher than your confidence threshold suggests.

Common Mistakes

Treating each experiment's p-value as independent when experiments are running in the same funnel. They're not independent. Interactions are real.

Adding metrics until something reaches significance. Every additional metric you add increases your within-experiment false discovery risk. Commit to your primary metric before you launch.

Running hundreds of small tests without replication. Velocity is good, but a 95% confidence threshold on 200 tests per year means you're shipping roughly 10 false positives annually — and you have no way to identify which ones.

Not adjusting for multiple variations. An A/B/C/D test (control plus 3 variants) has 3 chances to produce a false positive, not 1. Your effective false positive rate is closer to 15% if you naively apply a 95% threshold to each comparison independently.

Ignoring the FDR vs. FWER distinction. Teams that apply Bonferroni corrections to their experiment results end up with significance thresholds so strict that nothing ever wins. FDR control is the right approach — just understand what it does and doesn't protect.

What to Do Next

  1. Count your simultaneously running experiments. If you have more than 5-7 running at once, calculate your portfolio-level false positive exposure.
  2. Audit your win rate over the last 12 months. If it's above 50%, investigate whether your significance thresholds are appropriate.
  3. Add guardrail metrics to every active experiment. Start with revenue per visitor and your core conversion rate.
  4. Pre-register your next 5 experiments. Write the hypothesis, primary metric, and expected direction before launch. Commit to it.
  5. Plan an A/A test. Run one in the next 30 days as a calibration check.

For the mechanics of why results fluctuate even after statistical significance is reached, see Why Your Optimizely Results Keep Changing. For diagnosing why an experiment isn't reaching significance in the first place, see Why Your Experiment Won't Reach Statistical Significance.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.