The Hidden Weakness in Most Experimentation Programs
Every experimentation team talks about false positives. Significance levels, p-values, the risk of shipping a variant that does not actually work — these concerns dominate the conversation. But there is a mirror-image problem that gets far less attention and causes far more damage: false negatives.
A false negative occurs when your test fails to detect a real effect. The variant genuinely improves the metric, but the test says otherwise. You kill a winning change and move on, never knowing you left value on the table.
Statistical power is the probability of avoiding this mistake. And in most experimentation programs, it is dangerously low.
What Statistical Power Means
Power is the probability that your test will detect an effect of a given size, assuming that effect truly exists. If your test has conventional power, there is still a meaningful chance you will miss a real improvement of the size you specified.
Conversely, a test with low power has a high chance of returning a null result even when the variant works. You designed an experiment that was unlikely to succeed from the start.
Power depends on four factors:
- Sample size. More data means more power. This is the primary lever.
- Effect size. Larger effects are easier to detect. You have less control over this one.
- Significance level. Stricter significance thresholds reduce power.
- Metric variance. Noisier metrics require more data for the same power.
Why Most Tests Are Underpowered
Teams do not calculate power in advance
The most common reason tests are underpowered is that nobody calculated the required sample size before launching. The team picks an arbitrary duration — two weeks, one sprint, one business cycle — and hopes it is enough. Often it is not.
Teams optimize for speed over reliability
In fast-moving product organizations, there is constant pressure to ship quickly. Long test durations feel like they slow the team down. So tests get cut short, traffic allocation stays low, and the resulting power is inadequate.
The irony: underpowered tests do not save time. They waste time by producing inconclusive results that require re-testing or, worse, by killing variants that actually worked.
MDEs are set too small
Teams sometimes specify a minimum detectable effect that is unrealistically small for their traffic level. Detecting a tiny improvement in conversion requires enormous sample sizes. When the required duration exceeds what the team is willing to run, the test launches underpowered.
High-variance metrics dominate the roadmap
Revenue per visitor, average order value, and engagement time are important business metrics, but they have high variance compared to simple conversion rates. Tests targeting these metrics need substantially more sample than teams expect.
The Real Cost of Low Power
You kill winning variants
An underpowered test that returns a null result does not mean the variant failed. It means the test failed. The variant might have improved the metric by a meaningful amount, but you did not have enough data to see it. Ship that variant and you might have moved the needle. Instead, you learned nothing and wasted the opportunity.
Your win rate drops
When most of your tests are underpowered, your observed win rate plummets. This creates a perception that experimentation is not working — that the team cannot come up with good ideas. In reality, the ideas may be fine. The tests just are not built to detect the effects.
False discoveries get amplified
Here is a counterintuitive effect: underpowered tests increase the rate of false discoveries among the results that do reach significance. When power is low, the only effects that cross the significance threshold are the ones that happened to have inflated estimates due to random noise. The "significant" results from underpowered tests tend to overestimate the true effect size.
This means the variants you do ship based on underpowered tests will likely underperform their projected lift. When the team notices that shipped variants do not deliver the expected improvement, trust in the experimentation program erodes.
You cannot distinguish between "no effect" and "small effect"
An underpowered test cannot tell you whether the variant truly had zero effect or just a small effect you could not detect. This distinction matters. If the variant had no effect, you should move on. If it had a small effect, you might iterate on the concept to amplify it.
How to Diagnose Power Problems
Run a post-hoc power analysis
After a null result, calculate the power your test actually had for a range of plausible effect sizes. If the power was low for effects you care about, the null result is uninformative — it does not mean the effect is zero.
Track your observed win rate
If fewer than a reasonable proportion of your tests reach significance, and you believe your team generates decent hypotheses, low power is a likely explanation. A healthy experimentation program with adequate power should see a meaningful share of tests produce significant results.
Compare expected lift to confidence interval width
If your confidence intervals are consistently wider than the effects you are trying to detect, your tests are underpowered. The intervals should be narrow enough to distinguish between the null hypothesis and your expected effect.
Audit your MDE choices
Look at the MDEs your team has been using. If they are substantially smaller than the effects that would matter for the business, the tests are both unrealistic and underpowered.
How to Fix It
Calculate sample size before every test
Make power analysis a mandatory step in experiment design. Use the MDE, baseline rate, and desired power to determine the required sample size. If the duration is unacceptable, adjust the MDE — do not just run an underpowered test.
Increase traffic allocation
Many teams allocate only a small fraction of traffic to experiments. Increasing the allocation to a larger share directly increases power and reduces test duration.
Use variance reduction techniques
Methods like CUPED (Controlled-experiment Using Pre-Experiment Data) can dramatically reduce metric variance, effectively increasing power without increasing sample size. These techniques use pre-experiment data to remove predictable variation.
Choose more sensitive metrics
Sometimes the primary business metric is too noisy for the available traffic. Consider using a more sensitive proxy metric that responds more clearly to changes. Click-through rate may be a better test metric than downstream conversion if the upstream metric has lower variance.
Be realistic about MDEs
Work with stakeholders to set MDEs that reflect both what matters for the business and what is feasible for your traffic level. A test powered to detect only large effects tells you less than you might like, but at least it tells you something reliable.
Run fewer, better tests
Instead of running many underpowered tests simultaneously, concentrate traffic on fewer tests with adequate power. Five well-powered tests produce more reliable learning than twenty underpowered ones.
Power in the Context of Experimentation Strategy
Power is not just a statistical concept. It is a strategic one. The choice of how much power to require reflects your organization's experimentation philosophy:
- High power means fewer tests, higher confidence. You run fewer experiments per quarter but can trust the results. Best for high-stakes decisions.
- Moderate power means more tests, some false negatives. You accept that you will miss some real effects in exchange for higher experiment throughput. Best for rapid iteration environments.
- Low power means lots of tests, lots of noise. You run many experiments but learn little from each one. This is almost never the right strategy.
The optimal power level depends on your traffic, the cost of wrong decisions, and how many ideas you want to test. But there is a floor: below a certain power level, the test produces so little information that it is not worth running.
FAQ
What power level should I target?
Conventional power levels are the standard for good reason. Going lower than that means you are more likely to miss real effects than to detect them, which defeats the purpose of testing.
Is it worth doing a post-hoc power analysis?
Post-hoc power analysis on a completed test is controversial among statisticians because the power is mathematically determined by the p-value. However, it can be useful for communicating to stakeholders that a null result from an underpowered test is not the same as evidence of no effect.
How does power interact with sequential testing?
Sequential testing methods allow you to check results at multiple points during the test while maintaining overall error rates. They can sometimes detect effects earlier than fixed-horizon tests, effectively increasing power at a given sample size. But the power guarantee applies at the end of the test, not at any individual check.
Can I increase power by making the significance threshold less strict?
Yes, but this increases your false positive rate. It is usually better to increase power by increasing sample size or reducing variance rather than by lowering the bar for significance.