The Experiment That Teaches You Nothing
Your team runs forty A/B tests per quarter. Only a handful show significant results. Leadership starts questioning whether the experimentation program is worth the investment. The data team insists the ideas are good. Product says the methodology must be flawed.
Neither side considers the simplest explanation: the tests are underpowered. They were never designed to detect the effects that actually exist. They are expensive coin flips disguised as experiments.
Underpowered testing is the most widespread problem in corporate experimentation. It is invisible because it looks like negative results rather than bad methodology. And it silently destroys the ROI of experimentation programs everywhere.
What Makes a Test Underpowered
A test is underpowered when it does not have enough sample size to reliably detect an effect of the size you care about. The test might detect a very large effect, but it will miss moderate and small effects — which are exactly the effects that most real-world changes produce.
The symptoms:
- Most tests return null results. Not because the changes have no effect, but because the tests cannot detect the effects that exist.
- Significant results have inflated effect sizes. The few results that reach significance tend to have estimates that are higher than the true effect, because only inflated estimates can cross the significance threshold in underpowered tests.
- Shipped variants underperform projections. When you ship a "winning" variant from an underpowered test, the real-world lift is usually smaller than what the test showed.
- Confidence intervals are very wide. If your intervals routinely span from negative to positive, the test lacks the precision to distinguish between help and harm.
How Underpowered Tests Damage Your Program
The direct cost: missed wins
Every underpowered null result might be a real improvement that you failed to detect. If your test has low power for a moderate effect size, you will miss most moderate improvements. Those missed improvements represent real revenue, engagement, or conversion gains that your program failed to capture.
The cruel irony: teams often conclude that "nothing works" when the truth is that their measurement tool is broken.
The amplification effect: winner's curse
In an underpowered testing environment, the results that do reach significance are a biased sample. They disproportionately represent cases where random noise inflated the estimate. This is the winner's curse — your apparent winners are not as good as they look.
When you ship these inflated estimates and measure the actual effect, the disappointment compounds the trust problem. "Experimentation said we would see a big lift. We got almost nothing."
The credibility cost: eroding trust
An experimentation program that routinely produces null results and occasionally produces wins that do not replicate will eventually lose organizational credibility. Stakeholders stop trusting the numbers. Product teams stop prioritizing test design. The program becomes a compliance exercise rather than a strategic capability.
This credibility loss is the most expensive consequence because it is the hardest to reverse. Once leadership loses faith in experimentation, rebuilding that faith takes years of consistently reliable results.
The opportunity cost: traffic waste
Traffic allocated to underpowered tests is traffic that could have been used for well-powered tests or allocated to other business purposes. If a test has low power, the expected information value per visitor is minimal. You are paying the opportunity cost of the traffic without getting proportional learning in return.
Root Causes
No pre-test power analysis
The most common cause. Teams launch tests without calculating the required sample size. They run for a fixed duration, check the results, and hope for the best. Without power analysis, there is no way to know whether the test had a reasonable chance of detecting the expected effect.
Unrealistic MDE targets
Teams specify a minimum detectable effect that is far smaller than what their traffic can support. Wanting to detect a tiny improvement is reasonable in theory but requires enormous sample sizes in practice.
Traffic fragmentation
Running too many tests simultaneously divides traffic across experiments. Each test gets a fraction of the available sample, reducing the power of every concurrent test.
High-variance metrics
Some metrics are inherently noisier than others. Revenue per visitor has much higher variance than click-through rate. Tests using high-variance metrics need proportionally more sample — often several times more than teams expect.
Insufficient test duration
Organizational pressure to move fast leads to short test durations. Two-week tests are standard in many organizations, but two weeks may not be enough for the metric, traffic level, and MDE in question.
The Diagnostic Checklist
Here is how to determine whether your program has a power problem:
1. Calculate the power of your recent tests. Go back to your last quarter of experiments. For each test, calculate the statistical power for a reasonable range of effect sizes. If most tests had low power for moderate effects, you have a systemic problem.
2. Compare observed win rates to expected rates. A healthy experimentation program with adequate power should see a meaningful proportion of tests produce significant results. If your win rate is much lower, underpowering is the likely explanation.
3. Track post-ship performance. For tests that reached significance, measure the actual post-ship effect. If shipped variants consistently underperform their test estimates, the winner's curse is operating — a hallmark of underpowered tests.
4. Measure confidence interval widths. If your average confidence interval width is larger than the effect sizes you are trying to detect, your tests lack sufficient precision.
5. Count the tests running simultaneously. If you are running many tests at once on the same traffic, individual test power may be far lower than expected.
The Fix: A Power-First Approach
Make power analysis mandatory
No test launches without a documented sample size calculation. The calculation should include the MDE, expected baseline, power level, and significance threshold. The resulting duration should be reviewed and approved before traffic is allocated.
Set realistic MDEs
Work with stakeholders to set MDEs that balance business value against feasibility. Use historical effect sizes as a guide. If no past test has ever produced a large lift, do not design tests to detect only effects that large.
Reduce concurrent tests
If traffic is limited, run fewer tests with adequate power rather than many tests with inadequate power. The math is clear: five well-powered tests produce more reliable learning than twenty underpowered ones.
Use variance reduction
Techniques like CUPED and stratification can reduce effective variance by a meaningful fraction, which increases power without requiring more traffic. These methods use pre-experiment data to control for predictable variation in the metric.
Choose appropriate metrics
For each test, consider whether the primary metric is the most sensitive way to measure the change. Sometimes a more proximal metric (closer to the point of change) has lower variance and provides a more powerful test.
Extend test durations strategically
When the required sample size exceeds what a standard sprint can provide, extend the test rather than accepting low power. Communicate to stakeholders why longer tests produce more reliable and valuable results.
Pool related surfaces
If a change applies to multiple pages or flows, consider pooling traffic across all affected surfaces. This increases the effective sample size without requiring more time.
The Cultural Shift
Fixing underpowered testing requires more than statistical corrections. It requires a cultural shift in how the organization values experimentation quality.
- Reward learning, not just wins. A well-powered null result is valuable because it tells you the change had less than a certain level of impact. An underpowered null result tells you nothing.
- Measure program health, not just win rate. Track median power, confidence interval width, and post-ship replication rate alongside the win rate.
- Make power visible. Include the power calculation in every experiment report. Stakeholders should know whether a null result is informative or inconclusive.
- Invest in traffic efficiency. Variance reduction, metric selection, and test prioritization are infrastructure investments that pay dividends across every experiment.
The organizations that get the most value from experimentation are not the ones that run the most tests. They are the ones that run tests capable of producing reliable answers.
FAQ
How do I convince leadership that underpowered tests are a problem?
Translate the problem into business terms. Calculate the expected number of real improvements missed per quarter due to low power. Estimate the revenue impact of those missed improvements. Compare the cost of running properly powered tests to the value of the improvements they would detect.
Is it ever okay to run an underpowered test?
Only if you treat the result as purely exploratory and do not make ship/no-ship decisions based on it. An underpowered test can still provide directional evidence, but it should not be the basis for a definitive decision.
What power level should I target?
Conventional power levels are the standard. Anything lower means you are more likely to miss real effects than to detect them. For high-stakes decisions, targeting higher power is warranted.
How do I know if a specific null result is due to low power or a truly zero effect?
Look at the confidence interval. If the interval is narrow and centered near zero, the effect is likely small. If the interval is wide and spans meaningfully positive and negative values, you cannot distinguish between no effect and a real effect that the test was unable to detect.