The Hidden Problem: Effects You Never Detect
Most discussions about A/B testing errors focus on false positives: thinking something works when it does not. But there is an equally damaging error that receives far less attention. False negatives occur when a real improvement exists, but your test fails to detect it. You conclude "no significant difference," revert the change, and never know you just threw away a winner.
Statistical power is your defense against false negatives. It measures your test's ability to detect a real effect when one exists. Understanding power is essential for designing tests that do not waste your time by being too insensitive to find real improvements.
What Statistical Power Actually Measures
Statistical power is the probability that your test will correctly identify a true effect. If your test has 80% power, it means that when there is a real effect of the size you specified, you have an 80% chance of detecting it and a 20% chance of missing it.
More precisely, power is 1 minus the probability of a Type II error (beta). A Type II error is a false negative: the test says no effect exists when one actually does. With 80% power, your Type II error rate is 20%.
It is important to understand that power is always defined relative to a specific effect size. A test might have 80% power to detect a 20% relative improvement but only 30% power to detect a 5% relative improvement. Power is not a fixed property of a test. It varies depending on the effect you are trying to detect.
Type II Errors: The Business Cost of Missing Winners
Type II errors are insidious because they are invisible. When you get a false positive, you eventually discover it because the expected improvement does not materialize. But when you get a false negative, you never know. You simply move on, unaware that you abandoned a change that would have improved your metrics.
Consider this scenario: your team spends two weeks developing and testing a new checkout flow. The test comes back "not significant." You revert to the old design. But what if the new checkout flow actually improved conversion by 8%? With a low-power test, there is a meaningful probability that you missed this improvement entirely.
The business cost is twofold. First, you lost the 8% improvement itself, which on a high-revenue page could be substantial. Second, you wasted the effort of developing and testing the change. Unlike a false positive (where you at least learn the change did not work), a false negative teaches you nothing because you drew the wrong conclusion.
Why 80% Power Is the Standard
The convention of 80% power emerged from a practical balancing act. Higher power requires larger sample sizes, which means longer tests and more traffic. The marginal benefit of increasing power from 80% to 90% is relatively modest compared to the additional cost.
At 80% power, you detect 4 out of 5 real effects. At 90%, you detect 9 out of 10. The improvement sounds significant, but achieving 90% power requires roughly 30% more sample size. For most organizations, this additional traffic investment is not justified by catching one more effect out of ten.
There are exceptions. In high-stakes testing where the cost of missing a winner is very high (such as medical trials or tests with large revenue impact and infrequent testing opportunities), 90% or 95% power may be warranted. But for typical digital experimentation programs running many tests per year, 80% power represents a good tradeoff between sensitivity and velocity.
The Power Triangle: How Power, Sample Size, and Effect Size Interact
Power exists in a triangle with sample size and effect size. Understanding this relationship is crucial for test planning:
Larger sample sizes increase power. More data means more precision, which means smaller effects become detectable. This is the most straightforward lever for increasing power.
Larger effects are easier to detect. A 50% improvement is easy to spot even with a small sample. A 2% improvement requires enormous samples. This is why targeting larger effect sizes (through bolder test hypotheses) is a legitimate strategy for running more powerful tests.
Higher significance thresholds increase power. If you are willing to accept more false positives (raising alpha from 0.05 to 0.10), you get more power for the same sample size. This is a tradeoff, and the right balance depends on the relative costs of false positives and false negatives.
How Underpowered Tests Mislead You
The damage from underpowered tests goes beyond simply missing real effects. Underpowered tests distort your entire view of experimentation effectiveness.
When a test is underpowered, the only effects that reach significance are those amplified by random noise. This means the significant results from underpowered tests tend to overestimate the true effect size. You see a 30% lift in the test, implement the change, and then observe only a 5% improvement in practice.
Over time, this erosion of trust destroys experimentation programs. Stakeholders see that test results do not materialize in production, conclude that A/B testing does not work, and reduce investment in the testing program. The real problem was never testing itself. It was underpowered tests producing inflated estimates.
How to Ensure Adequate Power in Practice
Here is a practical checklist for ensuring your tests have adequate power:
Always calculate sample size before starting. Use your baseline conversion rate, your minimum detectable effect, and 80% power to determine the required sample. If the required duration is unreasonable, adjust your MDE or test a higher-traffic page.
Be honest about the MDE. Do not set an unrealistically small MDE just because you want to detect any possible effect. Set it to the smallest effect that would actually justify implementation. This keeps your tests feasible.
Do not split traffic too many ways. Each additional variation reduces the traffic per variation, reducing power. Two to three variations (including control) is usually the maximum for reasonable power.
Consider the full testing velocity picture. Running one well-powered test that takes three weeks is better than running three underpowered tests that each take one week. The well-powered test gives you one reliable answer. The three underpowered tests give you three unreliable answers.
Power Analysis After the Test: What Can You Learn?
Post-hoc power analysis (calculating power after the test is complete) is a controversial practice. Some statisticians argue it provides useful context. Others consider it misleading because it is mathematically redundant with the p-value.
What is genuinely useful is retrospective analysis of your testing program's power characteristics. If you consistently run tests at 40% power, you are missing more than half of real effects. This is a program-level problem that indicates you need to either increase traffic allocation or target larger effects.
Key Takeaways
Statistical power is your protection against false negatives, the silent killers of experimentation programs. The 80% standard means accepting that you will miss 1 in 5 real effects, which is a reasonable tradeoff for most testing programs. Power depends on sample size, effect size, and significance threshold. Underpowered tests do not just miss effects; they produce inflated estimates that erode trust in experimentation. Always calculate sample size before starting a test, and design your testing program to maintain adequate power across all experiments.