What is Statistical Power?

Atticus Li

← Glossary · Statistics & Methodology

Statistical Power

The probability that a test will correctly detect a true effect when one exists — conventionally set at 0.80.

Statistical power is the probability that your experiment will return a significant result when the variant genuinely outperforms control. A test with 0.80 power has an 80% chance of catching a real effect of the specified size — and a 20% chance of missing it (a Type II error). Power is the often-ignored sibling of the p-value, and it is what separates "we didn't find a difference" from "we could not have found a difference even if one existed."

Also Known As

Data science: 1 - beta, sensitivity
Growth: test sensitivity, "are we going to catch this?"
Marketing: confidence to detect
Engineering: true positive rate at the target effect

How It Works

Take a checkout with 5% baseline conversion, 20,000 users per variant, alpha 0.05, and a true relative lift of 6%. A power calculation returns roughly 0.65 — meaning a 35% chance of missing this real effect. Doubling to 40,000 per variant raises power to ~0.89. Same truth, same alpha, different story about what you will see.

Power is a function of four things: true effect size, sample size, variance, and alpha. You cannot know true effect in advance, so you design around the smallest effect you would care about if it were real.

Best Practices

Target 0.80 as a floor, 0.90 for decisions that are hard to reverse like pricing, paywalls, or homepage redesigns.
Report power at readout, not just at design. Observed variance often differs from assumed variance.
If a test is flat, report post-hoc the effect size you had 80% power to detect. That is the actionable message.
Increase power with variance reduction (CUPED, stratification) rather than pure sample size when traffic is constrained.
Distinguish power to detect any effect from power to detect a specific direction.

Common Mistakes

Treating "not significant" as "no effect." Low power means absence of evidence, not evidence of absence.
Running A/A/B tests where the A/A split burns half the power. Use A/A testing periodically, not in every experiment.
Assuming power from a pilot study transfers to the full test. Pilots with 500 users estimate variance unreliably.

Industry Context

In SaaS/B2B, chronic underpower is the norm — most B2B teams should run fewer, larger tests rather than many underpowered ones. In ecommerce, power is usually adequate for conversion metrics but insufficient for AOV or revenue per user without variance reduction. In lead gen, power on the MQL metric is typically 30–50% even when power on form fills is 0.9, which is why teams celebrate "wins" that never show up in pipeline.

The Behavioral Science Connection

Humans treat absence of evidence as evidence of absence — a cognitive error known as omission neglect. When a test comes back flat, teams conclude "the change didn't work" rather than "we couldn't tell." Explicitly reporting power at readout forces the correct inference and protects against prematurely killing ideas that might have worked.

Key Takeaway

Power is how you avoid confusing "didn't win" with "couldn't win." Report it before, during, and after every test.

← Browse All Terms