The Number Behind Every Experimentation Decision
Somewhere in every A/B testing dashboard sits a p-value. It is small, often displayed with several decimal places, and it drives decisions worth significant revenue. Yet if you asked most people who rely on p-values to explain what they measure, the answer would be wrong.
This is not because people are careless. It is because p-values are genuinely counterintuitive. The definition is precise but narrow, and the gap between what it says and what people think it says creates real problems in experimentation programs.
What a P-Value Actually Is
A p-value is the probability of observing data at least as extreme as what you collected, assuming there is no real difference between your control and variant.
Read that again. The p-value is a statement about the data, not about the hypothesis. It answers the question: "If my variant truly does nothing, how surprised should I be by this data?"
A small p-value means the observed data would be unusual if the null hypothesis were true. It does not mean the null hypothesis is false. It does not mean the variant works. It does not tell you the probability that your hypothesis is correct.
The Classic Misinterpretation
Here is what most teams believe a p-value means: "There is only a small probability that this result happened by chance, so the variant is probably better."
The error is subtle but critical. The p-value is calculated under the assumption that chance is the only explanation. It does not weigh chance against the alternative. To get the probability that your variant actually works, you would need additional information — specifically, the prior probability that the hypothesis is true, which p-values do not incorporate.
Think of it this way. If you test a change that has almost no chance of working — say, changing the server's timezone — and you get a small p-value, should you believe the timezone change improved conversions? Probably not. The p-value does not account for how implausible the hypothesis was to begin with.
How P-Values Connect to Significance
Statistical significance is just a p-value compared to a threshold. When someone says a test is "significant," they mean the p-value fell below a predetermined cutoff.
The threshold is a line you draw before the test. Results below the line are called significant. Results above are not. The line itself is arbitrary — it is a convention that reflects how much false positive risk you are willing to accept.
Setting the threshold lower means fewer false positives but more false negatives. You will miss real effects because you demanded too much evidence. Setting it higher means more false positives but fewer misses. The right threshold depends on the consequences of each type of error.
The Five Things P-Values Do Not Tell You
1. The Probability That Your Variant Works
This is the big one. A small p-value does not mean there is a high probability the variant is effective. It means the data would be surprising if the variant were ineffective. Those are different statements with different implications.
2. The Size of the Effect
P-values say nothing about magnitude. A change that improved a metric by a trivial amount and a change that doubled it can produce identical p-values if the sample sizes differ. Always look at the effect size alongside the p-value.
3. Whether the Result Is Reproducible
A significant p-value in one test does not guarantee you will get the same result if you run the test again. P-values are calculated from one sample. Different samples produce different p-values, and the variation can be substantial — especially with smaller sample sizes.
4. Whether You Should Ship the Variant
The decision to ship involves business context that no statistical test can capture: implementation cost, maintenance burden, user experience implications, opportunity cost, and strategic alignment. A p-value is one input into that decision, not the decision itself.
5. Whether Your Test Was Well-Designed
A small p-value from a poorly designed test is not reliable evidence. If the randomization was flawed, the metric was noisy, or the test ran during an atypical period, the p-value inherits those problems.
How P-Values Behave During a Test
One of the most dangerous properties of p-values is that they fluctuate during a test. If you check a p-value on day one, it might be below the threshold. On day three, it might be above. On day five, below again.
This is normal and expected. Early in a test, sample sizes are small, estimates are noisy, and p-values bounce around. They stabilize as more data accumulates. But if you make decisions based on the first time the p-value crosses the threshold, you are not running a proper experiment — you are exploiting random fluctuation.
This is why fixed-horizon testing exists. You calculate the sample size you need before the test starts, run until you reach it, and then evaluate. Checking early is fine for monitoring, but the decision should happen at the planned endpoint.
Practical Guidelines for Using P-Values
Set the threshold before the test
Decide what p-value threshold you will use as your decision criterion before you see any data. Changing the threshold after seeing results is a form of p-hacking.
Report the actual p-value, not just "significant" or "not significant"
There is useful information in the magnitude of the p-value. A result just barely below the threshold is weaker evidence than one far below it. Binary significant/not-significant labels throw away this information.
Pair p-values with confidence intervals
Confidence intervals communicate the same statistical information as p-values but add information about effect size and precision. A confidence interval that ranges from a negligible improvement to a large one tells a different story than one that ranges from moderate to large.
Adjust for multiple comparisons
If you test multiple variants, segments, or metrics, the probability that at least one comparison produces a spurious significant result increases rapidly. Apply corrections when making multiple comparisons from the same experiment.
Do not use p-values as the sole decision criterion
The best experimentation programs use p-values as one input alongside effect size, confidence intervals, practical significance, and business context. A statistically significant result that does not matter for the business is not a win.
When P-Values Work Well and When They Do Not
P-values work well when:
- You have a clear, pre-specified hypothesis
- The sample size is adequate for the effect you expect
- You evaluate at a fixed endpoint
- You test a single primary metric
- You interpret the result in context, not in isolation
P-values work poorly when:
- You peek at results repeatedly and stop when significant
- You test many metrics and report the best one
- Your sample size is too small to detect realistic effects
- You treat the threshold as a bright line between truth and noise
- The decision is high-stakes and deserves a richer analysis
The Bayesian Alternative
Some experimentation programs have moved to Bayesian methods, which directly estimate the probability that one variant is better than another. This is often more intuitive and more aligned with what decision-makers actually want to know.
Bayesian methods require specifying prior beliefs about the likely effect size, which some teams find uncomfortable. But they avoid many of the interpretive pitfalls of p-values and produce outputs that map more naturally to business decisions.
Whether you use frequentist or Bayesian methods, the underlying data is the same. The difference is in how uncertainty is communicated and how decisions are framed.
FAQ
If p-values are so confusing, why do we still use them?
Historical momentum and tooling. Most experimentation platforms default to frequentist methods with p-values. They work well enough when used correctly, and there is a large body of guidance on how to design tests around them.
What is the difference between a p-value and a confidence level?
A confidence level is the complement of the significance threshold. If your threshold is set at a conventional level, the confidence level is one minus that threshold. The p-value is the actual number from your data; the threshold is the line you drew before the test.
Can I compare p-values across different tests?
Generally no. P-values depend on sample size, variance, and test design. A smaller p-value from one test is not necessarily stronger evidence than a larger one from a different test with different parameters.
What should I do when stakeholders ask if a test is significant?
Reframe the conversation. Instead of a yes/no answer, share the estimated effect size, the confidence interval, and the practical implications. Significance is a summary, not the full story.