Two Schools, One Decision
Every A/B testing platform makes a philosophical choice about how to analyze your experiments. Some use frequentist methods — p-values, confidence intervals, fixed-horizon tests. Others use Bayesian methods — posterior probabilities, credible intervals, continuous monitoring. A few offer both.
This is not an academic debate. The method you use changes how you design tests, when you can read results, what questions you can answer, and how you communicate uncertainty to stakeholders. Choosing the wrong one for your context creates friction at best and incorrect decisions at worst.
Frequentist Methods: The Traditional Approach
Frequentist statistics treats probability as long-run frequency. A confidence interval is not a statement about the probability that the true value lies within the range. It is a statement about the procedure: if you repeated the experiment many times, a certain proportion of the intervals would contain the truth.
How frequentist A/B testing works
- Before the test: Calculate sample size based on MDE, significance level, and power. Set the significance threshold.
- During the test: Collect data. Do not make decisions based on intermediate results (or use sequential methods if you want valid early looks).
- At the planned endpoint: Calculate the test statistic and p-value. Compare to the threshold. Report the confidence interval.
- Decision: If significant, the effect is statistically detectable. If not, you cannot distinguish the effect from zero.
Strengths
- Well-understood error guarantees. The false positive rate is controlled at the specified level over the long run. If you follow the protocol, you know your error rate.
- Objective. No subjective prior is required. Two analysts looking at the same data with the same test design will reach the same conclusion.
- Mature tooling. Most experimentation platforms, calculators, and textbooks use frequentist methods. There is extensive guidance on design and analysis.
Weaknesses
- Fixed-horizon requirement. The standard approach requires you to wait until the planned sample size is reached. Early peeking inflates false positives unless you use specialized sequential methods.
- Counterintuitive interpretation. P-values and confidence intervals are notoriously misunderstood. Most stakeholders interpret them as Bayesian probabilities, which they are not.
- Binary outcomes. The result is either significant or not. This all-or-nothing framing can be unhelpful when you need to make nuanced decisions.
- No direct probability of hypotheses. Frequentist methods cannot tell you the probability that variant B is better than variant A. They tell you the probability of the data given the null hypothesis.
Bayesian Methods: The Alternative
Bayesian statistics treats probability as a degree of belief. You start with a prior belief about the likely effect, update it with data, and end with a posterior belief. The result is a direct probability statement about the hypothesis.
How Bayesian A/B testing works
- Before the test: Specify a prior distribution for the effect size. This encodes your belief about the likely effect before seeing data.
- During the test: Update the posterior as data accumulates. You can check results at any time without inflating error rates.
- At any point: Report the posterior probability that the variant is better, the expected effect size, and the credible interval.
- Decision: Ship when the probability of improvement exceeds your threshold, or stop when it becomes clear the variant is unlikely to be beneficial.
Strengths
- Direct probability statements. "There is a high probability that variant B improves conversion" is exactly what decision-makers want to know, and Bayesian methods provide it directly.
- Continuous monitoring. You can check results at any time without statistical penalties. The posterior is valid whenever you look at it.
- Intuitive interpretation. Credible intervals mean what people think confidence intervals mean: there is a specific probability that the true value is in this range.
- Natural handling of uncertainty. The posterior quantifies uncertainty in a way that maps directly to decision-making under uncertainty.
Weaknesses
- Prior specification. You must choose a prior, and the choice affects the result — especially with small samples. Critics argue this introduces subjectivity.
- Computational complexity. Bayesian methods are more computationally intensive, though modern computing has made this less of an issue.
- Fewer formal error guarantees. Bayesian methods do not provide the same frequentist error rate guarantees, though calibrated Bayesian methods can achieve similar practical performance.
- Sensitivity to prior with small data. If your sample is small, the prior has outsized influence on the posterior. A poorly chosen prior leads to misleading results.
Head-to-Head Comparison
When can you look at results?
Frequentist: At the planned endpoint (unless using sequential methods).
Bayesian: Anytime. The posterior is always valid.
This is the biggest practical difference for most teams. Bayesian methods are naturally suited to continuous monitoring, which is how most product teams want to operate.
What does the result tell you?
Frequentist: The data would be unlikely if the null hypothesis were true.
Bayesian: There is a specific probability the variant is better, and the effect likely falls in a given range.
Bayesian outputs are more directly actionable because they answer the question stakeholders actually ask.
How do you handle early stopping?
Frequentist: Stopping early inflates false positives. You need sequential testing corrections.
Bayesian: Early stopping is built into the framework. You stop when the posterior crosses your decision threshold.
How sensitive is the method to assumptions?
Frequentist: Sensitive to the assumption that you follow the testing protocol (fixed sample, no peeking, pre-registered hypothesis).
Bayesian: Sensitive to the choice of prior. With enough data, the prior washes out, but with small samples, it matters.
What about error rates?
Frequentist: Long-run error rates are guaranteed by design.
Bayesian: Long-run error rates are not explicitly controlled but can be calibrated to match frequentist performance.
Which Should You Use?
Use frequentist methods when:
- Your team is comfortable with the protocol. You can commit to fixed sample sizes, pre-registered hypotheses, and endpoint analysis.
- Error rate guarantees matter. Regulatory contexts, high-stakes decisions, or environments where long-run false positive rates need explicit control.
- Your tools use them. If your experimentation platform is frequentist, switching methods means switching platforms.
- You want objectivity. No prior specification means no debate about subjective choices.
Use Bayesian methods when:
- You need continuous monitoring. Your team will check results before the planned endpoint, and you want those checks to be valid.
- Stakeholders want probability statements. "There is a high probability B is better" is more useful than "the p-value is below the threshold" in most product conversations.
- You run many small tests. Bayesian methods handle small samples more gracefully when the prior is reasonable.
- You want flexible stopping rules. The ability to stop early when results are clear saves traffic and accelerates learning.
The pragmatic answer
For most product experimentation programs, the differences are smaller than they appear. With adequate sample sizes, well-calibrated priors, and proper sequential testing corrections, both methods lead to similar decisions. The choice often comes down to tooling, team familiarity, and organizational preference.
If your current method is working and your team understands it, there may be no reason to switch. If you are starting from scratch, Bayesian methods are often a better fit for how product teams actually work — checking results frequently, making incremental decisions, and wanting intuitive uncertainty quantification.
The Hybrid Approach
Some sophisticated experimentation programs use both methods:
- Frequentist for high-stakes tests where formal error guarantees are important.
- Bayesian for rapid iteration where continuous monitoring and early stopping save significant time.
This requires the team to understand both frameworks, which is a higher bar. But it gives you the right tool for each situation.
FAQ
Is Bayesian testing less rigorous than frequentist testing?
No. Both are mathematically rigorous. They answer different questions with different assumptions. Bayesian methods are rigorous about probability of hypotheses; frequentist methods are rigorous about long-run error rates.
What prior should I use for A/B testing?
A weakly informative prior centered on zero effect is standard for most product experiments. This expresses the reasonable belief that most changes have small effects. Avoid informative priors unless you have strong evidence to support them.
Can I switch from frequentist to Bayesian mid-experiment?
Technically you can re-analyze the data using Bayesian methods at any time. But mixing frameworks within a single experiment complicates interpretation. Better to pick one method before the test starts.
Do Bayesian methods require more data?
Not inherently. With a reasonable prior, Bayesian methods can reach conclusions with less data than frequentist methods, especially when the effect is large. With a vague prior, the data requirements are similar.