Bayesian vs Frequentist A/B Testing
A practitioner's comparison of Bayesian and frequentist A/B testing — when to use each framework, the real cost of wrong decisions, and the speed-vs-accuracy tradeoff.
- Intuitive probability statements (e.g., '92% chance B beats A')
- Can peek at results without inflating error rates
- Incorporates prior knowledge from past experiments
- Natural stopping rules — stop when confidence is sufficient
- Better for communicating risk to non-technical stakeholders
- Well-understood mathematical guarantees on error rates
- Standardized methodology — reproducible across teams
- Clear pre-registration and analysis plan
- Regulatory and academic acceptance
- Simpler to implement and audit
- Prior selection can introduce bias if not carefully chosen
- Computationally more expensive at scale
- Harder to pre-register and reproduce exactly
- Can give false confidence with poorly chosen priors
- Less standardized — implementations vary across platforms
- Cannot peek at results without sequential testing corrections
- Binary outcomes (significant/not) lose nuance
- P-values are widely misinterpreted, even by practitioners
- Requires fixed sample size commitment upfront
- Slower to reach conclusions on low-traffic pages
I use frequentist for high-stakes decisions, Bayesian for velocity. When I'm testing a new checkout flow that could cost millions in revenue if we get it wrong, I want the mathematical guarantees of a well-powered frequentist test. But for iterating on headline copy or CTA colors where the downside is small, Bayesian lets me move faster. The real mistake is treating this as a religious debate — it's a tool selection problem. Match the framework to the decision's economic weight.— Atticus Li
The Framework Wars Are a Distraction
Every experimentation conference I attend, someone reignites the Bayesian vs frequentist debate as if picking the wrong framework will invalidate your entire testing program. It won't. What will invalidate your program is making bad business decisions because you chose the wrong tool for the job.
Let me be direct: both frameworks produce valid statistical inferences under their respective assumptions. The question isn't which is "more correct" — it's which maps better to your decision-making context.
How Frequentist Testing Actually Works
Classical frequentist A/B testing is built on the Neyman-Pearson framework. You define a null hypothesis (typically "no difference between A and B"), choose an acceptable false positive rate (alpha, usually 0.05), calculate the sample size needed to detect your minimum detectable effect, and run the test to completion.
The p-value tells you: given that there is no real difference, what is the probability of observing data at least this extreme? If that probability falls below your threshold, you reject the null hypothesis.
What most people get wrong: The p-value does not tell you the probability that B is better than A. It does not tell you the probability that your result is a false positive. These are common misinterpretations that lead to bad decisions, and I see them in experimentation programs at every level of maturity.
The frequentist framework's strength is its mathematical guarantees. If you set alpha at 0.05 and power at 0.80, and you run the test correctly, you will make a Type I error no more than 5% of the time and catch true effects at least 80% of the time. These are contractual guarantees — they hold as long as you follow the protocol.
How Bayesian Testing Actually Works
Bayesian A/B testing starts from a fundamentally different question: given the data I've observed, what is the probability that B is better than A?
You begin with a prior distribution — your belief about the likely effect size before seeing any data. As data arrives, Bayes' theorem updates this prior into a posterior distribution. The posterior gives you direct probability statements: "There is a 94% probability that variant B increases conversion rate by at least 0.5%."
The prior is both the strength and the vulnerability. A well-calibrated prior, built from your organization's historical test data, accelerates learning. A poorly chosen prior — or worse, a prior selected after seeing the data — undermines the entire analysis.
Modern Bayesian A/B testing platforms typically use weakly informative priors that let the data dominate quickly. This is a reasonable default, but it means you're leaving one of Bayesian testing's key advantages on the table.
The Business Economics of Each Framework
Here is where the comparison gets practical. Every testing decision has an economic context: the cost of a false positive, the cost of a false negative, the opportunity cost of waiting, and the cost of the test itself.
Cost of Wrong Decisions
Frequentist false positive cost: You ship a change that doesn't actually help (or actively hurts). With a 5% alpha, roughly 1 in 20 "winning" tests is a false positive. For a company running 100 tests per year, that's 5 shipped changes that aren't real winners. If each test touches $10M in annual revenue, the expected cost of false positives is significant but bounded.
Bayesian false positive cost: Harder to quantify precisely because it depends on your decision threshold (what probability of beating control do you require?) and your prior calibration. But Bayesian methods let you explicitly model the cost of being wrong, which is a genuine advantage for economic decision-making.
Speed vs Accuracy Tradeoff
The real competitive advantage in experimentation isn't accuracy at any single test level — it's the cumulative value of decisions made across your entire program over time. This is where the frameworks diverge most.
Frequentist programs optimize for individual test validity. Each test has rigorous error control, but the time cost is fixed. If you need 40,000 visitors per variant to detect a 2% relative lift, you wait until you get 40,000 visitors. Period.
Bayesian programs optimize for portfolio-level returns. You can stop early when evidence is overwhelming, reallocating traffic and engineering resources to the next test sooner. Over a year, this velocity advantage compounds.
I model this as an expected value calculation. If a Bayesian approach lets you run 30% more tests per year, even with slightly higher individual error rates, the portfolio-level expected value is often higher. The math is straightforward — run the Monte Carlo simulation for your specific traffic and revenue numbers.
When the Choice Actually Matters
High-Stakes Decisions: Use Frequentist
When the cost of shipping a false positive is catastrophic — a complete checkout redesign, a pricing model change, a core algorithm update — use frequentist methods. The fixed sample size and predetermined decision rule protect you from your own impatience.
I've seen teams "peek" at Bayesian posterior probabilities and ship changes based on a 78% probability of improvement. On a $200M revenue stream, that 22% chance of being wrong isn't a rounding error — it's a potential $44M mistake.
Velocity Decisions: Use Bayesian
For lower-stakes, high-frequency decisions — headline tests, CTA copy, email subject lines, image variants — Bayesian methods let you learn faster and iterate more. The per-test error rate matters less when you're running dozens of tests and the individual impact of each is bounded.
The Hybrid Approach
The most sophisticated experimentation programs I've worked with use both. They classify tests by economic impact tier:
- Tier 1 (high impact, high risk): Frequentist, pre-registered, full sample size
- Tier 2 (moderate impact): Bayesian with calibrated priors, moderate stopping rules
- Tier 3 (low impact, high velocity): Bayesian with aggressive stopping, multi-armed bandit allocation
This isn't fence-sitting — it's resource optimization. You're allocating your statistical rigor budget where it has the highest return.
Common Mistakes in Both Camps
Frequentist Mistakes - **Peeking without correction:** Checking results daily and stopping when p < 0.05 inflates your actual false positive rate to 20-30%. If you must peek, use sequential testing methods like alpha spending functions. - **Ignoring practical significance:** A statistically significant 0.1% lift on a low-revenue page isn't worth the engineering resources to productionize. Always frame results in revenue terms. - **Underpowered tests:** Running tests without adequate sample size, then concluding "no effect" when you simply didn't have enough data to detect one.
Bayesian Mistakes - **Prior hacking:** Choosing priors after seeing the data to get the result you want. This is the Bayesian equivalent of p-hacking and it is equally corrosive. - **Overconfidence in posterior probabilities:** A 90% probability of improvement sounds compelling, but if your model is misspecified, that 90% is meaningless. - **Ignoring multiple comparisons:** Testing 10 variants and reporting the one with the highest posterior probability without adjustment is still data dredging.
The Minimum Detectable Effect as a Business Decision
Regardless of which framework you choose, the most important decision in any test isn't statistical — it's defining what effect size is worth detecting.
In frequentist terms, this is your minimum detectable effect (MDE). In Bayesian terms, it's your region of practical equivalence (ROPE). Both serve the same business purpose: they answer the question "What is the smallest improvement that justifies the cost of implementation?"
This is fundamentally a business economics question. If implementing a winning variant costs $50,000 in engineering time, and the variant improves conversion by 0.1% on a page with $5M in annual throughput, the annual benefit is $5,000. That's a negative ROI. Your MDE should be set at the break-even point, not at whatever your platform defaults to.
My Framework Decision Tree
After running hundreds of experiments across multiple organizations, here is my decision process:
Step 1: Classify the economic impact. What is the maximum revenue exposure of this test? If it touches more than $10M in annualized revenue, go to Step 2a. Otherwise, Step 2b.
Step 2a: High-stakes path. Pre-register a frequentist test with adequate power. Define the MDE based on implementation costs. Run to completion. No peeking.
Step 2b: Velocity path. Set up a Bayesian test with priors calibrated from your historical win rate and effect size distribution. Define a probability threshold (I use 95% for moderate decisions, 90% for low-stakes). Monitor continuously and stop when the threshold is met.
Step 3: Document the decision framework, not just the result. Future you (and your successor) need to understand why you chose the approach you did.
The Bottom Line
The Bayesian vs frequentist debate is a proxy war for a more fundamental question: how do you make good decisions under uncertainty at scale? Both frameworks are tools in your decision-making toolkit. The practitioners who generate the most value aren't the ones with the strongest statistical opinions — they're the ones who match the right tool to the right decision context, every time.
Stop arguing about frameworks. Start building decision taxonomies that map statistical rigor to economic impact. That's where the real competitive advantage lives.