Bayesian vs Frequentist A/B Testing

Atticus Li

A practitioner's comparison of Bayesian and frequentist A/B testing — when to use each framework, the real cost of wrong decisions, and the speed-vs-accuracy tradeoff.

A Bayesian A/B Testing

B Frequentist A/B Testing

Overview

A statistical framework that updates probability estimates as new data arrives, expressing results as probability distributions rather than binary significance thresholds.

The classical hypothesis testing framework that controls error rates through fixed sample sizes and predetermined significance thresholds, typically p < 0.05.

Strengths

Intuitive probability statements (e.g., '92% chance B beats A')
Can peek at results without inflating error rates
Incorporates prior knowledge from past experiments
Natural stopping rules — stop when confidence is sufficient
Better for communicating risk to non-technical stakeholders

Well-understood mathematical guarantees on error rates
Standardized methodology — reproducible across teams
Clear pre-registration and analysis plan
Regulatory and academic acceptance
Simpler to implement and audit

Weaknesses

Prior selection can introduce bias if not carefully chosen
Computationally more expensive at scale
Harder to pre-register and reproduce exactly
Can give false confidence with poorly chosen priors
Less standardized — implementations vary across platforms

Cannot peek at results without sequential testing corrections
Binary outcomes (significant/not) lose nuance
P-values are widely misinterpreted, even by practitioners
Requires fixed sample size commitment upfront
Slower to reach conclusions on low-traffic pages

Best For

Teams running high-velocity testing programs who need to make fast decisions with acceptable risk tolerance, especially when communicating results to business stakeholders.

High-stakes decisions where you need rigorous error control, regulatory compliance, or when the cost of a false positive significantly outweighs the cost of waiting longer for results.

Expert Verdict

I use frequentist for high-stakes decisions, Bayesian for velocity. When I'm testing a new checkout flow that could cost millions in revenue if we get it wrong, I want the mathematical guarantees of a well-powered frequentist test. But for iterating on headline copy or CTA colors where the downside is small, Bayesian lets me move faster. The real mistake is treating this as a religious debate — it's a tool selection problem. Match the framework to the decision's economic weight.

— Atticus Li

The Framework Wars Are a Distraction

Every experimentation conference I attend, someone reignites the Bayesian vs frequentist debate as if picking the wrong framework will invalidate your entire testing program. It won't. What will invalidate your program is making bad business decisions because you chose the wrong tool for the job.

Let me be direct: both frameworks produce valid statistical inferences under their respective assumptions. The question isn't which is "more correct" — it's which maps better to your decision-making context.

How Frequentist Testing Actually Works

Classical frequentist A/B testing is built on the Neyman-Pearson framework. You define a null hypothesis (typically "no difference between A and B"), choose an acceptable false positive rate (alpha, usually 0.05), calculate the sample size needed to detect your minimum detectable effect, and run the test to completion.

The p-value tells you: given that there is no real difference, what is the probability of observing data at least this extreme? If that probability falls below your threshold, you reject the null hypothesis.

What most people get wrong: The p-value does not tell you the probability that B is better than A. It does not tell you the probability that your result is a false positive. These are common misinterpretations that lead to bad decisions, and I see them in experimentation programs at every level of maturity.

The frequentist framework's strength is its mathematical guarantees. If you set alpha at 0.05 and power at 0.80, and you run the test correctly, you will make a Type I error no more than 5% of the time and catch true effects at least 80% of the time. These are contractual guarantees — they hold as long as you follow the protocol.

How Bayesian Testing Actually Works

Bayesian A/B testing starts from a fundamentally different question: given the data I've observed, what is the probability that B is better than A?

You begin with a prior distribution — your belief about the likely effect size before seeing any data. As data arrives, Bayes' theorem updates this prior into a posterior distribution. The posterior gives you direct probability statements: "There is a 94% probability that variant B increases conversion rate by at least 0.5%."

The prior is both the strength and the vulnerability. A well-calibrated prior, built from your organization's historical test data, accelerates learning. A poorly chosen prior — or worse, a prior selected after seeing the data — undermines the entire analysis.

Modern Bayesian A/B testing platforms typically use weakly informative priors that let the data dominate quickly. This is a reasonable default, but it means you're leaving one of Bayesian testing's key advantages on the table.

The Business Economics of Each Framework

Here is where the comparison gets practical. Every testing decision has an economic context: the cost of a false positive, the cost of a false negative, the opportunity cost of waiting, and the cost of the test itself.

Cost of Wrong Decisions

Frequentist false positive cost: You ship a change that doesn't actually help (or actively hurts). With a 5% alpha, roughly 1 in 20 "winning" tests is a false positive. For a company running 100 tests per year, that's 5 shipped changes that aren't real winners. If each test touches $10M in annual revenue, the expected cost of false positives is significant but bounded.

Bayesian false positive cost: Harder to quantify precisely because it depends on your decision threshold (what probability of beating control do you require?) and your prior calibration. But Bayesian methods let you explicitly model the cost of being wrong, which is a genuine advantage for economic decision-making.

Speed vs Accuracy Tradeoff

The real competitive advantage in experimentation isn't accuracy at any single test level — it's the cumulative value of decisions made across your entire program over time. This is where the frameworks diverge most.

Frequentist programs optimize for individual test validity. Each test has rigorous error control, but the time cost is fixed. If you need 40,000 visitors per variant to detect a 2% relative lift, you wait until you get 40,000 visitors. Period.

Bayesian programs optimize for portfolio-level returns. You can stop early when evidence is overwhelming, reallocating traffic and engineering resources to the next test sooner. Over a year, this velocity advantage compounds.

I model this as an expected value calculation. If a Bayesian approach lets you run 30% more tests per year, even with slightly higher individual error rates, the portfolio-level expected value is often higher. The math is straightforward — run the Monte Carlo simulation for your specific traffic and revenue numbers.

When the Choice Actually Matters

High-Stakes Decisions: Use Frequentist

When the cost of shipping a false positive is catastrophic — a complete checkout redesign, a pricing model change, a core algorithm update — use frequentist methods. The fixed sample size and predetermined decision rule protect you from your own impatience.

I've seen teams "peek" at Bayesian posterior probabilities and ship changes based on a 78% probability of improvement. On a $200M revenue stream, that 22% chance of being wrong isn't a rounding error — it's a potential $44M mistake.

Velocity Decisions: Use Bayesian

For lower-stakes, high-frequency decisions — headline tests, CTA copy, email subject lines, image variants — Bayesian methods let you learn faster and iterate more. The per-test error rate matters less when you're running dozens of tests and the individual impact of each is bounded.

The Hybrid Approach

The most sophisticated experimentation programs I've worked with use both. They classify tests by economic impact tier:

Tier 1 (high impact, high risk): Frequentist, pre-registered, full sample size
Tier 2 (moderate impact): Bayesian with calibrated priors, moderate stopping rules
Tier 3 (low impact, high velocity): Bayesian with aggressive stopping, multi-armed bandit allocation

This isn't fence-sitting — it's resource optimization. You're allocating your statistical rigor budget where it has the highest return.

Common Mistakes in Both Camps

Frequentist Mistakes - Peeking without correction: Checking results daily and stopping when p < 0.05 inflates your actual false positive rate to 20-30%. If you must peek, use sequential testing methods like alpha spending functions. - Ignoring practical significance: A statistically significant 0.1% lift on a low-revenue page isn't worth the engineering resources to productionize. Always frame results in revenue terms. - Underpowered tests: Running tests without adequate sample size, then concluding "no effect" when you simply didn't have enough data to detect one.

Bayesian Mistakes - Prior hacking: Choosing priors after seeing the data to get the result you want. This is the Bayesian equivalent of p-hacking and it is equally corrosive. - Overconfidence in posterior probabilities: A 90% probability of improvement sounds compelling, but if your model is misspecified, that 90% is meaningless. - Ignoring multiple comparisons: Testing 10 variants and reporting the one with the highest posterior probability without adjustment is still data dredging.

The Minimum Detectable Effect as a Business Decision

Regardless of which framework you choose, the most important decision in any test isn't statistical — it's defining what effect size is worth detecting.

In frequentist terms, this is your minimum detectable effect (MDE). In Bayesian terms, it's your region of practical equivalence (ROPE). Both serve the same business purpose: they answer the question "What is the smallest improvement that justifies the cost of implementation?"

This is fundamentally a business economics question. If implementing a winning variant costs $50,000 in engineering time, and the variant improves conversion by 0.1% on a page with $5M in annual throughput, the annual benefit is $5,000. That's a negative ROI. Your MDE should be set at the break-even point, not at whatever your platform defaults to.

My Framework Decision Tree

After running hundreds of experiments across multiple organizations, here is my decision process:

Step 1: Classify the economic impact. What is the maximum revenue exposure of this test? If it touches more than $10M in annualized revenue, go to Step 2a. Otherwise, Step 2b.

Step 2a: High-stakes path. Pre-register a frequentist test with adequate power. Define the MDE based on implementation costs. Run to completion. No peeking.

Step 2b: Velocity path. Set up a Bayesian test with priors calibrated from your historical win rate and effect size distribution. Define a probability threshold (I use 95% for moderate decisions, 90% for low-stakes). Monitor continuously and stop when the threshold is met.

Step 3: Document the decision framework, not just the result. Future you (and your successor) need to understand why you chose the approach you did.

The Bottom Line

The Bayesian vs frequentist debate is a proxy war for a more fundamental question: how do you make good decisions under uncertainty at scale? Both frameworks are tools in your decision-making toolkit. The practitioners who generate the most value aren't the ones with the strongest statistical opinions — they're the ones who match the right tool to the right decision context, every time.

Stop arguing about frameworks. Start building decision taxonomies that map statistical rigor to economic impact. That's where the real competitive advantage lives.

← Browse All Comparisons