Bayesian vs Frequentist Testing in Optimizely: Which Should You Choose?

Atticus Li

← Blog · optimizely

Bayesian vs Frequentist Testing in Optimizely: Which Should You Choose?

Optimizely now offers three statistical engines: Sequential (Stats Engine), Frequentist Fixed Horizon, and Bayesian. Each one changes what you measure, how you decide, and when you can stop. Here's how to pick the right one for your team.

Atticus Li March 31, 2026 10 min read

The question I get more than any other from teams setting up Optimizely for the first time: "Should we use Bayesian or Frequentist?"

It's the right question to ask. The answer changes your UI, your interpretation, your stopping rules, and how you communicate results to stakeholders. Get it wrong and you'll either run tests for too long, call winners too early, or confuse your entire leadership team.

Here's what actually changes between the approaches — and a decision framework for picking the right one for your program.

What Optimizely Actually Offers

First, a clarification most guides skip: Optimizely gives you three statistical engines, not two.

Sequential (Stats Engine) — This is Optimizely's default and proprietary approach. It's built on sequential probability ratio testing with false discovery rate control. Despite common misconceptions, it's closer to frequentist than Bayesian in its mathematical foundations, but it was specifically designed to allow continuous monitoring (the thing classical frequentist testing forbids). When you haven't changed anything, this is what you're running.

Frequentist (Fixed Horizon) — A traditional null hypothesis significance testing approach. Requires you to pre-specify your sample size. Produces a p-value and statistical significance percentage. Cannot be validly checked until you've collected the required data. This is what your stats professor taught you.

Bayesian — Updates probability estimates as data accumulates. Reports "probability that variant beats control" rather than p-values. No fixed sample size required. Can stop early if one variant is clearly winning or losing.

This article compares all three so you can make a real decision.

Frequentist Fixed Horizon: The Rigorous Option

How it works

You set your sample size before the test starts. You run until you hit that number. You analyze once. The output is a p-value and a statement like "statistically significant at 95%."

The math assumes you've committed to analyzing your data exactly once, at a predetermined endpoint. Checking results midway and stopping early invalidates the statistical guarantees.

What changes in Optimizely

When you select Frequentist (Fixed Horizon) in the Stats Configuration dropdown, Optimizely's sample size calculator becomes your primary tool. You input:

Metric type (conversion or numeric)
Baseline metric value (your current conversion rate)
Minimum Detectable Effect (the smallest lift you want to detect)
Statistical significance level (typically 90% or 95%)
Number of variations (including control)

The calculator outputs visitors needed per variation. You copy that into the Sample Size per Variation field. The experiment becomes locked to that endpoint. You can optionally add a minimum duration in days if you want to ensure at least one full business cycle.

The results page shows: statistical significance percentage, p-value, and confidence intervals — the classic frequentist output.

When frequentist is the right call

Regulated industries (pharma, financial services, healthcare) where you need a defensible methodology that outside auditors will recognize
Compliance and legal contexts where the statistical method may need to be documented and defended
Academic or rigorous internal research where you want to publish or reference results
High-stakes irreversible decisions where you want the strictest false positive control and are willing to wait for the required sample size
Teams with strong statistical expertise who understand NHST deeply and have discipline around pre-registration

**Pro Tip:** If you're running frequentist tests, treat your sample size as a commitment, not a guideline. The moment you start peeking at results and considering early stops, you've compromised your Type I error rate. Either use Fixed Horizon with iron discipline, or switch to Stats Engine which is built for continuous monitoring.

Bayesian: The Intuitive, Fast Option

How it works

Bayesian testing asks a different question than frequentist. Instead of "how surprised would I be if there were no effect?", it asks "given the data I've collected, what's the probability that variant A beats control?"

That probability — called the "chance to beat" or posterior probability — updates continuously as visitors enter your test. When you've seen enough data, the probability stabilizes. You can stop when it's high enough.

Optimizely implements Bayesian testing by replacing the default Stats Engine with a Bayesian framework. The output is a probability: "87% chance Variant A beats control."

What changes in Optimizely

Configuration is simpler than Frequentist. In the Stats Configuration dropdown, select "Bayesian." You can optionally adjust the "Chance to beat probability threshold" — the minimum probability required before Optimizely marks a variant as a winner. The minimum is 70%, with 95% being common.

No sample size pre-specification is required. You monitor the probability and stop when it crosses your threshold — or when the probability stabilizes at a level that tells you the effect is smaller than your MDE.

Results show: probability that each variant beats control, and the estimated lift with credible intervals (Bayesian's version of confidence intervals).

When Bayesian is the right call

Fast-paced product teams where shipping speed matters more than statistical precision
High-traffic pages where you accumulate data quickly and want to make decisions faster
Stakeholder communication — "there's a 91% chance this version is better" is far more intuitive to executives than "p = 0.04 at 95% confidence"
Exploratory testing phases where you're learning rather than making high-stakes decisions
Teams without deep statistical training who need to act on results without getting lost in p-value interpretation

**Pro Tip:** Bayesian probability thresholds and frequentist significance levels are not equivalent. A 95% Bayesian "chance to beat" is not the same as 95% frequentist confidence. Bayesian 95% means "given the data, there's a 95% probability the variant is better." Frequentist 95% means "if there were no effect, we'd see this result only 5% of the time." Don't mix the interpretations.

Stats Engine (Sequential): The Default for Most Teams

Optimizely built Stats Engine as a middle path — and for most experimentation programs, it's the right default.

Stats Engine is mathematically a sequential test (frequentist-adjacent), but it's designed to support continuous monitoring without inflating false positive rates. It uses false discovery rate (FDR) control across your experiment's metrics, which reduces the chance of calling a winner on noise when you're tracking multiple metrics simultaneously.

The key practical advantage: you get valid real-time results. You can check the dashboard every day. You can share results in weekly readouts. You can stop early if the evidence is overwhelming. The statistical guarantees hold throughout.

For most CRO teams — especially those without dedicated statisticians — Stats Engine provides the best balance of rigor and flexibility.

The Practical Tradeoffs: A Direct Comparison

Speed to decision: Bayesian > Stats Engine > Frequentist

Bayesian can reach its probability threshold faster, especially when effects are large. Frequentist requires you to collect the full pre-specified sample regardless of how clear the signal is. Stats Engine falls in the middle — it allows early stopping but maintains stricter false positive control than Bayesian.

False positive control: Frequentist (if discipline is maintained) ~ Stats Engine > Bayesian

Classical frequentist gives you exact Type I error control — if you don't peek. Stats Engine provides valid continuous monitoring with FDR control. Bayesian is more flexible but can produce more false positives if teams stop tests too early at low probability thresholds.

Ease of interpretation: Bayesian > Stats Engine > Frequentist

"91% probability the variant is better" needs no explanation. "Statistically significant at 95% confidence" requires unpacking. P-values require a lecture.

Stakeholder communication: Bayesian wins here, consistently.

Required statistical sophistication: Frequentist > Stats Engine > Bayesian

Frequentist demands the most discipline — pre-registration, no peeking, strict adherence to sample size. Stats Engine is more forgiving. Bayesian is the most accessible.

**Pro Tip:** Match your statistical engine to your team's maturity, not to what sounds most rigorous. A team that uses frequentist but peeks at results daily has worse statistical properties than a team running Bayesian with an 85% threshold and genuine discipline around stopping rules.

Decision Table: Which Engine for Your Situation

Use Frequentist (Fixed Horizon) when:

You're in a regulated industry or compliance context
You need to document methodology for external review
Your team has strong statistical training and will not peek at results
You're running high-stakes tests where you want the tightest false positive control
Your experiment has a single primary metric

Use Bayesian when:

Your team needs fast decisions and shipping velocity is a priority
Stakeholders need intuitive probability statements, not p-values
You're running exploratory or iterative tests where speed of learning matters more than precision
Your test is lower stakes (copy, UI, layout) and you're comfortable with slightly higher false positive risk
You want to stop early when results are clearly going one way

Use Stats Engine (default) when:

You don't have a specific reason to deviate from the default
Your team checks results continuously and needs valid real-time inference
You're tracking multiple metrics and need FDR protection
You want the best balance of flexibility and rigor without committing to strict pre-registration

**Pro Tip:** You can change the statistical method on a test-by-test basis in Optimizely. Don't feel like you need to pick one approach for your entire program. Run frequentist for your pricing test, Bayesian for your hero section copy test, and Stats Engine for your standard feature experiments.

What Optimizely Recommends

Optimizely's default is Stats Engine, and that's deliberate. The company designed it specifically to solve the two most common failures in classical A/B testing programs: peeking and multi-metric false discovery inflation.

They've subsequently added Frequentist and Bayesian as explicit options because enterprise customers with compliance requirements or specific methodological preferences need them. But for a team starting from scratch, Stats Engine's defaults are solid.

Common Mistakes

Mistake 1: Switching methods mid-test. Don't change your statistical engine after a test starts. The analysis assumes a consistent method from the first visitor.

Mistake 2: Using frequentist but monitoring results daily. This is the worst of both worlds — you get the rigidity of frequentist (must pre-specify sample size) without the validity protection (you're peeking). Either commit to no peeking or switch to Stats Engine.

Mistake 3: Setting Bayesian probability thresholds too low. A 70% chance to beat is barely better than a coin flip. If you're making product decisions at 75% Bayesian probability, you're shipping losers regularly. Most teams should not go below 90%.

Mistake 4: Conflating the three methods' outputs. A 95% confidence Frequentist result, a 95% chance-to-beat Bayesian result, and a 95% Stats Engine result are not the same thing and should not be communicated interchangeably.

Mistake 5: Not documenting the chosen method in your experiment plan. When you revisit an experiment six months later, you need to know what statistical framework was used to interpret the results correctly. Document it.

What to Do Next

Audit your current default. Open Optimizely's Stats Configuration on your next experiment. Confirm you know which engine is selected and whether it's appropriate for the test stakes.
Define team standards. Write a one-page policy: which engine for which test type, what thresholds, who can override. This prevents ad hoc methodology choices that produce incomparable results across your program.
Train stakeholders on your chosen output format. If you're running Bayesian, teach your leadership team to read probability statements. If you're running Stats Engine, give them the "how to interpret a confidence interval" two-minute briefing. Consistent interpretation across the org matters more than the specific method.
Run a retrospective on your last 10 called experiments. What method was used? Was it appropriate for the stakes? Were results interpreted correctly? This audit often reveals systematic misinterpretation that's been costing you decision quality for months.

optimizely statistics bayesian frequentist ab-testing stats-engine

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

What Optimizely Actually Offers

Frequentist Fixed Horizon: The Rigorous Option

How it works

What changes in Optimizely

When frequentist is the right call

Bayesian: The Intuitive, Fast Option

How it works

What changes in Optimizely

When Bayesian is the right call

Stats Engine (Sequential): The Default for Most Teams

The Practical Tradeoffs: A Direct Comparison

Decision Table: Which Engine for Your Situation

What Optimizely Recommends

Common Mistakes

What to Do Next

Related Articles

How to Run Meta-Analysis Across Historical A/B Test Data: The Aggregation Validity Threshold

Experimentation Governance: Managing SRM, False Positives, and Bias

Pre-Test and Post-Test Calculators: Statistical Guardrails and the Cost of Statistical Debt

Related Articles

How to Run Meta-Analysis Across Historical A/B Test Data: The Aggregation Validity Threshold

Experimentation Governance: Managing SRM, False Positives, and Bias

Pre-Test and Post-Test Calculators: Statistical Guardrails and the Cost of Statistical Debt

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook