What is Bandit Algorithm (Multi-Armed Bandit)? — Glossary

Atticus Li

Bandit Algorithm (Multi-Armed Bandit)

An adaptive experiment design that dynamically shifts traffic toward better-performing variants during a test, balancing exploration of new options with exploitation of known winners.

What Is a Bandit Algorithm?

A bandit algorithm (or multi-armed bandit, MAB) is an adaptive experimentation method that continuously reallocates traffic toward better-performing variants while still sending some traffic to underperformers to verify their standing. It's named after the gambler's dilemma: with multiple slot machines of unknown payout, how do you maximize winnings across the session? In digital experimentation, it's a way to cut opportunity cost on long-running tests.

Also Known As

Marketing teams often call it adaptive testing or auto-optimizing test.
Growth teams say bandit, MAB, or adaptive allocation.
Product teams use bandit, optimization algorithm, or adaptive experiment.
Engineering teams refer to Thompson Sampling, UCB, or epsilon-greedy (specific bandit types).
Data science teams call it MAB, contextual bandit, or reinforcement learning-style allocation.

How It Works

You have three email subject lines to test. A classic A/B/C test splits traffic evenly for the full duration. A bandit starts at 33/33/33 but re-evaluates after every batch of sends. After 1,000 sends, subject A opens at 22%, B at 28%, C at 19%. The next batch reallocates: A gets 20%, B gets 65%, C gets 15%. As more data arrives, traffic concentrates further on the best performer. By the end you've sent far more opens to B than you would have in a fixed A/B/C split — at the cost of less statistical certainty about the losers.

Best Practices

Use bandits for optimization, not inference — when you care about maximizing outcomes, not learning.
Prefer Thompson Sampling over epsilon-greedy for better long-run performance.
Use contextual bandits when different user segments might prefer different variants.
Monitor guardrail metrics closely — bandits can lock onto a variant that's good on the primary metric but bad on secondary ones.
Don't use bandits for high-stakes infrastructure tests where you need clean causal inference.

Common Mistakes

Using bandits on low-traffic tests where early random variation causes premature convergence on a loser.
Treating bandit results as statistically significant — bandits optimize but don't prove causation cleanly.
Running bandits on noisy, delayed metrics (like LTV) where the feedback loop is too slow.

Industry Context

SaaS/B2B: Less common; traffic is usually too low and strategic learning matters more than short-term lift.
Ecommerce/DTC: Useful for headline, banner, and promotion optimization where traffic is high and decisions are low-stakes.
Lead gen: Good fit for ad creative optimization and dynamic landing page headline selection.

The Behavioral Science Connection

Bandits operationalize Herbert Simon's satisficing — accepting "good enough" quickly rather than chasing optimal certainty. They embody a systems-level tradeoff most individual experimenters fail to make: less learning per test, more value captured across the portfolio.

Key Takeaway

Use bandits when you want to capture value during optimization — and use A/B tests when you want to learn why something won.

← Browse All Terms