Bandit Algorithm (Multi-Armed Bandit)
An adaptive experiment design that dynamically shifts traffic toward better-performing variants during a test, balancing exploration of new options with exploitation of known winners.
What Is a Bandit Algorithm?
A bandit algorithm (or multi-armed bandit, MAB) is an adaptive experimentation method that continuously reallocates traffic toward better-performing variants while still sending some traffic to underperformers to verify their standing. It's named after the gambler's dilemma: with multiple slot machines of unknown payout, how do you maximize winnings across the session? In digital experimentation, it's a way to cut opportunity cost on long-running tests.
Also Known As
- Marketing teams often call it adaptive testing or auto-optimizing test.
- Growth teams say bandit, MAB, or adaptive allocation.
- Product teams use bandit, optimization algorithm, or adaptive experiment.
- Engineering teams refer to Thompson Sampling, UCB, or epsilon-greedy (specific bandit types).
- Data science teams call it MAB, contextual bandit, or reinforcement learning-style allocation.
How It Works
You have three email subject lines to test. A classic A/B/C test splits traffic evenly for the full duration. A bandit starts at 33/33/33 but re-evaluates after every batch of sends. After 1,000 sends, subject A opens at 22%, B at 28%, C at 19%. The next batch reallocates: A gets 20%, B gets 65%, C gets 15%. As more data arrives, traffic concentrates further on the best performer. By the end you've sent far more opens to B than you would have in a fixed A/B/C split — at the cost of less statistical certainty about the losers.
Best Practices
- Use bandits for optimization, not inference — when you care about maximizing outcomes, not learning.
- Prefer Thompson Sampling over epsilon-greedy for better long-run performance.
- Use contextual bandits when different user segments might prefer different variants.
- Monitor guardrail metrics closely — bandits can lock onto a variant that's good on the primary metric but bad on secondary ones.
- Don't use bandits for high-stakes infrastructure tests where you need clean causal inference.
Common Mistakes
- Using bandits on low-traffic tests where early random variation causes premature convergence on a loser.
- Treating bandit results as statistically significant — bandits optimize but don't prove causation cleanly.
- Running bandits on noisy, delayed metrics (like LTV) where the feedback loop is too slow.
Industry Context
- SaaS/B2B: Less common; traffic is usually too low and strategic learning matters more than short-term lift.
- Ecommerce/DTC: Useful for headline, banner, and promotion optimization where traffic is high and decisions are low-stakes.
- Lead gen: Good fit for ad creative optimization and dynamic landing page headline selection.
The Behavioral Science Connection
Bandits operationalize Herbert Simon's satisficing — accepting "good enough" quickly rather than chasing optimal certainty. They embody a systems-level tradeoff most individual experimenters fail to make: less learning per test, more value captured across the portfolio.
Key Takeaway
Use bandits when you want to capture value during optimization — and use A/B tests when you want to learn why something won.