What is Reinforcement Learning for Experimentation?

Atticus Li

← Glossary · Statistics & Methodology

Reinforcement Learning for Experimentation

Using RL algorithms — multi-armed bandits, contextual bandits, and full RL — to adaptively allocate traffic across variants and maximize cumulative reward.

RL reframes experimentation as sequential decision making under uncertainty. Instead of a fixed-horizon A/B test that commits equal traffic to all variants, RL methods allocate more traffic to variants that appear to be winning while preserving exploration to avoid premature convergence. The spectrum ranges from simple multi-armed bandits (Thompson sampling, UCB) to contextual bandits (per-user feature-aware assignment) to full RL with state, action sequences, and long-horizon rewards.

Also Known As

Data science: MAB, contextual bandit, Thompson sampling, UCB, policy gradient
Growth: adaptive testing, traffic reallocation
Marketing: dynamic optimization
Engineering: sequential allocation, bandit orchestration

How It Works

You have 6 CTA copy variants. Instead of 1/6 traffic to each for 4 weeks, use Thompson sampling: each variant has a Beta(alpha, beta) posterior; sample once from each, show the variant with the highest sampled conversion rate, observe outcome, update posteriors. Over time, losing variants get less traffic while winners get more. Cumulative regret (lift lost to exploring losers) is dramatically lower than in uniform A/B, and strong winners concentrate naturally.

For contextual bandits, extend Thompson sampling to linear or tree-based models conditioning on user features. For full RL, model state-action trajectories — appropriate for multi-step onboarding or email sequences where today's action affects tomorrow's reward.

Best Practices

Use bandits when cost of exploring losers is high and exploitation value is real-time (CTA, subject line, hero copy).
Prefer A/B tests when you need an unbiased effect estimate — bandits optimize regret, not estimation clarity.
Maintain exploration floors to prevent premature convergence, especially with non-stationarity.
Log propensities to enable counterfactual evaluation later.
Match algorithm to horizon. MAB for simple stationary choice; contextual bandits for heterogeneity; full RL for sequential action — most real problems are contextual bandits, not full RL.

Common Mistakes

Using bandits when you actually want an effect estimate. Bandit data has correlated assignment probabilities; naive ATE estimation is biased.
Ignoring non-stationarity. A bandit that has converged on a winner will be slow to adapt when user behavior shifts; use discounted or sliding-window variants.
Deploying full RL where a bandit would do. Full RL is hard, data-hungry, and rarely justified by business complexity.

Industry Context

In SaaS/B2B, bandits excel at in-app messaging, upgrade prompt copy, and activation nudges where exploration cost is high. In ecommerce, Thompson sampling on merchandising and email subject lines is a well-validated production pattern. In lead gen, contextual bandits on ad creative, landing page variants, and nurture sequences outperform fixed A/B in competitive, high-frequency environments.

The Behavioral Science Connection

RL formalizes the explore-exploit tradeoff that humans navigate badly. We either explore too little (overcommitted to last year's winner) or too much (restless optimizer never letting anything compound). Thompson sampling and its cousins make the right tradeoff explicit and data-driven.

Key Takeaway

RL and bandits are the right tool when the cost of exploration is real and the exploitation payoff is continuous. They are the wrong tool when you need a clean causal estimate for a one-time ship decision. Knowing which problem you have is most of the skill.

← Browse All Terms