"We feature-flagged the new checkout to 10% of users and metrics looked fine. So we rolled it out." I hear this at least once a month. It sounds rigorous. It is not. Here is why.
The confusion between feature flags and A/B tests is one of the most common — and most costly — misunderstandings in modern product development. Engineering teams assume that because both tools control who sees what code, they accomplish the same thing. They do not. One is a deployment mechanism. The other is a measurement framework. Treating them as interchangeable leads to shipping changes you think you validated but actually never tested.
Feature Flags: A Deployment Mechanism
A feature flag is a conditional switch in your codebase that controls which users see which version of your code. At its core, it is an if/else statement controlled by a configuration system rather than a code deploy.
Feature flags serve several deployment purposes:
Gradual rollout (canary deployment) lets you expose a new feature to 1% of traffic, then 5%, then 25%, then 100%. At each stage, you monitor error rates, latency, and crash reports. If something breaks, you kill the flag and instantly revert — no emergency deploy required.
Kill switches give you the ability to disable a broken feature in production without a full deployment cycle. When your new payment integration starts throwing 500 errors at 2 AM, you flip the flag instead of waking up the on-call engineer for a hotfix deploy.
User and team-based targeting lets you show features to internal users for dogfooding, beta testers for early feedback, or specific customer segments for phased releases.
The tooling ecosystem is mature: LaunchDarkly, Flagsmith, Split, Unleash, and countless homegrown systems. Most modern engineering organizations use feature flags as standard practice.
A/B Tests: A Measurement Framework
An A/B test is fundamentally different. It is not about deploying code safely — it is about learning whether a change improves outcomes. The key components that make it a valid experiment:
Random assignment ensures that the users who see the variant and the users who see the control are statistically comparable. Any differences in outcomes can be attributed to the change itself, not to differences between the groups.
Pre-defined hypothesis, metrics, and success criteria establish what you expect to happen and how you will measure it before the test begins. This prevents post-hoc rationalization — the dangerous practice of looking at results and then deciding what they mean.
Statistical analysis determines whether observed differences are real or just noise. A 3% lift in conversion could be a genuine effect or random variation. The statistics tell you which.
Causal conclusions are the end product. A properly run A/B test lets you say "this change caused this outcome" — not just "this change was correlated with this outcome during this time period."
The Critical Differences
The distinction comes down to the question each tool answers. Feature flags ask: "Does this break things?" That is monitoring. A/B tests ask: "Does this improve things?" That is measurement.
A feature flag rollout uses non-random allocation, has no formal control group, and applies no statistical test. Users might be assigned based on user ID hashing, geography, account age, or team membership — none of which guarantee comparable groups.
An A/B test uses random allocation, maintains an explicit control group, and produces a statistical conclusion. The randomization is the foundation. Without it, you cannot draw causal inferences.
Where They Overlap
Here is where the confusion starts: feature flags are often the implementation mechanism for A/B tests. Most modern experimentation platforms — Optimizely, Statsig, Eppo, LaunchDarkly Experimentation — combine both capabilities.
The flag controls WHO sees WHAT. The experiment framework measures the IMPACT. You need the flag to split traffic, but the flag alone tells you nothing about whether the change was good.
Think of it like a clinical trial. The pill packaging (feature flag) controls which patients get the drug and which get the placebo. But the packaging is not the study. The study is the randomization protocol, the outcome measurement, and the statistical analysis. Without those, you just handed out pills with no way to know if they worked.
HTTP Routing and API Feature Testing
The confusion deepens at the infrastructure level. Engineering teams frequently use traffic-splitting mechanisms that look like experiments but are not.
Layer 7 reverse proxy routing sends a percentage of API requests to a new service version. Nginx, HAProxy, and Envoy all support this natively. You configure the proxy to route 10% of requests to v2 and 90% to v1.
Kubernetes ingress controllers offer native traffic splitting. You define weighted routing rules that send a fraction of traffic to the canary deployment.
User-agent-based routing delivers different experiences to different clients — mobile versus desktop, internal versus external.
All of these are canary deployments. They tell you if the new version crashes, throws errors, or degrades latency. They do not tell you if the new version is better. The distinction matters enormously.
Canary vs. A/B Test: Same Mechanism, Different Questions
A canary deployment and an A/B test can use identical traffic-splitting infrastructure. The difference is entirely in what you do with the split.
Canary question: "Will this new code cause errors, latency spikes, or crashes?" You answer this by watching monitoring dashboards — error rates, p99 latency, CPU utilization, memory consumption.
A/B test question: "Will this new code improve conversion, engagement, or revenue?" You answer this through statistical analysis — hypothesis testing, confidence intervals, significance calculations.
You need both. The canary comes first for safety. The A/B test comes second for learning. Skipping either one creates risk — the canary prevents outages, the experiment prevents shipping features that look fine technically but quietly hurt the business.
The Dangerous Confusion
"We flagged it to 10% and the dashboard looked normal." I hear this constantly, and it is the exact point where teams go wrong.
First, there is no proper control group. You are comparing 10% on the new version against 90% on the old version, but the groups are not randomly assigned in a way that supports causal inference. The 10% might be early adopters, power users, or a specific geographic segment — all of which behave differently than your average user.
Second, "looked normal" is not a statistically valid conclusion. Dashboard metrics fluctuate daily. A 2% drop in conversion might be invisible in a dashboard view but represent millions in lost annual revenue. Without a statistical test, you have no way to distinguish signal from noise.
Third, self-selection bias is real. If your feature flag targets opted-in beta users, those users are inherently different from your general population. They are more engaged, more forgiving, and more likely to explore new features. Positive metrics from this group do not generalize.
Progressive Delivery: The Right Pipeline
The mature approach combines both tools in a structured pipeline:
Step 1: Feature flag for internal testing. Developers and QA test the feature behind a flag. Dogfooding, exploratory testing, edge case validation.
Step 2: Canary deployment at 1-5%. A small slice of production traffic hits the new code. Engineering monitors for errors, latency regressions, and infrastructure issues. This stage answers: "Is it safe?"
Step 3: A/B test at 50/50. A randomized experiment with a proper control group, pre-defined metrics, and statistical analysis. This stage answers: "Is it better?" This is where the validity considerations matter most.
Step 4: Gradual rollout based on results. If the A/B test is positive, progressively increase the variant allocation: 50% to 75% to 90% to 100%. Monitor for any degradation at scale.
Step 5: Full deployment and cleanup. Remove the feature flag. Delete the branching logic. Pay down the technical debt before it compounds. Abandoned experiment code is a real maintenance burden.
When You Can Skip the A/B Test Step
Not every change needs the full pipeline. You can skip the A/B test for:
Infrastructure changes with no user-facing impact. Database migrations, caching layer upgrades, service decomposition — these affect reliability and performance, not user behavior. A canary is sufficient.
Bug fixes. The experiment would be "does fixing the bug help?" The answer is obviously yes. Ship the fix, confirm it with monitoring, and move on.
Low-risk, easily reversible changes on low-traffic pages. If the page gets 200 visitors a week and the change is a copy tweak, you will never reach statistical significance anyway. Deploy, monitor, iterate.
When you genuinely lack the traffic for statistical validity. If your sample size calculations say you need 50,000 visitors per variant and you get 2,000 a month, the math does not work. Make your best judgment call and track directional metrics.
The Mistake I See New Analysts Make
New analysts make two related errors. The first is claiming that a feature flag rollout "validated" a hypothesis. No hypothesis was stated. No randomization was applied. No statistical test was run. That is not validation — it is monitoring with a story attached.
The second is treating A/B test infrastructure as unnecessary overhead because feature flags "already do the same thing." This usually comes from engineering-oriented teams where deployment safety is well understood but experimental design is not. The tools look similar from the outside. The epistemological difference is enormous.
My Pro Tip
Use feature flags for deployment safety. Use A/B tests for learning. They are complementary tools, not substitutes.
The companies that do experimentation well use both — flags for the engineering workflow, experiments for the product workflow. The flag gets the code to production safely. The experiment tells you whether that code should stay.
When evaluating experimentation platforms, look for tools that integrate both capabilities. Statsig, Eppo, and LaunchDarkly Experimentation all combine feature flag management with proper statistical analysis. This integration eliminates the gap where teams deploy via flags and forget to measure via experiments.
Career Guidance
Understanding the distinction between deployment and experimentation is a genuine signal of analytical maturity. Engineers think in deployments: "Is it safe? Does it work? Can we roll it back?" Product people think in experiments: "Is it better? How much better? For which users?"
The best analysts bridge both worlds. They understand the engineering pipeline well enough to set up clean experiments, and they understand the statistical framework well enough to draw valid conclusions. If you can walk into a room where engineering says "we already flagged it" and clearly explain why that is not an experiment — without being condescending — you are operating at a level most analysts never reach.
The social platforms context makes this even more complex, where network effects mean that even proper randomization does not guarantee valid results. But that is a problem for teams that have already mastered the basics. First, get the flag-versus-experiment distinction right. Everything else builds on that foundation.