Every textbook defines A/B testing the same way: "Show version A to one group, version B to another, measure which performs better." That definition is technically correct and practically useless.
I have run hundreds of experiments across SaaS products, e-commerce sites, and lead generation funnels. The teams that struggle with experimentation almost never fail because they do not understand the definition. They fail because they treat A/B testing like trying stuff instead of running experiments.
This guide covers what A/B testing actually looks like in practice, the five components every valid test requires, and the mistakes I see new analysts make in their first six months.
What A/B Testing Really Is
A/B testing is a controlled experiment on user behavior. That word "controlled" is doing all the heavy lifting.
You take a population of users, randomly split them into groups, expose each group to a different experience, and measure whether the difference in experience produces a statistically significant difference in behavior. The randomization is what makes it an experiment rather than an observation. Without it, you are just comparing two groups that might be different for reasons you cannot see.
The Scientific Method Applied to Digital Products
The structure is identical to how pharmaceutical companies test new drugs. You have a control group that gets the existing treatment. You have a treatment group that gets the new thing. You measure outcomes. You use statistics to determine whether the difference is real or just noise.
The stakes are different — nobody dies if your checkout button is the wrong color — but the rigor should be the same. The moment you cut corners on the methodology, you stop running experiments and start running a confirmation bias machine.
The Five Components
Every valid A/B test has exactly five components:
- Hypothesis — A testable prediction about what will happen and why
- Control — The current experience (your baseline)
- Variant — The changed experience (your treatment)
- Metric — What you are measuring to determine success
- Traffic allocation — How you split users between groups
Remove any one of these and you do not have an A/B test. You have a guess with analytics attached.
Split Testing vs. A/B Testing vs. Feature Flagging
These terms get used interchangeably, and it creates confusion. Here is how I draw the lines.
A/B testing is the methodology — a controlled experiment with statistical analysis to determine a winner. The intent is to learn whether a change improves a metric.
Split testing is functionally the same thing. Some people use "split testing" to specifically mean splitting traffic between entirely different page URLs, while "A/B testing" means changing elements within the same page. In practice, the distinction rarely matters. The methodology is identical.
Feature flagging is an engineering tool for controlling who sees what code. You can use feature flags to run A/B tests, but feature flagging by itself is not testing. Rolling out a feature to 10% of users and watching your dashboards is monitoring, not experimentation. There is no control group, no statistical rigor, and no hypothesis. It is useful, but it is a different thing.
The confusion matters because teams that think feature flagging equals A/B testing skip the scientific method entirely. They ship features, watch metrics, and convince themselves they are data-driven.
Anatomy of a Valid Test
Let me break down each of the five components in detail, because this is where most teams cut corners.
1. The Hypothesis
A hypothesis is not "let's try a green button." A hypothesis is a testable statement that connects a change to an expected outcome through a mechanism.
Good format: "If we [change], then [metric] will [direction] because [reason]."
Example: "If we add social proof badges below the pricing table, then free-to-paid conversion will increase because prospects in the evaluation stage need third-party validation to overcome risk aversion."
The "because" clause is what separates a hypothesis from a guess. It forces you to articulate your theory of user behavior. When the test concludes, you learn something regardless of whether the variant won. If it lost, your theory about user behavior was wrong, and that is valuable information.
I cover hypothesis construction in detail in the setup guide (/blog/posts/how-to-set-up-ab-test-hypothesis-implementation).
2. The Control
The control is the current experience. It is your baseline. It seems straightforward, but teams mess this up in two ways.
First, they change the control mid-test. Someone pushes a hotfix that modifies the control experience, and now your baseline has shifted. Your test is contaminated.
Second, they do not document the control thoroughly enough. When you go back to analyze results six months later, you need to know exactly what the control experience looked like. Screenshots, recordings, and detailed descriptions are not optional. This is why I am such a strong advocate for maintaining test archives (/blog/posts/ab-test-archives-experimentation-knowledge-base).
3. The Variant
The variant is the changed experience. The most common mistake here is changing too many things at once. If your variant has a new headline, new image, new layout, and new CTA, and it wins, what did you learn? You cannot attribute the win to any specific change.
Change one thing per test. If you need to test combinations, that is multivariate testing (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms), and it requires significantly more traffic.
4. The Metric
Your primary metric is the single number that determines whether the test succeeded. You should define it before the test starts and not change it after you see results.
You should also track secondary metrics to understand the broader impact. A variant might increase clicks but decrease purchases. If you only measured clicks, you would declare victory and lose revenue.
I go deep on metric analysis, including segmentation, in the analysis guide (/blog/posts/how-to-analyze-ab-test-results-segmentation).
5. Traffic Allocation
Traffic allocation is how you divide users between control and variant. This seems simple, but the details matter.
The 50/50 Split and When It Changes
The standard A/B test runs a 50/50 split. Half your traffic sees the control, half sees the variant. This maximizes your statistical power — the ability to detect a real difference if one exists.
But 50/50 is not always the right call.
A/B/n tests split traffic equally across more variants. Three variants plus a control means 25% each. The tradeoff is that each group is smaller, so you need more traffic or more time to reach significance. I cover this in the guide on running multiple tests (/blog/posts/running-multiple-ab-tests-simultaneously).
Risk mitigation splits like 90/10 or 80/20 are useful when a variant carries real business risk. If you are testing a fundamental change to your checkout flow, you might not want to expose 50% of your revenue to a potentially broken experience. The 90/10 split reduces risk but dramatically increases the time needed to reach statistical significance.
Bandit algorithms dynamically shift traffic toward winning variants. I cover when this makes sense in the article on A/B testing vs. bandits (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms).
When A/B Testing Is the Wrong Tool
A/B testing is not always the answer. Here are the situations where I tell teams to use something else.
Low traffic. If your site gets fewer than 10,000 visitors per month, most A/B tests will not reach statistical significance in a reasonable timeframe. You need volume for the math to work. Read the sample size and duration guide (/blog/posts/how-long-to-run-ab-test-sample-size) to understand the relationship between traffic, effect size, and test duration.
No clear metric. If you cannot define what success looks like numerically, you cannot run a meaningful test. "We want to improve the user experience" is not a metric. Bounce rate, task completion rate, or Net Promoter Score are metrics.
Qualitative questions. If you need to understand why users behave a certain way, run user research instead. A/B tests tell you what works, not why. Start with CRO research methods (/blog/posts/cro-research-methods-ab-testing) to build qualitative insight before you test.
Tiny changes on low-impact pages. Testing the shade of gray on your footer links is a waste of infrastructure and attention. Prioritize your tests (/blog/posts/how-to-prioritize-ab-tests-pxl-framework) and focus experimentation resources where they can move meaningful metrics.
The Mistake Every New Analyst Makes
The most common mistake I see from new analysts is treating A/B testing as "trying stuff." They skip the hypothesis. They do not calculate sample size before launching. They peek at results after three days and call a winner.
This is not experimentation. It is theater that produces unreliable conclusions and erodes trust in the testing program.
The antidote is simple: slow down on the front end. Write the hypothesis. Calculate the sample size (/blog/posts/how-long-to-run-ab-test-sample-size). Define the primary metric. Document the control. Set the test duration before you launch and commit to not peeking until it is complete.
The teams that build the strongest experimentation cultures are not the ones running the most tests. They are the ones running the most rigorous tests.
Pro Tip: The Hypothesis Is the Whole Point
Here is the thing most people miss: a well-constructed hypothesis teaches you something whether the variant wins or loses.
If your hypothesis is "green button will increase clicks," you learn nothing from a loss. But if your hypothesis is "reducing cognitive load on the pricing page by collapsing feature comparisons into expandable sections will increase plan selection because users currently experience decision paralysis when confronted with a full feature matrix" — now a loss tells you that decision paralysis is not the barrier. That changes what you test next.
The hypothesis drives the entire program forward. Every test should make your team smarter about user behavior, not just tell you which design asset performed better. Understand the statistics behind this (/blog/posts/ab-testing-statistics-p-values-confidence-intervals) so you can tell the difference between a real learning and noise.
What to Learn Next
This article covers the foundation. Here is where to go from here:
- A/B Testing vs. Multivariate Testing vs. Bandits (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms) — understand when A/B testing is the right approach and when it is not
- The A/B Testing Process (/blog/posts/ab-testing-process-research-prioritize-test-analyze) — the end-to-end workflow from research to analysis
- How to Set Up an A/B Test (/blog/posts/how-to-set-up-ab-test-hypothesis-implementation) — the practical mechanics of configuring and launching a test
- A/B Testing Statistics (/blog/posts/ab-testing-statistics-p-values-confidence-intervals) — the statistical foundations you need to interpret results correctly