Statistics is the language of experimentation. You don’t need a PhD to run A/B tests, but you need statistical literacy. Without it, you’re just looking at numbers on a dashboard and guessing.
I’ve watched smart marketers and product managers make confident decisions based on a complete misunderstanding of what their test results actually mean. The goal of this article is to fix that. I’m going to walk you through the core statistical concepts behind A/B testing in plain language, explain what your tools are actually telling you, and show you where most new analysts go wrong.
This is part of my complete A/B testing guide series (/blog/posts/what-is-ab-testing-practitioners-guide), and it’s one of the most important pieces. If you only read one technical article in this series, make it this one.
Three Foundational Concepts You Need First
Before we get into p-values and confidence intervals, you need to internalize three ideas that underpin all of A/B testing statistics.
Mean: Your Best Guess from Limited Data
When you run a test, you’re measuring a sample of your users, not the entire population. The sample average (mean) is your best estimate of the true conversion rate. But it’s still an estimate. Two different samples from the same population will give you two different means. This is normal, and it’s exactly why we need statistics.
Variance: The Natural Noise in Your Data
Variance measures how spread out your data is. High variance means your conversion rates bounce around a lot from session to session. Low variance means they’re relatively stable. Variance matters because it determines how much data you need to detect a real difference. High-variance metrics require bigger samples. This is why calculating sample size before you start (/blog/posts/how-long-to-run-ab-test-sample-size) is non-negotiable.
Sampling: We Never Know the Truth Directly
You can never measure the “true” conversion rate of your page. You can only estimate it from the sample of visitors who happened to show up during your test. The entire field of inferential statistics exists because of this gap between what we observe and what’s actually true. Every number your testing tool shows you is an estimate with uncertainty baked in.
What Is a P-Value? (And Three Wrong Definitions Everyone Uses)
The p-value is the most misunderstood concept in A/B testing. I hear wrong definitions constantly, even from people who should know better.
The Three Wrong Definitions
WRONG: “The probability that B is better than A.” Nope. The p-value says nothing about the probability that either variant is better. That’s a Bayesian question, and the p-value is a frequentist concept.
WRONG: “The probability we’ll make a mistake choosing B.” Also no. The p-value doesn’t tell you the probability of making any specific decision error.
WRONG: “The chance the result is real.” Still wrong. The p-value doesn’t quantify how “real” a result is.
The Actual Definition
RIGHT: “The probability of seeing a result this extreme (or more extreme) if there were actually no difference between A and B.”
Think of it as a surprise meter. You start by assuming there’s no difference between your variants (the null hypothesis). Then you ask: given that assumption, how surprising is the data I actually collected? A small p-value means the data would be very surprising under the “no difference” assumption. A large p-value means the data is perfectly consistent with no difference existing.
Here’s the analogy I use: Imagine a coin you suspect might be loaded. You flip it 100 times and get 60 heads. The p-value answers: “If this coin were perfectly fair, how likely is it I’d see 60 or more heads in 100 flips?” If that probability is very low, you have reason to doubt the coin is fair.
Why p < 0.05 Is a Convention, Not a Law
The 0.05 threshold is a convention that dates back to Ronald Fisher in the 1920s. There’s nothing magical about it. In some fields, the standard is 0.01 or even 0.001. In conversion optimization, 0.05 is the accepted standard, but you should set your threshold before you start the test and stick with it. I covered this in my article on setting up your test properly (/blog/posts/how-to-set-up-ab-test-hypothesis-implementation).
Statistical Significance: What It Means and What It Doesn’t
“Statistical significance” simply means the p-value fell below your pre-set threshold. That’s it.
What it means: The observed difference is unlikely to have occurred by random chance alone, given your significance threshold.
What it does NOT mean:
- The result is practically important
- The effect is large
- The result will replicate
- You should definitely implement the change
A 0.01% improvement in conversion rate can be “statistically significant” if you have millions of visitors. But is that improvement worth the engineering effort to implement? Probably not. Significance tells you the signal is likely real. It says nothing about whether the signal matters.
And critically, significance is NOT a stopping rule. You cannot peek at your results daily and stop the test the moment you see p < 0.05. That inflates your false positive rate dramatically. I explained why in detail in my article on how long to run your test (/blog/posts/how-long-to-run-ab-test-sample-size).
Type I and Type II Errors in Business Terms
Statisticians talk about Type I and Type II errors. Let me translate these into language that matters for your business.
Type I Error (False Positive)
You conclude that your variant is better when it actually isn’t. In business terms: you implement a change that doesn’t help, possibly hurts, and you waste development resources doing it. The probability of a Type I error is your significance level (alpha), typically 5%.
Type II Error (False Negative)
You conclude there’s no difference when a real improvement exists. In business terms: you throw away a winning idea because your test failed to detect the improvement. You leave money on the table. The probability of a Type II error is called beta, and (1 - beta) is your statistical power.
The Tradeoff
Reducing one type of error increases the other, unless you increase your sample size. Want fewer false positives? Set a stricter significance threshold, but you’ll miss more real effects. Want fewer false negatives? You need more power, which means more traffic or longer tests. This is the fundamental tension in planning your test duration (/blog/posts/how-long-to-run-ab-test-sample-size).
Statistical Power: The Concept Most Teams Ignore
Statistical power is the probability of detecting a real effect when one exists. If your test has 80% power (the industry standard), there’s a 20% chance you’ll miss a real improvement.
Most testing tools don’t surface power calculations prominently, which is a problem. Teams run underpowered tests constantly. They test for a week, see no significance, and declare the variant a loser, when the real issue is they never had enough data to detect the effect in the first place.
How to increase power:
- Bigger sample size: More visitors means more ability to detect small effects
- Bigger effect size: Test bolder changes that produce larger lifts
- Longer test duration: Run the test until you hit your required sample size
- Lower variance metrics: Choose primary metrics with less natural variability
I always calculate required sample size before launching any test. It’s the difference between experimentation and gambling.
Confidence Intervals: The Most Underrated Metric
Your conversion rate is not a single number. It’s a range. A confidence interval tells you: “We’re 95% confident the true conversion rate falls somewhere between X% and Y%.”
This is far more useful than a point estimate. Telling your stakeholder “the variant improved conversion by 5%” sounds precise but is misleading. Telling them “the variant improved conversion by somewhere between 1% and 9%” is honest and gives them the information they need to make a good decision.
The Driving Analogy
Imagine two people estimate their commute time. Person A says “60 minutes, plus or minus 20 minutes.” Person B says “40 minutes, plus or minus 5 minutes.” Person B’s estimate is more useful, not because the average is lower, but because the margin of error is tighter. You know what to expect.
The same applies to your A/B tests. A test that shows +3% with a tight confidence interval is more useful than one showing +10% with a wide interval that crosses zero.
Overlap and What It Tells You
If the confidence intervals of your two variants overlap significantly, you need more data. Overlapping intervals don’t automatically mean there’s no difference, but they do mean your estimate is too imprecise to be actionable. This is covered extensively in my guide on analyzing test results (/blog/posts/how-to-analyze-ab-test-results-segmentation).
One-Tail vs. Two-Tail Tests
This distinction causes more confusion than it deserves.
One-tail test: You only care if the variant is better than the control. The test only looks for differences in one direction.
Two-tail test: You care about differences in either direction, whether the variant is better or worse.
Most A/B testing tools default to one-tail tests because that’s typically what practitioners want: “Is B better than A?” Some tools use two-tail by default. The practical difference is that a one-tail test will reach significance faster for the same data, because it’s concentrating all its statistical power in one direction.
My advice: know which one your tool uses, understand that you can roughly convert between them (a one-tail p-value of 0.05 is approximately equivalent to a two-tail p-value of 0.10), and don’t lose sleep over it. There are bigger statistical fish to fry.
Effect Size: The Metric Your Dashboard Hides
Statistical significance tells you IF there’s likely a real difference. Effect size tells you HOW BIG that difference is. These are completely different questions, and most dashboards emphasize the first while burying the second.
A 0.1% improvement in conversion rate can achieve statistical significance with enough traffic. But implementing a change for 0.1% is rarely worth it when you factor in development costs, QA time, and opportunity cost. Always look at the absolute and relative effect size alongside significance.
I’ve seen teams celebrate a “significant” result that amounted to an extra 3 conversions per month. Effect size keeps you honest. When you build your test archive (/blog/posts/ab-test-archives-experimentation-knowledge-base), record effect sizes alongside significance for every test. Over time, you’ll develop intuition for what “big” looks like in your business.
Revenue Per Visitor and Non-Standard Metrics
Here’s where things get tricky. Standard proportion-based significance tests (the ones most tools run) work great for binary metrics like conversion rate (someone either converts or doesn’t). But they break down for revenue-based metrics.
Revenue per visitor (RPV) data is not normally distributed. Most visitors spend zero, a few spend a little, and a tiny number spend a lot. The distribution is heavily skewed, which violates the assumptions of standard tests.
For RPV and similar continuous, skewed metrics, you need non-parametric tests like the Mann-Whitney U test or Wilcoxon rank-sum test. These tests don’t assume your data follows a normal distribution. If your testing tool doesn’t support these natively, flag this with your data science team. Running a standard test on RPV data can produce unreliable results. This is one area where understanding external validity threats (/blog/posts/ab-testing-external-validity-threats) becomes critical.
The Biggest New Analyst Mistake
Memorizing “p < 0.05 means it worked” without understanding what the p-value actually represents. This creates the single biggest credibility gap when a data scientist or statistician questions your results.
I’ve been in rooms where an analyst presents a test result, a data scientist asks “what does your p-value actually mean here?” and the analyst freezes. Don’t be that person. If you’ve read this far, you already understand more than most practitioners. But understanding and explaining are different skills. Practice explaining these concepts to a colleague who doesn’t know statistics. If you can make it clear to them, you’ve truly internalized it.
Pro Tip: Start with Confidence Intervals
If I could teach new analysts only one thing about A/B testing statistics, it would be this: learn to read confidence intervals before you learn anything else. They tell you more about your test than any other single number.
A confidence interval shows you the range of plausible effect sizes, whether your estimate is precise or noisy, and whether the result is practically meaningful, all in one visual. When you present results to stakeholders, leading with the confidence interval instead of the p-value will make your recommendations more credible and your conversations more productive.
Career Guidance: Statistics Is Your Multiplier
Statistics is the skill that separates “test operators” from “experimentation strategists.” Anyone can click buttons in a testing tool. Understanding the math behind those buttons is what makes you the person who gets consulted on testing strategy, not just test execution.
My recommendation: take an introductory statistics course. Khan Academy, Coursera, whatever works for your learning style. It doesn’t need to be advanced. Just cover hypothesis testing, distributions, and confidence intervals. That foundation will change how you think about every test you run.
What to Learn Next
Now that you have the statistical foundation, here’s where to go deeper:
- How Long to Run Your Test (/blog/posts/how-long-to-run-ab-test-sample-size) covers power analysis and sample size calculations in practical terms
- How to Analyze Results (/blog/posts/how-to-analyze-ab-test-results-segmentation) shows you how to apply these concepts when interpreting real test data
- Validity Threats (/blog/posts/ab-testing-external-validity-threats) explains what can go wrong even when your statistics are technically correct
- Bayesian vs. Frequentist (/blog/posts/bayesian-vs-frequentist-ab-testing) presents an entirely different framework for thinking about test results
Statistics isn’t the fun part of experimentation. But it’s the part that makes everything else trustworthy. Get this right, and every test you run from here on will be on solid ground.