Your A/B testing tool picks a statistical test for you. But do you know which one it picked, or why? I’ve audited dozens of experimentation programs and found the same problem over and over: analysts running tests with the wrong statistical method and not realizing their results are unreliable.

When the default test is wrong for your data, your results are wrong. Not a little wrong — fundamentally wrong. You can get false positives, miss real effects, or make decisions based on p-values that don’t mean what you think they mean.

This article gives you a practical framework for choosing the right statistical test based on what you’re actually measuring. You don’t need to derive the math. But you do need to know which test fits your data.

The Core Question: What Type of Data Are You Measuring?

Every statistical test makes assumptions about the underlying data distribution. When those assumptions are violated, the test’s results become unreliable. The first step in choosing the right test is identifying what kind of data you’re working with.

There are five common data types in A/B testing, and each one maps to a different family of statistical tests. Let me walk through each one.

Continuous Data: Revenue, Time on Site, AOV

When your metric is continuous — it can take any numerical value within a range — you’re typically dealing with data that may or may not follow a normal distribution.

Welch’s t-test is the default choice for continuous metrics, and it should be your go-to 90% of the time. Unlike the classic Student’s t-test, Welch’s does NOT assume equal variances between your control and variant groups. This matters more than most people realize. If your variant changes user behavior, it often changes the variance of your metric too. Welch’s handles this gracefully.

Student’s t-test assumes equal variances between groups. Use it only when you’re confident the variances are similar — which, in practice, is almost never in A/B testing. Welch’s t-test is almost always the safer choice because it reduces to Student’s t-test when variances happen to be equal, but doesn’t break when they’re not.

When to use these: average order value, session duration, pages per session, revenue per paying customer (note: revenue per paying customer, not revenue per visitor — that distinction matters, and I’ll explain why below).

The practical difference between Welch’s and Student’s t-test is negligible at large sample sizes. But at smaller samples — say under 5,000 per group — the variance assumption matters. If you remember one thing from this section: default to Welch’s.

Binary Data: Conversion Rates and Click-Through Rates

When your metric is a yes/no outcome — did the user convert or not, did they click or not — you’re dealing with binomial data.

Z-test for proportions is the workhorse for conversion rate testing. This is what most A/B testing tools use by default when you’re comparing conversion rates between two groups. It’s fast, well-understood, and works reliably at the sample sizes most websites generate.

Fisher’s exact test is designed for small samples — roughly under 1,000 observations per group. Where the Z-test relies on a normal approximation that breaks down with small numbers, Fisher’s computes exact probabilities. If you’re testing on a low-traffic page or a niche B2B product, Fisher’s is the safer choice.

Barnard’s test is more statistically powerful than Fisher’s for 2x2 tables (the standard A/B test comparison), but it’s computationally heavier. Most tools don’t offer it by default, but it’s worth knowing about when every bit of statistical power matters — like when your sample sizes are constrained (/blog/posts/how-long-to-run-ab-test-sample-size) and you can’t afford to miss a real effect.

When to use these: conversion rate, click-through rate, bounce rate, form completion rate, add-to-cart rate, or any metric where each user either does or doesn’t do something.

The Z-test and the chi-squared test (discussed below) give identical results for two-group comparisons of proportions. The difference is that chi-squared generalizes to more than two groups, which makes it useful for multivariate tests (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms).

Count Data: Pageviews Per Session, Events Per User

When your metric counts discrete events — how many pages a user viewed, how many items they added to cart, how many support tickets they filed — you’re dealing with count data that typically follows a Poisson distribution.

E-test and C-test are designed for comparing Poisson-distributed counts between groups. These are less commonly discussed than t-tests or Z-tests, but they’re the correct choice when your metric is a count.

When to use these: pageviews per session, items viewed per visit, errors per user, support tickets per customer, interactions per session.

The mistake I see here is analysts treating count data as continuous and throwing a t-test at it. For high counts (averages above 20-30), the normal approximation works reasonably well and a t-test won’t mislead you too badly. But for low counts — like errors per user or purchases per session, where most values are 0, 1, or 2 — the Poisson assumption matters.

Multiple Categories: Menu Choices and Survey Responses

When users choose between multiple options — not just yes/no, but option A, B, C, or D — you need tests designed for multinomial data.

Chi-squared test is the classic choice for testing whether the distribution across categories differs between groups. If you’re running a test where users see different navigation menus and you want to know if the distribution of clicks across menu items changed, chi-squared is your test.

G-test (log-likelihood ratio) serves a similar purpose but is more accurate for smaller samples. It’s mathematically related to chi-squared and gives nearly identical results at large samples, but performs better when some categories have low expected counts.

When to use these: which product users select from a category page, navigation click distribution, multi-option survey responses, plan selection (free vs. basic vs. premium).

Chi-squared is also the foundation of multivariate testing analysis (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms), where you’re testing multiple changes simultaneously and need to understand which combinations of changes drive different outcomes.

Non-Normal and Skewed Data: Revenue Per Visitor

Here’s where most A/B testing programs go wrong. Revenue per visitor is the single most important metric for e-commerce, and it’s the metric most commonly analyzed with the wrong test.

The problem: revenue per visitor is NOT normally distributed. Most visitors buy nothing, creating a massive spike at zero. Of those who do buy, the distribution is heavily right-skewed — a few big purchases pull the average way above the median. Running a standard t-test on this data is statistically questionable at best and misleading at worst.

Mann-Whitney U test (also called Wilcoxon rank-sum test) is a rank-based test that makes NO assumptions about the underlying distribution. Instead of comparing means directly, it ranks all observations and tests whether one group tends to have higher ranks than the other. This makes it robust to outliers, skewness, and zero-inflation — exactly the problems that plague revenue data.

Bootstrap methods take a different approach: they resample your data thousands of times to empirically estimate the sampling distribution of your test statistic. No distributional assumptions required. Bootstrap confidence intervals for the difference in means are increasingly popular at sophisticated experimentation shops.

Bayesian approaches model the full posterior distribution of your metric, which naturally handles non-normal data. If you’re using a Bayesian A/B testing framework (/blog/posts/bayesian-vs-frequentist-ab-testing), the choice of prior and likelihood function handles the distribution problem more elegantly than frequentist tests.

When to use these: revenue per visitor, lifetime value, any metric with extreme skew, metrics with lots of zeros, metrics with heavy-tailed distributions.

The Practical Decision Tree

Here’s the framework I use when choosing a statistical test for any A/B experiment:

  1. Is your metric a proportion (CVR, CTR, bounce rate)? Use a Z-test for proportions. If sample size is under 1,000 per group, use Fisher’s exact test.
  2. Is your metric continuous and roughly normally distributed? Use Welch’s t-test. Not Student’s — Welch’s.
  3. Is your metric continuous but heavily skewed (revenue per visitor)? Use Mann-Whitney U test or bootstrap methods.
  4. Is your metric a count of events? Use a Poisson-based test (E-test or C-test).
  5. Are you comparing distributions across multiple categories? Use chi-squared or G-test.
  6. Small sample size (under 1,000)? Lean toward Fisher’s exact test for proportions or Mann-Whitney for continuous data. Avoid tests that rely on normal approximations.

When in doubt about the distribution, Mann-Whitney U is the Swiss army knife. It’s less powerful than parametric tests when their assumptions hold — meaning it needs a slightly larger sample to detect the same effect — but it never gives you wrong answers due to violated assumptions. That tradeoff is almost always worth it.

Z-Tests vs. T-Tests: When the Difference Matters

New analysts often confuse Z-tests and t-tests or use them interchangeably. Here’s the practical distinction.

A Z-test assumes you know the population variance. A t-test estimates the variance from your sample. At large sample sizes (n > 30 per group), the two converge — the t-distribution approaches the normal distribution, and the results are virtually identical.

At small sample sizes, the difference matters. The t-distribution has heavier tails, which means t-tests require slightly stronger evidence to declare significance. This is actually protective — it reduces false positives when your variance estimate is noisy.

For proportions, Z-tests are standard because the variance of a proportion is a direct function of the proportion itself — you don’t need to estimate it separately. For continuous metrics, t-tests are standard because the variance is unknown.

Why Most Tools Get Revenue Wrong

The most consequential mistake in A/B testing statistics is applying proportion or normal-distribution tests to revenue data. Here’s why it happens and how to fix it.

Most A/B testing tools treat every metric the same way: compute the mean in each group, estimate the standard error assuming normality, and run a Z-test or t-test. This works fine for conversion rates and reasonable continuous metrics. It falls apart for revenue per visitor.

Revenue per visitor typically looks like this: 95% of visitors have zero revenue, 4% have revenue between $20 and $100, and 1% have revenue above $100, with a long tail extending to thousands of dollars. The mean is pulled up by outliers. The variance is enormous. The distribution looks nothing like a bell curve.

When you run a t-test on this data, you get confidence intervals (/blog/posts/ab-testing-statistics-p-values-confidence-intervals) that are too narrow when outliers happen to land in one group, and too wide when they’re balanced. Your p-values become unreliable. You might declare a winner that won because one whale happened to land in the variant group.

The fix: use Mann-Whitney U for a frequentist approach, or bootstrap confidence intervals. Better yet, use CUPED (/blog/posts/cuped-variance-reduction-faster-ab-tests) with a pre-experiment covariate to reduce variance before testing. The combination of variance reduction and a distribution-appropriate test gives you reliable revenue analysis.

What New Analysts Get Wrong

The most common mistake is never questioning what test the tool picked. You click “analyze” and get a p-value. You report the p-value. Nobody asks whether the p-value was computed using an appropriate method.

The second most common mistake is using proportion tests on continuous data, or normal-distribution tests on heavily skewed data. Both produce results that look legitimate — you get a p-value, a confidence interval, a neat little graph — but the numbers are unreliable.

The third mistake is ignoring sample size requirements. Fisher’s exact test exists for a reason. When your test has 200 conversions per group, the normal approximation underlying a Z-test starts getting shaky. Use exact methods for small samples.

Career Guidance

You don’t need to derive these formulas. I have never once hand-computed a Mann-Whitney U statistic in a professional setting. But you need to know which test your tool uses and why. When a data scientist, a VP of engineering, or a skeptical PM asks “what test did you use?”, saying “whatever the tool picked” is not an acceptable answer.

The right answer sounds like this: “We used Welch’s t-test for session duration because the metric is continuous and we can’t assume equal variances. For conversion rate, we used a Z-test for proportions. For revenue per visitor, we used Mann-Whitney U because revenue is heavily right-skewed with zero-inflation.”

That answer takes five seconds and communicates that you understand what you’re doing. It builds trust with technical stakeholders and protects you from making bad decisions based on inappropriate statistical methods.

Learn the decision tree. Know your data types. Match them to the right test. It’s one of the highest-leverage skills in A/B test analysis (/blog/posts/how-to-analyze-ab-test-results-segmentation), and most analysts never bother to learn it.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.