Here is a result that should be impossible but isn't: a variant that wins on the topline, ships, and then underperforms — because it was actually worse for every single segment of users, and only looked better because the traffic mix between the two arms wasn't the same. That's Simpson's paradox, and it's hiding in more test reports than anyone wants to admit.
TL;DR
- An aggregate result can reverse when you split it by segment. A variant that "wins" overall can lose in every meaningful subgroup — the topline is an average that a shift in traffic composition can quietly poison.
- The usual cause is that the two arms aren't comparable. A traffic-source change, a bot wave, or a mix shift mid-test loads one arm with easier-to-convert users. The lift you measured is a mix artifact, not a treatment effect.
- The default test report can't show you this — it shows the pooled number, and the pooled number is exactly the thing the paradox corrupts.
- The diagnostic takes one extra cut: pull the result by your major segments and confirm the direction holds. If the aggregate points one way and every segment points the other, believe the segments.
| Segment | Control conv. | Variant conv. | Who looks better |
|---|---|---|---|
| High-intent traffic | ~higher | ~slightly lower | Control |
| Low-intent traffic | ~lower | ~slightly lower | Control |
| Pooled (all traffic) | lower | higher | Variant (!) |
The pooled row disagrees with both segment rows. That's not a rounding quirk — it's Simpson's paradox, and it means the topline is telling you the opposite of the truth.
The 50-year-old example that explains your test report
The cleanest illustration is one of the most famous statistics cases ever documented. In 1973, UC Berkeley's graduate admissions looked discriminatory: about 44% of male applicants were admitted versus about 35% of female applicants. Aggregate says bias against women. But when researchers split the data by department, in four of six departments women were admitted at a higher rate than men (Bickel, Hammel & O'Connell, *Sex Bias in Graduate Admissions*, Science 1975).
The resolution: women disproportionately applied to the most competitive departments, which had low admission rates for everyone. The "bias" in the aggregate was a composition effect — who applied where — not a per-department effect. Split by the variable that actually mattered, the aggregate reversed (detailed walkthrough).
The identical structure appears in medicine, where a treatment can show a higher overall success rate than an alternative while being worse for both small and large kidney stones — because the better treatment gets used on the harder cases. And it appears, constantly, in A/B tests. The lurking variable in your experiment is usually traffic composition: what mix of users landed in each arm.
How the mix breaks mid-test without anyone touching the tool
In a perfectly randomized experiment with stable traffic, Simpson's paradox shouldn't bite — randomization is supposed to balance the arms. The problem is that real programs rarely run in that clean condition. The mix shifts, and the shift isn't always symmetric across arms:
- A traffic-source change mid-test. Marketing launches a campaign halfway through, flooding one part of the funnel with a different intent profile. If that interacts with your randomization unit or timing, the arms stop being comparable.
- A sample ratio mismatch. If the split isn't actually even — the exact failure that experimentation governance flags SRM to catch — one arm can carry a heavier load of a particular segment, and the pooled comparison is corrupted before you even look at it.
- Bot or referral waves hitting one arm harder than the other, inflating or deflating one side's baseline.
- Segment-correlated ramp. Rolling a test out to one geography or platform first, then another, so exposure timing correlates with user type.
Any of these can produce a pooled number that flatters the variant while every honest subgroup tells the opposite story. The tool isn't broken. The randomization assumption is.
The aggregate is an average weighted by traffic mix. If the mix differs between arms, the average is measuring the mix, not the treatment.
The diagnostic: cut by segment before you trust the topline
The catch is cheap. After any test that matters, before shipping, pull the result broken out by your two or three most meaningful segments — the ones you'd bet actually respond differently: high vs. low intent, mobile vs. desktop, new vs. returning. Then check one thing: does the direction of the effect hold in the segments, or does it reverse against the aggregate?
Three outcomes:
- Aggregate and all segments agree. The win is real and robust. Ship with confidence.
- Aggregate positive, segments mixed but net-consistent. A genuine heterogeneous effect — the variant helps some users and not others. That's not a paradox; it's a segmentation insight worth acting on, possibly with a targeted rollout rather than a blanket ship.
- Aggregate positive, every segment negative (or vice versa). Simpson's paradox. The pooled number is a composition artifact. Do not ship on the aggregate — the variant is losing everywhere that a real user actually lives.
The third case is the dangerous one, and it's invisible unless you deliberately look. This is the same diagnostic instinct behind channel cannibalization: the topline can be arithmetically correct and still point the wrong way, because it's aggregating over a structure that hides the real mechanism. Cannibalization hides substitution between channels; Simpson's paradox hides mix shift between arms. Different mechanism, same lesson — the pooled number is where truth goes to hide.
The pattern across many tests
After pulling segment cuts on enough experiments, a practitioner-level intuition forms: the tests most vulnerable to this are the ones run during periods of changing traffic — a campaign launch, a seasonal spike, a platform rollout. Stable-traffic tests rarely surprise you on the segment cut. Volatile-traffic tests surprise you often enough that the cut becomes reflexive. The habit isn't "run segment analysis on everything" — it's "run it whenever the traffic composition might have moved during the test window," because that's precisely when the randomization assumption is most likely to have quietly failed.
This is also why finding hidden winners through data segmentation and guarding against Simpson's paradox are two sides of the same practice: the segment cut either reveals a real heterogeneous effect you can exploit, or a composition artifact you need to discard. Either way, the pooled number alone was never enough.
FAQ
If randomization is done right, can Simpson's paradox even happen?
Under clean randomization with stable traffic, it's unlikely — proper randomization balances confounders across arms in expectation. The reason it still shows up is that real experiments drift: sample ratio mismatches, mid-test traffic changes, and staggered rollouts all break the balance randomization was supposed to provide. The paradox isn't evidence that statistics failed; it's evidence that your randomization assumption failed, which is a different and very common problem worth catching directly.
How is this different from a normal heterogeneous treatment effect?
A heterogeneous effect is when the variant genuinely helps some segments and not others — the segments disagree, but the aggregate is an honest weighted summary of real effects. Simpson's paradox is when the aggregate reverses the segments, meaning the pooled direction is an artifact of unequal mix rather than a real average of real effects. The first is a targeting opportunity; the second is a trap. The segment cut distinguishes them.
Which segments should I check?
The two or three where you have a real prior that users respond differently — typically intent level, device, and new-versus-returning. You're not fishing across dozens of slices (that invites false positives of its own); you're confirming the effect direction holds in the handful of cuts that matter for your funnel. If the direction reverses in even one major, high-volume segment against the aggregate, that's your signal to stop and investigate the mix.
Doesn't checking segments risk p-hacking?
It would if you were hunting for a significant subgroup to declare a win. Here you're doing the opposite: using segment cuts to challenge an aggregate win, not to manufacture one. The discipline is that segment analysis is a robustness check on the topline decision, with the direction of the effect as the thing you're verifying — not a search for whichever slice reaches significance. Use it to disqualify fragile wins, not to rescue losers.
Bottom line
An A/B test's aggregate result is a weighted average, and when the weights — your traffic composition — differ between arms, that average can point the exact opposite way from the truth. Simpson's paradox isn't an exotic statistical curiosity; it's a routine consequence of real traffic drifting during a test, and it's completely invisible on the default pooled report. The fix costs one extra query: cut the result by your major segments and confirm the effect direction holds. When the aggregate says one thing and every segment says another, the segments are right and the topline is an artifact. Ship on the segments, not the average.
This kind of diagnostic — the cut that turns a topline into a trustworthy decision — is what I built GrowthLayer to make routine instead of heroic. For more field diagnostics on where experiment results mislead, subscribe to Lean Experiments.