A product manager opens the experiment dashboard on a Thursday morning. The test has been running for ten days. Variant B is winning. The aggregate primary-metric lift is +3.1%, p = 0.004, sample of 4.2 million users split 50/50. Everything in the headline view looks like a clean ship decision.

She opens the segment breakdown out of habit. Mobile: variant B loses by −1.4%. Desktop: variant B loses by −0.9%. iOS: variant B loses. Android: variant B loses. Returning users: variant B loses. New users: variant B loses by a hair. Every cut of the data shows variant B losing. The aggregate shows variant B winning by 3%.

She doesn’t ship. She files a ticket asking the data scientist what’s going on. Forty-five minutes later, the answer comes back: the ratio of mobile to desktop traffic in the treatment arm is 38/62, while in the control arm it’s 51/49. Desktop users convert at roughly 2.4x the rate of mobile users on this surface. The aggregate “lift” is not a treatment effect. It is a composition effect — the treatment arm happens to have more desktop users, who convert more, so the average looks higher even though within each device segment, variant B is worse than control.

This is Simpson’s Paradox, and it is the single most important reason segment-level analysis is not optional in A/B testing. An aggregated comparison can reverse direction when broken into subgroups, because weighted averages depend on both the within-group rates and the relative size of each group. When subgroup proportions differ between the conditions you’re comparing, the aggregate can lie about the direction of the effect — not just the magnitude, but the sign.

The phenomenon is named after Edward Simpson, who described it formally in 1951 in the Journal of the Royal Statistical Society, but the canonical real-world demonstration is the 1973 UC Berkeley graduate admissions case analyzed by Bickel, Hammel, and O’Connell in Science (1975). In experimentation, it shows up regularly enough that Kohavi, Tang, and Xu devote a section of Trustworthy Online Controlled Experiments (2020) to it, and Crook, Frasca, Kohavi, and Longbotham listed it as one of the seven pitfalls in their 2009 KDD paper.

This article walks through what Simpson’s Paradox is, the math that makes it work, the specific ways it shows up in A/B tests, why randomization doesn’t always prevent it, the practical diagnostics that catch it, and what your experimentation program should change once you understand it.

What Simpson’s Paradox Actually Is

Simpson’s Paradox is the phenomenon where a trend that appears in several groups of data disappears or reverses when those groups are combined. Edward Simpson’s original 1951 paper, “The Interpretation of Interaction in Contingency Tables,” gave the formal statement: in a 2×2×K contingency table, you can have a positive association between two variables in every one of the K subtables and a negative association in the marginal table you get by summing across K.

The phenomenon had been described earlier — Pearson noted it in 1899 and Yule in 1903 — but Simpson’s framing in a contingency-table context is the one that propagated through 20th-century statistics. Some textbooks call it the “Yule–Simpson effect” to credit both.

The canonical demonstration is the UC Berkeley graduate admissions data from fall 1973, analyzed by Peter Bickel, Eugene Hammel, and J. William O’Connell and published as “Sex Bias in Graduate Admissions: Data from Berkeley” in Science in 1975 (DOI: 10.1126/science.187.4175.398). The university was facing the threat of a discrimination lawsuit because the aggregate numbers looked damning:

  • Men: 8,442 applicants, ~44% admitted.
  • Women: 4,321 applicants, ~35% admitted.

A nine-point gap in admission rate, with sample sizes large enough that the chi-squared test on the difference is significant at any threshold you’d care about. On the aggregate, Berkeley appeared to be discriminating against women at the level of full graduate-school admissions.

Bickel, Hammel, and O’Connell looked at the same data department-by-department. Berkeley graduate admissions are decentralized — each department runs its own admissions process — so the relevant decision-makers are the departments, not the central university. When you re-aggregate by department, the picture inverts. In four of the six largest departments, women were admitted at a higher rate than men. In one, the rates were essentially tied. Only in one department was the rate measurably higher for men, and the difference there was small.

The classic 6-department subset that gets taught in statistics courses (and ships in R’s UCBAdmissions dataset) shows:

  • Men: 2,691 applicants, 1,198 admitted (44.5%).
  • Women: 1,835 applicants, 557 admitted (30.4%).

Yet department-by-department, the per-department admission rates favored women in most departments. The headline finding from Bickel et al. in Science: “The bias in the aggregate data stems not from any pattern of discrimination on the part of admissions committees, which seem, if anything, to favor women, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.”

In plain English: women disproportionately applied to humanities and other departments with low admission rates for everyone. Men disproportionately applied to engineering and physical sciences, which had higher admission rates for everyone. The aggregate gap was a composition effect. The departments themselves were not biased — and if anything were biased the other way.

This is the structural shape of Simpson’s Paradox. The aggregate hides the fact that the comparison is across heterogeneous subgroups with different base rates and different proportions in each condition.

The Math — When Subgroup Proportions Differ, The Weighted Average Can Reverse

The mathematical condition is straightforward. Suppose you have two conditions, control (C) and treatment (T), and two subgroups, segment 1 and segment 2. Within each segment, you have a conversion rate.

Let:

  • p_C1, p_T1 be the conversion rates in segment 1 for control and treatment.
  • p_C2, p_T2 be the conversion rates in segment 2 for control and treatment.
  • w_C1, w_C2 be the proportion of control users in segments 1 and 2 (w_C1 + w_C2 = 1).
  • w_T1, w_T2 be the proportion of treatment users in segments 1 and 2 (w_T1 + w_T2 = 1).

The aggregate conversion rate in each condition is the weighted average:

  • p_C = w_C1 · p_C1 + w_C2 · p_C2
  • p_T = w_T1 · p_T1 + w_T2 · p_T2

Simpson’s Paradox occurs when pT1 < p_C1 _and p_T2 < p_C2 (treatment loses in every segment) but p_T > p_C (treatment wins in aggregate) — or the symmetric reversal in the other direction.

A worked numerical example. Suppose:

  • Segment 1 (mobile) has low base conversion: control converts at 2.0%, treatment converts at 1.9%.
  • Segment 2 (desktop) has high base conversion: control converts at 5.0%, treatment converts at 4.8%.
  • Control population: 51% mobile, 49% desktop.
  • Treatment population: 38% mobile, 62% desktop.

Aggregate control rate: 0.51 × 2.0% + 0.49 × 5.0% = 1.02% + 2.45% = 3.47%.

Aggregate treatment rate: 0.38 × 1.9% + 0.62 × 4.8% = 0.722% + 2.976% = 3.70%.

Aggregate “lift”: treatment +6.6% relative over control. Treatment “wins” by a substantial margin in aggregate. Yet within mobile, treatment is worse (1.9 < 2.0). Within desktop, treatment is worse (4.8 < 5.0). Both segments show treatment losing. The aggregate win comes entirely from the fact that treatment over-indexes on the high-converting desktop segment.

The reversal hinges on two conditions that must both hold:

  1. The within-segment rates differ across segments (in the example, desktop converts at 2.5x mobile). If all segments had the same base rate, composition wouldn’t matter.
  2. The proportion of each segment differs across conditions (in the example, 51/49 mobile-desktop in control vs 38/62 in treatment). If both arms had identical segment composition, weighted averages couldn’t pull apart.

When both hold, the aggregate becomes a function not just of treatment effect but of composition — and composition can dominate when the treatment effect is small and the across-segment rate gap is large. This is exactly the regime A/B tests operate in: small expected effects (a 1-2% lift is a meaningful win), large between-segment heterogeneity (desktop and mobile, US and rest-of-world, returning and new users routinely differ by 2-10× in conversion rate).

How Simpson’s Paradox Shows Up In A/B Tests

In a hypothetical perfectly-randomized A/B test where the platform assigns users to control or treatment by a fair coin flip and nothing downstream interferes, you would expect segment composition to be balanced in expectation across arms. Mobile users would land in control and treatment at the same rate as desktop users. Composition imbalances would be small, vanishing as sample size grows.

In practice, several systematic mechanisms create composition imbalances large enough to drive Simpson reversals:

Ramped-up rollouts. Many experimentation programs ramp a treatment from 1% to 5% to 25% to 50% over a few days to limit blast radius. During the ramp, the treatment arm sees users in a different temporal mix than control. If usage patterns shift over the week (weekday-vs-weekend traffic, US-business-hours-vs-international, post-marketing-campaign vs steady state), the ramp produces composition imbalance. Kohavi’s Trustworthy Online Controlled Experiments (2020) is explicit about this in the section on Simpson’s Paradox: “It is easy to incorrectly identify the winning Treatment because of Simpson’s paradox” when ramping. The fix is to compare only the period when both arms were at their final allocation, or to use ramp-aware analysis that weights periods comparably.

Differential bot filtering. Bot detection runs after exposure. If the treatment changes user engagement (more clicks, longer sessions, more requests), some users who would have been classified as humans in control get classified as bots in treatment — pushing them out of the analysis. The treatment cohort that remains is enriched for one type of user (less-engaged) relative to control. If conversion rates differ between those user types, the aggregate now has both a true treatment effect and a composition effect, mixed together, and the composition effect can swamp.

Cache or CDN races. A small fraction of treatment users get served the cached control variant before the experiment script overrides it. The treatment arm now contains a mixture of “real treatment” and “accidentally served control” users. If the contamination is non-random with respect to a segment (e.g., users on slower connections are more likely to see the cached version because the experiment script is slow to load), composition imbalance is created mechanically.

Marketing campaigns targeting a variant URL. A paid campaign drives a burst of low-converting top-of-funnel users to a specific URL. If that URL is in one of the test variants, that variant’s composition shifts toward the campaign-segment, who convert at a different rate than baseline. The aggregate now reflects a mixture; segment-level analysis exposes it.

Time-varying user mix. Long-running tests (more than a few weeks) accumulate users whose composition shifts as the user base grows or seasonality kicks in. If the treatment arm has been ramped at a different time than control, or if engagement-driven re-exposure changes which users see treatment more often, the late-period mix differs from the early-period mix, and the aggregate is a mix that doesn’t match either.

Triggered analyses with imperfect triggers. Many experiments only “trigger” for users who hit a specific surface (e.g., the checkout page). If the trigger is implemented slightly differently in control vs treatment (a common bug), the triggered populations differ — not in the treatment but in who’s in the analysis. This is the same root cause class as Sample Ratio Mismatch, but the resulting bias often manifests as segment composition imbalance that produces Simpson reversal even when the headline SRM check passes.

The two-population overlay. Some teams analyze the same experiment across two populations — say, “logged-in users” and “all users” — and report both. The two populations have different conversion rates and different treatment effects. Aggregating them produces a weighted average that can move opposite to either subgroup, especially when one subgroup is much larger but converts at a much lower rate.

The common structure across all of these: anything that causes segment composition to differ across arms more than would happen by random chance is a candidate Simpson-Paradox driver. Even when the underlying randomization is sound, downstream processes — ramping, filtering, caching, triggering, weighting — can produce composition imbalances that the aggregate hides.

Real-World A/B Test Examples (Kohavi 2009, Crook 2009 Microsoft)

The most thoroughly documented practitioner-facing references for Simpson’s Paradox in online experiments are two Microsoft-authored papers and one book:

Kohavi, Longbotham, Sommerfield, and Henne (2009). “Controlled Experiments on the Web: Survey and Practical Guide,” published in Data Mining and Knowledge Discovery Vol 18, pp 140-181 (DOI: 10.1007/s10618-008-0114-1). The paper synthesizes lessons from ~600 controlled experiments at Microsoft and includes Simpson’s Paradox as a recognized phenomenon that shows up when treatments are ramped at different times than controls. The mechanism the paper highlights: a treatment ramped from 1% to 50% over a week will, during that week, be over-represented on weekdays (when ramping happened) and under-represented on weekends. If weekday users convert at a different rate than weekend users, the aggregate during the ramp will reflect a Simpson-style composition effect, not a pure treatment effect.

Crook, Frasca, Kohavi, and Longbotham (2009). “Seven Pitfalls to Avoid When Running Controlled Experiments on the Web,” published in the Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘09) (DOI: 10.1145/1557019.1557139). This is the canonical practitioner reference for things that go wrong in A/B testing at scale. Simpson’s Paradox is one of the seven pitfalls. The paper documents specific Microsoft experiments where aggregate and segment-level results disagreed and the segment-level analysis was the trustworthy one. The paper’s framing: practitioners must build pipelines that “default to segment analysis” rather than treating the aggregate as the headline.

Kohavi, Tang, and Xu (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge University Press (ISBN: 978-1108724265). The textbook for the field has a dedicated discussion of Simpson’s Paradox in the segment-differences section. The book’s framing emphasizes that the paradox is not rare — it occurs frequently enough in production experimentation that any program operating without segment-level diagnostics will routinely ship wrong-direction decisions.

A specific Microsoft example reported in the literature (paraphrased from the Crook 2009 paper and the Kohavi book): a test on a search-results page was ramped over a week. The aggregate showed treatment winning. When analyzed per day, treatment lost on every individual day. The composition shift across the week — produced by the ramp — was the entire source of the aggregate “win.” Once the analysis was restricted to the post-ramp steady-state period with matched compositions, the treatment was correctly identified as a loser.

Another example: a feature targeted at international users was tested. The aggregate showed a small positive lift. The international segment, which the feature was actually designed for, showed a strong positive lift. The US segment showed a slight negative lift. Because US traffic dominated the population, the aggregate lift was small — concealing both the much larger positive effect in the target segment and the negative effect in the larger non-target segment. Shipping the aggregate decision (modest positive, ship to everyone) was the wrong call. The right call was either to ship only to the international segment or to fix the negative US interaction before shipping broadly.

The general pattern Microsoft’s experimentation literature documents: when aggregate and segment-level results disagree, segment-level is almost always closer to the truth about what the treatment does to a given user. The aggregate is a population-weighted average that reflects composition as much as effect, and composition is a confounder.

Why Randomization Doesn’t Always Prevent It

The standard defense of A/B testing as a causal inference tool is that random assignment makes the two arms exchangeable in expectation. Whatever segment composition exists in the population, both arms will have it. So how does Simpson’s Paradox happen if randomization is working?

Three mechanisms keep randomization from saving you:

Randomization is over users, but analysis is over users-who-met-criteria. Many real analyses are not over the fully-randomized population — they’re over a triggered or filtered subset (the users who saw the surface, the users who didn’t bot-out, the users who weren’t filtered for opt-out). If triggering or filtering happens differently across arms — even subtly — the analyzed populations are no longer balanced, and the analysis becomes a comparison of two non-randomized groups. This is the same root cause class as Sample Ratio Mismatch (covered in its own article).

Ramped rollouts violate the implicit “same time, same conditions” assumption. If treatment is ramped from 1% to 50% over a week and control is steady at 50% throughout, you have two arms that did not see the same temporal slice of the user base. Even with perfect randomization within each day, the time-weighted composition differs. Standard A/B test math assumes the comparison is between two simultaneously-running arms. Ramping invalidates that assumption unless the analysis explicitly corrects for it.

Treatment effect heterogeneity plus composition imbalance. When the treatment has different effects on different segments (which is the norm, not the exception, in any real product), the population-weighted aggregate effect depends on what proportion of each segment is in the population. If randomization produces a 50/50 user split but those users happen to over-represent one segment in one arm by even a few percentage points — a sampling fluctuation that randomization cannot eliminate, only minimize — the aggregate effect estimate is biased relative to the average treatment effect on the underlying population.

The third mechanism deserves emphasis because it survives even with perfect infrastructure. Sampling variance in segment composition shrinks as sample size grows, but for very large heterogeneity in treatment effects, even small composition variance can produce a Simpson-style aggregate that disagrees with the segment-level picture. This is why segment-level analysis is not just a sanity check — it is the diagnostic that catches the case where the aggregate is misleadingly summarizing a heterogeneous treatment effect.

Pearl’s 2014 commentary in The American Statistician, “Comment: Understanding Simpson’s Paradox,” makes a stronger version of this point. Pearl’s framing: the question “which estimate is correct, the aggregate or the disaggregated?” does not have a purely statistical answer. It depends on the causal structure — what the treatment is intended to affect, what the confounders are, and what counterfactual you’re trying to estimate. In most A/B testing applications, the relevant counterfactual is “what would happen if we shipped this to the user population we’ll actually serve in production?” and the right analysis weights segments by their production prevalence, not by their (possibly skewed) prevalence in the test. Hernán, Clayton, and Keiding (2011, International Journal of Epidemiology, “The Simpson’s paradox unraveled,” DOI: 10.1093/ije/dyr041) develop the same point in an epidemiology context: the resolution is causal, not statistical.

The Practical Diagnostic — Segment-Level Analysis Is Not Optional

The single most important operational change Simpson’s Paradox implies for experimentation programs is this: every experiment result must be read at both the aggregate and the segment level, and any disagreement triggers investigation before any ship decision.

Concretely, “segment-level” means at minimum:

  • Device type (mobile, desktop, tablet — and often further by OS).
  • Geography (at minimum US vs rest-of-world; ideally top-5 markets).
  • User state (logged-in vs logged-out, new vs returning, subscriber vs free).
  • Acquisition source (organic, paid, direct, referral).
  • Time slice (per-day, per-week — especially for tests >1 week).

The diagnostic protocol when the aggregate is, say, +3%:

  1. Compute the same lift in each pre-declared segment.
  2. If all segments show lift in the same direction with similar magnitudes (say, all between +1% and +5%), the aggregate is trustworthy.
  3. If any segment shows the opposite direction and that segment is large enough that its disagreement is statistically meaningful, investigate. The aggregate might be a composition artifact.
  4. If all segments show lift in the same direction but with very different magnitudes (e.g., +12% for international, −0.5% for US), the aggregate is a weighted average that doesn’t tell the full story. Decide whether to ship to all users (using the aggregate), ship to the high-lift segment only, or fix the negative segment first.
  5. Specifically: check whether the segment composition differs across arms. If treatment has 38% mobile and control has 51% mobile, the aggregate is contaminated by composition imbalance and you should re-analyze using a method that controls for composition.

The composition-imbalance check is the SRM-adjacent diagnostic. Compute, for each segment, the chi-squared test of “is the proportion of users in this segment the same in control and treatment as expected?” If any segment fails this test, the composition is imbalanced and the aggregate is suspect.

When the composition is imbalanced, the standard fix is stratified or post-stratified analysis: compute the lift within each segment, then take a weighted average using fixed weights (typically the overall population proportions, not the per-arm proportions). This produces an estimate of the average treatment effect that doesn’t depend on which arm happened to over-represent which segment.

Variance Reduction Techniques That Help

Beyond segment-level analysis, two specific techniques reduce the risk of Simpson-style misleading aggregates and improve A/B test sensitivity:

CUPED (Controlled Experiments Using Pre-Experiment Data). Published by Deng, Xu, Kohavi, and Walker in 2013 in WSDM ‘13 (“Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data,” DOI: 10.1145/2433396.2433413). The technique uses a user’s pre-experiment behavior as a covariate, reducing variance and making it easier to detect smaller true effects without being misled by composition noise. CUPED doesn’t directly fix Simpson’s Paradox — composition imbalance still biases the estimate — but it reduces the variance of the estimator, which means the segment-level checks are sharper. CUPED has been adopted by Netflix, Booking, Meta, Airbnb, DoorDash, and many others (per Deng et al.’s follow-up papers and industry reports).

Stratification at assignment. If you know in advance that segment composition matters (and for any meaningful product, you do), stratify the random assignment by segment. Stratified randomization forces equal proportions of each segment into each arm, eliminating the source of Simpson reversal at the design stage. The cost is a small amount of complexity in the assignment logic. The benefit is that you never have to defend an analysis from a “but the segments are imbalanced” question.

Post-stratification. If stratification at assignment is infeasible (often because segments aren’t known at assignment time), post-stratify in analysis. Compute lift within strata, then aggregate with population weights. Mathematically equivalent in expectation to stratified design, and removes composition imbalance from the analysis.

Capping or trimming heavy-tailed metrics. For revenue or engagement metrics with long right tails, a few extreme users can dominate the aggregate. Cap outliers at a percentile (e.g., 99th) to make the aggregate less sensitive to who ended up in which arm by chance. This is variance reduction, not Simpson-specific, but it makes the same kind of aggregate-vs-segment disagreement less likely.

Twyman’s Law as a culture rule. Twyman’s Law: “Any statistic that appears interesting is almost certainly a mistake.” Kohavi’s book invokes this repeatedly. If the aggregate is much larger than the per-segment lifts, or much smaller, or in the wrong direction, the first hypothesis should be that the aggregate is wrong — a composition artifact, an SRM, a bot effect, a logging bug. Only after those are ruled out should you treat the aggregate as the headline.

What This Means For Your Experimentation Program

The calibration that should fall out of understanding Simpson’s Paradox is straightforward:

Segment-level analysis is not a “nice to have” for advanced teams. It is a core diagnostic for trustworthy A/B test interpretation. Any experimentation program that reads only the aggregate is operating in a regime where some unknown fraction of “winners” are composition artifacts. The fraction is hard to estimate precisely without re-analyzing historical experiments, but the Microsoft literature suggests it is meaningful — not 50% of experiments, but enough that it shows up in published case studies regularly.

The mental model “randomization solves confounding, so the aggregate is fine” is wrong. Randomization protects against confounding in expectation, given perfect implementation. Real experimentation pipelines have ramping, filtering, triggering, caching, marketing-campaign contamination, and treatment effect heterogeneity. Any of these can produce segment composition imbalances large enough to drive Simpson reversal. The defense is not “we randomized” — it is “we randomized and we check segment composition and we read segment-level results.”

The default workflow should make segment-level analysis automatic. Experimentation platforms should surface segment breakdowns alongside the aggregate by default — not as an opt-in deep dive. The cost is a few additional rows in the results UI. The benefit is that the failure mode this article describes becomes visible at glance rather than requiring an analyst to remember to check.

The cultural rule: when aggregate and segments disagree, segments win until proven otherwise. This is the rule the Microsoft literature has converged on after thousands of experiments. The aggregate is a function of both treatment effect and composition; segment-level is a function of treatment effect (within segment), so it’s the more direct estimate of what the treatment actually does to users. When they disagree, the burden of proof is on the aggregate to explain why composition isn’t the explanation.

The cost-benefit math. For a CRO team or PM with an experimentation roadmap: the engineering work to add segment-level analysis to your results pipeline is small (a few days to a few weeks depending on existing infrastructure). The cost of NOT doing it is some fraction of your shipped winners being composition artifacts that don’t actually improve the metric in production. If that fraction is 5%, and you ship 30 winners a year, you’re shipping 1-2 wrong-direction changes per year on average. Over 5 years of program operation, you’ve accumulated ~5-10 zero-or-negative-impact ships that “look like wins” in the historical experiment ledger, biasing the program’s evidence base toward false confidence. The fix is cheap; the failure mode is silent and compounding.

The deeper point: A/B testing is widely treated as “the gold standard for causal inference in product” because random assignment seems to solve confounding. Simpson’s Paradox is the reminder that randomization is necessary but not sufficient. The aggregate of a randomized experiment can still mislead if composition shifts post-assignment, if treatment effects are heterogeneous, or if the analysis pools across non-equivalent strata. The gold-standard claim is conditional on the analysis layer being competent — and a competent analysis layer treats segment-level disagreement with the aggregate as a hard signal worth investigating.

Sources

  • Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175), 398-404. DOI: 10.1126/science.187.4175.398.
  • Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society, Series B, 13(2), 238-241.
  • Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled Experiments on the Web: Survey and Practical Guide. Data Mining and Knowledge Discovery, 18(1), 140-181. DOI: 10.1007/s10618-008-0114-1.
  • Crook, T., Frasca, B., Kohavi, R., & Longbotham, R. (2009). Seven Pitfalls to Avoid When Running Controlled Experiments on the Web. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘09), 1105-1114. DOI: 10.1145/1557019.1557139.
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265. (Sections on segment differences and Simpson’s Paradox.)
  • Pearl, J. (2014). Comment: Understanding Simpson’s Paradox. The American Statistician, 68(1), 8-13.
  • Hernán, M. A., Clayton, D., & Keiding, N. (2011). The Simpson’s Paradox Unraveled. International Journal of Epidemiology, 40(3), 780-785. DOI: 10.1093/ije/dyr041.
  • Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ‘13), 123-132. DOI: 10.1145/2433396.2433413.
  • Kohavi, R. (2010). Unexpected Results in Online Controlled Experiments. ACM SIGKDD Explorations Newsletter, 12(2), 31-35. DOI: 10.1145/1964897.1964905.
  • UCBAdmissions R dataset documentation: vincentarelbundock.github.io/Rdatasets/doc/datasets/UCBAdmissions.
  • Simpson’s Paradox — Wikipedia: en.wikipedia.org/wiki/Simpson%27s_paradox.

FAQ

How do I detect Simpson’s Paradox in my A/B tests?

Run segment-level analysis on every experiment, comparing within-segment lift to aggregate lift. If they disagree in direction — aggregate positive but most segments negative, or vice versa — Simpson reversal is the prime suspect. Also check segment composition: chi-squared the proportion of each segment in control vs treatment. If any segment is meaningfully over- or under-represented in one arm, the aggregate is composition-contaminated. The Microsoft experimentation literature (Kohavi 2009, Crook 2009, Kohavi/Tang/Xu 2020) treats this as a default diagnostic, not an advanced one.

What about CUPED? Does it solve Simpson’s Paradox?

CUPED (Deng et al. 2013) is a variance-reduction technique that uses pre-experiment user behavior as a covariate. It improves A/B test sensitivity and reduces noise in lift estimates. It does not directly solve Simpson’s Paradox — if segment composition is imbalanced across arms, CUPED’s lift estimate is still biased by composition, just with smaller variance. The fix for Simpson’s Paradox is segment-level analysis and stratification, not variance reduction. CUPED and stratification are complementary: use both.

What about treatment effect heterogeneity? Doesn’t every test have it?

Yes — virtually every real product has heterogeneous treatment effects across segments. A button color change affects mobile and desktop users differently. A copy change affects new and returning users differently. The fact of heterogeneity doesn’t break A/B testing; it just means the aggregate is a population-weighted average of heterogeneous effects. The problem comes when (a) the heterogeneity is large and (b) composition is imbalanced across arms — then the aggregate is biased away from the true average treatment effect. The diagnostic is segment-level analysis. If lifts are similar across segments, the aggregate is meaningful. If they differ dramatically, the aggregate is hiding important information regardless of whether Simpson reversal occurs.

Should I stratify A/B test assignment by segment?

For segments known at assignment time (device, geography, logged-in status), yes — stratified randomization is cheap and eliminates a source of bias. For segments not known at assignment time (e.g., behavioral segments derived from user activity during the test), use post-stratification in analysis. The combined effect is that segment composition is balanced or analytically controlled, and Simpson reversal becomes much harder to produce. Kohavi/Tang/Xu (2020) discusses both approaches.

Is Simpson’s Paradox common in real A/B tests, or is it mostly a textbook curiosity?

It is common enough that Microsoft’s published practitioner literature (Crook 2009, Kohavi 2009, Kohavi/Tang/Xu 2020) treats it as a recognized class of problem worth dedicated discussion. Specific prevalence numbers are harder to pin down because catching Simpson’s Paradox requires running segment-level analysis, which programs without that habit don’t do. The Microsoft papers report that ramped rollouts routinely produce Simpson-style aggregate-vs-segment disagreements during the ramp period, and that triggered analyses with imperfect triggers produce them at meaningful frequency. The 6% SRM prevalence reported by Fabijan et al. (2019) for sample ratio mismatch is a lower bound on the rate of composition imbalances that could drive Simpson reversal — many composition imbalances are too small to fail SRM but still bias the aggregate.

If aggregate and segments disagree, which one should I believe?

The Microsoft experimentation literature converges on: segments win, until you can explain why the aggregate disagrees. Pearl’s 2014 commentary in The American Statistician is more nuanced — the right answer depends on the causal structure and the counterfactual you care about. For most A/B testing applications, the counterfactual is “what would happen if I ship this to all users,” and the right analysis weights segments by their production prevalence (post-stratification with population weights), which typically aligns with the segment-level picture rather than the imbalanced aggregate. When in doubt: trust the segment-level results, investigate the composition, and re-run with stratified design if the cause cannot be identified.

What’s the relationship between Simpson’s Paradox and Sample Ratio Mismatch?

They are siblings in the same family of composition-imbalance problems. SRM is the specific case where the chi-squared test on overall assignment counts (e.g., 50.2/49.8 in a 50/50 test) detects that the arms are not the populations you expected. Simpson’s Paradox can occur even when SRM passes (overall counts look fine) but composition differs at the segment level (mobile/desktop ratio differs across arms even though total counts are balanced). Both invalidate aggregate-only readings. The SRM check should be on by default; the segment-composition check should be the next gate after it. See SRM article for the related diagnostic.

My experimentation platform doesn’t surface segment-level analysis. What do I do?

Build it, or switch platforms. Optimizely, Eppo, Statsig, GrowthBook, AB Tasty, and VWO all surface segment-level analysis by default in 2026. Internal platforms often don’t until v2 or v3. The implementation is straightforward: for each pre-declared segment, compute the same lift you compute for the aggregate, and surface both side by side in the results UI. The hard part is not the math; it is wiring the pipeline to know what the segments are. If you cannot fix the tool, run the segment-level analysis manually on raw data before reading any result as a ship decision. This is the same operational discipline that catches SRM.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.