Sample Ratio Mismatch (SRM): The A/B Testing Quality Check Most Teams Skip

Atticus Li

← The Replication Crisis · replication-crisis

Sample Ratio Mismatch (SRM): The A/B Testing Quality Check Most Teams Skip

Microsoft found ~6% of its A/B tests have a broken traffic split that silently invalidates the results. Most teams never run the one-line check that would catch it. Why SRM is the highest-ROI gate experimentation programs ignore.

By Atticus Li May 25, 2026 25 min read

Imagine your favorite test of the quarter. The one with the 12% lift on a primary conversion metric, a p-value under .01, a sample of three million users split 50/50, and a confident roadmap line item that says “ship.” You ship. Three weeks later, the production cohort behaves nothing like the test predicted. Revenue is flat. The product manager asks the data scientist to double-check. The data scientist pulls the raw assignment counts. The control group has 1,512,488 users. The treatment group has 1,487,512 users. That looks fine — half a percent off — until you run a chi-squared test against the configured 50/50 and discover the probability of that split occurring by chance is roughly 1 in 10 million.

The lift wasn’t real. The two groups weren’t comparable. Something in your test infrastructure — bot filtering, a redirect failure, a caching layer, an asynchronous config refresh, a bug in the assignment hash — quietly delivered different populations to your control and treatment arms. The “winner” was an artifact. You shipped a tax on the product.

This is Sample Ratio Mismatch, or SRM, and Microsoft research has shown that roughly 6% of their large-scale online controlled experiments exhibit it (Fabijan et al., 2019, KDD ‘19). LinkedIn engineers have reported similar prevalence — about 10% of triggered analyses (Chen, Liu, & Xu, 2018). Booking.com, Yahoo, and Airbnb have published comparable findings. Despite being detected by a single chi-squared test that takes one second to compute, SRM is not on by default in most internal experimentation programs, and most major commercial A/B testing tools didn’t surface it prominently until 2019-2021.

The tension is uncomfortable: SRM detection is mathematically trivial and operationally catastrophic to ignore. If your experimentation program doesn’t gate result-reading on an SRM check, a measurable fraction of your shipped “winners” are not winners. They are bias.

This article walks through what SRM is, the empirical evidence on how common it is, why it invalidates lift measurements, the canonical taxonomy of root causes, why most teams don’t check for it, what to do when it fires, and the calibration this should produce in how you weight any A/B test result.

What SRM Actually Is

A Sample Ratio Mismatch is a statistically significant difference between the expected traffic split between experiment variants (e.g., 50/50, 80/20, 33/33/33) and the observed split at the end of data collection. It is a data-quality diagnostic, not a measurement of the treatment effect.

The detection is a one-line Pearson’s chi-squared goodness-of-fit test against the configured assignment probabilities. Given expected counts E_i and observed counts O_i for each variant i, the test statistic is:

χ² = Σ (O_i − E_i)² / E_i

For a two-variant experiment, χ² with 1 degree of freedom maps to a p-value. If that p-value falls below a chosen threshold, you have an SRM, and the analysis of treatment effects should not be trusted until the cause is found.

What threshold? Common practice in the industry has converged on a conservative cut-off because the cost of a missed SRM is much larger than the cost of a false alarm:

Eppo uses α = 0.001 (Eppo Docs, 2024).
Microsoft in the Fabijan et al. paper used illustrative thresholds where 821,588 vs 815,482 (a 50.2/49.8 split) was already “less than 1 in 500k” — i.e., they flag splits that would happen by chance much less often than the standard 0.05.
Trustworthy Online Controlled Experiments (Kohavi, Tang & Xu, 2020) recommends thresholds in the 0.001 range as a default, recognizing that the test is run on every experiment, so the family-wise error rate matters.

The reason the threshold is so much stricter than the usual 0.05 is twofold. First, you do not want to interrupt analysts on every experiment with false positives. Second — and more importantly — large-sample experiments make even small ratio deviations enormously unlikely under the null. A 50.5/49.5 split at N = 100,000 has p ≈ 0.0016. A 50.2/49.8 split at N = 1.6 million has p < 0.000002. At the sample sizes that make A/B tests sensitive enough to detect a 0.5% lift, even tiny configuration bugs trip the chi-squared loudly. That sensitivity is the feature, not the bug.

Why It Matters: Comparability Is The Whole Point

The reason A/B testing works as causal inference is that random assignment makes the two groups exchangeable. Any difference in outcomes between control and treatment can be attributed to the treatment because every other factor — user demographics, time of day, browser, device, prior behavior — is balanced across the two arms in expectation.

When the observed assignment ratio diverges from the configured ratio at a statistically impossible level, that exchangeability is the first thing in doubt. The question becomes: what about the user populations is different between control and treatment, beyond the intended manipulation? And the answer, almost always, is that something upstream of the metric calculation is filtering, dropping, mis-bucketing, or otherwise selecting users in a way that correlates with whether they were exposed to the treatment.

This is selection bias. As the Fabijan paper puts it: “SRMs cause a selection bias that invalidates any causal inference that could be drawn from the experiment. If there is a selection bias in treatment or control sample, then the observed metric movements may be due to the selection bias and cannot be attributed to the treatment effect.”

A few concrete ways this plays out:

Bot filtering applied asymmetrically. Your bot detection algorithm classifies extreme-engagement sessions as bots. The treatment increases engagement. More treatment users are filtered out as “bots.” Fewer remaining treatment users are now compared to a denser control sample. The comparison is between “everyone in control” and “the moderately-engaged subset of treatment.”
Redirect failures. A split URL test sends control to /a and treatment to /b. The redirect from the original URL to /b fails for ~2% of users (slow connections, ad blockers, racing scripts). Those users either bounce or get bucketed elsewhere. Now your “treatment” cohort is missing the slow-connection users, who would have converted at a different rate.
Caching layer races. Your CDN caches the control variant. Treatment users sometimes get a cached control response before the experiment script can override it. Some of your “treatment” users actually saw control. The treatment cohort is now a contaminated mixture, biasing the lift estimate toward zero — or in pathological cases, toward arbitrary directions depending on which users get the cache hit.
Asynchronous configuration refresh. Skype’s 2018 SRM (documented in the Fabijan paper): the experimentation system updated config asynchronously mid-session. The variant ID logged on a session changed mid-call, so 30% of treatment logs were attributed to control. The team observed a “significant increase in audio distortion” in treatment that wasn’t real — it was a logging artifact of the broken assignment.

The pattern: when groups aren’t comparable, the lift isn’t a treatment effect — it’s a measure of how the two populations differ on the outcome metric. That measurement is real, in a literal sense. It’s just not what you wanted.

The Microsoft 6% Finding (Fabijan 2019)

The single most cited prevalence statistic in the SRM literature comes from Fabijan, Gupchup, Gupta, Omhover, Qin, Vermeer, and Dmitriev’s KDD ‘19 paper, “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners” (DOI: 10.1145/3292500.3330722)).

The paper draws on more than 10,000 historical online controlled experiments across four case-study companies — Microsoft, Booking.com, Outreach.io, and Online Dialog. The headline empirical claim:

“Through quantitative analysis of experiments conducted within the last year we identified that approximately 6% of experiments at Microsoft exhibit an SRM.”

Their Figure 2 breaks this down by product, showing variation from roughly 3% on one product to 13-14% on another. The 6% number is a Microsoft-weighted average. The same section cites:

LinkedIn: Chen, Liu, and Xu (2018) reported that ~10% of triggered analyses at LinkedIn exhibited SRM.
Yahoo: Zhao, Chen, Matheson, and Stone (2016) reported SRMs as a routine concern at scale.

These are not small numbers. A 6% rate, over a program running ~10,000 experiments a year, means ~600 invalid analyses per year that would be silently treated as valid if the chi-squared check were not run. If 30% of those would otherwise have been declared “winners” (which is roughly the win rate at mature programs), that’s ~180 spurious shipped “wins” per year. Each one is a tax: engineering effort to build, code review effort to merge, ongoing maintenance, and the opportunity cost of believing the program is generating value it is not.

The 10,000-OCE corpus also lets Fabijan et al. derive a taxonomy of 25 distinct root causes organized into five categories — the closest the field has to a canonical reference for what causes SRMs in production.

Common Causes — The Five-Category Taxonomy

The Fabijan taxonomy groups SRM root causes by which stage of the experiment pipeline they originate in. This matters because the diagnostic strategy is different for each.

1. Experiment Assignment SRMs. The randomization itself is broken. Bugs in the bucketing function (e.g., a hash function that doesn’t distribute users uniformly across 1,000 buckets and assigns control one fewer bucket than treatment), corrupted user IDs that lead to inconsistent assignment across sessions, carry-over effects from prior experiments using the same seed, or non-orthogonal experiment overlap. These are the most dangerous because they often affect every experiment in the platform at once. The Fabijan paper calls these “systemic” SRMs and flags them with asterisks in the taxonomy.

2. Experiment Execution SRMs. The variants get delivered differently in production. Treatment variant starts at a different time than control. Filter execution is delayed for one variant. Telemetry is added or removed in one variant, causing different events to be logged. The treatment changes performance (faster pages generate more logs per user than slower pages — a treatment that improves speed can generate an SRM purely through more complete telemetry). The treatment changes engagement (more clicks → more logs → more users observed → SRM in the positive direction). Variant-specific crashes (treatment crashes 0.3% more often than control; those users’ sessions never complete logging).

3. Experiment Log Processing SRMs. The raw telemetry is fine but gets mangled in the pipeline. Bot detection runs after treatment exposure and the treatment changes user behavior in a way that pushes some users across the bot threshold (Microsoft’s MSN carousel case: more cards → more engagement → bot detector misclassified high-engagement treatment users as bots → SRM in the opposite direction → “negative” result that was actually a positive). Incorrect joins. Delayed log arrival from one platform (mobile users from one carrier arrive a day late and miss the cooking window). Inconsistent filtering between control and treatment.

4. Experiment Analysis SRMs. The data is fine but the analysis cuts the wrong slice. Missing counterfactual logging — you computed conversion only for users who triggered the new feature, but didn’t compute the analogous counterfactual for control users who would have triggered it. Incorrect starting point of the analysis (started reading results from day 2 instead of day 1, losing a different set of users from each arm). Wrong triggering or filtering condition.

5. Experiment Interference SRMs. Outside actors break the test. Variant interference (an experimenter ramping a variant up or down mid-test, pausing one variant for an hour to fix a bug, manually assigning themselves to treatment for debugging). Telemetry interference (injection attacks, bots posting blank or malformed values that get filtered asymmetrically). Search engine campaigns or paid marketing pointing users at a specific variant URL, force-assigning a chunk of users to one arm.

The taxonomy is descriptive, not exhaustive. Fabijan et al. explicitly note that they recognize 25 distinct causes from their case studies, and that this is “not intended to be a comprehensive list.” New ones get added as new failure modes emerge — particularly as products span more platforms (mobile, IoT, gaming consoles, voice) and as third-party systems (CDNs, ad blockers, browser privacy features) interact with experimentation pipelines.

What unifies the five categories is that none of them are detectable by looking at the metric movement alone. You will see a lift. The lift will look real. The mathematical apparatus of t-tests and confidence intervals will be applied correctly. The only signal that something is wrong is the upstream count mismatch — which is exactly what the chi-squared test is doing.

Why Most Teams Don’t Check It

Given that the math is trivial, the empirical prevalence is well-documented, and the consequences of missing SRM are catastrophic, why is it so common for experimentation programs to not have an SRM gate?

A few overlapping reasons:

The tools didn’t surface it by default. Until roughly 2019-2021, most major commercial A/B testing platforms either didn’t run an SRM check, ran one but buried it in a deep configuration tab, or treated it as an opt-in diagnostic for “advanced users.” Optimizely shipped automatic SRM detection in 2020-2021. VWO, AB Tasty, Convert, Statsig, Eppo, GrowthBook, and others have all added explicit SRM warnings to their results UIs, but the rollouts were staggered through 2019-2022. Teams who built their experimentation programs before that window often inherited workflows where reading the results UI didn’t require thinking about traffic balance.

Internal experimentation platforms ship without it. A common pattern at large companies: an internal experimentation platform is built by an engineering team to enable A/B testing, ships a v1 that handles assignment, exposure logging, and basic stats, and then gets adopted broadly. SRM detection is on the v2 roadmap because the v1 team is busy supporting the growing user base. The v2 ships eighteen months later. In the meantime, an unknown number of analyses are read without an SRM gate. (This is one reason Fabijan et al. wrote the KDD paper — to push internal platforms to ship SRM detection as a default v1 feature.)

“The test ran, ship it.” Cultural pressure. Experimentation programs are often judged by velocity — how many tests are run, how quickly results come in, how fast wins ship. Adding an SRM gate that blocks reading results when traffic counts are off feels, in the moment, like a tax on velocity. It is not. It is a tax on shipping bias-induced “wins” that don’t replicate in production. But the tax on velocity is felt today; the tax on shipping bad changes is felt three months later in a flat metric review, by which point the connection to the bad test is invisible.

Engineering effort to instrument is non-trivial. While the chi-squared test itself is one line, reliably surfacing it requires that the experimentation platform knows the configured assignment ratio (the expected) and accurately counts the observed assignments per variant. In platforms where assignment happens client-side, where bot filtering happens after exposure, where multiple log pipelines feed into the same analysis, getting the observed-count layer right is non-trivial. Teams underestimate this and ship without it.

The diagnostic taxonomy itself is recent. The Fabijan paper is from 2019. Before it, practitioners knew that SRMs happened, but the field lacked a shared vocabulary for the five-category structure and the 25 root causes. Investigating an SRM was bespoke work, often taking days to weeks (the paper quotes practitioners as saying “if it is browser related, it may take me a week; if it is not browser related, it can take me months”). The activation energy to add SRM checks felt high when the response to a positive check was “now spend a month diagnosing it.”

The Fixes — Pre-Launch, Daily Monitoring, Gates

Mature experimentation programs treat SRM as a gate, not a diagnostic. The structure looks like this:

Pre-launch instrumentation. Every experiment must log, per variant, the exact count of users assigned at the moment of assignment — before any downstream filtering, bot detection, or analysis. The assignment count is the ground truth against which everything else is compared. Without this baseline, you cannot run the chi-squared check because you don’t know what “expected” is.

A/A tests before A/B tests. Run an A/A experiment (both variants are control) before any new A/B test on a new surface, or after any infrastructure change. If the A/A produces an SRM, your infrastructure is broken before you’ve added a treatment. Fix the infrastructure. The Fabijan paper notes that A/A experiments with SRM are a strong signal of widespread, systemic root causes.

Daily SRM monitoring during the experiment. Don’t wait until the end. The chi-squared test should run nightly (or hourly) against accumulated counts. If SRM crosses the threshold mid-experiment, alert the experiment owner. This catches the failure mode where SRM emerges on day 3 because of a new traffic source (e.g., a marketing campaign forcing users to one variant) that wasn’t present on day 1.

Automated gates on result-reading. When SRM is present, the experimentation platform should block the results UI from showing the headline lift. Show the SRM warning prominently. Make the analyst click through a “I understand this analysis is invalid” interstitial before they can see the lift number. This sounds heavy-handed; it is. The alternative is that the lift number gets shown, screenshot, pasted into a Slack channel, and shipped.

The chi-squared check, in code. For completeness, this is the whole detection in Python:

from scipy.stats import chisquare
observed = [n_control, n_treatment]
expected = [(n_control + n_treatment) * p_control,
            (n_control + n_treatment) * p_treatment]
chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
srm_detected = p_value < 0.001

That is the diagnostic. The hard part is not the chi-squared. It is wiring the observed counts in correctly, defining the expected ratio in a place that the analysis pipeline can reliably read, and building the gate into the workflow.

What To Do When SRM Is Detected

When the chi-squared fires, the answer is not “increase the threshold and ignore it” and not “ship the result and hope it replicates.” The Fabijan paper provides ten rules of thumb for narrowing root-cause categories quickly. The condensed version:

Look at scorecards. If SRM is in the triggered/filtered scorecard but not the full scorecard, the trigger/filter logic is wrong. Relax the filter until the SRM disappears.
Look at segments. If SRM is in one user segment (older browser version, one geography, one device class), the cause is in that segment.
Look at time slices. If SRM gets worse over time, suspect caching, delayed config rollout, or carry-over. If SRM is present from day 1, suspect assignment logic.
Compare engagement metrics. If average engagement per user is higher in treatment, the SRM is excluding less-engaged users from treatment (or including more-engaged users in treatment), depending on the direction of the count discrepancy.
Check for systemic patterns. If multiple unrelated experiments simultaneously show SRM, look upstream at the experimentation platform itself (hashing function, bucket allocation, telemetry pipeline). Don’t waste days investigating each experiment individually.
Run an A/A. An A/A with SRM rules out treatment-effect-related causes and points to platform-level bugs.
Examine downstream pipelines. If your data has multiple processing stages, run the chi-squared at each stage. The stage where SRM first appears is where the bug lives.

The investigation can take a day, a week, or months. The cardinal rule is: do not ship the result while the SRM is unresolved. A shipped result derived from an SRM-invalidated test is not a “small bet” — it is a directional bet based on biased data, which is worse than no bet because it accumulates over time as the program “evidence base.”

If the investigation conclusively identifies the root cause and the cause is benign — for instance, the treatment legitimately reduced crashes, so more treatment users have complete logs (a positive cause; the Fabijan paper has a section titled “SRMs Can Have a Positive Cause”) — you can analyze a subsegment that is comparable across arms, or repeat the experiment with the cause fixed. If the cause is malign — assignment is broken, telemetry is broken, filtering is biased — you re-run the experiment after the fix. The “result” from the broken test is discarded.

What This Means For Your Experimentation Program

The empirical prevalence number is the part of this story that should change calibration.

Roughly 6-10% of A/B test results in mature programs are SRM-invalidated (Microsoft 6%, LinkedIn 10%). If your program does not have an SRM gate, that fraction of your “wins” — and a roughly equal fraction of your “losses” — are not what they appear to be. Some shipped “winners” were artifacts. Some killed “losers” were real wins disguised by bias.

If your program runs 100 tests per year and ships ~30% of them as winners, that’s 30 shipped changes. If 6-10% are SRM-invalidated and half of those would have changed direction with a clean analysis, you have ~1-2 wrong-direction ships per year — each consuming engineering effort, code review, ongoing maintenance, and producing zero or negative business impact. At a program running 1,000 tests per year, the number is 10-20.

There is also a second-order effect. Programs that don’t gate on SRM tend to develop a culture of trust in the experimentation platform that is weakly calibrated to its actual reliability. Teams ship based on the lift number without asking “is this measurement trustworthy?” When the calibration is wrong, the program drifts. After a few years of unchecked SRMs, the gap between “what the dashboard says” and “what the metric actually does in production” widens, and trust in experimentation as a decision tool collapses. The fix at that point — backfilling SRM checks, re-analyzing historical experiments, recalibrating the program’s reported win rate — is much more expensive than putting the gate in upfront.

Practical recommendation. If your program does not have an SRM gate today, this is among the highest-ROI engineering investments you can make. The implementation is a few days of work to wire in the assignment-count logging, the chi-squared check, and the result-UI gate. The payoff is the fraction of tests you are currently mis-reading. For a CRO team, a PM with an experimentation roadmap, or a CEO who reads experiment results in dashboards — this is the version of “be skeptical of your data” that has a concrete fix.

The same logic that makes A/B testing powerful as a decision tool — sensitivity to small effects at large sample sizes — makes SRM detection sharp at scale. Use it.

Sources

Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘19), 2156-2164. DOI: 10.1145/3292500.3330722. Open PDF at exp-platform.com.
Chen, N., Liu, M., & Xu, Y. (2019). How A/B tests could go wrong: Automatic diagnosis of invalid online experiments. Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ‘19), 501-509. DOI: 10.1145/3289600.3291000.
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265. Chapter on Sample Ratio Mismatch.
Zhao, Z., Chen, N., Matheson, D., & Stone, M. (2016). Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 498-507.
Kohavi, R., Deng, A., Longbotham, R., & Xu, Y. (2014). Seven rules of thumb for web site experimenters. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘14), 1857-1866. DOI: 10.1145/2623330.2623341.
Microsoft Experimentation Platform documentation: exp-platform.com.
Optimizely Automatic SRM Detection: Optimizely Support.
Eppo SRM documentation: docs.geteppo.com/statistics/sample-ratio-mismatch.
Sample Ratio Mismatch — Wikipedia: en.wikipedia.org/wiki/Sample_ratio_mismatch.

The Replication Crisis Hub — full collection.
Daryl Bem’s Precognition Studies — how standard methods can produce significant evidence for false hypotheses.
Confirmation Bias — why teams want to believe their winners are real.
Hindsight Bias — why shipped “wins” feel inevitable in retrospect.
The Diederik Stapel Fraud — what happens when data quality checks are absent.

FAQ

What chi-squared p-value threshold should I use for SRM detection?

Industry practice has converged on α = 0.001 as the standard threshold (Eppo uses this explicitly; Kohavi, Tang & Xu 2020 recommends thresholds in this range). The conventional 0.05 produces too many false positives at the volume of experiments a mature program runs, while 0.0001 or stricter misses too many real SRMs. The 0.001 threshold gives ~0.1% false positive rate per experiment while retaining near-100% power to detect ratio deviations large enough to matter at typical sample sizes.

What if my A/B testing tool doesn’t surface SRM warnings?

Switch to one that does, or roll your own. Optimizely, VWO, AB Tasty, Convert, Statsig, Eppo, and GrowthBook all now surface SRM checks (the dates these features shipped vary from 2019 to 2022). If you’re on an internal platform without it, the chi-squared check is one line of code — the hard work is wiring in the per-variant assignment counts at the source. If you cannot fix the tool, run the check yourself on the raw counts before reading any result.

Can SRM be different for mobile vs web traffic?

Yes, and segment-level SRM is one of the diagnostic patterns the Fabijan taxonomy points to. If the overall experiment is balanced but the mobile-only segment shows SRM, the cause is mobile-specific (delayed telemetry, different bot detection, redirect behavior on mobile browsers, app-launch timing). Run SRM checks per platform, per browser, per device class — segment-level SRM that is hidden by the overall view is a common failure mode at programs that only check aggregate counts.

What about geographic or attribution-window-related SRMs?

Same principle. If your test runs across regions and one region is geo-blocked or routed differently, that region’s assignment ratio can be off even if the global ratio looks fine. Attribution windows (e.g., “credit any conversion within 7 days of exposure”) that vary by variant — perhaps because of how cookies are set or how session length differs — can shift which users count as “in the analysis” and create downstream SRM. The fix is: run the chi-squared at the assignment layer (before any windowing) and ensure attribution logic is variant-independent.

Can SRM be fixed mid-experiment without restarting?

Usually no. If SRM is caused by a config bug, a fix mid-experiment creates two phases — pre-fix and post-fix — that are not comparable to each other, and the experiment effectively becomes two underpowered experiments with a discontinuity. The honest move is to stop, fix the underlying cause, and restart. The Fabijan paper notes one exception: if the SRM is caused by a clearly bounded subpopulation (e.g., one browser version, one geography), you can analyze the experiment on the comparable subpopulation and disclose the restriction. This is a defensible analysis only if the subpopulation isn’t where the treatment was supposed to work.

Is SRM a problem only at large sample sizes?

The diagnostic is most powerful at large sample sizes — which is exactly the regime where you most need it, because that’s where you’re trying to detect small lifts and where small percentage deviations in the split can swamp the real effect. At small N (a few hundred users), a 51/49 split is well within chance and the chi-squared will not fire. At large N (millions of users), a 50.2/49.8 split has p < 0.000002 and the chi-squared will fire loudly. Both behaviors are correct: small N can’t detect small bugs but also can’t detect small lifts, so the comparability concern is proportional. Large N is where SRM matters and where the test is sensitive.

Does SRM detection catch all kinds of experimentation problems?

No. SRM is a single diagnostic — it specifically detects assignment-ratio violations. Other classes of experimentation problems require their own diagnostics: peeking (early-stopping bias) requires sequential testing methods, novelty effects require longer holdouts, primacy effects require user-level (not session-level) analysis, multiple-comparisons inflation requires correction methods. SRM is one critical gate among several. The good news is that the chi-squared check is the cheapest of these gates to implement, so it’s the natural first one to add. The Trustworthy Online Controlled Experiments book (Kohavi, Tang & Xu 2020) covers the full set.

How does SRM relate to A/A testing?

An A/A test is two control variants assigned the same treatment. The expectation is that the chi-squared test on the assignment ratio should not fire (because both arms are receiving identical content and there is no treatment to differentially affect telemetry). If an A/A experiment exhibits SRM, the root cause is in the assignment, bucketing, telemetry, or analysis pipeline itself — not the treatment. A/A tests are therefore a standard pre-launch diagnostic for new surfaces or after infrastructure changes. The Fabijan paper recommends running A/A as the first investigation step when systemic SRM is suspected.

replication-crisis A/B Testing Experimentation data-qualityevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What SRM Actually Is

Why It Matters: Comparability Is The Whole Point

The Microsoft 6% Finding (Fabijan 2019)

Common Causes — The Five-Category Taxonomy

Why Most Teams Don’t Check It

The Fixes — Pre-Launch, Daily Monitoring, Gates

What To Do When SRM Is Detected

What This Means For Your Experimentation Program

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook