Why Sample Ratio Mismatch Matters for Business
Sample Ratio Mismatch (SRM) occurs when A/B test variants don't receive their expected user distribution — like a supposedly 50/50 split arriving at 53/47. This signals broken randomization that can invalidate test results entirely.
When SRM exists, measured improvements may reflect audience differences rather than actual treatment effects. Consider a concrete example: a checkout test showing "+4% lift" on 200,000 monthly sessions could falsely suggest 200 extra orders ($24,000/month revenue impact) if allocation bias caused the variant to receive higher-converting traffic segments.
The cost of missing SRM isn't a stats problem. It's a business problem. You ship changes based on phantom lift, and the revenue never materializes.
Core Diagnostic Process
Before interpreting any lift metric, run this three-step verification:
Step 1: Confirm Expected Split for Eligible Population
Account for ramps and verify the randomization unit matches your analysis unit. If you randomize on user_id but analyze sessions, you'll see natural ratio drift even with perfect assignment.
Step 2: Run Chi-Squared Significance Test
Flag imbalances exceeding 1% to 2% at scale or p-values below 0.01 as requiring investigation. At large sample sizes, even small absolute differences become statistically significant — that's the point. SRM at scale means something is systematically wrong.
Step 3: Establish Decision Rules Upfront
Decide whether you'll exclude corrupted periods, rerun, or halt the test entirely. Having this rule before you see results prevents motivated reasoning.
Common Root Causes
Four frequent culprits behind SRM:
- Unit mismatch: Randomizing on
user_idbut analyzing sessions. Multi-session users inflate counts unevenly. - Post-assignment filtering: Applying audience rules after allocation, creating uneven exclusions between variants.
- Uneven traffic routing: Different conversion paths (paid vs. organic) hitting assignment differently.
- Instrumentation failures: Bots, ad blockers, caching, or mid-test configuration changes that selectively drop events from one variant.
Practical Guidance
My decision rule: Stop interpreting lift when SRM is statistically unlikely (p < 0.01) and isolate root cause before proceeding.
A same-day triage routine involves:
- Computing daily splits by traffic source
- Identifying the drift date
- Correlating drift with releases, targeting changes, or infrastructure updates
Most SRM issues have a clear start date. Finding it usually points you to the fix.
When to Rerun vs. Salvage
If the SRM source is isolated to a time window (like a bad deploy that was rolled back), you can sometimes trim the corrupted period and analyze the clean data. But be honest: if more than 20% of the test period is affected, rerun.
If the SRM is caused by the variant itself (for example, the variant crashes for a segment of users, causing them to drop out), that's not fixable with data cleanup. Fix the variant and rerun.
Conclusion
The central principle: treat SRM as a stop sign, not a minor statistical caveat. Protecting experimental integrity means refusing to interpret lift when the foundation — balanced randomization — is cracked.