It is a Tuesday morning standup and the growth team is celebrating. A button-color test on the primary checkout CTA — the boring kind of test that the team runs every other week to feed the variance-reduction model — has read out a +29.4% lift on conversion, p = 0.001, sample size of 1.2 million users. The PM is already drafting the launch announcement. The designer is already screenshotting it for the brag deck. Engineering is already filing the ticket to roll it out to 100%.
The data scientist on the call asks a single question. “Has anyone looked at the raw assignment counts?” The room goes quiet. She pulls them up on her laptop. The treatment cohort is over-attributing conversion events from a redirect chain — the new button URL was implemented as a 302 redirect through a tracking parameter, and the analytics tag was firing the conversion event on both the redirect intermediate page and the final destination for users whose browsers cached the redirect target. Half of the “treatment lift” is the same conversion counted twice. The other half is selection: users on slow connections never made it through the redirect at all and silently dropped out of the treatment cohort, leaving a denser, more engaged subpopulation to be compared against the unfiltered control.
The “win” is a bug. Two bugs, actually. Once fixed, the underlying treatment effect is +0.3%, statistically indistinguishable from zero. The team does not ship.
This is Twyman’s Law in A/B testing, and it is the single most reliable diagnostic rule in industry experimentation. The formulation, attributed to British media and market researcher Tony Twyman and canonically formalized in print by A.S.C. Ehrenberg in his 1975 book Data Reduction, is brutally simple: “Any figure that looks interesting or different is usually wrong” (Ehrenberg, 1975). The corollary in modern online experimentation, articulated by Ron Kohavi across two decades of Microsoft and Bing leadership and codified in Kohavi, Tang & Xu’s Trustworthy Online Controlled Experiments (2020), is even sharper: when a controlled experiment delivers a surprising result, the most likely explanation is an instrumentation bug, a data leak, a sample ratio mismatch, an attribution failure, or a segment contamination — not a genuine breakthrough.
The interesting tension is structural. The results that CRO blogs, vendor case studies, conference talks, and consultancy decks love to publish — the dramatic +30% button-color win, the unexpected +50% pricing-page lift, the headline screenshot that makes the audience gasp — are precisely the results that Twyman’s Law tells you are most likely to be bugs. The discourse you read about A/B testing systematically over-represents bugs as successes. The selection bias in what gets shared makes the field look noisier and more magical than it actually is, and trains practitioners to misinterpret extreme results as evidence of insight rather than as evidence of broken instrumentation.
This article walks through where Twyman’s Law came from, why it is a statistical near-tautology for mature CRO programs, the catalog of common bugs that produce miracle lifts, how Microsoft and other sophisticated experimentation teams have institutionalized investigation-before-action protocols, the “Dirty Dozen” of metric interpretation pitfalls that Twyman’s Law most often catches, and how to build a practical “Twyman triage” gate into your own experimentation program.
What Twyman’s Law Actually Says
The law is named after William Anthony “Tony” Twyman (1934–2014), a British media and market researcher whose career spanned the 1950s through the early 2000s. Twyman was a central figure in UK television and radio audience measurement, contributing to methodological standards for the BBC, the Joint Industry Committee for Television Advertising Research (JICTAR), and the long-running BARB audience measurement system. The Royal Statistical Society’s Journal published his 1967 paper with A.S.C. Ehrenberg, “On Measuring Television Audiences,” widely cited as one of the first rigorous statistical treatments of broadcast audience estimation (Ehrenberg & Twyman, 1967).
By multiple accounts, Twyman never published the eponymous “law” himself. The attribution chain runs through Ehrenberg, who in his 1975 book Data Reduction: Analysing and Interpreting Statistical Data gave the formalization that survives today: “Any figure that looks interesting or different is usually wrong” (Ehrenberg, 1975). Ehrenberg credits the principle to Twyman, who reportedly arrived at it through years of watching audience measurement teams over-interpret what later turned out to be measurement artifacts — a spike in viewers that was a panel selection bug, a regional effect that was a survey instrument change, a demographic skew that was a weighting error.
Variants of the law have appeared in the literature since:
- Ehrenberg (1977): “Any reading which looks interesting or different is probably wrong.”
- Dickson (1999): “Any statistic that appears interesting is almost certainly a mistake.”
- Marsh & Elliott (2009) in Exploring Data: describe Twyman’s Law as “perhaps the most important single law in the whole of data analysis.”
The phrasing varies; the core claim does not. Surprising data is much more likely to reflect a problem with the measurement than a discovery about the world. This is not pessimism. It is a statement about base rates: real, large, repeatable effects are rare, especially in mature measurement systems, while opportunities for measurement to go wrong are abundant. A Bayesian would say: the prior probability of “huge true effect” is low; the prior probability of “measurement artifact” is moderate-to-high; so the posterior after observing “huge result” is dominated by the artifact branch unless the alternative has been actively ruled out.
Kohavi, Tang & Xu (2020), in chapter 3 of Trustworthy Online Controlled Experiments, explicitly elevate Twyman’s Law to a foundational design principle for online experimentation infrastructure. They write that the role of the experimentation platform is to embody Twyman skepticism in the default workflow — surprising results should automatically trigger validity tests, sample ratio checks, segment analysis, and a human review gate before the result is allowed to influence a ship decision. The most expensive failure mode in industry experimentation, on their account, is the team that observes a miracle lift, celebrates, and ships, only to find weeks later that the bug was in the test and not in the world.
Why “Too Good” Is A Statistical Red Flag
To see why Twyman’s Law is so reliable in CRO contexts, you have to understand the base rates of true large effects in mature experimentation programs. The empirical record from the most-published companies is consistent and humbling.
Microsoft’s reported experience, summarized by Kohavi, Tang & Xu (2020) and consistent with the survey by Kohavi, Henne & Sommerfield (2007), is that on a mature surface like Bing’s search results page, the median treatment effect across experiments is approximately zero, and the modal “successful” launched experiment produces a lift in the 0.1% to 1% range on the primary OEC (Overall Evaluation Criterion). Genuine double-digit lifts on a mature, high-traffic surface are vanishingly rare — Kohavi has publicly stated that across thousands of Bing experiments, fewer than 1 in 500 produced a measured lift above 2% on revenue or engagement metrics that survived replication.
Booking.com, in its widely cited internal data, reports similar distributions. Most ideas tested do not move the metric. A meaningful fraction make things worse. The successful interventions tend to be small. Airbnb, Netflix, LinkedIn, and Google have all published variants of the same distributional story: the median true effect is near zero, the tails are short, and double-digit lifts on mature surfaces are almost never real.
This gives Twyman’s Law its statistical bite. If your prior — built from the actual distribution of real lifts in your domain — puts very little probability mass above a +5% lift on a mature surface, and you observe a +30% reading, Bayes’ rule does most of the work for you. The probability that the reading reflects a true +30% effect is the product of (a) the prior probability of a +30% effect existing in your population and (b) the probability of observing that reading conditional on the effect being real. The first term is tiny. The competing branch — “the reading reflects a measurement, instrumentation, or attribution problem” — has a much larger prior because programs that run thousands of experiments per year accumulate many opportunities for things to go wrong.
Kohavi’s working heuristic, repeated across his blog posts and talks: on a mature, high-traffic web surface, any single A/B test reading above approximately +10% absolute lift on a major business metric is more likely to be an artifact than a real effect, and should trigger an investigation-first, ship-decision-second protocol. This is not a calibrated Bayes factor — it is a rule of thumb developed from watching the joint distribution of “headline lifts that were ever observed” and “headline lifts that survived replication.”
The greenfield case is different. A brand-new product surface, a new market, a category-defining UX change — these can have larger true effects because the program has not yet found and shipped the easy wins. But the principle still applies: surprising magnitudes always deserve investigation, and the investigation should happen before, not after, the launch decision.
The Common Bugs That Produce Miracle Lifts
The literature on online experimentation has converged on a stable catalog of failure modes that produce Twyman-violating results. Crook, Frasca, Kohavi & Longbotham (2009), in their KDD paper “Seven Pitfalls to Avoid When Running Controlled Experiments on the Web,” gave the canonical first taxonomy. Dmitriev, Gupta, Kim & Vaz (2017) extended it in “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls.” Fabijan et al. (2019) catalogued the specific subspecies of Sample Ratio Mismatch. Together these papers describe what to actually look for when a result reads too good.
Redirect attribution leaks. The most common cause of large false-positive lifts on URL-split tests. Treatment users navigate through a 301 or 302 redirect to reach the new URL. Some fraction of users (slow connections, ad blockers, browser caching) never complete the redirect — they bounce, get attributed to control, or get attributed nowhere. The treatment cohort becomes the subset of treatment-assigned users who successfully completed the redirect, which selects for fast connections, modern browsers, and lower friction. Compare that selected cohort to the unfiltered control and you get an attribution-induced “lift” that has nothing to do with the page change. Variants of this bug, with subtler attribution chains, have produced reported lifts ranging from a few percent to over 50%.
Bot filtering applied asymmetrically. A treatment that increases on-site engagement (more pageviews, longer sessions, higher scroll depth) may push some real users over a threshold that the bot-detection algorithm interprets as non-human, removing them from the analysis. The control cohort, with lower engagement, sits below the threshold and remains intact. The post-filter comparison is “the moderately-engaged subset of treatment” versus “all of control.” Net effect: depending on the direction of the threshold, you can produce either a phantom lift or a phantom regression. Dmitriev et al. (2017) describe a real Microsoft case where slideshow engagement appeared to lift dramatically but the gain was being created by inconsistent outlier filtering across arms.
Session-handling bugs. The experimentation platform refreshes user assignment configurations asynchronously. A long-running session that started in control gets re-bucketed to treatment mid-session when a config refresh fires. Conversion events fire after the re-bucket and get attributed to treatment despite the user having seen control for most of the session. Skype’s 2018 SRM, documented in Fabijan et al. (2019), was a real instance of this pattern: roughly 30% of treatment logs were attributed to control because variant IDs changed mid-call, producing an apparent “increase in audio distortion in treatment” that was entirely a logging artifact.
Instrumentation library version mismatches. Treatment and control variants get served by slightly different code paths that load slightly different versions of the analytics SDK. Even small differences in event-firing behavior (debouncing, batching, retry logic) can produce systematic differences in observed metrics that have nothing to do with the user experience. Kohavi has documented Bing cases where added JavaScript on the treatment variant artificially inflated click metrics by giving tracking beacons additional time to reach the server before page unload.
Partial-rollout interaction effects. A treatment is enabled at the same time as another team’s experiment on an upstream surface, and the two interact in a way that affects metric calculation. The reading on either experiment in isolation looks like a large lift; the joint effect is what is actually being measured.
Telemetry loss bias. Changes that alter how reliably events are captured — push notification protocol changes, network-condition handling, retry behavior — produce metric movements that reflect measurement completeness, not user behavior. Dmitriev et al. (2017) describe a Skype case where a push notification protocol change improved call metrics not because users were placing more calls but because the new protocol succeeded in delivering more telemetry events.
Daylight saving and timezone artifacts. The most-cited Amplitude example of Twyman’s Law: missing transaction data during a non-existent clock hour reveals a logging issue, not actual business disruption.
Seasonality and weekday effects masquerading as treatment effects. A test launched on a Friday and read on a Monday includes a weekend in the denominator. If treatment and control are exposed to slightly different weekday distributions due to ramp-up timing, the metric difference is the weekday effect, not the treatment.
Outlier handling differences. A single user with a wildly atypical purchase history (a fraudulent account, a stress test, a misconfigured enterprise customer) lands in one arm and not the other. Their behavior dominates the ratio metric and produces a lift in a single arm that is entirely about that one outlier.
The unifying observation: none of these bugs require malice or even particularly bad engineering. They are the natural consequence of doing experimentation on a complex web stack with many moving parts, asynchronous config delivery, mixed device populations, and adversarial bot traffic. They are common enough that any program running more than a few hundred tests per year will see all of them eventually. The defense is not “engineer harder” — it is “treat surprising results as signals to investigate, not signals to ship.”
How Kohavi And Microsoft Use Twyman’s Law
The operational protocol that Kohavi describes — across Bing, Microsoft Office, LinkedIn, Airbnb, and the unified position summarized in Kohavi, Tang & Xu (2020) — is what he calls investigation before action. The protocol is roughly:
- Surprising readings trigger an automated alert before the analyst sees the headline number. Threshold is contextual — usually defined as N standard deviations above the typical effect distribution for that surface, or absolute deviations beyond historical baselines for the surface and metric pair.
- Sample Ratio Mismatch is checked first, always. The chi-squared test against the configured assignment is computed on every experiment readout. If it fires, the experiment is marked “do not analyze” until the cause is found. Microsoft’s published prevalence of SRM is approximately 6% of all experiments (Fabijan et al., 2019) — meaning even with this gate, a meaningful fraction of all readings need to be set aside.
- Segment analysis is run automatically across browser, country, device class, traffic source, and other standard cuts. A real treatment effect should produce relatively consistent effects across major segments (with known mechanism-based exceptions); a bug-driven “effect” tends to be concentrated in a single segment that reveals the bug’s mechanism (e.g., “the entire lift is in Safari” often indicates a JavaScript compatibility issue affecting only one browser).
- A/A tests are run continuously in the background. If the platform itself is biasing toward false positives, those false positives will show up in null tests where there is no real treatment. Aggregate A/A test statistics are tracked over time as a calibration of the platform’s trustworthiness.
- Replication is required for surprising results. A miracle lift in one experiment is not actionable. A miracle lift that replicates in an independent re-run, with a new random seed, fresh data, and an investigation that has ruled out the obvious mechanisms, is actionable.
- Auto-shutdown of unexpectedly large metric movements. Dmitriev et al. (2017) note that Microsoft’s experimentation system will auto-shut-down experiments that show metric movements above certain thresholds before they reach normal completion, on the explicit Twyman logic that “the result is more likely to be a bug than a discovery, and continuing to run the test wastes resources while the bug propagates.”
The cultural framing matters too. In Kohavi’s published talks, he repeatedly emphasizes that a celebrated bug is an organizational failure, and that the data scientist who catches a Twyman-violating result before it ships should be praised, not seen as the buzzkill who killed a “win.” The incentive structure in many CRO programs is the opposite — the PM who shipped a +30% experiment gets the promotion, the data scientist who refused to validate it gets called difficult. Rebuilding the incentive structure so that not shipping a bug is recognized as the equally valuable outcome is, in Kohavi’s framing, the dominant cultural intervention an experimentation program can make.
The Dirty Dozen (Dmitriev 2017): The Most Common Metric Interpretation Pitfalls
Dmitriev, Gupta, Kim & Vaz (2017), “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments” (KDD 2017), gives the most useful single reference for the specific failure modes that Twyman investigations turn up. Their twelve pitfalls, with a brief gloss on each:
- Metric Sample Ratio Mismatch. The denominator populations differ between arms. (See our companion article on SRM for the full treatment.)
- Misinterpretation of Ratio Metrics. Moving a module higher on the page decreased CTR by 40% in a Microsoft example — but raw clicks went up because more users saw the module. The “decline” was a denominator artifact.
- Telemetry Loss Bias. A Skype push notification protocol change improved call metrics because it allowed more telemetry events to be reported, not because users called more.
- Underpowered Metrics. Real but small effects don’t reach significance, leading to “null result” conclusions where there was actually a meaningful change.
- Borderline P-Values. Results near p = 0.05 are weak evidence and frequently fail to replicate. Microsoft’s Bing team has at times required replication for any borderline result before shipping.
- Continuous Monitoring and Early Stopping. Peeking inflates false positive rates dramatically. (Companion article: the peeking problem.)
- Heterogeneous Treatment Effects. A feature can help some segments and hurt others. Without segment analysis, the overall null reading hides offsetting effects.
- Segment Misinterpretation. Simpson’s paradox: aggregating across segments can flip the direction of the observed effect. (Companion article: Simpson’s paradox in A/B testing.)
- Outlier Impact. Inconsistent outlier filtering between arms biases the comparison. A single whale user in one arm can dominate a ratio metric.
- Novelty and Primacy Effects. Short-term engagement gains often fade as users habituate. The “win” at day 3 disappears by week 4. (Companion article: novelty and primacy effects.)
- Incomplete Funnel Metrics. Funnel-step metrics need both conditional and unconditional success rates with adequate power at each step.
- Twyman’s Law. Surprising results deserve skepticism. Microsoft auto-shuts down experiments showing unexpectedly large metric movements.
The unifying observation across all twelve: most of these pitfalls produce results that look like dramatic wins (or dramatic losses), which is exactly the class of result that Twyman’s Law tells you to be most skeptical of. If your team is reading lots of dramatic results, you have a Dirty Dozen problem, not an insight pipeline.
How CRO Industry Discourse Is Selection-Biased
Here is the uncomfortable structural observation about A/B testing as a public discourse. The case studies that get published — by conversion rate optimization agencies, by A/B testing tool vendors, by speakers at growth conferences, by personal-brand consultants — are systematically the most dramatic readings. This is not because the consultants are dishonest. It is because dramatic results are what audiences want to read and what business development requires.
Twyman’s Law tells us those same dramatic results are precisely the ones most likely to be bugs. So the public corpus of “A/B testing wins” is, by selection, enriched for bugs. The signal-to-noise ratio in industry CRO content is structurally worse than the signal-to-noise ratio in the underlying experimentation programs, because the filter on what gets published rewards exactly the population of results that the underlying methodology says are most suspect.
Concretely:
- Vendor case studies. Most vendor case studies report lift magnitudes that, applied to the published metric distributions of mature CRO programs, would be statistical outliers of the kind Twyman’s Law tells you to investigate. The case studies do not typically report sample ratio checks, segment validation, A/A test calibration, post-launch monitoring, or whether the lift was replicated. The omission is not an oversight — it is the selection rule.
- Consultancy decks. The “+45% conversion lift” slide in a sales deck is the result that closed the next contract. It is also, statistically, the result most likely to be an artifact of the kind the consultancy’s process should have caught. The deck does not say which interpretation is correct because the deck is not built to.
- Blog posts and tweet threads. The “this one weird trick lifted my checkout flow 30%” genre is dominated by surfaces too small or too immature to have a stable baseline, attribution chains too fragile to support the inference, and methodology too thin to distinguish a true effect from a bug.
- Conference talks. A talk that opens with “we tested 200 things and 195 were null, 4 hurt, and 1 produced a +0.4% sustained lift” is honest, useful, and unbookable. A talk that opens with “we doubled conversions with one font change” gets the keynote slot.
The aggregate effect on practitioner intuition is corrosive. Junior PMs and analysts read the public corpus, internalize the assumption that double-digit lifts are normal and that A/B testing is a tool for discovering them, and then run experiments expecting to see results that, in a properly calibrated program, almost never appear. When those expected results do appear in their own experiments, the practitioner does not apply Twyman skepticism — because the public discourse has trained them that miracle lifts are the whole point — and they ship the bug.
The defensive posture is to discount any published A/B testing case study by the prior probability of a true effect of that magnitude in that domain. A consultancy claiming a +60% lift on a mature surface should be assumed to have a measurement issue until proven otherwise; the claim is not evidence about what the methodology can deliver, it is evidence about which results that consultancy chose to publish.
A Practical Twyman Triage
What does a working Twyman triage look like in a real CRO program? The protocol below is a synthesis of what Kohavi, the Bing team, the LinkedIn experimentation team, and the published papers describe, adapted for smaller programs that lack Microsoft’s tooling.
Step 1: Define contextual surprise thresholds for your surfaces. For each mature, high-traffic surface, look at the distribution of historical treatment effects on your primary metrics. Identify the magnitude above which a single-test reading would put you in the 99th percentile of historically observed effects. That number is your trigger threshold for the surface. Anything above it goes to investigation, regardless of how confident the p-value looks. A rough starting heuristic, applicable to most mature consumer web programs: any reading above +10% absolute lift on a primary business metric for a single component-level test on a high-traffic page is a Twyman trigger.
Step 2: Run the validity checklist before reading the headline. Sample Ratio Mismatch (chi-squared against configured ratio, α = 0.001). Segment splits across browser, country, device, traffic source, weekday. A/A baseline calibration for the platform. Outlier diagnostics. Telemetry capture rate consistency between arms. Funnel-step conditional rates.
Step 3: Run the bug-hypothesis checklist. Is there a redirect in the variant path? Is the analytics SDK or tag manager loaded identically in both arms? Did the experiment span a config refresh, a deploy, a marketing campaign launch, a holiday, a DST transition? Is there a partial overlap with another team’s experiment? Did bot filtering rates differ between arms? Did the experiment system experience any errors in the assignment service?
Step 4: If everything passes, replicate. A genuine large effect should re-appear in an independent re-run with fresh random assignment and a different time window. A bug-driven effect typically does not. The cost of replicating is one experiment-cycle of time. The cost of shipping a bug is months of degraded production performance plus the eventual cleanup.
Step 5: Document the investigation, ship-or-not. Whatever the conclusion, record what was checked, what was found, what the final decision was. This builds the institutional dataset that lets future Twyman triages be calibrated to your actual platform.
Step 6: Treat “kill the test, find the bug” as a positive outcome. The team member who catches a Twyman-violating bug saves the program more value than the team member who ships a +0.5% real lift. The promotion criteria should reflect that. They almost never do, and rebuilding the incentive structure is the deepest cultural intervention an experimentation program can make.
The implementation cost of this protocol is modest. Steps 1, 2, and 3 are largely automatable — most modern experimentation platforms (Statsig, Eppo, GrowthBook, internal builds) can be configured to compute SRM, run segment cuts, and surface anomalies in the readout. Step 4 requires organizational discipline more than tooling. Step 6 requires leadership courage.
What This Means For Your Experimentation Program
The calibration that Twyman’s Law should produce, for any CRO program that takes its findings seriously, is uncomfortable but freeing.
Most of the dramatic wins you have heard about are bugs. Not all. But enough of the public corpus of “miracle CRO lifts” is bug-driven that you should treat any claimed dramatic result — your own or someone else’s — as a hypothesis about a measurement artifact until the artifact branch has been actively ruled out. This is not skepticism for its own sake. It is a Bayesian update on the actual joint distribution of true effects and measurement noise.
The mature CRO program produces small, replicable wins. The published distributions from Microsoft, Booking, LinkedIn, Airbnb, and Netflix all tell the same story: the median launched experiment moves the primary metric by less than 1%. The aggregate value of an experimentation program comes from compounding many small wins, not from chasing rare large ones. The team that has set its expectations to “we are looking for +0.3% wins and the occasional +1% surprise” will run a more disciplined program than the team that has set its expectations to “we are looking for the next +30% breakthrough.”
Investigation-before-action is the highest-ROI workflow change you can make. It costs little, catches the most expensive failure mode (shipping bugs as features), and rebuilds your platform’s trustworthiness over time. Microsoft’s stated experience is that auto-shutdown of suspicious experiments saves enough metric movement per quarter to justify the engineering investment in the platform many times over.
Selection bias in your own readings is silent and expensive. Even if every individual experiment in your program is methodologically sound, the act of selecting the “best” reading for shipping introduces the Winner’s Curse, which combines with Twyman’s Law to systematically overstate the effects of the experiments you choose to ship. (See the companion article on the Winner’s Curse for the full treatment.) The defense is the same: investigation, replication, post-launch monitoring, and the cultural willingness to revise headline numbers downward when production data disagrees with test data.
Build the gate before you need it. The teams that institute Twyman triage before their first shipped-bug-as-feature incident absorb the cultural cost when stakes are low. The teams that institute it after a high-profile bug ship are doing the same work under the weight of a recent failure, which makes the rebuild harder. Spend the engineering time now to compute SRM, segment splits, and contextual threshold alerts on every readout. Spend the leadership capital now to make “killed the test, found the bug” a celebrated outcome. Both investments pay off the first time they catch a real bug, which in any program running more than a few hundred tests per year will happen on a timescale of months, not years.
The deeper point — and the one that connects Twyman’s Law to the rest of the replication crisis literature — is that the most interesting results in any empirical program are the ones that deserve the most scrutiny, not the least. This runs against the grain of how research, business reporting, conference programming, and academic publishing all work. The structural incentives in every venue reward dramatic findings and punish careful disconfirmation. The discipline of working scientists, working analysts, and working CRO teams is to invert that incentive locally: to give the surprising result more skepticism, not less, because the prior says it is more likely to be wrong.
A 30% lift on a button-color test is almost never magic. It is almost always a bug. Twyman knew this in 1970s television audience measurement, Ehrenberg formalized it in 1975, Kohavi has been preaching it across Bing and the academic experimentation literature for two decades, and the field has the infrastructure to act on it. The remaining gap is between what the literature recommends and what the average industry CRO program actually does. Closing that gap is the most leveraged single change you can make to your experimentation practice.
Sources
- Crook, T., Frasca, B., Kohavi, R., & Longbotham, R. (2009). Seven pitfalls to avoid when running controlled experiments on the web. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘09), 1105–1114. DOI: 10.1145/1557019.1557139
- Dickson, P. R. (1999). Marketing Management (2nd ed.). Dryden Press.
- Dmitriev, P., Gupta, S., Kim, D. W., & Vaz, G. (2017). A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘17), 1427–1436. DOI: 10.1145/3097983.3098024
- Ehrenberg, A. S. C. (1975). Data Reduction: Analysing and Interpreting Statistical Data. John Wiley & Sons.
- Ehrenberg, A. S. C. (1977). Rudiments of numeracy. Journal of the Royal Statistical Society, Series A, 140(3), 277–297.
- Ehrenberg, A. S. C., & Twyman, W. A. (1967). On measuring television audiences. Journal of the Royal Statistical Society, Series A, 130(1), 1–59.
- Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19), 2156–2164. DOI: 10.1145/3292500.3330722
- Kohavi, R. (2013). Twyman’s Law and Controlled Experiments. ExP Platform blog. exp-platform.com/Documents/TwymansLaw.pdf
- Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). Practical guide to controlled experiments on the web: Listen to your customers not to the HiPPO. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘07), 959–967. DOI: 10.1145/1281192.1281295
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. Chapter 3: Twyman’s Law and Experimentation Trustworthiness. DOI: 10.1017/9781108653985
- Kohavi, R., & Thomke, S. H. (2017). The surprising power of online experiments. Harvard Business Review, September–October 2017, 74–82. hbr.org/2017/09/the-surprising-power-of-online-experiments
- Marsh, C., & Elliott, J. (2009). Exploring Data: An Introduction to Data Analysis for Social Scientists (2nd ed.). Polity Press.
Related
- Replication Crisis Hub — the full library of 75+ articles on what reliable evidence looks like across disciplines.
- Sample Ratio Mismatch in A/B Testing — the single most important Twyman triage check; Microsoft data showing ~6% of experiments have it.
- The Winner’s Curse in A/B Testing — why even bug-free experiments overstate true effects when you select the highest-lifted variant for shipping.
- The Peeking Problem in A/B Testing — early stopping inflates false positives and triggers Twyman investigations on borderline results.
- Multiple Comparisons in A/B Testing — testing many metrics or many segments without correction produces spurious “wins” that fail Twyman scrutiny.
- Novelty and Primacy Effects in A/B Testing — short-term lifts that fade are a common cause of “the test won but production didn’t.”
- Simpson’s Paradox in A/B Testing — aggregation effects that flip the sign of an observed treatment effect when you cut by segment.
FAQ
What’s a reasonable Twyman threshold for my context?
Start by computing the historical distribution of treatment effects on your primary metric for the surface in question. Take the 99th percentile of absolute lift magnitudes you have ever observed; anything above that is a Twyman trigger. For most mature consumer web programs, this lands somewhere between +5% and +15% absolute lift on a primary business metric, depending on surface volatility. For low-traffic surfaces or new product surfaces with no historical distribution to anchor on, use a lower bar — anything above the noise floor of your historical A/A test variance is a Twyman trigger, because you have less prior data to distinguish noise from signal.
What about genuine breakthroughs? Don’t you risk missing real wins by treating large results as suspicious?
You risk delay, not miss. Twyman triage adds an investigation step before shipping; it does not block shipping. A genuine breakthrough survives investigation — SRM checks pass, segment analysis is consistent with mechanism, bug-hypothesis checklist comes up empty, replication confirms the effect. The cost of the investigation is days or weeks. The cost of shipping a bug as a feature can be quarters of degraded production performance plus the cleanup. The asymmetry strongly favors investigating, and genuine breakthroughs are rare enough on mature surfaces that the marginal delay on the rare real one is a small price for catching the much more numerous false ones.
What about new-product greenfield surfaces where everything is novel?
Greenfield surfaces have larger true effect distributions because the program has not yet found and shipped the easy wins. Adjust your threshold upward accordingly — a +25% lift on a brand-new flow is much more plausible than a +25% lift on a mature checkout page. But the investigation protocol still applies: SRM, segment analysis, bug-hypothesis checklist. The threshold for “this deserves investigation” is higher on greenfield, but the standard of evidence for “ship without investigation” is never zero.
How do I train my team on this without making them so skeptical they paralyze the program?
Frame Twyman’s Law as a calibration tool, not a permission gate. The default workflow is: read the test, run the validity checks, ship the result. Twyman triage is the additional step for surprising results only. The framing that works: “we are not skeptical of all results, we are calibrated about which results to spend extra time validating, and the dramatic ones get the extra time because the data tells us they need it.” Pair this with explicit recognition for team members who catch bugs — promotions, written praise, presentation slots at internal meetings. The behavior follows the incentives.
What’s the difference between Twyman’s Law and just “running good A/B tests”?
Twyman’s Law is the specific calibration that the magnitude of a result is itself information about its trustworthiness. Running good A/B tests gets you to “the methodology is sound.” Twyman’s Law adds “the more dramatic the reading, the more aggressively you should validate the methodology on this specific test, because the prior probability of a real effect of that magnitude is low.” It is a Bayesian principle layered on top of frequentist methodology, and it is the layer that catches the failures the frequentist methodology cannot catch on its own — bugs that are silent at the per-test level but obvious when you apply a magnitude-conditioned prior.
How does Twyman’s Law relate to Sample Ratio Mismatch?
SRM is one of the most common mechanisms that produces Twyman-violating results, and it is also the easiest to detect. Run an SRM check on every experiment as part of the validity checklist. When a result triggers a Twyman investigation, SRM is the first thing to check — it accounts for a meaningful fraction of all bug-driven miracle lifts. See the companion article on SRM for the full mechanism, prevalence data, and detection protocol.
Is Twyman’s Law applicable outside A/B testing?
Yes, broadly. The original formulation applied to market research and audience measurement. The principle generalizes to any empirical context where measurement noise, instrumentation bugs, attribution issues, or selection effects can produce surprising readings. It is a near-universal heuristic for applied statistics: when a reading is surprising, the most likely explanation is usually something about the measurement rather than something new about the world. The replication crisis literature across psychology, biomedicine, and economics is largely a structural restatement of Twyman’s Law applied at the level of published findings rather than individual experiments.
What’s the single most useful thing I can do tomorrow to operationalize this in my program?
Add an SRM check to every test readout, with an alert that fires whenever the chi-squared p-value is below 0.001. This single change will catch a meaningful fraction of bug-driven miracle lifts before they get celebrated, and it requires perhaps an hour of engineering work on most modern experimentation platforms. After that, work on the contextual surprise thresholds for your major surfaces, then the segment-analysis automation, then the cultural framing. But the SRM check is the highest-ROI starting point.