Statistical vs Practical Significance: Why Your "Significant" A/B Test Result Doesn't Matter

Atticus Li

← The Replication Crisis · replication-crisis

Statistical vs Practical Significance: Why Your "Significant" A/B Test Result Doesn't Matter

Statistical significance (p < 0.05) and practical significance (effect large enough to matter to the business) are independent. High-traffic platforms detect 0.02% lifts at p < 0.001 — and ship them. That is the confusion that lets tiny lifts look like wins and kills real effects.

By Atticus Li May 25, 2026 33 min read

Statistical significance (p < 0.05) and practical significance (effect large enough to matter to the business) are independent. High-traffic platforms can detect a 0.02% absolute lift at p < 0.001 — and routinely ship the change. That is the confusion that lets tiny lifts look like wins and kills genuinely useful features that did not happen to clear an arbitrary threshold.

It is the end-of-quarter experimentation review at a consumer subscription business. The growth team is presenting an experiment on the upgrade modal: a one-line copy change, run for two weeks across roughly 4.2 million users per arm. The headline slide reads “Variant B wins. Lift: +0.04% on free-to-paid conversion. p < 0.001.” There is applause. The team has booked the change for next sprint’s deploy. The CEO, who is not from a quantitative background but has been around the block, asks a question the team did not prepare for: “What does that mean in dollars?” The room calculates. On the business’s actual baseline conversion rate, a 0.04% absolute lift on the upgrade modal translates to about $48,000 of additional annual recurring revenue, against an engineering and design cost of roughly two weeks of three FTEs. The CEO does not say anything for a while. Then she asks a second question: “What was the lift on the experiment we killed last month — the redesign of the onboarding flow? The one we said failed?” The team pulls up the report. The killed experiment had a point estimate of +1.8% on activation, a 95% confidence interval of [-0.6%, +4.2%], and a p-value of 0.18. It had been classified as a “null result” and shelved. The CEO sits with that for another moment, then says: “We just shipped a forty-eight thousand dollar copy change and killed something that, if real, was worth two million. Tell me how the math on this works.”

This is the confusion at the heart of most industry A/B testing programs. Statistical significance is a procedural property of an experiment: assuming the null hypothesis is true, how unlikely is the observed result? Practical significance is an economic property of the change: is the effect large enough that shipping it is the right business decision, net of cost? These two questions are mathematically independent. With enough sample size, you can make any non-zero effect statistically significant. With small enough sample size, you can fail to detect a transformative effect. The relationship between p-values and business value is mediated by traffic, baseline rates, and effect-size assumptions in a way that practitioners systematically underweight, and that A/B testing tools — by surfacing p-values as the headline metric on every experiment dashboard — systematically reinforce.

This article exists because the statistical-versus-practical confusion is the second-most-expensive mistake in industry experimentation (the first being the peeking problem, which produces actual artifacts rather than just misallocated attention). It is also the mistake that the academic statistical literature has been most vocal about for the longest time. The American Statistical Association published a formal position statement on p-values in 2016 and a follow-up special issue in 2019 that, in unusually direct language, asked the scientific community to stop treating “p < 0.05” as a meaningful threshold for any decision. The CRO and experimentation industries have not absorbed that message. The dashboards still lead with p-values. Teams still ship anything green and kill anything red. And businesses keep shipping forty-eight-thousand-dollar copy changes while killing the two-million-dollar onboarding redesign.

What Statistical Significance Actually Is

To understand why statistical significance and practical significance are independent, you have to be precise about what each one is claiming.

A p-value is a conditional probability with a specific structure: assuming the null hypothesis is true (typically, that the experimental treatment has no effect on the outcome), what is the probability of observing a test statistic at least as extreme as the one actually observed? It is a property of the data given a hypothesis, not a property of the hypothesis given the data. The threshold of 0.05 is a convention, not a discovery — Ronald Fisher proposed it in The Design of Experiments (1935) as a useful working cutoff for agricultural field trials. It has no special mathematical status, and Fisher himself wrote in subsequent papers that researchers should not treat it as a fixed line.

The American Statistical Association’s 2016 statement, Wasserstein, R. L., & Lazar, N. A. (2016). “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician, 70(2), 129–133. DOI: 10.1080/00031305.2016.1154108, was the first formal position the ASA had ever taken on any statistical methodology question, which is itself a signal of how seriously the discipline was taking the misuse. The statement laid out six explicit principles. Two are directly relevant to A/B testing:

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. A p-value of 0.001 does not mean “there is a 99.9% chance the treatment works.” It means “if the treatment did nothing, we would see this kind of result 0.1% of the time.”
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. This is the principle that the entire A/B testing industry routinely violates. The size of an effect and the strength of evidence against the null are two different things. A massive sample size can produce overwhelming evidence against the null for a tiny effect; a small sample size can produce weak evidence against the null for a transformative effect.

The follow-up statement was even more direct. Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). “Moving to a World Beyond ‘p < 0.05.’” The American Statistician, 73(sup1), 1–19. DOI: 10.1080/00031305.2019.1583913 introduced a special issue containing 43 papers from leading statisticians and methodologists on what a post-significance practice should look like. The editorial’s core recommendation was that researchers should “stop using P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis.” The recommendation was not to abandon p-values entirely; it was to stop using the 0.05 threshold as a decision criterion and to report effect sizes with uncertainty intervals as the primary inferential output.

The framing in the 2019 statement is worth quoting because it bears directly on how A/B testing tools surface results: “No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. A label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important; nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false, or unimportant.” Read that paragraph next to the headline of a typical A/B test dashboard, which usually reads something like “Significant! Variant B wins!” with a p-value displayed in green. The dashboard is making a claim that the ASA statement explicitly says the methodology cannot support.

What Practical Significance Actually Is

Practical significance is the question the dashboard does not ask: is the size of the effect large enough that shipping the change is the right business decision, net of the costs of shipping it? This is an economic question, not a statistical one. It depends on factors that are completely outside the experimental data: the engineering cost of implementing the variant in production, the design and maintenance overhead of supporting a more complex codebase, the opportunity cost of the team time, the strategic priority of the metric being moved, and the size of the relevant traffic pool to which the change will be applied.

The framework for practical significance in industry A/B testing has been laid out most carefully by Ron Kohavi and his collaborators, drawing on a decade of running thousands of experiments at Microsoft Bing, Amazon, and LinkedIn. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 9781108724265 is the definitive industry treatment. The book devotes substantial attention to the distinction between statistical and practical significance and is unusually direct about what the recommended practice should be: pre-register a Minimum Detectable Effect (MDE) based on business value, and treat any experimental result whose confidence interval excludes that MDE as a non-result regardless of its p-value.

The Kohavi framing of MDE is the one practitioners should adopt. The MDE is not “the smallest effect the experiment can statistically detect at the available sample size” — that is the statistical definition, which is what most A/B testing tools default to displaying. The MDE in the practical-significance sense is “the smallest effect that would justify the engineering, design, and ongoing maintenance cost of shipping the change.” It is set by the business before the experiment runs, not derived from the data after the experiment ends. A change with a 0.04% lift in conversion may be statistically significant at p < 0.001 on a sample of millions, but if the MDE the business set in advance was 0.5%, the experiment has produced a null result on the question that actually mattered.

The same framing applies in the opposite direction, which is the failure mode the CEO in the cold-open spotted. An experiment with a point estimate of +1.8% on activation and a confidence interval of [-0.6%, +4.2%] has not produced a null result in the practical-significance sense. It has produced an inconclusive result: the data are consistent with a substantial positive effect and also with a small negative effect, and the question of whether to ship is now a question of expected value and decision theory, not a question of whether the test “passed” some arbitrary threshold. Killing such a result because it failed to clear p < 0.05 is throwing away a Bayesian decision-theoretic problem and treating it as a frequentist hypothesis test, which is not what the methodology is for.

Why Statistical and Practical Significance Are Independent

The mathematical reason that statistical and practical significance can be arbitrarily decoupled is the sample-size structure of the test statistic. For a two-proportion z-test on conversion rate — the standard A/B testing setup — the test statistic scales with the square root of the sample size. The same absolute effect size produces a larger and larger test statistic as the sample grows, which means the p-value gets smaller and smaller, which means the result gets more and more “statistically significant” without anything about the underlying effect changing.

This is not a subtle effect at industry traffic volumes. A platform running 10,000 visitors per arm needs roughly a 4% relative lift in conversion to detect the effect at p < 0.05 with 80% power. The same platform running 10 million visitors per arm — the kind of traffic Bing or Amazon sees on a major surface — can detect a 0.13% relative lift at the same threshold. The same platform running 100 million visitors per arm can detect a 0.04% relative lift. At Bing’s actual experimentation scale, where individual high-traffic experiments routinely run across hundreds of millions of search sessions, the methodology is capable of detecting absolute effect sizes that are smaller than the day-to-day noise in the underlying business — effects that are real in the sense that they are not chance fluctuations under the null, but that are too small to matter for any business decision.

Ron Kohavi has been particularly direct about this on his public writing and at industry talks. In a widely-cited LinkedIn post that has become a reference for the experimentation community, Kohavi argued that the appropriate use of the power formula in industry settings is in reverse: do not start with the available sample size and ask “what effect can I detect?” Start with the MDE the business cares about and ask “do I have enough traffic to detect an effect of that size with adequate power?” If the answer is yes, the experiment is worth running; if the answer is no, the experiment will be uninformative regardless of its eventual p-value, and you should not run it.

The opposite-direction failure — small samples with substantial effect sizes that fail to reach significance — is symmetric. A genuinely useful product change that produces a 15% lift in some downstream metric, tested on a sample where the available traffic gives the methodology 60% power to detect a 20% lift, is going to fail to reach p < 0.05 about 60% of the time even when the change works exactly as hoped. The frequentist methodology, run naively against an arbitrary threshold, will classify the experiment as a “loss” and the change will be killed. The actual data — point estimate +15%, CI spanning [+2%, +28%], directionally and economically positive but not significant — should support shipping, especially if the cost of the change is low and the cost of the missed opportunity is high. But “ship the null result” is a sentence most experimentation programs cannot bring themselves to say, because the dashboard says red.

The Two Failure Modes In Practice

The combination of these forces produces two distinct, equally common, and equally damaging failure modes in modern experimentation programs.

Failure mode one: significance-as-win. The team ships every experiment whose p-value crosses 0.05. Over the course of a year, they accumulate a large number of “wins,” each of which is statistically significant and economically trivial. The cumulative business impact of the wins is well below what the team’s effort budget should have produced, because most of the wins are detecting effects below the threshold where shipping is the right call. The team’s morale is high (they ship a lot of green dashboards) and the experimentation program’s credibility within the broader business is gradually corroded (the metric impact is not visible in revenue, retention, or any KPI the CFO tracks). Eventually the CFO asks why the experimentation team’s annual budget is the size of an engineering department and the answer is some variant of “we ship many small wins,” which the CFO does not find compelling.

Failure mode two: null-as-failure. The team kills every experiment whose p-value does not cross 0.05. Over the course of a year, they kill a number of features whose point estimates were directionally positive, economically meaningful, and not statistically distinguishable from zero given the available sample. Some fraction of these killed features represent real effects that the test did not have the power to detect. The cost of the false negatives is invisible — the team never sees what would have happened had they shipped — but the cumulative product impact is substantial. The experimentation program is functioning as a brake on product velocity rather than as an evidence base for product decisions, because the brake is calibrated to a threshold that has no relationship to the cost of being wrong.

Most experimentation programs exhibit both failure modes simultaneously. The same team ships the 0.04% lift on the upgrade modal because the p-value cleared the threshold, and kills the 1.8% onboarding redesign because the p-value did not. The decisions are internally consistent with the methodology the team is using, and the methodology is producing systematically misaligned outcomes relative to what would maximize the business’s expected value. This is not a problem with statistical inference; it is a problem with the decision rule the team has wrapped around the inference.

The Gelman-Carlin Type S and Type M Errors Framework

The classical statistical-inference framework focuses on two types of errors. Type I is rejecting the null when the null is true (the false positive, which the 5% threshold nominally controls). Type II is failing to reject the null when the null is false (the false negative, which is controlled by statistical power). The Type I / Type II framing is what most quantitative training emphasizes, and it is the framing that the standard A/B testing methodology is built around.

Andrew Gelman and John Carlin proposed an extension to this framework that is more relevant to how statistically significant results actually fail in practice. Gelman, A., & Carlin, J. (2014). “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science, 9(6), 641–651. DOI: 10.1177/1745691614551642 introduced two additional error types specifically calibrated to the question of what happens when a study produces a statistically significant result:

Type S (sign) error: the experiment produces a statistically significant result with the wrong sign relative to the true effect. The treatment actually decreases the outcome, but the experiment’s significant result claims it increases the outcome (or vice versa).
Type M (magnitude) error: the experiment produces a statistically significant result whose magnitude is substantially inflated relative to the true effect. The treatment has a small positive effect, but the experiment’s significant result claims a large positive effect.

The reason these matter is that Type S and Type M errors are systematic, not random, and they get worse — not better — when the study is underpowered. The mathematical mechanism is selection: in an underpowered study, the only way for a true small effect to clear the significance threshold is for sampling noise to inflate the observed effect, which means significant results are overrepresented in the right tail of the noise distribution. The published effect sizes from underpowered significant studies are therefore systematically larger than the true effect sizes, by a factor that Gelman and Carlin call the “M-type exaggeration ratio.” For studies with very low power, this ratio can exceed 3 — meaning the typical “significant” effect estimate is more than three times the true underlying effect.

The Type S error is the rarer but more dangerous failure mode. In an underpowered study where the true effect is small and positive, the significance filter selects for observations in the right tail of the noise distribution. Most significant results will be correctly positive, but some — the ones where the noise was in the opposite direction and large enough to overcome both the true effect and the threshold — will be significant in the wrong direction. The published literature in such a setting contains a non-zero rate of confidently asserted findings that are pointing the wrong way.

For A/B testing, the Type S / Type M framing has two specific implications. First, an A/B test that reports p < 0.05 with a barely-clearing point estimate at moderate sample size should be assumed to have an inflated effect size; the post-launch monitoring will routinely show the lift shrinking or disappearing, because the original estimate was selected from the right tail of the sampling distribution. This is a structural feature of significance filtering, not a measurement problem. Second, for any experimentation program where multiple tests are run and only the significant ones are shipped, the aggregate observed lift from shipped variants will systematically overstate the actual business impact, because each individual lift estimate is biased upward by the same selection mechanism. The phenomenon has its own A/B testing literature under the name “winner’s curse,” which is the same mathematical fact applied specifically to the post-test estimation problem.

The deeper point is that statistical significance does not even reliably get the sign right under realistic experimental conditions. The framing of “p < 0.05 means we know the treatment works” is wrong in the strong sense — significant results from underpowered studies are not just inflated, they are sometimes pointing the wrong way, and the methodology gives the practitioner no diagnostic for telling the cases apart.

What Modern Industry Practice Looks Like

The combination of the ASA statements, the Gelman-Carlin framework, the academic critique from the broader replication-crisis literature, and the operational experience of running tens of thousands of A/B tests at major platforms has produced a set of recommended practices that diverge sharply from the default behavior of most A/B testing tools. The recommendations are not new — the Kohavi book that codifies them was published in 2020, drawing on practice that was already mature at Microsoft and similar shops by the mid-2010s — but adoption outside of the major experimentation platforms remains low.

The first recommendation is pre-registered MDE based on business value. Before the experiment launches, the team writes down the smallest effect that would justify shipping the variant, accounting for engineering cost, design cost, ongoing maintenance, and opportunity cost. This number is set by the product owner and the engineering lead, not by the data scientist. The experiment is then powered to detect that MDE at the chosen significance level and power; if the available sample size does not give adequate power for the chosen MDE, the experiment is either redesigned (a more sensitive metric, a longer run time, a larger traffic allocation) or not run at all. Running an underpowered experiment “to see what happens” is treated as a misuse of resources, because the result will be either a false negative on a real effect or a true positive with inflated magnitude — both of which are misleading.

The second recommendation is confidence interval reporting as the primary inferential output. The headline of the experiment report is the point estimate of the lift and its 95% confidence interval, displayed against the pre-registered MDE. The p-value is reported as supplementary information, not as the headline. The decision rule is not “is p < 0.05?” but “does the confidence interval exclude the MDE in the wrong direction?” An experiment whose CI is entirely above the MDE is a ship decision. An experiment whose CI is entirely below the negative of the MDE is a kill decision. An experiment whose CI straddles the MDE is an inconclusive result, and the decision becomes an expected-value calculation under uncertainty rather than a binary classification.

The third recommendation is Bayesian credible intervals with explicit prior assumptions for high-stakes decisions. The Bayesian framing — what is the posterior probability that the true effect exceeds the MDE, given the data and our prior beliefs about plausible effect sizes? — produces a probability that is directly interpretable as a business decision input. The expected-loss formulations developed by commercial platforms (the “expected loss” decision rule, where you ship the variant when the expected loss from shipping is below an acceptable threshold) operationalize this idea. The classical objection — that priors are subjective and bias the result — is increasingly seen as a feature rather than a bug, because the prior is the place where the team’s existing knowledge about the surface, the metric, and the typical effect sizes of similar changes can be made explicit and inspectable.

The fourth recommendation is expected-value decision rules that account for cost. The relevant business question is not “did the experiment win?” but “what is the cost-adjusted expected value of shipping versus not shipping, integrating over the posterior on the effect size?” This calculation requires explicit numbers for the engineering cost of shipping, the maintenance cost of carrying the change, the value of the metric improvement at the relevant scale, and the time horizon over which the change will produce value. For most A/B test decisions in most businesses, the calculation produces a sharper recommendation than the p-value framing — and surfaces the cases where the experimentation methodology has been answering the wrong question.

Microsoft’s Bing experimentation program is the public case study most often cited as an exemplar. Public talks and writing from the Bing team, including the material in the Kohavi book, document an experimentation operation that runs over 10,000 experiments per year and uses MDE pre-registration and effect-size-with-CI reporting as standard practice rather than as advanced techniques. The published industry impact — Bing’s own attribution of 10–25% annual revenue growth per search to experimentation-driven changes — is consistent with the claim that the discipline produces large business value when it is run with proper decision rules, even (or especially) when the underlying methodology is statistically conservative.

The Bayesian Alternative

The Bayesian framing of A/B testing deserves a more careful treatment because it has become the dominant alternative offered by modern commercial platforms (VWO, Dynamic Yield, and others now ship Bayesian engines as their default or primary inferential framework), and because it changes the structure of the decision problem in ways that matter for the statistical-versus-practical question.

In a Bayesian A/B test, the analysis produces a posterior distribution over the treatment effect — a probability distribution that combines a prior (the team’s beliefs about plausible effect sizes before the data, often based on the distribution of historical experimental outcomes on similar surfaces) with the likelihood of the observed data. The output is not a p-value; it is a distribution from which any decision-relevant probability can be computed directly. “What is the probability that the lift exceeds 1%?” is answered by integrating the posterior above 1%. “What is the expected value of shipping the variant?” is answered by computing the expected value of the lift weighted by the posterior, against the cost of shipping.

The advantage of this framing for the statistical-versus-practical problem is that the practical question is built into the structure of the answer. The MDE is not a side calculation to be remembered separately; it is the integration limit on the posterior. The expected value calculation is not a post-hoc business analysis; it is the natural output of the inference. The decision rule of “ship if the expected lift exceeds the cost threshold” is a one-line specification, not a complex multi-step protocol.

The disadvantage — the one that has slowed Bayesian adoption in industry — is that the prior matters, and most teams do not have a principled way to set it. A vague or improper prior can produce confident posteriors that are wrong; an overly tight prior can swamp the data and produce posteriors that simply reflect the team’s preexisting beliefs. The classical objection from the academic literature, summarized in the McShane et al. paper below, is that the choice of prior is a degree of freedom that can be exploited to produce whatever conclusion the analyst wants, in the same way that flexibility in p-value calculations can produce false positives. The pragmatic response — used by most industry Bayesian implementations — is to use empirical Bayes, where the prior is fit from the historical distribution of experimental outcomes on the same platform, producing a prior that is grounded in the realized distribution of effect sizes rather than in the analyst’s beliefs.

The broader academic argument for moving beyond classical significance testing is laid out in McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). “Abandon Statistical Significance.” The American Statistician, 73(sup1), 235–245. DOI: 10.1080/00031305.2018.1527253 and Amrhein, V., Greenland, S., & McShane, B. (2019). “Scientists Rise Up Against Statistical Significance.” Nature, 567(7748), 305–307. DOI: 10.1038/d41586-019-00857-9. The Amrhein paper, signed by more than 800 statisticians and methodologists in a remarkable example of disciplinary consensus, explicitly called for the end of the dichotomous “significant / not significant” framing. The McShane paper proposed concrete alternatives: report effect sizes with uncertainty intervals, treat p-values as continuous measures of evidence rather than as binary decision criteria, and integrate the inferential output with prior knowledge and decision-theoretic considerations.

Geoff Cumming’s “new statistics” agenda, laid out in Cumming, G. (2014). “The New Statistics: Why and How.” Psychological Science, 25(1), 7–29. DOI: 10.1177/0956797613504966, is the constructive program for what experimental practice should look like after the move away from significance testing. The core recommendations are: report effect sizes, report confidence intervals, plan for meta-analysis from the start, and avoid the dichotomous accept/reject framing entirely. The Cumming agenda is sometimes characterized as “estimation thinking” — the goal of the analysis is to estimate the size of the effect with appropriate uncertainty, not to decide whether the effect “exists.” For A/B testing, this maps directly onto the MDE-and-CI framing recommended by Kohavi: the question is not “does the variant beat the control?” but “by how much, with what uncertainty, against what cost?”

What This Means For Your Experimentation Program

The actionable changes for a CRO, experimentation lead, or PM running an A/B testing program are concrete and well-defined. They do not require switching statistical engines, though that may help. They do require changing the decision rule the team is using and the way results are surfaced to stakeholders.

Pre-register an MDE for every experiment. Before the test launches, the team writes down the smallest effect that would justify shipping the change, accounting for the full cost of shipping (engineering, design, maintenance, complexity). This number is a business decision, not a statistical one, and it should come from the product owner and engineering lead rather than from the data scientist. If the available sample size cannot detect that MDE at adequate power, the experiment is either redesigned or not run; underpowered tests on important surfaces are a misuse of traffic and team capacity, because the results — significant or not — will be uninformative for the business decision.

Report results as point estimate plus 95% confidence interval, displayed against the pre-registered MDE. The headline of the experiment review is the lift, with its uncertainty, and where it sits relative to the threshold the business cared about. The p-value is supplementary, not headline. If your experimentation tool defaults to surfacing p-values prominently, suppress that surfacing in your internal templates; make the team work harder to retrieve the p-value than to retrieve the effect size and the CI.

Use a three-tier decision rule rather than a binary significance threshold. An experiment whose CI lies entirely above the MDE is a ship decision. An experiment whose CI lies entirely below the negative MDE is a kill decision. An experiment whose CI straddles the MDE is an inconclusive result, which triggers a separate expected-value analysis rather than a default action. Most experiments at most companies fall into the inconclusive bucket, and treating them as a third category — rather than forcing them into a binary — produces dramatically better business decisions than the default frequentist framing.

Compute expected value for the inconclusive cases. When the CI straddles the MDE, the right question is the expected value of shipping integrated over the posterior on the effect. If the cost of shipping is low and the upper tail of the CI is large, the expected-value calculation often supports shipping even in the absence of “significance.” If the cost of shipping is high and the lower tail of the CI is negative, the expected-value calculation often supports not shipping even when the point estimate is positive. The calculation is doing real work that the binary threshold is not doing.

Track post-launch performance against the experimental estimate. The Type M / Type S framework predicts that the realized lift from shipped variants will systematically underperform the experimental estimate. Tracking the realized-versus-predicted ratio across all shipped variants is a calibration check on your experimentation program. If realized lifts are consistently 50–70% of experimental estimates, the program has a Type M problem (you are shipping a meaningful number of inflated-estimate variants and your effort budget is being misallocated). If realized lifts are occasionally negative when experimental estimates were significantly positive, the program has a Type S problem (you are sometimes shipping variants that hurt the metric). Both are tractable with proper power calculations and MDE discipline; both are invisible until you measure them.

The most important calibration change is psychological. The team needs to internalize that p-values are not evidence rankings, p < 0.05 is not a meaningful threshold, and “the experiment won” is not a coherent claim divorced from the size of the effect and the cost of the change. The dashboards will keep displaying p-values, the stakeholders will keep asking whether the test was significant, and the cultural inertia of the experimentation industry will keep pulling the team back toward the binary frame. The teams that resist that pull and run their decisions on effect-size-plus-cost are the ones whose experimentation programs produce demonstrable business impact, retain credibility with the CFO, and avoid the slow-motion erosion of organizational trust in experimentation as a discipline. The teams that do not, ship the forty-eight-thousand-dollar copy change and kill the two-million-dollar onboarding redesign.

Sources

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129–133. DOI: 10.1080/00031305.2016.1154108
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. DOI: 10.1080/00031305.2019.1583913
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. DOI: 10.1177/1745691614551642
Cumming, G. (2014). The New Statistics: Why and How. Psychological Science, 25(1), 7–29. DOI: 10.1177/0956797613504966
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 9781108724265. Cambridge listing
Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists Rise Up Against Statistical Significance. Nature, 567(7748), 305–307. DOI: 10.1038/d41586-019-00857-9
McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon Statistical Significance. The American Statistician, 73(sup1), 235–245. DOI: 10.1080/00031305.2018.1527253
Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd. (Original source for the p < 0.05 convention.)

The Replication Crisis Hub — overview of the broader methodological-failure landscape
The Peeking Problem in A/B Testing — how looking at the dashboard inflates the false-positive rate
The Winner’s Curse in A/B Testing — why the lift from shipped variants is systematically inflated
Multiple Comparisons in A/B Testing — what happens when you run many tests against the same significance threshold
Sample Ratio Mismatch in A/B Testing — the data-integrity check that catches the most common silent failure

Frequently Asked Questions

What MDE should I use for my experimentation program?

There is no single number; the MDE is specific to the surface, the metric, and the business. The practical approach is to compute the engineering, design, and maintenance cost of shipping a typical change for the surface, divide by the relevant revenue or KPI value at the scale the change will apply to, and use the resulting break-even effect size as the floor. For most consumer subscription businesses, MDEs in the 0.5–2% range on primary conversion metrics are typical; for high-traffic engagement surfaces where the cost of change is low, MDEs in the 0.1–0.5% range are reasonable; for major redesigns with substantial engineering investment, MDEs in the 3–10% range are common. The number matters less than the discipline of setting it in advance.

Should I switch to Bayesian A/B testing?

Probably yes, but the switch is less important than fixing the decision rule. A frequentist test with proper MDE pre-registration, CI reporting, and three-tier decision rules will produce better business decisions than a Bayesian test with sloppy priors and no MDE discipline. If you are already running a frequentist test program well, switching to Bayesian gives you cleaner decision-theoretic semantics and natural expected-value calculations; if your current program is undisciplined, switching frameworks will not fix the underlying problem.

What about regulated industries (medical devices, financial products) where p < 0.05 is the formal regulatory threshold?

The regulatory threshold is what the regulator requires for clearance; that is the threshold you have to clear. The statistical-versus-practical distinction still applies, just in addition to the regulatory requirement: a regulated product that clears p < 0.05 with an effect size below the clinically or financially meaningful threshold is not a product worth shipping, even though it is a product the regulator will approve. The MDE pre-registration discipline is even more important in regulated contexts because the cost of being wrong is higher.

Should I report p-values at all?

The 2019 ASA editorial and the McShane “Abandon Statistical Significance” paper argue for no; the practical answer for most industry teams is that p-values can be reported as supplementary information but should not be the headline. The dashboards will display p-values regardless, because the tools default to it; the discipline is to train the team to look at the effect size and CI first, the p-value second, and to make decisions on the effect-size-and-cost framing rather than on the p-value threshold.

My A/B testing tool only reports p-values, not confidence intervals. What do I do?

Compute the CI yourself. For two-proportion tests, the CI on the difference is straightforward and can be added as a column in any results spreadsheet. For more complex metrics (revenue per user, multi-variate experiments), bootstrap the CI from the raw data. The tool’s inability to display the CI is a tooling limitation, not a methodological constraint; the underlying data supports the CI computation in every case.

What’s the difference between “minimum detectable effect” as my tool defines it and “MDE” as you’re using it here?

Most tools define MDE as a function of the available sample size and the chosen significance level and power — given that sample size, alpha, and beta, the MDE is the smallest effect the test can statistically detect. This is the statistical definition, and it is calculated from the data without reference to the business. The framing in this article — the Kohavi framing — uses MDE to mean the smallest effect the business would care about, which is set independently of the data and is a constraint on whether the experiment is worth running at all. Both numbers are useful, but they answer different questions, and the business-MDE is the one that should drive the ship/kill decision.

How do I get my team to stop calling tests “winners” based on p-values?

Change what the report template looks like. The headline of every experiment review should be the point estimate plus 95% CI, with the pre-registered MDE displayed as a reference line on the same chart. The phrase “statistically significant” should not appear in the headline; if anyone asks, the report can include the p-value as a supplementary number. The team will resist initially because the binary frame is cognitively easier; over a quarter or two of consistent template enforcement, the muscle memory shifts. The thing that does not work is asking the team to “remember the difference between statistical and practical significance” while the dashboard continues to surface p-values as the headline. The cultural change has to be operationalized in the artifact the team produces.

Is this the same problem as the peeking problem?

No, but they compound. The peeking problem produces actual statistical artifacts — p-values that are inflated relative to the nominal false-positive rate because of optional stopping. The statistical-versus-practical confusion is a decision-rule problem — even with correctly computed p-values and confidence intervals, the binary significance threshold is not the right decision criterion. A program that fixes the peeking problem (by adopting always-valid inference or pre-registered sample sizes) but does not fix the decision rule still ships the forty-eight-thousand-dollar copy change. A program that fixes the decision rule but does not fix the peeking problem ships changes whose lift estimates are inflated by selection. You need both.

replication-crisis A/B Testing Statistical Significance Experimentation evidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What Statistical Significance Actually Is

What Practical Significance Actually Is

Why Statistical and Practical Significance Are Independent

The Two Failure Modes In Practice

The Gelman-Carlin Type S and Type M Errors Framework

What Modern Industry Practice Looks Like

The Bayesian Alternative

What This Means For Your Experimentation Program

Sources

Related

Frequently Asked Questions

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook