When you ship the variant with the highest observed lift, you are selecting on true effect plus lucky noise. The noise does not replicate. The math guarantees that aggregate “test wins” overstate aggregate “production wins” by 20-50%. This is structural, not implementation error.
It is the Wednesday post-mortem after a winning A/B test. The deck shows a button-redesign experiment that read +15.2% on the primary checkout-conversion metric, p = 0.008, sample size pre-registered and met, no peeking, sample ratio mismatch checks clean, segment analysis unsurprising. The variant was shipped to 100% of traffic three weeks ago. The post-launch monitor now shows +6.1%, well within the 95% confidence interval of the test but less than half the headline lift the leadership team was briefed on. The PM presents three hypotheses: implementation drift between the test environment and production, a confounding seasonality effect that suppressed the underlying improvement, or some interaction with a parallel marketing campaign. The team debates which to investigate. Engineering files a ticket to re-validate the production instrumentation.
There is nothing to investigate. There is no implementation drift, no seasonality confound, no interaction effect. The variant is doing exactly what it was always going to do --- delivering a real but smaller-than-claimed lift, because the +15.2% number was never an unbiased estimate of the variant’s true effect. It was an unbiased estimate conditional on observing that exact value, but the very act of selecting it for shipping --- the operation of looking across a portfolio of tested variants and choosing the one with the highest observed lift --- introduced a systematic upward bias into the headline number. The variant was selected for being above its true effect by chance. The chance does not persist.
This is the Winner’s Curse in A/B testing, and it is one of the cleanest examples in applied statistics of a phenomenon that practitioners do not learn about until it has been silently degrading their experimentation program’s calibration for years. The math is rigorous. The mechanism is unambiguous. The fix is well-documented in the literature. And the structural insight is uncomfortable: even with perfect methodological hygiene --- no peeking, proper SRM checks, segment analysis, the full Microsoft-grade discipline of trustworthy experimentation --- the headline lift on the winning variant systematically overstates the true effect, because the selection rule “ship the one with the highest observed lift” is statistically equivalent to “ship the one with the most upward measurement error.”
The implication for CRO leaders and growth executives is calibration: when you report a +15% test win to the leadership team, the realistic expectation for production impact should already be discounted by some shrinkage factor. The published industry analyses suggest that factor is typically in the 20-50% range for individual high-confidence wins and roughly 30-50% for portfolio aggregates. If your experimentation program is currently reporting raw test lifts and treating them as production forecasts, your leadership is being briefed on numbers that are systematically too high by amounts that compound across the portfolio.
What The Winner’s Curse Actually Is
The Winner’s Curse is older than A/B testing. The original framing comes from the auction-theory literature on sealed-bid auctions for items of uncertain value --- the winner of the auction is the bidder whose private valuation of the item was highest, but if all bidders are estimating the same underlying true value with noise, the highest valuation is systematically above the true value. The winner systematically overpays. The mechanism is regression to the mean applied to the selection event.
The same mechanism operates in A/B testing, with the bidders replaced by tested variants and the private valuations replaced by observed lifts. Each variant has some true effect on the primary metric. The A/B test produces an estimate of that effect with measurement noise. The team that runs the portfolio of tests, observes the estimated lifts, and picks the highest-lift variant to ship is doing exactly what the auction winner does --- selecting on the variable that includes both the underlying truth and the lucky upward error.
Decompose the math. For any tested variant $i$, the observed lift can be written as:
$$\hat{\theta}_i = \theta_i + \epsilon_i$$
where $\theta_i$ is the true effect of the variant on the primary metric, and $\epsilon_i$ is the measurement noise from the finite sample size of the test. Under standard A/B testing assumptions, $\epsilon_i$ is approximately normal with mean zero and variance determined by the per-arm sample size and the metric’s underlying variance. The test gives you $\hat{\theta}_i$; you want to know $\theta_i$.
For a single pre-specified variant tested in a single pre-specified experiment, $\hat{\theta}_i$ is an unbiased estimator of $\theta_i$. This is the standard A/B testing guarantee, and it is correct.
The Winner’s Curse appears when the variant being shipped is not pre-specified. If you run $N$ tests, observe $N$ estimated lifts, and select the variant with the maximum observed lift --- call it variant $k$ --- the conditional expectation of $\theta_k$ given that variant $k$ was the maximum is no longer equal to $\hat{\theta}_k$. It is strictly less than $\hat{\theta}_k$. The selection event “this variant had the highest observed lift in the portfolio” provides information that the selected variant probably had a positive realization of the noise term $\epsilon_k$, because variants with positive noise realizations are more likely to be selected as the maximum than variants with negative noise realizations.
The selection bias is purely a property of the selection rule. It has nothing to do with the quality of the individual test. The test was clean; the estimator was unbiased; the false-positive rate was correctly controlled. The bias enters at the moment you select on the observed outcome.
This is regression to the mean re-expressed in operational terms. The next time you measure the variant’s effect, the noise realization will be different, and on average it will be closer to the prior mean than the original noise realization that put the variant at the top of the portfolio. The “next measurement” is the production rollout. The shrinkage from headline lift to production impact is the Winner’s Curse cashing in.
The Math --- Empirical Bayes Shrinkage
The formal correction for the Winner’s Curse is empirical Bayes shrinkage of the observed lift estimates. The intuition is that the population of true effects $\theta_i$ across the portfolio of tested variants has some prior distribution $p(\theta)$ --- mostly small effects with occasional larger ones, roughly normally distributed with some hyperparameters that the data can estimate. Given the observed estimate $\hat{\theta}_i$ and the prior, Bayes’ rule produces a posterior on $\theta_i$ that is a weighted average of the observation and the prior mean. The posterior mean is the shrinkage estimator.
The classical form, attributable to Stein, C. (1956). “Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, and the James-Stein refinement (James, W., & Stein, C. (1961). “Estimation with quadratic loss.” Proceedings of the Fourth Berkeley Symposium), is that for three or more independent normal observations of related means, the sample means are inadmissible as estimators --- a shrunk estimator that pulls each individual mean toward the grand average has uniformly lower expected mean squared error. The James-Stein estimator was the original proof that “naively use the sample mean” is provably suboptimal whenever you have multiple related estimates.
In the A/B testing context, the related estimates are the observed lifts on the variants in a portfolio of tested experiments. The shrinkage estimator pulls each observed lift toward zero (or toward the portfolio’s prior mean lift, which is typically small and may even be slightly negative once you account for the long right tail of true positives and the bulk of true zeros). The amount of shrinkage depends on the ratio of measurement noise variance to between-variant true-effect variance. Highly noisy tests --- tests with small samples or noisy metrics --- get shrunk a lot. Low-noise tests --- tests with very large samples and clean metrics --- get shrunk less. The shrinkage is proportional to how much weight Bayes’ rule places on the prior versus the observation.
The conditional expected value of the true effect given the observation, under a normal prior with mean $\mu_0$ and variance $\tau^2$, and observation noise variance $\sigma^2$, is:
$$E[\theta_i | \hat{\theta}_i] = \frac{\sigma^2 \mu_0 + \tau^2 \hat{\theta}_i}{\sigma^2 + \tau^2}$$
If the prior is centered at zero ($\mu_0 = 0$) and the test is noisy enough that $\sigma^2$ is comparable to $\tau^2$, the shrinkage factor on $\hat{\theta}_i$ is roughly one-half --- the posterior mean is about half the observed lift. If the test has very large samples and very low noise, the shrinkage factor approaches one and the posterior mean is approximately the observed lift.
This is the formal math behind the “shrink the test winner by 30-50%” rule of thumb that the industry-published analyses converge on. The exact shrinkage in a given program depends on the portfolio’s specific noise-to-signal ratio, but for typical industry A/B testing programs running on moderate sample sizes against noisy primary metrics, the shrinkage is large enough to materially change the reported numbers.
The conditional-on-selection bias is even larger than the unconditional shrinkage. The shrinkage formula above adjusts for the prior; it does not separately adjust for the selection event “this variant was the maximum.” Adding the selection correction tightens the shrinkage further. The full treatment in Efron, B. (2011). “Tweedie’s formula and selection bias.” Journal of the American Statistical Association, 106(496), 1602—1614. DOI: 10.1198/jasa.2011.tm11181 derives the correct posterior mean conditional on a selection event, and it is more aggressive than the naive empirical Bayes shrinkage that ignores selection.
Type M Errors --- Gelman & Carlin 2014
The cleanest framing of why selection on observed effect inflates effect estimates comes from the methodological-reform literature in psychology, in Gelman, A., & Carlin, J. (2014). “Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors.” Perspectives on Psychological Science, 9(6), 641—651. DOI: 10.1177/1745691614551642.
Gelman and Carlin’s contribution was to extend the traditional Type I / Type II error framework with two additional error types that capture how published effects get distorted under low-power-plus-selection conditions. Their definitions:
- Type S (sign) error: Conditional on having found a statistically significant effect, the probability that the estimated effect has the wrong sign --- i.e., the variant actually hurts the metric, but the test says it helps.
- Type M (magnitude) error: Conditional on having found a statistically significant effect, the expected ratio of the estimated effect magnitude to the true effect magnitude. The “exaggeration ratio.” If the true effect is +5% and the average significant estimate is +12%, the Type M ratio is 2.4 --- significant findings overstate true effects by a factor of 2.4 on average.
The Gelman-Carlin paper is a paradigm shift because it reframes statistical power. Traditional power analysis is about Type II error --- the probability of missing a real effect. Gelman and Carlin show that low-power studies have a separate, more insidious problem: even when they do detect a real effect, they systematically overstate its magnitude, often by large multiples. The lower the power, the larger the Type M inflation, because only the lucky upward noise realizations cross the significance threshold and get reported.
For A/B testing, the relevance is direct. Underpowered tests --- tests with insufficient sample size to reliably detect the actual effect sizes in the program --- produce winning variants whose headline lifts are dramatically inflated relative to their true effects. The variant doesn’t have to fail to be a real winner; it just has to be reported with a number much larger than its true effect. A program that runs underpowered tests on a wide portfolio of features will systematically report winning lifts that are 2x to 4x larger than the actual production impact, because the only winners that surface are the ones with sufficiently lucky noise realizations to cross the (overly demanding for the actual effect size) significance bar.
The portfolio-level implication: if your program is running A/B tests with 80% statistical power against effect sizes that turn out to be substantially smaller than the design assumed, your reported winners are not just somewhat inflated --- they are catastrophically inflated, and the apparent “win rate” of your program is being held up entirely by the upper tail of the noise distribution. The Gelman-Carlin treatment shows the exact functional form of the inflation. It is not a small correction.
Lee & Shen 2018 --- The Explicit A/B Testing Treatment
The canonical A/B testing-specific treatment of the Winner’s Curse is Lee, M. R., & Shen, M. (2018). “Winner’s curse: Bias estimation for total effects of features in online controlled experiments.” KDD ‘18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 491—499. DOI: 10.1145/3219819.3219905. Lee and Shen are Airbnb data scientists, and the paper documents the Winner’s Curse as observed in production at Airbnb’s Experiment Reporting Framework (ERF).
The framing of the paper is that Airbnb runs hundreds of concurrent experiments, and the platform-level reporting aggregates the estimated effects of “shipped” variants to produce a quarterly accounting of how much total value the experimentation program has delivered. The naive total --- summing the observed lifts of variants that were shipped --- is biased upward, because the shipping decision is conditional on observing a significant positive lift. The variants that were shipped are exactly the ones whose observed lifts include positive noise realizations large enough to clear the significance threshold; the variants whose noise realizations were negative or small were not shipped, so they do not contribute to the total. The selection asymmetry produces a portfolio-level overstatement.
Lee and Shen formalize this with the equivalent of the auction-theory math: conditional on the selection event “this variant’s observed lift exceeded the shipping threshold,” the conditional expectation of the true effect is below the observed lift. They derive the correction analytically and propose an empirical Bayes implementation that estimates the prior from the observed distribution of test results across the program. The corrected total effect is the sum of the shrunk individual estimates, and it is systematically lower than the naive sum of the raw observed lifts.
The methodological contribution is twofold. First, they make the Winner’s Curse explicit as a structural feature of how A/B testing programs report aggregate value --- the bias is not a flaw in any individual test, but a property of the selection rule applied at the program level. Second, they provide a practical Bayesian correction that can be retrofitted onto an existing experimentation platform without changing the underlying test methodology. The platform still runs frequentist A/B tests; the platform-level reporting layer applies the shrinkage adjustment.
The Lee-Shen treatment is the natural reference for industry teams who want to understand the bias quantitatively. The methodology is sound, the implementation is tractable, and the framing maps directly onto how most experimentation programs structure their reporting.
Coey & Cunningham 2019 --- Experiment Splitting And The 44% Improvement
The most directly persuasive industry evidence on the magnitude of the Winner’s Curse comes from Coey, D., & Cunningham, T. (2019). “Improving treatment effect estimators through experiment splitting.” WWW ‘19: The World Wide Web Conference, 285—295. DOI: 10.1145/3308558.3313452. Coey and Cunningham are Facebook data scientists, and the paper analyzes a dataset of 226 Facebook News Feed A/B tests using a methodology called experiment splitting.
Experiment splitting works as follows. Take an A/B test with sample size $N$ in each arm. Randomly split each arm into two halves of size $N/2$. Compute the lift estimate on the first half ($\hat{\theta}^{(1)}$) and the lift estimate on the second half ($\hat{\theta}^{(2)}$). These are two independent estimates of the same true effect $\theta$. The first half can be used to “select” winners or fit shrinkage priors; the second half provides an unbiased validation estimate that is not contaminated by the selection event. The methodology is the experimental analog of a holdout set in machine learning --- it preserves an unbiased estimate of generalization performance by holding back a portion of the data from the model-selection step.
The empirical finding from the 226 Facebook News Feed A/B tests is the headline result: a shrinkage estimator based on repeated experiment splitting had 44% lower mean squared predictive error than the conventional, unshrunk treatment effect estimator. The 44% number is the empirical magnitude of the Winner’s Curse at Facebook News Feed scale. It is not a theoretical worst-case estimate; it is what they measured in a real, large-scale experimentation program. They also report that their estimator beat the classical James-Stein shrinkage by 18%, showing that even the standard shrinkage methodology was leaving substantial improvement on the table relative to a properly-tuned empirical Bayes treatment.
The methodological contribution of Coey-Cunningham is that experiment splitting gives you a non-parametric way to estimate the right amount of shrinkage to apply, without having to make distributional assumptions about the prior. The flexibility means the methodology can be applied across very different test portfolios --- portfolios with mostly small effects, portfolios with heavy tails, portfolios with strong inter-test correlations --- and produce calibrated shrinkage estimates appropriate to each. The trade-off is that you “spend” half your sample size on the splitting; the gain in calibration has to be worth the loss in raw statistical power.
For an industry team trying to understand how much shrinkage their winning variants need, the Coey-Cunningham 44% MSE reduction is the cleanest available benchmark. It implies that a typical Facebook News Feed test winner, treated naively, had a mean squared estimation error roughly 80% higher than what was achievable with a properly shrunk estimator. The implied magnitude bias is large and consistent across the test portfolio.
Empirical Demonstrations Beyond Facebook
The Lee-Shen Airbnb paper and the Coey-Cunningham Facebook paper are the cleanest individual case studies, but the broader pattern shows up across every major experimentation program that has published portfolio-level analyses.
Microsoft’s Experimentation Platform is documented at length in Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265, and the book is explicit that the Winner’s Curse is one of the standard methodological hazards that the Microsoft platform’s analysts have to actively correct for in their reporting. Kohavi’s broader work on experimentation at Microsoft and Amazon includes systematic discussions of how naive aggregation of shipped-experiment effects overstates the true cumulative impact, and how the platform’s reporting layer needs to apply shrinkage or holdout validation to produce honest numbers.
The earlier Crook, T., Frasca, B., Kohavi, R., & Longbotham, R. (2009). “Seven pitfalls to avoid when running controlled experiments on the web.” KDD ‘09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1105—1114. DOI: 10.1145/1557019.1557139 raised the selection bias issue in print at the very beginning of the modern industry-A/B-testing literature, framing it as one of the standard pitfalls that platform designers and analysts need to internalize. The 2009 paper is striking in retrospect because it identified the structural problem nearly a decade before the formal corrections were published, but the methodological discipline was slow to propagate across the industry.
Microsoft’s Bing data also forms the basis of Azevedo, E. M., Deng, A., Olea, J. L. M., Rao, J., & Weyl, E. G. (2020). “A/B testing with fat tails.” Journal of Political Economy, 128(12), 4614—4672. DOI: 10.1086/710607, a paper that analyzes the distribution of true effects across Bing’s experimentation portfolio. The finding most relevant to the Winner’s Curse is that the underlying distribution of true effects is fat-tailed --- most experiments have very small true effects, a small fraction have large true effects, and the tail is heavy enough that portfolio-optimal strategies look different from what classical frequentist designs would recommend. The fat-tail finding compounds the Winner’s Curse: when most true effects are small and a few are large, the variants that clear the significance threshold on noisy tests are heavily over-represented in the upper tail of the noise distribution, and the resulting headline lifts are even more inflated than the simpler Gaussian framing would suggest.
The broader publication-bias literature in economics provides the methodological foundation for treating selection-on-significance as a quantifiable, correctable bias. Andrews, I., & Kasy, M. (2019). “Identification of and correction for publication bias.” American Economic Review, 109(8), 2766—2794. DOI: 10.1257/aer.20180310 develops the formal framework for estimating the conditional probability of publication as a function of a study’s result and correcting estimated effect sizes for the resulting bias. The Andrews-Kasy methodology was developed for academic-economics meta-studies, but the mathematical machinery is identical to what an industry experimentation program would need to correct portfolio-level reporting. The selection event “this paper was published” maps directly to “this variant was shipped”; the same correction approaches apply.
The convergence across these literatures is the most important empirical fact for industry practitioners. Independently developed analyses --- by Airbnb data scientists, Facebook data scientists, Microsoft experimentation researchers, and academic economists --- all arrive at the same conclusion: when you select on observed effect, you systematically overstate the true effect, and the magnitude of the bias is large enough to matter for portfolio-level decisions. The 30-50% shrinkage range is the rough consensus, with the exact value depending on the noise-to-signal ratio of the specific program.
Why “Better Methodology” Doesn’t Eliminate The Winner’s Curse
The most counterintuitive aspect of the Winner’s Curse is that it is robust to improvements in test methodology. A team can adopt the entire toolkit of trustworthy experimentation --- always-valid inference to handle peeking, sample ratio mismatch detection, careful segment analysis, pre-registered hypotheses, the works --- and still have a Winner’s Curse problem of the same approximate magnitude. The structural source of the bias is the selection rule, not the test methodology.
Consider a thought experiment. Suppose you run a portfolio of A/B tests with perfect statistical methodology: every test uses always-valid inference, so the false-positive rate is exactly 5%; every test has been checked for SRM and passes; the primary metrics are clean and the segment analyses don’t reveal any heterogeneity issues; the sample sizes are correctly powered for the realistic effect sizes you expect; the analyses are conducted by a careful, methodologically sophisticated team. The tests themselves are unimpeachable.
You select the variant with the highest observed lift to ship. The Winner’s Curse applies. The expected production impact is below the observed lift, by an amount that depends only on the noise-to-signal ratio of your test portfolio, not on the quality of any individual test. The methodology is necessary but not sufficient. Even with the cleanest possible tests, the act of selecting on observed lift introduces the bias.
This is the structural insight that the published literature converges on and that most practitioners do not appreciate until they see it explained. The Winner’s Curse is not caused by sloppy statistics; it is caused by the operation of selecting an outcome variable that has measurement error. The selection event would still bias the estimator even if the measurement error were perfectly characterized, the false-positive rate were perfectly controlled, and the sample sizes were optimally chosen. The selection rule “pick the maximum” inevitably picks variants whose noise realization was unusually positive, and the noise realization does not persist.
The corollary is that practitioners who think they have “solved” the Winner’s Curse by adopting better statistical methodology are mistaken. The methodology improvements are real and important --- they address the peeking problem, the SRM problem, the segment-analysis problem, and other genuine flaws --- but they do not address the selection-on-maximum problem. The selection problem requires a separate correction layer: either empirical Bayes shrinkage of the reported estimates, or experiment splitting to produce holdout validation, or post-launch monitoring to compare against test predictions, or some combination.
The Microsoft approach, as documented in Kohavi-Tang-Xu, is to apply both layers: the platform enforces methodological discipline in the test designs, and the reporting layer applies appropriate shrinkage and holdout validation to portfolio-level summaries. The two layers are complementary, and neither substitutes for the other. A team that gets the test methodology right but not the reporting layer will still produce inflated portfolio-level numbers; a team that gets the reporting layer right but not the test methodology will produce sound portfolio-level numbers but unreliable individual test results. Both are needed.
What Actually Works
The fixes for the Winner’s Curse fall into four classes, all of which are documented in the academic and industry literature and all of which are tractable for experimentation programs that want to address the bias.
Empirical Bayes shrinkage of test estimates. This is the Lee-Shen approach: apply a Bayesian correction to the observed lifts before they are aggregated into portfolio-level reports. The prior distribution can be estimated from the observed empirical distribution of test results across the program (hence “empirical Bayes”), and the shrinkage is applied automatically based on the noise-to-signal ratio of each individual test. The implementation is a layer on top of the existing experimentation platform, not a replacement for it. The headline numbers reported to leadership are the shrunk estimates; the raw observed lifts are still available for diagnostic purposes. This is the lowest-cost, highest-leverage intervention available to most programs.
Experiment splitting. This is the Coey-Cunningham approach: split each test’s sample into a training half (used to fit the shrinkage prior or to identify candidate winners) and a validation half (used to produce unbiased lift estimates on the selected winners). The validation half is a clean holdout that is not contaminated by the selection event, so its lift estimate is unbiased. The cost is that you spend half your sample on the holdout, which reduces statistical power; the benefit is that you get truly unbiased estimates of the selected variant’s effect rather than relying on a shrinkage correction that depends on the prior being correctly specified. For large-sample programs where the loss of half the sample is tolerable, this is the cleanest methodology available.
Pre-registered confirmation tests. Run an initial test to identify candidate winners; then run a separate, pre-registered confirmation test on the candidate winner before fully shipping. The confirmation test is run on a fresh sample, and the analysis is pre-registered to the specific hypothesis “is this variant’s true effect $\geq$ X% lift?” The confirmation test’s estimate is not contaminated by the original selection event, so it provides an unbiased measurement of the variant’s true effect. This is the experimental analog of “replication” in the academic literature. The cost is that you have to run two tests for every shipping decision, doubling the time-to-launch on winning experiments. For high-stakes decisions where the cost of overstating the impact is large, the discipline is worth it.
Post-launch hold-out validation. Ship the winning variant to most of the traffic but retain a small hold-out group that continues to see the original control. Measure the post-launch effect on the primary metric by comparing the treated group against the held-out control group. This is the cleanest possible measurement of the variant’s actual production impact, because the comparison is on the same population and the same time period, with no contamination from the original selection event. The implementation cost is non-trivial --- the platform has to support a permanent or long-running hold-out group, the analysis has to be set up to compare the treated and held-out populations, and the held-out users see a worse experience indefinitely (if the variant really is an improvement) --- but the resulting measurement is the gold standard. Microsoft and a number of other large experimentation programs run hold-out groups on their shipped wins as a discipline.
The choice among these four is partly about cost and partly about the structure of the experimentation program. Empirical Bayes shrinkage is the lowest-cost intervention and should be the default. Experiment splitting and pre-registered confirmation tests are higher-cost but cleaner. Post-launch hold-out validation is the most expensive but provides the most defensible measurement. A serious program adopts a combination of these methods, applying the cheaper ones to all tests and the more expensive ones to high-stakes shipping decisions.
What This Means For Portfolio-Level Expectation Setting
The practical implication for CRO and growth leaders is that portfolio-level reporting needs to be calibrated for the Winner’s Curse, and the calibration needs to be communicated to executive stakeholders explicitly rather than left as an unspoken adjustment.
The headline question is: what fraction of the claimed cumulative lift from the experimentation program is actually delivered in production? Naive accounting --- summing the observed lifts of all shipped variants --- overstates the answer. The realistic post-shrinkage estimate is roughly 50-80% of the naive sum, depending on the noise-to-signal ratio of the program and the specific methodology used to apply shrinkage. A program that has been reporting “$10M annualized impact from experimentation” based on raw test lifts is realistically delivering closer to $5-8M annualized impact in production. The gap is the Winner’s Curse cashing in across the portfolio.
For executive reporting, the right format is to present both numbers --- the raw observed lift sum and the shrinkage-adjusted estimate --- with an explicit footnote explaining the methodology. The raw sum is useful as a “ceiling” estimate; the shrinkage-adjusted estimate is the realistic forecast. Showing both creates calibration over time as stakeholders see the relationship between the two metrics and develop intuition for the gap. Showing only the raw sum systematically overpromises and underdelivers; showing only the shrinkage-adjusted estimate creates apparent underperformance relative to the raw test results and produces stakeholder confusion about why “the test said +15% but the report says +9%.”
For individual shipping decisions, the right format is to discount the headline lift by the program’s typical shrinkage factor before briefing the leadership team. If the historical shrinkage on shipped variants has been 35%, the headline +15% test win should be reported as “+15% in test, approximately +9-10% expected in production.” This is harder to message because it sounds like a hedge, but it is the calibrated forecast. The alternative is presenting +15%, having production come in at +6%, and then having to explain to the leadership team that “the test was right but the production impact was a different number,” which damages the credibility of the experimentation program even though the test itself was sound.
The harder organizational lift is changing the incentive structures around test wins. If the team that runs the experiment is evaluated on the headline lift in the test (which is selection-biased upward) rather than the realized lift in production (which is the unbiased measurement of what the variant actually delivered), the incentives encourage running noisy tests on variants with high upside and low downside, because the noise asymmetry produces inflated apparent wins. The fix is to align evaluation with the post-launch hold-out measurement, which removes the asymmetry. Programs that have made this change report a noticeable shift in the type of experiments their teams propose --- away from speculative “big swing” tests with thin theory, toward more careful tests on changes the team has stronger reasons to believe will work. The shift is healthy. The selection-on-test-win incentive was producing portfolio-level overstatement and rewarding test designs that maximized variance.
What This Means For CROs And Growth Leaders
The bottom-line operational implications for CRO, growth, and PM leaders running experimentation programs:
Build a shrinkage layer into your reporting. The lowest-cost intervention is to add an empirical Bayes shrinkage adjustment to the platform-level reporting of cumulative test wins. The shrinkage factor can be estimated from the program’s historical data using the Lee-Shen methodology. The implementation effort is on the order of a few weeks of data-science work, and the resulting numbers are more honest than the raw aggregation.
Run post-launch hold-out groups on high-stakes shipping decisions. For changes that affect more than 10% of the user base or that are projected to deliver more than $X of annualized impact (where $X is calibrated to your scale), run a permanent or long-running hold-out group that retains the original control variant. The hold-out provides the gold-standard measurement of actual production impact and serves as the accountability mechanism for the experimentation program’s portfolio-level claims.
Calibrate executive reporting for shrinkage. Present both the raw test lift and the shrinkage-adjusted estimate, with an explicit explanation of why they differ. Over time, stakeholders develop intuition for the gap, and the experimentation program’s credibility is reinforced rather than eroded as the realized impact tracks the calibrated forecast rather than the inflated headline.
Discount your historical “wins” portfolio. Apply a 30-50% shrinkage to the cumulative claimed impact from the program’s historical test wins. The resulting number is closer to what the program actually delivered. This is uncomfortable to do for the first time, because the historical accounting was probably overstated and the corrected accounting will look like underperformance relative to what was previously reported. But the corrected accounting is the correct accounting, and it is better to correct it once explicitly than to have stakeholders independently discover the inflation through the gap between headline tests and observed business outcomes.
Treat the Winner’s Curse as a structural feature, not a methodology flaw. When a shipped variant comes in below the headline test lift, the first hypothesis should be the Winner’s Curse, not implementation drift or instrumentation issues. The Winner’s Curse is the most common explanation for the gap, and investigating implementation issues that don’t exist is a waste of engineering time. The right framing for the post-mortem is “the test result was selection-biased, and the production impact is the calibrated estimate of the true effect” rather than “something went wrong in production.”
Align team incentives with post-launch impact, not test wins. If the team is evaluated on the number and magnitude of test wins, the incentive structure encourages designing tests that maximize variance --- thin-theory speculative changes that produce occasional large apparent wins through noise. If the team is evaluated on the realized post-launch impact (as measured by hold-out validation), the incentive structure encourages designing tests on changes that the team has substantive reasons to believe will work. The latter incentive produces a healthier program and a more honest portfolio of claimed wins.
Educate stakeholders on the structural insight. The Winner’s Curse is counterintuitive enough that it needs to be explained directly to the executive team. The framing that works is the auction analogy: the winner of an auction systematically overpays, because the winner is the bidder whose private valuation was highest, and “highest” includes both the underlying truth and the upward estimation error. The same mechanism makes test winners systematically overstate their true production impact. Once stakeholders understand the structural reason, they accept the calibration adjustment without treating it as evidence of an underperforming experimentation program.
The uncomfortable summary is that most experimentation programs are reporting headline numbers that are systematically inflated by amounts that compound across the portfolio, and the inflation is not anyone’s fault --- it is a structural consequence of the selection rule that defines what “winning” means. The fix is not to run better tests; the fix is to add a shrinkage layer to the reporting. The fix is well-documented in the literature, low-cost to implement, and high-leverage for the credibility of the program. The longer a program runs without applying it, the larger the accumulated overstatement and the larger the eventual reckoning when stakeholders compare cumulative test claims to observed business outcomes.
Sources
- Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 197—206.
- James, W., & Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361—379.
- Crook, T., Frasca, B., Kohavi, R., & Longbotham, R. (2009). Seven pitfalls to avoid when running controlled experiments on the web. KDD ‘09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1105—1114. DOI: 10.1145/1557019.1557139
- Efron, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496), 1602—1614. DOI: 10.1198/jasa.2011.tm11181
- Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641—651. DOI: 10.1177/1745691614551642
- Lee, M. R., & Shen, M. (2018). Winner’s curse: Bias estimation for total effects of features in online controlled experiments. KDD ‘18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 491—499. DOI: 10.1145/3219819.3219905
- Andrews, I., & Kasy, M. (2019). Identification of and correction for publication bias. American Economic Review, 109(8), 2766—2794. DOI: 10.1257/aer.20180310
- Coey, D., & Cunningham, T. (2019). Improving treatment effect estimators through experiment splitting. WWW ‘19: The World Wide Web Conference, 285—295. DOI: 10.1145/3308558.3313452
- Azevedo, E. M., Deng, A., Olea, J. L. M., Rao, J., & Weyl, E. G. (2020). A/B testing with fat tails. Journal of Political Economy, 128(12), 4614—4672. DOI: 10.1086/710607
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1108724265.
Related
Browse the full Replication Crisis Hub for related material on evidence evaluation and statistical inference:
- The Peeking Problem in A/B Testing --- a separate, also-pervasive A/B testing failure mode that compounds with the Winner’s Curse
- Sample Ratio Mismatch in A/B Testing --- the platform-level integrity check that detects when randomization itself has failed
- Daryl Bem’s Precognition Studies --- the canonical illustration of how selection-on-significance produces impossible findings
- The Decline Effect --- the broader pattern in which initial significant findings systematically shrink under replication
- Hindsight Bias --- the cognitive driver of why test winners feel more inevitable in retrospect than they were in prospect
- Confirmation Bias --- why teams over-interpret evidence that supports their preferred ship decisions
FAQ
How do I estimate the shrinkage factor for my own experimentation program?
The most defensible approach is the experiment-splitting methodology from Coey and Cunningham 2019. Take a sample of your historical tests, randomly split each test’s sample into halves, use the first half to identify which variants would have been selected as winners under your shipping criteria, and compute the lift estimates on the held-out second half for those selected variants. The ratio of the held-out lift to the selection-half lift is the empirical shrinkage factor for your portfolio. Repeat across many random splits to get a stable estimate. For a quick rule-of-thumb, the published industry analyses suggest 30-50% shrinkage as a baseline, but your specific program’s factor depends on your noise-to-signal ratio and could be higher or lower.
Does Bayesian A/B testing eliminate the Winner’s Curse?
No. Bayesian A/B testing produces a posterior distribution over the true effect for each individual variant, which is a coherent quantity at any sample size and is not affected by the peeking problem. But if you then select the variant with the highest posterior mean lift to ship, you still have the Winner’s Curse, because the selection rule is operating on a noisy quantity. The Bayesian framework actually provides the natural language for thinking about the correction --- the posterior mean conditional on the selection event is below the unconditional posterior mean --- but the Bayesian methodology by itself doesn’t apply that conditional correction. You still need an explicit shrinkage or holdout layer.
What about high-power tests with very large samples? Don’t they have a smaller Winner’s Curse?
Higher-power tests have less measurement noise per test, so the noise-to-signal ratio is lower and the shrinkage factor is closer to zero. But the Winner’s Curse does not vanish; it just gets smaller. Even very large tests selected from a portfolio show some upward bias because the selection rule still favors variants with positive noise realizations. For tests with extremely large samples where the noise is much smaller than the typical true effect, the shrinkage may be small enough (less than 10%) to ignore in practice. For typical industry tests with moderate samples, the shrinkage is large enough (20-50%) to matter operationally.
Should I run confirmation tests on every winning variant?
Confirmation tests are expensive --- they double the time-to-launch on winning experiments --- so the right answer is to apply them selectively. For high-stakes changes (large traffic, large projected impact, irreversible decisions), confirmation tests are worth the cost. For low-stakes changes that can be easily rolled back if production performance disappoints, the empirical Bayes shrinkage in the reporting layer is sufficient. The cost-benefit depends on the asymmetry between the cost of overstating the impact and the cost of running two tests; programs with high overstatement cost (e.g., publicly reported earnings impacts) should run more confirmation tests than programs with low overstatement cost.
What if my organization is using a commercial A/B testing platform? Do they handle this?
The major commercial platforms have varying degrees of Winner’s Curse correction built into their reporting layers. Some platforms apply empirical Bayes shrinkage by default; others present raw observed lifts and leave the correction to the customer. You should ask your platform vendor explicitly: “What shrinkage methodology, if any, is applied to the reported lift estimates for shipped variants? Are the platform-level cumulative impact reports adjusted for selection bias?” If the answers are “none” and “no,” you need to layer your own shrinkage adjustment on top of the platform’s reporting before briefing executives. The commercial platforms have made meaningful progress on the peeking problem; the Winner’s Curse has been a slower migration and is still inconsistently handled across vendors.
How does the Winner’s Curse interact with the peeking problem?
The two problems are independent and compound multiplicatively. The peeking problem inflates the false-positive rate on individual tests --- variants that aren’t really winners get labeled as winners because of intermediate looks at the data. The Winner’s Curse inflates the magnitude of the apparent lift on actually-winning variants. A program with both problems has individual tests that overstate their probability of being true winners and aggregated portfolio reports that overstate the magnitude of the true winners’ impact. The two problems are addressed by separate methodological fixes --- always-valid inference for peeking, empirical Bayes shrinkage for Winner’s Curse --- and a serious program needs both.
Is the 44% MSE reduction from Coey-Cunningham generalizable to my program?
The 44% number is specific to Facebook News Feed A/B tests in the 2019 sample, and the exact magnitude depends on the noise-to-signal ratio of that specific portfolio. For your program, the actual MSE improvement from applying shrinkage will depend on your portfolio’s specific characteristics --- mostly small effects with occasional large ones gives different shrinkage than mostly moderate effects, and noisy primary metrics give different shrinkage than clean ones. The order of magnitude is plausible for most industry A/B testing programs, but the exact number requires running the experiment-splitting analysis on your own historical data. The methodology generalizes; the specific 44% does not.
Should I report shrinkage-adjusted lifts to engineering teams who built the variant?
Yes, and explain the methodology. The engineering team that built the variant should understand both the raw observed lift (which is what their work delivered in the test environment) and the shrinkage-adjusted estimate (which is the realistic forecast for production). The framing should not be “your work delivered less than the headline number” --- the work delivered exactly what the test measured. The framing should be “the headline number is selection-biased upward, and the shrinkage-adjusted estimate is the calibrated forecast for what the same work will deliver in production.” Engineers generally appreciate the statistical clarity. The framing problem is harder with executive stakeholders who do not have the methodological background; the framing problem with engineers is largely a non-issue.
replication-crisis ab-testing winners-curse regression-to-mean evidence-evaluation