The Winner’s Curse: Why Big A/B Test Wins Rarely Hold Up

Atticus Li

← Blog · A/B Testing

The Winner’s Curse: Why Big A/B Test Wins Rarely Hold Up

The lift in your test report and the lift finance sees a quarter later rarely match. Winner's curse, novelty decay, and regression all shrink it.

By Atticus Li July 3, 2026 9 min read

The test report says the winning variant lifted conversion by a clean, quotable number. That number goes in the deck, gets multiplied out to an annual revenue figure, and enters the forecast. A quarter later, the realized impact is reliably smaller — sometimes much smaller — and almost nobody goes back to find out why, because the test already "won."

TL;DR

The tested lift systematically overstates the lift you'll actually realize. Three forces — winner's curse, novelty decay, and regression to the mean — all push the same direction: the number shrinks after launch.
This isn't measurement error; it's structural. You selected the variant because its estimate looked good, and selecting on a noisy estimate guarantees the true effect is usually smaller than the measured one.
The gap is invisible because nobody re-measures. The test report becomes the permanent record, the projected annual figure gets booked, and the realized number is never checked against the projection.
The fix is a post-launch re-measure and a haircut on projections: confirm the lift held, and discount the tested number before you forecast on it.

Force	What it is	Direction
Winner's curse	You selected the variant because its estimate was high	Overstates
Novelty effect	Users react to "new," then habituate	Overstates (fades)
Regression to the mean	Extreme measured results drift toward the true average	Overstates
Net effect on your forecast	Tested lift > realized lift	Reliably smaller

Every one of these pushes the same way. That's why the realized number is almost never higher than the tested one — the deck is systematically optimistic.

Winner's curse: you selected on noise

Start with the force nobody accounts for. You ran several variants, or several tests, and you shipped the one with the best measured result. That selection step is the problem. A measured lift is the true lift plus noise, and by choosing the highest measured result, you've preferentially selected the variants where the noise happened to be positive. So the winner's measured effect is, on average, an overstatement of its true effect. This is the "winner's curse," and it's a documented, recurring puzzle in large-scale experimentation — the selected winner's reported effect is biased upward precisely because it was selected for being high (Kohavi et al., *Trustworthy Online Controlled Experiments: Five Puzzling Outcomes*).

The effect is strongest exactly where teams are most excited: underpowered tests and multi-variant tests. The less power a test had, the noisier its estimate, and the more of the "winning" lift is likely to be noise you selected for. A borderline-significant winner from a small test is the most likely to disappoint at launch — and the most likely to be celebrated hardest, because a big number from a small test feels like a discovery.

Novelty: the reaction to "new" that isn't the reaction to the change

The second force is behavioral. When users encounter a changed experience, some of their initial response is to its newness, not its substance. A novel element draws attention and engagement simply for being unfamiliar, and that bump fades as users habituate (Novelty and Primacy: A Long-Term Estimator for Online Experiments). Run the test for two weeks and you capture the novelty-inflated response. Ship it, wait a quarter, and the novelty is gone — leaving only the substantive effect, which is smaller.

Novelty has a mirror twin, the primacy effect, that runs the other way: sometimes users initially resist a change and warm to it over time, so the tested effect understates the long-run one. But novelty is the more common direction for the kinds of attention-grabbing changes teams love to test, which is why the net bias across a portfolio skews toward overstatement. The point isn't that short tests are always optimistic — it's that a two-week window measures a transient response, and the long-run number can differ in either direction, so you can't assume the tested lift is the durable one.

A two-week test measures how users respond to a change that's new. A quarter later, the change isn't new anymore — and neither is the response.

Regression to the mean: extreme results drift back

The third force is the oldest one in statistics. Francis Galton noticed in the 1880s that the sons of very tall fathers tended to be tall but less tall — extreme measurements are followed by less-extreme ones, because part of any extreme observation is transient luck that doesn't repeat. Kahneman built much of his work on how badly humans intuit this: we attribute a spectacular result to a durable cause and are then surprised when it "declines," when the decline was the predictable regression of an extreme reading toward the true value.

An unusually large tested lift is an extreme measurement. Some of its size is durable signal; some is the transient luck of that particular sample and window. The durable part persists; the luck doesn't. So the realized lift regresses toward the true, more modest effect — not because anything broke, but because extreme readings are, by their nature, partly transient. The bigger and more exciting the tested number, the more regression you should expect, and the harder the eventual "decline" will feel to a team that booked the full figure.

Why the gap stays invisible: nobody re-measures

All three forces are well understood. The reason they keep surprising teams is organizational: the test report is treated as the final word, so the realized number is never compared to it. The test wins, the projected annual lift is computed by multiplying the two-week effect out across a year, that figure enters the forecast, and the team moves to the next test. A quarter later, when the realized revenue comes in lighter, it's attributed to market conditions, seasonality, or "execution" — anything but the fact that the original number was a selected, novelty-inflated, un-regressed estimate that was never going to fully materialize.

When I re-measure winners after launch — a step I've made standard — the pattern is consistent across contexts: the realized lift is a meaningful fraction below the tested lift more often than not, and the gap is widest for the tests that looked most spectacular. Naming that pattern is uncomfortable in a program that celebrates wins — but it's the difference between forecasts you can defend to finance and forecasts that quietly miss every quarter. It's the same reason holdout tests are the honest way to prove incremental revenue: they measure what actually happened at launch, not what a prior short test predicted would happen.

The fix: re-measure, then haircut your projections

Two practices close the gap:

Re-measure winners after launch. Keep a holdback on your most consequential ships, or run a clean before/after read a month or two in, to confirm the lift held at full traffic over real time. This catches novelty decay and lets you replace the projected number with a realized one before it's too deep into the forecast.
Apply a projection haircut. Before multiplying a two-week lift into an annual figure, discount it for winner's curse and expected regression — more aggressively for smaller, noisier, more spectacular tests. A projection that assumes the full tested lift will persist forever is the most optimistic assumption in the building. A disciplined program books a fraction of the tested lift and treats the re-measure as the true number.

This connects directly to why a 15% win-rate is normal, not low: once you account for lift decay, the honest realized value of a program is lower than the sum of its tested lifts, and a program that forecasts on undiscounted test numbers will chronically overpromise. And it's why claimed tested lift is a vanity metric until it's confirmed in production — the number that belongs in the forecast is the realized one, measured after the novelty faded and the noise regressed out. Across a large portfolio, the programs that keep finance's trust are the ones whose realized numbers match their projections, and that discipline is what running tests at scale actually teaches.

FAQ

Does lift decay mean the test was wrong?

No — the test was probably right that the variant is better; it was just optimistic about how much better. The direction usually holds; the magnitude shrinks. That distinction matters: you should still ship a real winner, you just shouldn't forecast on the full tested lift. Treat the test as reliable evidence of direction and a biased-high estimate of magnitude.

How much should I discount a tested lift?

There's no universal number — it depends on the test's power, how many variants you selected from, and how novelty-prone the change is. The principle is to discount more for smaller, noisier, multi-variant, attention-grabbing tests and less for large, well-powered, single-variant tests of substantive changes. The most reliable approach isn't a fixed haircut at all; it's a post-launch re-measure that replaces the estimate with a realized number.

Isn't a novelty effect sometimes real, durable value?

Occasionally the "new" experience genuinely resets user behavior in a lasting way — but you can't tell that from the tested window, because during the test the change is new for everyone. The only way to distinguish durable value from a novelty bump is to re-measure after users have habituated. If the lift holds a quarter later, it was real; if it fades, it was novelty. Assuming durability without the re-measure is how the forecast breaks.

Why don't more teams re-measure?

Because the test already "won," and re-measuring a win has an unappealing risk profile: the best case confirms what you already claimed, and the worst case forces you to revise a number you already booked and celebrated. That asymmetry is exactly why it gets skipped — and exactly why the teams that do it have forecasts that hold. The discomfort of re-measuring is the price of not chronically overpromising.

Bottom line

The lift in your test report is a biased-high estimate of the lift you'll realize, and three forces guarantee it: winner's curse (you selected on noise), novelty decay (users react to "new," then habituate), and regression to the mean (extreme readings drift back toward the truth). All three push the same way, so the realized number is reliably smaller than the tested one — and it stays invisible because the test report becomes the permanent record and nobody re-measures. The fix is unglamorous: keep a holdback to re-measure your consequential winners after the novelty fades, and haircut your projections before you multiply a two-week lift into an annual forecast. The direction of a good test holds; the magnitude doesn't. Forecast on the realized number, not the celebrated one.

Building re-measurement and honest projection into a repeatable process is part of why I made GrowthLayer. For more on the gap between what tests promise and what programs deliver, subscribe to Lean Experiments.

A/B Testing Experimentation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

TL;DR

Winner's curse: you selected on noise

Novelty: the reaction to "new" that isn't the reaction to the change

Regression to the mean: extreme results drift back

Why the gap stays invisible: nobody re-measures

The fix: re-measure, then haircut your projections

FAQ

Does lift decay mean the test was wrong?

How much should I discount a tested lift?

Isn't a novelty effect sometimes real, durable value?

Why don't more teams re-measure?

Bottom line

Related Articles

Statistical Significance in A/B Testing: Is a Big Lift Still Noise?

Multivariate Testing or a Confounded A/B Test: Which Did You Run?

Do Website Personalization Examples Help—or Just Add Friction?

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook