Most failed experiments I review at the enterprise level did not fail in the test window. They failed at the planning stage, three weeks earlier, when an analyst pulled a single SQL query and confused an eventual recovery rate with a baseline conversion rate.

It is the most common analysis mistake I see, and it is almost invisible because the math looks clean, the dashboard renders, and the number sounds about right. The test gets greenlit. Two weeks later the team is staring at a flat result and wondering why a change they were sure would move the needle barely registered.

The problem is almost never the creative. The problem is that the baseline was never a baseline in the first place.

"Most new analysts make this mistake: they use what is easy to measure instead of what is correct to model." — Atticus Li

The Setup That Traps Everyone

Here is the pattern I see in almost every CRO team I have audited. The analyst is asked to plan a test on a multi-step flow — a checkout, an application, a signup. They open the warehouse, pull 90 days of data, and write something like this:

  • Users who started the flow but did not finish: 4,000
  • Users who eventually completed the flow: 2,400
  • Calculated baseline: 2,400 / 4,000 = 60%

Then they write "baseline conversion rate = 60%" in the test plan doc and move on.

On paper this is correct arithmetic. In practice, that 60% number is not a baseline conversion rate. It is a 90-day eventual recovery rate, and the difference is the reason the experiment will miss.

What That 60% Actually Measures

The 60% figure answers a different question than the analyst thinks it does. It answers: "Of the users who started the flow in the last 90 days and did not finish in-session, what fraction eventually crossed the finish line at any point during the window?"

That is a historical recovery number. It includes users who:

  • Came back three hours later on a different device
  • Returned four weeks later after a retargeting email
  • Bookmarked the page and finished on their lunch break
  • Received a discount code from a separate lifecycle campaign
  • Completed for reasons that have nothing to do with the page you are about to test

None of those users will be affected by the button color, the headline, the form field count, or whatever else is in the variant. Your test touches a single moment. The 60% number rolls up every moment across three months.

When you use that number as the expected conversion rate of the control, you are implicitly assuming the variant will also get credit for every retargeting email, every lifecycle campaign, and every long-tail return visit. The variant will not. The assumption is the trap.

The Mental Model That Fixes It

The clean way to think about this is to break the observed behavior into the actual steps it contains:

Abandon → Return → Restart → Convert

Your warehouse data, in most setups, shows you only the ends of that chain — abandon and convert. The middle two steps, the return rate and the conversion-after-return rate, are compressed into a single aggregate number. That aggregate number is the 60%. And it is not what your experiment influences.

Your experiment influences something smaller. Usually it influences one of these:

  • The probability of finishing in-session (no return required)
  • The probability of finishing after returning via your specific surface
  • The probability of finishing after seeing a specific variant of a specific screen

The correct baseline is the conversion rate of the slice your test can actually reach — not the blended three-month recovery rate of everyone who ever touched the flow.

A Worked Example With Real Numbers

Let me walk through this end-to-end with the same numbers from above, because the fix is more concrete than the abstract version suggests.

The raw data (90 days, one checkout flow):

Metric | Count

  • Metric: Total users in the system — Count: 1,000,000
  • Metric: Users who entered the checkout — Count: 6,400
  • Metric: Users who abandoned the checkout — Count: 4,000
  • Metric: Users who eventually completed (any channel, any day) — Count: 2,400
  • Metric: Weeks in the window — Count: 13

The naive analysis:

baseline = 2,400 / 4,000 = 60%
traffic  = 1,000,000 users in the system

Plugged into a power calculator, that combination tells the analyst they can detect a 3% relative lift in a week. Everyone is excited. The test ships.

What is actually wrong:

First, the traffic number is completely off. The test cannot touch 1,000,000 users. It can only touch users who are in the checkout flow, because that is where the variant lives. So the correct traffic denominator is:

eligible traffic = 4,000 abandoners / 13 weeks ≈ 308 users/week

That is three orders of magnitude smaller than what the power calculator was fed. In a world of 308 users per week, a 3% relative lift is not detectable in a week. It is not detectable in a quarter.

Second, the baseline is too high. The 60% includes lifecycle and retargeting conversions that the test cannot influence. A realistic working range looks more like this:

Bound | Estimate | Meaning

  • Bound: Upper — Estimate: 60% — Meaning: Everyone who eventually converts, from any source, over 90 days
  • Bound: Midpoint — Estimate: 35–45% — Meaning: Users who return and convert through surfaces the test can reach
  • Bound: Lower — Estimate: 25% — Meaning: Users who convert in-session or immediately after the test variant

The correct inputs for the test plan are the midpoint and the eligible weekly traffic:

baseline ≈ 40%
traffic  ≈ 300 users/week
expected lift ≈ 5–10% relative (realistic UX improvement)

Rerun the power calc with those three numbers and the picture changes completely. The detectable effect size is much larger. The runtime is much longer. And crucially, the analyst now knows that before shipping — not after.

The Two Errors Are Different, and Both Matter

Notice that there were two separate mistakes in the naive analysis, and they compound:

  1. Wrong numerator for the baseline. Conflating eventual recovery with testable conversion. This inflates the baseline and makes the variant look like it has less room to move than it does.
  2. Wrong denominator for the traffic. Using all users in the system instead of users eligible for the test. This inflates the traffic estimate by three orders of magnitude and makes the runtime look tiny.

Either one on its own will mis-size a test. Together they produce a test plan that is almost guaranteed to ship, almost guaranteed to run underpowered, and almost guaranteed to produce an "inconclusive" result the team will then have to argue about.

I have seen experimentation roadmaps stall for an entire quarter because of this exact pairing. The fix, every time, is in the planning doc — not in the variant.

The Framework I Use for Baseline Estimation

When I sit down to plan an experiment and the data is imperfect (which is almost always), I walk through four steps in order.

Step 1: Anchor on what you can measure

Start with the number you actually have. In the example, that is 60%. Do not throw it away — it is useful information, it is just not the answer. Treat it as the upper bound. The true testable conversion rate is almost certainly lower than this, because eventual recovery includes sources your test cannot influence.

Step 2: Discount for causal reach

Ask what fraction of the 60% is actually within the influence of the test. Retargeting emails go away. Cross-device returns go away. Lifecycle discount codes go away. Anything that would have converted the user regardless of your variant goes away.

This is not a precise calculation. It is a judgment call informed by what you know about the flow. In most enterprise checkouts I have worked on, the testable fraction is somewhere between 40% and 75% of the eventual recovery number. So 60% eventual recovery becomes a 25–45% testable baseline, depending on how strong the lifecycle engine is.

Step 3: Build a range, not a point estimate

Write down a lower bound, a midpoint, and an upper bound. Use the midpoint for planning. Use the lower bound to sanity-check the runtime. Use the upper bound to sanity-check the detectable effect.

lower    = 25%   → conservative, use for minimum runtime
midpoint = 40%   → use for actual test sizing
upper    = 60%   → use only to verify sanity

The range is the honest representation of what you know. A single number is a story you are telling yourself.

Step 4: Sanity-check the output

Before you ship, take the power calculator's output and ask two questions:

  1. Is that traffic actually going to show up? (Check seasonality, marketing spend, release calendar.)
  2. Is that lift actually realistic for the change I am making? (A button color does not produce 20% relative lift. A pricing change might.)

If either answer is "probably not," go back and fix the inputs.

For more on how I think about decisions with incomplete data, the same framework applies at the decision stage — you are always working in ranges, not point estimates.

The Traffic Denominator Question

The denominator mistake deserves its own section because it is the one I see most often at mid-sized teams that have just gotten access to a full data warehouse.

The impulse is to use the biggest denominator available, because bigger numbers feel more statistically powerful. This is exactly backwards. The correct denominator is the smallest population your test can actually reach, because every user outside that population is noise.

Ask yourself a single question: "Can this user experience the variant?"

  • If they never hit the page the test lives on → no, exclude them
  • If they hit the page but are in a segment the test is not targeting → no, exclude them
  • If they hit the page on a device or browser the variant does not render on → no, exclude them
  • If they are in the holdout or a mutually exclusive test → no, exclude them

What is left is your real traffic. That is the number that goes into the power calculator. It is almost always much smaller than the analyst's first instinct, and it is almost always the number that determines whether the test is plannable at all.

What a 60% Eventual Recovery Rate Actually Tells You

Here is a more useful way to read that 60% number — not as a baseline, but as a diagnostic about the shape of your funnel.

If 60% of abandoners eventually come back and convert, the problem in your flow is usually not demand. The users want the product. They are willing to complete the purchase. They just did not complete it in the session you wanted them to.

That reframes the entire experiment program. You are not trying to convince users who do not want to convert. You are trying to reduce the reasons the users who already want to convert had to leave. That is a very different optimization target, and it points you at different tests:

  • Friction reduction — fewer fields, clearer steps, faster load
  • Cognitive load reduction — fewer choices, clearer language, better defaults
  • Reassurance — trust signals, objection handling, payment clarity
  • Speed — the biggest lever in almost every mature funnel I have ever seen

You are optimizing for time to convert and in-session completion rate, not for total demand. That is a much more tractable problem, and the experiments that target it are also the ones most likely to move the testable baseline — the 25–45% range — rather than the already-saturated 60% upper bound.

The Checklist I Run Before Every Test Plan Ships

Whenever I audit a test plan, I run through the same four questions before I sign off. If any answer is "no," the plan goes back for revision.

  1. Am I using eligible users only in the denominator? Not total users. Not monthly actives. The users who can actually experience the variant.
  2. Am I treating eventual behavior as immediate behavior? If my baseline number is a 90-day recovery rate, am I implicitly assuming the variant gets credit for 90 days of lifecycle activity?
  3. Does my baseline reflect only the conversions the test can influence? Or is it a blended number that includes sources the test cannot touch?
  4. Am I using a range instead of a single number? Lower, midpoint, upper. Plan on the midpoint, sanity-check with the edges.

The test plan that clears all four is not guaranteed to win. But it is guaranteed to be honest about what it can and cannot detect, and that is the thing that separates a program that learns from a program that just ships tests.

What You Are Not Seeing

Four more traps I see sitting just behind the baseline mistake, because they share the same root cause — using what is easy to measure instead of what is correct to model.

1. Return behavior might be the real bottleneck. If your "eventual recovery" number is high but the return rate itself is low, the test has limited reach no matter how good the variant is. Measure return rate separately.

2. You may already be near a ceiling. A high eventual conversion rate means fewer users left to capture. Incremental gains get smaller as you approach the ceiling. Lift assumptions that were realistic at 40% baseline are not realistic at 60%.

3. Your lift assumption is probably too optimistic. UX improvements in mature funnels rarely produce more than 10–20% relative lift. If your plan is built on a 25% lift assumption, the plan is built on a fantasy. Be honest in writing before you ship.

4. Measurement limitations will hide real effects. Without proper sequencing, deduplication, and attribution, the test will report noise as signal and signal as noise. If you cannot trust the measurement, you cannot trust the result — regardless of whether the baseline was right.

Bottom Line

The baseline conversion trap is not a statistics problem. It is a modeling problem disguised as a statistics problem. The analyst is doing correct arithmetic on the wrong numbers.

The fix is three moves:

  1. Anchor in reality. Start with the number you can measure, treat it as an upper bound.
  2. Adjust for causality. Discount for the fraction of that number your test can actually influence.
  3. Work in ranges, not points. Lower, midpoint, upper. Plan on the midpoint.

That is how mature teams plan experiments. Everyone else is running tests against a baseline that is not a baseline, on traffic that is not really eligible, for a lift that is not realistic, and then being surprised when the result is inconclusive.

The surprise is the dashboard. The mistake was three weeks earlier, in the planning doc.

Key Takeaways

  • Eventual recovery is not the same as testable conversion. A 90-day recovery rate includes lifecycle, retargeting, and cross-device returns that no A/B test can influence. Use it as an upper bound, not a baseline.
  • Use the smallest eligible denominator, not the biggest. Traffic is the users who can actually experience the variant — not total site visitors, not monthly actives.
  • Plan on ranges, not point estimates. Lower bound, midpoint, upper bound. Plan on the midpoint, sanity-check with the edges.
  • Realistic UX lift is 5–10% relative, not 25%. If the plan assumes more, it is fantasy planning, not experiment planning.
  • A 60% eventual recovery rate is a diagnostic, not a baseline. It tells you the problem is friction and speed, not demand — which points at a different class of experiments.
  • The four-question checklist: eligible users only, immediate vs. eventual, testable slice only, ranges not points. Any "no" goes back to the drafting phase.

FAQ

What is the difference between baseline conversion rate and eventual recovery rate?

Baseline conversion rate is the fraction of users who convert through the specific slice of behavior your test can influence. Eventual recovery rate is the fraction of users who eventually convert through any channel over the full measurement window. The two numbers can be very different — eventual recovery is usually much higher because it includes retargeting, lifecycle email, and cross-device returns the test cannot touch.

Why is using total site visitors as the traffic denominator wrong for an A/B test?

Because most A/B tests only run on a specific page, segment, or flow. A user who never reaches that page cannot experience the variant, so they are statistical noise. Including them inflates the apparent sample size by orders of magnitude and makes the power calculation produce runtimes that are not achievable in reality.

How do I estimate a baseline when I cannot measure in-session conversion directly?

Anchor on the number you can measure (usually the eventual recovery rate), then discount it for the fraction of that recovery your test can causally influence. Build a range — lower bound, midpoint, upper bound — and plan on the midpoint. A reasonable starting discount for mature funnels is to assume the testable portion is 40–75% of the eventual recovery number, adjusted for how strong the lifecycle marketing engine is.

What is a realistic lift to expect from a UX-focused A/B test?

In mature funnels, 5–10% relative lift is realistic for a typical UX improvement. Tests that claim 20%+ relative lift are usually either measurement artifacts, novelty effects, or changes that are really pricing or positioning changes in disguise. If your test plan assumes more than 10–15% relative lift, the assumption should be written down and challenged before shipping.

How do I size an experiment when traffic is low?

First, confirm the traffic estimate is using the correct eligible denominator — many "low traffic" problems are actually "wrong denominator" problems. If the eligible traffic is genuinely low (under a few hundred per week), either pick a much larger effect to detect, plan for a longer runtime, use a sequential or Bayesian approach that is more efficient at low sample sizes, or consider whether the test should be deprioritized in favor of one with more reach.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.