Inside 200+ A/B Tests: What Two Years of CRO at Scale Actually Looks Like

Q: What this means for practitioners?

For CRO managers, analysts, and growth product managers, the data has a few specific implications: For founders and VPs of Growth evaluating CRO investment:

Atticus Li

← Blog · ab-testing

Inside 200+ A/B Tests: What Two Years of CRO at Scale Actually Looks Like

An anonymized analysis of 200+ A/B tests run across an enterprise CRO program. The aggregate data contradicts most published CRO advice — only 2% of tests reach traditional statistical significance, single-device tests outperform "combined" tests by 3×, and 89% of testing budget targets acquisition while retention is starved.

By Atticus Li May 1, 2026 9 min read

Across 200+ A/B tests run over two years inside an enterprise CRO program, the aggregate data tells a story most CRO content silently contradicts. Statistical significance is rare. Methodology choice matters more than test count. Retention is starved while acquisition is over-tested. Here's what real practitioner data shows — and what to do with it.

TL;DR

2% stat-sig rate. Only 4 of 200+ tests reached traditional p<0.05. The CRO field's published advice ("wait for stat-sig before shipping") is workable on a small fraction of real-world tests. The rest ship on directional + secondary signals or get parked.
12% of all wins are non-stat-sig. Real programs ship variants under holdout-validated, non-inferiority, and directional methodologies more often than under traditional A/B inference. Most don't document the methodology distinction.
Single-device tests outperform "combined" tests by 3×. Win rate on device-segmented tests: 22-28%. Win rate on combined-platform tests (no device segmentation): under 10%. Aggregating mobile + desktop hides the wins.
89% acquisition, 8% retention. A heavily acquisition-skewed test budget is the industry default. Retention testing is structurally underweighted relative to its value to LTV-based businesses.
Inconclusive ≈ 10%, "Optimization Opportunity" backlog ≈ 16%. Real programs have a substantial work-in-progress queue alongside their shipped wins. The visible win count is a small fraction of total program activity.

How the data was collected and anonymized

This analysis covers 200+ A/B tests run across an enterprise CRO program over a two-year window. All identifying data has been removed before any aggregate was computed:

Field	Anonymization rule
Brand names	Stripped — no company, product, or sub-brand identifiers
Test IDs	Internal IDs replaced with public Exp-XXX aliases
Specific dates	Bucketed to quarter; exact run dates removed
Sample sizes per test	Bucketed to wide ranges (1k-10k, 10k-100k, 100k+)
Lift percentages	Reported as ranges, not exact values
Internal funnel codes	Genericized (account dashboard, plan selection page, etc.)
Personnel names	Stripped except author

Public-research benchmarks from Baymard Institute, Nielsen Norman Group, CXL, Goodui, and Optimizely are referenced where they triangulate or contrast with the proprietary aggregate. Specific external statistics are cited to their published source; aggregate counts and rates are from the proprietary portfolio.

The aggregate data is published as ranges and proportions only. No individual test result can be reconstructed from this analysis, and no organization is identifiable from the patterns described.

Finding 1: The 2% problem — statistical significance is rare in real programs

Across 200+ completed tests, the rate of tests reaching traditional frequentist significance at p<0.05:

Stat-sig category	Count	Share
Stat-sig at p<0.05	~4	~2%
Non-stat-sig (NS)	~199	~98%

The 2% number is the data point most CRO advice ignores. Industry publications routinely tell teams to "wait for statistical significance" — advice that's workable on a small fraction of real tests where the math actually allows stat-sig in a reasonable runtime.

For the other 98%, three patterns determine what happens:

Pattern	What teams do
Directional positive, low traffic / high baseline	Most teams ship on directional signal but mislabel as "stat-sig win"
Directional negative, properly powered	Most teams revert and document as "loss" without specifying methodology
Inconclusive (flat)	Most teams park; some ship on faith

The methodology mislabel is the actual problem. It produces a test repository where "winner" means three different things — sometimes stat-sig win, sometimes directional ship, sometimes "we liked it and shipped it." Future readers can't tell which is which.

Public benchmark comparison: Optimizely's published case studies and Goodui's open repository show similarly low stat-sig rates when tests are honestly classified — though most public CRO content systematically over-counts wins by treating directional results as confirmed lifts. The 2% rate is consistent with what the methodology literature would predict; it contradicts what most marketing CRO content claims.

Finding 2: Single-device tests outperform combined tests by ~3×

Looking at 200+ tests, win rates differ meaningfully by how the test platform was scoped:

Test platform	Tests	Win rate
Desktop-only	~86	~22-26%
Mobile-only	~84	~24-28%
Both Desktop AND Mobile (segmented results required)	~22	~30-34%
Combined (single aggregate, no device segmentation)	~11	<10%

The "Combined" category had the lowest win rate by a wide margin. The mechanism is consistent across the dataset: combined tests bury opposite-direction effects on the two device classes, so even when something genuinely won on one device, the aggregate often showed inconclusive.

This contradicts the default test design at most CRO programs, which is "run sitewide and read aggregate." The data says the opposite — segment by device class, read per-device, and ship conditionally.

Public benchmark comparison: Nielsen Norman Group's research on mobile vs desktop UX consistently flags meaningful behavioral differences between device classes, particularly on form-fill and modal-mediated flows. Baymard Institute's checkout research shows mobile abandonment rates approximately 10-15 percentage points higher than desktop on equivalent flows. Both findings reinforce that aggregating device classes obscures real per-device dynamics.

Finding 3: Methodology distribution — 50/50 dominates, holdout is rare

The test methodology distribution across the portfolio:

Methodology	Tests	Share
50/50 A/B	~70	~70% of methodology-tagged tests
Multi-arm (33/33/33 or A/B/C)	~5	~5%
100% pre/post (no concurrent control)	~8	~8%
Non-inferiority	~3	~3%
Holdout-validated rollout (90/10 or similar)	~1	~1%
Methodology not documented	~13	~13%

The pattern is consistent across CRO programs at this scale: 50/50 A/B is the default regardless of whether it's the right regime for the test conditions. High-baseline pages, thin-traffic surfaces, and non-inferiority scenarios are all routinely tested under 50/50 methodology that mathematically can't power them. The result is a high proportion of inconclusive tests that get shipped on faith.

The cost is institutional: programs accumulate "we shipped this and the data was noisy" entries in their test repository. A year later, nobody can defend why a particular variant is in production. The methodology was wrong for the conditions, but the documentation never made that distinction.

The fix is methodology selection by condition:

Test conditions	Right regime
Baseline <50%, traffic ≥5K/arm/week	Standard 50/50 A/B
Baseline ≥80%, OR traffic <5K/arm/week	Holdout-validated rollout (90/10)
Variant ships for non-conversion reasons (compliance, brand)	Non-inferiority

Programs that match methodology to conditions ship more defensible wins than programs that force every test into 50/50.

Finding 4: Acquisition vs retention imbalance — 89% vs 8%

The lifecycle distribution of tests:

Lifecycle stage	Tests	Share
Acquisition	~182	~89%
Retention	~16	~8%
Other / unspecified	~5	~3%

Acquisition testing dominates. Retention testing is a structural minority, despite retention's role in LTV math.

This pattern is consistent with most CRO programs at this scale. Acquisition is more visible to leadership (top-of-funnel metrics map cleanly to dashboards), more politically funded (marketing teams have larger experimentation budgets than product teams), and easier to test (entry-point pages have higher traffic).

The opportunity cost shows up in the LTV math. A 5% acquisition lift on a base of $100M in new revenue produces $5M of gross. A 5% retention lift on the existing customer base — which for most subscription businesses is multiples larger than annual new acquisition revenue — produces correspondingly more. The CRO programs that compound their advantage over time are the ones that test retention proportionally to its revenue contribution, not its dashboard visibility.

Public benchmark comparison: Frederick Reichheld's foundational LTV research (Bain) and the broader subscription-economy literature show retention's compounding effect on LTV is typically 5-10× the comparable acquisition lift on equivalent test volume. The 89/8 split in this dataset is closer to industry default than to LTV-optimal allocation.

Finding 5: The work-in-progress queue is larger than the shipped-win count

The status distribution across all tracked tests, including those not yet completed:

Status	Approximate count
Winner (shipped)	~49
Loser (reverted)	~10
Inconclusive	~21
Optimization Opportunity (pre-launch backlog)	~32
Live (currently running)	~10
Hypothesis / discovery	~6
Pre-design / development	~10
Various blocked / waiting states	~25

The shipped wins (~49) are a small fraction of total program activity. The work-in-progress queue (Optimization Opportunity + various pre-launch states) is comparable in size. Inconclusive results (21) approach half the win count.

This is the institutional reality of CRO programs at scale. Most external CRO content presents a clean win → ship → measure narrative. The actual shape of the work is much messier, with substantial discovery, pre-design, legal review, and waiting-on-stakeholder states between hypothesis and shipped variant.

Implication for program leaders: the right metric for program health is not "wins shipped per quarter." It's the velocity through the pipeline — hypothesis → discovery → spec → development → launch → analysis → ship. Programs with high win counts but stagnant pipelines are running through their backlog of low-hanging tests. Programs with healthy pipeline velocity continue compounding even when individual quarters show fewer shipped wins.

Finding 6: The failure-mode taxonomy

Across the inconclusive and loser tests in the portfolio, ~12 distinct failure modes recur. Most CRO content treats these as one-offs; the data shows they're systematic patterns.

Failure mode	Frequency in portfolio	Detection signal
Wrong-intent clicks	High	Sub-baseline click-to-conversion from specific source pages
Friction injection at destination	High	Modal-mediated CTA has sub-baseline click-to-conv vs equivalent direct-routing
CTA cannibalization	High	Existing CTAs lose volume in variant; new CTA at sub-baseline conv rate
Confounded-variable trap	Medium-High	Upstream + downstream metrics in opposite directions
Form-vs-content mismatch	Medium	Test optimizes form when user research said content
Above-the-fold competition	Medium	Multiple CTAs at same intent/commitment/audience
Aggregate-mask asymmetry	Medium	Aggregate flat or noisy; segmented results show opposite directions
Trust-badge ambiguity	Medium	Time-on-page up + FAQ attractiveness up + conversion flat
Time-on-page misread	Medium	Metric interpreted without engagement context
Underpowered stat-sig forcing	Medium	50/50 A/B run on conditions math can't power
Copy-intent mismatch	Medium	High CTR + low click-to-conv on intent-mismatched source pages
Visibility failure	Low-Medium	Low CTR, high click-to-conv when clicked

The taxonomy matters because each failure mode has a specific detection signal and a specific fix. Treating them as distinct patterns — instead of as generic "the test didn't work" — produces meaningfully higher win rates on follow-up iterations.

This site has a dedicated article on each of these failure modes; collectively they form the practitioner's atlas of how CTA tests actually fail.

Finding 7: The gap between CRO advice and CRO reality

A few patterns where mainstream CRO content systematically misrepresents what real programs experience:

CRO advice consensus	What 200+ tests actually show
"Wait for statistical significance before shipping"	2% of tests reach stat-sig; the other 98% need a different methodology
"Run a 50/50 A/B test for every change"	50/50 is wrong for high-baseline or thin-traffic conditions; programs that force it accumulate inconclusive results
"More tests = more wins"	Program win rates are roughly fixed by methodology quality and pre-test screen discipline; volume alone doesn't move the win rate
"Sticky mobile CTAs always work"	Sticky CTAs work when the friction they remove is real; on pages without that friction they're noise or harm
"Adding a trust badge is a free win"	Trust badges win when the page is also equipped to answer the questions they raise; otherwise they create friction
"Use modals to capture leads"	Modals cost 20-70% of click-to-destination throughput; the captured-lead value rarely exceeds the lost direct-funnel value

Public benchmark comparison: Goodui's open A/B test repository documents win rates roughly consistent with the proprietary portfolio's: most tests don't move the needle, win rates cluster in the 20-30% range across surfaces, and the most-cited "winning patterns" have substantial variance across implementations. CXL's research blog and Speero's published case studies show similar patterns when results are honestly reported.

The gap between mainstream CRO content and real program data is mostly about reporting honesty. Tests that "kind of worked" get reported as wins. Tests that produced inconclusive results don't get published. Programs that follow advice based only on published winners systematically over-estimate the field's win rate and under-estimate the methodology discipline required to ship defensible variants.

What this means for practitioners

For CRO managers, analysts, and growth product managers, the data has a few specific implications:

Implication	Action
Stat-sig discipline matters more than test volume	Build methodology selection into pre-test review. Match regime to conditions.
Per-device segmentation outperforms aggregate reading	Make device segmentation a default test-report column. Decide ship/revert per device.
Retention is structurally undertested	Audit your test mix. If acquisition is >80% of test volume and retention is <10%, the program is leaving compounding LTV gains on the table.
Failure modes are systematic, not random	Use the failure-mode taxonomy to diagnose inconclusive tests. Most "didn't work" results have a specific named pattern with a specific fix.
Mainstream CRO content over-states win rates	Calibrate stakeholder expectations. Real win rates are 20-30%, not the 50%+ implied by published case studies.

For founders and VPs of Growth evaluating CRO investment:

Question	What the data says
Should I build a CRO program in-house or hire an agency?	Either works; methodology discipline matters more than which path. The 2% stat-sig rate applies to both.
How fast should we expect wins?	20-30% of tests produce shippable wins. At a 10-test-per-month cadence, expect 2-3 shipped wins per month after the program is mature.
What's a sign the program is working?	Pipeline velocity (hypothesis → ship), not win count. Programs with high velocity compound.
How do I evaluate a CRO consultant or agency?	Ask for their stat-sig rate and how they classify shipped non-stat-sig wins. The honest answer is ~2% stat-sig and explicit methodology classification (holdout, non-inferiority, directional). The dishonest answer is everything they ship is "stat-sig."

Bottom line

Real CRO programs at scale produce a much messier picture than published CRO advice suggests. Statistical significance is rare. Methodology selection matters more than test volume. Device segmentation outperforms aggregation. Retention is undertested relative to its LTV impact. Failure modes follow systematic patterns, not random noise.

The teams that consistently grow funnel revenue are the ones that:

Match methodology to test conditions (50/50, holdout-validated, non-inferiority — pick the right regime)
Segment by device class as a default, not a custom-cut analysis
Run pre-test screens that catch failure modes before experiment budget is committed
Document methodology accurately so future readers know what shipped under what evidence
Audit their test-mix balance between acquisition and retention quarterly

Programs that adopt these disciplines see win rates climb from 20-30% to 60-75% on candidates that pass the pre-test screens. The volume of tests doesn't change much. The defensibility of each shipped win — and the institutional knowledge that compounds across the program — changes substantially.

This article is the hub for a practitioner's atlas of A/B testing. Each finding above is covered in depth in a dedicated spoke article — sticky CTAs, click-to-conversion ratio, copy intent, modal vs direct routing, holdout-validated shipping, device asymmetry, time-on-page interpretation, trust-badge architecture, confounded-variable recovery, form vs content variable selection, above-the-fold hierarchy, and CTA cannibalization detection. Read whichever pattern matches the test you're about to ship.

The data is updated annually. The next quarterly update covers Q1-Q4 of the current testing year and refines the aggregates with new tests as the portfolio grows.

ab-testing benchmarks hub methodology state-of-experimentation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter

Inside 200+ A/B Tests: What Two Years of CRO at Scale Actually Looks Like

TL;DR

How the data was collected and anonymized

Finding 1: The 2% problem — statistical significance is rare in real programs

Finding 2: Single-device tests outperform combined tests by ~3×

Finding 3: Methodology distribution — 50/50 dominates, holdout is rare

Finding 4: Acquisition vs retention imbalance — 89% vs 8%

Finding 5: The work-in-progress queue is larger than the shipped-win count

Finding 6: The failure-mode taxonomy

Finding 7: The gap between CRO advice and CRO reality

What this means for practitioners

Bottom line

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the Weekly
Experimentation Playbook

TL;DR

How the data was collected and anonymized

Finding 1: The 2% problem — statistical significance is rare in real programs

Finding 2: Single-device tests outperform combined tests by ~3×

Finding 3: Methodology distribution — 50/50 dominates, holdout is rare

Finding 4: Acquisition vs retention imbalance — 89% vs 8%

Finding 5: The work-in-progress queue is larger than the shipped-win count

Finding 6: The failure-mode taxonomy

Finding 7: The gap between CRO advice and CRO reality

What this means for practitioners

Bottom line

Related Articles

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

The Six-Figure Decision: How Strategic Price De-Emphasis Reveals the True Economics of Attention

Related Articles

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

The Six-Figure Decision: How Strategic Price De-Emphasis Reveals the True Economics of Attention

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook