Across 200+ A/B tests run over two years inside an enterprise CRO program, the aggregate data tells a story most CRO content silently contradicts. Statistical significance is rare. Methodology choice matters more than test count. Retention is starved while acquisition is over-tested. Here's what real practitioner data shows — and what to do with it.
TL;DR
- 2% stat-sig rate. Only 4 of 200+ tests reached traditional p<0.05. The CRO field's published advice ("wait for stat-sig before shipping") is workable on a small fraction of real-world tests. The rest ship on directional + secondary signals or get parked.
- 12% of all wins are non-stat-sig. Real programs ship variants under holdout-validated, non-inferiority, and directional methodologies more often than under traditional A/B inference. Most don't document the methodology distinction.
- Single-device tests outperform "combined" tests by 3×. Win rate on device-segmented tests: 22-28%. Win rate on combined-platform tests (no device segmentation): under 10%. Aggregating mobile + desktop hides the wins.
- 89% acquisition, 8% retention. A heavily acquisition-skewed test budget is the industry default. Retention testing is structurally underweighted relative to its value to LTV-based businesses.
- Inconclusive ≈ 10%, "Optimization Opportunity" backlog ≈ 16%. Real programs have a substantial work-in-progress queue alongside their shipped wins. The visible win count is a small fraction of total program activity.
How the data was collected and anonymized
This analysis covers 200+ A/B tests run across an enterprise CRO program over a two-year window. All identifying data has been removed before any aggregate was computed:
| Field | Anonymization rule |
| --------------------- | ---------------------------------------------------------- |
| Brand names | Stripped — no company, product, or sub-brand identifiers |
| Test IDs | Internal IDs replaced with public Exp-XXX aliases |
| Specific dates | Bucketed to quarter; exact run dates removed |
| Sample sizes per test | Bucketed to wide ranges (1k-10k, 10k-100k, 100k+) |
| Lift percentages | Reported as ranges, not exact values |
| Internal funnel codes | Genericized (account dashboard, plan selection page, etc.) |
| Personnel names | Stripped except author |
Public-research benchmarks from Baymard Institute, Nielsen Norman Group, CXL, Goodui, and Optimizely are referenced where they triangulate or contrast with the proprietary aggregate. Specific external statistics are cited to their published source; aggregate counts and rates are from the proprietary portfolio.
The aggregate data is published as ranges and proportions only. No individual test result can be reconstructed from this analysis, and no organization is identifiable from the patterns described.
Finding 1: The 2% problem — statistical significance is rare in real programs
Across 200+ completed tests, the rate of tests reaching traditional frequentist significance at p<0.05:
| Stat-sig category | Count | Share |
| ------------------ | ----- | ----- |
| Stat-sig at p<0.05 | ~4 | ~2% |
| Non-stat-sig (NS) | ~199 | ~98% |
The 2% number is the data point most CRO advice ignores. Industry publications routinely tell teams to "wait for statistical significance" — advice that's workable on a small fraction of real tests where the math actually allows stat-sig in a reasonable runtime.
For the other 98%, three patterns determine what happens:
| Pattern | What teams do |
| ------------------------------------------------- | ----------------------------------------------------------------------- |
| Directional positive, low traffic / high baseline | Most teams ship on directional signal but mislabel as "stat-sig win" |
| Directional negative, properly powered | Most teams revert and document as "loss" without specifying methodology |
| Inconclusive (flat) | Most teams park; some ship on faith |
The methodology mislabel is the actual problem. It produces a test repository where "winner" means three different things — sometimes stat-sig win, sometimes directional ship, sometimes "we liked it and shipped it." Future readers can't tell which is which.
Public benchmark comparison: Optimizely's published case studies and Goodui's open repository show similarly low stat-sig rates when tests are honestly classified — though most public CRO content systematically over-counts wins by treating directional results as confirmed lifts. The 2% rate is consistent with what the methodology literature would predict; it contradicts what most marketing CRO content claims.
Finding 2: Single-device tests outperform combined tests by ~3×
Looking at 200+ tests, win rates differ meaningfully by how the test platform was scoped:
| Test platform | Tests | Win rate |
| ---------------------------------------------------- | ----- | -------- |
| Desktop-only | ~86 | ~22-26% |
| Mobile-only | ~84 | ~24-28% |
| Both Desktop AND Mobile (segmented results required) | ~22 | ~30-34% |
| Combined (single aggregate, no device segmentation) | ~11 | <10% |
The "Combined" category had the lowest win rate by a wide margin. The mechanism is consistent across the dataset: combined tests bury opposite-direction effects on the two device classes, so even when something genuinely won on one device, the aggregate often showed inconclusive.
This contradicts the default test design at most CRO programs, which is "run sitewide and read aggregate." The data says the opposite — segment by device class, read per-device, and ship conditionally.
Public benchmark comparison: Nielsen Norman Group's research on mobile vs desktop UX consistently flags meaningful behavioral differences between device classes, particularly on form-fill and modal-mediated flows. Baymard Institute's checkout research shows mobile abandonment rates approximately 10-15 percentage points higher than desktop on equivalent flows. Both findings reinforce that aggregating device classes obscures real per-device dynamics.
Finding 3: Methodology distribution — 50/50 dominates, holdout is rare
The test methodology distribution across the portfolio:
| Methodology | Tests | Share |
| -------------------------------------------- | ----- | -------------------------------- |
| 50/50 A/B | ~70 | ~70% of methodology-tagged tests |
| Multi-arm (33/33/33 or A/B/C) | ~5 | ~5% |
| 100% pre/post (no concurrent control) | ~8 | ~8% |
| Non-inferiority | ~3 | ~3% |
| Holdout-validated rollout (90/10 or similar) | ~1 | ~1% |
| Methodology not documented | ~13 | ~13% |
The pattern is consistent across CRO programs at this scale: 50/50 A/B is the default regardless of whether it's the right regime for the test conditions. High-baseline pages, thin-traffic surfaces, and non-inferiority scenarios are all routinely tested under 50/50 methodology that mathematically can't power them. The result is a high proportion of inconclusive tests that get shipped on faith.
The cost is institutional: programs accumulate "we shipped this and the data was noisy" entries in their test repository. A year later, nobody can defend why a particular variant is in production. The methodology was wrong for the conditions, but the documentation never made that distinction.
The fix is methodology selection by condition:
| Test conditions | Right regime |
| ------------------------------------------------------------ | --------------------------------- |
| Baseline <50%, traffic ≥5K/arm/week | Standard 50/50 A/B |
| Baseline ≥80%, OR traffic <5K/arm/week | Holdout-validated rollout (90/10) |
| Variant ships for non-conversion reasons (compliance, brand) | Non-inferiority |
Programs that match methodology to conditions ship more defensible wins than programs that force every test into 50/50.
Finding 4: Acquisition vs retention imbalance — 89% vs 8%
The lifecycle distribution of tests:
| Lifecycle stage | Tests | Share |
| ------------------- | ----- | ----- |
| Acquisition | ~182 | ~89% |
| Retention | ~16 | ~8% |
| Other / unspecified | ~5 | ~3% |
Acquisition testing dominates. Retention testing is a structural minority, despite retention's role in LTV math.
This pattern is consistent with most CRO programs at this scale. Acquisition is more visible to leadership (top-of-funnel metrics map cleanly to dashboards), more politically funded (marketing teams have larger experimentation budgets than product teams), and easier to test (entry-point pages have higher traffic).
The opportunity cost shows up in the LTV math. A 5% acquisition lift on a base of $100M in new revenue produces $5M of gross. A 5% retention lift on the existing customer base — which for most subscription businesses is multiples larger than annual new acquisition revenue — produces correspondingly more. The CRO programs that compound their advantage over time are the ones that test retention proportionally to its revenue contribution, not its dashboard visibility.
Public benchmark comparison: Frederick Reichheld's foundational LTV research (Bain) and the broader subscription-economy literature show retention's compounding effect on LTV is typically 5-10× the comparable acquisition lift on equivalent test volume. The 89/8 split in this dataset is closer to industry default than to LTV-optimal allocation.
Finding 5: The work-in-progress queue is larger than the shipped-win count
The status distribution across all tracked tests, including those not yet completed:
| Status | Approximate count |
| --------------------------------------------- | ----------------- |
| Winner (shipped) | ~49 |
| Loser (reverted) | ~10 |
| Inconclusive | ~21 |
| Optimization Opportunity (pre-launch backlog) | ~32 |
| Live (currently running) | ~10 |
| Hypothesis / discovery | ~6 |
| Pre-design / development | ~10 |
| Various blocked / waiting states | ~25 |
The shipped wins (~49) are a small fraction of total program activity. The work-in-progress queue (Optimization Opportunity + various pre-launch states) is comparable in size. Inconclusive results (21) approach half the win count.
This is the institutional reality of CRO programs at scale. Most external CRO content presents a clean win → ship → measure narrative. The actual shape of the work is much messier, with substantial discovery, pre-design, legal review, and waiting-on-stakeholder states between hypothesis and shipped variant.
Implication for program leaders: the right metric for program health is not "wins shipped per quarter." It's the velocity through the pipeline — hypothesis → discovery → spec → development → launch → analysis → ship. Programs with high win counts but stagnant pipelines are running through their backlog of low-hanging tests. Programs with healthy pipeline velocity continue compounding even when individual quarters show fewer shipped wins.
Finding 6: The failure-mode taxonomy
Across the inconclusive and loser tests in the portfolio, ~12 distinct failure modes recur. Most CRO content treats these as one-offs; the data shows they're systematic patterns.
| Failure mode | Frequency in portfolio | Detection signal |
| ------------------------------------- | ---------------------- | ------------------------------------------------------------------------------ |
| Wrong-intent clicks | High | Sub-baseline click-to-conversion from specific source pages |
| Friction injection at destination | High | Modal-mediated CTA has sub-baseline click-to-conv vs equivalent direct-routing |
| CTA cannibalization | High | Existing CTAs lose volume in variant; new CTA at sub-baseline conv rate |
| Confounded-variable trap | Medium-High | Upstream + downstream metrics in opposite directions |
| Form-vs-content mismatch | Medium | Test optimizes form when user research said content |
| Above-the-fold competition | Medium | Multiple CTAs at same intent/commitment/audience |
| Aggregate-mask asymmetry | Medium | Aggregate flat or noisy; segmented results show opposite directions |
| Trust-badge ambiguity | Medium | Time-on-page up + FAQ attractiveness up + conversion flat |
| Time-on-page misread | Medium | Metric interpreted without engagement context |
| Underpowered stat-sig forcing | Medium | 50/50 A/B run on conditions math can't power |
| Copy-intent mismatch | Medium | High CTR + low click-to-conv on intent-mismatched source pages |
| Visibility failure | Low-Medium | Low CTR, high click-to-conv when clicked |
The taxonomy matters because each failure mode has a specific detection signal and a specific fix. Treating them as distinct patterns — instead of as generic "the test didn't work" — produces meaningfully higher win rates on follow-up iterations.
This site has a dedicated article on each of these failure modes; collectively they form the practitioner's atlas of how CTA tests actually fail.
Finding 7: The gap between CRO advice and CRO reality
A few patterns where mainstream CRO content systematically misrepresents what real programs experience:
| CRO advice consensus | What 200+ tests actually show |
| --------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| "Wait for statistical significance before shipping" | 2% of tests reach stat-sig; the other 98% need a different methodology |
| "Run a 50/50 A/B test for every change" | 50/50 is wrong for high-baseline or thin-traffic conditions; programs that force it accumulate inconclusive results |
| "More tests = more wins" | Program win rates are roughly fixed by methodology quality and pre-test screen discipline; volume alone doesn't move the win rate |
| "Sticky mobile CTAs always work" | Sticky CTAs work when the friction they remove is real; on pages without that friction they're noise or harm |
| "Adding a trust badge is a free win" | Trust badges win when the page is also equipped to answer the questions they raise; otherwise they create friction |
| "Use modals to capture leads" | Modals cost 20-70% of click-to-destination throughput; the captured-lead value rarely exceeds the lost direct-funnel value |
Public benchmark comparison: Goodui's open A/B test repository documents win rates roughly consistent with the proprietary portfolio's: most tests don't move the needle, win rates cluster in the 20-30% range across surfaces, and the most-cited "winning patterns" have substantial variance across implementations. CXL's research blog and Speero's published case studies show similar patterns when results are honestly reported.
The gap between mainstream CRO content and real program data is mostly about reporting honesty. Tests that "kind of worked" get reported as wins. Tests that produced inconclusive results don't get published. Programs that follow advice based only on published winners systematically over-estimate the field's win rate and under-estimate the methodology discipline required to ship defensible variants.
What this means for practitioners
For CRO managers, analysts, and growth product managers, the data has a few specific implications:
| Implication | Action |
| ----------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| Stat-sig discipline matters more than test volume | Build methodology selection into pre-test review. Match regime to conditions. |
| Per-device segmentation outperforms aggregate reading | Make device segmentation a default test-report column. Decide ship/revert per device. |
| Retention is structurally undertested | Audit your test mix. If acquisition is >80% of test volume and retention is <10%, the program is leaving compounding LTV gains on the table. |
| Failure modes are systematic, not random | Use the failure-mode taxonomy to diagnose inconclusive tests. Most "didn't work" results have a specific named pattern with a specific fix. |
| Mainstream CRO content over-states win rates | Calibrate stakeholder expectations. Real win rates are 20-30%, not the 50%+ implied by published case studies. |
For founders and VPs of Growth evaluating CRO investment:
| Question | What the data says |
| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Should I build a CRO program in-house or hire an agency? | Either works; methodology discipline matters more than which path. The 2% stat-sig rate applies to both. |
| How fast should we expect wins? | 20-30% of tests produce shippable wins. At a 10-test-per-month cadence, expect 2-3 shipped wins per month after the program is mature. |
| What's a sign the program is working? | Pipeline velocity (hypothesis → ship), not win count. Programs with high velocity compound. |
| How do I evaluate a CRO consultant or agency? | Ask for their stat-sig rate and how they classify shipped non-stat-sig wins. The honest answer is ~2% stat-sig and explicit methodology classification (holdout, non-inferiority, directional). The dishonest answer is everything they ship is "stat-sig." |
Bottom line
Real CRO programs at scale produce a much messier picture than published CRO advice suggests. Statistical significance is rare. Methodology selection matters more than test volume. Device segmentation outperforms aggregation. Retention is undertested relative to its LTV impact. Failure modes follow systematic patterns, not random noise.
The teams that consistently grow funnel revenue are the ones that:
- Match methodology to test conditions (50/50, holdout-validated, non-inferiority — pick the right regime)
- Segment by device class as a default, not a custom-cut analysis
- Run pre-test screens that catch failure modes before experiment budget is committed
- Document methodology accurately so future readers know what shipped under what evidence
- Audit their test-mix balance between acquisition and retention quarterly
Programs that adopt these disciplines see win rates climb from 20-30% to 60-75% on candidates that pass the pre-test screens. The volume of tests doesn't change much. The defensibility of each shipped win — and the institutional knowledge that compounds across the program — changes substantially.
This article is the hub for a practitioner's atlas of A/B testing. Each finding above is covered in depth in a dedicated spoke article — sticky CTAs, click-to-conversion ratio, copy intent, modal vs direct routing, holdout-validated shipping, device asymmetry, time-on-page interpretation, trust-badge architecture, confounded-variable recovery, form vs content variable selection, above-the-fold hierarchy, and CTA cannibalization detection. Read whichever pattern matches the test you're about to ship.
The data is updated annually. The next quarterly update covers Q1-Q4 of the current testing year and refines the aggregates with new tests as the portfolio grows.