I've reviewed a lot of A/B testing roadmaps. The pattern is consistent: a long list of things someone wants to test, no rationale for why, no priority order beyond "we thought of these," and no connection to what the business needs to learn.

That's not a roadmap. It's a backlog with the word "experiment" in the title.

A real testing roadmap is a research agenda. It answers: what are the most important questions about our user behavior, what tests are designed to answer those questions, and in what order should we run them to build on each other's findings?

Here's how to build one.

Why Most Testing Roadmaps Fail

The fundamental problem: most roadmaps list tests, not hypotheses.

"Test new hero image" is not a hypothesis. "Test new hero image because our heatmap data shows users aren't reading below the fold, suggesting the hero isn't creating enough interest to scroll" is a hypothesis. One is an action item. The other is a research question with supporting evidence.

The practical difference: when the test with a hypothesis produces an unexpected result (no lift, or a lift but in the wrong metric), you have something to update. When the test without a hypothesis produces an unexpected result, you have nothing. You don't know if you tested the wrong thing, measured the wrong outcome, or learned something meaningful.

A roadmap built on hypotheses is testable, iterable, and defensible. A roadmap built on ideas is just a queue.

**Pro Tip:** For every item on your roadmap, require this format: "We believe [change] will [outcome] because [evidence]. We will know this is true when we see [measurable metric] move by [target amount]." If a test can't be described this way, it's not ready to be on the roadmap.

The Four Essential Roadmap Categories

A healthy testing program has a portfolio of test types, not just a queue of optimization ideas. I organize every roadmap into four categories.

Category 1: Quick Wins High-confidence ideas backed by strong qualitative and quantitative evidence, requiring low development effort. These are your program's heartbeat — they keep velocity high while the bigger tests are running.

Examples: Button copy changes backed by user research, form field removal based on drop-off data, error message improvements identified in session recordings.

Aim for: 30-40% of your roadmap. These should have a high win rate (60-70%+) because they're well-evidenced.

Category 2: Big Bets High-impact, high-effort tests. These are full redesigns, new product features, or significant UX overhauls. They take longer to build and run, but their potential upside is large.

Examples: Full checkout flow redesign, new pricing page structure, complete navigation overhaul.

Aim for: 20-30% of your roadmap. Win rate will be lower (30-40%), but when they win, the impact is significant.

Category 3: Learning Experiments Tests designed to answer a strategic question rather than directly generate lift. These are valuable even when they don't win — you come away with a defensible answer to something your team was debating.

Examples: Testing whether social proof or scarcity messaging works better for your audience, whether users respond more to feature-led or outcome-led copy, whether adding a product video increases or decreases conversion.

Aim for: 20-25% of your roadmap. Don't optimize these for win rate — optimize them for information value.

Category 4: Anti-Regression Tests Tests that verify a known winner is still performing. UX patterns become stale. Traffic mix changes. Seasonal effects shift user behavior. Running occasional anti-regression tests catches when a previous winner has stopped working.

Aim for: 10-15% of your roadmap. Run these quarterly on your highest-impact shipped variants.

**Pro Tip:** If your roadmap has no learning experiments, you're optimizing tactically but not strategically. Learning experiments are how you update your mental model of what your users respond to. That updated model makes all your future quick wins and big bets more likely to succeed.

How to Source Ideas

A roadmap with diverse sources produces better results than one dominated by any single input. Here's what I use.

Analytics data (quantitative): Where are users dropping off? Session recordings and funnel analysis identify pages where conversion is below benchmark. Exit surveys on high-exit pages tell you why. These are your highest-priority sources because they point to real friction in the current experience.

User research (qualitative): Usability tests and user interviews reveal why users behave the way they do. A funnel analysis can tell you that 40% of users abandon on the payment page — user research tells you they're confused by the shipping cost reveal. One insight generates a concrete, testable hypothesis.

Past experiment learnings: What have you already learned? Each test result updates your model. If social proof consistently outperforms scarcity messaging in your tests, that's a signal about your audience that should influence future roadmap items.

Behavioral science frameworks: Frameworks like loss aversion, social proof, and commitment consistency aren't just academic — they generate hypotheses. "Users are more motivated by what they might lose than what they might gain" → test loss-framed vs. gain-framed CTAs on your upgrade page.

Competitive analysis: What are similar products doing that you aren't? This isn't "copy competitors" — it's "if multiple credible players in your market have converged on a pattern (e.g., persistent cart in the header), that's signal worth testing."

**Pro Tip:** Run a quarterly "hypothesis sprint" — a 2-hour session where the team (product, design, research, analytics) generates 20-30 hypotheses using all five sources above. Stack-rank them. The top 12 become your next quarter's roadmap. This replaces ad hoc roadmap additions with a structured, team-aligned process.

ICE Scoring for Prioritization

ICE — Impact × Confidence × Ease — is the most commonly used prioritization framework in CRO. When used correctly, it's genuinely useful. When gamed, it's worthless.

Impact (1-10): If this test wins, how large is the expected revenue or conversion impact? Rate against your other roadmap items, not in absolute terms. A 10 is your highest-potential-impact test.

Confidence (1-10): How strong is the evidence supporting the hypothesis? Direct user research behind it = 9-10. Gut feel = 1-3. Analytics data without qualitative backing = 5-6.

Ease (1-10): How easy is the test to build and run? A button color change = 9. A full checkout redesign = 2.

Score = Impact + Confidence + Ease (or multiply — either works, though addition produces less extreme scores).

How to avoid gaming it: The most common manipulation is rating low-evidence ideas highly on Confidence because you "really believe in them." Require each Confidence score to cite specific evidence: a user research finding, an analytics data point, a past experiment result. Without cited evidence, Confidence defaults to 3.

**Pro Tip:** ICE scores should be updated as you learn more. A test that started at Confidence 5 (weak analytics signal) moves to Confidence 8 after you run a user study that confirms the hypothesis. Use ICE as a living document, not a one-time ranking.

Sequencing Strategy: Don't Run Your Best Idea First

This is counterintuitive, but important: don't lead with your highest-ICE-scored test.

Experimentation programs have startup costs — tool configuration, team learning, process establishment. Your first few tests will run less smoothly than your tests 6 months in. Running your biggest, highest-risk test first maximizes the chance that execution problems contaminate your most important result.

Build the program first, then bet big on it.

Recommended sequencing:

  • Months 1-2: Run quick wins and small learning experiments. Build team familiarity with the tool, QA process, and result interpretation. Establish the stopping framework. These tests are valuable but forgiving.
  • Month 3: Run your first medium-effort test. You've now run 6-8 tests. You know your common failure points. Your QA process is tighter.
  • Month 4+: Layer in big bets. The program is running smoothly. A failed big bet doesn't derail the whole roadmap. A won big bet delivers real impact on top of an already-healthy velocity.

Balancing the Roadmap

Spread tests across the funnel, not just the bottom.

Most teams over-index on checkout optimization because it's closest to revenue. This makes sense, but a mature roadmap has coverage across the funnel:

  • Top of funnel (landing pages, blog, SEO pages): 25-30% of tests
  • Middle of funnel (product pages, pricing, features): 30-35%
  • Bottom of funnel (cart, checkout, upsell): 25-30%
  • Post-purchase (onboarding, retention flows): 10-15%

Also balance by device type. Desktop and mobile often have very different UX problems. A change that dramatically improves mobile checkout might be irrelevant on desktop. When possible, run mobile-specific tests rather than device-pooled tests that average out device-specific effects.

Stakeholder Management: Keeping Leadership Out of the Roadmap

The fastest way to destroy a testing program is to let leadership use it as a channel for shipping their pet ideas without rigorous testing.

Every leadership-requested test competes with a hypothesis-backed test for a testing slot. The leadership-requested test usually has lower confidence (it's based on opinion, not evidence) and often gets reported differently ("we ran this for the CEO and it underperformed" vs. "we ran this based on user research and it underperformed").

How to handle it:

  1. Welcome leadership hypothesis submissions — but require the same hypothesis format as all other roadmap items. "I want to test red buttons" gets reformatted into a proper hypothesis or gets deprioritized.
  2. Allocate 10-15% of roadmap capacity to leadership-requested tests. This is enough to maintain political buy-in without letting requests dominate.
  3. Report leadership-requested tests the same way you report all tests. No special treatment in reporting.
  4. When a leadership-requested test underperforms, use it as an evidence point: "This is why we have a hypothesis-first process. The user research would have flagged this before we spent 4 weeks testing it."
**Pro Tip:** Publish your testing roadmap to all stakeholders monthly. Transparency reduces "surprise" test requests because people can see the queue and understand why their idea isn't running yet.

Experiment Velocity: Quality Over Quantity

Teams often treat experiment velocity as a pure volume metric. More tests = better program. This is wrong.

Four well-designed tests per month with proper hypotheses, adequate sample sizes, and rigorous stopping rules produces more learning than twelve tests with sloppy hypotheses, underpowered runs, and premature stopping.

A test that runs to proper completion and produces a trustworthy result — even if it's a null result — advances your understanding of user behavior. A test that gets stopped at 40% of required sample size produces nothing you can act on.

The velocity metric I track: conclusive tests per month. A conclusive test is one that ran to proper sample size, met all stopping conditions, and produced a result (win, loss, or null) you can act on. Not just "tests launched."

**Pro Tip:** Track your conclusive test ratio: conclusive tests / tests launched. Healthy programs run at 70-80% (some tests get legitimately stopped early for harm or contamination reasons). Programs below 50% are either underpowered or experiencing systematic stopping problems.

What a Mature 90-Day Roadmap Looks Like

Here's an example structure for a mid-size e-commerce program with ~200,000 monthly sessions and a team of 2-3 CRO practitioners.

Month 1 (4 test slots):

  • Checkout: Button label copy test (quick win, UX research-backed)
  • PDP: Social proof placement test (learning experiment)
  • Cart: Shipping threshold messaging test (quick win, analytics-backed)
  • Homepage: Hero value proposition test (big bet, 3-week runway)

Month 2 (4 test slots):

  • Checkout: Trust badge redesign (quick win)
  • PDP: Product image quantity test (learning experiment)
  • Cart: Upsell timing test (medium effort, revenue-focused)
  • Anti-regression: Re-test last quarter's checkout winner at current traffic mix

Month 3 (4 test slots):

  • Navigation: Category structure test (big bet)
  • PDP: Video vs. no video test (learning experiment)
  • Checkout: Form field reduction test (quick win)
  • Mobile-specific: Mobile checkout layout test (device-targeted big bet)

That's 12 tests, covering all four categories, spread across funnel stages and devices, with sequencing that builds from simple to complex.

Common Mistakes

Mistake 1: No hypothesis documentation. Without documented hypotheses, you can't learn from losses. Every losing test should update your model — but only if you had a model to begin with.

Mistake 2: Running too many tests simultaneously. If you're splitting traffic across 6 concurrent tests on the same pages, interaction effects between tests make results unreliable. Limit concurrent tests to 2-3 on any single user journey.

Mistake 3: Ignoring the null results. A null result — no significant effect — is information. If you hypothesized a 15% lift based on user research and found zero effect, the research was telling you something different from what you thought. That's worth investigating.

Mistake 4: Never revisiting the roadmap. A 90-day roadmap should be reviewed every 30 days. New analytics data, user research, and test results change priorities. Treat the roadmap as a living document, not a commit.

What to Do Next

Start your next roadmap sprint by running a hypothesis audit: take every item on your current roadmap and write a full hypothesis statement for each one. Any item that can't generate a clear hypothesis isn't ready to be a test.

For a complete roadmap template, test documentation framework, and the prioritization worksheet I use with CRO clients, see the Optimizely Practitioner Toolkit.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.