Every Team Needs a Prioritization Framework (Just Not a Broken One)
You have more test ideas than you can run. This is the universal condition of every experimentation program. The question is not whether to prioritize — it is how.
The ICE framework is the most popular answer. Rate each test idea on Impact, Confidence, and Ease, multiply the scores, and rank by the result. It is simple, intuitive, and widely adopted. It is also deeply flawed in ways that most teams never notice.
Understanding those flaws does not mean abandoning ICE. It means using it wisely and knowing when to supplement it with something better.
How ICE Works
ICE scores each test idea on three dimensions, typically on a scale of one to ten:
Impact: How large will the effect be if the test wins? A ten means a transformational change to a key metric. A one means a negligible improvement.
Confidence: How certain are you that the test will produce the predicted result? A ten means you have strong data supporting the hypothesis. A one means it is pure speculation.
Ease: How simple is the test to implement? A ten means it can be launched in a day. A one means it requires months of engineering work.
The ICE score is the average (or product) of the three dimensions. Higher scores get prioritized first.
Why Teams Love It
ICE is attractive because:
- It is easy to explain to anyone
- It forces a structured conversation about each idea
- It produces a numerical ranking that feels objective
- It can be done in a spreadsheet in thirty minutes
These are real benefits. For teams just starting to prioritize their test backlog, ICE is better than no framework at all.
Where ICE Breaks Down
The problems with ICE are not obvious until you use it at scale. Here are the most damaging ones.
Problem 1: Subjective Scoring Creates False Precision
What does an Impact score of seven mean versus a six? There is no calibration. Two people on the same team will score the same idea differently, and neither is wrong because the scale is undefined.
This creates a dangerous illusion of objectivity. The numbers feel precise, but they are opinions dressed up as data. Teams debate whether an idea is a six or a seven when the distinction is meaningless.
The deeper issue: ICE scores are not comparable across people, across time, or across idea categories. An Impact score of eight for a pricing test and an eight for a homepage test represent completely different magnitudes of business impact.
Problem 2: Ease Bias Kills Strategic Experiments
Because Ease is weighted equally with Impact and Confidence, quick-to-implement ideas systematically outrank ambitious ones. A small copy change with moderate expected impact will score higher than a fundamental flow redesign with high expected impact but significant implementation effort.
Over time, this produces a test program dominated by incremental tweaks. The big bets — the tests that could transform a metric by tens of percent — never make it to the top of the queue because they are never easy.
This is especially dangerous because the relationship between implementation effort and test impact is not linear. Some of the highest-impact experiments require substantial engineering investment. ICE structurally penalizes them.
Problem 3: Confidence Is Poorly Defined
What does confidence mean in the context of a test you have never run? You might be confident because you saw a similar test work at another company. Or because a user interview suggested the idea. Or because your gut tells you it will work.
These are very different types of confidence, but ICE treats them as interchangeable. A confidence score based on rigorous data analysis gets the same weight as one based on anecdotal evidence.
The paradox: The tests you are most confident about are often the least interesting. If you are confident a change will work, you probably already know the answer — the experiment is confirmatory, not exploratory. The most valuable experiments are the ones where you genuinely do not know what will happen.
Problem 4: No Learning Value Component
ICE optimizes for impact on the metric. It does not account for how much you learn from the experiment, regardless of the outcome.
Some experiments are worth running even if they are unlikely to win, because the result will fundamentally change your understanding of user behavior. A test that reveals users do not care about a feature you thought was essential is extremely valuable — but ICE would rate it low on Confidence and Impact.
Problem 5: No Portfolio Thinking
ICE ranks ideas individually. But experimentation programs should be managed as portfolios, with a mix of:
- High-confidence incremental tests (reliable but small gains)
- Medium-confidence moderate tests (balanced risk and reward)
- Low-confidence big bets (unlikely to win but transformational if they do)
ICE produces a ranked list, not a balanced portfolio. Teams that follow ICE strictly end up with a homogeneous test program that lacks strategic diversity.
Better Alternatives (and When to Use Each)
RICE: Adding Reach
RICE adds a fourth dimension — Reach — which accounts for how many users the change will affect. This is a meaningful improvement over ICE because it prevents teams from prioritizing tests that affect a tiny segment of users.
RICE formula: (Reach x Impact x Confidence) / Effort
Use RICE when: You have reliable data on the number of users who encounter each touchpoint you are testing.
Weighted Scoring Models
Instead of giving equal weight to each dimension, assign weights based on your strategic priorities. If learning is important, add a Learning Value dimension and weight it heavily. If speed matters more than magnitude, weight Ease higher.
The key is making the weights explicit and reviewing them quarterly. Different business stages call for different weights.
Use weighted scoring when: Your team has enough experimentation experience to know which dimensions matter most for your context.
The Opportunity Cost Framework
Instead of scoring ideas in isolation, compare them against each other by asking: "If I run this test instead of that one, what am I giving up?"
This forces direct comparison rather than absolute scoring. It also naturally accounts for portfolio balance — after selecting three incremental tests, the opportunity cost of another incremental test is high relative to a big bet.
Use opportunity cost when: Your team is mature enough to have robust debates about test selection without needing a numerical crutch.
Expected Value Calculation
For teams with enough data, calculate the expected value of each test:
Expected Value = Probability of Winning x Size of Win x Number of Users Affected
This is more rigorous than ICE because it uses actual estimates rather than arbitrary scores. The probability of winning can be informed by historical win rates for similar test types. The size of win can be estimated from the minimum detectable effect.
Use expected value when: You have historical experiment data to calibrate your estimates.
Making ICE Work Better
If you stick with ICE (and many teams should, because its simplicity has real value), here are ways to compensate for its weaknesses:
Calibrate Your Scales
Define what each score means for your team. Write it down:
- Impact 10: Moves the primary KPI by more than fifteen percent relative
- Impact 7: Moves the primary KPI by five to fifteen percent relative
- Impact 4: Moves the primary KPI by one to five percent relative
- Impact 1: No meaningful impact expected
Do this for all three dimensions. Calibrated scales produce more consistent and meaningful scores.
Add a Strategic Override
Reserve a portion of your testing capacity (around twenty to thirty percent) for strategic experiments that ICE would rank low but that align with long-term learning goals. This prevents the Ease bias from dominating your program.
Score as a Team
Individual scoring is inconsistent. Score ideas together as a team, discussing each dimension before assigning a number. The conversation is more valuable than the score.
Review and Recalibrate Monthly
ICE scores go stale. An idea that was hard to implement last quarter may be easy now because the engineering team built the necessary infrastructure. Review and update scores monthly.
Add a Learning Dimension
Expand ICE to ICEL: Impact, Confidence, Ease, Learning. Score each idea on how much you will learn from the result, regardless of whether it wins. This counteracts the confidence paradox and ensures your program includes genuinely exploratory experiments.
The Bigger Picture
Prioritization frameworks are tools, not truths. The goal is not to produce the perfect ranking. The goal is to ensure your experimentation program allocates resources to the highest-value tests while maintaining strategic balance.
Any framework that forces a structured conversation about what to test and why is better than no framework. ICE is a good starting point. Just be honest about its limitations and compensate for them deliberately.
FAQ
Is ICE good enough for a small team just starting with experimentation?
Yes. For teams running fewer than five tests per month, ICE provides enough structure without the overhead of more complex frameworks. Add sophistication as your program matures.
How do I prevent the highest-paid person's opinion from dominating ICE scoring?
Use blind scoring. Each team member scores independently before the group discussion. Reveal scores simultaneously and discuss divergences.
Should I factor in implementation risk as separate from Ease?
Yes, this is a good enhancement. A test might be easy to implement but risky (for example, changing a payment flow). Separate "effort" from "risk" to avoid conflating them.
How often should I re-prioritize my test backlog?
Monthly at minimum. Major re-prioritization should happen quarterly when business goals shift. Individual test scores should be updated whenever new information becomes available.