Most A/B test hypotheses I review are not hypotheses. They're guesses wearing a hypothesis costume — "we believe changing the button color will increase conversions" — with no mechanism, no specificity, and no way to learn anything whether the test wins or loses.
After running and reviewing several hundred experiments, I've come to think the hypothesis is the highest-leverage artifact in the entire testing process. A sharp hypothesis forces you to articulate why you expect a change to work, which makes the test worth running even when it loses. A vague hypothesis produces tests that are inconclusive by construction — you couldn't learn anything from them no matter the result.
This is the structure I use, the failure modes I watch for, and how to tell a real hypothesis from a guess.
What a real A/B test hypothesis contains
A hypothesis worth testing has four parts:
1. A specific change — exactly what's different in the treatment 2. A predicted effect — which metric moves, and in which direction 3. A mechanism — *why* the change should produce that effect (the theory of the user) 4. A magnitude — roughly how much you expect the metric to move
The mechanism is the part most teams skip, and it's the most important. A hypothesis without a mechanism is just a prediction. With a mechanism, you're testing a *belief about your users* — and that's what compounds across experiments.
A complete hypothesis reads like this:
Because new users can't tell which plan fits them (mechanism), adding a "most popular" badge to the mid-tier plan (change) will increase mid-tier selection (effect + direction) by 10-20% (magnitude), because the badge reduces choice-paralysis by signaling a default (mechanism, restated as the user theory).
Compare that to what usually crosses my desk:
We think adding a badge will improve conversion.
The second version can't fail informatively. If it wins, you don't know why. If it loses, you don't know whether the badge was wrong, the placement was wrong, or the whole premise was wrong. You ran a test and learned nothing.
The "Because / We expect / Measured by" template
The template I hand teams who are starting out:
Because [observation about user behavior], we expect that [specific change] will cause [specific metric] to [increase/decrease] by [magnitude], measured by [primary metric over time window].
Worked example:
Because 40% of trial users never connect a data source and those users churn at 3× the rate (observation), we expect that adding a one-click sample-data option to onboarding (change) will cause 14-day activation rate (metric) to increase by 15-25% (magnitude), measured by the percentage of signups who reach the activation event within 14 days.
Everything in that sentence is testable, falsifiable, and grounded in an observation. The observation ("40% never connect a source, they churn 3×") is what makes it a hypothesis rather than a hunch. You looked at data, found a pattern, and formed a theory about how to change it.
Where the observation comes from
The strongest hypotheses start from an observation, not from an idea. The observation is usually one of four kinds:
Quantitative funnel data. A drop-off in your activation or conversion funnel. "73% of users abandon at the payment step" is the seed of a dozen hypotheses.
Qualitative signal. Session recordings, support tickets, user interviews. "Users repeatedly try to click the non-clickable plan name" tells you something the funnel numbers can't.
A behavioral principle. A known mechanism — defaults, loss aversion, social proof — that you have reason to believe applies to a specific decision your users make. (I've written separately about which behavioral principles actually replicate in commercial A/B tests; most don't, so use this source carefully.)
A competitor or analog. Something that worked in an adjacent context, with an explicit theory of why it would transfer to yours.
Hypotheses that start from "I have an idea" with no observation behind them are the ones that produce the most inconclusive tests. The idea might be good — but without an observation, you have no way to calibrate whether it's worth the test slot.
The three failure modes
After reviewing a lot of hypotheses, the same three failures recur.
Failure 1: No mechanism
"Changing X will improve Y." No *why*. These tests are coin flips. When they win, you can't generalize the learning to the next test. When they lose, you can't tell what was wrong. You spent two weeks of traffic to learn one binary fact about one specific change, with no transferable insight.
The fix: force the sentence "because [user behavior]." If you can't complete it, you don't have a hypothesis yet — you have a candidate change looking for a justification.
Failure 2: Compound changes
"We'll redesign the pricing page" is not a hypothesis — it's twelve hypotheses tangled together. New layout, new copy, new badge, new tier order, new CTA. If it wins, which change drove it? If it loses, did one terrible change mask three good ones?
The fix: isolate the variable. One mechanism per test. If you genuinely need to test a bundle (sometimes you do — a coherent redesign can have interaction effects), say so explicitly and accept that you're testing "this whole package vs the old one," not learning which element mattered.
Failure 3: No magnitude, so no power
"Will increase conversions" — by how much? Without a magnitude estimate, you can't compute the sample size, which means you don't know how long the test needs to run, which means you'll either stop too early (false positive from peeking) or run forever on an underpowered test that can't detect the effect you're hoping for.
The fix: estimate the magnitude before you start, even roughly. A 2% lift and a 20% lift require wildly different sample sizes. The magnitude estimate is what turns "let's run it and see" into a powered experiment with a defined stopping point.
A real hypothesis vs a guess: the test
When I'm deciding whether a proposed hypothesis is worth a test slot, I ask three questions:
1. Can it lose informatively? If the test loses, will we have learned something about our users — or just that this one change didn't work? Real hypotheses teach you something either way.
2. Is the mechanism falsifiable? Could the data plausibly show the mechanism is wrong? "Users want simplicity" is unfalsifiable. "Users abandon because there are too many plan options" is falsifiable — reducing options will or won't move the metric.
3. Would I bet on the magnitude? If I had to put money on the predicted lift range, would I? If the honest answer is "I have no idea how big the effect would be," the hypothesis isn't grounded enough yet.
A proposed test that passes all three is worth running. One that fails any of them goes back for sharpening before it takes a traffic slot.
Prioritizing hypotheses once you have several
A good experimentation program generates more hypotheses than it can test. The constraint is traffic, not ideas. The frameworks people reach for — ICE (Impact, Confidence, Ease), PIE (Potential, Importance, Ease) — are fine scaffolding, but the input that matters most is the one teams fudge: confidence.
Confidence should come from the strength of the observation behind the hypothesis, not from how much you like the idea. A hypothesis backed by "40% drop-off with 3× churn" deserves higher confidence than one backed by "this worked at my last company." Rate confidence by the evidence, not the enthusiasm, and your prioritization stops being theater.
How the hypothesis shapes everything downstream
The reason I treat the hypothesis as the highest-leverage artifact is that it determines the quality of everything after it:
• The primary metric falls out of the predicted effect. A sharp hypothesis names its metric; a vague one leaves you metric-shopping after the fact (a recipe for false positives).
• The sample size falls out of the magnitude estimate. No magnitude, no power analysis, no defined stopping point.
• The analysis falls out of the mechanism. If you predicted *why*, you know which segments to check and which guardrails matter.
• The learning falls out of all of it. A test run from a sharp hypothesis adds to your model of your users whether it wins or loses. A test run from a guess adds a single disposable fact.
Teams that write sharp hypotheses run fewer tests and learn more from each. Teams that write guesses run more tests and accumulate a pile of inconclusive results they can't act on.
The one-sentence discipline
If you take one thing from this: before any test, make yourself complete this sentence out loud, with no blanks:
Because \_\_\_\_ (something you observed), we expect \_\_\_\_ (specific change) to \_\_\_\_ (move a named metric, a named direction, a rough amount), because \_\_\_\_ (your theory of the user).
If you can't fill every blank with something specific and grounded, you don't have a hypothesis yet. You have a guess. Sharpen it before it costs you two weeks of traffic.
That single discipline — refusing to run a test until the sentence is complete — has done more for the experimentation programs I've worked with than any tool, framework, or statistical technique.