The single biggest quality problem in most experimentation programs isn't statistical—it's conceptual. Teams run tests without hypotheses. They have ideas, and they call those ideas hypotheses, but they're not.
"Let's test a bigger CTA button" is not a hypothesis. It's a design preference dressed up as an experiment. And when it loses—which it usually does—you learn nothing useful.
After running 100+ experiments, the pattern is clear: the quality of your hypothesis determines the value of your test, win or lose. This guide will show you exactly how to write hypotheses that produce real learning regardless of outcome.
Why "Let's Test a Bigger CTA Button" Is Not a Hypothesis
A hypothesis is a falsifiable prediction derived from an observation, grounded in a theory about why the prediction should be true.
"Let's test a bigger CTA button" has none of those elements:
- There's no observation driving it (what data suggested the button size was a problem?)
- There's no prediction (bigger in what way? On what metric?)
- There's no theory (why would size affect behavior?)
When this test loses, you conclude "button size doesn't matter" and move on. But you haven't actually answered anything. You tested one specific size change in one specific context without understanding the mechanism. The next person who wants to test "a bigger button" will run the same test and reach the same wrong conclusion.
A proper hypothesis would sound like: "Because our heatmap data shows only 12% of users scroll far enough to see our CTA, we believe moving the CTA above the fold will increase form submission rate for first-time visitors because it reduces the effort required to take action and makes the conversion opportunity visible without requiring active engagement."
Now when you test this and it loses, you learn something: scroll depth isn't the primary barrier. The friction is something else—maybe the copy, maybe the trust signals around the form, maybe the value proposition. Your next hypothesis is smarter because this one failed intelligently.
The Hypothesis Structure That Actually Works
Here's the framework I use for every test:
"Because [observation/data], we believe [change] will [measurable outcome] for [audience] because [behavioral/psychological mechanism]."
Break it down:
[Observation/data] — What evidence prompted this idea? Heatmap data, funnel drop-off rates, session recordings, user research, support tickets, survey responses. Cite the specific data point. Not "users seem to struggle" but "our funnel data shows 67% of users who add an item to cart abandon before entering shipping information."
[Change] — What, specifically, are you testing? Not "a different checkout flow" but "displaying shipping cost estimates on the product page before users initiate checkout."
[Measurable outcome] — What metric will move, and in what direction? "Will increase checkout completion rate." "Will decrease bounce rate." "Will increase revenue per visitor." Pick one primary metric. Be specific about direction.
[Audience] — Who does this apply to? All users? Mobile users only? New visitors? Users referred from paid search? Segmenting your hypothesis forces you to think about whether the behavioral mechanism applies universally or only in specific contexts.
[Behavioral/psychological mechanism] — This is the most important part. Why will this change produce that outcome for that audience? What theory of human behavior or decision-making predicts this result?
**Pro Tip:** If you can't articulate the behavioral mechanism, you don't have a hypothesis—you have a guess. The mechanism is what makes a test learnable. It's the difference between "this worked" and "this worked because people respond to X in context Y."
The Behavioral Mechanism Library
These are the mechanisms I return to most often. Knowing them turns vague ideas into testable predictions:
Loss aversion — People feel losses roughly twice as strongly as equivalent gains. Applications: showing what users lose by not acting (vs. what they gain), free trial expiration messaging, abandoned cart recovery.
Social proof — People look to others' behavior when uncertain. Applications: review counts, "X people bought this today," trust badges, customer logos.
Cognitive load — Every additional decision or piece of information increases mental effort and reduces conversion likelihood. Applications: reducing form fields, simplifying pricing tables, progressive disclosure, removing navigation from landing pages.
Anchoring — First numbers seen influence perception of subsequent numbers. Applications: showing original price before sale price, displaying highest plan first on pricing pages, showing full contract value before monthly equivalent.
Completion pressure — People are motivated to complete things they've started. Applications: progress bars, "almost there" messaging, step-count indicators in multi-step flows.
Authority — Credentials, certifications, and expertise reduce uncertainty. Applications: "As seen in" press mentions, security certifications in checkout, professional credentials near claims.
Urgency and scarcity — Limited availability or time increases decision motivation. Applications: low stock indicators, countdown timers, limited-time offers. (Use carefully—misuse erodes trust.)
**Pro Tip:** Keep a behavioral mechanism reference card in your hypothesis template. When someone proposes a test without a mechanism, the card gives you vocabulary to help them find one. Most good ideas have a real mechanism buried in them—you just need to excavate it.
Five Bad Hypotheses Rewritten
Example 1: Homepage Headline
Bad: "Let's test a different homepage headline."
Good: "Because our exit survey data shows 34% of homepage bounces cite 'not sure what this product does' as their reason for leaving, we believe changing the headline from [current product-forward copy] to [benefit-forward copy focused on the primary use case] will decrease bounce rate for new visitors because reducing ambiguity lowers the cognitive effort required to determine product relevance."
Example 2: Checkout Form
Bad: "Let's simplify the checkout form."
Good: "Because session recordings show users pausing significantly at the 'Company Name' field during personal purchases, and 23% of cart abandonment happens on the billing information page, we believe making 'Company Name' an optional field (visually de-emphasized) will increase checkout completion rate for non-business purchasers because removing an irrelevant required field reduces friction without affecting users who need the field."
Example 3: Pricing Page
Bad: "Let's add testimonials to the pricing page."
Good: "Because our funnel shows 41% of users who visit the pricing page leave without clicking any plan, and our post-signup survey shows 'not sure if it will work for my situation' as the top pre-purchase concern, we believe adding three customer case studies segmented by industry to the pricing page will increase plan selection click-through for users who scroll below the pricing table because social proof from relevant peers reduces uncertainty about fit, addressing the specific objection our survey identified."
Example 4: Email Signup
Bad: "Let's test a different CTA for the email newsletter."
Good: "Because our heatmap shows users engage with the newsletter sign-up section (11% of homepage visitors scroll to it) but our signup rate is only 0.4%, we believe changing the CTA from 'Subscribe' to a benefit-specific label ('Get weekly CRO insights') will increase email signup rate for engaged scrollers because specificity sets accurate expectations and reduces the ambiguity that creates hesitation at opt-in moments."
Example 5: Product Page
Bad: "Let's test showing more product images."
Good: "Because customer support receives weekly questions about [product] dimensions and installation requirements, and our return rate for this product is 2x the category average, we believe adding a contextual scale photo (showing the product next to a common object for size reference) and a brief 'What's included' section to the product page will increase add-to-cart rate and reduce post-purchase returns because addressing pre-purchase uncertainty reduces both abandonment and post-purchase regret—both driven by the same information gap."
Hypothesis vs. Prediction — Know the Difference
A hypothesis explains why something should happen. A prediction specifies what you expect to observe in the data.
You need both.
Hypothesis: "Because showing the number of recent purchases creates social proof, we believe adding a '[X] people bought this in the last 24 hours' indicator will increase product page CVR for users who view products with 10+ recent purchases, because social proof reduces purchase uncertainty in contexts where users can't physically evaluate the product."
Prediction: "We expect to see a 5–10% relative lift in add-to-cart rate, with stronger effects on products with higher recent purchase counts. We do not expect the effect on products with fewer than 5 recent purchases (where the social proof signal would be weak) to be statistically significant."
The prediction tells you what data to look at during analysis. Without it, you'll data-fish—looking at every metric until something looks significant.
Write both before you build the test. Lock them in writing. Then analyze against them.
**Pro Tip:** Add a "prediction" field to your Optimizely experiment description alongside your hypothesis. When you analyze results, start by checking whether the prediction was accurate—not by mining for any positive signal.
How Good Hypotheses Make Losing Tests More Valuable
When you lose a test with a weak hypothesis: "The bigger button didn't win. Move on."
When you lose a test with a strong hypothesis: "We hypothesized that visibility was the barrier (users weren't seeing the CTA). The test moved the CTA to where 100% of users would see it, and CVR didn't change. Therefore, visibility is not the problem. The barrier is earlier or more fundamental—probably the value proposition or trust. Next test: address trust signals on the same page."
That's compounding knowledge. Every test builds on the last one because you understand not just what happened but why. Within three or four tests, you're running the experiment that would have taken you 10 random tests to arrive at.
This is why the behavioral mechanism is the most important part of the hypothesis. It's the thread you pull when the test loses.
Documenting Hypotheses in Optimizely
Use Optimizely's description field religiously. Here's the template I put in every experiment:
HYPOTHESIS: Because [observation + data citation], we believe [change] will [metric + direction] for [audience] because [behavioral mechanism].
PREDICTION: Expected lift: [range]. Primary metric: [exact metric name in Optimizely]. Secondary metrics: [list]. Guardrail metrics: [what we must not harm].
DATA SOURCE: [Link to analytics dashboard, session recording, survey data].
TEST PARAMETERS: Traffic allocation: [%]. Target sample size: [n per variant]. Expected runtime: [weeks]. Scheduled analysis date: [date].
This takes five minutes per experiment. It saves hours of archaeology when you're analyzing results six weeks later.
**Pro Tip:** Create a custom Optimizely tag for the behavioral mechanism type (loss-aversion, social-proof, cognitive-load, etc.). This lets you query your experiment history by mechanism and find patterns: "We've run four social proof tests. Three won on mobile, zero won on desktop. Why?"
Building a Hypothesis Library
Every completed experiment is raw material for future hypotheses. The teams that compound their learning fastest are the ones that systematically harvest insights from past tests.
A hypothesis library is a searchable record of:
- The original hypothesis (observation + change + mechanism)
- What happened (won/lost/inconclusive, lift magnitude)
- What we learned (the insight, regardless of outcome)
- Implications for future tests (what hypotheses does this generate?)
Build this in Notion, Confluence, Airtable—wherever your team actually works. Link each entry to the Optimizely experiment. Tag by mechanism, page type, audience, and date.
The library becomes your most valuable asset after 6–12 months. When a new team member joins, they can read the library and understand your customers in a week. When you're planning next quarter's roadmap, the library tells you where you have evidence gaps.
**Pro Tip:** End every experiment analysis with a "What hypotheses does this generate?" section. Even a winning test generates hypotheses: "Since urgency messaging worked on the cart page, does it work on the product page? Does it work differently for high-consideration vs. impulse purchases?"
The Hypothesis Review: Before You Build Anything
Before a test goes to development, run it through this checklist:
- Does the hypothesis have all five components? (Observation, change, metric, audience, mechanism)
- Is the observation cited with actual data, not intuition?
- Is the metric specific and trackable in Optimizely?
- Is the behavioral mechanism real and plausible?
- Is the prediction written down?
- Has the sample size been calculated?
- Are the guardrail metrics defined?
If any answer is no, send it back for refinement. A test that fails the hypothesis review will produce bad data regardless of how well it's built.
Common Mistakes
Writing the mechanism after the test — Some teams write hypothesis documentation after the test is built, to justify what's already been decided. This defeats the entire purpose. The hypothesis should exist before anyone writes a single line of code.
Mechanism = "users will prefer it" — "Users will prefer the new design because it looks cleaner" is not a behavioral mechanism. "Users will convert at a higher rate because reducing visual complexity decreases cognitive load at the decision moment" is a mechanism. Preference is not a mechanism.
Testing too many changes at once — If you change the headline, the image, the CTA copy, and the button color simultaneously, you can't attribute a win or loss to any individual change. Test one change per hypothesis unless you're doing a full-page multivariate test with explicit interaction effect analysis.
Ignoring audience segmentation — A hypothesis that's true for new visitors may be false for returning customers. A mechanism that works on desktop may fail on mobile. Segment your hypothesis and your analysis.
Copying competitors without a mechanism — "Our competitor does X, so we should test X" is not a hypothesis. It might generate one: "If our competitor's high-converting checkout shows progress indicators, and progress indicators reduce abandonment by activating completion pressure, then..." But the competitor's behavior is data, not a mechanism.
What to Do Next
- Take your three top test ideas and rewrite them using the full hypothesis structure.
- For each, identify the behavioral mechanism from the framework list above.
- Write a prediction for each—what you expect to observe in the data, including magnitude.
- Set up the Optimizely description template for all future experiments.
- Start your hypothesis library now, with your three oldest completed tests as the first entries.
Once your hypotheses are solid, the next step is making sure you're measuring the right things. Read Metrics by Revenue Model: What to Measure Based on Your Business Type to make sure your primary metrics actually reflect business outcomes.