Why Most A/B Tests Start Wrong

The average CRO test idea sounds like this: "Let's test a bigger button." "Let's try a different headline." "Can we A/B test the hero image?"

These are observations, not hypotheses. They tell you what to change, but not why it should work, who it should work for, or what mechanism drives the expected behavior change. When tests built on these non-hypotheses fail — and many do — you walk away with nothing. You can't learn from a test where you didn't state what you expected to learn.

The difference between a test program that compounds knowledge over time and one that churns through variants without building insight comes down almost entirely to how you write hypotheses.

What a Hypothesis Actually Is

A hypothesis is a specific, falsifiable prediction with a stated mechanism. It's not an idea. It's not a hunch. It's a structured claim that the data can either support or contradict — and either outcome generates knowledge.

The key word is mechanism. If your hypothesis doesn't state why a change should produce a behavioral outcome, you can't interpret a negative result. You changed the button and conversion didn't improve — so what? Without a mechanism, you don't know if your change was wrong, your mechanism was wrong, or your measurement was flawed.

With a mechanism, a losing test tells you something specific: either the mechanism isn't operating (users aren't experiencing the cognitive load you assumed), the mechanism is operating but isn't influencing this decision (loss aversion exists but doesn't drive this choice), or the mechanism exists but your execution failed to trigger it (the new copy was meant to create urgency but users read it as pressure and disengaged).

The Full Hypothesis Template

Here's the template I've used across 100+ tests:

**"Because [observation/data], we believe [specific change] will [measurable outcome] for [target audience], because [behavioral/psychological mechanism]."**

The five components:

  1. Because [observation/data]: What evidence prompted this hypothesis? A heatmap showing users ignore the CTA? An exit survey where users said they didn't understand the pricing? A funnel analysis showing 40% drop-off at step 2? The observation grounds the hypothesis in data, not preference.
  2. We believe [specific change]: What exactly are you changing? Not "a better headline" — but "changing the headline from 'Energy Plans for Your Home' to 'Cut Your Energy Bill by 20% — Guaranteed.'" Specificity matters because vague changes produce uninterpretable results.
  3. Will [measurable outcome]: What metric moves? Clicks on the CTA? Checkout starts? Free trial sign-ups? Define the metric before you run the test. If you define it after, you're at risk of selecting the metric that happened to improve.
  4. For [target audience]: All users? New visitors only? Users on mobile? Returning users who haven't converted yet? Segmenting your expected outcome makes the hypothesis more testable and often reveals that changes work for some audiences and not others.
  5. Because [behavioral/psychological mechanism]: This is the most important part. Why would this change affect behavior? Cognitive load reduction, loss aversion, social proof, anchoring, the endowment effect, decision paralysis, reciprocity — what principle predicts the behavioral change you expect?
**Pro Tip:** Write the mechanism last, after you've written everything else. If you can't articulate a behavioral mechanism for a change you've proposed, that's a signal the change isn't ready to test. Either do more research or choose a different test.

5 Bad Hypotheses Rewritten as Good Ones

1. CTA Button Test

Bad: "Let's test changing the button color to orange."

Good: "Because our heatmap data shows only 12% of users who scroll past the pricing section click the current green CTA, and user session recordings show users pausing at the button for 2-3 seconds before leaving, we believe changing the CTA copy from 'Get Started' to 'Start Saving on Your Energy Bill' will increase CTA clicks from visitors who've scrolled to pricing, for users who've been on the page more than 30 seconds, because outcome-led copy reduces the cognitive cost of evaluating whether the action is worth taking."

The bad hypothesis gives you a result but no learning. The good hypothesis, if it loses, tells you either that outcome-led copy isn't the issue here, or that the scroll-and-pause behavior has a different cause you haven't identified yet.

2. Form Reduction Test

Bad: "Let's remove the phone number field from the lead form."

Good: "Because our form analytics show a 34% field abandonment rate at the phone number field (compared to 8% on all other fields), and post-conversion surveys indicate users report concerns about unwanted sales calls, we believe removing the phone number field will increase form completions for first-time visitors on our commercial energy landing pages, because reducing the perceived cost of form completion lowers the risk perception that is currently the primary abandonment driver."

3. Pricing Page Layout

Bad: "Test a highlighted 'most popular' tier on the pricing page."

Good: "Because our pricing page exit rate is 58% (vs. 32% sitewide) and our Hotjar data shows users spending 40% of time on the pricing page comparing the middle and upper tiers, we believe adding a 'Most Popular' badge to the Business tier and visually elevating it above the others will increase Business tier selection rate for new visitors comparing plans, because social proof anchoring ('other people chose this') reduces the decision complexity of choosing between undifferentiated options."

4. Hero Headline Test

Bad: "Test feature-led vs. benefit-led headline."

Good: "Because our user interviews (n=12) showed 8 of 12 users couldn't articulate what our product does after viewing the current feature-focused homepage headline, and our bounce rate on homepage is 71% for organic search visitors, we believe changing the headline from 'Advanced Energy Management Platform for Commercial Properties' to 'Cut Commercial Energy Costs by 15-30% — Without Changing Your Operations' will reduce homepage bounce rate for organic search visitors, because outcome-led copy answers the visitor's implicit question ('what will this do for me?') before they invest reading time."

5. Social Proof Test

Bad: "Add more testimonials above the fold."

Good: "Because our click-through rate from homepage to sign-up is 4.2% and user session data shows 67% of users who reach the testimonial section (below the fold) are 3x more likely to click through than those who don't, we believe moving the testimonials section above the fold — directly below the hero — will increase homepage-to-signup CTR for new organic visitors, because social proof from identified users in similar roles reduces purchase risk perception before the user has decided whether to invest attention in the rest of the page."

How a Good Hypothesis Turns a Losing Test into a Learning

This is the ROI of good hypothesis writing that almost no one talks about.

When "let's test a bigger button" loses, you have no information. You know button size probably doesn't matter — but that's it.

When your well-structured hypothesis loses, you have a map of what to test next:

If the mechanism was "outcome-led copy reduces cognitive cost" and the test lost, you have three leads:

  1. The cognitive cost framing was wrong — maybe the actual barrier isn't cognitive cost but trust
  2. The mechanism is right but your execution was off — the copy was outcome-led but not specific enough to the user's actual outcome
  3. The mechanism is right but it's operating — users like the new copy, but something else in the funnel is preventing conversion

Each of these leads to a better next hypothesis. The test program compounds.

This is why the behavioral mechanism is the most important part of the template. It's not just philosophical rigor — it's the branching logic for your next test.

**Pro Tip:** After every test, regardless of outcome, write a "post-test hypothesis" — one sentence explaining what you now believe to be true (or newly uncertain) based on the result. This discipline, run consistently, builds an institutional knowledge base that junior team members can use to onboard faster and senior team members can use to avoid re-running tests you've already resolved.

Connecting Hypotheses to Behavioral Science Principles

Good hypotheses almost always reference an established behavioral principle. Here are the six most commonly applicable in CRO work:

Loss aversion: People weight losses roughly twice as heavily as equivalent gains. Apply when: framing offers as "what you'll lose by not acting" vs. "what you'll gain by acting." Relevant for: CTAs, pricing page copy, trial expiration messaging.

Social proof: People look to others' behavior to resolve uncertainty about what to do. Apply when: adding testimonials, user counts, review scores, or "most popular" signals. Relevant for: pricing pages, sign-up pages, landing pages with undifferentiated offers.

Cognitive load: People have limited mental bandwidth. Reducing the number of decisions, fields, or options presented at once reduces friction. Apply when: form design, navigation structure, option count on pricing pages.

Anchoring: People rely heavily on the first number or piece of information presented when making subsequent judgments. Apply when: pricing order (show highest price first), discount framing (show original price), feature comparison tables.

Commitment and consistency: People behave consistently with prior commitments and self-perception. Apply when: multi-step form design (foot-in-the-door), free trial-to-paid conversion, onboarding flows.

Reciprocity: People feel compelled to return a favor after receiving something of value. Apply when: lead magnet design, freemium feature gating, content-before-conversion flows.

When your hypothesis references one of these, you can search the academic literature and prior test data for evidence of the mechanism's effect size. That makes your duration estimate and MDE more grounded.

The Hypothesis Review Checklist

Before you build any test, run through five questions:

  1. Is the observation data-grounded? Can you point to a specific analytics finding, user research result, or session recording that motivates this test? If the answer is "it just seems like a good idea," do more research first.
  2. Is the change specific enough to isolate? If you're testing three changes at once (copy, color, and layout), you can't attribute a result to any one change. Make the change specific enough that a significant result is interpretable.
  3. Is the metric primary and pre-registered? The metric you're optimizing needs to be defined before the test launches. If you add a metric after seeing partial results, that metric is not valid.
  4. Is the mechanism falsifiable? "Users will like it better" is not a mechanism. "Reducing form fields below 5 reduces cognitive load and increases completion rate" is falsifiable. You can test and disprove it.
  5. What would you conclude if this test loses? If you can't answer this question, your hypothesis isn't specific enough. The answer to "what would you conclude if this loses" should always be: "then we've ruled out [specific mechanism] as the driver, which means the issue is more likely [alternative mechanism]."
**Pro Tip:** Build a shared hypothesis library in your documentation tool of choice. Every hypothesis — win or loss — gets archived with its mechanism, result, and post-test interpretation. Within 6-12 months of consistent practice, this library becomes the highest-value asset your CRO program has.

The Hypothesis Library: Making Knowledge Compound

The single biggest difference between a 2-year-old CRO program with 100 tests and a strong institutional knowledge base versus one that has run 100 tests but learned very little comes down to hypothesis documentation.

If you archive your hypotheses and their mechanisms, you can:

  • Identify patterns (loss aversion framing works for us on checkout pages but not on landing pages — why?)
  • Avoid re-testing what you've already resolved
  • Onboard new team members to real institutional knowledge, not just tool training
  • Build meta-analyses of your own data (what MDE do social proof tests typically achieve in our product category?)

The structure for each archived hypothesis: the hypothesis statement, the mechanism, the test design, the primary metric and result, the post-test interpretation, and the next hypothesis it generated.

Common Mistakes

Mistake 1: Post-hoc hypothesis writing. Writing the hypothesis after you see the data. This is the CRO equivalent of drawing the target around the arrow. It looks like rigor but produces no learning.

Mistake 2: Confusing an observation with a hypothesis. "Users aren't clicking the CTA" is an observation. "Because users aren't clicking the CTA, we believe..." is the start of a hypothesis. Observations motivate hypotheses — they aren't hypotheses themselves.

Mistake 3: Mechanism-free hypotheses. "We believe a shorter form will increase completions." Why? What mechanism? "Because reducing fields below 5 lowers the cognitive cost of form completion for users who are comparison-shopping across multiple vendors." That's a mechanism. Without it, you can't interpret a negative result.

Mistake 4: Testing gut feelings from stakeholders without translating them into hypotheses. "The CEO wants to test a video on the homepage" is a directive. Your job is to translate it: "Because new visitors are spending under 15 seconds on the homepage (session data), we believe adding a 60-second explainer video will increase time-on-page and sign-up rate for organic search visitors, because video reduces the cognitive effort of understanding a complex value proposition." Now it's a hypothesis.

What to Do Next

  1. Take your last 3 test ideas from your backlog and rewrite them using the full template. If you can't write the mechanism, put that test idea in a "needs more research" bucket and do a user interview or session recording review before scheduling it.
  2. Create a hypothesis library document. Archive your last 10 test results with the mechanism and post-test interpretation. Identify any patterns.
  3. Establish a pre-launch review process: before any test gets built, the hypothesis goes through the 5-question checklist. If it fails any question, it goes back for revision.

For templates, worked examples, and a hypothesis scoring framework integrated with Optimizely's project setup, see the Optimizely Practitioner Toolkit at atticusli.com/guides/optimizely-practitioner-toolkit.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.