The redesign looked fantastic. Stakeholders loved it. The primary metric held steady. And enrollment completions quietly dropped by double digits for weeks before anyone noticed.

This is the story of how my team shipped a homepage variant that objectively improved the user experience and destroyed downstream conversion in the process. I have spent years leading experimentation at a Fortune 150 energy company, where our optimization program has driven over $30M in verified revenue impact in 2025 alone. This was still one of the most instructive failures I have encountered.

If you run A/B tests and judge winners by a single metric, this case study is a warning.

What We Changed

The test was a homepage redesign with three major changes bundled together:

1. Shorter page with brand-forward messaging. We cut roughly 40% of the page length. The new version led with emotional brand copy instead of product-focused content. It felt premium and modern.

2. Modal overlay replaced dedicated plan search pages. In the control, clicking “Find a Plan” took users to a full dedicated page with filters, results, and a clear path to enrollment. In the variant, that same action opened a modal overlay on top of the homepage. Same functionality, fewer page loads, theoretically less friction.

3. Simplified navigation hierarchy. We consolidated several entry points into a streamlined flow.

The design team was proud of the work. The variant tested well in usability sessions. Internal stakeholders preferred it unanimously.

The Flat Primary Metric Trap

Our primary metric was engagement with the plan search flow — specifically, the rate at which homepage visitors initiated a plan search. After two weeks at full traffic allocation, the variant showed a flat result: no statistically significant difference from control.

The team interpreted “flat” as “safe to ship.” The logic was simple: if the primary metric is not worse, and the design is better, then we should ship the new experience. This reasoning felt airtight.

It was completely wrong.

Flat on the primary metric does not mean flat everywhere. It means you have not yet found where the impact is hiding. In experimentation, flat is not a verdict. It is an invitation to look deeper.

The Signal Nobody Checked

Three weeks after the test launched, our analytics team flagged an anomaly in a weekly business review. Enrollment confirmations — the final step where a customer actually signs up for an energy plan — were down. Not by a trivial amount. The decline was consistent and patterned, tracking precisely with the test’s traffic allocation.

We had been so focused on the primary metric that nobody had pulled the downstream funnel data until it showed up in an unrelated report.

This is the most dangerous pattern in experimentation: flat primary, declining downstream. It is more dangerous than a clear loss on the primary metric, because a clear loss triggers investigation. A flat primary metric triggers complacency.

The enrollment decline was large enough to represent meaningful lost revenue every week the test ran. And because the primary metric looked fine, the test had been running at 100% traffic for the variant while we “monitored” the flat result.

The Diagnosis

We pulled the full funnel data and the picture became clear immediately.

Control flow: User clicks “Find a Plan” and lands on a dedicated page. This page transition is a commitment signal. The user has left the homepage. They are now in task-completion mode. The dedicated page has one job: help you find and enroll in a plan. Cognitive context is narrow and focused. Users who reached this page completed enrollment at a strong rate.

Variant flow: User clicks “Find a Plan” and a modal opens on top of the homepage. The homepage is still visible behind the overlay. The modal feels temporary and dismissible — because it is. The user has not made a commitment. They are browsing. The X button is right there. Closing the modal returns them to exactly where they were, with zero cost.

The modal had the same fields, the same filters, the same plan results. Functionally identical. Psychologically, it was a completely different experience.

Why Better UX Killed Conversion

This is the core insight that changed how I think about experimentation: UX quality and conversion effectiveness are not the same axis.

The modal was genuinely better UX by conventional standards. It was faster, required fewer page loads, and kept the user in context. Every UX heuristic would rate it higher than the control. But conversion is not about usability scores. It is about behavioral commitment.

In behavioral science, there is a concept called the "sunk cost progression" — each step a user takes deeper into a flow increases their psychological investment. Navigating to a dedicated page is a step. It costs attention and effort. That cost is precisely what makes the user more likely to complete the next step.

The modal eliminated that cost. And in doing so, it eliminated the commitment signal that preceded enrollment. The user never crossed a threshold. They peeked through a window instead of walking through a door.

This is why you cannot evaluate interaction pattern changes using the same metric as content changes. When you change how users engage with a flow — not just what they see — you must measure behavioral outcomes, not just engagement rates.

The Hidden Second Effect

There was a second, subtler problem we almost missed.

The shorter homepage in the variant redistributed scroll behavior. With less content above the fold and fewer sections to scroll through, users reached the bottom of the page faster. At the bottom was a zipcode entry widget — a secondary conversion path.

In the control, most users never scrolled far enough to see this widget. In the variant, a significantly higher percentage of users interacted with it. This zipcode widget led to a different flow that had a lower enrollment completion rate.

So the variant did not just weaken the primary path. It accidentally strengthened a weaker secondary path. The net effect compounded the enrollment decline.

This is a common pattern in page-length experiments that nobody talks about. When you shorten a page, you do not just remove content. You change the probability distribution of where users end up. Every element that was previously below the scroll threshold becomes more visible and more likely to capture attention.

What I Do Differently Now

This experiment changed four things about how I run tests:

Rule 1: Never bundle content changes with interaction pattern changes. Changing copy, page length, or visual design is one category. Changing how users navigate between steps — modals vs. pages, inline vs. redirect, tabs vs. accordions — is a fundamentally different category. Test them separately.

Rule 2: Always pull downstream metrics before calling a test. Every test analysis now includes at least two steps beyond the primary metric. If any downstream metric moves in the opposite direction of the primary, the test gets flagged for deep investigation.

Rule 3: Size your tests against downstream metrics, not just the primary. We used to calculate sample size based on the primary metric’s conversion rate. Now we also calculate the sample needed to detect a meaningful change in the downstream metric. If the downstream metric needs more traffic to reach significance, we run the test longer.

Rule 4: Use a 3-phase redesign sequence. Phase 1 tests content and messaging only with the same interaction patterns. Phase 2 tests interaction pattern changes only with the winning content. Phase 3 combines them. This takes longer but produces clearer signals and reduces the risk of shipping a hidden regression.

Key Takeaways

  • A flat primary metric is not a green light — it is a signal to investigate downstream metrics immediately
  • Modals and dedicated pages are not interchangeable; they create fundamentally different psychological commitment levels
  • Removing friction from a conversion flow can remove the commitment signals that drive completion
  • Shorter pages redistribute attention to elements that were previously below the scroll threshold
  • Never bundle content changes with interaction pattern changes in the same test
  • Size your experiments to detect meaningful shifts in downstream revenue metrics, not just top-of-funnel engagement

Frequently Asked Questions

How do you decide which downstream metrics to track for a homepage test?

Start with the business outcome your homepage exists to drive. For us, that is enrollment confirmation. Then work backward: what are the two or three intermediate steps between the homepage and that outcome? Those become your required downstream metrics. Every homepage test should track at least plan search initiation, plan selection, and enrollment completion.

Can modals ever work for high-intent conversion flows?

Yes, but only when the modal represents a micro-commitment, not the entire conversion path. A modal works well for a single-field action like entering a zipcode. It fails when it tries to contain a multi-step flow that benefits from dedicated focus. The key question is whether the action inside the modal requires sustained attention or just a quick input.

How long should you run a test before checking downstream metrics?

Do not wait. Include downstream metrics in your test dashboard from day one. The mistake we made was treating downstream metrics as a follow-up analysis rather than a core part of the test monitoring plan. Set up automated alerts for any downstream metric that diverges from the primary metric direction by more than your pre-defined threshold.

What is the minimum sample size needed to detect downstream metric changes?

It depends on the baseline conversion rate of the downstream metric. Downstream metrics typically have lower conversion rates than primary metrics, which means you need larger samples to detect the same relative effect size. Run a power analysis on each downstream metric independently. If the downstream metric requires three times the sample of the primary, plan your test duration accordingly.

Related Reading

This case study illustrates one of the six distinct A/B test outcomes most practitioners never learn to identify. For the full decision framework covering all six result types — from clear winners to mixed signals — read The 6 Types of A/B Test Results Nobody Explains Clearly.

The deeper issue behind this test was hypothesis quality — the team bundled too many changes without isolating variables. For the framework that prevents this, see Why A/B Tests Fail Before They Start.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Leads applied experimentation at NRG Energy. $30M+ in verified revenue impact through behavioral economics and CRO.