Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he runs 100+ experiments per year and generated $30M in verified revenue impact in 2025. He writes about the operational reality of building experimentation programs that survive contact with organizational politics.

I need to tell you about the thing that almost killed my experimentation program. Not a bad test. Not a platform failure. Not budget cuts. It was something much more insidious: the slow, invisible erosion of credibility that happens when your team starts shopping for winning metrics after seeing results.

Post-hoc metric shopping is when you change your success metric after seeing the data to make a test look like a winner. And it's happening in most experimentation programs I've ever consulted on. Often without anyone recognizing it as a problem.

What Metric Shopping Actually Looks Like

Here's the scenario. You run a test on a landing page redesign. Your primary metric is form submissions. After three weeks, the test comes back flat — no meaningful difference in form submissions between control and variant.

But someone on the team notices that time on page went up 12%. Another person sees that scroll depth increased. Someone else pulls up a secondary click metric that moved 8%.

Suddenly the readout becomes: "The redesign didn't impact form submissions directly, but it significantly increased engagement. Time on page was up 12%, scroll depth improved, and downstream click-through on the value proposition section rose 8%. We recommend shipping the variant based on these engagement signals."

That's metric shopping. And it feels completely rational in the moment. The test clearly did _something_. The data is real. Nobody is fabricating numbers. But the conclusion is manufactured.

You decided the test was about form submissions. The form submission rate didn't move. The test lost. Everything else is noise until you design a follow-up to test those engagement hypotheses explicitly.

Why Every Team Does It

I've watched this pattern play out across dozens of organizations, and the root cause is always the same: pressure.

Stakeholder pressure. Someone senior championed this redesign. They allocated design and development resources. They told _their_ boss it was happening. Now you're telling them it didn't work? That's a hard conversation. It's much easier to find a metric that moved and spin a softer narrative.

Sunk cost. The test took six weeks to design, build, QA, and run. The team invested real time and energy. Calling it a loss feels like all that work was wasted. Finding a winning metric — any winning metric — makes the investment feel justified.

Program survival instinct. This is the most dangerous one. When you're building an experimentation program, especially in the first year or two, you feel enormous pressure to demonstrate value. Every test that "fails" is ammunition for someone who thinks experimentation is a waste of resources. So you start unconsciously looking for wins in every dataset.

I understand all of these pressures. I've felt every one of them. But the solution to organizational pressure is not to compromise on truth. Because that path leads somewhere much worse than a flat test.

How It Destroys Programs Over Time

Metric shopping doesn't blow up your program in one dramatic failure. It erodes it slowly, like termites in a foundation. Here's the progression I've seen play out multiple times.

Stage one: Selective storytelling. The experimentation team starts presenting results in the best possible light. Losing tests get reframed as "learnings with positive engagement signals." Win rates magically climb because you're counting these reframed tests as wins.

Stage two: Stakeholder confusion. Different teams start reporting different "wins" from the same test. Marketing says the test won on engagement. Product says it was flat on conversion. The data team can't reconcile the narratives. People start asking which metric actually matters, and nobody has a consistent answer.

Stage three: Loss of trust. Executives notice that the experimentation team always seems to find a win. They start questioning the methodology. When you present a genuine, high-confidence win — the kind that should drive major investment — it gets the same skeptical reception as your reframed engagement metrics. You've trained people to discount your results.

Stage four: Program marginalization. The experimentation team becomes "the team that always says their stuff works." You've become the center of spin instead of the center of truth. Budget gets harder to justify. Headcount requests get denied. The program dies not because it was bad, but because it destroyed its own credibility.

I've watched this exact progression happen at two different companies before I joined NRG. It takes about 18 months from stage one to stage four. And once you're at stage four, it's nearly impossible to recover without a complete reset of the team and its leadership.

The Organizational Dynamics Nobody Talks About

Here's what makes this problem so intractable: metric shopping often starts with the people who should be preventing it.

The CRO manager wants to show their boss that the program is working. The director of marketing wants to report wins to the VP. The VP wants to show the CMO that digital experimentation is driving results. At every layer, there's an incentive to find the most favorable interpretation of the data.

Early in a program's life, when you're trying to build proof of concept, the temptation is overwhelming. "We just need a few early wins to secure next year's budget, then we'll tighten up the methodology." I've heard this exact rationalization. It never works. You can't build a rigorous culture on a foundation of expedient narratives.

The other dynamic is cross-functional friction. When the experimentation team owns the methodology but a stakeholder owns the strategy, you get competing incentives. The product manager who requested the test wants it to win because they have a roadmap to defend. The experimentation analyst wants to report the truth because their professional credibility depends on it. Guess who usually loses that argument?

The One Rule That Prevents It

After all the programs I've built and consulted on, I've landed on one rule that prevents post-hoc metric shopping. It's brutally simple and non-negotiable.

One primary metric, locked before launch, documented in the test brief.

Every test has a single primary metric. It's chosen during the hypothesis phase, before anyone sees any data. It's written into the test brief alongside the hypothesis, the sample size calculation, and the expected runtime. Once the brief is approved and the test launches, the primary metric cannot change.

Everything else — engagement metrics, secondary conversions, segment-level cuts — is explicitly labeled as exploratory. Exploratory findings are interesting. They can inform future hypotheses. But they do not determine whether this test won or lost.

When I present results, there's one truth. "The test targeted form submission rate. The result was a 1.2% lift that did not reach statistical significance at our 95% confidence threshold. The test is inconclusive on its primary metric." Then, and only then, do I mention exploratory findings, clearly labeled as such.

This rule does three things. First, it forces the team to think harder about what actually matters before building anything. If you can only have one primary metric, you'd better choose the right one. Second, it removes the temptation to spin after the fact because everyone agreed on the success criterion upfront. Third, it builds the credibility that makes your real wins land with impact.

How to Handle Losing Tests Honestly

If you lock your primary metric and report honestly, you're going to report a lot of losses. That's normal. Industry win rates for well-designed A/B tests range from 15% to 30%. At NRG, we run at about 24%, which is strong but still means three out of four tests don't produce a statistically significant improvement on the primary metric.

The key is framing losses correctly. A test that doesn't move the primary metric isn't a failure — it's a decision. You now know that this particular change, at this particular point in the funnel, does not meaningfully impact the metric you care about. That's valuable. It prevents you from investing further in a direction that doesn't work.

I structure every test readout in the same format, whether the test won or lost. Hypothesis. Primary metric result. Confidence level. Business implication. Recommended next action. The format is identical for wins and losses. This normalization is crucial because it removes the emotional charge from "losing" tests.

When a stakeholder pushes back — "But the engagement metrics went up, shouldn't we ship it?" — I have a consistent answer: "Those are exploratory findings. If we believe scroll depth is the metric that matters, let's design the next test with scroll depth as the primary metric and power it appropriately. I'm not willing to retroactively change the success criterion because that undermines every result we report."

That conversation is uncomfortable exactly once. After that, people understand the standard.

Why This Matters for Program Survival

Credibility is the experimentation team's most valuable asset. More valuable than the testing platform. More valuable than the traffic volume. More valuable than the analysts on the team.

When an experimentation team has credibility, their wins drive real investment. "The experimentation team says this change will generate $2M in annual revenue" becomes a statement that moves budget. When they don't have credibility, that same statement gets met with "Well, they always say their tests win."

I've been able to generate $30M in verified revenue impact in 2025 because when I present a result, people believe it. They believe it because they've seen me present losing tests with the same rigor and transparency as winning tests. They believe it because the methodology has been consistent since day one. They believe it because I've never once changed a primary metric after seeing results.

That trust took years to build. It would take one month of metric shopping to destroy.

Build the System That Prevents the Temptation

The best way to prevent metric shopping isn't willpower — it's process. Build the system so the temptation never arises.

Lock the primary metric in the test brief. Make test briefs a required artifact before any development begins. Store every brief in a test repository that's accessible to anyone in the organization. When you present results, reference the original brief. Make the audit trail visible.

If you're running more than a handful of tests, you need tooling that enforces this workflow — intake, hypothesis documentation, metric locking, and results tied back to the original brief. That's exactly what we built GrowthLayer to do: automate the test brief process so the primary metric is locked, documented, and visible before the first line of test code gets written.

The organizations that build lasting experimentation programs are the ones that choose honesty over optics from day one. It's harder in the short term. It's the only thing that works in the long term.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.