Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he runs 100+ experiments per year and generated $30M in verified revenue impact in 2025. He writes about the operational reality of building experimentation programs that survive contact with organizational politics.
I've been in rooms where three different stakeholders presented the same A/B test result and somehow reached three entirely different conclusions. Not because anyone was lying. Because each person was selecting the slice of data that supported their existing position.
This is the dirty secret of enterprise experimentation: the data doesn't speak for itself. People speak for the data. And when you let that happen without guardrails, you end up with an experimentation program that produces ammunition instead of insights.
How Results Actually Get Spun
Let me walk you through the most common patterns I've seen. These aren't hypothetical. I've watched every one of these happen in real organizations, including my own.
Selective metric reporting. The test had a primary metric and six secondary metrics. The primary metric was flat. But one secondary metric moved 14%. Guess which number makes it into the executive readout? The stakeholder who championed the change suddenly becomes very interested in "engagement metrics" they've never cared about before.
Timeframe manipulation. The test ran for four weeks. Week three had a spike because of a seasonal promotion. Someone pulls the week-three data in isolation and presents it as "the result." Or they truncate the test period to exclude the regression in week four. The data is technically accurate. The conclusion is manufactured.
Attribution gymnastics. Marketing claims the conversion lift from a test that optimized the checkout flow. Product claims the same lift because they redesigned the UI. The experimentation team measured a 3% improvement. Somehow both teams report it upward as their win. Nobody is coordinating, and the company thinks they have 6% improvement when they have 3%.
Reframing the hypothesis. The original hypothesis was about increasing purchases. The test didn't move purchases. But someone rewrites history: "Actually, we were testing whether the new messaging resonated with users. And the click-through rate on the CTA increased, so it clearly resonated." That was never the hypothesis. But try proving that six weeks later when the original brief has been buried in a Confluence page nobody reads.
The confidence interval stretch. A result comes back at 87% confidence. Not statistically significant by any reasonable standard. But the stakeholder presents it as "directionally positive" and recommends shipping. Directionally positive is not a statistical concept. It's a political one.
Why This Happens Everywhere
The root cause isn't bad people. It's misaligned incentives layered on top of organizational pressure.
Product managers need to show their roadmap items delivered results. Marketing leaders need to justify campaign spend. Design teams need to prove their redesigns weren't just aesthetic exercises. Engineering leads need to show the sprint was worth the investment.
Every single one of these people has a legitimate reason to want the test to succeed. And when the raw data doesn't cooperate, the temptation to find the angle that supports the narrative is overwhelming. Not because they're dishonest — because their performance reviews, their budgets, and their credibility are on the line.
I've watched senior leaders who I deeply respect subtly reframe test results to protect their teams. It's human nature. Which is exactly why you can't rely on human nature to keep the data clean.
The Center of Excellence Must Be the Single Source of Truth
This is where an experimentation Center of Excellence earns its existence. Not as a consulting team. Not as a support function. As the authoritative source of what happened.
Here's what that means in practice. One team owns the final readout. One team determines whether the test met its primary success metric. One team writes the conclusion. Everyone else can provide context, suggest follow-ups, and challenge the methodology. But they don't get to write an alternative version of reality.
At NRG, every experiment has a standardized report that includes the primary metric result, the pre-registered hypothesis, the sample size, the confidence level, and the business recommendation. That report is the canonical record. If your readout to leadership differs from what's in that report, you're going to get called on it.
This isn't about control. It's about credibility. An experimentation program lives and dies on trust. The moment stakeholders stop trusting the results — because they've seen the same data presented five different ways — you're done. Nobody funds a program they don't trust.
The Standardized Reporting Fix
Here's the specific framework I use to prevent spinning.
One primary metric, declared before the test launches. Not after. Before. In writing. In the experiment brief. If you want to track secondary metrics, great. But the test is evaluated on the primary metric. Period.
Pre-registered hypotheses. You write down what you expect to happen and why before you see any data. This makes it much harder to retroactively claim the test was about something else. Pre-registration isn't just for academia. It's the single most effective defense against organizational spin.
Standardized readout template. Every test gets the same format. Primary metric result. Statistical significance. Confidence interval. Sample size. Duration. Business recommendation. Secondary observations. No creative reframing. No narrative embellishment. The template forces clarity.
Win/loss classification by the experimentation team. Not by the stakeholder. The team that ran the test calls it a win, a loss, or inconclusive. That classification goes into the program database and doesn't change. If a stakeholder disagrees, they can challenge the methodology or request a follow-up test. They don't get to reclassify the result.
Transparent communication of losses. This is the hardest one. You have to normalize losing. Most tests lose. That's not failure — that's the scientific method working. If your program only reports wins, you've already lost the credibility battle. I present our full portfolio performance, including every flat and losing test, in every quarterly review. Leadership respects the honesty, and it makes the wins more credible.
The Story Matters — But It Has to Be Honest
I want to be clear about something. I'm not saying narrative doesn't matter. It absolutely does. Especially early in a program when you're fighting for budget and headcount, the way you frame your results determines whether the program survives.
But there's a massive difference between smart framing and spin.
Smart framing says: "This test didn't move the primary metric, but it eliminated a hypothesis that would have consumed two sprints of development time. That's a $200K decision we didn't waste." That's honest. That's valuable. And it's a legitimate way to communicate the value of experimentation.
Spin says: "While the primary metric was inconclusive, we saw strong engagement signals that suggest the variant resonated with users." That's not honest. That's someone trying to avoid the word "lost."
The difference is simple. Smart framing presents the actual result and adds business context. Spin presents a different result than what actually happened.
What Happens When You Don't Fix This
I've seen programs collapse under the weight of their own spin. It follows a predictable pattern. Results get spun. Leadership notices the inconsistencies. Trust erodes. Funding questions start. The program gets restructured, downsized, or killed.
The irony is that the spin was meant to protect the program. Stakeholders thought that presenting wins would secure the budget. Instead, the inconsistency between what was presented and what actually shipped destroyed the program's reputation.
The experimentation programs that survive long-term are the ones that tell the truth, build trust through transparency, and let the cumulative impact speak for itself. A portfolio of 100 experiments with a 24% win rate and $30M in verified revenue impact is a far more compelling story than a handful of cherry-picked "wins" that nobody quite believes.
Build the single source of truth. Enforce the reporting standard. Tell the story honestly. That's how you build a program that lasts.
---
_If you're building an experimentation program and want tools that enforce rigor from day one — sample size calculators, significance tests, and SRM diagnostics — check out GrowthLayer's free experimentation calculators. They're built for operators, not textbooks._