Atticus Li leads Applied Experimentation at NRG Energy (Fortune 150), where he runs 100+ experiments per year and generated $30M in verified revenue impact in 2025. He writes about the operational reality of building experimentation programs that survive contact with organizational politics.
There's a conversation I have during the first week at every new organization. It goes like this: "Before I run a single experiment, I need to understand your data dictionary. Not the documentation — the reality. How do you actually define a user? What counts as a session? When you say 'conversion,' what event fires and under what conditions?"
The answers are never straightforward. And the gap between what people think the data says and what it actually says is where experiments go wrong.
There's Always Going to Be Weirdness
Every company's data dictionary has quirks. I don't say this as a criticism — it's an inevitable consequence of how analytics implementations evolve over time. Someone sets up tracking five years ago. Requirements change. New features get added. Edge cases accumulate. Workarounds become permanent. Nobody documents the decisions that were made at 11pm the night before a launch.
The result is a data dictionary that's technically correct but practically misleading unless you understand the context behind every definition.
At SVB, we used Google Analytics. A "unique user" in GA was based on the client ID cookie. If someone visited on their phone and then on their laptop, that was two users. If they cleared their cookies, that was a new user. If they used incognito mode, new user. The "unique users" metric was really "unique browser instances," which could overcount actual humans by 30-40% depending on the audience.
At NRG, we use Adobe Analytics. Adobe's visitor identification works differently — it uses a combination of cookies, device fingerprinting, and authenticated states. The same person visiting from two devices might or might not be deduplicated, depending on whether they logged in. "Unique visitors" in Adobe is closer to actual humans than GA's version, but only for authenticated users. For anonymous traffic, it has similar limitations.
Same concept — "unique users" — with fundamentally different definitions between the two platforms. If you ran an experiment at SVB and measured "unique users who converted," then moved to NRG and applied the same mental model, your conversion rates would look different even if nothing about the actual customer behavior changed. Because you're counting different things.
The Definitions That Trip Up Experiments
Let me walk through the specific definitions that cause the most problems in experimentation.
Sessions. GA4 defines a session as a group of user interactions within a given time frame. A session expires after 30 minutes of inactivity (by default, configurable). Adobe has a similar concept but calls it a "visit" and the timeout rules can differ based on implementation. Some companies override the default timeout. Some have custom session logic for single-page applications. If your experiment measures "conversion rate per session," you need to know exactly what triggers a new session in your specific implementation.
Pageviews. Sounds simple, right? A pageview fires when a page loads. Except in single-page applications, where navigation doesn't trigger a traditional page load. Some implementations fire virtual pageviews on route changes. Some don't. Some fire them on hash changes. Some only fire on full URL changes. I've seen SPAs where navigating from /products to /products/detail counted as one pageview in some implementations and two in others. If your experiment is on a specific page and you're measuring pageviews, you need to verify that the pageview fires when you think it does.
Conversions. This is the big one. "Conversion" means whatever someone decided it means during the implementation. At one company, a conversion fires when the user clicks the submit button. At another, it fires when the server confirms the submission was successful. At a third, it fires when the payment is processed. These are materially different events that happen at different points in the funnel, with different failure rates between them.
I've seen experiments where the "conversion rate" was inflated because the conversion event fired on button click, not on successful submission. Users who clicked submit but got a validation error were counted as conversions. The experiment looked like a winner because the variant had fewer validation errors — not because it actually drove more successful submissions.
Revenue. You'd think revenue would be unambiguous. It's not. Does revenue include tax? Shipping? Does it net out discounts and coupons? Does it account for returns and refunds? Is it recognized at the point of purchase or at the point of fulfillment? Every company answers these questions differently, and the answer changes what "revenue per user" actually means in your experiment.
What Happens When You Don't Check
Here's a real scenario. A team I worked with ran an experiment measuring "add to cart rate." The test showed a 9% improvement. They shipped it. Two weeks later, finance flagged that revenue hadn't moved. The team investigated and discovered that their "add to cart" event was firing on a JavaScript state change, not on the actual API call that added the item to the cart. The variant had a UI change that triggered the state change without actually adding the item. The 9% lift was entirely phantom conversions.
Another example. An experimentation team measured "unique users who completed enrollment." The win was 6%. But "unique users" in their implementation counted each browser session independently. Mobile users who started on their phone and finished on their laptop were counted once in the control (completed) and once in the variant (started but not completed on the phone, then completed on the laptop as a "new" user). The variant was actually performing the same, but the cross-device behavior inflated the control's apparent failure rate.
These aren't edge cases. These are the kinds of issues I encounter regularly. The more complex the analytics implementation, the more opportunities for definitional mismatches to corrupt your results.
The Pre-Experiment Data Audit
Before I run any experiment in a new environment, I do a data audit. Here's what that looks like.
Step one: read the implementation, not the documentation. Documentation is often outdated or aspirational. I look at the actual tracking code. What events fire? Under what conditions? What parameters are attached? Where are the edge cases?
Step two: query the raw data. I pull raw event data for the metrics I plan to use in experiments and manually verify that they match my understanding. If I think "conversion" means "successful purchase," I check whether the conversion event fires before or after payment confirmation. I check whether it fires on retries. I check whether it fires for $0 transactions.
Step three: reconcile with business data. I compare the analytics numbers to the source-of-truth business systems. If analytics says 10,000 conversions last month and the CRM says 9,200, I need to understand where the 800-conversion gap lives before I trust any experiment that uses that metric.
Step four: document the quirks. Every implementation has them. I create a living document that catalogs every known quirk in the data: events that double-fire under certain conditions, metrics that exclude certain user segments, definitions that differ from common understanding. This document becomes required reading for anyone who runs experiments.
The Cross-Platform Translation Problem
If you've worked at multiple companies — or even multiple teams within the same company — you know that metrics don't translate cleanly. A "conversion rate" at one company is not comparable to a "conversion rate" at another, even if they're in the same industry. The definitions are different. The funnel shapes are different. The tracking implementations are different.
This matters for experimentation because a lot of our intuition is built on prior experience. You might think "a 5% conversion rate lift is typical for a good test" based on your experience at a previous company. But if that company defined conversion differently, your calibration is off.
I recalibrate my intuition every time I join a new organization or start working with a new analytics platform. The first month is all about understanding the data before I try to move it.
Build the Habit
Every metric in your experimentation program should have a verified definition that the entire team understands. Not the textbook definition. The actual definition in your specific implementation.
This isn't glamorous work. Nobody gets promoted for auditing event tracking. But it's the foundation that everything else rests on. An experiment built on misunderstood metrics isn't an experiment — it's guesswork with a statistical veneer.
Understand the dictionary before you try to write with it.
---
_Once your metrics are solid, make sure your experiments are properly powered to detect real differences. GrowthLayer's A/B test calculator helps you plan sample sizes based on the metrics that actually matter to your business._