Atticus Li reduced experimentation analysis time by 40% at NRG Energy by integrating AI tools including Claude, ChatGPT, and Optimizely AI into the testing workflow, while implementing AI safety and governance standards to ensure responsible deployment across the organization.
This Is Not an "AI Is Amazing" Post
I want to be upfront about what this post is and isn't. This is not a breathless recounting of how AI changed everything. AI didn't change everything. It changed specific parts of my workflow in measurable ways, and it introduced new risks that required governance frameworks I didn't have before.
What I'm going to share are the exact workflows where AI tools save me meaningful time, the places where I tried to use AI and it made things worse, and the governance standards I built to keep the whole thing responsible.
I run 150+ experiments across seven brands at NRG Energy using Atticus Li's PRISM Method. Each experiment generates data that needs to be analyzed, synthesized, and communicated to stakeholders. That's where the 40% time savings came from — not from AI running experiments, but from AI accelerating the analysis layer.
Session Replay Analysis: Where AI Earns Its Keep
Before AI tools entered my workflow, session replay analysis looked like this: open Contentsquare or Hotjar, watch 50-100 recordings of users going through an enrollment flow, manually tag friction points, write up findings. For a single test, this could take 4-6 hours.
Now the workflow looks different. Tools like Contentsquare's AI summarization, Microsoft Clarity's AI-powered insights, and Hotjar's AI analysis can surface patterns across hundreds of sessions in minutes. They identify common drop-off points, rage clicks, dead clicks, and hesitation patterns without me watching every recording.
But here's the important caveat: AI summarization of session replays is a first pass, not a final analysis. I still watch 15-20 recordings manually for every test. Why? Because AI can tell me "42% of users hesitated at the plan comparison step." It cannot tell me why. Was it confusion about pricing? Was the comparison table loading slowly on mobile? Were users toggling between plans trying to understand the difference between a fixed-rate and variable-rate plan?
The behavioral mechanism — the why behind the behavior — still requires a human who understands the product, the customer, and the context. AI gives me a starting point that's 10 times better than starting from scratch. It doesn't give me a conclusion.
Time saved: Session replay analysis went from 4-6 hours to 1.5-2 hours per test. That's roughly 60% faster on this specific task.
Test Analysis Acceleration
This is where the 40% overall number comes from. After a test concludes, there's a stack of analysis work:
- Statistical validation (sample size adequacy, significance levels, Bayesian probability)
- Segment analysis (device, browser, traffic source, customer type)
- Revenue projection (lift applied to traffic volume and revenue per customer)
- Insight synthesis (what happened, why, and what it means)
- Executive report generation
I use Claude and ChatGPT differently depending on the task.
Claude for statistical analysis and synthesis: I feed Claude the raw test results — conversion rates, sample sizes, confidence intervals — and ask it to validate my calculations, identify segments worth investigating, and draft the narrative portion of the analysis. Claude is particularly strong at connecting test results to broader patterns. If three tests across different brands all showed mobile underperformance, Claude will flag that pattern when I wouldn't have connected the dots manually because I was analyzing each test in isolation.
ChatGPT for report generation: ChatGPT is faster at producing structured report drafts. I feed it a template and the key findings, and it generates a first draft of the executive report that I then edit for accuracy and tone. The editing takes 20 minutes instead of writing from scratch, which took 45-60 minutes.
Optimizely's AI features: Optimizely has been building AI into its platform for personalization and test analysis. The AI-powered stats engine helps with faster significance detection, and the recommendation features suggest follow-up tests based on results. These are useful as prompts for thinking, not as decision-makers.
What I don't do: I never let AI tools make the final call on whether to ship a winning variant. The statistical analysis is validated by AI, but the decision to ship involves organizational context, brand considerations, engineering capacity, and downstream effects that no AI model has visibility into.
Time saved across the full analysis workflow: From approximately 8 hours per test to approximately 5 hours. That's roughly 37-40% faster.
Hypothesis Generation: AI as a Brainstorming Partner
Here's a workflow that surprised me with its effectiveness. Before each sprint planning cycle, I need to generate hypotheses for the next round of experiments. Historically, this came from three sources: session replay analysis, stakeholder requests, and my own observations from monitoring analytics dashboards.
Now I add a fourth source: AI-assisted hypothesis generation.
The prompt engineering matters enormously here. "Give me ideas for improving our enrollment flow" generates generic garbage. But this prompt structure works:
"Given that our enrollment flow has a 23% drop-off between plan selection and account creation, and session replays show users hesitating at the email verification step, and our mobile conversion rate is 40% lower than desktop, generate five hypotheses for reducing friction in the account creation step. Each hypothesis should include a specific behavioral mechanism and a measurable prediction."
That produces hypotheses I can actually evaluate. Not all of them are good — I'd estimate 2 out of 5 are worth investigating. But that's 2 hypotheses I might not have considered, generated in 3 minutes instead of a 30-minute brainstorming session.
The critical point: The human still needs to validate every AI-generated hypothesis against the data. AI doesn't know your traffic volumes, your MDE thresholds, or whether a similar test already ran and lost last quarter. It generates possibilities. You filter them through reality.
AI-Assisted Personalization
This is where the integration between tools starts compounding. At NRG, we use Tealium as our Customer Data Platform and Optimizely for experimentation and personalization. Adding AI to this stack changed what was possible.
Dynamic content delivery: Tealium collects behavioral data in real-time — what pages a user visits, how long they spend, what they click, whether they've visited before. Optimizely's personalization engine uses this data to serve different content to different segments. AI helps by identifying micro-segments that a human analyst wouldn't think to test.
For example, AI analysis revealed that users who visited the plan comparison page more than three times without converting responded differently to social proof messaging than first-time visitors. That's not a segment I would have hypothesized on my own. But when we tested personalized messaging for this segment, we saw a 23% increase in personalized lead acquisition.
Real-time behavioral segmentation: The combination of Tealium's real-time data collection and AI-driven segment discovery means we can identify and act on behavioral patterns within a single session, not just across sessions. A user showing exit intent on mobile gets a different intervention than a user who's been methodically comparing plans for 10 minutes.
This is powerful, and it's also where governance becomes critical.
AI Safety and Governance
I want to spend real time on this section because it's the part most "AI in marketing" content skips entirely. When you integrate AI tools into workflows that touch customer data and business decisions, you need governance standards. Period.
Here's what I built at NRG, informed by the broader governance frameworks we developed for the experimentation program:
Data privacy guardrails: No personally identifiable information goes into AI prompts. When I feed test results to Claude, the data is aggregated — conversion rates and sample sizes, not individual user records. This sounds obvious, but I've seen analysts copy-paste raw data exports into ChatGPT without thinking about what's in those exports.
Model governance standards: We document which AI tools are approved for which use cases. Claude and ChatGPT are approved for analysis and report generation. They are not approved for making deployment decisions or generating customer-facing copy without human review. Optimizely's built-in AI is approved for personalization within the guardrails we've defined.
Bias detection: AI models can amplify biases in the data. If our test data shows that a particular variant performed better for desktop users (who tend to be higher-income in our customer base), the AI might recommend optimizing for that segment without flagging the equity implications. We review AI recommendations through a bias lens: who benefits, who's excluded, and are we comfortable with that tradeoff?
Transparency: When I present AI-assisted analysis to stakeholders, I'm explicit about which parts were AI-generated and which parts were human analysis. Not because anyone asked me to, but because trust matters more than efficiency. If a VP discovers that the analysis they used to make a $500K decision was generated by ChatGPT without disclosure, that's a trust violation that damages the entire program.
Guardrails for personalization: Not every behavioral pattern should be acted on. We have rules about what personalization triggers are acceptable (showing relevant plan information based on browsing behavior) versus what crosses the line (dynamic pricing based on inferred income level). AI makes the second category easier to implement, which means governance has to be more vigilant, not less.
Claude Code for Building Internal Tools
This is a different use of AI that's worth its own section. Beyond using AI for analysis, I use Claude Code to build internal tools that make the experimentation program more efficient.
Sample size calculators: I built a custom calculator that factors in our specific traffic distributions across brands, accounts for seasonal variation (energy usage spikes in summer and winter), and outputs both frequentist and Bayesian sample size requirements. Claude Code wrote the core logic in about 30 minutes. Building it from scratch would have taken me a full day.
Automated report generation: I built scripts that pull test results from our analytics platforms, calculate key metrics, and generate draft reports in a standardized format. The reports still need human review and narrative, but the data compilation and formatting is handled automatically.
Internal dashboards: Quick visualization tools for monitoring active experiments, tracking the testing pipeline, and surfacing tests that are approaching significance. These aren't production-quality dashboards — they're internal tools that save me from manually checking Optimizely and Adobe Analytics separately for each active test.
The common thread is that Claude Code lets me build tooling that would normally require a dedicated developer. As a one-person experimentation team (with agency support for execution), this is genuinely transformative for my productivity.
What AI Cannot Do
I want to end with the limitations because they're more important than the capabilities.
AI cannot build stakeholder relationships. The reason my experimentation program grew from 20 tests to 100+ per year isn't because I had better tools. It's because I built relationships with product managers, brand directors, and finance leaders who trusted me to guide their decisions. That trust was earned in conference rooms, not in chat interfaces.
AI cannot navigate organizational politics. Every enterprise has politics. Which team owns the homepage. Which brand gets priority for testing resources. How to position a losing test so the VP who championed it doesn't kill the program. These are judgment calls that require understanding the human dynamics of your specific organization.
AI cannot understand organizational context. An AI model doesn't know that the CMO just changed, and the new one has different priorities. It doesn't know that last quarter's reorg moved the enrollment flow team to a different department. It doesn't know that the developer who maintains the test implementation layer is on paternity leave and his replacement doesn't know Optimizely.
AI cannot make judgment calls on test priorities. Given limited testing slots and competing requests, which test should run next? That decision involves traffic forecasts, stakeholder relationships, strategic priorities, technical complexity, opportunity cost, and organizational timing. AI can provide data inputs to that decision. It cannot make it.
AI cannot replace the consultant mindset. The most valuable thing I do isn't analysis — it's acting as a consultant, not a reporter. Stakeholders don't need more data. They need someone who understands their business context well enough to make recommendations they trust. AI is a tool in that process, not a substitute for it.
The 40% speed gain is real and it's valuable. It means I can run more tests, analyze results faster, and spend more time on the strategic work that actually grows the program. But the strategic work itself — the relationship building, the political navigation, the judgment calls — remains stubbornly, beautifully human.
If you want to talk about integrating AI into your experimentation workflow responsibly, reach out at [email protected]. I'm happy to share what's worked and, more importantly, what hasn't.