I run over 100 experiments a year at a Fortune 150 company. Before that, I spent years building experimentation programs from scratch.
Most "A/B testing best practices" articles read like they were written by someone who's never had to defend a losing test in a business review. They tell you to "test one variable at a time" without mentioning that's mathematically impossible for 95% of companies. They tell you to "share results" without addressing the political reality of stakeholders reframing your data.
Here's what I actually follow. These aren't theoretical — they're the practices that survived contact with real traffic, real org politics, and real budget pressure.
1. Pre-Register Your Success Metric or You'll Lie to Yourself
Before any test goes live, I write down: the metric, the baseline, and the target. That document doesn't change after results come in.
This isn't just good science — it's organizational self-defense. There's massive pressure to prove that the experiment you spent resources on is a winner. Without a locked metric, teams start shopping: "Well, the primary metric didn't move, but look at this secondary metric!" Suddenly every test is a "win" and nobody trusts the program.
I've seen this kill experimentation programs. The center of excellence becomes the center of spin. Stakeholders stop believing the numbers. Resources dry up.
My rule: One primary metric, locked before launch. Everything else is exploratory. When I present results, there's one truth. People come to my team because they know exactly how something performed — no stories, no framing games.
2. High Traffic, High Conversion Potential — Everything Else Is UX Research
I don't A/B test pages with low traffic. Full stop.
You can't get statistical significance on a page with 500 monthly visitors. The math doesn't work. That doesn't mean I ignore those pages — I improve them using UX research, qualitative feedback, and design best practices. But I don't pretend I'm "testing" when the sample size makes the result meaningless.
For A/B testing, I focus where the leverage is: high-traffic pages with measurable conversion events that flow to actual business outcomes. Not just clicks. Not just pageviews. Customer acquired. Payment completed. Revenue attributed.
The downstream check I always run: Does this conversion metric actually connect to the business outcome we care about? A landing page lift that doesn't translate to more customers is a vanity metric wearing a lab coat.
3. The "One Variable" Rule Is a Luxury Most Teams Can't Afford
This is the most repeated — and most inaccurate — best practice in experimentation.
"Only test one variable at a time." Sure, if you're Amazon with billions of sessions. For most companies, testing a button color or a single word change will never reach statistical significance. The minimum detectable effect is too small relative to your traffic.
What I actually do: I group related changes into a single, testable hypothesis. "A benefit-led hero section converts better than a feature-led hero section" — that's one hypothesis with multiple supporting design changes. It's measurable. It's meaningful. And it produces an insight the team can build on.
I save granular single-variable tests for follow-ups on proven winners, where the baseline is already strong and the question is genuinely about one element.
4. Sample Size Isn't Optional — It's the Entire Point
I calculate minimum sample size before designing the test, not after. Power calculators exist for a reason. Plug in your traffic, your baseline conversion rate, and the minimum effect you want to detect. The output tells you how long to run.
Most tests need 1-2 full business weeks minimum. I account for weekday/weekend cycles, seasonal patterns, and traffic dips. If the math says 4 weeks, I run 4 weeks.
The trap I see constantly: Teams that start a test, see "80% confidence" after 3 days, and declare victory. That's not how statistics works. You set the confidence threshold before the test, and you don't stop until you reach it.
5. A/A Test Before You A/B Test
Before I run real traffic through a new test setup, I run an A/A test — identical pages, split traffic. If the tool reports a "winner" between two identical experiences, something is broken. Tracking pixel misconfigured. URL redirect leaking. Cookie logic wrong.
This catches problems that would otherwise contaminate weeks of data. I also verify: no duplicate URLs or redirect misalignments, traffic split is within acceptable range (doesn't have to be perfect 50/50, but can't trigger a sample ratio mismatch), and all conversion events fire correctly on both variations.
Then I implement a 24-hour check-in post-launch. Data flowing? Splits clean? Nothing broken? Good. Now the test is real.
6. The 50/50 Split Doesn't Have to Be Perfect — But It Has a Ceiling
New analysts on my team sometimes flag that the split isn't exactly 50/50. It doesn't need to be. There's an acceptable error range between sample sizes.
But — and this is critical — it absolutely must stay within range. If it triggers a sample ratio mismatch in your testing tool's diagnostic, stop. QA everything. Figure out why traffic is leaking to one side. A dirty split produces dirty data, and no amount of post-hoc analysis fixes that.
7. Don't Peek. Period.
Peeking at intermediate results is the most common way to invalidate a test. Early data is noisy. A directional signal on day two with incomplete sample sizes is statistically meaningless.
I've seen teams stop tests because the variant "looked like it was winning" at 40% of required sample size. That's not data-driven — that's anxiety-driven.
The only reason to stop early: A technical issue — sample ratio mismatch, broken tracking, implementation bug. Fix it, QA, restart clean.
8. The Story You Tell About Results Matters as Much as the Results
This is the part nobody writes about.
Early in a program's life, how you communicate results determines whether experimentation survives. Not every test wins. Win rates across the industry hover around 15-30%. Stakeholders need to understand that learning is the output, not just lifts.
But — and this is the nuance — you also need to build proof and authority early. Frame honestly, but strategically. Show the business impact when there is one. Show what you learned when there isn't. Get results in front of decision-makers through every channel: presentations, business reviews, email summaries, Slack updates.
The teams that bury their results in a spreadsheet are the teams that lose their budget.
9. Document Like Your Program Depends on It — Because It Does
Every test gets a full write-up: hypothesis, methodology, results, statistical significance, limitations, new hypotheses generated, and recommended next steps.
This isn't busywork. It's your experimentation knowledge base. Six months from now, when someone proposes a test you already ran, you need that institutional memory. When you're advocating for resources, you need the cumulative evidence.
Most teams track tests in Excel or PowerPoint. It works, but it's not built for it. The documentation layer — insights, recommendations, follow-ups — is where programs compound over time.
10. Iterate on Winners. That's Where the Compounding Happens.
A winning test isn't the finish line — it's the starting point.
Roll out the winner on the tested page first. Monitor at full traffic. Then expand to related pages incrementally, customizing minimally per context. Each expansion is a chance to validate that the insight generalizes.
Follow-up tests that build on a proven insight compound faster than starting fresh each time. The best experimentation programs don't run random tests — they build on what they've already proven works.
If you want the tools to actually run this process — test briefs, sample size calculators, result distribution, knowledge base — that's what GrowthLayer does.