Every experimentation program I have ever built eventually hit the same wall. The team that started the program is not the team that scales it. People join. People leave. Priorities shift. Stakeholders change. The only thing that survives all of that turnover is the process — and only if you wrote it down.
"You need standard processes so that no matter who comes in, they understand how we run tests. What confidence means. What power level we use. What's an acceptable MDE, what's not. When do we run tests, when do we turn them off. People come and go, but the process stays." — Atticus Li
Standards are not bureaucracy. They are the thing that lets a two-person team eventually become a ten-person team without every new hire reinventing the rigor from scratch.
The Scaling Wall
"As you scale up tests, you're going to hit a ceiling. It might be a tool issue, a people issue, a process issue, or a culture issue. You're going to find out it's one of those. But one of the biggest mistakes is teams running a lot of tests without focus — just running tests for the sake of running tests." — Atticus Li
When I scaled one experimentation program from 20 tests a year to more than 100, the first thing that broke was not the tooling or the traffic. It was the informal knowledge. Things that "everyone knew" when the team was three people stopped being known when the team was seven.
- A new analyst did not know what confidence threshold to apply because nobody had written it down.
- A new developer did not know which QA steps were mandatory and which were optional.
- A new CRO manager did not know how to handle a sample ratio mismatch because the convention had been verbal.
Each of these is a small gap. Together they cause test quality to drop just as the volume is rising. Win rates fall. Stakeholders start to lose confidence. The credibility you built at 20 tests a year gets eroded in the scramble to ship 100.
The fix is not to slow down. It is to write the standards down, aggressively, before you scale — and to treat the standards as living documents that every new team member reads on day one.
What Needs to Be Standardized
Here are the decisions every experimentation program needs to make explicitly, in writing, before it can scale:
1. Confidence threshold
What significance level (or Bayesian probability) is required to call a test a winner? Is it different for low-risk vs. high-risk tests? Document the threshold, the reason, and the exceptions.
2. Statistical power
What power are you targeting in pre-test calculations? Industry default is 80%, but some programs use higher thresholds for high-stakes tests. Whatever you choose, document it.
3. Minimum detectable effect (MDE) policy
What is the smallest lift you are willing to try to detect? This determines which tests are worth running at all. A page with 3k weekly sessions cannot detect a 2% lift in a reasonable timeframe — do not pretend otherwise.
4. Test duration policy
When do you stop a test? At significance, at time elapsed, at sample size, at a combination? Pre-determined stopping rules are the difference between rigorous testing and p-hacking.
5. Sample ratio mismatch policy
What is the threshold at which SRM triggers a test invalidation? How is SRM investigated? Who gets notified?
6. QA checklist
Every test goes through a QA checklist before it launches. What is on the checklist? Who signs it off? What is the blocker policy if something fails?
7. Reporting format
Every test result is reported in a standard format. Same sections, same dollar-value framing, same disclosure of limitations. Consistency is how stakeholders learn to read results quickly.
8. Intake and prioritization criteria
How does an idea get from a backlog into the next test slot? What scoring framework do you use (ICE, PIE, PRISM)? Who has authority to override prioritization?
9. Holdout policy
Do you run post-launch holdouts on winners? For how long? For which kinds of tests? This is critical for validating that lifts persist beyond the test window.
10. Learning documentation
Every test, win or lose, produces a learning. Where is that learning stored? How is it searchable? How do new team members find prior learnings relevant to their current test?
Standards Are Not Dogma
One reason teams resist standards is a fear that they will calcify the program. "If we write everything down, we cannot adapt." This is the wrong concern. Good standards are versioned. They are revised. They are the starting point, not the ceiling.
"If someone new comes in with a different way of running tests, they can look at our process and give recommendations — but they'll also understand why we do the things we do. If their method is better, we take it, implement it, and learn from it." — Atticus Li
The point of the standard is to make disagreement productive. Without a documented baseline, every new hire's preferences become personal battles. With a documented baseline, preferences become proposals to revise a process. That is healthy. It is also much faster.
Writing Standards That People Actually Read
Most experimentation standards documents fail because they are written like regulatory filings. They are long, dense, structured like legal contracts, and nobody reads them after the first day. Here is how to write standards that stick:
Make it one page per topic, maximum. If the confidence threshold policy needs more than one page to explain, something is wrong. Tight standards survive. Sprawling ones die.
Use examples, not just rules. For every standard, show a concrete example of how to apply it. "Here is a test where we used 80% confidence. Here is a test where we used 95%. Here is why they are different."
Note the rationale. Why is the standard what it is? A team member who understands the reason can reason their way through edge cases. A team member who only knows the rule cannot.
Name an owner. Every standard has a single human who is responsible for keeping it current. If nobody owns it, it decays.
Version it. Every change to a standard is a versioned change with a date and a reason. That way you can look back and see how the program's thinking has evolved.
The Onboarding Multiplier
Here is the return on investment of good standards: a new hire can ramp to full productivity in weeks instead of months.
When the process is documented, the new CRO manager reads the confidence threshold policy, reads the QA checklist, reads the reporting format, and within a day they know how to run a test. They are not waiting for an available senior person to answer basic questions. They are not inventing their own version of rigor. They are executing on a standard the team has already agreed to.
That onboarding time multiplies as the team grows. A program with three people does not care much about onboarding speed. A program that wants to grow to ten people cares a lot. The difference between great standards and mediocre standards becomes the difference between shipping 100 tests a year and shipping 40.
The Dependencies Between Standards
One thing I have learned the hard way: standards depend on each other, and you cannot skip the dependencies.
You cannot have a real confidence threshold policy without a real MDE policy, because the MDE determines how long you have to run to hit confidence. You cannot have a real intake process without a real scoring framework, because scoring is how the intake gets triaged. You cannot have a real post-test reporting format without agreement on how realized impact is calculated.
Write your standards as a connected document, not a series of independent pages. Show the relationships explicitly. A new hire should be able to trace the flow from "idea arrives" to "result reported" and see how each standard feeds the next.
FAQ
Should standards be the same for high-traffic and low-traffic pages?
Mostly yes, with explicit exceptions. The core principles — confidence, power, QA — should be consistent. But duration, MDE, and test type will necessarily differ based on traffic. Document the exceptions and the reasoning.
How do you update standards without disrupting the team?
Run a monthly or quarterly review. Anyone on the team can propose a revision. Proposals get discussed, versioned, and the team agrees to adopt or reject. Standards that evolve predictably are respected. Standards that change randomly are resented.
What if leadership resists standardization?
Show them the cost of inconsistency. Pull examples of tests where different team members made different decisions on the same edge case. Tie those inconsistencies to lost learnings or bad outcomes. Leadership will see that standardization is how you scale without losing quality.
How do you handle cases where the standard doesn't apply?
Every standard should have an "escalation path" section. When in doubt, escalate to the owner of the standard. That way the gap gets noticed and the standard gets updated. Silent workarounds are the start of process decay.
Build the Standards That Let You Scale
Most experimentation programs stall at the same point: when the team grows past the size where informal knowledge works. The fix is to write the standards down before you need them, not after.
I built GrowthLayer specifically to encode these standards into a working tool — confidence thresholds, QA checklists, prioritization scoring, reporting templates, and a versioned learnings library. It is the system I wish I had when I was scaling enterprise programs from scratch.
If you are hiring or growing into roles that require structured program management, explore open CRO and experimentation roles on Jobsolv.
Or book a consultation and I will help you audit and build the standards that let your program scale without losing rigor.