A/B Testing Documentation Framework: Templates, Metadata Standards, and the Hypothesis Reuse Rate
TL;DR: The point of documenting experiments isn't to record what happened. It's to make the next similar hypothesis sharper than this one was. The Hypothesis Reuse Rate measures whether you're pulling that off.
Key Takeaways
- Experiment documentation fails when it's treated as record-keeping instead of input for the next test
- The Hypothesis Reuse Rate measures whether new test hypotheses are built on past test insights or written from scratch — most programs reuse less than 10% and don't realize it
- Problem statements, hypotheses, primary/secondary/guardrail metrics, audience targeting, and power analysis should all appear in one standardized template at launch, not after results
- Test IDs with consistent naming conventions separate searchable archives from filing-cabinet archives
- Automated data collection and monitoring are what make documentation sustainable at volume — manual tracking decays within a quarter
Documentation Isn't Record-Keeping
Most experiment documentation is written as if the primary audience is a future auditor. Problem statement, hypothesis, variants, results, decision. All there, all accurate, all essentially useless when a new test gets designed six months later — because nobody reads it that way.
The right framing: documentation is an input to the next hypothesis. Every documented experiment should make the next similar hypothesis faster to write and sharper to design. If it doesn't, the documentation is filler.
"You document experiments because your brain can't hold them. That's the whole job." — Atticus Li
This reframing changes what gets recorded. Record-keeping optimizes for completeness. Input optimizes for retrieval — what will a practitioner want to know when they're designing a related test next quarter?
The Hypothesis Reuse Rate
Here's the metric:
HRR = New hypotheses that reference at least one past experiment / Total new hypotheses written
Reference means: the hypothesis document cites a specific prior test, pulls a baseline from an archive entry, or explicitly builds on or contradicts an earlier result.
Interpretation thresholds:
- HRR above 60% — Strong archive usage. Documentation is doing its job.
- HRR between 30% and 60% — Typical for well-run programs. Archive is used but not systematically.
- HRR between 10% and 30% — Most programs land here. The archive exists; it's mostly unread.
- HRR below 10% — Each new test starts from scratch. Documentation is decorative.
What to Include in an Experiment Doc
Problem Statement. Specific user or business problem being addressed. Metrics that indicate the problem (conversion rate, bounce rate, session duration). Assumptions about causes.
Hypothesis. Format: "If we [change], [user segment] will [outcome] because [mechanism]." Link to the behavioral or business mechanism being tested.
Primary Metric. Single metric that drives the decision. Tied to a business outcome, not an activity count.
Secondary Metrics. 2-3 metrics that capture related effects. Used to interpret primary-metric results.
Guardrail Metrics. Pre-declared metrics that trigger rollback if they degrade. Protects against primary winners that cause secondary disasters.
Variants. Description of each variant with screenshots for UI tests. Link to PRDs for functional tests. Clear labeling of control vs. treatment.
Audience Targeting. User segments included and excluded. Device types, regions, new vs. returning visitors.
Allocation and Power Analysis. Traffic split, expected sample size, calculated MDE, expected runtime.
Monitoring Plan. Which dashboards track which metrics. What triggers alert. Who owns the watch.
Results. Observed effects with confidence intervals. SRM check outcome. Decision made and reasoning.
Ten sections. Every test. No exceptions. Consistency is what makes documentation compound.
Metadata Standards That Matter
Test ID. Unique identifier for every test. Never reused, never reassigned.
Naming Convention. Consistent format: "Product_Surface_Feature_Metric_Date" or similar. Searchable by any dimension.
Test Duration. Start and end dates. Enables seasonality analysis across archives.
Data Sources. Every analytics platform or data warehouse the test drew from. Links to dashboards, Figma files, and PRDs.
Audience Segmentation. Consistent vocabulary for user segments. "Mobile-first users" not "mobile people."
Outcome Classification. Win, loss, inconclusive, or cancelled. With win-rate categorization by hypothesis type, this becomes meta-analysis-ready.
Common Mistakes in Documentation
Inconsistent templates. Different team members documenting differently. Archive becomes un-searchable within a quarter.
Writing documentation after results are in. Hindsight reshapes the hypothesis. Pre-launch documentation captures what was actually believed, which is more useful for future hypothesis review.
Recording only the decision. "We shipped variant B." Next person has no idea why.
Free-text tags. "Checkout" and "Checkout Flow" and "Checkout funnel" should be one tag. Controlled vocabularies prevent this.
No reference to prior tests. New hypotheses written in isolation suggest the archive isn't being consulted. This is directly what HRR measures.
Automation That Sustains Documentation
Manual documentation decays. The teams that maintain quality documentation at scale automate the repetitive parts:
Template-driven intake. Test launch triggers creation of a structured document with required fields. Can't launch without filling in.
Dashboard linking. Automatic linking of experiment IDs to monitoring dashboards, so the results section populates itself as the test runs.
Metadata enforcement. Controlled vocabularies enforced at tag selection, not free-text input. Prevents drift.
Search integration. Past experiments automatically surfaced during new hypothesis intake when tags or audience definitions match.
Programs that automate see HRR rise. Programs that rely on manual discipline see it fall over time.
Common Failures in Testing Workflows
The patterns that break documentation:
Tests started without full specs. The team says they'll fill in the details later. They don't.
Results interpretation happens in chat threads. Insights documented in Slack are insights that will be gone in six months.
Inconsistent ownership. When documentation ownership rotates without handoff, quality drops at every transition.
Archive spring-cleaning. Teams that periodically "clean up" the archive by deleting old tests destroy meta-analysis capacity.
Advanced: Searchable Qualitative Learnings
Beyond structured metadata, archives benefit from free-text qualitative learnings — one to two paragraphs per test describing what was surprising, what the numbers missed, what a follow-up test should consider. These are the fields that inform hypothesis reuse most directly.
Make them searchable through full-text indexing. The archive should support both "show me all tests on checkout" and "show me all tests where the secondary metric moved opposite to the primary."
Frequently Asked Questions
What's the minimum documentation per test?
Problem statement, hypothesis, primary metric, sample size, result, decision. About 30 minutes of writing per test. Below this, nothing compounds.
Should documentation be written before or after the test?
Before. Pre-registration prevents post-hoc interpretation. Results get added at the end, but the plan stays fixed.
How do I get the team to actually maintain documentation?
Two things: make it easy (templates, automation, required fields) and measure HRR quarterly. When the team sees that tests referencing the archive have higher win rates, adoption follows.
What's the right tool?
Notion, Confluence, or purpose-built platforms (GrowthLayer) all work. The tool matters less than the discipline. Any tool that supports templates and search works.
How often should documentation be reviewed?
Active documentation during the test. Retrospective documentation within one week of results. Archive-level audits quarterly to catch drift and ensure tags are normalized.
Methodology note: Hypothesis Reuse Rate thresholds reflect patterns observed across mid-market experimentation programs. Specific figures are presented as ranges. Template structures draw on established practice in experimentation management.
---
Structured documentation compounds into a searchable archive. Browse the GrowthLayer test library for examples of what consistent documentation looks like at scale.
Related reading: