A/B Test Repository Architecture: Schema, Tagging, and the Retrieval Time Budget
TL;DR: A/B test repositories don't fail because the schema is wrong. They fail because nobody can find what they need in under two minutes. The Retrieval Time Budget is what separates a useful repository from a graveyard.
Key Takeaways
- Repositories get used when they contain learnings that save time — everything else is filing
- The Retrieval Time Budget is the single strongest predictor of repository adoption: median time from "I need a past test" to "I found the relevant one"
- Schema design with 25 core fields (Experiment ID, hypothesis, metrics, funnel stage, decision) supports both retrieval and meta-analysis
- Tagging frameworks require controlled vocabularies — free-text tags degrade into chaos within a year and make retrieval unreliable
- Spreadsheets and generic task tools (Jira, Notion) fail at repository scale because they optimize for writing, not retrieval
What Makes a Repo Get Used
Most A/B test repositories exist. Most aren't used. The difference between the two isn't schema elegance or feature completeness. It's whether a practitioner designing a new test can find relevant past tests fast enough to actually do it.
"Nobody opens a repo for data. They open it for learnings." — Atticus Li
The question to ask of any repository: when someone is writing a new hypothesis on pricing-page design, can they retrieve every past pricing-page test in under two minutes? If yes, the repository gets used. If no, it doesn't — regardless of how complete the schema is.
This reframes the design problem. Schema, tagging, and retrieval aren't three separate things to optimize. They're three components of one thing: retrieval speed at scale.
The Retrieval Time Budget
Here's the metric:
RTB = Median time from "I need to find a past test" to "I found the relevant ones"
Interpretation thresholds:
- RTB under 2 minutes — Strong. Practitioners reach for the archive by default.
- RTB 2-10 minutes — Usable but friction-heavy. Archive used for deliberate searches, not quick checks.
- RTB 10-30 minutes — Too slow. Practitioners only use the archive when forced.
- RTB over 30 minutes — Effectively unusable. The archive is storage, not infrastructure.
Cut RTB by half and adoption typically doubles. This is the leverage point most teams miss.
Schema Design: 25 Core Fields
A repository schema that supports retrieval and meta-analysis should capture these fields per experiment:
Identity and ownership: Experiment ID, Name, Owner, Team/Pod, Status.
Timing: Start Date, End Date, Duration.
Scope: Product Area/Surface, Funnel Stage, User Segment, Eligibility Rules.
Hypothesis and design: Problem Statement, Hypothesis, Variant Descriptions, Traffic Allocation.
Metrics and power: Primary Metric, Secondary Metrics, Guardrail Metrics, Minimum Detectable Effect (MDE), Sample Size.
Results and decisions: Directional Outcome, Statistical Significance, Confidence Interval, Ship/Kill/Iterate Decision, Decision Rationale.
Timestamping every event or fact supports auditability. Decisions like "ship," "iterate," or "stop" add transparency that future analysts need. Linking schema data with business tables allows accurate performance metric computation across segments.
Immutable assignment logs. Never modify assignment tables. If configuration changes mid-flight, record a new version. Canonical records act as filters against duplicated tests that could affect analyses.
Tagging Framework That Scales
Consistent tagging is the precondition for retrieval. Inconsistent tags — "signup" vs. "registration" vs. "onboarding" for the same concept — fragment the archive and make search unreliable.
Tag dimensions that matter:
- Audience type (new users, returning, power users, free-tier)
- Device platform (mobile, desktop, tablet, responsive)
- Funnel stage (acquisition, activation, retention, revenue, referral)
- Hypothesis type (reduce friction, add social proof, clarify value, adjust pricing)
- Risk level (high-risk, medium, low — affects review process)
AI-generated tags require review. Automated tagging can produce candidates, but consistency requires human normalization. Clean tags regularly to prevent drift.
Controlled vocabulary over free text. A fixed tag list with clear definitions beats flexibility every time. The tradeoff — some tests don't fit perfectly — is worth the retrieval reliability you gain.
Retrieval Optimization
Search by multiple dimensions. Filter by funnel stage AND user segment AND result direction. Simple keyword search isn't enough at archive scale.
Normalized tags, not free text. If "checkout" and "checkout flow" don't merge, the archive is already broken.
Canonical records with version history. One authoritative entry per experiment, with changes tracked. Duplicates and fragmented records destroy retrieval.
Scalability built in. What works at 50 tests annually fails at 500 annually. Design for the volume you'll have in 18 months, not the volume you have now.
Integration with design tools. Linked Figma files, PRD references, dashboard pointers. Context is what turns data into learning.
Common Pitfalls
Over-reliance on spreadsheets and generic tools. Spreadsheets fail with inconsistent terminology and no normalization. Jira buries experimental insights in task discussions. Confluence and Notion produce duplicate entries and unstructured searches. None scale past 30-50 tests.
Lack of standardization between teams. Product, marketing, and UX each use different schemas and tags. Cross-team retrieval becomes impossible. The cost of fragmentation compounds over time.
Ignoring retrieval in the design phase. Teams design for capture (what needs to be recorded) without designing for retrieval (what will need to be found). This is the most common architectural failure.
Best Practices for Long-Term Success
Regular updates and maintenance. Bi-weekly or monthly schema review. Tag normalization. Archive of outdated or failed tests. Maintenance is required, not optional.
Cross-team collaboration. Shared tagging frameworks. Canonical experiment records with clear ownership. Standardized schema enforced across product, marketing, and UX.
Training on proper use. Onboarding that includes schema walkthrough. Tagging conventions taught during first-week experiment launches. Retrieval practice with real past tests. Adoption is a training problem as much as a tool problem.
Tools and Platforms
Custom-built solutions offer control but require significant engineering investment and maintenance. Fits orgs with unique requirements or strict compliance needs.
Third-party tools:
- Optimizely — enterprise-focused, strong integrations, steep learning curve.
- Eppo — warehouse-native, connects to Snowflake/BigQuery/Redshift/Databricks.
- LaunchDarkly — real-time feature flags and client-side experimentation.
- GrowthBook — open-source flexibility, SQL-first analysis.
- Split.io — multivariate and complex test management.
- VWO — session recording, heatmaps, client-side optimization.
Integration with experimentation platforms. Real-time processing via Kafka or equivalent. Automated metric computation against data warehouses. Bayesian methods, CUPED variance reduction, SRM detection. Audit trails and version control.
Common Questions
What's the minimum viable repository?
A structured spreadsheet with 10 required fields: test ID, name, hypothesis, primary metric, start date, end date, result (win/loss/inconclusive), significance, decision, reasoning. Below this, you have storage, not a repository.
How do I migrate from spreadsheets without losing history?
Export to structured format (CSV). Map columns to the target schema. Import with preserved timestamps. Rebuild tags using controlled vocabulary. Budget 2-4 weeks for full migration.
Who owns the repository?
A senior practitioner with 10-20% of their time dedicated to schema, tagging, and meta-analysis. Without an owner, archives decay within two quarters.
How do I measure RTB?
Time 10 random practitioners each searching for a specific past test. Median the results. Re-measure quarterly to catch drift.
Should we have one repository or multiple?
One. Multiple repositories produce cross-search failures and duplicate documentation. Even cross-team programs benefit from a single canonical source with role-based access.
Methodology note: Retrieval Time Budget thresholds reflect experience across mid-market experimentation programs. Specific figures are presented as ranges. Schema and tagging patterns draw on established practice in experiment archive design.
---
A repository gets used when learnings are findable. Browse the GrowthLayer test library for examples of archives designed around fast retrieval.
Related reading: