Centralized A/B Testing Database Design: The Trace Ratio That Predicts Program Health

TL;DR: A centralized A/B testing database is only as useful as the fraction of experiments you can fully reconstruct from it. Most orgs build storage and call it a database, which is why they keep rerunning the same tests.

Key Takeaways

  • A centralized A/B test database isn't measured by rows stored — it's measured by how many past experiments you can reconstruct end-to-end from hypothesis to decision
  • The Experiment Trace Ratio quantifies this: the percentage of archived tests where hypothesis, result, guardrail metrics, and ship/kill decision are all recoverable
  • Schema design that prioritizes assignment logs, immutable history, and SRM detection columns prevents the silent failure modes that corrupt results at scale
  • Warehouse-native architectures (Snowflake, BigQuery) beat application databases for test storage because they support the aggregation math that A/B testing actually requires
  • Privacy and compliance aren't adjacent concerns — they determine whether your schema can keep user-level assignment data for the audit periods experimentation requires

The Real Test of a Test Database

Most teams with "centralized A/B testing databases" have something closer to a warehouse of rows. Experiments are logged. Assignments are stored. Results are computed. But when someone asks "why did we ship variant B instead of A in the checkout test last year?" — the answer is usually in a Slack thread, an old Notion doc, or nobody's memory.

This is the failure mode worth designing against. The value of a test database isn't storage. It's reconstruction. Six months after a test ships, can you recover the hypothesis, the variants, the audience, the guardrail metrics, the directional result, and the decision rationale — without pinging the person who ran it?

"A centralized database is what lets multiple teams, brands, and regions work from the same learnings. Without it, every team starts over." — Atticus Li

Teams that treat experimentation data as an institutional asset build schemas around reconstruction. Teams that treat it as operational telemetry build schemas around real-time performance. The two goals produce very different databases.

The Experiment Trace Ratio

Here's the framework I use to score how well a test database actually works:

ETR = Experiments fully reconstructable / Total experiments archived

Fully reconstructable means all of the following can be recovered from the database without interviewing anyone:

  • Original hypothesis (not just the test name)
  • Variant descriptions including the change applied
  • Eligible audience and exclusion rules
  • Primary metric, secondary metrics, guardrails
  • Directional result with sample size and statistical significance
  • Ship/kill/iterate decision and reasoning

Interpretation thresholds:

  • ETR above 80% — Strong database design. Institutional memory compounds.
  • ETR between 50% and 80% — Typical for well-intentioned programs. Gaps exist in decision traces and guardrail capture.
  • ETR between 20% and 50% — Common. The database is a storage layer, not a memory layer.
  • ETR below 20% — You have a log file, not a database. Rewrite the schema or accept that most learning decays inside 12 months.

Core Schema Components That Matter

Experiment metadata table. One row per experiment, with ID, name, hypothesis text, feature area, funnel stage, owner, team, start date, end date, status, decision, and decision rationale. This table is the spine — everything else joins to it.

Variant table. One row per variant per experiment, with variant ID, description, traffic allocation, and link to the experiment. Keeps variant logic versioned.

Assignment log. One row per user per experiment assignment, timestamped. This is the largest table by volume and the most important for reconstruction. Never mutate assignment logs. Append-only, with full history preserved.

Metrics table. Pre-aggregated metrics by experiment, variant, and day. Computed from the assignment log joined against business event tables. Pre-aggregation makes analysis tolerable at scale; computing fresh every time kills query performance.

Guardrail tracking. Explicit columns for guardrail metric values, thresholds, and whether any triggered a fail-stop. Guardrails that aren't archived in structured form are guardrails that don't exist for future reference.

Choosing the Right Storage Layer

Relational databases (Postgres, MySQL) work for experiment metadata and variant tables. They struggle with assignment-log scale at high volume — millions of assignments per test overwhelm application databases.

Data warehouses (Snowflake, BigQuery) are the right home for assignment logs and metrics. They support the aggregation patterns experimentation requires: group-bys on billions of rows, windowing over time, joins with business event tables. Warehouse-native experimentation platforms exist for exactly this reason.

NoSQL stores (MongoDB, Cassandra) can handle write-heavy assignment streams but don't give you the SQL aggregation power you need for analysis. Use them for ingestion if volume demands, but replicate to a warehouse for analytical work.

The hybrid pattern most mature programs land on: metadata in Postgres, assignment logs streaming through Kafka into a warehouse, pre-aggregated metrics exposed to dashboards.

Data Privacy and Historization

Assignment logs contain user IDs. This creates two obligations: protect the data, and preserve it for audit periods long enough to support experiment analysis.

Historization matters. Never overwrite an assignment record. If an experiment's configuration changes mid-flight, record a new version — don't mutate the original. This preserves the ability to reconstruct what was live when.

Retention policies must balance experimentation needs (often 12-24 months for meta-analysis) against privacy requirements (GDPR, CCPA, HIPAA where applicable). Document retention in the schema itself — a column indicating when data should be anonymized or deleted makes compliance programmatic instead of aspirational.

Access controls on user-level data should be tighter than on aggregated metrics. Most analysis work can happen on pre-aggregated data without accessing raw assignments.

Common Mistakes in Database Design

Treating the database as telemetry. Optimizing for ingestion speed at the expense of reconstruction. You end up with fast writes and no recoverable context.

Skipping the decision trace. Every ship/kill decision should write back to the experiment metadata table with reasoning. When this is missed, the archive becomes half-useful.

Free-text tagging. Storing feature areas, funnel stages, and hypothesis types as free text produces unqueryable chaos within a year. Use controlled vocabularies.

Joining assignment logs to operational tables only. Connect to business event tables as well — purchases, retention signals, usage metrics — so that long-term impact analysis is possible without re-engineering later.

Under-investing in SRM detection. Build chi-squared SRM checks into the pipeline, not the analyst's manual workflow. If SRM detection is opt-in, it won't run on most tests.

Advanced: Real-Time Processing and Pre-Aggregation

High-velocity programs running 50+ tests per quarter need real-time dashboards that don't query raw assignment logs. The pattern:

  • Event ingestion via Kafka or equivalent streams assignment events and business events.
  • Real-time pre-aggregation computes rolling metrics by experiment and variant — conversion rate, sample size, SRM status, guardrail status.
  • Analytical queries hit pre-aggregated tables instead of raw logs. This is 10-100x faster and supports interactive exploration.
  • Audit-quality queries (for end-of-test analysis) can still hit raw logs when needed, but the common path is pre-aggregated.

This architecture supports the velocity mature programs need without sacrificing reconstructability.

Frequently Asked Questions

How long should I retain assignment logs?

18-24 months is typical. Long enough to support meta-analysis of past tests and seasonality patterns. Beyond that, aggregate-only retention usually suffices. Balance against privacy and storage costs.

Do I need a purpose-built experimentation platform or can I build this?

Warehouse-native platforms (GrowthLayer, Eppo, Statsig) handle the schema, ingestion, and aggregation work. Building from scratch is possible but typically takes 6-12 months of engineering investment for something a platform provides off the shelf. The decision depends on scale and specialization needs.

What's the single most-overlooked column?

The ship/kill decision rationale. Teams capture the result but rarely capture why the result was acted on. A year later, this is often the most valuable piece of context and the hardest to reconstruct.

How do I handle multi-variant tests (A/B/n)?

Variant table design should support arbitrary variant counts. Assignment logs should reference variant ID rather than binary test/control flags. Results analysis extends naturally to multi-variant if the schema is designed for it.

What about cross-experiment interactions?

If multiple tests run simultaneously on overlapping audiences, interaction effects can corrupt results. Track overlap in the schema — which other tests were active during this test's run — so post-hoc analysis can flag potential interactions.

Methodology note: Experiment Trace Ratio thresholds reflect patterns observed across mid-market experimentation programs. Specific figures are presented as ranges. Schema patterns draw on established warehouse-native experimentation practices.

---

See how a well-structured repository changes what teams can learn from past tests. Browse the GrowthLayer test library for reference examples of organized experiment archives.

Related reading:

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.