How to Run Meta-Analysis Across Historical A/B Test Data: The Aggregation Validity Threshold

TL;DR: Meta-analysis isn't about combining experiments — it's about knowing when you have enough similar tests for the aggregate to tell you something true. The Aggregation Validity Threshold is how to avoid pattern-matching on noise.

Key Takeaways

  • Meta-analysis combines results across tests to surface patterns, but below a minimum cluster size the aggregate is more misleading than any single test
  • The Aggregation Validity Threshold is the floor: how many similar tests you need before combined effects become trustworthy
  • Publication bias is the most corrosive failure mode — archives that document wins and skip losses produce inflated, unreproducible aggregates
  • Standardizing metrics like Revenue Per Visitor across experiments is the precondition for meaningful aggregation; without it, you're comparing tests that aren't comparable
  • Meta-analysis works best when you also include inconclusive and cancelled tests — those are the data points that protect against confirmation bias

What Meta-Analysis Actually Is

Meta-analysis isn't a fancy word for "looking at a bunch of tests at once." It's a statistical method for combining effect sizes from independent experiments to estimate an overall effect with more power than any single test has.

Done well, meta-analysis tells you whether a winning pattern in one test holds across multiple tests, segments, or contexts. Done badly, it tells you a flattering story your archive was pre-selected to produce.

"One test tells you what happened. A hundred tell you whether it was a pattern or luck." — Atticus Li

The trap most teams fall into: they meta-analyze wins, miss the losses, and conclude their hypotheses work better than they do. The statistical corrections exist and are well-documented, but archives have to be designed to support them from the start.

Why Most Meta-Analyses Fail

Three failure modes dominate:

Publication bias. The archive systematically excludes losses. When you aggregate across wins only, the average effect looks strong because every input was pre-selected for strength.

Inconsistent metrics. Test A measured conversion rate, Test B measured revenue per visitor, Test C measured something else. Aggregating across different metrics produces numbers that don't mean anything.

Too few comparable tests. Combining five tests on loosely related topics doesn't produce statistical power — it produces noise with confidence intervals wide enough to drive a truck through.

The defense against all three is archive discipline: include everything (wins, losses, inconclusive), standardize metrics at capture time, and cluster tests only when they're genuinely comparable.

The Aggregation Validity Threshold

Here's the threshold framework:

AVT = Minimum number of comparable tests required before an aggregated effect estimate is trustworthy

Interpretation thresholds:

  • Below 10 comparable tests — Aggregation is premature. Individual test inspection is more reliable than the combined estimate.
  • 10-25 comparable tests — Aggregation produces useful direction but wide confidence intervals. Good for identifying patterns worth testing further.
  • 25-50 comparable tests — Aggregation produces meaningful effect estimates. Can inform hypothesis prioritization and roadmap decisions.
  • 50+ comparable tests — Aggregation supports strategic decisions. Patterns detected here are usually durable.

Comparable means: same or similar hypothesis type, same or comparable metric, overlapping audience definitions. A pricing test cluster should not mix B2B SaaS pricing experiments with e-commerce checkout experiments.

Preparing Historical Data for Meta-Analysis

Step 1 — Identify the cluster. Pick a hypothesis type and a metric. Pull all tests matching both from the archive. If the count is below 10, stop — the cluster is too small.

Step 2 — Validate test quality. Exclude tests with known execution issues: SRM flags, sample size below the calculated minimum, early stopping, tracking bugs. The aggregate is only as good as its inputs.

Step 3 — Standardize effect sizes. Convert results to a common metric. For conversion rate tests, relative lift is usually the right unit. For revenue, Revenue Per Visitor normalized against baseline.

Step 4 — Check for publication bias. Funnel plots or trim-and-fill methods flag whether the cluster systematically over-represents wins. If your archive shows only wins, the cluster is biased before analysis begins.

Step 5 — Aggregate with appropriate weighting. Weight by sample size, using fixed-effects models if tests are highly homogeneous or random-effects models if they vary. Both approaches are standard.

Key Research References

Two studies frame what the evidence actually looks like at scale:

Miller & Hosanagar (2020) analyzed 2,732 A/B tests across 252 U.S. e-commerce companies. Their findings: price promotions produced the largest effects early in the funnel, shipping promotions produced the best effects later. Scarcity messaging showed a 2.9% average lift in Revenue Per Visitor. These are the kinds of patterns that only emerge from meta-analysis.

Browne & Swarbrick Jones (2017) studied 6,700 e-commerce experiments and found that only about 10% of large-scale tests produced revenue lifts exceeding 1.2%. This is a critical baseline: the base rate of meaningful wins is lower than most teams assume, and meta-analysis is how you calibrate expectations.

Common Meta-Analysis Mistakes

Analyzing wins-only archives. Selection bias on the input end makes the output useless. Include everything.

Comparing incomparable tests. Mixing SaaS and e-commerce tests in one cluster produces a number. The number isn't meaningful.

Ignoring false discovery rate. When you run multiple aggregations, some will look significant by chance. Bonferroni correction or FDR control is required when analyzing more than a handful of clusters.

Treating meta-analysis as a one-time exercise. Patterns shift as the product evolves. Quarterly re-aggregation catches drift.

Skipping confidence intervals. An aggregated effect of "+3.5%" with a 95% CI of [-1%, 8%] is very different from [3%, 4%]. Confidence intervals are where the real story lives.

Using Meta-Analysis to Inform Future Tests

The strongest use of meta-analysis isn't retrospective reporting. It's forward-looking test design:

Sample size calibration. Historical effect size distributions inform what MDE is realistic. If your cluster has a median effect of 2%, designing future tests for 10% MDE is aiming at fiction.

Hypothesis prioritization. Clusters with high win rates are worth testing more. Clusters with low win rates are worth stopping or redirecting.

Segment strategy. When meta-analysis reveals that a pattern holds for Segment A but not Segment B, the product decision may be to differentiate the experience rather than find one universal winner.

Advanced: Tools and Automation

Statistical software (R, Python with scipy/statsmodels) handles the math for ad-hoc meta-analyses. At archive scale, automated platforms like GrowthLayer can support cluster-level analysis by standardizing metadata and metrics at capture time.

Warehouse-native platforms (Snowflake, BigQuery) make the join between experiment archives and business event tables tractable — which is what meta-analysis across long time horizons requires.

Frequently Asked Questions

What's the minimum cluster size for a real meta-analysis?

10 comparable tests is the practical floor. Below that, the confidence intervals are too wide to extract meaning.

How do I handle tests with different primary metrics?

Standardize to a common unit (relative lift is usually the best choice for conversion tests). If metrics can't be reconciled, split into sub-clusters instead of forcing them together.

Can I run meta-analysis on an archive that excluded losses?

Not meaningfully. The result will be biased upward. Rebuild the archive to include everything first, or explicitly caveat the meta-analysis as win-only.

What software should I use?

R with the metafor package is the academic standard. Python with statsmodels works similarly. For teams without statistical staff, platforms with built-in meta-analysis features (GrowthLayer, Eppo) handle the math.

How often should I re-run meta-analyses?

Quarterly for active programs. Product changes and seasonality drift make older aggregates less predictive. Annual re-aggregation is the minimum.

Methodology note: Threshold patterns reflect experience across experimentation programs with archives ranging from 100 to 10,000+ tests. Specific figures are presented as ranges. Research references include Miller & Hosanagar (2020) and Browne & Swarbrick Jones (2017).

---

Structured archives are the precondition for meaningful meta-analysis. Browse the GrowthLayer test library for examples organized by hypothesis type and funnel stage.

Related reading:

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.