How to Run Meta-Analysis Across Historical A/B Test Data: The Aggregation Validity Threshold
TL;DR: Meta-analysis isn't about combining experiments — it's about knowing when you have enough similar tests for the aggregate to tell you something true. The Aggregation Validity Threshold is how to avoid pattern-matching on noise.
Key Takeaways
- Meta-analysis combines results across tests to surface patterns, but below a minimum cluster size the aggregate is more misleading than any single test
- The Aggregation Validity Threshold is the floor: how many similar tests you need before combined effects become trustworthy
- Publication bias is the most corrosive failure mode — archives that document wins and skip losses produce inflated, unreproducible aggregates
- Standardizing metrics like Revenue Per Visitor across experiments is the precondition for meaningful aggregation; without it, you're comparing tests that aren't comparable
- Meta-analysis works best when you also include inconclusive and cancelled tests — those are the data points that protect against confirmation bias
What Meta-Analysis Actually Is
Meta-analysis isn't a fancy word for "looking at a bunch of tests at once." It's a statistical method for combining effect sizes from independent experiments to estimate an overall effect with more power than any single test has.
Done well, meta-analysis tells you whether a winning pattern in one test holds across multiple tests, segments, or contexts. Done badly, it tells you a flattering story your archive was pre-selected to produce.
"One test tells you what happened. A hundred tell you whether it was a pattern or luck." — Atticus Li
The trap most teams fall into: they meta-analyze wins, miss the losses, and conclude their hypotheses work better than they do. The statistical corrections exist and are well-documented, but archives have to be designed to support them from the start.
Why Most Meta-Analyses Fail
Three failure modes dominate:
Publication bias. The archive systematically excludes losses. When you aggregate across wins only, the average effect looks strong because every input was pre-selected for strength.
Inconsistent metrics. Test A measured conversion rate, Test B measured revenue per visitor, Test C measured something else. Aggregating across different metrics produces numbers that don't mean anything.
Too few comparable tests. Combining five tests on loosely related topics doesn't produce statistical power — it produces noise with confidence intervals wide enough to drive a truck through.
The defense against all three is archive discipline: include everything (wins, losses, inconclusive), standardize metrics at capture time, and cluster tests only when they're genuinely comparable.
The Aggregation Validity Threshold
Here's the threshold framework:
AVT = Minimum number of comparable tests required before an aggregated effect estimate is trustworthy
Interpretation thresholds:
- Below 10 comparable tests — Aggregation is premature. Individual test inspection is more reliable than the combined estimate.
- 10-25 comparable tests — Aggregation produces useful direction but wide confidence intervals. Good for identifying patterns worth testing further.
- 25-50 comparable tests — Aggregation produces meaningful effect estimates. Can inform hypothesis prioritization and roadmap decisions.
- 50+ comparable tests — Aggregation supports strategic decisions. Patterns detected here are usually durable.
Comparable means: same or similar hypothesis type, same or comparable metric, overlapping audience definitions. A pricing test cluster should not mix B2B SaaS pricing experiments with e-commerce checkout experiments.
Preparing Historical Data for Meta-Analysis
Step 1 — Identify the cluster. Pick a hypothesis type and a metric. Pull all tests matching both from the archive. If the count is below 10, stop — the cluster is too small.
Step 2 — Validate test quality. Exclude tests with known execution issues: SRM flags, sample size below the calculated minimum, early stopping, tracking bugs. The aggregate is only as good as its inputs.
Step 3 — Standardize effect sizes. Convert results to a common metric. For conversion rate tests, relative lift is usually the right unit. For revenue, Revenue Per Visitor normalized against baseline.
Step 4 — Check for publication bias. Funnel plots or trim-and-fill methods flag whether the cluster systematically over-represents wins. If your archive shows only wins, the cluster is biased before analysis begins.
Step 5 — Aggregate with appropriate weighting. Weight by sample size, using fixed-effects models if tests are highly homogeneous or random-effects models if they vary. Both approaches are standard.
Key Research References
Two studies frame what the evidence actually looks like at scale:
Miller & Hosanagar (2020) analyzed 2,732 A/B tests across 252 U.S. e-commerce companies. Their findings: price promotions produced the largest effects early in the funnel, shipping promotions produced the best effects later. Scarcity messaging showed a 2.9% average lift in Revenue Per Visitor. These are the kinds of patterns that only emerge from meta-analysis.
Browne & Swarbrick Jones (2017) studied 6,700 e-commerce experiments and found that only about 10% of large-scale tests produced revenue lifts exceeding 1.2%. This is a critical baseline: the base rate of meaningful wins is lower than most teams assume, and meta-analysis is how you calibrate expectations.
Common Meta-Analysis Mistakes
Analyzing wins-only archives. Selection bias on the input end makes the output useless. Include everything.
Comparing incomparable tests. Mixing SaaS and e-commerce tests in one cluster produces a number. The number isn't meaningful.
Ignoring false discovery rate. When you run multiple aggregations, some will look significant by chance. Bonferroni correction or FDR control is required when analyzing more than a handful of clusters.
Treating meta-analysis as a one-time exercise. Patterns shift as the product evolves. Quarterly re-aggregation catches drift.
Skipping confidence intervals. An aggregated effect of "+3.5%" with a 95% CI of [-1%, 8%] is very different from [3%, 4%]. Confidence intervals are where the real story lives.
Using Meta-Analysis to Inform Future Tests
The strongest use of meta-analysis isn't retrospective reporting. It's forward-looking test design:
Sample size calibration. Historical effect size distributions inform what MDE is realistic. If your cluster has a median effect of 2%, designing future tests for 10% MDE is aiming at fiction.
Hypothesis prioritization. Clusters with high win rates are worth testing more. Clusters with low win rates are worth stopping or redirecting.
Segment strategy. When meta-analysis reveals that a pattern holds for Segment A but not Segment B, the product decision may be to differentiate the experience rather than find one universal winner.
Advanced: Tools and Automation
Statistical software (R, Python with scipy/statsmodels) handles the math for ad-hoc meta-analyses. At archive scale, automated platforms like GrowthLayer can support cluster-level analysis by standardizing metadata and metrics at capture time.
Warehouse-native platforms (Snowflake, BigQuery) make the join between experiment archives and business event tables tractable — which is what meta-analysis across long time horizons requires.
Frequently Asked Questions
What's the minimum cluster size for a real meta-analysis?
10 comparable tests is the practical floor. Below that, the confidence intervals are too wide to extract meaning.
How do I handle tests with different primary metrics?
Standardize to a common unit (relative lift is usually the best choice for conversion tests). If metrics can't be reconciled, split into sub-clusters instead of forcing them together.
Can I run meta-analysis on an archive that excluded losses?
Not meaningfully. The result will be biased upward. Rebuild the archive to include everything first, or explicitly caveat the meta-analysis as win-only.
What software should I use?
R with the metafor package is the academic standard. Python with statsmodels works similarly. For teams without statistical staff, platforms with built-in meta-analysis features (GrowthLayer, Eppo) handle the math.
How often should I re-run meta-analyses?
Quarterly for active programs. Product changes and seasonality drift make older aggregates less predictive. Annual re-aggregation is the minimum.
Methodology note: Threshold patterns reflect experience across experimentation programs with archives ranging from 100 to 10,000+ tests. Specific figures are presented as ranges. Research references include Miller & Hosanagar (2020) and Browne & Swarbrick Jones (2017).
---
Structured archives are the precondition for meaningful meta-analysis. Browse the GrowthLayer test library for examples organized by hypothesis type and funnel stage.
Related reading: