The Garden of Forking Paths: Gelman-Loken 2013 On Implicit Multiple Comparisons

Atticus Li

← The Replication Crisis · replication-crisis

The Garden of Forking Paths: Gelman-Loken 2013 On Implicit Multiple Comparisons

In 2013, Andrew Gelman and Eric Loken showed that good-faith researchers using standard methodology can produce noise-driven findings without any conscious p-hacking. The mechanism is the garden of forking paths — implicit multiple comparisons created by data-dependent decisions made one at a time.

By Atticus Li May 25, 2026 22 min read

In November 2013, Andrew Gelman of Columbia and Eric Loken of Pennsylvania State posted a working paper to the Department of Statistics archive at Columbia with a title designed to be quoted: “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time.” The paper extended a methodological critique that Simmons, Nelson, and Simonsohn had launched two years earlier in Psychological Science, and it did so by removing the most convenient defense the targets of that critique had been using. The defense was: “I am not p-hacking. I had my hypothesis in mind before I looked at the data. I made one set of analytic choices and I reported the result.” Gelman and Loken’s contribution was to show that even when every word of that defense is true, the inferences a researcher draws can still be statistically equivalent to running many implicit comparisons and reporting only the one that worked.

The argument has a borrowed name. The “garden of forking paths” is a 1941 short story by Jorge Luis Borges in which a labyrinth is not a physical maze but a structure of decisions — every choice point branches off a universe of consequences, and the story’s protagonist is reading a book in which every fork is realized as a separate narrative. Gelman and Loken applied the image to statistical analysis. The garden is the universe of analyses a researcher would have run, given the data they actually observed. The path they took is the analysis they actually published. The branches they did not take — the comparisons they would have made if the data had looked slightly different — exist as counterfactual analyses that nevertheless contribute to the false-positive rate of the path that was taken.

This article walks through what Gelman and Loken actually argued, why their extension of the Simmons-Nelson-Simonsohn framework was important, how the garden-of-forking-paths mechanism produces noise-driven findings even in good-faith research programs, what the modern methodological response has been, and how working strategists and consumers of research can apply the framework when evaluating any study that reports a single “significant” finding from a single specified analysis.

The Setup: What Simmons, Nelson, and Simonsohn Established In 2011

The Gelman-Loken paper cannot be understood without the paper it extended. In November 2011, Joseph Simmons, Leif Nelson, and Uri Simonsohn published “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” in Psychological Science (DOI: 10.1177/0956797611417632)). The paper is one of the founding documents of the modern replication-crisis methodology literature.

Simmons and colleagues made a structural argument with two halves. The first half was a Monte Carlo simulation showing that if a researcher exercises four common analytic flexibilities — choosing whether to add a covariate, choosing one of two dependent variables, dropping or keeping data based on inspection, and deciding when to stop collecting more participants — the nominal false-positive rate at alpha = 0.05 inflates to over 60 percent under realistic assumptions. The second half was a now-famous empirical demonstration in which the authors conducted a real experiment, exercised those flexibilities in the analysis, and produced a “significant” finding that listening to a specific Beatles song reduced participants’ chronological age. The finding was impossible by definition. The flexibilities had been enough to manufacture it from noise.

The Simmons-Nelson-Simonsohn paper named a behavior that the field already knew about but had not previously had a clean term for: p-hacking. The term was useful because it was specific. It referred to the practice of running an analysis, observing the result, modifying the analysis based on the result, and repeating until a significant p-value was produced — and then reporting the final analysis as if it had been the only one. The paper’s recommendations focused on disclosure: researchers should report all manipulations, all measures, all data exclusions, and all decision rules about stopping data collection. If a researcher had exercised flexibility, they should say so, and the p-value should be interpreted in light of the flexibility.

The 2011 paper was widely read and widely cited. It also produced a predictable defense. Researchers who had not engaged in iterative analysis-tuning could legitimately say: “I did not p-hack. I had my hypothesis in mind. I ran the analysis I planned to run. I reported the result. The Simmons-Nelson-Simonsohn critique does not apply to me.” The defense was sincere. It was also, Gelman and Loken would argue two years later, statistically insufficient.

The Gelman-Loken Insight: Data-Dependent Decisions Without Iteration

The central move in the 2013 working paper is to define a class of analytic flexibility that does not require iteration on the same dataset. The class includes decisions like: which subjects to include or exclude based on properties of the data, how to code or categorize a continuous variable based on the observed distribution, which of several theoretically relevant control variables to include in the model, how to handle outliers based on visual inspection, and which subgroups to break out for secondary analysis based on which subgroups have appreciable sample sizes.

These decisions share a common structure. They are made once. They are made based on properties of the data the researcher has in front of them. They are usually defensible in isolation — there is a reasonable methodological argument for each choice. And they are usually undocumented in the published paper, because the published paper reports the analysis that was actually performed, not the counterfactual decision tree that would have produced different analyses on different data.

The garden-of-forking-paths argument is that this last property is the problem. The p-value of a “significant” finding is mathematically the probability of obtaining a result at least as extreme as the one observed, under the null hypothesis. The reference distribution for that probability is the set of results the researcher would have obtained under repeated sampling from the null. If the researcher’s analytic decisions depend on properties of the sample — even if they make those decisions only once per dataset — then the reference distribution for the actual analysis is not the same as the reference distribution for the planned analysis. It is the reference distribution for the rule that maps observed data to analytic choices. And that distribution is wider than the distribution for any single fixed analysis, because it incorporates the variance of analytic outcomes across the counterfactual datasets the researcher would have encountered.

The conclusion is that a researcher who runs one analysis on their data, but who would have run a different analysis if the data had looked different, is statistically equivalent to a researcher who ran every analysis they would have run on every possible dataset and reported the one that produced a significant result. The single analysis they actually performed contributes to a multiple-comparisons family that includes every analysis they would have performed under different data.

This is the garden. The paths that were not taken are still part of the structure.

A Concrete Example: The Implicit Multiple Comparisons In A “Simple” Analysis

To make the abstraction concrete, consider a hypothetical social psychology study. A researcher hypothesizes — before collecting any data — that exposure to a particular emotional prime will affect a particular behavioral outcome. The hypothesis is preregistered in the loose sense of being committed to writing before the experiment runs. The researcher collects 100 participants, randomly assigns half to the prime condition and half to control, and measures the outcome.

When the data comes in, the researcher inspects it. They notice that three participants in the control condition have outcome values that are more than three standard deviations from the mean. The methodology textbooks discuss outlier removal. The researcher decides to exclude these participants on the standard rationale that extreme values may reflect inattention or measurement error. The researcher also notices that the outcome distribution is somewhat skewed. They decide to log-transform the outcome variable, which is a defensible response to skewness in a parametric analysis. They then notice that the effect appears stronger in the female participants than in the male participants. The hypothesis did not specify a gender interaction, but the researcher reports the main effect for the full sample and also notes the gender-stratified analysis in a follow-up paragraph.

The researcher publishes a paper reporting a significant main effect at p = 0.04, with the outlier exclusion, log transformation, and gender stratification all described transparently in the methods section. No analysis was iterated. The hypothesis was not modified after seeing the data. The reported analysis is the only analysis performed.

The garden-of-forking-paths argument is that this paper has implicit multiple comparisons even though the researcher behaved entirely in good faith. The implicit comparisons are: the analysis that would have been performed if no participants had been outliers (no exclusion would have happened); the analysis that would have been performed if the outcome distribution had been symmetric (no transformation would have happened); the analysis that would have been performed if there had been no apparent gender effect (no stratification would have been mentioned). Each of these counterfactual analyses would have been the published analysis under slightly different data. The p-value of the actual analysis is therefore not interpretable as the false-positive rate of a single specified test. It is the false-positive rate of a decision tree of tests, conditional on the path actually walked.

The inflation produced by a single forking decision is small. The inflation produced by three or four such decisions, compounded multiplicatively, can be enough to convert a nominal alpha of 0.05 into an effective alpha well above 0.20. A “significant” finding at p = 0.04 under those conditions provides much weaker evidence against the null than the raw p-value suggests.

Why The Distinction From Explicit P-Hacking Matters

The Gelman-Loken paper is sometimes misread as a restatement of Simmons-Nelson-Simonsohn. It is not. The distinction it makes is important both for methodology and for how the literature is interpreted.

Explicit p-hacking is a behavior. It involves repeated analysis on the same dataset, with each analysis informed by the result of the previous one, until a significant finding is produced. Researchers who engage in explicit p-hacking can in principle stop doing so. They can preregister their analyses, follow the preregistration, and report the result whether it is significant or not. The Simmons-Nelson-Simonsohn recommendations work because they are aimed at a behavior that can be modified.

The garden of forking paths is structural. It is the mathematical consequence of the standard practice of letting analytic choices depend on properties of the data. Even a researcher who is not iterating, who is not selecting from multiple analyses, who has the hypothesis in mind before collecting data, is producing inflated false-positive rates whenever any aspect of the analysis is contingent on what the data show. The structure exists whether or not the researcher is aware of it. The structure exists even if the researcher would honestly answer “I did one analysis” if asked.

The implication is that the Simmons-Nelson-Simonsohn defense — “I had my hypothesis ahead of time and I did one analysis” — is necessary but not sufficient. A researcher who had the hypothesis ahead of time but made analytic choices in response to the data is still operating inside the garden. The p-value they report is still inflated relative to the false-positive rate it nominally represents. The finding is still more likely to fail to replicate than the reported significance would suggest.

This explains a puzzle in the early-2010s replication literature. By 2013, several large replication projects were beginning to show that the rate of failure to replicate published psychology findings was substantially higher than the nominal alpha of 0.05 would predict. The Open Science Collaboration’s Reproducibility Project, which would report its results in Science in 2015, eventually found that only about 36% of attempted replications of high-profile psychology findings produced statistically significant effects in the same direction as the original. If all the original studies had been one-shot tests of preregistered hypotheses, the expected replication rate would have been much higher. The garden-of-forking-paths argument suggested why the gap existed even in studies whose authors were not consciously p-hacking. Good-faith researchers, exercising ordinary analytic judgment on data they had not yet seen, were nevertheless producing findings whose effective false-positive rates were several times the nominal 0.05.

How Good-Faith Research Produces Noise-Driven Findings

The garden-of-forking-paths mechanism is worth understanding at a finer grain because it generates noise in a way that is invisible from inside the research program that is producing it. The researchers in such a program are not engaged in any behavior they would identify as scientifically problematic. They are running studies, examining the data, making methodological decisions on the basis of training and convention, and reporting results. The structure of the noise is created by the interaction between data-dependent decisions and the existence of counterfactual decisions that would have been made under different data.

A 2016 paper by Jelte Wicherts and colleagues, “Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking” (DOI: 10.3389/fpsyg.2016.01832)), enumerated 34 degrees of freedom across the four phases of a study. The list is sobering. In the planning phase, researchers choose theoretical framework, hypothesis direction, target sample size, exclusion criteria, and dependent variables — and the choices made before data collection are themselves often based on prior data the researcher has seen. In the running phase, researchers choose when to stop collecting data, whether to add participants if initial results are ambiguous, how to handle attrition, and how to handle equipment failures or experimenter mistakes. In the analyzing phase, researchers choose how to handle outliers, which transformations to apply, which covariates to include, which subgroups to examine, which statistical tests to use, and how to handle missing data. In the reporting phase, researchers choose which analyses to feature, which to relegate to supplementary materials, how to describe the methods, and how to frame the results.

Each of these choices can be made in good faith. Many of them are made on the basis of considerations that have nothing to do with the result of any single analysis. But all of them are made by researchers who can see the data, and most of them are made in ways that would have been different under different data. The cumulative degrees of freedom across a single study produce a garden whose path-count, even under conservative assumptions, runs into the hundreds. Reporting one path from such a garden as if it were a single specified test is the source of the inflation.

The Wicherts checklist is an attempt to operationalize the Gelman-Loken insight as a procedural intervention. The checklist asks researchers, for each degree of freedom, either to specify the choice in advance (closing the fork) or to disclose the choice and the alternatives that were considered (illuminating the fork). The procedural goal is to reduce the number of paths the researcher could have taken from the data they actually observed, by constraining the decisions either ex ante through preregistration or ex post through disclosure.

The Modern Response: Detailed Preregistration, Blinded Analysis, Registered Reports

The methodological response to the garden-of-forking-paths argument has had three main components, each addressing a different aspect of the structural problem.

The first is detailed analysis-plan preregistration. Conventional preregistration as it was practiced before about 2013 typically committed researchers to a hypothesis and a design but left analytic choices unspecified. The Gelman-Loken extension of the Simmons-Nelson-Simonsohn critique made it clear that hypothesis preregistration was insufficient. Modern preregistration platforms like the Open Science Framework’s standardized templates, AsPredicted’s pre-2014 template, and the more elaborate Registered Reports template (developed by Chris Chambers at Cardiff and now adopted by over 300 journals) require researchers to specify in advance not only the hypothesis but the exclusion criteria, the variable coding rules, the primary analysis, the covariates that will and will not be included, the rule for handling outliers, and the criteria for any subgroup analyses. The goal is to close as many forks as possible before the data is observed, so that the path the researcher walks is the only path they could have walked from the planned starting point.

The second is blinded analysis. In a blinded analysis, the researcher constructs and tests the analytic pipeline on a version of the data in which the key variables are scrambled, permuted, or replaced with simulated values. The structural advantage is that the researcher cannot make data-dependent decisions about the analysis because they cannot see the data that would inform those decisions. Once the pipeline is locked, the blind is lifted and the pipeline is applied to the real data. The technique is standard in high-energy physics, where the Higgs boson discovery analyses were blinded for years before unblinding, and it is increasingly adopted in human subjects research where the data can be partitioned into a planning sample and a confirmatory sample.

The third is registered reports. A registered report is a paper whose introduction, methods, and analysis plan are peer-reviewed and accepted in principle by a journal before the data is collected. After the data is collected, the authors execute the planned analysis, and the journal commits to publishing the result regardless of whether the finding is statistically significant. The structural intervention is to make the publication contract independent of the result, which removes the incentive that creates the garden in the first place. By 2024 over 300 journals had adopted some form of the registered reports format, and the replication rates of registered-report findings have been substantially higher than the replication rates of conventional papers in the same fields, providing direct empirical support for the underlying methodological argument.

These responses are not perfect substitutes for each other. Preregistration is the most widely adopted but the least binding — researchers can deviate from preregistered plans, and the deviations may or may not be transparent in the published paper. Blinded analysis is the most structurally rigorous but requires substantial technical setup and is most feasible in fields where the planning sample can be artificially generated. Registered reports are the most structurally complete because they remove the publication incentive that produces the garden, but they require institutional buy-in from journals and a willingness from researchers to commit to a result they cannot yet see.

The Strategist Takeaway

A working strategist or consumer of research can apply the garden-of-forking-paths framework as a discount on any finding that has not been preregistered with a detailed analysis plan, or replicated by an independent team. The discount is not because the original researchers were dishonest. The discount is because the structural conditions under which the finding was produced did not constrain the analytic decisions enough to make the reported p-value a reliable estimate of the false-positive rate. The garden was present. The path was walked. The other paths the researcher could have walked are not visible from the paper, but they are part of the reference distribution that the reported p-value should be interpreted against.

The practical heuristics are these. First, treat any single-study finding at p between 0.01 and 0.05 with substantial skepticism unless the study was preregistered with a detailed analysis plan or has been independently replicated. The garden inflation is enough at those p-values to convert nominal significance into effective non-significance under realistic assumptions about the number of forks. Second, prefer studies that report null results alongside significant ones — a research program that reports both is less likely to be selectively showing the surviving path from a garden. Third, prefer registered reports and replications over single-study original findings whenever both exist for the same claim. The structural conditions under which registered reports and replications are produced make the resulting p-values more interpretable. Fourth, treat post-hoc subgroup analyses as hypothesis-generating rather than hypothesis-confirming — the garden is widest at the subgroup level, where the count of plausible breakdowns multiplies the count of plausible analyses within each breakdown. Fifth, when you yourself are running an analysis on data you have collected, write down the planned analysis before opening the dataset, and treat any deviation from the plan as a flag that the resulting p-value should be interpreted in light of the deviation.

The deeper takeaway is that the absence of conscious p-hacking is not the same as the presence of valid inference. A research culture in which every individual researcher is acting in good faith can still produce a literature that fails to replicate, because the structural conditions of how studies are designed, analyzed, and selected for publication create implicit multiple comparisons that the individual researcher cannot see from inside their own work. The Gelman-Loken argument is an argument about systems, not about individuals. The response has to be at the system level: through preregistration, through blinded analysis, through registered reports, and through replication as a routine institutional practice rather than an exception.

Sources

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Working paper, Department of Statistics, Columbia University. http://www.stat.columbia.edu/~gelman/research/unpublished/forking.pdf
Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102(6), 460-465. https://www.americanscientist.org/article/the-statistical-crisis-in-science
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. DOI: 10.1177/0956797611417632
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832. DOI: 10.3389/fpsyg.2016.01832
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
Chambers, C. D. (2013). Registered reports: A new publishing initiative at Cortex. Cortex, 49(3), 609-610. DOI: 10.1016/j.cortex.2012.12.016

P-Hacking and Researcher Degrees of Freedom — the Simmons-Nelson-Simonsohn 2011 paper that Gelman and Loken extended, and the explicit-iteration class of analytic flexibility.
HARKing: Hypothesizing After Results Are Known — Norbert Kerr’s 1998 framework for the closely related practice of presenting post-hoc hypotheses as if they had been specified in advance, which interacts with the garden mechanism.
The Multiple Comparisons Problem in A/B Testing — the same mechanism applied to applied experimentation, where multi-arm tests and secondary-metric checking produce explicit gardens with quantifiable inflation.
Ioannidis 2005: Why Most Published Research Findings Are False — the Bayesian-base-rate framework that subsumes the garden-of-forking-paths mechanism as one of several inflation sources in the bias term of the PPV formula.
The Peeking Problem in A/B Testing — the sequential-testing variant of the garden, where data-dependent decisions about when to stop collecting more participants create implicit forks in the temporal dimension.

FAQ

What is the difference between p-hacking and the garden of forking paths?

P-hacking is the explicit iterative practice of running multiple analyses on the same dataset and selecting the one that produces a significant result. The garden of forking paths is the structural condition in which a researcher runs one analysis but makes analytic decisions that depend on properties of the data, such that they would have run a different analysis under different data. P-hacking is a behavior. The garden is a structure. The garden produces inflated false-positive rates even when the researcher is not p-hacking, because the counterfactual analyses the researcher would have run under different data are statistically part of the reference distribution for the analysis they actually ran.

Does preregistering the hypothesis solve the garden problem?

No, not by itself. Preregistering the hypothesis closes the forks related to which hypothesis is tested, but it leaves open the forks related to how the hypothesis is tested — which covariates are included, how outliers are handled, which subgroups are analyzed, how variables are coded. Detailed analysis-plan preregistration, in which the analytic decisions are specified in advance, is what the modern methodological literature recommends. The Open Science Framework templates, the AsPredicted template, and the Registered Reports format all require analytic detail beyond the hypothesis itself.

Are the original researchers acting in bad faith when they produce garden-driven findings?

No. The whole point of the Gelman-Loken argument is that the garden produces inflated false-positive rates in research programs whose participants are entirely sincere. The researchers do not see the counterfactual decisions they would have made under different data, because they only see the data they actually collected. The methodological response is structural — preregistration, blinding, registered reports — precisely because individual good faith is not enough to prevent the inflation. The argument is about systems, not about individuals.

Why is the discount on garden-driven findings hard to quantify?

The size of the inflation depends on the number of forks in the garden and the correlation structure among the alternative analyses, both of which are typically not reported in the published paper and may not be fully accessible to the original researcher in retrospect. Simulations like the Simmons-Nelson-Simonsohn Monte Carlo can produce point estimates for stylized cases — for example, four forks at a 0.5-correlation structure can inflate alpha = 0.05 to about 0.18 — but real research programs vary in the number and structure of forks. The practical response is to use registered reports and direct replications as the empirical gold standard, since these structural features remove most of the inflation by construction rather than by estimation.

Has the garden-of-forking-paths argument been empirically validated?

Indirectly, yes. The Open Science Collaboration’s 2015 Reproducibility Project found that only about 36% of high-profile psychology findings replicated in independent samples, and the average effect size in the replications was less than half the average effect size in the originals. If the original findings had been produced by single specified tests with valid p-values, the replication rate should have been much higher. The gap is consistent with the predictions of the Gelman-Loken argument — that good-faith research using standard methodology produces effective false-positive rates well above the nominal 0.05. Subsequent replication projects in economics, cancer biology, and neuroscience have produced similar patterns, with replication rates well below what the nominal p-values of the original studies would predict.

replication-crisisgarden-of-forking-pathsgelman-lokenmethodologyevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

The Setup: What Simmons, Nelson, and Simonsohn Established In 2011

The Gelman-Loken Insight: Data-Dependent Decisions Without Iteration

A Concrete Example: The Implicit Multiple Comparisons In A “Simple” Analysis

Why The Distinction From Explicit P-Hacking Matters

How Good-Faith Research Produces Noise-Driven Findings

The Modern Response: Detailed Preregistration, Blinded Analysis, Registered Reports

The Strategist Takeaway

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook