In 2011, three researchers published a paper claiming that listening to “When I’m Sixty-Four” by The Beatles literally made people younger. Not “feel younger.” Younger. Their age — by 1.5 years — at p < .05. The control song was “Kalimba,” a generic instrumental that ships with Windows 7.

The result was statistically significant. It was also completely fabricated as a demonstration. Joseph Simmons, Leif Nelson, and Uri Simonsohn ran the study honestly, collected the data honestly, and used only analytical choices that any reviewer in 2011 would have accepted as standard practice. They got an impossible result anyway. Then they wrote a paper explaining exactly how they did it — and accidentally invented the term that now defines a generation of methodological critique: p-hacking.

Their 2011 Psychological Science paper, “False-Positive Psychology,” is the methodological bombshell that made the replication crisis impossible to ignore. The Bem precognition paper showed something was broken. Ioannidis 2005 explained why most published findings were probably wrong. But Simmons et al. did something different: they demonstrated, with arithmetic, that the actual false-positive rate in standard psychology practice was not 5% as advertised. It was over 60%.

This article walks through what they did, why the math is unforgiving, and how the concept of “researcher degrees of freedom” has reshaped how you should read any quantitative claim — academic, commercial, or otherwise.

The Beatles study: a demonstration in bad faith done in good faith

The setup was almost insultingly simple. Twenty undergraduates listened to one of three songs: “When I’m Sixty-Four,” “Kalimba,” or “Hot Potato” by The Wiggles. Afterwards, they reported their date of birth and their father’s age (used as a covariate). The hypothesis: listening to “When I’m Sixty-Four” — a song explicitly about aging — would make participants chronologically younger.

The result: participants who heard “When I’m Sixty-Four” were 1.5 years younger than those who heard “Kalimba,” controlling for father’s age, F(1, 17) = 4.92, p = .040.

This is, of course, biologically impossible. Songs do not modify birth certificates. The point of the demonstration was that Simmons, Nelson & Simonsohn (2011) got this impossible result while using only the following four flexibilities — each of which was, at the time, considered acceptable methodology in mainstream psychology:

  1. They chose between two dependent variables (felt age vs. chronological age) after seeing the data.
  2. They added 10 more observations and re-checked significance.
  3. They chose between three song conditions after seeing which one produced the cleanest contrast.
  4. They added father’s age as a covariate after testing it both with and without.

None of these moves would have been disclosed in a typical 2011 methods section. None of them required dishonesty. Every single one was a routine decision that a working researcher would have justified as “trying things to see what the data tells us.” Stacked together, they produced a false-positive at the conventional p < .05 threshold.

The simulation: 60.7% false positives, not 5%

The Beatles study was a parlor trick. The real contribution of the paper was the Monte Carlo simulation that proved this wasn’t a fluke. Simmons et al. simulated 15,000 datasets in which the null hypothesis was true by construction — there was, by stipulation, zero real effect. They then applied combinations of four common researcher flexibilities to each simulated dataset and recorded how often the analyst would have found a “significant” result at p < .05.

The expected false-positive rate, if the data had been analyzed in a single pre-specified way, is 5%. That is what the alpha threshold guarantees.

The simulated false-positive rates were:

  • Flexibility in stopping rule alone (collect 20, check, collect 10 more if not significant): 7.7%
  • Two dependent variables, report whichever is significant: 9.5%
  • Three conditions, choose the comparison post hoc: 12.6%
  • Adding or dropping a covariate based on whichever helps: 11.7%

Combining all four — exactly the menu of choices any working researcher had in 2011 — produced a false-positive rate of 60.7%.

Read that again. With four small “looks like a normal analytical decision” knobs, you get a false-positive rate that is twelve times what your statistical machinery is supposed to deliver. The peer-reviewed literature was, in expectation, more than half noise.

This was the moment the methodology section of psychology — and by extension, every behavioral science that borrowed psychology’s playbook, including marketing, UX research, and business experimentation — got rewritten.

The four flexibility dimensions in detail

The paper’s contribution was naming and quantifying the moves. Here are the four researcher degrees of freedom Simmons et al. focused on, and why each one is corrosive even when used innocently.

1. Flexibility in choosing the dependent variable

You measure both “how much participants liked the product” and “intent to purchase.” You analyze both. One is significant, one isn’t. You write up the significant one as your primary finding and either omit the other or relegate it to a footnote.

Each additional outcome you measure but only report when significant adds an opportunity for a false positive. With two outcomes, the false-positive rate roughly doubles. With five, it quintuples. The reader has no way of knowing how many outcomes you measured — they only see the one that worked.

2. Flexibility in stopping (optional stopping)

You collect data from 30 participants, run your test, see p = .11. You collect 10 more, re-run, see p = .08. You collect 10 more, see p = .04. You stop and report.

This is mathematically guaranteed to inflate false positives. The p-value calculation assumes you fixed your sample size in advance. Peeking and continuing if not significant is a different statistical procedure — one that has a much higher Type I error rate than the nominal alpha. This is the same mechanism that makes naive A/B test peeking dangerous in product analytics. See the A/B testing peeking problem for the commercial-side version of this exact mistake.

3. Flexibility in choosing covariates

You run your main effect. It’s not significant. You add gender as a covariate. Now it is. You add age. Now it isn’t. You drop age, keep gender. You report the final model.

Each covariate you can include or exclude doubles the number of analytical paths through the data. Three optional covariates means eight possible analyses. If even one of those eight gives you p < .05 and you only report that one, your effective alpha is no longer 5%.

4. Flexibility in reporting experimental conditions

You run a three-arm study: control, treatment A, treatment B. You contrast A vs. control (significant), B vs. control (not), A vs. B (not). You write up A vs. control as your main finding and explain that B was “a manipulation check” or “exploratory.”

This is the move that produced the Beatles result. Three conditions means three pairwise contrasts. Without a multiple comparisons correction (which most papers in 2011 did not apply at the condition level), at least one of those contrasts is likely to be spuriously significant under the null. This is the same problem that destroys credibility in marketing tests that compare many variants without correction — see A/B testing multiple comparisons for the commercial parallel.

The 21-word solution: Simmons et al.’s disclosure remedy

After demonstrating the problem, Simmons, Nelson & Simonsohn proposed a fix. Not the only fix — preregistration, replication, p-curve, all came later — but the simplest and most immediately actionable: require authors to disclose, in every paper, that they had not done the things that inflate false-positive rates.

They proposed six requirements for authors and four for reviewers. The core author disclosure, the famous “21-word solution,” reads:

“We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”

Twenty-one words. If a paper cannot truthfully make that claim, the reader needs to apply much stronger skepticism to the conclusions. If a paper does make the claim, the reader at least knows that the researcher degrees of freedom catalogued in the paper were not deployed.

The 21-word solution has since been adopted as a standard disclosure by journals like Psychological Science, Journal of Experimental Social Psychology, and dozens of others. It is now a normal part of methods sections. In 2011, it was a radical demand.

The six requirements for authors (Simmons et al. 2011, Table 2) are:

  1. Decide on the rule for terminating data collection before data collection begins and report this rule in the article.
  2. Collect at least 20 observations per cell or provide a compelling cost-of-data-collection justification.
  3. List all variables collected in a study.
  4. Report all experimental conditions, including failed manipulations.
  5. If observations are eliminated, also report what the statistical results are if those observations are included.
  6. If an analysis includes a covariate, report the statistical results of the analysis without the covariate.

The reviewer-side requirements are essentially: enforce the above. Don’t let authors retreat into vague language about “preliminary studies” or “robustness checks.” Require the full menu of choices to be laid out so readers can assess them.

The garden of forking paths: Gelman and Loken’s generalization

Two years after Simmons et al., statisticians Andrew Gelman and Eric Loken wrote a working paper, “The Garden of Forking Paths,” that pushed the argument further. The standard p-hacking critique assumes the researcher consciously tried many analyses and reported only the significant one. But Gelman and Loken (2013) pointed out that even a researcher who runs exactly one analysis can be implicated in the same statistical problem.

How? Because the analysis they ran was conditional on the data they saw. Had the data looked different, they would have run a different analysis. The forking paths are not paths actually taken — they are paths that would have been taken in counterfactual datasets.

Concretely: imagine a researcher who, upon seeing the data, decides that the effect is moderated by gender, so they split the sample. Had the gender split not been significant, they might have split by age. Had the age split not worked, they might have dropped outliers. Each of these conditional decisions, even if only one is actually executed, contributes to the multiple-comparison problem because the decision rule depends on the data.

This is why preregistration matters and why “we just did one analysis” is not a sufficient defense. The analysis you did was selected from a garden of possible analyses, and the selection mechanism was the data itself. The Type I error rate of your one analysis is not the nominal alpha — it is the nominal alpha multiplied by the number of paths through the garden you would have taken given different data.

The Gelman-Loken paper is short, accessible, and worth reading in full. It generalizes Simmons et al.’s point from “researchers who try many things” to “researchers who make any data-dependent decisions at all.”

P-curve: the diagnostic Simonsohn built next

Once you accept that the literature is contaminated by p-hacking, the obvious question is: how do you tell which specific findings are real? Uri Simonsohn — one of the three authors of the 2011 paper — built a diagnostic to answer this. The result was P-curve, published as Simonsohn, Nelson & Simmons (2014) in Journal of Experimental Psychology: General.

The intuition is elegant. If a research literature contains real effects, the distribution of significant p-values (those below .05) should be right-skewed — concentrated near zero, tapering off toward .05. This is because true effects, when they hit statistical significance, tend to clear the bar comfortably.

If a research literature is mostly p-hacked, the distribution of significant p-values should be left-skewed — clustered just below .05. This is because p-hacking is a process of barely crossing the threshold: you stop tweaking the moment you get p = .049, not p = .002.

So you take a set of papers on a topic, extract every significant p-value, and plot the histogram. A real-effect literature shows a steep right-skew. A p-hacked literature shows a hump near .05. A null literature shows a flat distribution.

P-curve has been applied to dozens of literatures with varying results. Some literatures pass — strong right-skew, real effects. Some literatures fail catastrophically — flat or left-skewed, suggesting nothing real underneath. The technique has the virtue of being applicable to existing published work without requiring new experiments. You can audit a research field with a calculator and a list of p-values.

The P-curve methodology has its critics — it has assumptions about how p-hacking proceeds, and sophisticated p-hackers can defeat it. But it added a tool to the methodological toolkit that didn’t exist before, and it has been used to vindicate some literatures (e.g., classic priming effects fail; some embodied cognition findings pass) and condemn others.

How widespread is p-hacking? John, Loewenstein & Prelec 2012

A natural follow-up question: do researchers actually p-hack? Or is this a theoretical concern that doesn’t reflect real behavior?

Leslie John, George Loewenstein, and Drazen Prelec answered this with an anonymous survey of 2,155 academic psychologists, published as John et al. (2012) in Psychological Science. They used a “Bayesian truth serum” technique to incentivize honest reporting and asked researchers whether they had ever engaged in 10 specific questionable research practices (QRPs).

The self-admission rates were striking:

  • Failing to report all dependent measures: 63.4% admitted to it
  • Deciding whether to collect more data after looking at results: 55.9% admitted to it
  • Failing to report all conditions of a study: 27.7% admitted to it
  • Stopping data collection early because results were significant: 15.6% admitted to it
  • Rounding off p-values (e.g., reporting p = .054 as p < .05): 22.0% admitted to it
  • Selectively reporting studies that “worked”: 45.8% admitted to it
  • Deciding whether to exclude data after looking at impact: 38.2% admitted to it
  • Reporting an unexpected finding as predicted from the start (HARKing): 27.0% admitted to it
  • Claiming results are unaffected by demographic variables when actually unsure: 3.0% admitted to it
  • Falsifying data: 1.7% admitted to it

The headline finding: questionable research practices are the norm, not the exception. The Bayesian truth serum estimates (which adjust for likely under-reporting) were even higher than the raw self-admissions in most categories.

This matters because it confirms that the false-positive simulation in Simmons et al. (2011) is not a hypothetical worst case. It is roughly the actual analytical environment that produced the psychology literature pre-2011. The replication crisis is not a mystery — it is the predictable downstream consequence of a publishing system that incentivized researchers to deploy researcher degrees of freedom and a peer-review system that did not require disclosure of those choices.

What this means for you as a strategist, marketer, or operator

If you build products, design experiments, or interpret research to make business decisions, the Simmons et al. paper is required reading for three reasons.

First, it changes how you read every behavioral science claim. Before 2011, you could roughly assume that a published, peer-reviewed psychology finding was probably real. After 2011 — and especially after the replication failures documented in the OSC 2015 reproducibility project — you should default to skepticism for any claim that hasn’t been independently replicated, preregistered, or assessed by p-curve. A single p < .05 from a single study is not evidence. It is, in expectation, slightly more likely to be a false positive than a true effect.

Second, the same researcher degrees of freedom are present in your own work. Every commercial A/B test, every marketing experiment, every product metric review contains the same flexibilities Simmons et al. catalogued. If you peek at your test results midway and decide whether to keep going, you are exercising flexibility in stopping. If you run a multivariate test and report only the segments where the treatment won, you are exercising flexibility in conditions. If you re-segment your test results by demographic and find a “winning” segment, you are walking the garden of forking paths. Your effective false-positive rate is not 5%. It is whatever the simulation in Simmons et al. would compute for your specific menu of analytical moves — almost certainly between 30% and 70%.

Third, the disclosure remedy applies to commercial research too. When you read a vendor’s case study claiming a 40% lift, ask the 21-word equivalent: did they report how they determined sample size? Did they report all variants tested, not just the winner? Did they report all metrics, not just the one that moved? Did they report the test if they re-ran it on a different cohort? If the answer to any of these is “they don’t say,” you are looking at a marketing artifact, not evidence. Apply the same skepticism a 2026 peer reviewer would apply to a 2011-era psychology paper.

The deeper point is cultural. The replication crisis is not a story about dishonest researchers. It is a story about an entire field’s analytical norms producing predictably broken outputs. The same dynamics produce predictably broken outputs in commercial behavioral science — marketing, product, UX research, customer success analytics — wherever the incentive to find significant results exists and the disclosure norms are weak. The fix in research was disclosure, preregistration, and replication. The fix in commercial behavioral science is the same: pre-specify your hypotheses, report all variants and metrics, and require replication before treating a finding as decision-grade evidence.

Sources

  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632
  • Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University. http://www.stat.columbia.edu/~gelman/research/unpublished/forking.pdf
  • Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534-547. https://doi.org/10.1037/a0033242
  • John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524-532. https://doi.org/10.1177/0956797611430953
  • Nelson, L. D., Simmons, J., & Simonsohn, U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69, 511-534. https://doi.org/10.1146/annurev-psych-122216-011836

FAQ

What does “p-hacking” mean exactly?

P-hacking is the practice of trying multiple analytical approaches — different dependent variables, different sample sizes, different covariates, different sub-samples — until one produces a statistically significant result (typically p < .05), and then reporting only that result as if it were the only analysis performed. Simmons, Nelson & Simonsohn (2011) coined the term and demonstrated that even modest p-hacking inflates false-positive rates from 5% to over 60%.

What are “researcher degrees of freedom”?

Researcher degrees of freedom are the analytical choices a researcher makes after seeing the data: which outcome variable to feature, whether to add or drop covariates, whether to exclude outliers, when to stop collecting data, which experimental conditions to report. Each degree of freedom doubles or triples the effective false-positive rate when used selectively. Simmons et al. (2011) identified four major dimensions; Gelman and Loken (2013) generalized the concept to any data-dependent analytical choice.

Is p-hacking the same as fraud?

No. Fraud is fabricating or falsifying data. P-hacking is using only standard, accepted analytical techniques but selecting among them based on which produces significant results. The John, Loewenstein & Prelec (2012) survey found that less than 2% of psychologists admitted to outright fraud, but more than 60% admitted to failing to report all dependent measures and more than 55% admitted to deciding whether to collect more data after looking at results. P-hacking is a methodology problem, not a character problem — which is why fixing it requires changing incentives and disclosure norms, not just calling out bad actors.

Why does the 21-word solution work?

The 21-word solution (“We report how we determined our sample size, all data exclusions, all manipulations, and all measures”) works because the entire mechanism of p-hacking depends on selective reporting. If researchers must disclose every measure, every condition, every exclusion, and the sample size rule, the multiple-comparison problem becomes visible to readers and reviewers. Hiding the menu of choices is what makes the menu dangerous. Disclosure does not eliminate the choices — it just lets readers price them in.

Does p-hacking apply to commercial A/B testing?

Yes, and severely. Commercial experimentation contains exactly the same researcher degrees of freedom as academic psychology: choice of primary metric, peeking and continuing, segment analysis, covariate adjustment, condition comparison. The mathematical inflation of false-positive rates is identical. A marketing team that runs an A/B test, peeks daily, segments results by device type, and reports only the segments where the treatment won is exercising the same Simmons-Nelson-Simonsohn flexibilities and getting the same false-positive rate inflation. The commercial fixes are also identical: pre-specify the primary metric, fix the sample size in advance (or use sequential testing methods designed for peeking), report all segments, require independent replication on a fresh cohort before treating a finding as ship-ready.

What’s the difference between p-hacking and HARKing?

HARKing — Hypothesizing After Results are Known — is closely related but distinct. P-hacking involves trying multiple analyses on the same data and selecting the significant one. HARKing involves observing a pattern in the data and then writing the paper as if you had predicted that pattern from the start. Both are problematic. HARKing additionally misleads readers about the theoretical strength of the prediction, turning exploratory pattern-finding into faux-confirmatory hypothesis testing. The fix for both is the same: preregistration that separates confirmatory from exploratory analyses.

How did the field respond to Simmons et al. 2011?

Slowly at first, then dramatically. Within five years, Psychological Science adopted preregistration badges, the 21-word disclosure became standard at many journals, the Open Science Framework launched as a preregistration infrastructure, and the OSC 2015 Reproducibility Project empirically confirmed the predictions of Simmons et al. by failing to replicate the majority of tested psychology findings. By 2020, preregistration was standard practice for high-quality empirical work in psychology, and meta-research had become its own subfield. Nelson, Simmons & Simonsohn (2018) document this transition in their Annual Review paper “Psychology’s Renaissance.” The 2011 paper is, retrospectively, the inflection point.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.