In January 2011, the Journal of Personality and Social Psychology — the flagship outlet of the entire field — published a paper by Cornell social psychologist Daryl Bem titled “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” The paper reported nine experiments in which college undergraduates appeared to predict random events that had not yet occurred. Eight of the nine experiments produced statistically significant results, with a mean effect size of d = 0.22, comfortably in the range of effects routinely published in social psychology.

Bem was 72 years old, a long-tenured Cornell professor, a past editor of JPSP, and a respected experimental social psychologist. He had used standard methods, standard paradigms, and standard statistical tests. The paper went through normal peer review with four reviewers. The journal’s editor, Charles Judd, published an accompanying editorial explaining the decision and inviting replication.

Then The New York Times ran Benedict Carey’s piece “Journal’s Paper on ESP Expected to Prompt Outrage.” It did. Within a week, three groups had begun the most consequential argument in modern social psychology, and within five years, the methodological foundations of the field would be visibly different.

The interesting thing about Bem’s paper is not whether precognition is real — almost no working psychologist thought it was. The interesting thing is that the methods used to “demonstrate” it were the same methods used to publish thousands of other findings in the field’s top journals. If the standard toolkit could produce significant evidence for time-reversed causation, what did that say about everything else the toolkit had produced?

This article walks through what Bem actually did, what the response actually showed, and why a paper most psychologists found absurd ended up doing more for the credibility of psychological science than most “successful” findings ever have.

What Bem’s 2011 Paper Actually Tested

Bem’s nine experiments were structurally elegant. Each took a well-established psychological paradigm and reversed the time-order of stimulus and response. If a known effect existed in the normal direction — for instance, that emotionally charged words are responded to faster after being primed by a related image — Bem ran the same experiment with the prime presented after the response. The question: would subjects’ behavior still be influenced by the not-yet-shown prime?

Bem grouped the nine experiments into four basic paradigms:

Precognitive detection of erotic stimuli (Experiments 1 and 2). Subjects sat in front of two curtained images on a screen, one of which concealed a randomly selected erotic photograph and the other concealed a neutral image. Subjects had to guess which curtain hid the erotic image. The selection of which side hid the target was made by the computer after the subject’s choice was logged. Bem reported subjects guessed correctly 53.1% of the time when the hidden image was erotic, against the 50% chance baseline (p = .01).

Precognitive avoidance of negative stimuli (Experiment 2 variant). Same paradigm, negative images. The reported effect was smaller and weaker.

Retroactive priming (Experiments 3 and 4). Subjects judged whether a picture was pleasant or unpleasant; after their judgment, a prime word (positive or negative) was flashed on screen. Bem reported that response times were affected by primes that appeared after the response — congruent post-primes sped responses, incongruent post-primes slowed them.

Retroactive habituation (Experiments 5, 6, 7). Subjects rated their preference between two images; afterward, the computer randomly selected one of the two for subliminal repeated exposure. Bem reported subjects preferred, in advance, the image that would later be subliminally exposed — i.e., a retroactive mere-exposure effect.

Retroactive facilitation of recall (Experiments 8 and 9). Subjects studied a list of words and were tested on free recall. After the recall test, the computer randomly selected a subset of the studied words for the subject to retype as a “practice exercise.” Bem reported that words selected for post-test practice were recalled at higher rates during the earlier test than words not selected for practice — i.e., practicing words in the future facilitated remembering them in the past.

The headline numbers: across nine experiments with over 1,000 subjects, eight produced p-values below .05 in the predicted direction. The mean effect size of d ≈ 0.22 was in line with effects routinely published in social and cognitive psychology. Bem reported the analyses with what looked like the field’s standard rigor.

This is the part that mattered. The a priori implausibility was, in a sense, the point. Bem was explicit: he was using the field’s accepted procedures to test something the field expected to be false, and reporting positive results.

Why The Methodology Looked Normal

Reading Bem’s 2011 paper without the controversy attached, you would not immediately notice anything pathological. He used:

  • standard between- and within-subjects designs
  • common reaction-time paradigms drawn from cognitive psychology
  • the same kinds of dependent measures used throughout social psychology
  • standard frequentist statistical tests (t-tests, ANOVAs)
  • p < .05 as the reporting threshold
  • effect sizes (Cohen’s d) in line with the field’s published norms
  • a sample size per experiment (~100) that was, in 2010, typical for social psychology

The peer-review process was normal. JPSP is the highest-impact journal in the discipline. The editorial team did not lower its standards to publish the paper; they applied normal standards and the paper passed them. Editor Charles Judd’s accompanying note made this point directly: the paper had been evaluated using the same criteria applied to all submissions, and either the criteria let it through (in which case the criteria deserved scrutiny) or the effects were real (in which case the field had bigger problems).

That was the trap. The same methodological choices that produced “feeling the future” had produced “elderly priming slows your walk” (Bargh, 1996), “power posing raises your testosterone” (Carney, 2010), “willpower depletes like a muscle” (Baumeister, 1998), and a thousand smaller findings that filled textbooks and TED talks. If the methods could yield positive results for clearly absurd hypotheses, then positive results for plausible hypotheses meant a lot less than the field had been treating them.

A handful of methodologists — Eric-Jan Wagenmakers among them — had been arguing this for years and being mostly ignored. Bem’s paper made the argument unignorable.

The Wagenmakers Critique

Eric-Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, and Han van der Maas published their reply in the same issue of JPSP, under the title “Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi” (2011, DOI: 10.1037/a0022790).

Their critique had two layers.

The Bayesian reanalysis. They reanalyzed Bem’s reported data using default Bayesian t-tests with what they argued was a reasonable prior on effect size. Their conclusion: when the prior probability of psi is treated honestly — i.e., not assumed to be 50/50 in advance — the actual evidence the data provide for the psi hypothesis ranges from “weak” to “non-existent.” Most of Bem’s experiments yielded Bayes factors that favored the null hypothesis or were inconclusive. Only one or two experiments showed evidence above the conventional threshold for “substantial” evidence in favor of psi.

The math here matters less than the conceptual point: a p-value below .05 is a measure of how surprising the data are if the null is true, not a measure of how likely psi is to be real. For a hypothesis that virtually no working scientist gives meaningful prior probability to, the bar for “convincing” evidence should be considerably higher than the bar for “publishable.”

The methodological flexibility critique. Wagenmakers and colleagues pointed to a list of practices in Bem’s paper that, individually, looked harmless and were standard in the field — but collectively, allowed many degrees of analytic freedom:

  • Exploratory framed as confirmatory. Bem’s experiments often included multiple dependent measures, multiple groups of subjects, and multiple sub-analyses. Reading the paper, it was difficult to tell which tests had been planned in advance and which had been chosen after the data came in.
  • One-sided p-values. Several of Bem’s tests were reported as one-tailed, which makes them easier to clear .05 — but only if the direction was specified in advance. For exploratory work, two-tailed is the honest choice.
  • Optional stopping. Sample sizes varied across experiments without clear pre-specification of stopping rules. If a researcher checks results periodically and stops when significance is achieved, the false-positive rate inflates substantially above the nominal 5%.
  • The garden of forking paths. Even with good intentions, researchers can make many small analytical choices — which subjects to exclude, which trials to drop as outliers, which covariates to include — that, in aggregate, hugely expand the chance of finding something significant.

Wagenmakers’ argument was not that Bem had committed any single act of fraud. It was that the standard practices of the field permitted enough flexibility that motivated researchers could, in good faith, produce significant evidence for almost any hypothesis. The remedy, they argued, was preregistration: stating in advance what you plan to test and how you plan to analyze it, so that the data analysis is genuinely confirmatory rather than exploratory dressed up as confirmation.

Bem and colleagues replied that Wagenmakers had used an inappropriate prior and that, with what they considered a more realistic prior, the Bayesian evidence supported psi. This response was technically defensible but did not engage with the deeper critique about flexibility and confirmation.

The Replication Crash

Within 18 months, three separate teams ran preregistered, well-powered replications of Bem’s effects. None of them worked.

Galak, LeBoeuf, Nelson, and Simmons (2012, JPSP, DOI: 10.1037/a0029709), “Correcting the Past: Failures to Replicate Psi,” reported seven replication attempts of Bem’s Experiments 8 and 9 (the retroactive recall facilitation paradigm). The combined sample was 3,289 participants — multiples of the original sample size. Across all seven experiments, the average effect size was d = 0.04, statistically indistinguishable from zero. A meta-analysis combining their replications with all known prior attempts (including Bem’s originals) produced an overall effect not different from zero.

Three of the authors — Nelson, Simmons, and (separately) Simonsohn — had already published the “False-Positive Psychology” paper by the time this came out. They knew what they were looking for and they ran the kind of study they had argued the field needed: pre-specified analyses, large samples, transparent reporting. The effect did not survive.

Ritchie, Wiseman, and French (2012, PLOS ONE, DOI: 10.1371/journal.pone.0033423), “Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s ‘Retroactive Facilitation of Recall’ Effect,” reported three independent preregistered replications of the same paradigm. Combined N = 150, combined p = .83 (one-tailed). No support for the effect. Notably, the team had trouble publishing their failed replication at all — several journals, including JPSP itself, declined the paper before PLOS ONE accepted it. That was a separate scandal that the field then had to confront: failed replications of high-profile findings were difficult to publish, which created exactly the publication bias that inflated published effect sizes in the first place.

By 2013, the working consensus among methodologically literate psychologists was that Bem’s effects were artifacts of researcher degrees of freedom, not evidence of psi. The interesting question was what to do about the methods.

Bem’s Own 2015 Meta-Analysis Defense

Bem did not concede. In 2015, he and collaborators Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan published “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events” in F1000Research (DOI: 10.12688/f1000research.7177.2). The paper claimed to aggregate 90 experiments from 33 laboratories across 14 countries, reporting an overall Hedges’ g of 0.09, z = 6.40, p = 1.2 × 10⁻¹⁰, and a Bayes Factor of 5.1 × 10⁹ — “decisive evidence” in Bayesian terms.

On its face, this was a strong defense. A meta-analytic effect that significant, drawn from independent labs, would be hard to dismiss.

Daniel Lakens’ detailed critique (published on his blog and widely cited) identified the issues. The meta-analysis suffered from the same problem that originally motivated the field’s reform: it could not distinguish a real-but-small effect from a small effect produced by publication bias and selective reporting. P-curve analyses — which assess whether the distribution of significant p-values in a literature is consistent with the existence of a real effect versus selective reporting — were mixed, and the estimated true effect size of 0.20–0.24 in p-curve analyses was suspiciously similar to the effect size Bem originally reported. That kind of consistency is not what a clean signal in noisy data tends to look like.

More importantly, the meta-analysis did not include the high-powered preregistered failures from Galak et al. or comparable null work as a counterweight to lab-affiliated psi research. The lab populations represented in the 90 experiments skewed toward researchers who had reasons to find effects, and the selection criteria for inclusion were, in places, opaque. A meta-analysis that aggregates flexibility-prone studies will still reflect the flexibility, even when each individual study looks clean.

The 2015 defense did not change the field’s working view. By that time, the methodological reform agenda had taken on a life independent of the psi question.

The Bigger Consequence: False-Positive Psychology

The same week Bem’s paper hit the press, Joseph Simmons, Leif Nelson, and Uri Simonsohn were finishing a paper that would do more for the field than anything else published that decade.

“False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” published in Psychological Science later in 2011 (DOI: 10.1177/0956797611417632), did not mention Bem directly. It didn’t need to. The paper used a deliberately absurd example: it reported an experiment in which subjects who listened to The Beatles’ “When I’m Sixty-Four” became chronologically younger than subjects who listened to a control song. They demonstrated this with p < .05 using realistic-looking analytical choices.

The point: researcher degrees of freedom — flexibility in choosing dependent variables, choosing sample sizes, using covariates, reporting subsets of conditions — could inflate the false-positive rate from the nominal 5% to over 60%. They ran simulations showing that just two or three apparently reasonable analytic decisions, made in sequence, were enough to make almost any hypothesis testable as “significant.”

The paper closed with six concrete requirements for authors and four guidelines for reviewers: disclose stopping rules, disclose all conditions, disclose all measures, disclose data exclusions, and so on. These became the foundation of what is now standard methodological practice in better journals: preregistration, transparent reporting, sample-size pre-specification, multiverse analyses.

The historical accident of timing matters. The Bem paper made the urgency of the False-Positive Psychology paper visible to anyone paying attention. If you could explain how Bem’s methods produced “evidence” for psi using normal lab practice, you had explained how thousands of other published findings might have been produced. That changed what the methodological community was arguing against — not a fringe paper, but business as usual.

By 2015, the Open Science Collaboration’s “Reproducibility Project: Psychology” had completed direct replications of 100 published studies, reporting that about 36% replicated to statistical significance versus 97% of the originals. The field had its empirical confirmation. The Bem paper had triggered the reform; the Reproducibility Project quantified the damage.

What’s Honest To Say About Psi And About Methods Now

Two distinct conclusions, often conflated, follow from this episode.

On psi: preregistered high-powered replications of Bem’s effects have failed. The most rigorous tests of the precognition hypothesis have not produced reliable evidence. The Bayesian reanalysis of Bem’s own data, with realistic priors, did not support the existence of the effect. The 2015 meta-analytic defense has the same structural weaknesses as the original literature. The most parsimonious explanation of Bem’s results is that researcher degrees of freedom, multiple testing, optional stopping, and the garden of forking paths produced spurious significance in good-faith experiments. Psi, as an empirical hypothesis, remains essentially unfalsifiable — every failure can be attributed to experimenter “psi-suppression,” every success to “psi-facilitation” — which is itself a methodological tell.

On the field’s methods: the more important lesson is not about ESP. It is that a respected researcher, using standard methods, with standard rigor, in the field’s top journal, produced significant evidence for a hypothesis everyone agreed was almost certainly false. That outcome was diagnostic. It said that the standard methods could produce significant evidence for almost any hypothesis a motivated researcher chose to test. The fact that “feeling the future” got published was not a scandal about Bem; it was a scandal about the methods. The fix was not to remove Bem from publication — it was to fix the methods so that the next Bem-style paper would have to clear a higher bar.

That fix happened, partially. Preregistration is now standard in the better journals. Sample sizes have grown. Replication is no longer career suicide. Transparency norms — data sharing, materials sharing, analysis-script sharing — have improved. The reform is incomplete and unevenly distributed across subfields, but it is real.

What This Means For Strategists

You don’t run experiments on undergraduates. You make decisions based on findings other people ran. The Bem episode is a useful lens for evaluating any behavioral-science claim that arrives at your desk.

If the methodology could plausibly have produced significant evidence for a clearly absurd hypothesis, treat the result with caution. This is the Bem test. Before you act on a claim like “subjects who saw the warm color converted 20% better than subjects who saw the cool color,” ask: could the same methods, applied to a hypothesis you find ridiculous, also have produced a publishable positive result? If the answer is yes — because the sample was small, the analysis was post-hoc, the comparisons were many — the result is not telling you what it claims to tell you.

Preregistration matters because it removes the analytical flexibility. A study where the hypothesis and analysis plan were specified in advance, and the data analysis followed the plan, is structurally different from a study where the analysis was chosen after the data came in. Look for “preregistered,” “registered report,” or explicit OSF links. Their absence is not proof of error, but their presence is meaningful.

Multi-lab replications are the strongest evidence. A single significant result, even in a top journal, is a hypothesis. A finding that replicates across multiple independent labs with preregistered protocols and adequate sample sizes is something you can act on. The asymmetry of evidence here is large — one well-powered replication failure is worth a dozen single-lab originals.

Sample size is not optional. Bem’s experiments averaged about 100 subjects per study. Galak’s replications used 3,000+. The first number is the field’s old norm. The second number is what it takes to actually test a small effect. If a result is reported with n < 200 per cell and an effect size of d < 0.2, the study was underpowered to detect what it claims to have detected — and underpowered studies that produce significant results are exactly the studies most likely to be inflating their effect sizes.

Effect sizes drift downward when methods improve. This is the most important regularity in the post-2011 literature. Effects measured in better-controlled, larger-sample, preregistered studies are systematically smaller than the originals. Sometimes they survive but with effect sizes 30-60% smaller. Sometimes they vanish. If a behavioral-science claim is being sold to you with effect sizes from the original 1990s–2000s literature, the realistic-world effect is almost certainly smaller, possibly much smaller.

The “absurd hypothesis” diagnostic is portable. When you read a marketing case study, a UX research report, a vendor pitch deck citing behavioral science, ask the diagnostic question: if the experimenter had been testing the opposite hypothesis, would the same data have supported that conclusion too? If yes, you are looking at narrative dressed as evidence, not at evidence.

Bem’s paper is, in its strange way, the most useful single artifact in the modern history of behavioral science. It is the cleanest possible demonstration that the methods of the field, used in good faith, could produce significant evidence for a hypothesis that was almost certainly false. Every behavioral-science claim made to a strategist or operator since 2011 should be evaluated against that demonstration.

The cost of taking this seriously is that you will discount many published findings you previously trusted. The benefit is that you stop spending money and credibility on effects that will not show up when you actually try to use them.

Sources

  • Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425. DOI: 10.1037/a0021524
  • Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426-432. DOI: 10.1037/a0022790
  • Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103(6), 933-948. DOI: 10.1037/a0029709
  • Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s ‘Retroactive Facilitation of Recall’ effect. PLOS ONE, 7(3), e33423. DOI: 10.1371/journal.pone.0033423
  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. DOI: 10.1177/0956797611417632
  • Bem, D., Tressoldi, P., Rabeyron, T., & Duggan, M. (2015). Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events. F1000Research, 4, 1188. DOI: 10.12688/f1000research.7177.2
  • Carey, B. (2011, January 5). Journal’s paper on ESP expected to prompt outrage. The New York Times. Article archived at NYTimes.com
  • Lakens, D. (2015). Why a meta-analysis of 90 precognition studies does not provide convincing evidence of a true effect. The 20% Statistician. daniellakens.blogspot.com
  • Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716

FAQ

Does this mean ESP is real?

No. The most rigorous preregistered replications of Bem’s effects, run with sample sizes many times larger than the originals, have not produced reliable evidence for precognition. Bayesian reanalysis of Bem’s own data with realistic priors does not support the existence of the effect. The most parsimonious explanation of the original positive results is researcher degrees of freedom and selective analysis, not psi. The 2015 meta-analytic defense suffers from the same selection effects as the original literature.

Is statistics broken?

The statistical tools are not broken; the way they were routinely used was. A p-value below .05 measures how surprising the data are if the null hypothesis is true — it does not measure how likely the alternative hypothesis is to be correct, especially for hypotheses with very low prior probability. Combined with flexibility in data collection (optional stopping), analysis (multiple comparisons, choice of covariates, exclusion rules), and reporting (publishing only significant results), the practical false-positive rate in the field was far higher than the nominal 5%. The fix is not to throw out statistics; it is to constrain the flexibility through preregistration and transparent reporting.

What is preregistration?

Preregistration means stating, in writing and timestamped before data collection begins, what hypothesis you will test, how many subjects you will run, what your dependent measure is, how you will analyze the data, what counts as a positive result, and what exclusion rules you will apply. After the data are collected, your analysis must follow the registered plan. Any deviation must be flagged as exploratory. The Open Science Framework (osf.io) hosts most preregistrations and makes them publicly accessible. A “registered report” goes further: peer review of the design happens before data collection, and the journal commits to publish regardless of whether the result is positive or null.

What should I trust now in behavioral science?

Findings with multiple preregistered direct replications across independent labs. Effects with consistent results in studies that used adequate sample sizes (typically 1,000+ for psychological effects in the d = 0.2 range). Findings published as registered reports. Results that show up in well-controlled field experiments, not just lab studies. Effects that survive when you constrain analytical flexibility. Be skeptical of: single-study findings, especially with n < 200; effects that vary substantially across labs; results with large effect sizes (d > 0.5) in domains where such effects would be implausible; findings that are widely cited but rarely directly tested.

Was Bem committing fraud?

No. Multiple investigators who have engaged with Bem’s work have concluded that he was not faking data. He used standard methods for the era. His analyses, while flexible in ways that inflated false-positive rates, were not unusual by 2010 standards. The point of the episode is exactly that you don’t have to commit fraud to produce a misleading result — the field’s normal practices were enough to do it. That is what made the episode consequential. If Bem had been a fraudster, the field could have written off the paper as one bad actor. Because he wasn’t, the field had to look at its own habits.

Why did the journal publish the paper?

JPSP editor Charles Judd’s accompanying editorial addressed this directly. The paper had been peer-reviewed by four reviewers, judged by standard journal criteria, and met those criteria. The editorial team decided that rejecting the paper because the conclusion was implausible would amount to using prior beliefs to override evaluation of the methods — a worse precedent than publishing it. Judd’s reasoning was: if our standard review process passes a paper claiming psi, the appropriate response is to interrogate our standard review process, not to suppress the paper. In retrospect this looks like exactly the right call, even though it was uncomfortable in the moment.

Why did failed replications struggle to get published?

This was a second-order scandal. When Ritchie, Wiseman, and French submitted their failed replication of Bem’s effect to JPSP, the journal declined on the grounds that it did not publish “mere replications.” This was standard practice across high-impact psychology journals at the time: original findings were publishable, failed replications were not. The result was a publication-bias engine that systematically inflated the effect sizes in the published literature, because positive results survived selection and negative results did not. The post-2011 reform agenda included pressuring journals to publish high-quality replications regardless of outcome, and several journals — including Psychological Science and Perspectives on Psychological Science — substantially changed their policies.

If standard methods produce false positives this easily, why is any behavioral science result trustworthy?

Some are. The findings most likely to be real are those that have been directly replicated in preregistered studies across multiple labs with large samples and have survived when analytical flexibility was constrained. Examples include certain core findings in cognitive psychology (working memory limits, certain reaction-time effects), some well-tested findings in judgment and decision-making (loss aversion in some contexts, anchoring effects with clear procedures), and effects that hold up in field experiments with consequential outcomes. The post-replication-crisis literature is, on average, more conservative in its claims and more empirically grounded than the pre-2011 literature. The pre-2011 literature should be read with skepticism unless specific findings have been re-tested under modern standards. That is the practical legacy of Bem’s paper.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.