The Reproducibility Project: Psychology (OSC 2015) — When The Field Replicated Itself And Found 39%

Atticus Li

← The Replication Crisis · replication-crisis

The Reproducibility Project: Psychology (OSC 2015) — When The Field Replicated Itself And Found 39%

In August 2015, 270 researchers published the largest direct-replication project in the history of psychology. They tried to reproduce 100 prominent studies. 39% replicated. Here is what the Open Science Collaboration actually did, what its critics got right and wrong, and how the field changed afterward.

By Atticus Li May 25, 2026 18 min read

In August 2015, the journal Science published a paper with 270 authors and a single, brutal number: 39%.

That was the share of 100 high-profile psychology studies — drawn from three of the field’s leading journals — that the Open Science Collaboration managed to replicate using the same materials, the same procedures, and adequate or larger samples. The remaining 61% either produced effects in the wrong direction, effects too small to reach statistical significance, or effects so attenuated they no longer told the same story.

Among the studies that did replicate, the average effect size came in at roughly half the magnitude reported in the original papers.

The Open Science Collaboration’s “Estimating the Reproducibility of Psychological Science” was the largest direct-replication project ever attempted in the social sciences. It was also the first time a research community had voluntarily, transparently, and at scale put its own published record on trial.

The result was not a quiet correction. It was a permission slip. Within five years of OSC 2015, preregistration, registered reports, multi-lab Many Labs efforts, and an entire generation of open-data norms would harden into infrastructure — much of which is now baseline expectation in any well-run psychology lab.

This article walks through what OSC 2015 actually measured, what its headline number really means, why Gilbert and colleagues’ high-profile critique mattered less than it appeared to, and how a strategist or executive should discount the published psychology literature in light of what the field’s own attempt at self-audit revealed.

What The OSC Actually Did

The Open Science Collaboration was an organizational structure as much as a study. It was coordinated through the Center for Open Science (Brian Nosek’s nonprofit, founded in 2013 in Charlottesville, Virginia), and it pooled the labor of hundreds of researchers across more than a dozen countries.

The methodology was deliberately conservative — designed less to maximize replication success and more to be defensible against the inevitable critiques.

Study selection. The team sampled 100 studies from three journals: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition. All three are flagship outlets. All three are competitive. The papers were drawn from 2008 issues, which gave the original work several years to be cited, discussed, and, if anyone wanted to, replicated independently.

The selection was not random in a strict sense. The OSC used a structured process: studies had to be feasible to replicate (no field studies requiring access to specific populations or expensive longitudinal data), the lead replicating team had to be willing to take it on, and the original authors had to be contactable. Within those constraints, the project tried to cover the breadth of what the journals publish.

Materials and protocol. Wherever possible, replicating teams obtained the original materials directly from the authors. Each replication protocol was written up in advance, sent back to the original authors for review, and revised based on the authors’ feedback before data collection began. This is the part that often gets lost in summary write-ups: original authors had veto power over the protocol. If they said “you can’t run this in undergraduates, it needs to be online workers,” the OSC accommodated. If they said “the manipulation check has to be the same,” it was the same.

Sample sizes. Replications were powered at 92% on average to detect the original effect — substantially larger than the original studies, which had a mean power of around 50% to detect their own reported effects. In other words, the replicating teams typically ran samples at least as large as the originals, and usually larger.

Pre-registration. Every replication protocol was registered on the Open Science Framework before data collection began. The analyses were specified in advance. The success criteria were specified in advance. There was no flexibility for the replicators to slice the data multiple ways and report the slice that worked.

Multiple success metrics. Because no single number tells the whole story, the OSC reported success across five different criteria — statistical significance (p < .05 in the original direction), subjective assessment by the replicating team, effect-size comparison, meta-analytic combination, and prediction-market accuracy. The infamous 39% is the first of those: the share of original findings that crossed the p < .05 threshold in the replication.

This matters because critics later focused heavily on the p < .05 criterion, but the OSC was explicit from the start that this was one lens among five.

The Headline Findings

The numbers, in the order people most often cite them:

36 of 97 effects (37%) reached p < .05 in the replication direction. Three studies were excluded from the final p-value analysis for methodological reasons, leaving 97 in the denominator. (The often-cited “39%” figure rounds from 36/97 ≈ 37.1%; some summaries use a denominator of 100 with three nulls counted as failures, giving 36/100 = 36%; others use slightly different inclusion rules and arrive at 39%. The OSC’s own abstract reports both “39% of effects were subjectively rated to have replicated the original result” and a separate statistical-significance figure. The lesson is to read the original table, not the round number.)

Mean effect size in the replication was roughly half the original. Of the 97 studies, the replication effect size averaged r ≈ 0.20, versus the original mean of r ≈ 0.40. This was the more important finding from a scientific standpoint. Even where a replication “worked” by the p-value criterion, the effect was usually substantially smaller than the literature had recorded.

The 95% confidence intervals from the original studies contained the replication effect in only 47% of cases. Under correct measurement, the originals’ confidence intervals should contain the replication effect ~83% of the time. They contained it about half as often as that.

Original p-values closer to .05 replicated worse. The studies whose originals had p-values in the .04 — .05 range replicated at far lower rates than those with p < .01. This is exactly what you would predict if a significant fraction of the original literature consists of borderline findings that crossed the threshold by chance or by light analytic flexibility.

Replication success correlated with prediction-market accuracy. Researchers who bet, in a small prediction market, on which studies would replicate were modestly accurate. The field’s collective intuition about which findings were robust was not random — psychologists already had some sense, before data collection, of which famous findings were on shaky ground.

The headline that traveled was “39% of psychology studies replicate.” The more accurate headline is: across multiple metrics, somewhere between a third and two-thirds of the high-profile psychology literature from 2008 either failed to replicate, replicated at a much smaller effect size, or produced confidence intervals that retrospectively look wildly overconfident.

One of the most consequential subgroup findings was the gap between subfields.

The OSC reported a clear difference:

Cognitive psychology studies replicated at roughly 50%.
Social psychology studies replicated at roughly 25%.

The cognitive-psychology studies came predominantly from Journal of Experimental Psychology: Learning, Memory, and Cognition — a journal whose papers tend to feature larger sample sizes, simpler manipulations, within-subject designs, response-time measures with thousands of trials per subject, and well-characterized paradigms.

The social-psychology studies came from JPSP and the social sections of Psychological Science — a literature dominated by between-subject designs, small samples (often n < 50 per cell), single behavioral measures, manipulations that depend on the cover story working, and effects whose theoretical interpretation often runs ahead of the data.

This gap was not novel. Cognitive psychologists had been quietly complaining about social-psychology methods for years before OSC 2015. The OSC just put a number on it. The number was that the average social-psychology finding from a top journal in 2008 had roughly a 1-in-4 chance of being directly reproducible with the original materials, in a protocol the original authors had vetted, by a team powered to detect the original effect.

It is hard to overstate how much this changed internal field dynamics. The “priming” subliterature in social psychology — elderly priming, money priming, professorial priming, and the larger story that subtle environmental cues drive significant downstream behavior — was particularly hard-hit in the years that followed.

The Gilbert Critique

In March 2016, Daniel Gilbert (Harvard), Gary King (Harvard), Stephen Pettigrew (Harvard), and Tim Wilson (Virginia) published a Comment in Science arguing that the OSC’s headline conclusion was wrong. Their paper, “Comment on ‘Estimating the reproducibility of psychological science,’” appeared in Science Vol. 351, No. 6277, p. 1037, with the DOI 10.1126/science.aad7243.

The argument had three main thrusts.

One: protocol infidelity. Gilbert et al. argued that many of the replications used materials or populations that differed materially from the originals — different countries, different demographics, different versions of stimuli. They presented examples where replicating teams had run experiments on populations the original effect should not be expected to generalize to, or used translated materials whose equivalence to the originals was unverified. On a strict reading of “direct replication,” they argued, many of the OSC’s attempts were not direct replications at all.

Two: statistical underestimation of agreement. Gilbert et al. argued that the probability of two studies both reaching p < .05 in the same direction, even when both are sampling from the exact same true effect, is mathematically limited. They proposed alternative statistical frameworks under which the OSC’s results were consistent with a true replication rate substantially higher than 39%.

Three: cherry-picked endorsement. They noted that the OSC had asked original authors to “endorse” the replication protocols, and showed that the studies whose authors had endorsed the replication protocols replicated at substantially higher rates than studies whose authors had not endorsed. From this, they argued the unendorsed protocols were systematically lower-quality, and that the “true” reproducibility rate should be calculated only over endorsed studies — yielding a much higher number.

The Comment was high-profile, made the rounds in the popular press, and provided cover for researchers who wanted to dismiss the OSC results as overhyped.

The Anderson Response

The OSC team, led by Christopher Anderson, replied in the same issue of Science — “Response to Comment on ‘Estimating the reproducibility of psychological science,’” Vol. 351, No. 6277, p. 1037, DOI 10.1126/science.aad9163.

The response, in roughly the same number of words, dismantled the Gilbert et al. argument point by point.

On protocol infidelity: the OSC noted that protocols had been sent to original authors in advance for review, and that any “infidelity” was, in many cases, the original authors’ own approved version. The examples Gilbert et al. raised were often cases where the original author had reviewed and accepted the protocol modifications. Further, when the OSC re-analyzed restricting only to protocols rated highly faithful by independent raters, the replication rate barely moved — the headline 39% was not driven by sloppy replications of robust effects.

On statistical underestimation: the OSC pointed out that Gilbert et al.’s alternative statistical framework assumed the originals’ point estimates were unbiased — but the whole question of the replication crisis is whether the originals were biased upward by publication bias, p-hacking, and small-sample volatility. If you assume the originals are unbiased to argue the originals replicate, you have assumed your conclusion.

On endorsement: the response noted that endorsement was not random. Studies whose authors were available, responsive, and confident in their effects were more likely to engage with the replication team. Studies whose authors had moved on, retired, or quietly suspected their original effects were not robust were less likely to endorse. Selecting on endorsement does not measure the reproducibility of the literature — it measures the reproducibility of the literature that authors think is robust enough to bother defending.

The technical exchange was, to most methodologists who followed it, a fairly clean win for the OSC. The Gilbert et al. critique made some valid local points — some specific replications were indeed weak — but its global claim that the OSC underestimated reproducibility did not survive the exchange.

The public-perception fight was murkier. The Gilbert critique gave a lot of researchers permission to dismiss the broader reproducibility findings as “controversial” or “contested.” In some quarters that framing stuck. But in the operational sense — what funders, journals, and methodologists actually did over the next five years — the field reformed as if the OSC was right.

What Changed After 2015

The most important consequence of OSC 2015 was not the number itself. It was the infrastructure that crystallized around accepting it.

Pre-registration as default. By 2020, preregistration had moved from a niche practice to a standard expectation in many psychology subfields. Journals introduced badges. Funders started asking. Cohorts of graduate students were trained in preregistration the way prior cohorts had been trained in APA formatting.

Registered reports. Nosek and Lakens’ 2014 Social Psychology paper “Registered reports: A method to increase the credibility of published results” (Vol. 45, Issue 3, pp. 137 — 141) laid out the registered-report model: peer review of methods before data collection, with publication contingent on executing the agreed protocol rather than on finding significant results. By the early 2020s, more than 300 journals offered registered reports as a submission format. Within the registered-report literature, “positive result” rates dropped from the >90% typical of conventional publication to roughly 40 — 50% — closer to what you would expect from a literature where the data, not the analyst’s discretion, drives the conclusion.

Many Labs and large-scale replication. The OSC was followed by Many Labs 1, 2, 3, 4, and 5; the Reproducibility Project: Cancer Biology; the Pipeline Project; ManyBabies; and a series of pre-registered, multi-lab collaborations that systematically tested specific theoretical claims at scale. Ego depletion, social priming, the facial-feedback hypothesis, and others were subjected to coordinated replication efforts. The headline result of most of these was the same as OSC 2015: original effects, retested at scale with preregistration, were typically much smaller than the literature had recorded, and often not distinguishable from zero.

Data and materials sharing. Open data requirements became standard at journals like Psychological Science, Cognition, and several APA outlets. The Open Science Framework hosts millions of files; in 2008, no analogous infrastructure existed.

Statistical reform. The use of confidence intervals over point-and-p-value reporting expanded. Power analyses, once perfunctory, became gating requirements at many journals. The 2018 American Statistical Association statement on p-values, while not directly caused by OSC 2015, was shaped by the same intellectual current.

Career-incentive shifts. Hiring committees in many psychology departments began explicitly valuing methodological rigor and replication work — a substantial change from the pre-2015 norm where “I replicated someone else’s experiment” was career-irrelevant or worse.

None of this means psychology is “fixed.” Plenty of poorly-powered, single-lab, original-only studies still get published at the field’s middle and bottom tiers. But the top journals, the major funders, and the most-cited subfields look meaningfully different than they did in 2008.

The Strategist’s Bayesian Discount

Now the operational question: if you are a strategist, an executive, a marketer, or a product manager who reads a psychology study and is asked to make a decision based on it, what should OSC 2015 do to your prior?

Roughly this:

For social-psychology findings published before ~2014 with n < 100 and a single behavioral measure, default to ~25% prior that the headline effect is real at anything close to the reported magnitude. That is the literal OSC subgroup result. Asch’s conformity work, Milgram’s obedience studies, Bandura’s Bobo doll, and several of the field’s other genuine durable findings are exceptions — but you should make the exception based on independent meta-analytic evidence, not the original paper’s narrative.

For cognitive-psychology findings from the same era with larger samples and within-subject designs, default to ~50% prior. Cognitive psychology is not immune to the crisis, but the methodological norms run cleaner. Stroop effects, recognition-vs-recall asymmetries, working-memory capacity limits, and similar bread-and-butter cognitive findings are robust.

Halve the effect size you read. Even when a finding replicates, OSC 2015 found the effect was on average about half the original. If a paper reports d = 0.4, plan for d ≈ 0.2 in your operating environment. If the business decision only makes sense at d ≥ 0.4, the decision is more fragile than the paper suggests.

Treat any psychology finding with p in the .03 — .05 range as effectively unproven. OSC 2015 made clear that originals close to the threshold replicate much worse than originals well below it. A p = .04 result in a 2009 social-psychology paper, in the absence of preregistered direct replication, is worth less than the paper’s narrative will suggest.

Strongly upweight preregistered, large-sample, multi-lab evidence over single-lab original findings — even if the multi-lab evidence is recent and the original is famous. The Many Labs efforts, registered-report literature, and large-N direct replications are signal in a way that 2008-vintage social-psychology papers are not. When the two conflict, bet on the preregistered replication.

Be skeptical of any business-book or consulting-deck claim grounded in a single famous psychology study from before 2012, especially in social psychology. Power-pose, marshmallow-test-as-life-predictor, broken-windows-as-causal, decision-fatigue-as-glucose, willpower-as-finite-tank, and similar narratives all draw on original findings whose replication record is poor or actively negative. The story may still be useful as a metaphor; it is not load-bearing evidence for a strategic decision.

The point is not to dismiss psychology — it is to internalize that the rate at which the field’s published claims survive direct, preregistered, well-powered retest is closer to a third than to “settled science.” Adjust accordingly.

Sources

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 351(6277), 1037. DOI: 10.1126/science.aad7243
Anderson, C. J., Bahník, Š., Barnett-Cowan, M., Bosco, F. A., Chandler, J., Chartier, C. R., et al. (2016). Response to Comment on “Estimating the reproducibility of psychological science.” Science, 351(6277), 1037. DOI: 10.1126/science.aad9163
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45(3), 137 — 141. DOI: 10.1027/1864-9335/a000192
Center for Open Science. Open Science Framework. https://osf.io

Ioannidis 2005: Why Most Published Research Findings Are False — the theoretical paper that anticipated everything OSC 2015 measured empirically.
Daryl Bem’s Precognition Studies — the 2011 paper that showed the field’s standard methods could produce “significant” evidence for the impossible.
Diederik Stapel: Anatomy of a Fraud — the parallel scandal that exposed how little the field’s normal oversight could catch fabrication.
Money Priming: Why The Effect Mostly Vanished — one of the highest-profile social-priming subliteratures and how it fared under direct replication.
Ego Depletion: How Willpower Became A Glucose Tank — And Why That Story Collapsed — the multi-lab preregistered failure of one of social psychology’s most influential frameworks.

FAQ

Was the 39% number ever revised or retracted?

No. The OSC’s reported figures have stood up across the subsequent decade of analysis. The Gilbert et al. critique did not result in any correction or retraction. Independent re-analyses by other methodologists have generally confirmed the OSC’s headline findings, with some variation depending on which inclusion rules and success criteria are used. The number in the headlines is essentially the number you should still use today.

Did the OSC pick the worst studies on purpose?

No. The sampling process aimed for representative coverage of the three journals’ 2008 output, subject to feasibility constraints (the lead replicating team had to be willing, the original authors contactable, the materials obtainable). If anything, the constraints biased toward studies whose original authors were still active, available, and willing to engage — i.e., somewhat better-supported original work, not worse.

Why three journals and not the whole field?

Bandwidth. The OSC was already the largest direct-replication project ever attempted; expanding the sample would have made it impossible to complete on any reasonable timeline. The three chosen journals are flagship outlets in three major subareas, and the consensus among methodologists has been that the result generalizes reasonably to the broader top-tier literature of the era. Subsequent multi-lab efforts in other subareas (developmental, clinical, organizational) have produced broadly similar patterns.

Has psychology improved since 2015?

Operationally, yes. The methodological infrastructure that exists in 2026 — preregistration norms, registered reports, multi-lab consortia, data-sharing requirements, statistical-reform pressure — is substantially better than 2008. Whether the current literature replicates at higher rates than 2008 is an open empirical question. Some recent partial-replication efforts suggest improvement, particularly within registered reports; others are more equivocal. The honest answer is that the infrastructure is in place to find out, and several large-scale projects are underway.

Should I distrust all of psychology?

No. The robust findings of cognitive and developmental psychology, classical learning theory, large-scale individual-differences work (the Big Five, psychometric intelligence research, well-validated clinical instruments), and meta-analytic syntheses of large multi-study literatures are largely unaffected by the OSC’s findings. What the OSC documented is specifically a problem with single-lab, single-study, often-small-sample social-psychology and lower-tier experimental work from the pre-reform era. The fact that any one such finding is unreliable does not mean the entire discipline is.

What is the most actionable single takeaway?

Apply a 50% effect-size discount and a 25 — 50% real-effect prior to any pre-2014 single-lab psychology finding before basing a decision on it. If the decision still makes sense under that discount, proceed. If it doesn’t, the original study was never strong enough evidence to support the decision in the first place — the OSC just made that visible.

replication-crisisosc-2015reproducibilitypsychology-methodologyevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What The OSC Actually Did

The Headline Findings

Social Psychology vs Cognitive Psychology

The Gilbert Critique

The Anderson Response

What Changed After 2015

The Strategist’s Bayesian Discount

Sources

Related Articles In This Hub

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook