The Many Labs Replication Projects: The Field's Self-Audit

Atticus Li

← The Replication Crisis · replication-crisis

The Many Labs Replication Projects: The Field's Self-Audit

The Many Labs projects coordinated dozens of labs replicating classic psychology findings with huge samples and pre-registered protocols. Roughly half replicated. The other half didn't. Here is what survived, what failed, and how strategists should use the results to evaluate any psychology-based claim.

By Atticus Li May 25, 2026 21 min read

The Many Labs projects coordinated dozens of labs replicating classic psychology findings with huge samples and pre-registered protocols. Roughly half replicated. The other half didn’t. Here is what survived, what failed, and how strategists should use the results to evaluate any psychology-based claim.

In 2011, psychology had a credibility problem that no one quite knew how to solve. Daryl Bem had published evidence of precognition in a top journal. Diederik Stapel had been caught fabricating dozens of studies. The first failed replications of priming effects were starting to leak out, and the response from the original authors was usually a combination of indignation and what came to be called the “you didn’t replicate it because you didn’t believe in it” defense.

The structural problem was that no individual lab could settle anything. If one lab tried to replicate an effect and failed, the original authors could (and did) attribute the failure to a hundred minor protocol deviations, sample differences, or hidden moderators. If one lab replicated it, that was treated as confirmation, but the publication system was so biased toward positive results that nobody trusted single-lab replications either.

The fix was structural. Instead of asking one lab to settle a claim, ask thirty labs to run the same study simultaneously, with the protocol pre-registered and signed off by the original authors before any data was collected. If thirty labs running 6,000 participants couldn’t find an effect, the “you didn’t believe in it” defense collapsed.

This was the Many Labs idea. Between 2014 and 2020, four major Many Labs projects ran, each with a different question and a different design, and together they constitute the closest thing the field has to a coordinated self-audit. This article is about what those projects did, what they found, what replicated, what didn’t, and how strategists should use the results to evaluate any psychology-based claim about human behavior.

The Coordination Problem

Before Many Labs, the standard replication attempt looked like this. A researcher in one lab decided to test a published claim. They ran the study on whatever sample they could recruit (usually undergraduates at their own university). They used the published protocol, with whatever local adaptations were needed. They got a result. They tried to publish it. If the result was a successful replication, journals were uninterested (“we already knew that”). If it was a failure, the original authors fought it, journals were skittish, and the failure ended up in someone’s file drawer.

This was not a system designed to settle disputes. It was a system designed to generate publications, and publications meant findings, and findings meant positive results.

The Many Labs design solved most of these problems at once. The key features were these. First, multi-site simultaneity: dozens of labs running the same protocol at the same time, so cross-lab variation could be directly measured rather than inferred. Second, pre-registration: the protocol was signed off in advance, ideally by the original authors, so there was no post-hoc moving of the goalposts. Third, transparent reporting: all data, code, and analyses public, so the whole thing was auditable. Fourth, large combined samples: tens of thousands of participants, so the statistical power to detect even small effects was overwhelming.

When this was done well, the verdict was hard to evade. If 36 labs running 15,000 people couldn’t find an effect that was supposed to be there, the most parsimonious explanation was that the effect was not there, or was much smaller than originally reported.

Many Labs 1 (Klein et al. 2014)

The first Many Labs project was led by Richard Klein, then a graduate student, with the Center for Open Science as the coordinating institution. The paper was Klein, R. A., et al. (2014), “Investigating variation in replicability: A ‘many labs’ replication project,” Social Psychology, 45(3), 142–152, DOI: 10.1027/1864-9335/a000178.

The design was deliberately broad. Thirty-six labs around the world (mostly in the United States and Europe) ran the same set of thirteen classic psychology effects, with a combined sample of 6,344 participants. The effects were chosen to span a mix of canonical findings: anchoring (multiple variants), retrospective gambler’s fallacy, framing effects (gain vs. loss framing of decisions), money priming, currency priming, sunk costs, the IAT (a quality-control check), and several others.

The headline pattern in the results was clean and informative. Effects with large original effect sizes — like the classic anchoring effects, gain-loss framing, and gambler’s fallacy — replicated reliably across nearly every site. Effect sizes were close to the original published values. The effects that had originally been reported as “small but real” were a different story. Some replicated weakly. Some didn’t replicate at all. Money priming, in particular, came out essentially flat — a finding that anticipated the much larger collapse of the priming literature that would unfold over the next several years.

The cross-lab variation was, importantly, smaller than many critics had predicted. For the effects that did replicate, the effect sizes were remarkably consistent across sites. This was the first systematic piece of evidence that “samples and settings vary” was being used as an excuse for non-replication when the real explanation was that the effects were never as robust as the original papers claimed.

The interpretive frame Klein and colleagues used was deliberately neutral. They did not claim that any specific effect was “debunked.” What they did was provide the field with the first systematic dataset on which effects could be relied on and which could not. For about 60% of the effects they tested, the answer was “this is reliable across labs.” For the other 40%, the answer was “this is much weaker than the literature suggested, or absent.”

Many Labs 2 (Klein et al. 2018)

The second project was an order of magnitude bigger. Klein, R. A., et al. (2018), “Many Labs 2: Investigating variation in replicability across samples and settings,” Advances in Methods and Practices in Psychological Science, 1(4), 443–490, DOI: 10.1177/2515245918810225.

This time, 125 labs across 36 countries on six continents collaborated. The combined sample was 15,305 participants. The protocol tested 28 effects drawn from across social and cognitive psychology, again with pre-registration and original-author sign-off where possible.

The headline replication rate was sobering. Of the 28 effects tested, roughly half replicated in the predicted direction with statistical significance and a meaningful effect size. The other half either failed to reach significance or showed effects so small that they were practically indistinguishable from zero. Across the replicated effects, the average effect size was substantially smaller than in the original publications — typically half the original magnitude, sometimes less. This is the “shrinkage” pattern that the field had been quietly worried about for a decade and that the Many Labs 2 dataset documented in stark, quantitative terms.

The cross-cultural finding deserves its own paragraph. One of the standing defenses of small or null replications was that the effect was culturally bound — that classic American findings would behave differently in Asian or European or Latin American samples. The Many Labs 2 design, with its 36-country sampling, was the cleanest test of that hypothesis ever conducted. The result: very limited cross-cultural variation. For most effects, the differences across countries were smaller than the differences across labs within the same country. Where an effect replicated, it tended to replicate everywhere. Where it failed, it tended to fail everywhere. The “cultural moderation” defense did not survive the data.

Among the specific effects, the patterns matched what the field had been quietly converging on. Classic anchoring effects (Tversky and Kahneman 1974) replicated robustly across sites and cultures. Gain-loss framing (Tversky and Kahneman 1981) replicated. Currency priming was weak but detectable. Several social priming effects, including some that had been celebrated mainstays of the textbook canon, came out flat or near-flat. The structural shape of the literature, after Many Labs 2, was that classic cognitive psychology held up better than classic social psychology, and that effects with large originally reported effect sizes were more likely to survive than effects with originally small or borderline ones.

Many Labs 3 (Ebersole et al. 2016)

The third project asked a different question. Ebersole, C. R., et al. (2016), “Many Labs 3: Evaluating participant pool quality across the academic semester via replication,” Journal of Experimental Social Psychology, 67, 68–82, DOI: 10.1016/j.jesp.2015.10.012.

The question Many Labs 3 was built to answer was a hidden-moderator question. One of the things that critics of replication failures had been arguing was that participant pools varied across the academic semester. At the start of the semester, you got eager freshmen. At the end, you got exhausted upper-classmen who had run out of credits and were participating reluctantly. If effects were sensitive to participant motivation, focus, or fatigue, the timing of the data collection inside the semester might matter a lot.

So the Many Labs 3 team coordinated 20 labs running 10 effects at multiple time points across the academic semester, with a combined sample of approximately 3,000 participants. The design specifically allowed time-of-semester to be analyzed as a moderator.

The result was, again, sobering for the moderator defense. Across the 10 effects tested, time-of-semester explained essentially none of the variance in replication outcomes. Effects that replicated replicated at the start, middle, and end of the semester. Effects that failed failed throughout. The “tired late-semester students” hypothesis turned out to be a hypothesis with very little empirical support.

The broader lesson from Many Labs 3 was that contextual sensitivity, at least for the kinds of effects this project tested, was much lower than the dominant narrative in social psychology had assumed. The field had spent a decade explaining away null results by appealing to subtle contextual factors. The data did not support that explanation for most of the effects examined.

Many Labs 5 (Ebersole et al. 2020)

The fifth project (there was a Many Labs 4 on terror management theory; not the focus here) took on a meta-question about the replication process itself. Ebersole, C. R., et al. (2020), “Many Labs 5: Testing pre-data-collection peer review as an intervention to increase replicability,” Advances in Methods and Practices in Psychological Science, 3(3), 309–331, DOI: 10.1177/2515245920958687.

The question was this. Some of the failed replications in earlier projects had been criticized on the grounds that the replication protocols were not ideal — they had used approximations of the original methods rather than the exact original procedures, and the protocols had not been peer-reviewed by experts before data collection. If we ran the same replications again but this time with rigorously peer-reviewed protocols developed in consultation with the original authors, would the replication rate go up?

Many Labs 5 tested this directly. The team selected 10 effects from the 2015 Open Science Collaboration’s Reproducibility Project (most of which had failed to replicate the first time) and re-ran them with improved, peer-reviewed protocols developed in consultation with the original authors. The combined sample across the new replications was approximately 8,000 participants.

The result was small. Across the 10 effects, the peer-reviewed-protocol versions did slightly better than the original replication attempts, but not by much. The average effect-size gain from the new protocols was on the order of about half a standard deviation in effect size, and the overall replication rate did not change dramatically. Most of the effects that failed the first time also failed the second time, even with improved protocols and original-author input.

The implication was clarifying. The “the replications were sloppy” defense, like the “samples vary” and “the semester matters” defenses before it, did not survive direct empirical test. When the field rebuilt the protocols carefully and re-ran them, most of the failed effects stayed failed.

What Replicated, What Failed

Across the four Many Labs projects, plus the closely related Registered Replication Reports and the original Open Science Collaboration project (Open Science Collaboration 2015, Science, 349(6251), aac4716, DOI: 10.1126/science.aac4716), a fairly consistent map of the social/cognitive psychology canon has emerged. The boundaries are not perfectly clean, but the broad shape is consistent enough to be useful.

Findings that replicated robustly across multi-site projects: classical anchoring effects (numerical anchors influencing subsequent estimates); gain-loss framing (decisions made differently when outcomes are described as gains vs. losses); retrospective gambler’s fallacy (overestimating runs of independent events); the IAT as a measure of associative responses (though its predictive validity for behavior remains contested separately, see IAT predictive validity); core findings in basic cognitive psychology including memory effects, Stroop, and standard heuristics-and-biases demonstrations.

Findings that failed in multi-site preregistered replications: ego depletion (the willpower-as-depletable-resource construct, see ego depletion), as documented in Hagger, M. S., et al. (2016), “A multilab preregistered replication of the ego-depletion effect,” Perspectives on Psychological Science, 11(4), 546–573, DOI: 10.1177/1745691616652873; facial feedback (the hypothesis that physically smiling makes you feel happier, Wagenmakers et al. 2016 RRR); elderly priming and other “social priming” effects in the Bargh tradition; the free-will manipulation effect (Vohs and Schooler 2008); several specific “social judgment” effects.

Findings that shrunk but partially replicated: most effects with originally medium effect sizes came back at roughly half their original magnitude. The effects still existed in many cases, but they were not as practically meaningful as the original literature had claimed.

The pattern, when you stand back, is reasonable to summarize this way. The cognitive psychology of judgment and decision-making in the Kahneman-Tversky tradition has held up much better than the social psychology of unconscious influence and priming. Effects with originally large effect sizes are more reliable than effects with originally medium or small ones. And cross-cultural moderation, late-semester effects, and protocol sloppiness do not explain very much of the variance in what replicated.

How a Strategist Should Use This

If you are using any psychology-based claim to make a real decision — about how to design an experience, set up a pricing page, structure an incentive, or build a customer flow — the Many Labs corpus gives you a usable diagnostic.

The diagnostic is: before you act on a psychology-based claim, ask whether it has been tested in a Many Labs-style coordinated multi-site preregistered replication. If yes, use the post-replication effect size, not the original. If no, apply an Ioannidis-style discount to your prior.

This is more conservative than it sounds and far more empirical than the alternative. The alternative, in most marketing and product circles, is to treat the most cited textbook version of a finding as if it were robust because it is in a textbook. The Many Labs results are emphatic that textbook citation is a very poor proxy for actual robustness. Many of the most-cited findings in social psychology textbooks have failed multi-site preregistered replications. Continuing to design products and experiences around those findings is, at best, expensive cargo-culting.

The post-replication effect size matters because it tells you what your real expected lift is. If you build a pricing page around an anchoring effect and the post-replication estimate is roughly half the original published value, your expected revenue lift is roughly half of what the original literature would have predicted. That changes what experiments are worth running, what changes are worth defending against the inevitable internal challenge, and what investments in design and copy actually have positive ROI.

For findings that have not been tested in a multi-site replication, the right move is not to dismiss them, but to discount your confidence. Ioannidis (2005) estimated that most published research findings are false, and the Many Labs and OSC data have largely vindicated that estimate within psychology specifically. If the claim has not survived a coordinated multi-site test, treat it as a hypothesis that might be true rather than as established fact. (See why most published research findings are false for the underlying argument.)

This is not about being negative. It is about being calibrated. The Many Labs corpus is one of the field’s most valuable assets precisely because it gives you a way to be empirically calibrated rather than emotionally calibrated about what psychology actually knows.

What Many Labs Did Not Solve

For completeness, a few honest caveats about what these projects did and did not accomplish.

They did not test every important finding. There are still hundreds of effects in the social psychology literature that have not been examined in multi-site coordinated replications. Absence of a multi-site test is not the same as a failed test. It just means the prior should not be as confident as a textbook makes it sound.

They did not solve the file-drawer problem at the level of the broader literature. The Many Labs projects produced clean tests of specific effects, but the underlying ecosystem that produced the original inflated literature — publication bias, p-hacking, HARKing, weak preregistration, the careerist pressure to find positive results — was the actual root cause of the replication crisis. The Many Labs projects exposed the symptoms. Fixing the root cause required the infrastructure changes that the broader open-science movement has been pursuing over the past decade: pre-registration norms, registered reports, mandatory data sharing, and the gradual cultural shift toward treating replications as scientifically valuable.

They have not been universally accepted. Some original authors of failed effects continue to dispute the methodology of the replications. Most of these defenses, on close reading, reduce to the same hidden-moderator arguments that the Many Labs designs were specifically built to test and that the data did not support. But the dispute is not over for everyone, and it is worth being aware that the literature you read on these effects depends heavily on whose telling you are reading.

They focused mostly on Western and educated samples even when they sampled internationally. The Many Labs 2 sampling, while genuinely international, still over-represented universities and under-represented the kinds of populations that most behavioral interventions are actually targeting. Where cross-cultural variation does exist for an effect, it is most likely to show up in the populations that were least represented in these studies. This is a real limit, even if it is also a smaller limit than the original “the effect is American-specific” defenses claimed.

What This Means for the Field

The honest summary is that the field of social psychology in 2026 looks very different from the field in 2011. About half of the canonical social-psychology findings that the Many Labs projects examined did not replicate cleanly. The other half did. The methodology of the discipline has been reshaped in response — preregistration, registered reports, mandatory data sharing, and a much more sober prior about new findings have all become standard at top journals. None of this would have been possible without the kind of coordinated, transparent, falsifiable evidence that Many Labs provided.

If you are a strategist or a builder, the Many Labs era is not a reason to give up on psychology. It is a reason to use it carefully. The findings that replicated, replicated. The ones that didn’t, didn’t. The field has done its self-audit. The job, now, is to use the audit results.

Sources

Klein, R. A., et al. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45(3), 142–152. DOI: 10.1027/1864-9335/a000178
Klein, R. A., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. DOI: 10.1177/2515245918810225
Ebersole, C. R., et al. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. DOI: 10.1016/j.jesp.2015.10.012
Ebersole, C. R., et al. (2020). Many Labs 5: Testing pre-data-collection peer review as an intervention to increase replicability. Advances in Methods and Practices in Psychological Science, 3(3), 309–331. DOI: 10.1177/2515245920958687
Hagger, M. S., et al. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546–573. DOI: 10.1177/1745691616652873
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. DOI: 10.1371/journal.pmed.0020124

The Open Science Collaboration 2015 Reproducibility Project — the foundational mass-replication paper that triggered the field’s reckoning
Why Most Published Research Findings Are False (Ioannidis 2005) — the structural argument for why most published claims are wrong
P-Hacking and the Garden of Forking Paths — the mechanism behind the inflated original literature
Publication Bias and the File-Drawer Problem — why the published literature systematically over-represents positive results
Ego Depletion: How Willpower Became a Glucose Tank — the single highest-profile Many Labs failure

FAQ

Q: What is the Many Labs project? A: The Many Labs projects are a series of coordinated multi-laboratory replication studies in psychology, led by Richard Klein, Charles Ebersole, and colleagues at the Center for Open Science between 2014 and 2020. Each project had dozens of independent labs simultaneously running the same pre-registered protocols testing classic psychology findings, with combined samples ranging from 3,000 to over 15,000 participants. The goal was to give the field a coordinated empirical self-audit rather than relying on single-lab replication attempts that could always be disputed.

Q: What percentage of psychology findings replicated in Many Labs? A: Roughly 50–60%, depending on the specific project. Many Labs 1 (2014) reported about 60% of the 13 tested effects replicated robustly. Many Labs 2 (2018) reported about half of the 28 tested effects replicated with substantial cross-cultural consistency. The effects that did replicate were typically about half the magnitude of the original published claims, a phenomenon often called “effect size shrinkage.”

Q: Which famous psychology effects failed to replicate? A: The most prominent failures in multi-site preregistered replications include ego depletion (Hagger et al. 2016), facial feedback (Wagenmakers et al. 2016), elderly priming and several other “social priming” effects in the Bargh tradition, and the free-will manipulation effect (Vohs and Schooler 2008). Many other smaller-scale social-judgment effects also failed.

Q: Which findings did replicate cleanly? A: Anchoring effects, gain-loss framing, retrospective gambler’s fallacy, currency priming (at smaller-than-original effect sizes), and most core cognitive psychology effects in the heuristics-and-biases tradition. Broadly: classical Kahneman-Tversky-style cognitive psychology held up much better than classical social psychology.

Q: Did Many Labs find cross-cultural variation in psychology effects? A: Surprisingly little. Many Labs 2, with its 36-country sample, was specifically designed to test the “the effect is culturally bound” defense for failed replications. The result was that cross-cultural variation was generally small — smaller than variation across labs within the same country. Where an effect replicated, it tended to replicate across cultures. Where it failed, it tended to fail across cultures.

Q: How should I use Many Labs results when evaluating a psychology-based claim? A: Apply this diagnostic: ask whether the claim has been tested in a Many Labs-style coordinated multi-site preregistered replication. If yes, use the post-replication effect size, not the original. If no, apply an Ioannidis-style discount to your prior confidence. This is more conservative than treating textbook claims as established fact, and far more aligned with what the field’s own self-audit actually showed.

Q: Does failed replication mean the original finding was fraud? A: No. The vast majority of replication failures reflect the systemic problems of the pre-2011 literature — publication bias, p-hacking, weak preregistration, underpowered studies, and the field-wide bias toward positive results — rather than intentional fraud. A small number of replication failures have turned out to involve fraud (Stapel, Hauser, LaCour, Wansink in different ways), but most failures are about the structural ecosystem that produced inflated effect sizes, not about individual misconduct.

Q: What is the Many Labs 5 finding about peer-reviewed protocols? A: Many Labs 5 (Ebersole et al. 2020) tested whether failed replications would succeed if the protocols were carefully peer-reviewed and developed in consultation with the original authors before data collection. The answer was: a small improvement in average effect size (about half a standard deviation of effect-size scale), but most failed effects stayed failed. The “protocols were sloppy” defense did not survive direct empirical test.

Q: Is the replication crisis over? A: The acute methodological crisis has largely been addressed. Preregistration, registered reports, mandatory data sharing, and a more skeptical prior toward new findings have become standard at top psychology journals. The deeper challenge — re-evaluating the existing canon and untangling the parts that survived from the parts that didn’t — is a multi-decade project that the Many Labs corpus is one of the main inputs to.

replication-crisismany-labsmulti-site-replicationpsychology-methodologyevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

The Coordination Problem

Many Labs 1 (Klein et al. 2014)

Many Labs 2 (Klein et al. 2018)

Many Labs 3 (Ebersole et al. 2016)

Many Labs 5 (Ebersole et al. 2020)

What Replicated, What Failed

How a Strategist Should Use This

What Many Labs Did Not Solve

What This Means for the Field

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook