A reference hub for famous behavioral science studies that did not survive replication --- Stanford Prison, Power Posing, Marshmallow Test, Ego Depletion, Bystander Effect, Mozart Effect, and more. Each entry cites primary research, explains what really happened, and shows what leaders should learn about evaluating behavioral science evidence.

If you have ever cited a behavioral-science study to justify a business decision --- a hiring framework, a pricing experiment, a leadership ritual, a UX pattern --- there is a meaningful chance that the study you cited has not survived rigorous replication.

This is not a hypothetical. Over the last fifteen years, large preregistered replication efforts have systematically tested the most famous findings in social and developmental psychology. The results have been brutal. Ego depletion. Power posing. Facial feedback. Stanford Prison. The marshmallow test as a long-run predictor. Money priming. The 2007 glucose mechanism of willpower. The strong popular version of the bystander effect. Most have either failed to replicate at all or have had their effect sizes substantially revised downward.

The popular versions of these findings --- the ones that reach you through TED talks, business books, podcasts, and conference keynotes --- are systematically behind the academic consensus, sometimes by a decade or more. The work that propagates culturally is filtered for storytelling potential rather than evidential strength. The replications that revise the story do not get the same cultural amplification as the original headline findings.

This hub is a reference for what actually survives. Each entry walks through a famous behavioral-science study, what the original paper actually claimed, what the major replications found, what the meta-analytic verdict is today, and what any of it means for leaders, founders, consultants, and anyone whose decisions depend on understanding human behavior.

The hub is opinionated about evidence. It also tries to be fair. Where original authors have defensible positions on contested replications, those positions are represented. Where popular versions of a finding overstate the academic version, the gap is named explicitly. The goal is to give you a calibrated, citation-backed picture of what you can actually rely on --- not to dunk on famous researchers or declare behavioral science dead.

Why this hub exists

Three things motivated building this. Each is a reason it matters for anyone who uses behavioral science in their work.

1. The popular-culture version of behavioral science is years behind the academic version. This gap is not a marginal issue. Many widely cited findings --- the ones in MBA case studies, leadership books, and pop-psychology articles --- have been substantially revised or contradicted by the academic literature, and the revisions have not propagated. If your mental model of “how people work” was built in the 2010s on a diet of popular treatments, parts of that model are now out of date.

2. Single-paper findings drive disproportionate downstream consequences. The Mozart Effect was launched by a one-page note in Nature with 36 college students; it triggered state legislation and a half-billion-dollar product category. The Stanford Prison Experiment was published in a low-tier criminology journal with 24 participants; it shaped public understanding of institutional behavior for fifty years. Power posing came from a single small study; it generated one of the most-watched TED talks of all time. The pattern repeats: thin original evidence, massive cultural amplification, slow correction.

3. The cost of building strategy on collapsing findings is real. If you have built hiring processes, organizational design choices, marketing strategies, or product features on behavioral-science assumptions that have since failed replication, you are paying ongoing costs you may not be measuring. The fix is not to abandon behavioral science. It is to upgrade the discipline with which you evaluate which claims have earned the right to influence decisions.

How to use this hub

Each entry follows the same structure. Skim the TL;DR for the verdict. Read the body when you need the citations, the nuance, or the practical takeaway.

If you are evaluating a specific claim --- “should I cite this study in a board deck?” --- find the entry, check the verdict line, and read the “Honest Verdict Today” section. That is the most efficient path.

If you want to recalibrate your general behavioral-science priors, read the Tier 1 entries in order. They are sequenced to build a coherent picture of how findings get amplified, what kinds of evidence are worth trusting, and what a calibrated reader’s posture should look like.

If you are using a finding as a teaching example or a presentation hook, the FAQ section of each entry contains the questions most readers ask. Use those to anticipate audience objections.

The Tier 1 entries (the most-cited findings)

These fourteen studies are the foundation of the hub. They are the most famous findings, the ones most likely to come up in a business or strategy context, and the ones with the clearest replication-evidence stories. Tier 1 ranges from the Stanford Prison Experiment and Power Posing through to Vicary’s subliminal-advertising fraud.

Social and Personality Psychology

Stanford Prison Experiment --- For 50 years, the canonical proof that “ordinary people become evil under bad systems.” Audio tapes from the original study, released in archives in 2018, show the guards were coached. Le Texier’s 2019 American Psychologist paper documented the methodology. The BBC Prison Study (Reicher & Haslam 2006) found nearly the opposite under cleaner conditions.

Power Posing --- A two-minute “high-power” pose was supposed to raise testosterone, lower cortisol, and increase risk tolerance. The Ranehill 2015 RRR (n=200) failed to replicate the hormonal effects. First author Dana Carney disavowed the original finding in 2016 (“I do not believe that ‘power pose’ effects are real”). A small subjective “felt power” effect probably survives; the strong physiological claim is dead.

Ego Depletion --- The idea that willpower is a depletable resource that runs down through the day. The Hagger 2016 multi-lab RRR (23 labs, n=2,141) and the Vohs 2021 paradigmatic test (36 labs, n=3,531) both found essentially null effects. The glucose-mechanism version is dead. A smaller attention-control variant (Garrison 2019, d≈0.20) survives.

Facial Feedback Hypothesis --- The Strack 1988 pen-in-mouth study claimed forced-smile expressions made cartoons rated funnier. The Wagenmakers 2016 RRR (17 labs, n=1,894) found nothing. The Many Smiles 2022 collaboration (n=3,878 across 19 countries) is the strongest verdict; the specific pen-in-mouth task is inconclusive, the broader facial-feedback hypothesis has small support.

Marshmallow Test --- A four-year-old who can wait for a second marshmallow is supposed to be on track for life success. Watts 2018 (n=918, NLSY sample) found the predictive effect largely disappeared after controlling for family background. Sperber 2024 (n=702 to age 26) found adult outcomes do not survive controls. The popular version is not supported.

Bargh Elderly Priming --- Subliminal exposure to elderly-stereotype words was supposed to make people walk more slowly. Doyen 2012 and Shanks 2013 failed to replicate. The broader semantic priming literature holds up; the specific behavioral-priming version does not.

Bystander Effect & Kitty Genovese --- The “38 witnesses watched and did nothing” story was largely fictionalized; at least one person called police, and a neighbor held Genovese as she died. The lab effect (Darley & Latané 1968) is real under specific conditions. But the Philpot 2020 CCTV study of real public conflicts found bystanders intervened in 90.9% of incidents --- directly contradicting the popular framing.

Mozart Effect --- A 1993 one-page note in Nature with 36 college students became state legislation in Georgia and a $400M Baby Einstein industry. The Pietschnig 2010 meta-analysis (k=39) found essentially null effects. Whatever small transient effects exist are general arousal effects from preferred music, not specific to Mozart.

Growth Mindset --- Carol Dweck’s intervention research has had a tougher empirical run than popular treatments suggest. Sisk 2018 meta-analysis found r≈0.10. Yeager 2019 large RCT found small effects in specific subgroups. Foliano 2019 UK RCT was null. Nuanced --- not zero, but much smaller than popular claims.

Behavioral Economics

Loss Aversion Universality --- Kahneman and Tversky’s 2:1 ratio claim is not a universal constant. Gal & Rucker 2018 challenged it; Mrkva 2020 defended a moderated version. Real but heavily moderated, not a universal multiplier.

Choice Overload / Jam Study --- Iyengar & Lepper 2000 showed a 10x conversion drop when shoppers saw 24 jams vs 6. The Scheibehenne 2010 meta-analysis (k=50) found essentially null overall. Chernev 2015 found the effect appears under specific moderators (complexity, novice users, decision difficulty). Conditional, not universal.

CRO / UX / Marketing

F-Pattern Reading --- Jakob Nielsen’s 2006 NN/g claim about how users scan web pages. NN/g’s own 2017 update explicitly reframes it: F-pattern is a failure mode of poor design, not an inevitable user behavior. Well-formatted pages produce different and better patterns. (Practitioner-focused --- on Growth Layer.)

8-Second Attention Span --- The “humans now have shorter attention than goldfish” claim. The BBC traced the citation chain in 2017; the underlying study doesn’t exist. There is no measurable “average attention span” that the construct refers to. (Practitioner-focused --- on Growth Layer.)

Vicary Subliminal Advertising --- James Vicary’s 1957 “Eat Popcorn” study generated FCC inquiries and a panic about subliminal influence. Vicary confessed in 1962 Advertising Age that the study was fabricated. Subliminal priming has small lab-condition effects (Karremans 2006); the popular version is fraud. (Practitioner-focused --- on Growth Layer.)

The Tier 2 entries (the deeper bench)

These twenty-nine additional studies extend the hub across social psychology, behavioral economics, CRO/UX, neuromarketing, and persuasion. Each has been written with the same citation standard as Tier 1: verified primary sources, real DOIs, distinct narrative angles, no fabricated data.

Social Psychology & Persuasion (Atticus Li)

Stereotype Threat --- Steele & Aronson 1995’s dramatic finding has shrunk under meta-analysis. Flore & Wicherts 2015 found near-zero corrected effects; Shewach 2019 (212 studies) verdict: “negligible to small.” Real but much smaller than the public narrative.

Milgram Obedience Experiments --- Gina Perry’s 2013 archival work in the Yale boxes documented participant suspicion, off-script prods, and condition cherry-picking. The famous 65% was condition 1 of 24; the average across conditions was 43.6%. Burger 2009 partial replication only went to 150V.

The Implicit Association Test --- Greenwald’s IAT became the world’s most-used “bias measure,” but Oswald 2013 meta-analysis (r ≈ .15) and Forscher 2019 (492 studies) showed IAT scores barely predict actual discriminatory behavior. Greenwald himself has walked back individual-level prediction.

Pygmalion Effect --- Rosenthal & Jacobson 1968’s “expectations create reality.” Thorndike’s TOGA-test critique was right; Raudenbush 1984 meta-analytic d ≈ 0.11; Jussim’s reframing shows accurate teacher perception dominates the small expectancy residual.

Hawthorne Effect --- Levitt & List 2011 obtained the original 1924-1932 Western Electric data and found the famous lighting-effects pattern is “entirely fictional.” The term was coined by Landsberger in 1958, 25+ years after the data, based on a particular reinterpretation.

Money Priming --- Vohs, Mead & Goode 2006’s influential Science paper. Rohrer 2015 nine preregistered replications (n > 1,800) all null. Lodder 2019 meta-analysis preregistered subset: g = 0.01. Parallel kill case to Bargh elderly priming.

Broken Windows Theory --- Wilson & Kelling’s 1982 Atlantic essay reshaped policing for 100M+ people. Levitt 2004 attributed the 1990s crime decline mostly to other factors; Braga 2015 meta-analysis showed disorder-policing effects are modest and concentrated in community/problem-oriented (not zero-tolerance) approaches.

Power of “Because” (Langer 1978) --- The famous copy-machine study. The 93% placebic-reason result was specific to SMALL requests; the same paper’s large-request condition showed placebic compliance collapses to ~24%. The marketing canon kept the headline and dropped the boundary.

The Decoy Effect --- Huber, Payne & Puto 1982’s asymmetric dominance. Frederick 2014 (38 replications with realistic stimuli) and Yang & Lynn 2014 (91 experiments) showed the effect mostly disappears or reverses outside abstract two-attribute lab paradigms. The Economist case is largely apocryphal.

Foot-in-the-Door --- Freedman & Fraser 1966. Burger 1999 meta-analysis: r ≈ .15-.17, d ≈ .30. Real but modest --- not the “massive lift” the popular framing implies.

Door-in-the-Face --- Cialdini 1975. Genschow & Westfal 2021 direct preregistered replication succeeded (51% vs. 34%). Feeley 2012 meta-analysis r ≈ .126 verbal / .052 behavioral. Real and replicates, but requires specific conditions (same requester, minimal delay, same cause).

CRO / UX / Marketing (Growth Layer)

Learning Styles --- Pashler et al. 2008 conclusively showed VARK has no scientific support. Newton & Salvi 2020: ~89% of teachers still believe in it. A multi-billion-dollar L&D industry sells products with no evidence base.

Above-the-Fold Myth --- Print-newspaper terminology transplanted to web. NN/g 2010 + Chartbeat 2013 + Schade 2015 “Fold Manifesto” all show users scroll; the fold is no longer the cliff the original rule implied.

Mehrabian 7-38-55 Rule --- Mehrabian himself has spent decades trying to kill the generalized application. His 1967 studies were about communicating INCONSISTENT feelings via single words --- not general communication. He calls the popular usage “absurd.”

The 3-Click Rule --- Never had any research basis. Porter 2003 UIE study (44 participants, 620 tasks): persistence depends on perceived progress, not click count. Survives 20+ years after being debunked.

Color Psychology in CRO --- “Red button beats green” was always one company’s one page on one day. Universal “color psychology” claims have weak academic basis. What matters: contrast, salience, brand context, WCAG compliance.

Hick’s Law for UX --- A real law (Hick 1952) misapplied. The original requires equiprobable, reactive, non-semantic choice. Landauer & Nachbar 1985 found broader/flatter menus were FASTER than deeper/narrower --- opposite of what “Hick’s Law says simplify” predicts.

Miller 7±2 Memory Rule --- Miller’s 1956 paper was about three converging cognitive paradigms, NONE of which was about navigation menus. Miller himself called the “magical number” a “pernicious, Pythagorean coincidence.” Cowan 2001 updated to ~4 chunks anyway.

10% of the Brain Myth --- No source, no founder, no study. Yet 65% of US adults still endorse it (Brown 2014). The most confidently stated lie in pop neuroscience.

Brain Training Games --- The 2016 FTC $2M Lumosity settlement, the 2014 Stanford 70-neuroscientist consensus, and the 2016 Simons PSPI 175-page review all reach the same conclusion: marketing claims outrun the science.

Cialdini’s Influence Principles --- Practitioner calibration guide. Three pillars (authority, social proof in specific contexts, door-in-the-face) have strong evidence; three (foot-in-the-door, reciprocity outside lab, manufactured scarcity) are wobblier than the framework’s reputation implies. Bohner & Schlüter 2014 did not reproduce the famous Goldstein hotel-towel result.

Left vs Right Brain Personality --- Built on a misreading of Sperry’s Nobel-winning split-brain research. Nielsen et al. 2013 PLOS ONE (n=1,011 fMRI scans) directly tested whether individuals have lateralized activity patterns --- they don’t. Sperry himself warned against the pop-psych reading.

Mirror Neurons in Marketing --- Real basic-neuroscience finding (Rizzolatti 1992) extrapolated by marketers into a persuasion framework the underlying neuroscience does not support. Hickok 2014 “Myth of Mirror Neurons” is the canonical critique.

“95% of Decisions Are Unconscious” --- The Zaltman attribution is unverified; the specific 95% figure doesn’t come from a study. ZMET is qualitative interview methodology, not quantitative measurement of conscious/unconscious decision proportions.

Banner Blindness --- Real and well-replicated (Benway 1998, NN/g 2007 & 2018), but practitioners use it as a catch-all label for any low-engagement element. Most cases are actually weak content, missing information scent, or poor headline craft --- different fixes.

Page Speed Conversion Claims --- The famous Amazon “100ms = 1% sales lost” (Linden 2006), Walmart 2012, Google 2008 stats are 13-18 years old, single-company snapshots cited as universal laws. Modern Core Web Vitals research shows much smaller and more conditional effects.

Pricing Anchoring --- Tversky & Kahneman 1974’s classic anchoring effect is robust. The SaaS-pricing application is more conditional: anchor credibility, buyer expertise, and market familiarity all moderate the effect. Implausible anchors backfire.

What survives --- the anti-examples

Not everything has fallen apart. A useful way to calibrate is to look at the findings that have held up under careful replication. The hub includes dedicated articles on the most useful anti-examples:

  • Fitts’s Law (target acquisition time as a function of distance and target size) --- robust across 70+ years and many devices; the cleanest example of a UX “law” that earned the name. Useful both as a practical tool and as a calibration tool against folk laws like Hick’s, Miller’s, and the 3-Click Rule.
  • The Default Effect / Status Quo Bias (Samuelson & Zeckhauser 1988; Madrian & Shea 2001; Johnson & Goldstein 2003) --- the behavioral-economics finding that actually reshaped policy. Jachimowicz 2019 meta-analysis: d = 0.68. Survived the Maier 2022 critique that demolished other nudge categories. The highest-confidence behavioral intervention strategists can deploy.
  • The Sunk Cost Fallacy (Arkes & Blumer 1985; Staw 1976) --- robust across species, lab paradigms, and corporate field data (Guenzel 2025 Journal of Finance). Predicts predictable patterns of bad organizational decision-making.
  • Ultimatum Game cross-cultural variation (Henrich et al. 2001) --- robust, replicated across small-scale societies (forthcoming).
  • Semantic and evaluative priming (the cognitive forms of priming, distinct from social/behavioral priming) --- robust.

These anti-examples are useful because they show what good behavioral-science evidence looks like: multiple independent replications, large samples, robust effects across populations and conditions, mechanisms that don’t depend on a single charismatic researcher’s lab. They are also concrete reminders that the replication crisis didn’t invalidate behavioral science --- it sharpened it.

How to read behavioral-science claims after this hub

Five heuristics that fall out of the cumulative evidence in this hub.

1. Distrust findings with too-clean stories. Real human behavior is messy, contextual, and full of moderators. Findings that are unusually clean and memorable should make you more skeptical, not less. The Mozart Effect, power posing, and the marshmallow test were all unusually clean stories. They are also all examples of overclaims.

2. Single-paper findings deserve provisional belief at best. Even in prestige journals. Nature, Science, and PNAS publish lots of papers that don’t replicate. Until a finding has been independently replicated in well-powered preregistered designs, hold it provisionally regardless of where it was first published.

3. Look for evidence outside the original paradigm. The strongest test of a behavioral-science claim is often a kind of data that the original researchers didn’t use and that doesn’t share their framing assumptions. CCTV footage testing the bystander effect. Preregistered multi-lab RRRs testing ego depletion. Cross-national replications testing facial feedback. These out-of-paradigm tests are usually more decisive than another iteration within the original framing.

4. Confounds usually do more work than the headline variable. The marshmallow test’s predictive power was largely SES. The Stanford Prison Experiment’s brutality was largely leadership signaling. The lab bystander effect was largely about ambiguous low-stakes conditions. Whenever a single measurement appears to predict an outcome, expect that most of the predictive power is carried by confounds --- variables that produce both the measurement and the outcome through separate channels.

5. Cultural belief lags scientific revision by years to decades. The version of behavioral science that reaches you through popular media is systematically behind the version that exists in the journals. Periodically auditing your own behavioral-science assumptions --- perhaps annually --- is one of the highest-ROI cognitive habits available to anyone whose work depends on understanding people.

Get the next entry

The hub currently covers 44 studies across psychology, behavioral economics, UX, CRO, and marketing. New entries continue to publish as new replication evidence is published or as additional studies merit coverage.

Subscribe to the newsletter to get each new entry in your inbox. Or book a consultation if you’d like a structured audit of the behavioral-science assumptions baked into your current strategy.

Methodology

Every entry cites primary research with verified DOIs. Each was independently fact-checked by deep-research agents against journal sources, replication papers, and current meta-analyses before drafting. Where the academic consensus is contested, both sides are represented fairly. Where the popular version overstates the academic version, the gap is named explicitly.

If you find an error or have a citation that should be considered, send it in. The hub is a living document and will be updated as new replication evidence is published.

replication-crisis behavioral-science evidence-evaluation leadership hub

Free Tool

Built for Experimentation Teams

GrowthLayer is the experimentation platform I built for CRO teams --- test management, AI-powered insights, and pattern recognition across your entire program.

Explore GrowthLayer → (opens in new tab)

· Start Free →

Share this article

LinkedIn (opens in new tab) X / Twitter (opens in new tab)

Copy link

Go deeper

Methodology The PRISM Method Case Studies $30M+ in Results Work Together Services & Mentoring

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.