The Replication Crisis Hub: Famous Behavioral Science Studies and What Actually Survives

Atticus Li

← Home · replication-crisis

The Replication Crisis Hub: Famous Behavioral Science Studies and What Actually Survives

A reference hub for famous behavioral science studies that did not survive replication — Stanford Prison, Power Posing, Marshmallow Test, Ego Depletion, Bystander Effect, Mozart Effect, and more. Each entry cites primary research, explains what really happened, and shows what leaders should learn about evaluating behavioral science evidence.

By Atticus Li May 7, 2026 17 min read

A reference hub for famous behavioral science studies that did not survive replication --- Stanford Prison, Power Posing, Marshmallow Test, Ego Depletion, Bystander Effect, Mozart Effect, and more. Each entry cites primary research, explains what really happened, and shows what leaders should learn about evaluating behavioral science evidence.

If you have ever cited a behavioral-science study to justify a business decision --- a hiring framework, a pricing experiment, a leadership ritual, a UX pattern --- there is a meaningful chance that the study you cited has not survived rigorous replication.

This is not a hypothetical. Over the last fifteen years, large preregistered replication efforts have systematically tested the most famous findings in social and developmental psychology. The results have been brutal. Ego depletion. Power posing. Facial feedback. Stanford Prison. The marshmallow test as a long-run predictor. Money priming. The 2007 glucose mechanism of willpower. The strong popular version of the bystander effect. Most have either failed to replicate at all or have had their effect sizes substantially revised downward.

The popular versions of these findings --- the ones that reach you through TED talks, business books, podcasts, and conference keynotes --- are systematically behind the academic consensus, sometimes by a decade or more. The work that propagates culturally is filtered for storytelling potential rather than evidential strength. The replications that revise the story do not get the same cultural amplification as the original headline findings.

This hub is a reference for what actually survives. Each entry walks through a famous behavioral-science study, what the original paper actually claimed, what the major replications found, what the meta-analytic verdict is today, and what any of it means for leaders, founders, consultants, and anyone whose decisions depend on understanding human behavior.

The hub is opinionated about evidence. It also tries to be fair. Where original authors have defensible positions on contested replications, those positions are represented. Where popular versions of a finding overstate the academic version, the gap is named explicitly. The goal is to give you a calibrated, citation-backed picture of what you can actually rely on --- not to dunk on famous researchers or declare behavioral science dead.

Why this hub exists

Three things motivated building this. Each is a reason it matters for anyone who uses behavioral science in their work.

1. The popular-culture version of behavioral science is years behind the academic version. This gap is not a marginal issue. Many widely cited findings --- the ones in MBA case studies, leadership books, and pop-psychology articles --- have been substantially revised or contradicted by the academic literature, and the revisions have not propagated. If your mental model of “how people work” was built in the 2010s on a diet of popular treatments, parts of that model are now out of date.

2. Single-paper findings drive disproportionate downstream consequences. The Mozart Effect was launched by a one-page note in Nature with 36 college students; it triggered state legislation and a half-billion-dollar product category. The Stanford Prison Experiment was published in a low-tier criminology journal with 24 participants; it shaped public understanding of institutional behavior for fifty years. Power posing came from a single small study; it generated one of the most-watched TED talks of all time. The pattern repeats: thin original evidence, massive cultural amplification, slow correction.

3. The cost of building strategy on collapsing findings is real. If you have built hiring processes, organizational design choices, marketing strategies, or product features on behavioral-science assumptions that have since failed replication, you are paying ongoing costs you may not be measuring. The fix is not to abandon behavioral science. It is to upgrade the discipline with which you evaluate which claims have earned the right to influence decisions.

How to use this hub

Each entry follows the same structure. Skim the TL;DR for the verdict. Read the body when you need the citations, the nuance, or the practical takeaway.

If you are evaluating a specific claim --- “should I cite this study in a board deck?” --- find the entry, check the verdict line, and read the “Honest Verdict Today” section. That is the most efficient path.

If you want to recalibrate your general behavioral-science priors, read the Tier 1 entries in order. They are sequenced to build a coherent picture of how findings get amplified, what kinds of evidence are worth trusting, and what a calibrated reader’s posture should look like.

If you are using a finding as a teaching example or a presentation hook, the FAQ section of each entry contains the questions most readers ask. Use those to anticipate audience objections.

The Tier 1 entries (the most-cited findings)

These fourteen studies are the foundation of the hub. They are the most famous findings, the ones most likely to come up in a business or strategy context, and the ones with the clearest replication-evidence stories. Tier 1 ranges from the Stanford Prison Experiment and Power Posing through to Vicary’s subliminal-advertising fraud.

Stanford Prison Experiment --- For 50 years, the canonical proof that “ordinary people become evil under bad systems.” Audio tapes from the original study, released in archives in 2018, show the guards were coached. Le Texier’s 2019 American Psychologist paper documented the methodology. The BBC Prison Study (Reicher & Haslam 2006) found nearly the opposite under cleaner conditions.

Power Posing --- A two-minute “high-power” pose was supposed to raise testosterone, lower cortisol, and increase risk tolerance. The Ranehill 2015 RRR (n=200) failed to replicate the hormonal effects. First author Dana Carney disavowed the original finding in 2016 (“I do not believe that ‘power pose’ effects are real”). A small subjective “felt power” effect probably survives; the strong physiological claim is dead.

Ego Depletion --- The idea that willpower is a depletable resource that runs down through the day. The Hagger 2016 multi-lab RRR (23 labs, n=2,141) and the Vohs 2021 paradigmatic test (36 labs, n=3,531) both found essentially null effects. The glucose-mechanism version is dead. A smaller attention-control variant (Garrison 2019, d≈0.20) survives.

Facial Feedback Hypothesis --- The Strack 1988 pen-in-mouth study claimed forced-smile expressions made cartoons rated funnier. The Wagenmakers 2016 RRR (17 labs, n=1,894) found nothing. The Many Smiles 2022 collaboration (n=3,878 across 19 countries) is the strongest verdict; the specific pen-in-mouth task is inconclusive, the broader facial-feedback hypothesis has small support.

Marshmallow Test --- A four-year-old who can wait for a second marshmallow is supposed to be on track for life success. Watts 2018 (n=918, NLSY sample) found the predictive effect largely disappeared after controlling for family background. Sperber 2024 (n=702 to age 26) found adult outcomes do not survive controls. The popular version is not supported.

Bargh Elderly Priming --- Subliminal exposure to elderly-stereotype words was supposed to make people walk more slowly. Doyen 2012 and Shanks 2013 failed to replicate. The broader semantic priming literature holds up; the specific behavioral-priming version does not.

Bystander Effect & Kitty Genovese --- The “38 witnesses watched and did nothing” story was largely fictionalized; at least one person called police, and a neighbor held Genovese as she died. The lab effect (Darley & Latané 1968) is real under specific conditions. But the Philpot 2020 CCTV study of real public conflicts found bystanders intervened in 90.9% of incidents --- directly contradicting the popular framing.

Mozart Effect --- A 1993 one-page note in Nature with 36 college students became state legislation in Georgia and a $400M Baby Einstein industry. The Pietschnig 2010 meta-analysis (k=39) found essentially null effects. Whatever small transient effects exist are general arousal effects from preferred music, not specific to Mozart.

Growth Mindset --- Carol Dweck’s intervention research has had a tougher empirical run than popular treatments suggest. Sisk 2018 meta-analysis found r≈0.10. Yeager 2019 large RCT found small effects in specific subgroups. Foliano 2019 UK RCT was null. Nuanced --- not zero, but much smaller than popular claims.

Behavioral Economics

Loss Aversion Universality --- Kahneman and Tversky’s 2:1 ratio claim is not a universal constant. Gal & Rucker 2018 challenged it; Mrkva 2020 defended a moderated version. Real but heavily moderated, not a universal multiplier.

Choice Overload / Jam Study --- Iyengar & Lepper 2000 showed a 10x conversion drop when shoppers saw 24 jams vs 6. The Scheibehenne 2010 meta-analysis (k=50) found essentially null overall. Chernev 2015 found the effect appears under specific moderators (complexity, novice users, decision difficulty). Conditional, not universal.

CRO / UX / Marketing

F-Pattern Reading --- Jakob Nielsen’s 2006 NN/g claim about how users scan web pages. NN/g’s own 2017 update explicitly reframes it: F-pattern is a failure mode of poor design, not an inevitable user behavior. Well-formatted pages produce different and better patterns. (Practitioner-focused --- on Growth Layer.)

8-Second Attention Span --- The “humans now have shorter attention than goldfish” claim. The BBC traced the citation chain in 2017; the underlying study doesn’t exist. There is no measurable “average attention span” that the construct refers to. (Practitioner-focused --- on Growth Layer.)

Vicary Subliminal Advertising --- James Vicary’s 1957 “Eat Popcorn” study generated FCC inquiries and a panic about subliminal influence. Vicary confessed in 1962 Advertising Age that the study was fabricated. Subliminal priming has small lab-condition effects (Karremans 2006); the popular version is fraud. (Practitioner-focused --- on Growth Layer.)

The Tier 2 entries (the deeper bench)

These twenty-nine additional studies extend the hub across social psychology, behavioral economics, CRO/UX, neuromarketing, and persuasion. Each has been written with the same citation standard as Tier 1: verified primary sources, real DOIs, distinct narrative angles, no fabricated data.

Stereotype Threat --- Steele & Aronson 1995’s dramatic finding has shrunk under meta-analysis. Flore & Wicherts 2015 found near-zero corrected effects; Shewach 2019 (212 studies) verdict: “negligible to small.” Real but much smaller than the public narrative.

Milgram Obedience Experiments --- Gina Perry’s 2013 archival work in the Yale boxes documented participant suspicion, off-script prods, and condition cherry-picking. The famous 65% was condition 1 of 24; the average across conditions was 43.6%. Burger 2009 partial replication only went to 150V.

The Implicit Association Test --- Greenwald’s IAT became the world’s most-used “bias measure,” but Oswald 2013 meta-analysis (r ≈ .15) and Forscher 2019 (492 studies) showed IAT scores barely predict actual discriminatory behavior. Greenwald himself has walked back individual-level prediction.

Pygmalion Effect --- Rosenthal & Jacobson 1968’s “expectations create reality.” Thorndike’s TOGA-test critique was right; Raudenbush 1984 meta-analytic d ≈ 0.11; Jussim’s reframing shows accurate teacher perception dominates the small expectancy residual.

Hawthorne Effect --- Levitt & List 2011 obtained the original 1924-1932 Western Electric data and found the famous lighting-effects pattern is “entirely fictional.” The term was coined by Landsberger in 1958, 25+ years after the data, based on a particular reinterpretation.

Money Priming --- Vohs, Mead & Goode 2006’s influential Science paper. Rohrer 2015 nine preregistered replications (n > 1,800) all null. Lodder 2019 meta-analysis preregistered subset: g = 0.01. Parallel kill case to Bargh elderly priming.

Broken Windows Theory --- Wilson & Kelling’s 1982 Atlantic essay reshaped policing for 100M+ people. Levitt 2004 attributed the 1990s crime decline mostly to other factors; Braga 2015 meta-analysis showed disorder-policing effects are modest and concentrated in community/problem-oriented (not zero-tolerance) approaches.

Power of “Because” (Langer 1978) --- The famous copy-machine study. The 93% placebic-reason result was specific to SMALL requests; the same paper’s large-request condition showed placebic compliance collapses to ~24%. The marketing canon kept the headline and dropped the boundary.

The Decoy Effect --- Huber, Payne & Puto 1982’s asymmetric dominance. Frederick 2014 (38 replications with realistic stimuli) and Yang & Lynn 2014 (91 experiments) showed the effect mostly disappears or reverses outside abstract two-attribute lab paradigms. The Economist case is largely apocryphal.

Foot-in-the-Door --- Freedman & Fraser 1966. Burger 1999 meta-analysis: r ≈ .15-.17, d ≈ .30. Real but modest --- not the “massive lift” the popular framing implies.

Door-in-the-Face --- Cialdini 1975. Genschow & Westfal 2021 direct preregistered replication succeeded (51% vs. 34%). Feeley 2012 meta-analysis r ≈ .126 verbal / .052 behavioral. Real and replicates, but requires specific conditions (same requester, minimal delay, same cause).

CRO / UX / Marketing (Growth Layer)

Learning Styles --- Pashler et al. 2008 conclusively showed VARK has no scientific support. Newton & Salvi 2020: ~89% of teachers still believe in it. A multi-billion-dollar L&D industry sells products with no evidence base.

Above-the-Fold Myth --- Print-newspaper terminology transplanted to web. NN/g 2010 + Chartbeat 2013 + Schade 2015 “Fold Manifesto” all show users scroll; the fold is no longer the cliff the original rule implied.

Mehrabian 7-38-55 Rule --- Mehrabian himself has spent decades trying to kill the generalized application. His 1967 studies were about communicating INCONSISTENT feelings via single words --- not general communication. He calls the popular usage “absurd.”

The 3-Click Rule --- Never had any research basis. Porter 2003 UIE study (44 participants, 620 tasks): persistence depends on perceived progress, not click count. Survives 20+ years after being debunked.

Color Psychology in CRO --- “Red button beats green” was always one company’s one page on one day. Universal “color psychology” claims have weak academic basis. What matters: contrast, salience, brand context, WCAG compliance.

Hick’s Law for UX --- A real law (Hick 1952) misapplied. The original requires equiprobable, reactive, non-semantic choice. Landauer & Nachbar 1985 found broader/flatter menus were FASTER than deeper/narrower --- opposite of what “Hick’s Law says simplify” predicts.

Miller 7±2 Memory Rule --- Miller’s 1956 paper was about three converging cognitive paradigms, NONE of which was about navigation menus. Miller himself called the “magical number” a “pernicious, Pythagorean coincidence.” Cowan 2001 updated to ~4 chunks anyway.

10% of the Brain Myth --- No source, no founder, no study. Yet 65% of US adults still endorse it (Brown 2014). The most confidently stated lie in pop neuroscience.

Brain Training Games --- The 2016 FTC $2M Lumosity settlement, the 2014 Stanford 70-neuroscientist consensus, and the 2016 Simons PSPI 175-page review all reach the same conclusion: marketing claims outrun the science.

Cialdini’s Influence Principles --- Practitioner calibration guide. Three pillars (authority, social proof in specific contexts, door-in-the-face) have strong evidence; three (foot-in-the-door, reciprocity outside lab, manufactured scarcity) are wobblier than the framework’s reputation implies. Bohner & Schlüter 2014 did not reproduce the famous Goldstein hotel-towel result.

Left vs Right Brain Personality --- Built on a misreading of Sperry’s Nobel-winning split-brain research. Nielsen et al. 2013 PLOS ONE (n=1,011 fMRI scans) directly tested whether individuals have lateralized activity patterns --- they don’t. Sperry himself warned against the pop-psych reading.

Mirror Neurons in Marketing --- Real basic-neuroscience finding (Rizzolatti 1992) extrapolated by marketers into a persuasion framework the underlying neuroscience does not support. Hickok 2014 “Myth of Mirror Neurons” is the canonical critique.

“95% of Decisions Are Unconscious” --- The Zaltman attribution is unverified; the specific 95% figure doesn’t come from a study. ZMET is qualitative interview methodology, not quantitative measurement of conscious/unconscious decision proportions.

Banner Blindness --- Real and well-replicated (Benway 1998, NN/g 2007 & 2018), but practitioners use it as a catch-all label for any low-engagement element. Most cases are actually weak content, missing information scent, or poor headline craft --- different fixes.

Page Speed Conversion Claims --- The famous Amazon “100ms = 1% sales lost” (Linden 2006), Walmart 2012, Google 2008 stats are 13-18 years old, single-company snapshots cited as universal laws. Modern Core Web Vitals research shows much smaller and more conditional effects.

Pricing Anchoring --- Tversky & Kahneman 1974’s classic anchoring effect is robust. The SaaS-pricing application is more conditional: anchor credibility, buyer expertise, and market familiarity all moderate the effect. Implausible anchors backfire.

What survives --- the anti-examples

Not everything has fallen apart. A useful way to calibrate is to look at the findings that have held up under careful replication. The hub includes dedicated articles on the most useful anti-examples:

Fitts’s Law (target acquisition time as a function of distance and target size) --- robust across 70+ years and many devices; the cleanest example of a UX “law” that earned the name. Useful both as a practical tool and as a calibration tool against folk laws like Hick’s, Miller’s, and the 3-Click Rule.
The Default Effect / Status Quo Bias (Samuelson & Zeckhauser 1988; Madrian & Shea 2001; Johnson & Goldstein 2003) --- the behavioral-economics finding that actually reshaped policy. Jachimowicz 2019 meta-analysis: d = 0.68. Survived the Maier 2022 critique that demolished other nudge categories. The highest-confidence behavioral intervention strategists can deploy.
The Sunk Cost Fallacy (Arkes & Blumer 1985; Staw 1976) --- robust across species, lab paradigms, and corporate field data (Guenzel 2025 Journal of Finance). Predicts predictable patterns of bad organizational decision-making.
Ultimatum Game cross-cultural variation (Henrich et al. 2001) --- robust, replicated across small-scale societies (forthcoming).
Semantic and evaluative priming (the cognitive forms of priming, distinct from social/behavioral priming) --- robust.

These anti-examples are useful because they show what good behavioral-science evidence looks like: multiple independent replications, large samples, robust effects across populations and conditions, mechanisms that don’t depend on a single charismatic researcher’s lab. They are also concrete reminders that the replication crisis didn’t invalidate behavioral science --- it sharpened it.

How to read behavioral-science claims after this hub

Five heuristics that fall out of the cumulative evidence in this hub.

1. Distrust findings with too-clean stories. Real human behavior is messy, contextual, and full of moderators. Findings that are unusually clean and memorable should make you more skeptical, not less. The Mozart Effect, power posing, and the marshmallow test were all unusually clean stories. They are also all examples of overclaims.

2. Single-paper findings deserve provisional belief at best. Even in prestige journals. Nature, Science, and PNAS publish lots of papers that don’t replicate. Until a finding has been independently replicated in well-powered preregistered designs, hold it provisionally regardless of where it was first published.

3. Look for evidence outside the original paradigm. The strongest test of a behavioral-science claim is often a kind of data that the original researchers didn’t use and that doesn’t share their framing assumptions. CCTV footage testing the bystander effect. Preregistered multi-lab RRRs testing ego depletion. Cross-national replications testing facial feedback. These out-of-paradigm tests are usually more decisive than another iteration within the original framing.

4. Confounds usually do more work than the headline variable. The marshmallow test’s predictive power was largely SES. The Stanford Prison Experiment’s brutality was largely leadership signaling. The lab bystander effect was largely about ambiguous low-stakes conditions. Whenever a single measurement appears to predict an outcome, expect that most of the predictive power is carried by confounds --- variables that produce both the measurement and the outcome through separate channels.

5. Cultural belief lags scientific revision by years to decades. The version of behavioral science that reaches you through popular media is systematically behind the version that exists in the journals. Periodically auditing your own behavioral-science assumptions --- perhaps annually --- is one of the highest-ROI cognitive habits available to anyone whose work depends on understanding people.

Get the next entry

The hub currently covers 44 studies across psychology, behavioral economics, UX, CRO, and marketing. New entries continue to publish as new replication evidence is published or as additional studies merit coverage.

Subscribe to the newsletter to get each new entry in your inbox. Or book a consultation if you’d like a structured audit of the behavioral-science assumptions baked into your current strategy.

Methodology

Every entry cites primary research with verified DOIs. Each was independently fact-checked by deep-research agents against journal sources, replication papers, and current meta-analyses before drafting. Where the academic consensus is contested, both sides are represented fairly. Where the popular version overstates the academic version, the gap is named explicitly.

If you find an error or have a citation that should be considered, send it in. The hub is a living document and will be updated as new replication evidence is published.

Complete Index — All 161 Articles

The essay above introduces the hub's framework with hand-picked examples. Below is the comprehensive index — every article in the hub, grouped by category, in alphabetical order within each category.

Famous Studies Reinterpreted (20)

Asch Conformity Experiments: Real, But More Conditional Than The Popular Framing — Asch conformity studies replicate well — but the popular "people will always go along with the group" framing is wrong. The actual finding is conditional, public-not-private, and one dissenter destroys it.
Bargh Elderly Priming: The Day a Nobel Laureate Wrote a Letter Warning the Field of a "Train Wreck Looming" — In 1996, John Bargh published a study showing that subliminally exposing people to words about old age made them walk more slowly down a hallway. The finding became one of the most-cited demonstrations of unconscious…
Cognitive Dissonance: A Robust Theory That Got Oversold Into A Theory Of Everything — Festinger & Carlsmith 1959 is one of social psychology most cited findings. The core replicates under specific conditions. The marketing version, which explains everything from buyers remorse to brand loyalty, is a di…
Daryl Bem's Precognition Studies: The Paper That Broke Social Psychology — In 2011, a Cornell psychologist published nine experiments in psychology's top journal showing people could feel the future. The methods looked normal. The replications failed. The fallout rebuilt the field.
Ego Depletion: How Willpower Became a Glucose Tank — And Why That Story Collapsed — For two decades, "willpower is a depletable resource you can run out of" was treated as settled science. Then 23 labs tried to replicate it and found nothing. Here is what actually happened to ego depletion, what surv…
James-Lange Theory of Emotion: The 1884 Framework, Updated And Partly Vindicated — In 1884 William James proposed that emotions are perceptions of bodily reactions. Cannon-Bard demolished the strong version in 1927. Then Damasio (1994) and Barrett (2017) brought a refined version back. The honest ca…
Milgram Obedience Experiments: What The Yale Archives Actually Show — The famous "65% of people will shock a stranger to death" finding is a real result from a single condition out of 24. When Gina Perry combed Milgram's Yale archives, the picture got more complicated — and more interes…
Money Priming: The Influential 2006 Effect That Modern Replications Cannot Find — In 2006, a Science paper showed that subtly reminding people of money made them less helpful and more self-sufficient. The effect was replicated through the late 2000s, cited thousands of times, and built a research p…
Power Posing: How One of TED's Most-Watched Talks Outlasted Its Own Science — Amy Cuddy's 2012 TED talk on power posing has been viewed over 75 million times. Her own co-author publicly disavowed the original finding in 2016. A large replication failed to find the hormonal effects. Here is what…
Rosenhan's "On Being Sane in Insane Places": The Foundation Study Cahalan Showed Was Largely Fabricated — In 1973, eight "pseudopatients" allegedly checked into 12 mental hospitals, fooled every psychiatrist, and triggered a half-century of policy reform. In 2019, Susannah Cahalan went looking for the evidence. Most of it…
Schachter & Singer 1962: The Foundational Emotion Theory That Failed Replication — Schachter & Singer 1962 is the canonical textbook account of how emotion works — physiological arousal plus cognitive label. Two careful replications in 1979 failed to reproduce either condition, and the 1983 comprehe…
Sherif's Robber's Cave Experiment: What Gina Perry Found In The Archives — For seventy years, Robber's Cave was the canonical proof that intergroup conflict emerges spontaneously when groups compete. Then Gina Perry pulled Sherif's notes out of the archive. What strategists building team fra…
The Bystander Effect and Kitty Genovese: How One Mistaken Newspaper Story Created a Field — and What CCTV Footage Reveals Now — For sixty years, the Kitty Genovese case anchored the most-taught finding in social psychology: that bystanders do nothing when someone is in trouble. The original news story was factually wrong, the lab effect is rea…
The Facial Feedback Hypothesis: A Pen in Your Teeth, a Camera in the Room, and a Detective Story About a Methodological Moderator — Hold a pen between your teeth and cartoons are supposed to seem funnier. Seventeen labs ran the experiment and found nothing. Then the original author proposed that a hidden video camera was the reason — and the next…
The Hawthorne Effect: The Famous Workplace Story That The Data Don't Support — For 70 years, every methodology textbook has cited the same story: workers at the Hawthorne Works got more productive when they were watched, regardless of what changed. The data the story was supposedly based on were…
The Jam Study and Choice Overload: When the Moderators Matter More Than the Main Effect — In 2000, a tasting booth at a California grocery store became one of the most-cited studies in consumer psychology. Six varieties of jam sold ten times better than twenty-four. The finding launched a thousand SaaS pri…
The Little Albert Experiment (Watson 1920): Methodological Disaster That Founded Behaviorism — Watson and Rayner's 1920 conditioning of a nine-month-old infant to fear a white rat is cited in every introductory psychology textbook as foundational proof of classical conditioning in humans. The actual study was a…
The Marshmallow Test: What Willpower at Age Four Actually Predicts (And What It Does Not) — For fifty years, the marshmallow test was cited as proof that willpower at age four predicts adult success. The 2018 replication, the 2020 critique, and the 2024 follow-up have rewritten the story. Here is what the ev…
The Stanford Prison Experiment: How a Famous "Truth" About Human Nature Came Apart — For 50 years, the Stanford Prison Experiment was the canonical proof that "ordinary people become evil under bad systems." Then the audio tapes came out. What strategists, founders, and consultants should learn from o…
Triplett 1898 Social Facilitation: The First Experimental Psychology Study, Re-Examined — Triplett 1898 is the canonical "first social psychology experiment." Stroebe 2012 re-examined the actual data and showed the original analysis would not survive modern significance testing. The phenomenon is real — bu…

Fraud & Misconduct (8)

Andrew Wakefield's MMR-Autism Fraud: The Most Consequential Medical Fraud Of The Century — A 12-child case series in The Lancet, undisclosed payments from plaintiffs' lawyers, a competing patent, and falsified timelines triggered a global vaccine panic that killed a UK measles-elimination status. The retrac…
Brian Wansink and the Mindless Eating Lab: A Self-Inflicted Fraud Investigation — Brian Wansink shaped MyPlate, school lunches, and a generation of consumer-behavior research. Then he wrote a blog post praising a student for "salvaging" a failed dataset into four papers — and three statisticians wh…
Diederik Stapel: The 58-Retraction Fraud That Reshaped Social Psychology — A celebrated Dutch social psychologist fabricated entire datasets for more than a decade. Three junior colleagues finally raised the alarm. The 58 retractions that followed exposed how prestige and "clean" findings ca…
Hwang Woo-Suk Stem Cell Fraud: How A National Hero Fabricated The Cloning Breakthrough — In 2004-2005, Hwang Woo-Suk was on Korean postage stamps and topped every "Scientist of the Year" list — his Science papers claimed the first cloned human embryonic stem cell line. By 2006, both papers were retracted…
Jan Hendrik Schön Bell Labs Physics Fraud: 16 Papers Retracted From Science And Nature — Between 1998 and 2002, Jan Hendrik Schön at Bell Labs published a Nobel-trajectory string of breakthroughs in organic electronics — single-molecule transistors, organic superconductors, plastic lasers. In 2002 two out…
Marc Hauser And The Harvard Cognition Lab: The Fraud Case That Foreshadowed The Replication Crisis — In 2010, Harvard found one of its most celebrated cognitive psychologists "solely responsible" for eight instances of scientific misconduct. The case retracted papers in Cognition, forced corrections in Science, and a…
Michael LaCour and the Gay Marriage Canvassing Fraud: The Cleanest Unmasking In Social Science — In December 2014 a UCLA graduate student published a paper in Science claiming brief conversations with gay canvassers durably changed minds on same-sex marriage. Five months later, two Berkeley graduate students prov…
Theranos: The Fraud That Bypassed Scientific Peer Review Entirely — Theranos reached a $9B valuation around blood-testing claims that had never been published, peer-reviewed, or independently replicated. The canonical case study of how scientific-validation evasion can sustain a multi…

Methodology Landmarks (15)

Antidepressant Publication Bias: Turner 2008 NEJM On The SSRI Evidence Distortion — Erick Turner pulled FDA records on 74 antidepressant trials covering 12,564 patients, then compared them to what got published. The published literature inflated apparent efficacy by about a third. The methodology of…
Cargo Cult Science: Feynman's 1974 Caltech Address On Scientific Integrity — In June 1974, Richard Feynman delivered the Caltech commencement address that coined the phrase "cargo cult science." The talk was about fields that follow the outward forms of science — publication, statistics, termi…
Finance's Replication Crisis: Harvey-Liu-Zhu 2016 On Why Most "Anomalies" Don't Replicate — Cam Harvey, Yan Liu and Heqing Zhu reviewed 316 documented stock-return factors published over 50 years and concluded that the conventional t > 2 threshold is too low — most "anomalies" are noise. Hou-Xue-Zhang 2020 t…
HARKing — Hypothesizing After the Results are Known (Kerr 1998) — In 1998, Norbert Kerr gave a name to a practice every researcher recognized but no one had isolated: HARKing — presenting post-hoc findings as if they had been predicted in advance. The label changed the conversation…
Ioannidis 2005: "Why Most Published Research Findings Are False" Landmark — In 2005, a Stanford epidemiologist published an eight-page paper in PLOS Medicine with a title that read like a manifesto. The math was elementary Bayesian reasoning. The conclusion — that most "significant" findings…
Machine Learning's Reproducibility Crisis: NeurIPS Checklist And Code-Release Reforms — Machine learning has its own reproducibility crisis: papers without code, undocumented hyperparameters, cherry-picked seeds, undisclosed compute. Pineau 2021, Henderson 2018, and Lipton-Steinhardt 2019 mapped the fail…
Open Data And Code Sharing: The Infrastructure Reform Of The Replication Crisis Era — The replication crisis did not just diagnose a problem — it built infrastructure to fix it. Open data mandates, code sharing requirements, preregistration platforms, and funder data management plans are the structural…
P-Hacking and Researcher Degrees of Freedom: Simmons 2011 "False-Positive Psychology" — How Simmons, Nelson & Simonsohn 2011 proved researcher flexibility inflates false-positive rates from 5% to 60%+. The paper that coined "p-hacking" and rewrote what counts as evidence in behavioral science.
Publication Bias and the File Drawer Problem: Rosenthal 1979 — Rosenthal 1979 formalized the file drawer problem: null results sit in desk drawers while positive findings get published. Franco 2014 quantified it — nulls are 40% less likely to be written up. Turner 2008 showed ant…
Registered Reports: The Journal Format That Makes Replication Reform Stick — Registered Reports flip the order of peer review: the methodology is reviewed and accepted before the data are collected, and the journal commits to publishing the eventual paper regardless of results. The empirical r…
The Decline Effect: Jonathan Lehrer's 2010 New Yorker Story That Predicted The Replication Crisis — Five months before Daryl Bem published his precognition paper, Jonathan Lehrer told New Yorker readers that famous scientific findings tend to shrink over time as replications fail. The 2010 article named a pattern th…
The Garden of Forking Paths: Gelman-Loken 2013 On Implicit Multiple Comparisons — In 2013, Andrew Gelman and Eric Loken showed that good-faith researchers using standard methodology can produce noise-driven findings without any conscious p-hacking. The mechanism is the garden of forking paths — imp…
The Many Labs Replication Projects: The Field's Self-Audit — The Many Labs projects coordinated dozens of labs replicating classic psychology findings with huge samples and pre-registered protocols. Roughly half replicated. The other half didn't. Here is what survived, what fai…
The Reproducibility Project: Psychology (OSC 2015) — When The Field Replicated Itself And Found 39% — In August 2015, 270 researchers published the largest direct-replication project in the history of psychology. They tried to reproduce 100 prominent studies. 39% replicated. Here is what the Open Science Collaboration…
The Sokal Hoax (1996): A Physicist Punctures Postmodern Cultural Studies — In 1996 NYU physicist Alan Sokal published a deliberately nonsensical paper in the cultural-studies journal Social Text — then revealed the hoax the same day. The episode triggered the Science Wars and exposed that "p…

Behavioral Economics & Decision-Making (31)

Base Rate Neglect: The Robust Reasoning Error In Diagnosis And Decisions (Anti-Example) — Tversky and Kahneman 1973 showed people ignore prior probabilities when given individuating descriptive information. Casscells 1978 found Harvard physicians made the same mistake on mammography. Bar-Hillel 1980 system…
Confirmation Bias: One Of The Most Robust Findings In Cognitive Psychology (Anti-Example) — Most behavioral findings in this hub collapsed under scrutiny. Confirmation bias did not. Sixty years of replication, thousands of studies, multiple paradigms, and effect sizes large enough to be operationally meaning…
Equity Premium Puzzle (Mehra-Prescott 1985): The Anomaly Standard Economics Cannot Explain — Rajnish Mehra and Edward Prescott showed in 1985 that the historical US equity premium of ~6% is roughly seventeen times larger than standard consumption-based asset pricing would predict. Forty years later the puzzle…
Hindsight Bias: One Of The Most Robust Findings In Cognitive Psychology (Anti-Example) — Most findings in this hub did not survive scrutiny. Hindsight bias did. Fischhoff 1975, hundreds of replications, a 122-study meta-analysis, and a confirmed mechanism — the "I knew it all along" effect is real, large,…
Hyperbolic Discounting / Present Bias: Real, Replicated, Magnitude Disputed — Present bias is one of the better-supported findings in behavioral economics — and one of the most over-extended. The qualitative effect replicates. The specific quantitative claims, including the famous gym-membershi…
Implicit Egotism / Name-Letter Effect: The "Dennis The Dentist" Claim Demolished — In 2002, Brett Pelham published a JPSP paper claiming people are disproportionately likely to become dentists if named Dennis, to marry partners with matching initials, and to move to cities resembling their own names…
Loss Aversion: What Survives of Behavioral Economics' Most Famous Idea — Losses loom twice as large as gains, and the human mind is built around that asymmetry — or so behavioral economics taught for forty years. A 2018 challenge in the Journal of Consumer Psychology argued the claim was o…
Mental Accounting: Thaler's Real Framework, Often Stretched Past Its Evidence — Richard Thaler's mental accounting framework is real, useful, and Nobel-recognized. Specific effects --- the house money effect, income-source asymmetry, sunk-cost-by-account --- have empirical support. Many popular m…
Money Illusion: Why Nominal Pay Beats Real Pay In Workers' Heads — Shafir, Diamond & Tversky 1997 showed people systematically evaluate money in nominal rather than inflation-adjusted terms. The effect replicated in Fehr-Tyran lab experiments and in Card-Hyslop and Akerlof labor-mark…
Prospect Theory: The Behavioral-Economics Framework That Actually Replicates (Anti-Example) — Most behavioral-science findings in this hub did not survive scrutiny. Prospect theory did. Across 47 years, multiple paradigms, and dozens of countries, Kahneman and Tversky's 1979 framework remains the gold standard…
Representativeness Heuristic: Tversky-Kahneman's Foundational Bias Framework (Anti-Example) — Tversky and Kahneman 1971/1972/1974 built a framework in which people judge probability by similarity to a prototype rather than by base rates. It generated the Linda problem, the gambler's fallacy, base-rate neglect,…
Save More Tomorrow (Thaler-Benartzi 2004): The Behavioral Intervention That Actually Worked — Thaler and Benartzi 2004 Save More Tomorrow is the rare behavioral economics intervention that produced a measurable, scalable, decades-durable real-world result. Field-tested at Ispat Inland Steel, codified into the…
The Affect Heuristic: Slovic's Real-But-Conditional Decision-Theory Finding — Paul Slovics affect heuristic is one of the rare behavioral-economics findings that has held up across two decades of replication, but the magnitude of the effect depends heavily on time pressure, expertise, and the s…
The Allais Paradox: The Decision Theory Result That Forced Economics To Update (Anti-Example) — In 1953, French economist Maurice Allais demonstrated systematic violations of expected utility theory using simple choice pairs. The result replicated for 70 years across labs, countries, and stakes. It motivated pro…
The Cognitive Reflection Test: A 3-Question Anti-Example That Predicts Decision Quality — Shane Frederick’s 3-question Cognitive Reflection Test (CRT) is one of the cleanest anti-examples in behavioral science — a tiny measure that robustly predicts heuristics-and-biases susceptibility, intertemporal patie…
The Conjunction Fallacy / Linda Problem: Real Effect, Contested Interpretation — Tversky and Kahneman 1983 showed people judge "Linda is a bank teller and a feminist" more probable than "Linda is a bank teller" --- violating basic probability. The effect is robust. But Hertwig and Gigerenzer 1999…
The Decoy Effect: The Pricing-Page Tactic That Doesn't Replicate — The decoy effect — adding a worse option to push customers toward a target — became a foundational SaaS pricing tactic on the strength of Huber, Payne & Puto (1982) and Dan Ariely's Economist example. In 2014, two lar…
The Default Effect: The Behavioral-Economics Finding That Actually Holds Up — Most behavioral-science findings in this hub did not survive scrutiny. The default effect did. Across decades of replication, across countries, across decision domains — automatic enrollment and opt-out architecture r…
The Dictator Game: Cross-Cultural Generosity That Replicates (Anti-Example) — The dictator game strips away the strategic reasoning of the ultimatum game and asks: when a person has total power and zero rejection risk, do they still give? Across 616 studies, 20,813 dictators, and dozens of cult…
The Disposition Effect: The Robust Investor Bias That Explains Holding Losers, Selling Winners — In 1985, Hersh Shefrin and Meir Statman formalized one of the most robust findings in behavioral finance: investors sell winners too early and hold losers too long. Odean 1998 replicated it on 10,000 brokerage account…
The Dunning-Kruger Effect: Real Phenomenon Or Mostly A Statistical Artifact? — Kruger & Dunning's 1999 paper became one of the most memed findings in psychology: incompetent people supposedly don't know they're incompetent. The empirical reality is more uncomfortable. The classic Dunning-Kruger…
The Easterlin Paradox + Killingsworth Reversal: Income And Happiness, Updated — The famous "$75K plateau" still gets cited as fact. A 2023 adversarial collaboration between Kahneman and Killingsworth reconciled the apparent contradiction — and the modern consensus is more nuanced than the pop-psy…
The Ellsberg Paradox: Ambiguity Aversion And Robust Decision Theory (Anti-Example) — Daniel Ellsberg 1961 demonstrated systematic ambiguity aversion violating subjective expected utility. Robustly replicated for 60+ years. Explains equity premium, insurance puzzles, and home bias. An anti-example of a…
The Endowment Effect: Real, But More Conditional Than Pricing Playbooks Imply — Kahneman, Knetsch & Thaler 1990 showed Cornell students demanded roughly twice as much to sell a coffee mug as students without mugs would pay for one. The finding became the backbone of behavioral pricing. The 2005 P…
The Framing Effect: A Robust Anti-Example From Prospect Theory — Most behavioral findings in this hub collapsed under scrutiny. Tversky and Kahneman 1981's Asian disease problem did not. Forty-plus years of replication, two meta-analyses, and a Many Labs result of 31 of 36 labs hit…
The Gambler's Fallacy: A Robust Cognitive Bias That Predicts Real Behavior (Anti-Example) — After Miller & Sanjurjo 2018 reversed the hot hand fallacy, some readers asked whether the gambler's fallacy was also a statistical illusion. It is not. Across lab studies, casino video, lottery purchase records, and…
The Halo Effect: One Of The Most Robust Findings In A Century Of Social Psychology — Most behavioral-science findings in this hub did not survive scrutiny. Thorndike 1920 did. A century after the original paper, the halo effect remains one of the largest, most replicated, and most operationally conseq…
The Just-World Hypothesis: A Real Bias That Gets Stretched Beyond The Evidence — Melvin Lerner's 1980 just-world hypothesis is real but conditional — the lab effect replicates, but population-level political applications are speculative. Here is what the evidence actually supports and what it does…
The Sunk Cost Fallacy: The Bias That Predicts Why Bad Projects Survive — Most behavioral findings in this hub did not survive scrutiny. The sunk cost fallacy did — across labs, species, and a 2025 Journal of Finance paper that tracked the bias inside real corporate acquisitions. Here is wh…
The Ultimatum Game Across Cultures: The Behavioral-Economics Finding That Was Real But Not Universal (Anti-Example) — The ultimatum game replicates in every society researchers have tested it in. The specific quantitative pattern from Western undergraduates does not. Henrich and colleagues 2001 ran the game in 15 small-scale societie…
The Wason Selection Task: Where Context Beats Abstract Logic — Most people fail the abstract Wason card task at rates around 10 to 25 percent. Reframe the same logical structure as a social-contract violation and performance jumps to 70 to 90 percent. The empirical pattern is rob…

Cognitive Psychology & Perception (9)

Change Blindness: Anti-Example Companion To The Invisible Gorilla — Simons and Levin 1998 Door Study showed about half of people miss a stranger swapping in mid-conversation. Three decades of replication confirmed the result. Change blindness is robust, mechanistically grounded, and q…
Inattentional Blindness / The Invisible Gorilla: A Famous Demonstration That Replicates (Anti-Example) — Most famous psychology demonstrations did not survive the replication crisis. The invisible gorilla did. Simons and Chabris 1999 inattentional blindness held up across hundreds of studies, including a 2013 radiologist…
Libet's Free Will Experiments: The "Readiness Potential" Reinterpreted — Libet 1983 showed brain activity preceded conscious intention by 350ms. The popular reading: "your brain decides before you do." Four decades of follow-up neuroscience, including Schurger 2012 and 2021, no longer supp…
Loftus Eyewitness Memory: The Robust Cognitive Psychology That Reshaped Criminal Law (Anti-Example) — Most of this hub catalogs cognitive-psychology findings that did not hold up. Elizabeth Loftus's 50-year program on eyewitness memory and the misinformation effect is the opposite case — replicated across thousands of…
The Mere Exposure Effect: Familiar Equals Liked (Anti-Example) — Most behavioral-science findings in this hub did not survive scrutiny. Zajoncs 1968 mere exposure effect did. Across five decades of replication, hundreds of studies, and a 2017 meta-analysis of 256 effects, the basic…
The Spotlight Effect: The Robust Bias That You Stand Out Less Than You Feel — The spotlight effect — Gilovich, Medvec & Savitsky (2000) — is one of social psychology survivors of the replication crisis. People overestimate how much others notice them by roughly 2x. The Barry Manilow t-shirt stu…
The Stroop Effect: Cognitive Psychology's Most-Replicated Finding (Anti-Example) — Most psychology findings in this hub did not survive scrutiny. The Stroop effect did — millions of times across labs, populations, and decades. It is the benchmark for what a robust cognitive finding actually looks li…
Wegner's Illusion of Conscious Will: Real Insight Or Overstated Philosophy? — Daniel Wegner's 2002 book argued the feeling of willing is a constructed inference, not direct perception of mental causation. The lab paradigm replicates. The "free will is an illusion" headline does not follow. Here…
Wegner's White Bear / Ironic Process: Real, But Less "Ironic" Than The Popular Framing — Wegner's 1987 "don't think of a white bear" study spawned ironic process theory, became gospel in clinical psychology, sports coaching, and self-help. The lab effect replicates modestly. The "theory of everything bad…

Asch Conformity Cross-Cultural Variation: What Bond & Smith 1996 Showed (Anti-Example) — Asch conformity replicates everywhere — but its magnitude varies dramatically by culture. Bond & Smith 1996 meta-analyzed 133 studies across 17 countries (24,617 participants). The headline US number is not the univer…
Door-In-The-Face: The Persuasion Technique That Actually Replicates (With Specific Conditions) — Door-in-the-face (Cialdini 1975) is one of the better-replicated persuasion paradigms — a 2021 preregistered replication recovered the original effect 46 years later. But it works only under specific conditions: same…
Festinger's Social Comparison Theory: One Of Social Psychology's Most Robust Frameworks (Anti-Example) — Festinger 1954 social comparison theory has held up across 70 years of evidence — Suls Martin Wheeler 2002 review, modern Vogel and Fardouly social-media work all converge. Anti-example for evidence evaluation.
Foot-In-The-Door: The Persuasion Tactic That Works (Just Less Than Marketers Think) — Freedman & Fraser 1966 launched a billion-dollar funnel doctrine. The meta-analytic effect is r ≈ 0.17 — real, but a fraction of what the popular framing suggests. Calibrate before you build a strategy on it.
Granovetter's Strength of Weak Ties: The Sociology Finding That Holds Up (Anti-Example) — Most behavioral-science findings cataloged in this hub did not survive scrutiny. Granovetter's 1973 strength-of-weak-ties hypothesis did — and in 2022 a 20-million-user LinkedIn experiment published in Science deliver…
Six Degrees of Separation: Milgram's Small-World Study, Cracks And Vindication — The famous "six degrees of separation" claim came from a 1969 Milgram-Travers study where only 64 of 296 chains completed. Judith Kleinfeld's 2002 critique gutted the original evidence. But Watts 2003 and Backstrom 20…
The Bystander Intervention Effect: More Conditional Than The Famous Story Suggests — The Darley-Latane bystander effect is one of the most-cited findings in social psychology. The Fischer 2011 meta-analysis says it is real but conditional; the Philpot 2020 CCTV study found intervention in 91% of real…
The Trolley Problem And Moral Psychology: Greene 2001 fMRI Findings And Their Limits — Greene 2001 used fMRI and the trolley problem to argue emotion drives deontological moral judgments while reason drives utilitarian ones. The original imaging effects have analytic concerns, the dual-process model has…
WEIRD Critique: Why Psychology's "Universal" Findings Came From The Weirdest Population — Roughly 96% of psychology samples come from countries housing only 12% of world population. Henrich, Heine & Norenzayan 2010 showed those samples are systematically outliers — visual perception, fairness, cooperation,…
Wisdom of Crowds: A Real Phenomenon With Important Conditions (Anti-Example) — Galton 1907 showed 787 fairgoers estimating ox weight collectively landed within 1% of actual. Surowiecki popularized it. Lorenz 2011 PNAS showed social influence destroys the effect. Conditions matter — independence…

Medical Reversals (9)

Acupuncture Sham-Trial Evidence: Real Effect, But Not From Where You'd Think — Decades of acupuncture trials with sham-needle controls show a consistent pattern: real acupuncture beats no-treatment, but barely beats sham. Most of the apparent benefit is the therapeutic ritual, not the meridian-p…
Aspirin For Primary Prevention: The 2018 Reversal Of Decades-Long Cardiology Guidance — For decades, healthy adults were told to take baby aspirin to prevent heart attacks. In 2018, three large trials — ASPREE, ARRIVE, and ASCEND — published essentially simultaneously, reversed the guidance entirely.
Beta-Carotene And Cancer Prevention: The CARET Trial Reversal — In the 1980s, observational epidemiology was unanimous: beta-carotene prevented cancer. Two large randomized trials — CARET and ATBC — were stopped early in the 1990s because the supplement was killing the people it w…
Cancer Screening Overdiagnosis: When Finding Cancer Earlier Doesn't Save Lives — For decades "screening saves lives" was treated as obviously true. Modern epidemiology has documented a systematic problem: overdiagnosis. The Korean thyroid epidemic, mammography reviews, and PSA evidence all show th…
Hormone Replacement Therapy: The WHI Trial Reversal That Changed Women's Medicine — For a decade, observational studies suggested hormone replacement therapy cut postmenopausal heart disease by a third. In July 2002, the largest randomized trial ever run on HRT was stopped early because the treatment…
PREDIMED Trial: The Mediterranean Diet Study That Got Retracted And Re-Published — In 2013, a 7,447-person Spanish RCT showed the Mediterranean diet cut major cardiovascular events by ~30%. Five years later, the New England Journal of Medicine retracted it after anesthetist John Carlisle ran a stati…
The DSM And Diagnostic Reliability: When The Manual Itself Replicates Poorly — The DSM-5 field trials (2010-2012) tested whether two clinicians evaluating the same patient would assign the same diagnosis. Major Depressive Disorder came back with a kappa of 0.28. Generalized Anxiety Disorder came…
The Saturated Fat / Diet-Heart Hypothesis: The Nutritional Consensus That Got Substantially Revised — For half a century, US dietary guidelines told Americans to avoid saturated fat to prevent heart disease. Then meta-analyses of 21, then 76, then dozens more cohorts came in — and the association largely was not there…
The Stress-Causes-Ulcers Myth: How Medical Consensus Was Wrong For Decades — For most of the 20th century, medicine taught that stress, diet, and excess acid caused peptic ulcers. In 1982, two Australian pathologists discovered a bacterium that explained almost all of them. The field resisted…

Economics & Policy (8)

Acemoglu-Robinson Institutions: The 2024 Nobel-Winning Program (Anti-Example) — Most empirical economics in this hub falls apart on close inspection. The Acemoglu-Johnson-Robinson research program on institutions did not. Twenty-five years of critique, replication, and counter-critique have left…
Axelrod's Tit-for-Tat Tournament: The 40-Year-Old Game Theory Result That Still Holds (Anti-Example) — Most of the famous findings in this hub did not survive scrutiny. Axelrod's 1980 tit-for-tat tournament did. Forty years of adversarial testing, biological field replications, and mathematical refinements have left th…
Broken Windows Theory: The Atlantic Essay That Reshaped Policing On Weak Evidence — In 1982, two academics published a nine-page essay in The Atlantic Monthly arguing that visible disorder causes serious crime. It was not empirical research. It was a theoretical argument illustrated by Philip Zimbard…
Card & Krueger's Minimum Wage Study: An Anti-Example Of Robust Empirical Economics — Most empirical findings in this hub did not survive replication. Card and Krueger 1994 did. Thirty years of hostile attempts to overturn the New Jersey-Pennsylvania fast-food study --- payroll re-analysis, contiguous-…
Reinhart-Rogoff "90% Debt Threshold": The Excel Error That Shaped Global Austerity — In 2010, two Harvard economists published a short paper claiming that public debt above 90% of GDP collapses growth. Politicians from Paul Ryan to Olli Rehn cited it to justify austerity. Three years later, a graduate…
Tetlock's Superforecasting: An Anti-Example Of Rigorous Behavioral Science That Actually Predicts — Most behavioral-science findings cataloged in this hub did not survive scrutiny. Tetlock's superforecasting research did. Large preregistered samples, multi-year tournaments, identifiable traits, and independent repli…
The Hot Hand Fallacy: The "Cognitive Bias" That Two Statisticians Reversed In 2018 — For thirty years, every behavioral economics textbook taught the "hot hand fallacy" — basketball players never have streaky shooting, fans only think they do. In 2015, two statisticians found a subtle methodological b…
The Lucas Critique: Why Macroeconomic Models Break When Policy Changes — In 1976, Robert Lucas Jr. published a 28-page paper that destroyed an entire generation of macroeconomic forecasting models. His argument was simple, devastating, and impossible to ignore: you cannot use historical da…

A/B Testing & Experimentation (9)

Novelty and Primacy Effects in A/B Testing: Why Your First-Week Lift Disappears — Even with perfect statistical hygiene, a five-day test on a UI change is measuring "users react to change" not "users prefer this design." Novelty fades within weeks; primacy resolves within months. Weekly-cadence tes…
Sample Ratio Mismatch (SRM): The A/B Testing Quality Check Most Teams Skip — Microsoft found ~6% of its A/B tests have a broken traffic split that silently invalidates the results. Most teams never run the one-line check that would catch it. Why SRM is the highest-ROI gate experimentation prog…
Simpson's Paradox in A/B Testing: When Your Overall Lift Hides A Loss For Most Users — A variant can show a +3% lift overall while losing in every device segment, every geography, and every cohort. This is Simpson's Paradox — the same statistical reversal that made 1973 UC Berkeley look like it was disc…
Statistical vs Practical Significance: Why Your "Significant" A/B Test Result Doesn't Matter — Statistical significance (p < 0.05) and practical significance (effect large enough to matter to the business) are independent. High-traffic platforms detect 0.02% lifts at p < 0.001 — and ship them. That is the confu…
The Multiple Comparisons Problem in A/B Testing: Why Running 20 Variants Guarantees A False Winner — At alpha = 0.05, running 20 independent A/B tests produces at least one false positive with probability 64%, even when no variant has a real effect. The math is 89 years old and unambiguous. Most CRO programs apply no…
The Peeking Problem in A/B Testing: The Statistical Mistake That Inflates Your False-Positive Rate To 40%+ — Standard frequentist A/B testing assumes one look at the data. Every additional peek inflates the false-positive rate above the nominal 5%. Continuous monitoring of a fixed-horizon test pushes it past 30% — and that i…
The Surrogate Metric Trap in A/B Testing: When Your Test Win Hurts Long-Term Outcomes — Short-term proxy metrics in A/B testing can move in the opposite direction of the long-term outcomes they were meant to predict. Hohnhold 2015 at Google showed short-term ad-revenue lifts erode long-term ad-clicks. Pr…
The Winner's Curse in A/B Testing: Why Your Test Wins Always Get Smaller In Production — When you ship the variant with the highest observed lift, you are selecting on true effect plus lucky noise. The noise does not replicate. The math guarantees that aggregate "test wins" overstate aggregate "production…
Twyman's Law in A/B Testing: When Your Results Are Too Good To Be True, They Probably Are — British market researcher Tony Twyman gave us the most reliable diagnostic rule in applied statistics: any figure that looks interesting or different is usually wrong. In A/B testing, a 30% lift on a button-color test…

Education, Self-Help & Pop-Psychology (24)

Bandura's Bobo Doll: A Foundational Study Whose Real Findings Were Much More Modest — Bandura's 1961 and 1963 Bobo doll studies are cited as definitive evidence that watching aggressive models causes generalized aggression — including the media-violence chain that runs through video games. The original…
Bandura's Self-Efficacy: The Personality Construct That Actually Replicates (Anti-Example) — Self-esteem collapsed under Baumeister's 2003 review. Grit dissolved into conscientiousness. Self-efficacy did neither. Stajkovic and Luthans found r = 0.38 across 114 workplace studies. Multon found r = 0.38 across a…
Birth Order Personality Effects: The Family Folklore That Disappeared In A 377,000-Person Sample — For a century, parenting books, family therapy curricula, and leadership commentary have leaned on the claim that birth order shapes personality in important ways. Modern large-sample studies — including one with 377,…
Class Size Effects: The Tennessee STAR Experiment And What Education Actually Knows — Most education research is observational, confounded, and contested. Tennessee STAR was different — a true randomized controlled trial of 11,500 kids across 79 schools, tracked into adulthood. The qualitative claim su…
CliftonStrengths / StrengthsFinder: The Gallup Product Academic Psychology Doesn't Use — CliftonStrengths has been taken 30+ million times and is embedded in corporate L&D at 90% of Fortune 500 companies. Yet academic psychology essentially does not use the instrument — independent peer-reviewed validatio…
Csikszentmihalyi's Flow: A Beloved Construct With A Surprisingly Thin Lab Foundation — Mihaly Csikszentmihalyi's "flow" is a real subjective experience that replicates reliably in self-report. The eight-component model and the famous skill-challenge channel are far thinner in the lab than the consultanc…
Goleman's Emotional Intelligence: The Best-Selling Construct Academic Psychology Doesn't Recognize — Daniel Goleman's 1995 book sold 5+ million copies and launched a multibillion-dollar EI coaching and assessment industry. The empirical reality: meta-analyses (Joseph & Newman 2010; O'Boyle 2011) show "mixed-model" EI…
Growth Mindset: When the Effect Is Real But a Tenth the Size You Were Told — Carol Dweck's growth mindset theory has shaped curricula at hundreds of school districts, training programs at Fortune 500 companies, and a generation of parenting advice. Large meta-analyses now show effects of r = 0…
Howard Gardner's Multiple Intelligences: A Theory That Won The Classroom Without Winning The Lab — Howard Gardner's 1983 Multiple Intelligences theory transformed K-12 curriculum and corporate L&D — but the most rigorous direct empirical test (Visser, Ashton & Vernon 2006) found all eight intelligences load on a si…
Kübler-Ross's Five Stages Of Grief: The Most-Taught Change Framework With No Empirical Foundation — Denial, anger, bargaining, depression, acceptance — the famous five stages are the backbone of grief counseling, palliative care training, and corporate "change curve" decks. The empirical record, from Maciejewski 200…
Maslow's Hierarchy of Needs: The Pyramid Maslow Never Drew, The Evidence He Never Produced — Abraham Maslow's hierarchy of needs is the most-taught motivation framework in management education. Two facts the slide deck leaves out: Maslow himself never drew the famous pyramid (Bridgman 2019 traces it to 1960s…
Neuro-Linguistic Programming (NLP): The Sales-Coaching Framework Academic Psychology Considers Pseudoscience — NLP is the sales-coaching frameworks academic psychology has tested most cleanly and rejected most decisively. The preferred-representational-system hypothesis fails. Eye-accessing cues do not work. A 2012 systematic…
Polyvagal Theory: Popular In Trauma Therapy, Contested By Mainstream Neuroscience — Polyvagal theory is a wellness-industry blockbuster invoked in millions of therapy sessions and corporate "trauma-informed" trainings — and a framework that mainstream comparative neuroanatomy has substantially reject…
Pygmalion Effect: The Self-Fulfilling Prophecy That Mostly Wasn't — Rosenthal & Jacobson 1968 became management gospel — expect great things and people will deliver. The actual 1968 effects were largely confined to grades 1-2, the IQ test was psychometrically inappropriate for young c…
Spaced Repetition and Testing Effect: The Learning Science That Actually Replicates (Anti-Example) — Two of the most robust findings in cognitive psychology — spaced repetition and retrieval practice — have a century of replication, large effect sizes, mechanistic grounding, and applied tools that work. Corporate L&D…
The 10,000 Hours Rule: What Ericsson Studied vs. What Gladwell Popularized — Malcolm Gladwell's 2008 Outliers introduced the '10,000 hours rule' to a generation of managers. The researcher whose work it cited, K. Anders Ericsson, spent the next decade publicly disowning the framing. Here's wha…
The Big Five Personality Traits: An Anti-Example Of A Personality Model That Actually Replicates — Most popular personality assessments — MBTI, DISC, Enneagram, Insights Discovery — have weak or no empirical foundations. The Big Five (OCEAN) is the exception. It was discovered through factor analysis, replicates ac…
The Fredrickson-Losada 3:1 Positivity Ratio: When Pseudo-Math Pretends To Be Science — Barbara Fredrickson sold a precise mathematical threshold for human flourishing — 2.9013 positive emotions per negative one — derived from differential equations. In 2013, a grad student, a physicist, and a happiness…
The IKEA Effect: A Real "Labor-Of-Love" Bias With Specific Conditions — The IKEA effect is one of the few consumer-psychology findings that has held up under scrutiny --- but only under specific conditions: completion has to succeed, the effort has to be non-trivial, and the user has to a…
The Mozart Effect: How a 36-Person Study Became a State Policy — And Why It Was Never There — A 1993 single-page note in Nature with 36 college students sparked a baby-genius industry, a state law in Georgia mandating classical music for newborns, and a generation of confident parenting advice. The meta-analys…
The Myers-Briggs (MBTI): A Personality Test Academic Psychology Considers Pseudoscience — The MBTI is used by an estimated 88% of Fortune 500 companies and generates $20+ million/year in revenue, yet academic personality psychology essentially does not use it. Pittenger (1993, 2005) showed ~50% of takers g…
The Reading Wars: Phonics vs Whole Language And The "Science Of Reading" Resolution — For three decades US elementary schools taught reading using methods the research evidence did not support. The National Reading Panel report in 2000 and the Castles, Rastle & Nation review in 2018 both concluded syst…
The Self-Esteem Movement: The Universal Solvent That A Comprehensive Review Said Wasn't — In 1986, California funded a task force to study self-esteem as a "social vaccine" against crime, drug abuse, and welfare. A generation of policy followed. In 2003, a 44-page review of ~15,000 studies concluded the fo…
Tuckman's Forming-Storming-Norming-Performing: The Team Development Model With Weak Evidence — Bruce Tuckman's 1965 four-stage model dominates leadership training despite weak empirical foundations. The original paper was a literature review of therapy groups, not a longitudinal study of work teams — and Gersic…

Other (18)

Begley & Ellis 2012: When Amgen Tried To Replicate Cancer Research And Found 89% Failure — In 2012, two cancer biologists published a Nature commentary that detonated a quiet bombshell. Amgen had tried to replicate 53 landmark cancer biology papers. Six confirmed. Forty-seven did not. The 89% failure rate b…
Bem's Self-Perception Theory: Real Alternative Or Complementary Framework? — Daryl Bem's 1967 self-perception theory was framed as a rival to cognitive dissonance — same predictions, no inner discomfort. Fazio 1977 showed both are real, operating in different domains. The synthesis is the usef…
Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes — Cohen offered d=0.2/0.5/0.8 "with some trepidation" as a last resort for fields with no better referent. The field turned a borrowed convention into a law of nature: dismissing "small" effects that compound and inflat…
Festinger's "When Prophecy Fails": The Case Study That Created A Theory — Festinger, Riecken and Schachter's 1956 study of a UFO doomsday cult is one of the most famous case studies in social psychology — and one of the most methodologically weak findings still routinely cited as evidence f…
Grit: Real, But Barely Distinguishable From Conscientiousness — Angela Duckworth's grit research became one of the most successful behavioral-science exports of the 2010s — TED talk, MacArthur grant, bestselling book, school curricula, military screening. The 2017 Credé meta-analy…
Linguistic Relativity (Sapir-Whorf): The Strong Version Is Dead, The Weak Version Lives — The strong claim that language determines what you can think is dead — killed by the Eskimo-snow hoax and color-term universals. The weak claim that language nudges perception and memory survives in careful experiment…
Mussweiler-Strack Numeric Anchoring: Wide-Ranging Robust Effects Beyond Pricing (Anti-Example) — Most behavioral findings in this hub collapsed under scrutiny. Numeric anchoring did not. From the Strack-Mussweiler 1997 mechanism paper through Englich on judges, Galinsky on negotiation, Northcraft on real-estate e…
Naive Realism: The Conviction That You See Reality As It Is — Naive realism — Ross & Ward (1996) — is the conviction that you perceive the world objectively and that disagreement signals bias or ignorance. The robust, well-replicated root of the hostile media effect, the bias bl…
Net Promoter Score (NPS): The "One Number You Need To Grow" That Doesn't — Frederick Reichheld's 2003 HBR article introduced NPS as the single growth-predicting metric, and ~80% of Fortune 1000 companies adopted it. Independent peer-reviewed work — Keiningham et al. (2007), Morgan & Rego (20…
Replication Markets: Prediction Markets For Whether A Finding Will Replicate — Dreber 2015 PNAS showed researcher prediction markets forecast replication outcomes with ~71% accuracy. Camerer 2018 confirmed it in Nature. The implication: experts collectively know which findings are weak, but publ…
Stereotype Threat: The Effect That Got Smaller Every Time We Looked — Steele & Aronson 1995 became the empirical foundation for a generation of DEI policy, diversity training, and educational reform. Meta-analyses by Stoet & Geary, Flore & Wicherts, and Shewach & Sackett showed the unde…
The Availability Heuristic: A Foundational Finding That Has Held Up For 50 Years — Most behavioral-economics findings in this hub did not survive scrutiny. Tversky and Kahnemans 1973 availability heuristic did. Across five decades of replication, in domains as varied as risk perception, jury decisio…
The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate — In 1949, Bertram Forer gave 39 students a "personalized" profile to rate. The mean was 4.26 of 5 — yet every student got the identical sketch, lifted from an astrology book. Why generic descriptions feel uniquely true…
The False Consensus Effect: Why You Think Everyone Agrees With You — The false consensus effect — Ross, Greene & House (1977) — is the egocentric bias of overestimating how many others share your beliefs and choices. The "Eat at Joe's" sandwich-board study and what it predicts about ma…
The Implicit Association Test: The Bias Tool That Doesn't Predict Bias — The IAT is the world's most-used measure of implicit bias — over 40 million tests completed, deployed in DEI training across federal agencies and Fortune 500 companies. Independent meta-analyses since 2013 show IAT sc…
The Overjustification Effect: When Rewards Backfire (And When They Don't) — In 1973, Stanford preschoolers who were promised a "Good Player" award for drawing later spent half as much free time drawing as kids who got nothing. The takeaway became "rewards kill motivation." The real science is…
The Power Of "Because": Langer's Copy-Machine Study, Honestly Read — The Langer 1978 copy-machine study is the most-cited justification in copywriting for "always include the word because." The original paper actually shows the placebic-reason magic disappears the moment the request st…
Yerkes-Dodson Law: The "Inverted U" Performance Claim That's Real But Oversold — In 1908, Robert Yerkes and John Dodson reported that mice learned a discrimination task fastest at intermediate shock intensities. A century later, that finding has become the "Yerkes-Dodson Law" — an inverted-U curve…

replication-crisis Behavioral Science evidence-evaluation Leadership hub

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter