In late August 2011, three junior researchers walked into the office of the dean of the School of Social and Behavioral Sciences at Tilburg University in the southern Netherlands. They had a problem they were afraid to name. The man whose data they were quietly questioning was the dean of their own faculty: Diederik Stapel — a Dutch social psychologist whose career had been an uninterrupted ascent through the field’s most prestigious journals and most coveted awards.

What they wanted to say, and had hesitated for months to say, was that Stapel’s data felt wrong. Not “edge-case wrong” — wrong in the way only fabricated data can be wrong. The effects were too clean. The descriptive statistics were too tidy. The patterns lined up with hypotheses with a regularity that real human behavior rarely permits.

Within days, the rector had suspended Stapel from his duties. Within weeks, Stapel confessed. Within a year, three university investigative committees — chaired by the eminent psycholinguist Willem Levelt, alongside committees at the University of Amsterdam and the University of Groningen — had documented what is still arguably the most consequential research fraud case in modern social psychology. Over the years that followed, Retraction Watch tallied 58 retractions attributable to Stapel’s fabrications, spanning more than a decade of work, dozens of co-authors, and at least 10 doctoral dissertations supervised by him.

This is the story of how it happened, what his colleagues missed for ten years, what the field did about it — and what every strategist who cites “research-backed” findings should learn from it.

Who Stapel Was

Diederik Alexander Stapel was not a marginal figure. By the late 2000s he was one of the most visible social psychologists in Europe.

He earned his PhD at the University of Amsterdam in 1997 and joined the University of Groningen in 2000, where he was promoted to full professor. In 2006 he moved to Tilburg University as a chaired professor of social and cognitive psychology, founding the Tilburg Institute for Behavioral Economics Research (TiBER). In September 2010 — less than a year before his fraud unraveled — he was appointed Dean of the School of Social and Behavioral Sciences at Tilburg.

His publication record was a who’s-who of the field’s most prized venues: the Journal of Personality and Social Psychology (JPSP), Personality and Social Psychology Bulletin (PSPB), Journal of Experimental Social Psychology, and in April 2011 a co-authored paper in Science — “Coping with Chaos: How Disordered Contexts Promote Stereotyping and Discrimination” (Stapel & Lindenberg, 2011, Science, 332(6026), 251–253) — which reported that exposure to messy environments (litter, broken sidewalks) increased stereotyping behavior. The paper was widely covered in the popular press as a tidy real-world extension of “broken windows” theory.

He had won the Career Trajectory Award from the Society of Experimental Social Psychology in 2009 — an award given to mid-career psychologists for sustained, exceptional contributions. He was producing graduate students. He was on editorial boards. His Tilburg appointment as dean placed him in administrative authority over the very junior colleagues who would, two years later, walk into the rector’s office with the suspicion that ended his career.

This is the prestige profile that matters for what comes next. Stapel was not a peripheral figure whose work was easy to dismiss. He was, by every external measure the field uses to evaluate excellence, exactly the kind of researcher whose work you would assume had been rigorously vetted by reviewers, co-authors, editors, and the apparatus of normal science.

The Whistleblowers And The Investigation

Three junior researchers — never publicly named, identified in the Levelt commission’s reports only as the people whose actions prompted the investigation — had been noticing things for months before they decided to act.

The reporting in Yudhijit Bhattacharjee’s New York Times Magazine investigation (“The Mind of a Con Man,” April 26, 2013) and in the Levelt committee’s interim report (October 31, 2011) describes the kind of small, accumulating anomalies that are easy to second-guess in isolation but become impossible to ignore in aggregate. Rows of data with implausibly identical scores. Effect sizes that were too consistent across studies that should have produced more variance. Surveys reportedly conducted at high schools where, when one researcher tried to verify, the schools did not appear to have ever hosted the studies. Raw data files that Stapel was reluctant to share, or could not produce when asked. Studies that one of the whistleblowers had collaborated on personally — and where the data Stapel returned did not match the methodology that had been agreed.

The hesitation to raise these concerns is part of the story. The accusers were untenured. The accused was the dean of their faculty, an internationally prominent scholar, and someone who controlled their professional futures. The Levelt report and subsequent reporting both note that the whistleblowers had been gathering corroborating evidence for an extended period before approaching the administration — precisely because the cost of being wrong, in social and career terms, would have been catastrophic.

On September 7, 2011, Tilburg University suspended Stapel. By the rector’s announcement, three formal investigative committees were chartered: the Levelt Committee at Tilburg (chaired by Willem J. M. Levelt, former president of the Royal Netherlands Academy of Arts and Sciences), the Noort Committee at the University of Groningen (chaired by Ed Noort), and the Drenth Committee at the University of Amsterdam (chaired by Pieter J. D. Drenth). Together they covered every institution where Stapel had held an appointment.

The interim report appeared on October 31, 2011 and confirmed what the whistleblowers had alleged: data fabrication, on a scale and across a duration that none of the investigators had initially imagined. Stapel had already, by this point, confessed in writing to having “manipulated research data” and “faked studies” — not once, but repeatedly, over years. He voluntarily surrendered his PhD title to the University of Amsterdam in November 2011.

The final joint report — Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel — was published on November 28, 2012. It is the canonical document on this case, and it is unsparing.

What Stapel Actually Did

It is important to distinguish what Stapel did from the broader replication-crisis pattern of overstated, p-hacked, or selectively reported findings. He did not exaggerate real data. He invented it.

The Levelt report and Stapel’s own confession — including the account in his 2012 Dutch-language memoir Ontsporing (“Derailment”), translated into English in 2014 as Faking Science: A True Story of Academic Fraud — describe a methodology that, at scale, looks almost mundane. Stapel would propose a study to a junior collaborator or graduate student. They would design the experiment together. Stapel would then say he would handle the data collection himself — at a high school, or via a research assistant, or in some setting the co-author had no way to directly observe. He would return, sometimes weeks later, with a tidy spreadsheet of “collected” data that supported the hypothesis cleanly. The collaborator, typically a more junior researcher trusting the senior dean, would run the analyses, write the paper, and the work would proceed to publication.

In a substantial number of cases, the data did not exist. There had been no high school visit. No participants. No actual study. Stapel had sat at his computer and typed numbers into a file that he then handed to a trusting collaborator.

In other cases, the misconduct was subtler: studies had occurred but the numbers had been altered, augmented, or selectively curated to produce the desired effect. The Levelt report distinguishes between papers with confirmed fabrication, papers with strong evidence of fabrication, and papers where the data are unrecoverable and so the question cannot be definitively settled.

The mechanism is worth pausing on. Stapel did not need to compromise an entire research team. He needed to be the person who handled the data — alone. In the prevailing norms of social psychology at the time, that was not an unusual division of labor. A senior researcher who claimed to be doing the data collection was trusted to do it. There was no widespread expectation that raw data files be archived, shared, or independently verified before publication. The fraud was possible because the field’s verification infrastructure was built on the assumption that researchers would not lie.

How It Went Undetected For A Decade

The Levelt report devotes substantial attention to the question of how this could have continued for so long. The answer is not flattering to the field — and it is precisely why this case matters beyond the scandal itself.

The findings were too clean to question. Stapel’s results were striking, narratively satisfying, and produced large, clean effect sizes that fit the hypotheses tightly. In a healthy scientific culture, that pattern should have triggered scrutiny — real human behavior is messy, and effect sizes in social psychology are typically modest with substantial noise. But in the actual publication culture, clean findings were rewarded with publication in high-impact journals, and noisy findings often were not. Reviewers and editors were not trained to be suspicious of results that were too good. They were trained, in effect, to prefer them.

Co-authors trusted the senior researcher. The Levelt report’s joint summary explicitly concludes that the co-authors of Stapel’s papers were not accessories to the fraud. They had no realistic way to detect that data they were given by a senior, established colleague had been invented. The norm of trusting your collaborators’ contributions is foundational to scientific collaboration — and Stapel exploited it.

There was no preregistration, no data sharing, no independent verification. In the social psychology of the 2000s and early 2010s, preregistration was rare, raw data was almost never made publicly available, and replication studies were difficult to publish and largely uncredited. The combination meant that a researcher could publish a striking finding and face no realistic threat that anyone would independently re-collect the data, compare it to the published claims, or attempt to reproduce the analysis from raw files.

Prestige and authority suppressed concerns. The Levelt report and subsequent commentary describe a culture in which questioning a senior, prize-winning researcher was professionally risky for junior colleagues. The whistleblowers themselves spent months gathering evidence before approaching the administration, knowing that the cost of being wrong about a dean was potentially career-ending. In a slightly different institutional culture — one without the prestige hierarchy that made the accused person nearly untouchable — the concerns might have surfaced years earlier.

The field’s incentive structure rewarded volume. Stapel was prolific. He produced papers, graduate students, and grants at a rate that was, in retrospect, biologically implausible for any honest researcher doing actual data collection at the scale he claimed. The field’s hiring, tenure, and grant-funding committees rewarded that productivity. There was no countervailing pressure on “how is this person physically producing this much data?”

None of these conditions are unique to social psychology. They describe, in various combinations, the verification infrastructure of most of academic research at the time. The Stapel case is the one that finally forced the field to confront them.

The 58 Retractions

The retraction count grew over years as the investigative committees worked through Stapel’s full bibliography of approximately 137 papers, interviewing co-authors, requesting raw data, and assessing the evidence for each individual study.

The pace of retractions tracked the investigations: 7 by the end of 2011, 31 by late 2012, 55 by August 2015, and 58 by the end of 2015, the count Retraction Watch has continued to maintain. The retractions include the Science paper “Coping with Chaos” (retracted December 1, 2011), multiple papers in JPSP, papers in PSPB, papers in Journal of Experimental Social Psychology, Social Cognition, and several book chapters that built on the fabricated work.

The damage extended past Stapel himself. The Levelt commission’s joint report concluded that at least 10 doctoral dissertations supervised by Stapel contained fabricated data — meaning that more than a dozen graduate students had defended PhDs in part on data that Stapel had invented and handed to them. A 2014 working paper by Mongeon and Larivière (“Costly Collaborations: The Impact of Scientific Fraud on Co-authors’ Careers”) estimated the citation and career impact across more than 30 of Stapel’s co-authors, many of whom saw publication records gutted by retractions on work they had performed in good faith.

The legal consequences for Stapel were modest by US standards. In June 2013, the Dutch prosecutor reached an agreement with him: 120 hours of community service and the surrender of benefits equivalent to roughly 1.5 years of salary. He did not serve prison time. He has subsequently worked as a teacher and consultant, and has given media interviews and lectures about the case.

What This Catalyzed

The Stapel case did not, on its own, create the methodological reform movement in psychology. The Bem precognition controversy (also 2011) and the Bargh elderly-priming replication failures (also 2012) were occurring simultaneously. What Stapel did was provide the field with a case so clear-cut, so undeniable, and so impossible to attribute to “honest disagreement about methods” that it became impossible to dismiss calls for structural reform.

In the years that followed:

  • The Center for Open Science was founded in 2013 by Brian Nosek and colleagues, with the explicit mission of promoting preregistration, data sharing, and the kind of open verification infrastructure whose absence had allowed Stapel’s fraud to persist.
  • Preregistration moved from a fringe practice to a norm. Major journals — Psychological Science among the earliest — added preregistration tracks. The Registered Reports format, where reviewers evaluate study design before data collection and accept the paper conditional on the protocol being followed, was developed in part as a structural answer to the Stapel-class failure mode.
  • Data sharing policies tightened at major journals. By the mid-2010s, Psychological Science, JPSP under new editorial leadership, and others required or strongly encouraged authors to make raw data and analysis code available alongside publication.
  • The Reproducibility Project: Psychology (Open Science Collaboration, 2015, Science, 349(6251), aac4716) attempted to replicate 100 published psychology findings. Only about 36% replicated at the original effect size or larger. The Stapel case had primed the field to take this result seriously rather than dismiss it.
  • Dutch psychology in particular underwent substantial introspection. The Tilburg School of Social and Behavioral Sciences restructured its research integrity oversight. The Royal Netherlands Academy of Arts and Sciences issued formal recommendations on research integrity. Multiple Dutch universities revised their data archiving and verification requirements.

The Stapel case did not solve the problem. Research misconduct continues to be discovered. But the infrastructure for catching it — preregistration, data sharing, replication initiatives, independent statistical scrutiny of published work — is materially stronger than it was in 2011, and Stapel’s case is one of the clearest reasons why.

What’s Honest To Say About Research Misconduct Now

It would be a mistake to read the Stapel case as evidence that “most” social psychology is fraudulent. It is not. The Levelt commission’s joint report was careful on this point: deliberate fabrication is rare. The far more common problem is the cluster of practices — selective reporting, p-hacking, undisclosed flexibility in data analysis, publication bias — that the replication crisis exposed across many fields. Fraud is the dramatic case; methodological sloppiness is the systemic case.

But the Stapel case demonstrates something important even when applied to the more common pattern: the field’s verification infrastructure depends, in practice, on trust rather than on independent checks. Peer review does not re-run the analysis. Reviewers do not see the raw data. Editors do not commission replication studies before accepting a paper. Co-authors typically trust each other to handle their assigned portions of the work honestly. None of these are bad-faith failures — they are the natural shape of a collaborative system built on the assumption of good faith.

The corollary, which is uncomfortable but true: a published finding in a peer-reviewed journal is a claim, not a verification. The verification — independent replication, raw data scrutiny, methodological audit — typically happens, when it happens, after publication, in the months and years that follow. The institutional check that catches outright fabrication is usually not editorial peer review. It is usually a junior collaborator who notices that something is off.

The Stapel case also illustrates the disproportionate role of whistleblower protection in actually catching misconduct. Three junior researchers, untenured and dependent on Stapel’s goodwill for their careers, made a calculation that the institutional reward for being right would exceed the institutional cost of being wrong. They were right in this case. The Levelt report and subsequent commentary have been explicit that without institutional protections — anonymous reporting, due-process investigation, and the explicit backing of senior administrators — junior researchers in similar positions could easily make the opposite calculation, and the fraud could continue.

What This Means For Strategists Evaluating Research-Backed Claims

If you are a CEO, consultant, or strategist who relies on academic research to inform decisions — about marketing, organizational design, persuasion, behavioral interventions — the Stapel case is not an esoteric academic scandal. It is the cleanest available proof that the chain of trust from “published in a peer-reviewed journal” to “a real, reliable finding” can break completely, with the breakage invisible from the outside for more than a decade.

The practical implications:

Treat single studies, even in prestigious journals, as provisional. The mechanism Stapel exploited — co-authors trusting his data, reviewers trusting the methods section, editors trusting peer review — applies even when no fraud is involved. A first-publication finding has not yet been independently verified. Treat it as a hypothesis that requires further evidence, not as established truth.

Weight replicated findings far more heavily than novel ones. The Open Science Collaboration’s reproducibility project found that fewer than 40% of psychology findings replicated at their original effect size. The findings that did replicate, and that have been confirmed across independent labs, are the ones worth building strategy on. Novelty is not evidence.

Be suspicious of “clean” findings. Stapel’s data was suspicious to his junior colleagues precisely because it was too tidy. Real-world human behavior is noisy. When a study reports a striking, clean, dramatic effect with no caveats and no failed conditions, that is a signal to dig deeper, not a signal to cite it more enthusiastically.

Ask what the preregistration looked like. A study that was preregistered — where the hypothesis, methods, and analysis plan were specified before data collection — is materially more credible than one that was not. For studies published after about 2015, this question is legitimate to ask. For studies published before, the absence of preregistration is not damning, but the absence of subsequent independent replication is.

Ask whether the raw data is publicly available. Studies that have made raw data available have, in effect, opened themselves to scrutiny. Studies that have not have not. This is not a verdict, but it is information about the level of accountability the authors were willing to accept.

Be skeptical when a consultant cites a striking finding without naming the replication evidence. “A Harvard study showed that…” is a signal that someone is using the prestige of an institution as a substitute for the verification status of the underlying claim. Ask: which study, what year, what was the sample size, has it been replicated, what was the effect size in the original versus in subsequent replications?

The Stapel case is the canonical proof that published “science” can be entirely fictional and the institutional checks that should catch it can fail. The right inference is not cynicism about science as a whole — the verification infrastructure has improved substantially since 2011. The right inference is calibrated humility about any single finding, particularly any single finding that is striking, clean, novel, and being cited to support a strategic decision that matters.

Sources

  • The Replication Crisis hub — the full set of cases, methods, and decision frameworks for strategists evaluating “research-backed” claims.
  • James Vicary and the Subliminal Advertising Hoax — a different shape of fraud: a single fabricated study whose claims persisted in popular culture for decades despite immediate scientific dismissal.
  • John Bargh, Elderly Priming, And The Failed Replications — not fraud, but the same field, the same era, and the same failure mode: a striking finding, an inability to replicate, and a long delay before the field updated.
  • Money Priming And The Vohs Failures — the cumulative failure of a research program, again in social psychology, again involving co-authors and a long chain of citation built on findings that did not hold.
  • Daryl Bem And Precognition — published in JPSP in 2011, the same year Stapel was unmasked. The two cases together forced the field to ask whether its publication standards were detecting anything at all.

FAQ

How common is outright fraud in psychology and the social sciences?

Rare, in the sense of deliberate fabrication of data of the kind Stapel committed. The Levelt commission and subsequent surveys consistently estimate that deliberate fraud accounts for a small minority of methodological problems in psychology. The far larger problem is the cluster of “questionable research practices” — selective reporting of conditions, undisclosed flexibility in analysis, p-hacking, publication bias — that produced the replication crisis. Fraud is the dramatic case; methodological sloppiness is the systemic case. Both undermine the trustworthiness of published findings, but they require different solutions.

Could this happen to a study I’m citing today?

In principle, yes — but the infrastructure that catches it has improved meaningfully since 2011. Preregistration, data sharing requirements, replication initiatives, and independent statistical audit tools (such as Statcheck and the GRIM test) have all gained traction. The practical implication: studies preregistered, with publicly available data, and ideally independently replicated, are materially more credible than studies without these features. For older studies, the absence of these features is not damning, but it does mean independent replication evidence carries disproportionate weight in evaluating the claim.

What happened to Stapel’s co-authors?

The Levelt commission’s joint report explicitly concluded that none of Stapel’s co-authors had been accessories to his fraud. They had been given fabricated data by a trusted senior colleague and had performed their analyses and writing in good faith. The reputational and career costs were nevertheless severe: many had publications retracted, citation counts collapse, and faced the public association with a notorious case. A 2014 analysis by Mongeon and Larivière documented measurable career impact on more than 30 co-authors. The Levelt report’s framing — that the co-authors were victims rather than perpetrators — was important institutionally but did not fully insulate them from professional consequences.

What happened to Stapel’s graduate students?

The investigative committees concluded that at least 10 doctoral dissertations supervised by Stapel contained fabricated data. Several students retained their PhDs after demonstrating that they had performed their own work in good faith using data they had been given. Some chose to leave academia. Some have spoken publicly about the difficulty of building a career on the foundation of a discredited mentor. The case became a reference point in discussions about doctoral supervision, mentor-student power dynamics, and the institutional obligation to protect trainees from supervisor misconduct.

How can journals catch this kind of fraud before publication?

Pre-publication detection of deliberate fabrication is genuinely difficult — peer review is a structural check on logic and methodology, not on data authenticity. The most effective post-Stapel responses have been infrastructural rather than evaluative: requiring raw data to be archived and available, requiring preregistration of hypotheses, and supporting independent replication initiatives. Tools like Statcheck (which automates the detection of inconsistencies between reported statistics) and GRIM (which checks whether reported means are mathematically possible given the sample size) now flag suspicious patterns automatically. These were retroactively applied to Stapel’s bibliography and identified additional anomalies. None of these substitutes for replication, but they make outright fabrication harder to sustain undetected for a decade.

In June 2013, the Dutch prosecutor’s office reached a settlement with Stapel: 120 hours of community service and the surrender of benefits equivalent to roughly 1.5 years of his university salary. He did not serve prison time. Scientific fraud is rarely prosecuted criminally in most jurisdictions because the legal theories — fraud, theft of grant funds, false statements — are difficult to apply to academic publishing. The institutional sanctions (loss of position, surrender of PhD, retraction of award) were arguably more consequential than the criminal disposition.

Did Stapel ever explain why he did it?

He has, repeatedly. His 2012 Dutch-language memoir Ontsporing (translated into English in 2014 as Faking Science), his interviews with Bhattacharjee for the New York Times Magazine, and subsequent public lectures all offer his own account. The themes he returns to are: a desire for clean, narratively satisfying findings that real data rarely provided; the cumulative pressure of producing publishable results in a field that rewarded volume and clarity; an escalation pattern in which initial small manipulations made larger fabrications easier; and what he describes as a moral and psychological breakdown that he was unable to halt on his own. These are his accounts. Whether they are complete or self-serving is a question readers must answer for themselves. What is not in dispute, including in his own telling, is what he did.

What is the single most important lesson for someone outside academia?

A published finding is a claim, not a verification. The verification — replication, independent data scrutiny, methodological audit — typically happens after publication, sometimes years after, and sometimes does not happen at all. When you cite a study to support a business decision, you are implicitly assuming the verification chain has held. The Stapel case is the cleanest available evidence that this assumption can be wrong for more than a decade in cases of outright fraud, and is frequently overstated even in cases of honest research. Treat striking single findings as hypotheses worth investigating further. Reserve confidence for findings that have replicated across independent labs.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.