On December 13, 2010, The New Yorker published a long-form science feature by Jonah Lehrer with the title “The Truth Wears Off: Is there something wrong with the scientific method?” The piece opened with a description of a meeting in 2004 between Jonathan Schooler, a psychologist at the University of California, Santa Barbara, and Michael Jennions, an evolutionary biologist at the Australian National University. They had been comparing notes on an unsettling pattern that each had noticed independently in their own subfields. Famous published effects --- the kind that get textbook treatment and undergraduate citations --- tended to get smaller every time someone tried to replicate them. The pattern was strong enough that Schooler had been giving it a name, half-joking and half-serious. He called it the decline effect. Lehrer’s article, written for a general magazine audience, was the first time the term and the pattern reached the public outside academic specialty journals.
The article appeared at an unusual moment in the history of the replication crisis. It was five months before The Journal of Personality and Social Psychology would publish Daryl Bem’s now-infamous paper claiming evidence for precognition, an event that would precipitate a generation of psychologists’ confrontation with the methodology of their own field. It was eleven months before Joseph Simmons, Leif Nelson, and Uri Simonsohn would publish “False-Positive Psychology” and introduce the phrase “researcher degrees of freedom” to the field. It was nearly five years before the Open Science Collaboration would publish its 100-study reproducibility project. And it was already five years after John Ioannidis had argued in PLOS Medicine that “Why Most Published Research Findings Are False,” a paper that the popular press had largely ignored. The Lehrer article sat between the technical Ioannidis warning and the academic confrontation that was about to come. It told general-magazine readers, in plain language, that the methodology of empirical science had a generic problem that produced inflated findings as a matter of routine.
The article was vindicated by the next decade of academic work in a way that few magazine pieces about science have been vindicated. Almost every specific case Lehrer described --- antipsychotic drug efficacy in schizophrenia, the symmetry-attractiveness link in evolutionary psychology, fluctuating asymmetry in biology, observational versus experimental epidemiology --- turned out to be either a genuine instance of the pattern he described or a reasonable illustration of it. The structural arguments Lehrer made --- about publication bias, about regression to the mean, about how the incentive structure of academic science produces inflated initial findings --- turned out to be precisely the diagnostic framework that the formal academic literature would adopt over the next five years. By 2015, when the Open Science Collaboration’s Estimating the Reproducibility of Psychological Science reported that fewer than half of the 100 attempted replications in its sample produced statistically significant effects in the same direction as the original studies, the field had circled back to the framework Lehrer had articulated for non-specialist readers five years earlier.
What complicates the story of this article is what happened to Jonah Lehrer himself in the two years after he wrote it. In June 2012, Lehrer was caught by journalist Michael Moynihan having fabricated Bob Dylan quotes in his book Imagine: How Creativity Works. Within weeks, additional investigations established a pattern of self-plagiarism --- recycling material from his own previous New Yorker blog posts, Wired columns, and Wall Street Journal pieces without disclosure --- and instances of factual misstatement in his published work. Lehrer resigned from The New Yorker on July 30, 2012, and the publisher subsequently withdrew Imagine from sale. His later attempts at a return to public writing have been controversial. None of this changes the validity of the December 2010 article, whose substantive arguments and case studies were sourced from named researchers, attributed quotations, and primary journal citations, and which has stood up under subsequent academic scrutiny. The 2010 article is what it always was: a journalistically clean and substantively prescient account of a phenomenon that academic science would spend the next decade rigorously confirming. Its author’s later journalism scandal is a separate question from this particular piece’s validity, and a strategist evaluating the article today has to hold both facts simultaneously: the diagnosis was correct, and the diagnostician was later disgraced for unrelated reasons in subsequent work.
This is the story of the 2010 article, the specific cases it described, the explanations Lehrer offered for the pattern, how the academic literature subsequently validated the thesis, the Lehrer fabrication scandal and how to think about it in relation to the 2010 piece, and what a strategist evaluating any empirical literature should take from the whole arc.
The Article And The Schooler Interview
Lehrer’s article is built around an extended interview with Jonathan Schooler, a psychologist who had spent the first half of his career building a reputation on a specific finding called verbal overshadowing. The original 1990 study, conducted by Schooler and Tonya Engstler-Schooler when Schooler was a graduate student at the University of Washington, reported that participants who described the face of a person they had previously seen --- writing out a verbal description of the face’s features --- subsequently performed substantially worse on a recognition task than control participants who had not been asked to verbalize. The verbal description, in other words, overshadowed the visual memory and impaired later recognition. The effect was large in the original study, was published in a top journal, and became part of the standard literature on memory and perception. Other laboratories cited it. Schooler continued to publish related work building on the original effect.
Schooler told Lehrer, in a candid moment that would have been remarkable in academic writing but was characteristic of Lehrer’s interviewing in 2010, that the verbal overshadowing effect had been steadily shrinking in his own replications and the replications of others over the subsequent two decades. The original effect size, which had been close to half a standard deviation, had drifted downward in replication after replication, with later studies finding effects half the size of the original or smaller. Schooler did not say that the effect did not exist; he said that the effect appeared to be substantially smaller than the original literature had suggested, and that he found this both intellectually disorienting and personally troubling. He had been collecting unpublished replication data on the effect, much of it null or weakly positive, and he was beginning to suspect that the pattern was generic across psychology and possibly across other empirical sciences.
The article then expanded the frame. Lehrer described conversations with John Crabbe, a behavioral geneticist at Oregon Health & Science University, who had attempted in the late 1990s to test the genetic basis of behavior in mice by running identical experiments in three separate laboratories --- in Portland, in Edmonton, and in Albany --- with the same mouse strains, the same equipment, the same protocols, the same time of day, the same handlers’ procedures. Crabbe’s 1999 Science paper reported that despite the unusually rigorous standardization across sites, the laboratories produced systematically different results on the behavioral measures. The implication was not that any one laboratory was wrong but that behavioral genetics findings were unexpectedly dependent on factors that could not be fully captured even by elaborate standardization protocols. The genetic mouse-behavior literature, viewed through Crabbe’s results, had been treating findings as more robust than the underlying biology supported.
Lehrer then ran through a series of further cases, each illustrating a different mechanism of decline. He described how the published literature on second-generation antipsychotic drugs --- a class of medications including risperidone, olanzapine, quetiapine, ziprasidone, and others, marketed in the 1990s as substantial improvements over the older first-generation antipsychotics --- had reported initial efficacy advantages that newer trials were finding to be roughly half their original magnitude. The 2005 CATIE trial, the 2006 CUtLASS trial, and a series of subsequent independent comparative studies had collectively suggested that the second-generation antipsychotics were not dramatically more effective than the older drugs they were intended to replace. The original literature had reported a much larger advantage. Lehrer described the symmetry-attractiveness literature in evolutionary psychology, where the early studies of the 1990s had reported strong correlations between facial symmetry and rated attractiveness, with later studies finding the correlations to be substantially weaker. He described the long history of observational epidemiology --- vitamin E and heart disease, hormone replacement therapy and cardiovascular protection, beta-carotene and cancer prevention --- where the initial observational associations had been overturned or sharply attenuated when later randomized trials were conducted. And, somewhat provocatively, he described the parapsychology literature on extrasensory perception, where Schooler himself had recently become interested in the methodological questions because the ESP findings exhibited the same decline pattern: large initial effects that shrank toward the null in later replications.
The case selection was deliberate, and it covered a wide enough range of fields that the pattern could not be explained as a quirk of any one methodology. Psychology, pharmacology, biology, epidemiology, and the contested boundary of parapsychology all showed the same shape. The article was making the case that the pattern was systemic.
The Explanations Lehrer Offered
Lehrer’s article was unusually careful, for a popular-magazine piece, in laying out the candidate explanations for the decline effect and assessing their adequacy. The first explanation he discussed was regression to the mean, the statistical principle that observations at the extreme end of a distribution tend to be followed by observations closer to the average. A field that selectively publishes initial findings that are statistically significant has, by construction, a bias toward results that landed at the upper end of the sampling distribution. Subsequent replications, even if conducted on the same true effect, will tend to land closer to the actual population value, which by definition is smaller than the published extreme. This is not a flaw in the replications; it is a feature of the selection process that produced the original publication. Lehrer treated this as the most parsimonious explanation for a substantial portion of the decline pattern, and the subsequent academic literature has largely endorsed this view as the dominant mechanism.
The second explanation he discussed was publication bias, the systematic tendency of journals to publish positive findings and either to reject or to leave unsubmitted the negative findings. This produces a published literature that is enriched in false positives and underrepresented in disconfirming evidence. Schooler, in his subsequent published response to the article (a 2011 Nature correspondence titled “Unpublished results hide the decline effect”), specifically argued that the decline effect could be substantially explained by the file drawer of unpublished negative replications, which if released would shrink the apparent meta-analytic effect immediately and substantially. The Turner 2008 NEJM study on antidepressant publication bias, which had been published two years before Lehrer’s article, was the model case: when the FDA’s complete trial files were compared to the published literature, the published literature inflated the antidepressant effect size by roughly a third. Lehrer cited this work and used it as the canonical illustration of the bias mechanism.
The third explanation he discussed was researcher flexibility, what would later be formalized in Simmons, Nelson, and Simonsohn’s 2011 paper as “researcher degrees of freedom” and in the subsequent literature as p-hacking. The argument is that a study analyst making many small choices --- which outcome to emphasize, which subjects to exclude, which statistical model to fit, which covariates to include --- can generate false-positive results without conscious dishonesty, simply through the cumulative effect of selecting choices that improve the appearance of the data. The decline effect, on this account, occurs because the original study’s effect was inflated by a particular set of analytic choices that the replicators do not make. Lehrer described this mechanism in non-technical language, and the subsequent academic literature on p-hacking has confirmed and quantified the effect with a precision that the 2010 article anticipated but could not yet provide.
The fourth explanation Lehrer entertained, and was criticized for entertaining, was what he called “a kind of cosmic randomness” --- the possibility that there might be aspects of the apparent decline that resisted explanation by any of the conventional mechanisms. He did not endorse this strongly, and the article was carefully hedged on the point, but he raised it as a residual possibility. Critics of the article, including some statisticians, argued that the residual was not necessary; the conventional mechanisms were enough to account for the entire observed pattern. The criticism was substantively correct. The 2015 OSC reproducibility project, the subsequent Many Labs projects, and the surrounding meta-analytic work have demonstrated that publication bias, regression to the mean, and researcher degrees of freedom together account for the bulk of the decline effect. There is no compelling evidence for a “cosmic” residual. Lehrer’s willingness to entertain the cosmic-residual hypothesis was probably the article’s weakest substantive choice. It did not undermine the rest of the piece, but it provided ammunition to readers who wanted to dismiss the article as journalistically loose.
How Academic Science Subsequently Validated The Thesis
The decline effect as Lehrer described it was substantially validated by academic work over the five to ten years following the article’s publication. The most direct validation came from the Open Science Collaboration’s 2015 Science paper, “Estimating the reproducibility of psychological science.” The OSC, an international consortium coordinated through the Center for Open Science, identified 100 studies published in three top psychology journals in 2008 --- Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition --- and conducted high-powered, preregistered replications of each one. The headline result: 36% of the replications produced statistically significant effects in the same direction as the original, compared to 97% of the original studies that had reported significant results. The average effect size in the replications was approximately half the size of the original effect, which is the decline pattern Lehrer had described, quantified across a representative sample of the field.
The OSC paper was followed by the Many Labs projects, the Reproducibility Project: Cancer Biology, the Camerer et al. 2018 Nature Human Behaviour replication of social-science papers from Nature and Science, and the Klein et al. Many Labs 2 project, each of which produced broadly compatible results: replication rates substantially below the original publication rates, replicated effect sizes substantially smaller than the originals, and a pattern of decline consistent with Lehrer’s 2010 description. Begley and Ellis, writing in Nature in 2012, reported that scientists at Amgen had attempted to replicate 53 landmark preclinical cancer studies and had successfully reproduced the findings of only 6 --- a 11% replication rate that suggested the decline effect in the cancer literature was, if anything, more severe than the psychology figures. Each of these papers, in its different field, was a vindication of the diagnostic frame Lehrer had articulated for non-specialist readers in 2010.
The mechanisms Lehrer had identified --- publication bias, regression to the mean, and researcher flexibility --- were also formally validated. Simmons, Nelson, and Simonsohn’s 2011 paper demonstrated that researchers exploiting a modest set of common analytic flexibilities could produce false-positive results at rates above 60%. The literature on p-hacking grew through the 2010s into one of the most cited bodies of methodological research in the social sciences, with formal techniques for detecting p-hacking (Simonsohn’s p-curve analysis, the p-uniform method, the Z-curve method) becoming standard tools in meta-analysis. Erick Turner and others extended the FDA-file methodology beyond antidepressants to additional drug classes. The cumulative pattern was that nearly every mechanism Lehrer had cited as a candidate explanation for the decline effect turned out to be a robust, formally documented feature of how empirical literatures generate inflated initial findings.
The exception, again, was the “cosmic randomness” hypothesis, which did not receive academic validation and which the formal literature has generally treated as unnecessary. The mainstream view today is that the decline effect is fully explained by the combination of selection mechanisms Lehrer described in the article, without any residual cosmic component. This refinement does not detract from the article’s main argument; it sharpens it. Lehrer’s central claim was that famous published findings systematically shrink over time. That claim has been thoroughly validated. His subsidiary speculation about a cosmic residual has not, and the subsequent literature has shown that no such residual is needed.
The Lehrer Scandal And How To Read It
In June 2012, journalist Michael C. Moynihan, then writing for Tablet Magazine, contacted Lehrer with questions about a passage in Lehrer’s recently published book Imagine: How Creativity Works, in which Lehrer had quoted Bob Dylan. Moynihan was a Dylan researcher and had been unable to verify the quotes. Over a series of exchanges, Lehrer initially defended the quotes, then progressively revised his account, and ultimately admitted that he had fabricated or composited several of the Dylan quotes in the book. Moynihan published his investigation on July 30, 2012, and Lehrer resigned from The New Yorker the same day. Houghton Mifflin Harcourt, the book’s publisher, withdrew Imagine from sale and offered refunds. Subsequent investigations established that Lehrer had also engaged in self-plagiarism, recycling material from his New Yorker blog posts, Wired columns, and Wall Street Journal pieces without disclosure, across a range of his published work in 2011 and 2012. A 2013 attempted comeback talk for the Knight Foundation was widely criticized as evasive and self-pitying, and Lehrer’s reputation in the science-journalism community has not substantially recovered.
The relevant question for a reader of the 2010 New Yorker article is how to think about the article in light of the author’s subsequent disgrace. The cleanest answer is that the 2010 article is independently verifiable in a way that most of Lehrer’s later work was not. The article’s substantive claims are sourced from named researchers, the quotations are attributed (Schooler, Crabbe, others), the primary research is cited to specific peer-reviewed papers that can be checked, and the underlying empirical pattern --- the decline effect, the inflation-then-shrinkage trajectory of famous findings --- has been independently validated by a large and well-organized subsequent academic literature. The article was, by its substantive content, correct. Nothing in the article relied on the specific journalistic moves --- composite quotations, recycled material, fabricated supporting detail --- that brought Lehrer down two years later. The 2010 article was a clean piece of explanatory journalism that has aged extraordinarily well; the 2012 scandal was about a different and substantially worse set of journalistic practices in different and later work.
The reasonable strategist position, then, is to read the 2010 article as a primary source for the public-press emergence of the decline-effect framing, to credit the article for its prescience and substantive accuracy, to verify any specific factual claim in it against the primary literature it cites (which is straightforwardly available), and to refuse to extend that credit to Lehrer’s later work without independent verification of each substantive claim. The article and the author are separable. A reader who throws out the 2010 article because of the 2012 scandal is making a mistake that the substantive content of the article does not warrant. A reader who treats Lehrer’s later writing as having the same credibility as the 2010 article is making the opposite mistake. Both errors are common and both are avoidable with the simple discipline of evaluating individual works on the basis of their independently verifiable content.
The Strategist Takeaway
For anyone who builds decisions on top of empirical findings --- product strategists evaluating behavioral research, clinicians evaluating treatment evidence, policymakers evaluating intervention studies, investors evaluating efficacy claims --- the Lehrer 2010 article has two enduring lessons that go beyond the specific cases it describes.
The first lesson is that when a popular-press science journalist with serious sources begins noticing that famous academic findings keep shrinking, the right response is to take the observation seriously rather than to dismiss it as journalistic exaggeration. Lehrer’s article was not an academic paper, but its substantive argument was correct, and the formal academic literature spent the next five years confirming the diagnosis that the popular-press article had articulated first. The general lesson is that signals from non-specialist observers who are paying attention to a field can be early indicators of structural problems that the field’s own internal debates have not yet surfaced. The information asymmetry between fields and the lay press goes both directions: usually the field knows more, occasionally the lay press notices a pattern that the field has been collectively averting its eyes from. A strategist’s job is to weight both signals appropriately and to update when they converge.
The second lesson is that famous findings should be assigned a stronger prior of inflation than less famous findings, and that the inflation factor is not small. The Lehrer article suggested, and the subsequent OSC and Many Labs literatures have confirmed, that the typical inflation is on the order of a factor of two: original effect sizes in famous published studies tend to be roughly twice the replicated effect sizes when independent high-powered replications are conducted. A strategic decision built on a famous finding without accounting for this inflation is, with high probability, built on an effect size that is materially smaller than the original publication suggests. The operational implication is to discount the magnitude of any published effect that has been heavily publicized and to consult preregistered replications, registered reports, and high-powered Many Labs-style consortium studies before treating any specific published effect as a reliable basis for a substantial commitment of resources.
The Lehrer article did not invent these lessons. It articulated them in a publicly accessible form five years before the academic literature would systematically confirm them. It is one of the small number of popular-press science articles that functions, in retrospect, as a leading indicator of a major paradigm shift in the underlying field. That is the substance of the article. The author’s separate later journalism scandal does not change what the 2010 article said, when it said it, or how thoroughly subsequent evidence has confirmed it.
Sources
- Lehrer, J. (2010, December 13). The truth wears off: Is there something wrong with the scientific method? The New Yorker. Available at https://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off
- Schooler, J. W. (2011). Unpublished results hide the decline effect. Nature, 470(7335), 437. https://doi.org/10.1038/470437a
- Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
- Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531-533. https://doi.org/10.1038/483531a
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632
- Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425. https://doi.org/10.1037/a0021524
- Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260. https://doi.org/10.1056/NEJMsa065779
- Crabbe, J. C., Wahlsten, D., & Dudek, B. C. (1999). Genetics of mouse behavior: Interactions with laboratory environment. Science, 284(5420), 1670-1672. https://doi.org/10.1126/science.284.5420.1670
- Moynihan, M. C. (2012, July 30). Jonah Lehrer’s deceptions. Tablet Magazine. Available at https://www.tabletmag.com/sections/news/articles/jonah-lehrers-deceptions
Related Reading
- Why Most Published Research Findings Are False: The Ioannidis 2005 PLOS Medicine Paper
- Estimating the Reproducibility of Psychological Science: The OSC 2015 Project
- Daryl Bem’s 2011 Precognition Paper and What It Did to Psychology
- P-Hacking and Researcher Degrees of Freedom: Simmons, Nelson, Simonsohn 2011
- Publication Bias and the File Drawer Problem
FAQ
What is the decline effect?
The decline effect is the empirical pattern that famous published scientific findings tend to produce smaller effect sizes in subsequent replications than they did in the original studies. The term was coined informally by psychologist Jonathan Schooler in the 2000s and was introduced to the general reading public by Jonah Lehrer’s December 2010 New Yorker article “The Truth Wears Off.” The pattern has been documented in psychology, pharmacology, biology, medicine, and other empirical fields. The standard explanations are publication bias, regression to the mean, and researcher flexibility in data analysis (p-hacking), and these mechanisms together account for the bulk of the observed decline pattern.
Was Lehrer’s 2010 article correct?
The substantive arguments and case studies in the 2010 article have been largely confirmed by subsequent academic research. The Open Science Collaboration’s 2015 Science paper found that fewer than half of 100 attempted replications in psychology produced statistically significant effects in the same direction as the originals, and the average replication effect was about half the size of the original. The Begley and Ellis 2012 Nature paper found that only 6 of 53 landmark preclinical cancer studies could be replicated. Many Labs projects, Reproducibility Project: Cancer Biology, and the Camerer et al. 2018 social-science replications produced compatible results. The diagnostic frame Lehrer articulated in 2010 has been validated by the formal academic literature of 2011 through the early 2020s.
Does the Lehrer fabrication scandal invalidate the 2010 article?
No. The 2012 scandal involved fabricated Bob Dylan quotes in Lehrer’s book Imagine: How Creativity Works and a pattern of self-plagiarism in his later journalism. The 2010 New Yorker article on the decline effect is sourced from named researchers, attributed quotations, and primary peer-reviewed citations that can be independently verified. None of the issues that brought Lehrer down in 2012 are present in the 2010 article. A reader should evaluate the 2010 article on its independently verifiable content, which has been thoroughly validated by subsequent academic work, while applying separate skepticism to Lehrer’s later writing.
What is regression to the mean and how does it cause the decline effect?
Regression to the mean is the statistical principle that extreme observations tend to be followed by less extreme observations. When a scientific field publishes only findings that meet a threshold of statistical significance --- typically p < 0.05 --- the published findings are by construction biased toward the upper end of the sampling distribution. If a study published an effect of magnitude 0.5 when the true population effect is 0.3, the published study landed at an extreme of the sampling distribution for the original sample. Subsequent replications, even of the same true effect, will tend to land closer to 0.3, the actual population value. This produces the appearance of a “declining” effect even though the underlying biology has not changed. Regression to the mean is widely considered the dominant mechanism underlying the decline effect.
What is the relationship between the decline effect and publication bias?
Publication bias is the systematic tendency of journals to publish positive findings (those with statistically significant results in the predicted direction) and to leave negative findings unpublished. This produces a published literature that is enriched in false-positive findings, inflated effect sizes, and selectively confirmatory evidence. When subsequent researchers attempt to replicate published findings, they face the actual underlying effect size, which is smaller than the published literature suggests. The result is the appearance of decline. Jonathan Schooler’s 2011 Nature correspondence specifically argued that publication bias accounts for a substantial share of the decline-effect pattern, and the subsequent literature on the file-drawer problem has confirmed this view.
Have other fields shown the same pattern?
Yes. The decline effect has been documented in psychology (OSC 2015, Many Labs 1-3, Camerer et al. 2018), preclinical cancer research (Begley and Ellis 2012, Reproducibility Project: Cancer Biology), pharmacology (the Turner 2008 antidepressant audit and subsequent FDA-file replications), economics (Camerer et al. 2016, replication studies of major laboratory experiments), neuroscience (multiple studies documenting underpowered designs and inflated effect sizes), and biomedical research more generally. The cross-field generality of the pattern was one of the central claims of Lehrer’s 2010 article, and it has been confirmed by the literature.
What should a strategist do when evaluating empirical evidence?
First, treat famous findings with stronger skepticism than less famous findings, because the selection process that made them famous also selected for inflated effect sizes. Second, weight preregistered replications and registered reports more heavily than original publications. Third, consult Many Labs-style consortium studies and large meta-analyses with formal corrections for publication bias before treating any specific effect as a reliable basis for a substantial decision. Fourth, when a popular-press science journalist with serious sources is noticing a systematic problem in a field, take the observation seriously rather than dismissing it. The Lehrer 2010 article is the canonical example of a popular-press signal that turned out to be a correct leading indicator of an academic crisis. Treat such signals as Bayesian evidence rather than as journalistic noise.