In March 2012, the journal Nature published a three-page commentary by C. Glenn Begley, a cancer biologist who had recently left a senior position at Amgen, and Lee Ellis, an oncologist at the University of Texas MD Anderson Cancer Center. The title was understated: “Drug development: Raise standards for preclinical cancer research.” The lead-in paragraph buried what would become one of the most-cited statistics in the entire literature on scientific reproducibility.
Over a roughly ten-year period, Amgen’s haematology and oncology department had attempted to confirm the findings of 53 “landmark” cancer biology papers — the kind of high-profile, high-prior, frequently-cited studies that were being used as the foundation for drug-discovery programs. Of the 53 attempts, the scientific findings could be confirmed in only six cases. Six out of fifty-three. Eleven percent. Or, framed the other way, eighty-nine percent of landmark cancer biology findings failed to replicate when an industrial research group with no incentive to fail tried to reproduce them.
The 89% failure rate would become the founding statistic for the recognition that biology — not just psychology, not just nutritional epidemiology, but the wet-lab biomedical research that underwrites the pharmaceutical industry’s entire pipeline — had a replication crisis. The number was shocking. It was also, as a body of subsequent work has shown, not an outlier. The same year and the year before, two equivalent statistics had appeared in adjacent corners of the literature: Prinz, Schlange, and Asadullah at Bayer had reported in 2011 that internal replication attempts on 67 oncology and cardiovascular projects had produced consistent findings in only about 25% of cases. Half a decade later, the Reproducibility Project: Cancer Biology — a coordinated multi-laboratory effort to replicate the headline findings of 53 high-impact cancer papers using pre-registered protocols — would publish its final reports in eLife and find a median effect-size shrinkage of roughly 85% across the replications it completed, with only a minority of attempts replicating cleanly.
The three findings together — Begley & Ellis 2012, Prinz et al. 2011, and the Reproducibility Project: Cancer Biology 2021–2022 — converge on the same uncomfortable conclusion: preclinical cancer biology, as published in the high-impact journals that strategic actors rely on, has systematically lower reproducibility than its citation patterns suggest. The implication for anyone whose business depends on preclinical evidence — biotech venture investors, pharmaceutical business-development teams, equity analysts covering oncology pipelines, strategic planners at hospital systems weighing investments in emerging therapies — is that the discount rate applied to a “promising preclinical finding” needs to be much steeper than the discount rate the published literature, on its surface, would suggest.
This article walks through what Begley and Ellis actually reported, why their methodology was specifically credible, how Prinz 2011 and the later Reproducibility Project converged on the same broad finding, what the economic cost is, what methodological standards Begley and Ellis recommended that the field has only partially adopted, and how a strategist evaluating biotech and pharmaceutical evidence should integrate the replication-risk fact into the decision frame.
The 89% Headline: What Begley And Ellis Actually Reported
The headline statistic is famous; the structure of how Begley and Ellis arrived at it is less so, and the structure is what makes the finding credible.
Begley was, until 2011, the head of haematology and oncology research at Amgen, one of the largest biotechnology companies in the world. The replication exercise he describes in the 2012 Nature paper was not an academic project; it was the routine internal due diligence Amgen’s discovery team performed before committing tens or hundreds of millions of dollars to a drug-discovery program built on top of a published external finding. The team would select a landmark paper — typically one that proposed a new cancer target or a new mechanism of cancer biology — and would attempt to reproduce the key in vitro and in vivo experiments in Amgen’s own laboratories before launching a program based on the finding.
The set of 53 papers had two important properties. First, they were “landmark” papers, by which Begley and Ellis meant high-profile publications in top-tier journals, typically with substantial subsequent citation, frequently used as the foundation for new drug-discovery programs across multiple companies and academic groups. Second, the papers were selected because Amgen was interested enough in the underlying claim to invest the resources to replicate it. This is a prior that runs strongly in the direction of the original findings being true: Amgen had selected for findings that looked promising enough to bet on. The set was not a random sample of cancer biology; it was a sample enriched for findings the field collectively believed.
The replication standard was straightforward: did the key experiments in the paper produce the same qualitative result when run in Amgen’s labs, using the same cell lines and procedures where possible, with the original authors’ direct involvement and assistance where the original authors were willing to engage? Amgen was not looking to debunk anyone. The team had a strong commercial interest in the original findings being true: a replicated finding could become the basis for a multi-year drug program; a non-replicated finding meant Amgen had wasted weeks or months of bench scientist time confirming that a published claim did not hold up.
Of the 53 attempts, the scientific findings could be confirmed in six cases. The other 47 failed to replicate. Begley and Ellis described the failure modes in the commentary: in many cases, the originally reported experiment was reproducible only under a specific subset of conditions that had not been disclosed in the original paper; in many cases, the original finding was an artifact of a specific cell line whose behavior was idiosyncratic; in many cases, key control experiments had been omitted; in many cases, the statistical analysis had been performed in a way that, when redone with proper blinding and pre-specification, produced a null result.
Begley and Ellis also reported a striking sociological observation about the engagement with the original authors. When the Amgen team contacted authors of papers that had failed to replicate, the most common response was not “let us investigate why your replication failed” but rather “we cannot share our raw data” or “we agree the experiment depends on subtle conditions we did not describe in the paper.” In some cases, the original authors acknowledged privately that the published figures had been selected from a larger set of experiments that included null and negative results. The published figures, in those cases, were the favorable outliers from a process that produced both positive and negative results — and only the positive ones had been published.
The commentary is careful about what it does and does not claim. It does not name the 53 papers. It does not name the authors. It does not provide a detailed breakdown of which kinds of failure mode predominated. This is a methodological weakness — independent verification of Amgen’s replication attempts is not possible without the paper-level data — and it has been a substantive criticism of Begley and Ellis 2012 in the years since. The defense is that Amgen had contractual confidentiality obligations to the original authors who had cooperated with the replication attempts, and that publishing a list of “papers that did not replicate” would have been incompatible with the spirit of the collaborative replication exercise. The commentary therefore reports the aggregate statistic without the case-level detail, and the field has had to take the aggregate statistic on the credibility of Begley’s name and Amgen’s reputation. That is a real epistemic limitation, but it does not obviously falsify the headline number.
Why The Amgen Setup Was Specifically Credible
The Begley and Ellis result has been more credible than a typical methodological-critique paper for three reasons specific to Amgen’s institutional position.
First, Amgen had no incentive to fabricate a failure-to-replicate. Independent academic replication efforts can be questioned on the grounds that the replicating team has career incentives to publish a high-profile null result. The reproducibility crisis has been an academic-prestige game, and “I replicated a famous study and found nothing” has become a publishable result in its own right. Amgen, by contrast, had a strong commercial interest in the original findings being true: every failed replication represented a drug-discovery program that would not be launched, a target that would not be pursued, and a research portfolio whose addressable market shrunk. The Amgen team was specifically not selected for skepticism; the bench scientists doing the replications were doing them because they wanted to build programs on top of the findings. The high failure rate emerged despite the team’s commercial interest in success, not because of it.
Second, the team had the technical capability to do the replications correctly. Failed replications can be questioned on the grounds that the replicating team lacked the technical expertise to reproduce the specific experimental conditions. Amgen’s haematology and oncology research group was, in 2011–2012, one of the most technically sophisticated cancer-biology laboratories in the world. The team had the cell lines, the antibodies, the imaging equipment, the in vivo facilities, and the deep methodological expertise to reproduce essentially any wet-lab cancer biology experiment that was reproducible in principle. When the Amgen team failed to replicate a finding, the explanation was not lack of technical capability.
Third, Begley personally engaged with the original authors. A common defense of failed replications is that the original effect requires specific conditions that the original authors know but did not fully describe in the paper. The Amgen team’s replication protocol included direct outreach to the original authors, offers to send Amgen scientists to visit the original lab to learn the protocol firsthand, and offers to have original authors visit Amgen to oversee the replication attempts. In many cases, the original authors engaged constructively, and the Amgen team replicated their experiments using exactly the cells, reagents, and procedures that the original lab specified. The failures that emerged from those collaborative replications cannot be explained as “you didn’t know the secret protocol.” The secret protocol was applied. The finding still failed to replicate.
Taken together, these three properties make Begley and Ellis 2012 a specifically strong piece of evidence. It is not a random sample of cancer biology, it lacks paper-level detail, and the aggregate statistic depends on Amgen’s good-faith reporting. But the methodological setup runs in the direction of finding more replications, not fewer, and the failure rate emerged in spite of those favorable conditions. The 89% number is, if anything, an underestimate of the false-positive rate in landmark cancer biology rather than an exaggeration.
Prinz, Schlange, And Asadullah (2011): The Bayer Parallel
The Begley and Ellis result did not appear in a vacuum. A year earlier, in September 2011, Nature Reviews Drug Discovery had published a short correspondence by Florian Prinz, Thomas Schlange, and Khusru Asadullah, three scientists at Bayer HealthCare, titled “Believe it or not: How much can we rely on published data on potential drug targets?” The piece described an internal Bayer exercise structurally similar to Amgen’s. Bayer had surveyed its target-identification scientists about the success rate of in-house replication attempts on published external findings used as the basis for drug-discovery programs.
The Bayer survey covered 67 such projects, drawn primarily from oncology and women’s health (with cardiovascular as a smaller slice). The reported aggregate result: in only about 20–25% of the projects could the published findings be completely reproduced. In the remaining 75–80%, the findings were either partially reproduced (with major caveats that altered the strategic implications) or not reproduced at all.
The Bayer paper is structurally weaker than Begley and Ellis 2012 — it is a brief correspondence rather than a full commentary, the methodology is a survey of in-house scientists rather than a documented replication protocol, and the paper does not specify what counted as “completely reproducible.” But the qualitative finding lined up. Two of the largest pharmaceutical and biotechnology companies in the world, working independently with different sets of papers and different internal teams, had reached the same conclusion: the majority of preclinical findings they tried to build programs on did not hold up to internal replication.
The convergence between Bayer (2011) and Amgen (2012) is what made the broader narrative of a preclinical replication crisis stick. A single anecdotal report from one company could be explained away as a peculiarity of that company’s selection criteria or replication protocol. Two reports, from two of the most sophisticated industrial research operations in the world, with consistent qualitative findings, was much harder to dismiss. The combined message — that even with substantial technical capability, strong commercial motivation to succeed, and direct collaboration with the original authors, large fractions of headline preclinical findings could not be reproduced — was the moment the broader scientific and biotech communities started taking the issue seriously.
The Bayer paper did not provide a detailed breakdown of which therapeutic areas or which types of studies were most problematic. Its DOI (10.1038/nrd3439-c1) resolves to the brief correspondence, which is paywalled but widely cited in the meta-research literature. Subsequent surveys — including the Freedman, Cockburn, and Simcoe 2015 estimate of the U.S. economic cost of irreproducible preclinical research, which we will return to below — have leaned on the Bayer and Amgen numbers as the foundational data points.
Reproducibility Project: Cancer Biology (2014–2022)
The Begley and Ellis paper had the limitations any internal industrial exercise has: opacity, no paper-level data, no independent verification. In 2013, two academic organizations — the Center for Open Science and Science Exchange — launched the Reproducibility Project: Cancer Biology (RP:CB) as an explicit attempt to do what Begley and Ellis had done, but openly, with pre-registered protocols, with full paper-level transparency, and with independent academic replicators rather than industrial researchers.
The project’s initial design called for the replication of 53 high-impact cancer biology papers published between 2010 and 2012, selected on the basis of citation counts and journal prestige. Each replication would be pre-registered as a Registered Report in eLife, with the experimental design and statistical analysis plan reviewed and approved before any data were collected. The replication experiments would be performed by independent academic laboratories using cell lines, reagents, and protocols agreed in advance with the original authors where possible.
The project ran from 2014 to 2022 and did not unfold as planned. The full timeline, documented in extensive eLife papers in 2021 and 2022, illustrates how difficult preclinical replication is even when an organized academic project is trying its hardest to do it right.
Findings from the project.
Of the 53 papers initially targeted, only 23 were even replicable in principle. For the other 30, the RP:CB team could not reach an agreement with the original authors on a replication protocol, could not obtain the original reagents, or could not reconstruct the original experimental conditions in sufficient detail to mount a meaningful replication attempt. The first finding of the project was, in effect, that more than half of high-impact cancer biology cannot be replicated by an external team for procedural reasons alone — not because the finding is necessarily wrong, but because the published descriptions of methods are not sufficient to enable an independent team to reproduce the experiments.
Of the 23 papers for which replications were attempted, 50 individual experiments were performed and analyzed. Across these 50 experiments, the headline meta-analytic findings reported in eLife in 2021–2022 included:
- A median effect-size shrinkage of approximately 85% from the original to the replication. Effects that were reported in the original papers were substantially smaller in the replications — often consistent with no effect at all once the appropriate statistical comparison was performed.
- Only a minority of experiments produced clean replications. The exact “replication rate” depends on the metric used (significance, effect size, direction, magnitude); under the more permissive definitions, a larger fraction qualified, and under the stricter definitions (which approximate the question “would a drug-discovery program built on this finding succeed?”), the fraction was small.
- For the subset of experiments that did replicate, the replication effect sizes were systematically smaller than the originals — the same “decline effect” that has been documented in other replication contexts. A replicated finding is, on average, smaller than the original claim, even when the qualitative direction holds.
Errington et al.’s summary paper in eLife (DOI: 10.7554/eLife.71601, with the methodological companion paper at DOI: 10.7554/eLife.67995) framed the conclusion with appropriate caution. The replications had not “proven” the original findings false; they had demonstrated that the published evidence base was substantially weaker than the publication record suggested, and that the median effect size in the field was substantially smaller than the median published effect size implied. These are the same two qualitative claims that Begley and Ellis had made nine years earlier, with paper-level transparency that Begley and Ellis had been unable to provide.
The RP:CB experience also illuminated a structural feature of the field that Begley and Ellis had hinted at but not fully developed: the published methods sections of high-impact cancer biology papers were systematically insufficient to enable independent replication. This is not a charge of fraud; it is a charge that the publication norms of the field treat replicability as a low priority. The decision to drop more than half of the originally targeted papers because no agreed replication protocol could be constructed is, in some ways, the most damning finding of the entire project.
Freedman, Cockburn, And Simcoe (2015): The $28 Billion Economic Cost
The fourth critical paper in this cluster is Freedman, Cockburn, and Simcoe 2015, “The economics of reproducibility in preclinical research,” published in PLOS Biology (DOI: 10.1371/journal.pbio.1002165). Where Begley and Ellis had documented the failure rate and Prinz et al. had confirmed it from a second industrial source, Freedman and colleagues asked the dollar-and-cents follow-up question: what does the irreproducibility of preclinical biomedical research cost the U.S. economy?
Their estimate, based on a combination of survey data, expert elicitation, and decomposition of preclinical research spending across four categories (biological reagents and reference materials, study design, laboratory protocols, and data analysis and reporting), was that approximately 50% of U.S. preclinical research could not be reproduced as reported. The total U.S. spending on preclinical research, drawing on government and industry expenditure data, was approximately $56 billion per year as of the mid-2010s. Their headline calculation: the irreproducibility-attributable cost of U.S. preclinical research was approximately $28 billion per year.
The Freedman et al. number is a back-of-the-envelope estimate built on assumptions that can each be questioned. The 50% irreproducibility rate is itself an aggregate of disparate sub-estimates; the attribution of “cost” to irreproducibility (as opposed to, say, the inherent attrition of drug-discovery programs) involves judgment calls; the $56 billion total is sensitive to which spending categories are included. The authors are explicit about these caveats. But the order of magnitude — tens of billions of dollars per year of U.S. preclinical research spending that produces findings that cannot be reproduced — has been broadly accepted in the meta-research literature as a reasonable first-pass estimate.
The strategic implication of the Freedman calculation is the one that should sit at the front of a biotech investor’s mind. The U.S. preclinical research enterprise produces a vast amount of published evidence, and a substantial fraction of that evidence — by Freedman’s estimate, half — does not hold up when independently replicated. The economic cost of that irreproducibility falls on the downstream actors who try to build drug programs, allocate venture capital, value biotech companies, or make portfolio decisions on the basis of the published preclinical record. The cost is not zero. It is in the tens of billions of dollars per year.
For a venture capitalist allocating a $100 million biotech fund across, say, 20 portfolio companies, each of which has built its scientific case on a small number of foundational preclinical papers, the Freedman framework implies that something like half of the portfolio is built on evidence that would not survive rigorous independent replication. The implication is not that biotech investing is a fool’s errand; the implication is that the diligence process needs to specifically interrogate the replication status of foundational findings, and the financial models need to account for replication risk as a first-order factor rather than a footnote.
Recommended Standards: What Begley And Ellis Asked The Field To Do
The 2012 Begley and Ellis commentary was not just a complaint. It included a specific list of methodological standards that the authors argued should become preconditions for publication of preclinical cancer biology findings. The list is short, technically specific, and aimed at the failure modes the Amgen team had repeatedly encountered in its replication attempts.
Blinded experiments. Many of the failed replications, the Amgen team had found, involved experiments where the researcher who scored the outcome knew which condition each sample came from. Blinding — the practice of having the outcome scored by a researcher who does not know the condition assignment — is standard in clinical trials and is technically feasible in most preclinical experiments. Begley and Ellis argued that preclinical cancer biology should adopt blinding as a default standard, particularly for experiments where the outcome is a subjective or semi-quantitative judgment (tumor scoring, immunohistochemistry assessment, behavioral assays).
Multiple cell lines. Many failed replications turned out to depend on the idiosyncratic behavior of a single cell line, often a line with known genetic instability or atypical biology. Findings reported in only one cell line are particularly vulnerable to cell-line-specific artifacts; findings that hold across multiple cell lines representing the relevant biological context are much more likely to be true and reproducible. Begley and Ellis recommended that preclinical cancer biology adopt a multi-cell-line standard for any finding being claimed as a general phenomenon, with the corollary that single-cell-line findings should be framed as preliminary rather than established.
Dose-response relationships. Many drug-target findings were reported at a single dose, with no systematic exploration of the dose-response curve. A genuine pharmacological effect typically shows a coherent dose-response — a smoothly varying response that increases with dose up to a saturation point, with a measurable EC50 or IC50. Findings that work at one specific dose but show no coherent dose-response are more likely to be off-target artifacts or chance findings than to be genuine pharmacological effects. Begley and Ellis recommended that preclinical drug-target findings include dose-response data as a default standard.
Multiple investigators. Many failed replications had been performed by a single graduate student or postdoc, with no independent confirmation within the originating lab. Findings that depend on a single person’s hands have a much higher rate of being technique-specific artifacts than findings that have been independently reproduced by multiple investigators within the same group. Begley and Ellis recommended that key findings be confirmed by at least two independent investigators before publication.
Pre-planned statistical analysis. Many failed replications turned out, on inspection, to involve post-hoc statistical analyses where the choice of test, the choice of exclusion criteria, or the choice of subgroup had been influenced by looking at the data. Pre-specification of the analysis plan — ideally pre-registered, or at minimum documented internally before data collection — eliminates this source of false positives. Begley and Ellis recommended pre-planned statistical analysis as a default standard for preclinical experiments, with particular attention to pre-specifying exclusion criteria, primary outcome measures, and the planned statistical tests.
Each of the five standards is technically feasible, cheap to implement, and individually capable of substantially reducing the false-positive rate. The combination of all five would, by Begley and Ellis’s argument, produce a published preclinical literature whose findings could be replicated at substantially higher rates than the current 11–25% baseline.
In the years since the 2012 commentary, adoption of these standards has been uneven. Some top journals (including Nature and Science) have introduced checklists that require authors to specify whether key experiments were blinded, whether multiple replicates were performed, and whether statistical methods were pre-specified. Some funders (NIH most prominently, through its Rigor and Reproducibility requirements introduced in 2016) have made adherence to versions of these standards a condition of grant funding. The penetration of the standards into the working norms of preclinical labs has been slower than the policy infrastructure suggests, however, and the empirical record (as documented by RP:CB through 2022) is that the underlying replication rate has not measurably improved across the field as a whole.
The lesson for a strategist evaluating a preclinical finding in the mid-2020s is that you cannot assume the standards have been applied. A “landmark paper” in a top journal in 2024 is not, in the absence of specific evidence to the contrary, dramatically more replicable than a landmark paper in 2010 was. You have to check.
Strategist Takeaway: Discount Rates For Preclinical Evidence
For a strategist whose decisions depend on preclinical biomedical evidence — a venture capitalist evaluating biotech investments, a pharmaceutical business-development executive evaluating in-licensing opportunities, an equity analyst valuing an oncology pipeline, a foundation program officer funding translational research — the Begley and Ellis cluster of findings translates into a practical adjustment to how the evidence base should be read.
The adjustment is not “ignore preclinical research.” Preclinical research is the foundation of the entire drug-discovery enterprise, and a substantial fraction of it is correct, replicable, and predictive of clinical outcomes. The adjustment is to apply a calibrated discount rate that reflects the empirical replication base rate, and to weight specific structural features of a given finding when deciding how aggressively to discount it.
The base rate is roughly 10–25% for landmark preclinical findings in oncology. That is what the Begley and Amgen number, the Prinz and Bayer number, and the RP:CB final reports collectively imply. The prior probability that any randomly selected high-impact preclinical cancer biology finding will fully replicate is somewhere in this range. This is the starting point. It is much lower than the implicit prior that most strategic actors operate with — and that gap between the assumed prior and the empirical prior is the source of most of the strategic mistakes that get made on the basis of preclinical evidence.
Adjust upward for findings that meet the Begley standards. A finding that has been blinded, replicated across multiple cell lines, characterized with a coherent dose-response, confirmed by multiple independent investigators, and analyzed with a pre-specified statistical plan is substantially more likely to replicate than a finding that lacks these features. The published paper may not always make clear which standards were applied; this is where a thorough diligence process pays off. Read the methods section. Read the supplementary materials. If the standards are not visibly applied, the discount is the empirical base rate; if the standards are applied, the discount is meaningfully smaller.
Adjust upward for findings that have been independently replicated. The single most powerful signal that a preclinical finding is real is independent replication by a group with no stake in the original. If a finding was published in 2018 and an independent academic group or a different industrial lab has reproduced the key experiments by 2026, the finding has cleared a major epistemic bar. Findings that have not been independently replicated should be discounted to the base rate; findings that have been should be discounted much less.
Adjust downward for findings that depend on a single cell line, a single laboratory, or a single principal investigator. The most common failure mode the Amgen team encountered was a finding that turned out to be specific to one lab’s particular cell line or one investigator’s particular technique. Findings whose published support comes entirely from a single source warrant a discount steeper than the base rate.
Adjust downward for findings in hot, competitive fields. Corollary 6 of Ioannidis 2005 applies with full force to preclinical biology. When many groups are racing to publish in a hot area (a new oncology target, a new mechanism of cancer stemness, a new immunotherapy modality), the first papers tend to be the most extreme positive results, and the field’s headline findings tend to decline over time. Apply a steeper discount to findings in hot fields than to findings in established subfields with stable research programs.
Adjust downward for findings with strong financial or career interests aligned with the result. Findings reported by founders of biotech companies, by groups with substantial industry funding for the specific finding, or by researchers whose career trajectory is tied to a particular result warrant the steeper discount that Corollary 5 of Ioannidis 2005 describes. This is not a charge of dishonesty; it is the structural adjustment that the empirical literature on funding-source effects supports.
The single most useful operational rule that falls out of the Begley and Ellis cluster is the same as the rule that falls out of Ioannidis 2005, applied to a specific field: do not commit substantial capital to a preclinical-evidence-based program on the basis of a single published paper, no matter how prestigious the journal or how famous the author. Wait for independent replication. If a finding is real and important, an independent group will reproduce it. If no independent group has reproduced it, treat the finding as a hypothesis that may not survive a rigorous test. The cost of this rule, applied consistently, is delay on a small number of programs that would have worked out anyway. The cost of not applying it, accumulated across a portfolio of decisions, is substantial — by Freedman, Cockburn, and Simcoe’s estimate, in the tens of billions of dollars per year at the national level, and at a portfolio level for any individual investor, large enough to matter materially.
Sources
- Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531–533. DOI: 10.1038/483531a.
- Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9), 712. DOI: 10.1038/nrd3439-c1.
- Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. DOI: 10.7554/eLife.71601.
- Errington, T. M., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Reproducibility in cancer biology: Challenges for assessing replicability in preclinical cancer biology. eLife, 10, e67995. DOI: 10.7554/eLife.67995.
- Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLOS Biology, 13(6), e1002165. DOI: 10.1371/journal.pbio.1002165.
- Begley, C. G. (2013). Reproducibility: Six red flags for suspect work. Nature, 497(7450), 433–434. DOI: 10.1038/497433a.
- Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. DOI: 10.1371/journal.pmed.0020124.
- Collins, F. S., & Tabak, L. A. (2014). Policy: NIH plans to enhance reproducibility. Nature, 505(7485), 612–613. DOI: 10.1038/505612a.
Related Reading
- Ioannidis 2005: “Why Most Published Research Findings Are False” Landmark — the Bayesian framework that explains why preclinical replication rates land where Begley and Ellis observed them.
- OSC 2015: The Reproducibility Project Of Psychological Science — the parallel coordinated replication effort in psychology, with a methodologically similar finding pattern.
- Lehrer’s “Decline Effect” (2010): When Findings Shrink Over Time — the phenomenon of effect-size shrinkage that Begley and Ellis and RP:CB both documented in preclinical biology.
- Publication Bias And The File-Drawer Problem — the structural mechanism that drives the inflated effect sizes in the original publications and the failure-to-replicate in subsequent attempts.
- PREDIMED And The Mediterranean-Diet Retraction — a parallel case in nutritional epidemiology where headline findings did not survive a more rigorous re-analysis.
Frequently Asked Questions
Did Begley and Ellis publish a list of the 53 papers that failed to replicate?
No, and this has been the most substantive methodological criticism of the 2012 commentary. The Amgen team had confidentiality obligations to the original authors who had cooperated with the replication attempts, and publishing a “list of papers that did not replicate” would have been incompatible with the collaborative spirit of the replication exercise. The aggregate statistic therefore depends on Begley’s good-faith reporting and Amgen’s institutional credibility. The Reproducibility Project: Cancer Biology, by contrast, was designed from the outset to be paper-level transparent, and its final 2021–2022 reports do specify exactly which experiments were attempted and what the outcomes were. The two efforts together — Begley and Ellis’s aggregate-level data and RP:CB’s paper-level data — converge on the same qualitative finding.
Is the 89% failure rate the same as saying 89% of cancer biology is fraudulent?
No, and this is an important distinction. Failure to replicate is not the same as fraud. The Begley and Ellis paper describes failure modes that include: missing control experiments, cell-line-specific artifacts, undisclosed experimental conditions, post-hoc statistical analyses, and selective reporting of positive results from a larger set of mixed experiments. Some of these failure modes shade into research misconduct, but most of them are better described as ordinary methodological weakness amplified by publication incentives. The aggregate statistic reflects a structural failure of the field’s norms, not a claim that 89% of cancer biologists are dishonest.
Has the situation improved since 2012?
Modestly, but not dramatically. The NIH introduced Rigor and Reproducibility requirements in 2016. Top journals have introduced methodological checklists. Some funders require pre-registration for confirmatory studies. The Reproducibility Project: Cancer Biology completed its work between 2014 and 2022 and found broadly the same pattern Begley and Ellis described — substantial effect-size shrinkage, low rates of clean replication, and structural barriers to even attempting replication. The policy infrastructure has improved; the working norms of preclinical labs have improved less. The empirical replication base rate in the mid-2020s is probably modestly better than it was in 2012, but not dramatically.
Is preclinical cancer biology specifically bad, or is this representative of preclinical biomedical research more broadly?
Preclinical cancer biology is the field where the replication failure rate has been most carefully measured, but the available evidence from other preclinical biomedical fields (cardiovascular biology, neurobiology, metabolic disease) suggests broadly similar patterns. The Prinz et al. 2011 paper covered oncology and cardiovascular projects with similar findings across both. The Reproducibility Project: Cancer Biology focused on cancer biology specifically. The Freedman 2015 economic estimate covers preclinical biomedical research broadly. The pattern is not unique to cancer biology, but cancer biology is where it has been most thoroughly documented.
What is the single most useful operational rule from this cluster of findings?
For any strategic decision that depends on a published preclinical finding, do not commit substantial capital before the finding has been independently replicated by a group with no stake in the original. The base rate for “first paper replicates cleanly” in high-impact preclinical cancer biology is roughly 10–25%. The base rate for “first paper that has also been independently replicated by an unaffiliated group” is much higher. Waiting for independent replication adds delay to a small number of programs that would have worked out; it eliminates a large fraction of programs that would have failed. For portfolio-level allocation across many preclinical-evidence-based decisions, the rule pays for itself many times over.
How does this relate to the broader replication crisis in other fields?
The Begley and Ellis cluster is the preclinical-biology arm of a phenomenon that has been documented across many empirical fields. The framework is the same — small studies, large flexibility in methods, publication incentives toward positive results — and the empirical findings are broadly consistent. The specific replication rates vary by field (psychology has been measured at roughly 36% by OSC 2015; cancer biology at roughly 11% by Begley and Ellis), but the qualitative picture is consistent. The cross-field consistency is itself the most important fact: the replication crisis is not a peculiarity of one field’s methods, but a structural property of how a research enterprise that publishes mostly positive findings, with large flexibility in design and analysis, will tend to produce a literature whose published claims systematically exceed the empirical reality they are claiming to describe.