In August 2005, the open-access journal PLOS Medicine published an eight-page paper by a Stanford-affiliated epidemiologist named John P. A. Ioannidis. The title was a sentence: “Why Most Published Research Findings Are False.” It did not say “may be” or “could be” or “are sometimes.” It said most published findings are, in fact, false. The paper’s central argument was a piece of elementary Bayesian reasoning carried to its conclusion, applied to the structural conditions of contemporary research, and presented in a journal whose editors had decided that the title was worth defending.
It became the most-downloaded paper in PLOS Medicine’s history. As of the mid-2020s the article has been viewed well over three million times, cited in tens of thousands of subsequent papers, and discussed in venues ranging from The Economist and The Atlantic to congressional briefings and pharmaceutical-industry retreats. It is, by any reasonable measure, one of the most influential single papers published in any biomedical journal this century.
Its influence is not because Ioannidis demonstrated a new statistical result. The math in the 2005 paper is undergraduate-level probability. Its influence is because Ioannidis took a piece of reasoning that statisticians had been making to each other in technical journals for decades and translated it into a frame that was forceful enough, clear enough, and credentialed enough to break out of the methodologists’ guild and into the working consciousness of bench scientists, editors, and journalists.
The argument has a shape. If you understand the shape, you understand most of what the broader replication crisis is about. You also acquire something more practically useful: a Bayesian discount rate to apply to any “significant” research finding you encounter, calibrated to the structural conditions of the field that produced it.
This article walks through what Ioannidis actually argued, why the argument matters for the broader replication crisis, how it connects to the empirical reproducibility projects that followed, what extensions Ioannidis himself has made in the two decades since, where the paper has been honestly criticized, and how a working strategist can apply the framework when evaluating any claim that begins with “studies show.”
The Title That Shook Publishing
The first thing to understand about Ioannidis (2005) is that the title was not hyperbole, and it was not a marketing move. It was the literal conclusion of the paper’s central argument, stated in the form that the math actually supported.
Ioannidis began the paper by noting that the credibility of any published research finding depends not only on the p-value the paper reports, but also on the prior probability that the hypothesis being tested is true, the statistical power of the study, and the various forms of bias that may have shaped the analysis. This was not a controversial claim in 2005. Bayesian statisticians had been pointing it out since at least the 1960s. What was different about Ioannidis’s paper was the move he made next: he treated the prior probability, the power, and the bias as not abstractions to be acknowledged in a footnote, but as plug-in parameters that could be estimated for real fields of research — and the results were not flattering.
The paper appeared in PLOS Medicine, then a young open-access journal that had launched in late 2004. The editors made a deliberate decision to publish a paper whose title made an unhedged empirical claim about the entire body of published medical research. Virginia Barbour and her co-editors took the bet that the math was correct, that the audience could handle it, and that the conversation it would provoke was worth having in a medical journal rather than a methodology journal. That editorial decision has been vindicated several times over. PLOS Medicine still hosts the paper on its open-access platform; the DOI 10.1371/journal.pmed.0020124 resolves to a stable URL that has been linked to from approximately every methodology syllabus on Earth.
The reaction was immediate and bifurcated. Methodologists and a subset of clinical researchers read the paper with the recognition of “yes, this is what we have been saying, but stated more clearly than we have managed.” Bench scientists and a subset of editors read it with annoyance — the title felt aggressive, the argument felt to some like a generalization based on a model rather than data. The most common version of the dismissal was: “but my findings are real.”
The argument did not depend on any individual finding being false. It depended on the structural properties of how research is conducted across an entire field. It was a statement about the base rate.
Ioannidis’s Bayesian Framework, In English
Strip the math out of the paper and the argument is this. When you run a study and report a “significant” finding, you are not telling the reader the probability that your finding is true. You are telling them the probability that you would have observed evidence at least this strong assuming the null hypothesis is true. These are different quantities. They are not even close to being the same quantity in many practical situations.
The probability that a “significant” finding is actually true — what statisticians call the positive predictive value (PPV) of a significant result — depends on three things beyond the p-value itself:
- The prior probability that the hypothesis being tested is true. Call this R, the ratio of true relationships to no-relationships among the hypotheses tested in a field. In a high-quality randomized trial testing a strongly-grounded mechanistic hypothesis, R might be close to 1. In a fishing expedition through a large gene database for associations with a disease, R might be 1 in 10,000.
- The statistical power of the study. Power is the probability of detecting a real effect when one exists. A study with 80% power will detect 80 out of every 100 real effects it tests. A study with 20% power will detect only 20. Underpowered studies miss real effects — which means that the “significant” findings they do produce are disproportionately likely to be false positives.
- The bias in the analysis. Bias here is a catch-all for any factor that increases the probability of reporting a significant finding when one does not exist. It includes outcome-switching, p-hacking, selective reporting of analyses, multiple-comparisons inflation, and a long list of design choices that operate in the same direction. Ioannidis used the parameter u to represent the fraction of non-true relationships that are claimed as true due to bias.
Plug these into Bayes’s theorem and you get a clean formula for the post-study probability that a “significant” finding is true. Ioannidis worked it out for a range of plausible values:
- In a well-conducted randomized trial of a mechanistically grounded hypothesis (high prior, high power, low bias), PPV can be around 85%. Most significant findings in this regime are true.
- In an underpowered observational study testing a speculative hypothesis (low prior, low power, moderate bias), PPV can fall well below 50%. Most significant findings in this regime are false.
- In a typical hypothesis-generating exploration through a large dataset (very low prior, modest power, high flexibility in design choices), PPV can fall into the single digits. The overwhelming majority of significant findings in this regime are false positives.
This is the formula behind the title. The claim is not “scientists lie” or “the literature is fraudulent.” The claim is that under realistic assumptions about the proportion of true hypotheses tested, the typical power of studies in many fields, and the structural pressures toward bias, the math produces PPVs below 50%. And below-50% PPV means: most published “significant” findings in that field are false.
The argument is mechanical. The conclusion follows from arithmetic. The only question is whether the assumptions about prior, power, and bias are reasonable for any given field. Ioannidis’s claim was that for many fields — including substantial portions of biomedical research, social psychology, nutritional epidemiology, and genomics-era association studies — the assumptions were entirely reasonable, and the conclusion therefore held.
The Six Corollaries
The bulk of the 2005 paper consists of six corollaries that fall directly out of the Bayesian framework. Each corollary describes a structural feature of how research is conducted in some fields, and explains why that feature predicts a higher rate of false positives. Together, they describe the conditions under which the replication crisis was inevitable.
Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Small studies have low statistical power. Low power means a higher fraction of real effects are missed, but it also means that when an effect is detected, the magnitude of the detected effect is biased upward (the “winner’s curse” phenomenon). The combination produces a literature with a high false-positive rate and inflated effect estimates among the false positives. Fields that publish primarily small studies — much of social psychology pre-2015, many fMRI studies of the 2000s, single-laboratory drug-screening studies — are vulnerable.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
For a fixed sample size, a smaller true effect requires more statistical power to detect. If the true effects in a field are small (which is the case in most social and behavioral sciences, and in much of nutritional epidemiology and clinical effectiveness research), then the same sample sizes produce lower power, and the same Bayesian arithmetic produces lower PPV. The candidate-gene-association literature of the late 1990s and early 2000s is the most famous illustration: the field was looking for small associations, was underpowered, and produced a literature whose published findings overwhelmingly failed to replicate when adequately powered studies finally arrived.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
Hypothesis-generating fields that test thousands or millions of relationships (genomics, neuroimaging, exploratory epidemiology) face a Bayesian penalty. If you test 100,000 hypotheses and only 100 of them are true, then your prior is 1 in 1,000. With standard p < 0.05 significance criteria and typical power, the vast majority of your “significant” findings will be the 5,000 false positives from the 99,900 null hypotheses. The advent of multiple-comparisons-corrected significance thresholds (genome-wide significance at p < 5 × 10⁻⁸, false discovery rate corrections in neuroimaging) is a response to this corollary. Fields that have not adopted such corrections — and there are still many — sit at high false-positive rates.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
This is the corollary that anticipates the entire later literature on “researcher degrees of freedom” and “p-hacking.” A field where researchers can choose among many measures of the outcome, many specifications of the predictor, many subsets of the data, and many statistical models has, in effect, a hidden multiple-comparisons problem. Each “free” choice that is conditioned on the data is another implicit test. The published p-value reflects only the final reported test; the underlying effective rate of false positives is higher, often substantially. The 2011 Simmons, Nelson, and Simonsohn paper “False-Positive Psychology” formalized this argument with simulations; Ioannidis had already named it as Corollary 4 six years earlier.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
Financial and ideological interests shift the bias parameter u upward. Industry-sponsored trials of industry-funded drugs systematically produce more favorable findings than independently sponsored trials of the same drugs — this is one of the most replicated meta-findings in the biomedical literature. Beyond direct financial interests, ideological prior commitments and career incentives operate in the same direction. Fields where the typical researcher has a substantial financial or reputational stake in finding a particular result will have higher bias and therefore lower PPV.
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
This is the corollary that is most often misunderstood, and it is also the one that should worry working professionals the most. When many independent teams race to publish in a hot field, the team that publishes first will frequently be the one that got the luckiest. With many teams running similar studies, the variance of results across teams is large; the most extreme positive result is the one that gets accepted to a top journal first; the median result is not what gets published. The same Bayesian arithmetic that drives Corollary 4 applies — many implicit tests across many teams — and the headline first paper in a hot field has, on average, an inflated effect size that subsequent work cannot replicate. The early history of fields like nutrigenomics, microbiome-health associations, and (more recently) GLP-1-everything research has shown this pattern repeatedly.
Each of the six corollaries makes the same Bayesian point applied to a different structural feature. Together they describe the conditions of much of modern science. The 2005 paper was, in this sense, a generalization rather than a discovery. But the generalization had not been made before in a form that the broader research community had been forced to confront.
Why The Paper Matters for the Replication Crisis
The replication crisis as a contemporary phenomenon — the recognition that large fractions of published findings in psychology, biomedicine, economics, and other empirical fields fail to replicate — did not begin with Ioannidis (2005). The methodological concerns it expressed had been articulated by statisticians for decades. Sterling (1959) had described the publication-bias problem; Meehl (1967, 1978) had articulated the structural pathologies of social-psychological null-hypothesis testing; Cohen (1962, 1988) had been documenting the chronic underpowering of behavioral research for half a century by the time Ioannidis published.
But Ioannidis’s paper became the load-bearing citation for the replication crisis because it did three things that the prior literature had not managed to do together.
First, it integrated the separate methodological critiques into a single Bayesian framework. Sterling’s publication-bias concern, Meehl’s structural critique, and Cohen’s power critique had been three different conversations in three different sub-literatures. Ioannidis showed that they were three corollaries of the same underlying Bayesian arithmetic. This integration made the critique portable: one could now say, in a single paragraph, what previously required reading three separate methodological literatures.
Second, it stated the conclusion in language that resisted hedging. The title was unhedged. The conclusion section was unhedged. The paper did not say “more research is needed to determine whether published findings are reliable.” It said most are not. This rhetorical choice did substantial work. It is much harder to ignore a paper whose title is a declarative empirical claim than to ignore a paper whose title is “Toward a Bayesian Framework for the Evaluation of Published Findings.”
Third, the author was credentialed enough that the dismissal-by-credential move did not work. Ioannidis was a Stanford-affiliated epidemiologist with a substantial publication record in mainstream biomedical journals at the time the paper was published. He was not a contrarian outsider; he was an insider making the critique from within the credentialing system. This made the critique difficult to dismiss as “an outsider doesn’t understand how science actually works.” It was a card-carrying member of the system saying, with full credentialing weight, that the system was producing predominantly false positives.
The combination of these three factors meant that within a decade, “Ioannidis 2005” had become the standard short-form citation that anyone in any empirical field could invoke when articulating concerns about reliability. Methodologists had been making the underlying points for fifty years. Ioannidis made them in a form that the rest of the research community could no longer ignore.
This is why the replication crisis, when it broke open as a public phenomenon in the early 2010s, took the form it did. The Daryl Bem precognition paper, the Diederik Stapel fraud case, the failed replications of priming effects, the multi-lab attempts at psychological replication — all of these episodes were interpreted through the framework Ioannidis (2005) had provided. The data points were psychological; the interpretive frame was epidemiological and Bayesian. Without the Ioannidis paper, the same data points would have produced a different and probably narrower conversation.
Connection to the Open Science Collaboration’s 2015 Project
The empirical companion piece to Ioannidis’s theoretical argument is the Open Science Collaboration’s 2015 paper “Estimating the reproducibility of psychological science,” published in Science (DOI: 10.1126/science.aac4716). Where Ioannidis (2005) had argued from first principles that PPV in many fields would be below 50%, the OSC project went out and measured it.
The project recruited 270 contributing authors across 64 collaborating teams and replicated 100 experimental and correlational studies published in three top-tier psychology journals in 2008: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition. Replicating studies used the original materials where possible (often with the original authors’ direct involvement), pre-registered analyses, and substantially larger sample sizes than the originals — typically powered at 90% to detect the originally reported effect size.
The headline results were:
- 97% of the original studies had reported statistically significant findings.
- 36% of the replications produced statistically significant findings in the same direction.
- The average effect size of the replications was approximately half the average effect size of the originals.
- For studies categorized as “surprising” results, the replication rate was even lower than for “expected” results.
The OSC numbers were not a clean test of Ioannidis’s framework — the replication rate depends on the prior, the original power, and the bias, all of which vary across the 100 studies and are difficult to estimate field-wide. But the order of magnitude was consistent with what the Bayesian framework predicted for a field characterized by Corollaries 1 through 6. Psychology in 2008 was a field of small samples, small effects, large numbers of tested hypotheses, large researcher degrees of freedom, modest financial interests, and many competing teams. The Bayesian framework predicted a replication rate substantially below the headline 97% significant-finding rate. The empirical project measured something in the same neighborhood — 36% replicating significantly, with halved effect sizes.
The OSC paper’s appearance in Science in 2015 was the moment the replication crisis stopped being a theoretical concern articulated by methodologists and became an empirically demonstrated property of a major scientific field. The combination of Ioannidis (2005) on the theory side and OSC (2015) on the empirical side has become the canonical citation pair for almost any discussion of the broader phenomenon. The papers do not refer to each other in detailed argumentative ways, but they fit together as a theory-prediction pair: the framework predicted that fields with these structural features would have replication rates well below their stated significance rates, and the field that was measured turned out to have a replication rate well below its stated significance rate.
Ioannidis’s Subsequent Extensions
Ioannidis has not been quiet in the two decades since the 2005 paper. He has continued to extend and refine the framework, with two streams of work that bear directly on the practical application of the original argument.
The 2014 follow-up: “How to make more published research true.”
In October 2014, Ioannidis published a second PLOS Medicine paper, “How to make more published research true” (DOI: 10.1371/journal.pmed.1001747), that took the diagnostic framework of the 2005 paper and asked: what reforms would actually move PPV upward? The paper proposed a set of structural changes whose adoption rate in the decade since has been mixed but real. The list included: pre-registration of analysis plans, increased statistical power as a publication norm, broader adoption of multi-team replication studies, registered reports as a publication format, open data and open code requirements, and the use of meta-analyses (with caveats) to triangulate across primary studies. The 2014 paper is the bridge between the 2005 diagnosis and the contemporary open-science reform movement; many of the specific policies that journals and funders have adopted since 2015 trace their argumentative lineage to this paper.
The 2018 nutritional-epidemiology critique.
In 2018, Ioannidis published a JAMA editorial titled “The challenge of reforming nutritional epidemiologic research” (DOI: 10.1001/jama.2018.11025) that turned the framework on one of biomedical research’s most prolific subfields. Nutritional epidemiology — the field of large-cohort observational studies relating dietary patterns to health outcomes — had produced a vast literature of associations between specific foods and specific health outcomes, frequently with policy implications. Ioannidis argued that the field was a near-ideal case study in the Corollaries: small effect sizes, large numbers of food-outcome associations tested, enormous researcher degrees of freedom in dietary recall, food categorization, covariate adjustment, and subgroup definition, plus financial and ideological interests in particular dietary findings. The PPV predicted by the framework was very low; the replication record of headline nutritional findings was, in fact, very poor. The editorial was a forceful application of the 2005 framework to a specific subfield and generated substantial controversy within that subfield. It is also a model of how to apply the diagnostic framework to a working empirical literature.
Beyond these two specific papers, Ioannidis has written extensively on related themes: the meta-research field he helped establish at Stanford (METRICS, the Meta-Research Innovation Center), the analysis of why specific large-scale replication efforts have produced the results they have, and (more controversially) the application of meta-research thinking to the COVID-19 evidence base. The 2005 framework is the through-line in all of this work.
Honest Reception: Where The Paper Has Been Criticized
It would be incomplete to discuss Ioannidis (2005) without acknowledging that the paper has faced substantive methodological criticism. The criticism does not, in my reading, undermine the central argument — but a strategist who wants to use the framework in practice should know what the disputes are.
The prior is doing all the work. The most serious technical critique of the 2005 paper is that the PPV calculations depend heavily on the assumed prior probability R that hypotheses being tested are true, and that R is difficult to estimate for any real field. If you assume that 1 in 10 hypotheses tested in a field is true, you get one set of numbers; if you assume 1 in 100, you get a very different set. Critics — most notably statistician Jager and Leek in a 2014 Biostatistics paper — have argued that for some specific subfields, empirical estimates of R are higher than Ioannidis assumed and that the literal claim “most published findings are false” does not survive a more careful empirical exercise. Ioannidis has responded that the broader qualitative conclusions are robust to a wide range of plausible R values, and the back-and-forth is ongoing in the methodology literature. For working purposes, the safest reading is: the framework correctly identifies which fields are at higher and lower risk; the absolute PPV numbers should be treated as illustrative rather than precise.
Bias is treated as a single parameter. The bias parameter u in Ioannidis’s formula is a catch-all that bundles together publication bias, p-hacking, fraud, selective reporting, and a dozen other distinct phenomena. Each of these has different mechanisms and different empirical signatures, and bundling them into a single parameter may obscure the practically useful question of which mechanism is dominant in a given field. This is a fair critique, but it is also the reason the 2005 paper is eight pages rather than 800: the unification into a single parameter was a deliberate simplification to make the central argument legible.
Some fields have been unfairly tarred. Researchers in fields with reasonably high power, reasonably grounded priors, and reasonably low bias — much of physics, much of well-conducted clinical-trial research, much of well-conducted economics — have reasonably objected that the title statement reads as a blanket claim about science when the framework actually identifies specific structural conditions that vary across fields. This is correct. The framework predicts low PPV under specific conditions; it does not predict low PPV everywhere. Reading the paper carefully, this is what Ioannidis says, but the title’s compression of the argument has produced a quarter-century of fights about whether “most published findings are false” applies to one’s own field.
The PPV framework assumes binary hypothesis testing. The Bayesian PPV framework assumes a discrete “true/false” structure for hypotheses, when in reality most empirical claims are continuous (a coefficient with a magnitude and a sign) rather than binary. Critics have argued that “estimation thinking” rather than “PPV thinking” is the more productive frame for many fields. This is true, and the contemporary methodological literature has largely moved in the estimation direction. But the PPV frame remains useful as a first-pass diagnostic, and it is the frame Ioannidis (2005) provided.
Taken together, the criticism is real and partially valid. The framework is a simplification, the bias parameter is a bundle, and specific PPV numbers should be treated as orders of magnitude rather than precise estimates. None of this undermines the qualitative claim that under realistic structural conditions in many fields, most “significant” findings are false. The qualitative claim is what does the work for practical purposes.
Strategist Takeaway: Applying the Bayesian Discount
What does this all mean for a working professional who is not running studies but is reading them — interpreting research claims to make business, product, or policy decisions? The Ioannidis framework provides a portable diagnostic that you can apply, mentally and quickly, to any “significant” research finding you encounter. The diagnostic does not give you a precise probability that the finding is true. It gives you a calibrated direction-and-magnitude for how much you should discount the finding before acting on it.
The diagnostic has four steps.
Step 1: Identify the prior probability that the hypothesis being tested is true in this field.
Ask yourself: how speculative is the hypothesis? Is the underlying mechanism strongly grounded in prior theory, or is the relationship being tested one of many possible relationships in a large hypothesis space? A randomized controlled trial of a drug whose mechanism is well-characterized has a high prior. A finding that “people in city X are more honest than people in city Y” in a single observational study has a low prior. Adjust your starting credence accordingly.
Step 2: Estimate the statistical power of the study (or fields it comes from).
Power depends on sample size and effect size. Small samples and small effects produce low power. If the study reports a sample of 30 people with a Cohen’s d of 0.2, the power is probably under 30%. A “significant” finding in a 30% power study has a much higher false-positive rate than a “significant” finding in an 80% power study. Pre-2015 social psychology averages around 35% power; well-conducted Phase III clinical trials average around 80%. Adjust your credence downward sharply for findings from low-power fields.
Step 3: Assess the researcher-degree-of-freedom flexibility in the analysis.
Is the outcome variable pre-specified, or were multiple outcomes measured and the most significant reported? Is the model specification pre-registered, or was the model selected after looking at the data? Are subgroup analyses pre-specified, or were they generated post hoc? The more flexibility in the analysis, the larger the implicit multiple-comparisons inflation, and the lower the actual PPV. Pre-registered studies get a substantial credibility boost. Exploratory analyses get a substantial credibility discount.
Step 4: Identify financial and ideological interests in the result.
Who funded the study? Who benefits from the result going in the direction it went? This is not a charge of dishonesty — even well-intentioned researchers are influenced by where the rewards lie — but a structural adjustment to your credence. Industry-funded studies of industry products are systematically more favorable than independently funded studies of the same products. The same logic applies in other domains: an ideologically motivated researcher publishing a finding congenial to her ideological priors warrants a similar discount.
Once you have walked through the four steps, you have a rough Bayesian discount rate to apply. A finding from a high-prior, high-power, pre-registered, independently funded study warrants very little discount — you can largely take it at face value. A finding from a low-prior, low-power, exploratory, interested-party-funded study warrants such an aggressive discount that the practical posterior probability of truth approaches zero — you should treat the finding as a hypothesis to be tested, not as an established result.
The discount rate is not a precise calculation. It is a structured habit of mind, calibrated by the Ioannidis framework, that you apply before acting on a research finding. The cost of acquiring this habit is minutes. The cost of not having it, accumulated across a career of decisions that depend on research claims, is substantial.
The single most useful operational rule that falls out of Ioannidis (2005) is this: do not change your behavior based on a single study, in any field, ever. Wait for replication. Wait for meta-analytic triangulation. If the result has not been independently reproduced by at least one team with no stake in the original finding, treat it as a candidate hypothesis rather than as a fact. This is the single rule that captures most of the practical value of the framework, and it is the one that will save you from most of the strategic mistakes that flow from over-trusting published “significant” findings.
Sources
- Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. DOI: 10.1371/journal.pmed.0020124.
- Ioannidis, J. P. A. (2014). How to make more published research true. PLOS Medicine, 11(10), e1001747. DOI: 10.1371/journal.pmed.1001747.
- Ioannidis, J. P. A. (2018). The challenge of reforming nutritional epidemiologic research. JAMA, 320(10), 969–970. DOI: 10.1001/jama.2018.11025.
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716.
- Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7(6), 528–530. DOI: 10.1177/1745691612465253.
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. DOI: 10.1177/0956797611417632.
- Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1–12. DOI: 10.1093/biostatistics/kxt007.
- Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance. Journal of the American Statistical Association, 54(285), 30–34. DOI: 10.1080/01621459.1959.10501497.
- Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. DOI: 10.1037/0022-006X.46.4.806.
Related Reading
- Daryl Bem’s Precognition Studies: The Paper That Broke Social Psychology — the empirical episode that crystallized the framework’s predictions in psychology.
- Diederik Stapel: Anatomy of Scientific Fraud — when the bias parameter approaches 100%.
- Reinhart-Rogoff “90% Debt Threshold”: The Excel Error That Shaped Global Austerity — single-paper findings driving policy without replication.
- The Peeking Problem in A/B Testing — the working-practitioner version of Corollary 4 in conversion-rate optimization.
- Money Priming: A Replication Crisis Case Study — Corollaries 1, 2, and 6 in a single research program.
Frequently Asked Questions
Is Ioannidis’s claim that literally most published findings are false?
The claim is conditional. Under the structural conditions described in the six corollaries — small studies, small effects, large numbers of tested hypotheses, large flexibility in analysis, financial interests, hot fields with many teams — the Bayesian arithmetic produces PPVs below 50%, which means most “significant” findings under those conditions are false. The claim is not that every field is in this regime. Well-conducted high-power randomized clinical trials of mechanistically grounded hypotheses have high PPV. The framework identifies the conditions that produce low PPV; it does not assert that all fields are equally afflicted.
Has the situation gotten better since 2005?
In some fields, substantially. Many top psychology and biomedical journals now require pre-registration for confirmatory studies, encourage registered reports, require open data and open code, and conduct adversarial collaboration replication efforts. The Center for Open Science, the Open Science Framework, and the broader meta-research movement have institutionalized many of the reforms Ioannidis’s 2014 paper called for. The current trajectory is positive. In other fields — particularly nutritional epidemiology, much of single-laboratory biomedical research, and substantial portions of social science outside top-tier psychology — the reforms have penetrated less. The picture is uneven.
Does this mean we should distrust all science?
No, and that is the wrong inference. The framework provides a calibrated direction for skepticism, not a blanket dismissal. Science as a system still produces enormous amounts of true knowledge. The framework tells you which specific structural conditions produce reliable findings and which produce unreliable ones, so that you can adjust your credence accordingly. The replacement for “trust everything published” is “calibrate your credence by the structural conditions of the field that produced the finding.” That is a more useful posture than either blanket trust or blanket distrust.
What is the single most useful operational rule from this framework?
Do not change your behavior based on a single study, ever. Wait for independent replication by teams with no stake in the original finding. The single-study evidence base is unreliable enough across enough fields that this rule, applied consistently, will save you from most of the strategic mistakes that flow from over-trusting published findings. If a result is real and important, it will replicate; if it does not replicate, you have lost nothing by waiting.
Is Ioannidis still active in this work?
Yes. He directs the Meta-Research Innovation Center at Stanford (METRICS), continues to publish meta-research papers, and has extended the framework into specific subfields (nutritional epidemiology in 2018, COVID-19 evidence more recently and more controversially). His subsequent work is uneven — some of it widely accepted, some of it disputed — but the 2005 paper remains the foundational contribution and the most-cited statement of the framework.
How does Bayesian reasoning relate to the p-value?
The p-value is a frequentist quantity. It tells you the probability of observing your data (or more extreme data) under the assumption that the null hypothesis is true. It does not tell you the probability that the null hypothesis is true given your data — that quantity requires Bayesian reasoning. Ioannidis’s framework is the application of Bayes’s theorem to the practical question of how much to trust a “significant” p-value, conditional on the structural features of the field. The p-value is an input to the calculation; the PPV is the output. Treating the p-value as if it were the PPV — which much of the working scientific literature does, implicitly — is the misreading the framework is correcting.