Publication Bias and the File Drawer Problem: Rosenthal 1979

Atticus Li

← The Replication Crisis · replication-crisis

Publication Bias and the File Drawer Problem: Rosenthal 1979

Rosenthal 1979 formalized the file drawer problem: null results sit in desk drawers while positive findings get published. Franco 2014 quantified it — nulls are 40% less likely to be written up. Turner 2008 showed antidepressant effects inflated by 32%. The literature is biased by what never appears.

By Atticus Li May 25, 2026 26 min read

Imagine a hundred research teams in different universities, each running the same experiment on the same hypothesis. The hypothesis is false — there is no real effect. By the arithmetic of significance testing at the conventional 0.05 threshold, roughly five of those hundred teams will, by chance alone, observe a “statistically significant” positive result. The other ninety-five will observe nulls.

Now imagine what happens next. The five teams that got significant results write up their findings, submit them to journals, and publish. Their papers enter the citable literature. The ninety-five teams that got null results face a different decision. The journals they would submit to are known to reject null results. Their faculty advisors will tell them the project is “not worth the time” to write up. Their grant reviewers will not be impressed by a CV that lists unpublished nulls. So the manuscripts get filed in desk drawers — and, in the digital era, in shared cloud folders that no one opens after the third month.

A meta-analyst who later surveys the literature on this hypothesis will find five published papers. All five report a positive effect. The literature appears to support the hypothesis with striking consistency. The fact that the underlying truth is the opposite — ninety-five out of a hundred experiments found nothing — is invisible, because the ninety-five were never written down anywhere a meta-analyst could find them.

This is the file drawer problem. Robert Rosenthal named it in 1979, in a four-page paper in Psychological Bulletin that has become one of the foundational citations of the replication crisis. The argument is so simple that it can be stated in a sentence: the published literature is a non-random sample of conducted research, biased toward positive results, and the magnitude of the bias is large enough to make published meta-analytic conclusions unreliable in fields where null results are systematically suppressed.

This article walks through what Rosenthal actually argued in 1979, the empirical work that quantified the bias decades later, the detection toolkit that meta-analysts now use to estimate how much rotten wood sits in the file drawer of any given literature, the canonical case study from psychiatric drug trials that turned the abstract concern into a public-health embarrassment, the preregistration reforms that the methodological community has converged on, and what a working strategist should do with the framework when evaluating claims drawn from any literature where publication is selective.

Rosenthal 1979: A Four-Page Paper That Named the Problem

Robert Rosenthal was, by 1979, one of the most prolific and methodologically sophisticated social psychologists of his generation. He had spent the preceding two decades doing meta-analytic synthesis of psychological experiments — combining results from many studies to estimate underlying effect sizes — and he had been thinking, for at least a decade by the time of the paper, about a problem he could not solve.

The problem was that his meta-analytic toolkit assumed the studies he was combining were a representative sample of the studies that had been conducted. He could see, from his own experience and from conversations with colleagues, that they were not. Studies with null results were systematically less likely to appear in the journals he was synthesizing from. They sat in researchers’ files, unpublished and uncited, and they were invisible to any meta-analyst who tried to estimate an effect from the published record alone.

In June 1979, Rosenthal published “The ‘File Drawer Problem’ and Tolerance for Null Results” in Psychological Bulletin — at the time the leading review journal in psychology. The paper was four pages long. It made three contributions that have shaped methodological discussion ever since.

First, it gave the problem a name. Before 1979, the concern had been articulated in scattered methodological essays — Theodore Sterling had described a version of it in a 1959 paper in the Journal of the American Statistical Association, and a small clinical-trials literature had discussed selective publication of pharmaceutical studies. But the concern had no canonical label. By calling it the “file drawer problem,” Rosenthal gave the methodological community a piece of vocabulary that fit the empirical situation. The metaphor was accurate (researchers literally filed unpublished manuscripts in drawers) and memorable. Within a few years the phrase had entered the standard methodological vocabulary; it remains there.

Second, it formalized the worst-case calculation. Rosenthal asked: given a meta-analytic conclusion based on k published studies showing a combined effect significant at some chosen threshold, how many unpublished null studies would have to exist in file drawers somewhere to overturn that conclusion? He worked out the arithmetic and called the answer the fail-safe N. If the fail-safe N is small — say, 20 unpublished nulls would be enough to reduce the combined effect to non-significance — then the meta-analytic conclusion is fragile, and the reader should worry. If the fail-safe N is enormous — say, ten thousand unpublished nulls would be required — then the conclusion is robust to plausible levels of publication bias.

The fail-safe N has been criticized over the years (we will come to this), but the underlying logic was important: it forced meta-analysts to think quantitatively about how much rot in the file drawer would be enough to change their conclusions. Before Rosenthal, the discussion was qualitative (“there might be unpublished nulls, who knows”). After Rosenthal, the discussion had a unit of account.

Third, it raised the structural question of journal incentives. Rosenthal pointed out that the file drawer problem was not a moral failure on the part of individual researchers. Most researchers would happily publish null results if journals would accept them. The problem was that the journals would not, and the reason the journals would not was that readers and citation indices treated null results as uninteresting. The incentive structure of the publishing ecosystem produced the bias.

This was the seed of what would, over the following four decades, grow into a substantial reform agenda — registered reports, results-blind peer review, journals dedicated to null findings, preregistration platforms. The 1979 paper did not propose those reforms. But it identified the structural diagnosis that they would later try to treat. The DOI 10.1037/0033-2909.86.3.638 resolves to the original paper, which remains as readable today as when it appeared.

What Was Missing: Empirical Quantification

Rosenthal’s paper had a limitation that he himself acknowledged. The fail-safe N calculation could tell you how many unpublished nulls would be required to overturn a meta-analytic conclusion, but it could not tell you how many actually existed. The file drawer was, by its nature, invisible. You could speculate about its contents but not enumerate them.

For thirty-five years after Rosenthal’s paper, the methodological literature operated under a presumed but unmeasured publication bias. Methodologists assumed that nulls were under-published; the fail-safe N gave them a way to talk about the magnitude of the assumption; but the empirical question — what fraction of conducted studies actually go unpublished, and what does the difference between conducted and published research look like in real fields? — remained essentially unanswered.

The reason the empirical question was so hard is that you cannot easily observe what was never written down. You need access to a registry of studies that captures the universe of studies begun, not just the subset published. Such registries did not exist in social science for most of the twentieth century. They began to exist, in selected fields, only in the 2000s.

The first widely-cited empirical demonstration of publication bias in social science came in 2014, when Annie Franco, Neil Malhotra, and Gabor Simonovits published a paper in Science called “Publication bias in the social sciences: Unlocking the file drawer.” The methodological field had been waiting for it, in a sense, for thirty-five years.

Franco, Malhotra, and Simonovits 2014: Unlocking the Drawer

What Franco and her collaborators had, that previous researchers had not, was a registry. Beginning in 2002, the National Science Foundation had funded a program called Time-sharing Experiments for the Social Sciences (TESS), which provided researchers with the ability to run experiments on representative samples of the U.S. population using survey infrastructure. Researchers submitted proposals; TESS reviewed them and selected a fraction to fund; funded experiments were conducted; results were returned to the researchers; and the researchers then decided what to do with the results.

Critically, TESS kept records of every funded experiment, the hypothesis being tested, and the outcome. This gave Franco et al. a sample of conducted experiments whose existence and results were knowable independently of whether anyone had published them.

The researchers identified 221 experiments funded through TESS between 2002 and 2012. They classified each experiment’s main result, based on the original investigator’s reported statistical analyses, into one of three categories: strong support for the hypothesis, mixed or partial support, and null. They then traced what had happened to each experiment subsequently — whether the investigator had written it up, whether they had submitted it to a journal, and whether it had been published. The results, published in Science in November 2014 under DOI 10.1126/science.1255484, were the most precise empirical estimate of publication bias in social science that had ever been produced.

The headline finding: null results were approximately 40 percent less likely to be written up than strong results, and roughly 60 percent less likely to be submitted to a journal. The bias was concentrated at the earliest decision points in the publication pipeline. Researchers were not failing to publish nulls because journals rejected them. Researchers were failing to publish nulls because they were not writing them up in the first place.

This was an important refinement of the 1979 framing. Rosenthal had located the bias at the editorial level: journals rejected nulls. Franco et al. located it earlier: researchers had internalized the journal-level bias to such a degree that they were self-censoring before submission. When asked why they had not written up null results, the most common reasons researchers gave in follow-up interviews were that the null was “uninteresting,” that it would not “tell a story,” and that the time spent writing it up would be better spent on a project with a more publishable expected outcome.

The empirical magnitude — a 40 percent suppression rate at the write-up stage and a 60 percent suppression rate at the submission stage — was a quantitative anchor that the field had been missing. It meant that the published literature in fields running TESS-like designs was a sample biased not by a small or moderate amount but by a large one. Meta-analytic conclusions based on such literatures had to be discounted accordingly.

The Franco et al. paper has been replicated and extended in other fields since 2014. Clinical-trials registries (clinicaltrials.gov, ICTRP) allow analogous analyses in pharmaceutical research. The pattern that has emerged across fields is consistent: a substantial fraction of conducted studies — typically between 25 and 60 percent, depending on the field — never reach publication, and the unpublished fraction is disproportionately composed of nulls.

The Detection Toolkit: How Meta-Analysts Estimate File-Drawer Bias

If you cannot observe the unpublished studies directly, can you infer their existence from the shape of the published literature? The answer is yes, and the methods for doing so have become substantially more sophisticated since Rosenthal’s fail-safe N.

A modern meta-analyst evaluating any literature for publication bias will typically run a battery of complementary tests. Each test exploits a different statistical signature that publication bias leaves on the published record.

The funnel plot is the most commonly used and the most intuitive. The meta-analyst plots, for each study in the literature, the estimated effect size against the precision of the estimate (usually the inverse of the standard error). In the absence of publication bias, the plot should resemble an inverted funnel: precise studies cluster tightly around the true effect at the top, while imprecise studies scatter more widely around the true effect at the bottom, with roughly symmetric distribution to the left and right. Publication bias produces an asymmetric funnel — typically, the bottom-left corner of the plot is empty, because small studies that estimated effects near zero (or negative) were less likely to be published than small studies that estimated large positive effects. The asymmetry is the visible footprint of the file drawer.

Egger’s regression test, introduced by Matthias Egger, George Davey Smith, Martin Schneider, and Christoph Minder in a 1997 BMJ paper (DOI 10.1136/bmj.315.7109.629), formalizes the funnel-asymmetry observation into a statistical test. Egger and colleagues showed that the relationship between standardized effect size and precision can be tested with a simple linear regression: if the intercept of the regression line is significantly different from zero, the funnel is asymmetric, and publication bias is the most parsimonious explanation. The test is not a definitive proof of bias — small-study heterogeneity can produce similar patterns — but it is a flag that should make a careful meta-analyst dig deeper.

P-curve, introduced by Uri Simonsohn, Leif Nelson, and Joseph Simmons in a 2014 paper in Journal of Experimental Psychology: General (DOI 10.1037/a0033242), is a more recent and more powerful tool that exploits a different statistical signature. The insight is that under the null hypothesis, p-values are uniformly distributed between zero and one. Under a true effect with adequate power, p-values are concentrated near zero with a steep right-skew. Under p-hacking or publication bias that hunts specifically for p-values just below the 0.05 threshold, the distribution of significant p-values shows a left-skew or a flat shape — there are unexpectedly many p-values clustered just below 0.05 and unexpectedly few near zero.

By plotting the distribution of statistically significant p-values from a set of studies, the meta-analyst can read off whether the literature reflects a true effect (right-skew toward zero), a null with p-hacking (left-skew toward 0.05), or insufficient evidence to distinguish them. P-curve has become particularly influential in social psychology, where Simonsohn and colleagues have used it to argue that several once-canonical literatures (ego depletion, facial feedback, certain priming effects) show p-curve signatures inconsistent with a true effect and consistent with p-hacking or publication bias.

PET-PEESE (precision-effect test / precision-effect estimate with standard error), developed by Tom Stanley and Hristos Doucouliagos, is a regression-based meta-analytic technique that estimates what the average effect size would be if every study had infinite precision — that is, if there were no small-study effects driving the apparent average. If the PET-PEESE-corrected effect is substantially smaller than the naive meta-analytic average, the gap is a quantitative estimate of how much publication bias is inflating the published literature.

No single one of these tools is definitive. A modern meta-analysis worth reading will report several of them and interpret the convergent (or divergent) signals. The toolkit allows a meta-analyst to estimate, with reasonable confidence, both whether a literature is contaminated by publication bias and by how much. The estimates are imperfect — they require modeling assumptions, and they can be confounded by genuine heterogeneity between studies. But they are vastly better than Rosenthal’s 1979 fail-safe N, and they are vastly better than the absence of tools that preceded him.

Turner 2008: The Antidepressant Case Study

The most consequential empirical demonstration of publication bias in any clinical literature appeared in January 2008 in the New England Journal of Medicine. The paper, by Erick Turner and colleagues at the FDA and Oregon Health & Science University, was titled “Selective Publication of Antidepressant Trials and Its Influence on Apparent Efficacy” (DOI 10.1056/NEJMsa065779). It documented, in unprecedented detail, the difference between what the antidepressant-drug literature said and what the underlying clinical trials actually showed.

Turner and his collaborators had access to a dataset that academic researchers had not previously been able to assemble. Between 1987 and 2004, the FDA had reviewed clinical trial data submitted by pharmaceutical companies in support of new drug applications for twelve antidepressant medications. The FDA’s review process required companies to submit results from every clinical trial they had conducted on each drug, regardless of whether those trials had been published. This gave the FDA — and, through Freedom of Information Act requests, Turner and his colleagues — a comprehensive view of the universe of antidepressant trials conducted in support of those twelve drug approvals.

Turner et al. identified 74 FDA-registered trials. They then searched the published literature to determine which trials had been published, and they compared the published representation of each trial to the FDA’s own evaluation of the trial’s outcome. The results were striking.

Of the 38 trials that the FDA had judged to be positive (showing efficacy of the antidepressant compared to placebo), 37 had been published — a publication rate of 97 percent. Of the 36 trials that the FDA had judged to be either negative (failing to show efficacy) or questionable (showing mixed evidence), only 14 had been published as null — and 22 had not been published at all, or had been published in a form that recast a negative result as a positive one through outcome-switching or selective reporting of subgroup analyses.

In other words, the published literature on antidepressant efficacy showed 48 positive trials and 3 negative trials. The actual FDA-registered evidence base showed 38 positive trials and 36 negative-or-questionable trials. The published literature was, by Turner’s count, more than three times more positive than the underlying clinical-trials data warranted.

Turner and colleagues then did the meta-analysis that the field had been doing for years, twice. The first meta-analysis used only the published literature. The second used the FDA-registered evidence base. The published-literature meta-analysis showed an average effect size for antidepressants over placebo of approximately 0.41 (a moderate effect by clinical standards). The full-evidence meta-analysis showed an average effect size of approximately 0.31. The published literature inflated the apparent efficacy of antidepressants by approximately 32 percent relative to the underlying evidence.

This is the empirical demonstration that turned publication bias from a methodological abstraction into a public-health concern. Antidepressants are among the most prescribed medications in the world; the apparent magnitude of their average benefit over placebo is, in the lower-effect-size regime, the kind of difference that affects clinical recommendations, prescribing patterns, and individual patients’ treatment decisions. A 32-percent inflation is not a methodologists’ quibble. It is the difference between a moderately effective intervention and a marginally effective one.

The Turner paper had immediate effects. It contributed to regulatory pressure that, in subsequent years, made registration of clinical trials mandatory in many jurisdictions and required public posting of trial results regardless of whether the trial was published in a peer-reviewed journal. The FDA’s Amendments Act of 2007 (which had been passed before the Turner paper but which acquired sharper enforcement teeth after it) required results posting on clinicaltrials.gov; the European Medicines Agency followed with comparable rules. The clinical-trials registry infrastructure that exists today is, in substantial part, a regulatory response to the kind of bias that Turner documented.

The Preregistration Solution

The methodological reform that has done the most to reduce publication bias is preregistration: the practice of publicly committing, before a study is conducted, to the hypothesis, the design, the analysis plan, and (in the strongest version) the manuscript that will be written regardless of how the results turn out.

Preregistration works because it severs the conditional dependence between the result and the publication decision. Under a preregistered design, the decision to publish has been made — either by the journal, in the case of registered reports, or by a transparent commitment in a registry — before the result is observed. The researcher cannot decide, after seeing a null result, that the project is “not worth writing up.” The commitment to write it up was made before the data came in.

The reform has several flavors. The most ambitious is the registered report, a publication format in which the journal reviews the study design and analysis plan before data collection, and (if accepted) commits to publishing the results regardless of the outcome. The format was introduced by Cortex in 2014 and has spread to roughly 300 journals across psychology, biology, social science, and medicine. Reviews of the registered-reports literature consistently find that the proportion of positive findings is dramatically lower than in conventional publications — typically around 40-50 percent, compared to 80-95 percent in conventional papers in the same fields. The gap is the visible footprint of what the file drawer used to swallow.

The lighter-touch versions include preregistration on the Open Science Framework (OSF), where researchers post a timestamped analysis plan before collecting data but the journal makes no advance commitment, and clinicaltrials.gov registration, which is now required for most clinical trials of FDA-regulated interventions. These lighter forms do not eliminate publication bias — researchers can still decide not to write up a null result — but they make it possible for meta-analysts to identify which conducted studies have not been published, and they make outcome-switching and selective analysis dramatically harder to conceal.

The combination of registered reports for the strongest cases, OSF preregistration for routine work, and clinical-trials registration for medical research has produced, in the fields where these practices are widely adopted, a measurable reduction in the apparent publication bias of recent literature. The file drawer has not been emptied, but it has been opened, and the conversation about what is in it can now be quantitative rather than speculative.

Where the Argument Has Been Refined

Rosenthal’s 1979 paper has held up remarkably well, but the methodological community has refined several of its specifics over the following four decades.

The fail-safe N has been substantially criticized. Multiple methodologists have shown that the standard fail-safe N calculation makes optimistic assumptions about the distribution of effect sizes in unpublished studies; it tends to produce reassuringly large numbers (suggesting robustness) even in literatures where more sophisticated analyses indicate substantial bias. Modern meta-analytic practice has moved away from the fail-safe N as a primary diagnostic and toward the funnel-based and p-curve-based tools described above. Rosenthal’s basic intuition — that you should think quantitatively about how much hidden literature would be required to overturn a published conclusion — survives; the specific tool he proposed has been superseded.

The file drawer is not the only mechanism of publication bias. Subsequent research has documented several related mechanisms that operate alongside non-publication of nulls: outcome-switching (changing the reported primary outcome after seeing the data), selective reporting of subgroups (publishing the one subgroup analysis that crossed the significance threshold), HARKing (hypothesizing after the results are known — see the HARKing article for the canonical analysis), and citation bias (positive results being cited more frequently and thus appearing more salient to readers and reviewers). The contemporary discussion treats these as members of a family of selection biases, of which the file drawer is the prototypical case.

The magnitude of bias varies substantially by field. The Franco et al. estimates of 40-60 percent suppression are specific to TESS-funded social-science experiments. Other fields show different magnitudes. Some areas of medical research, after the implementation of mandatory trial registration, show publication rates that are now closer to 80 percent of conducted studies. Some areas of psychology, before the open-science reforms of the 2010s, almost certainly showed suppression rates substantially higher than 40 percent. The pattern is not uniform; an honest meta-analyst will consider what is plausible in the specific field they are working in rather than applying a global discount.

What This Means for a Working Strategist

If you are a working strategist — a marketer, a product leader, a policy analyst, a manager — you will frequently encounter claims drawn from research literatures. Pricing research, behavioral economics, organizational psychology, consumer behavior, public health. The published literatures in all of these fields are non-random samples of conducted research, and the file drawer is real for all of them.

Three practical implications follow.

Apply a Bayesian discount to any single-study claim, especially one with a small sample. Rosenthal’s basic insight is that the published literature over-represents true effects in proportion to the bias in the publication pipeline. If you are reading a press release that summarizes “a new study” showing that some intervention works, your prior should be that the effect size reported is inflated and the probability that the underlying claim is true is lower than the reported p-value suggests. This is not a license to dismiss research — it is a calibration. The corresponding move in the related literature on Bayesian discounting is described at length in the Ioannidis 2005 article and is the same arithmetic.

Weight meta-analyses by their bias-correction methodology, not by their headline conclusion. When you read a meta-analysis, the most important section is the publication-bias diagnostics. If the paper presents a funnel plot, an Egger test, a p-curve, and a PET-PEESE-corrected effect estimate, take its headline seriously. If it presents none of these, treat the headline as a naive average of a non-random sample, and adjust your conclusion downward by an amount calibrated to the field’s typical bias.

Run your own work like a registered report. When you conduct an internal experiment — an A/B test, a pricing study, a survey of customers — commit in writing, before you collect data, to the hypothesis, the analysis plan, and the decision you will make at each possible outcome. Document the commitment in a place you cannot quietly revise. When the data come in, run the pre-committed analysis first; report it; then, if you want to explore further, label the exploration explicitly as exploration. This is not bureaucracy. It is the practice that makes your internal experimental record actually informative when you go back to it six months later. It is also the practice that, in the published-research world, has produced the most credible recent reductions in publication bias.

The file drawer that Rosenthal named in 1979 is still real, both in academic publishing and inside organizations. The Franco, Turner, and p-curve work has given us a substantially better understanding of its contents than the methodological community had for the first thirty-five years after Rosenthal’s paper. The remaining task is to use what we have learned — both when consuming research and when conducting our own.

Sources

Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86(3), 638-641. DOI: 10.1037/0033-2909.86.3.638
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502-1505. DOI: 10.1126/science.1255484
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534-547. DOI: 10.1037/a0033242
Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629-634. DOI: 10.1136/bmj.315.7109.629
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260. DOI: 10.1056/NEJMsa065779

P-Hacking and Researcher Degrees of Freedom — The mechanism that turns a single dataset into many implicit tests, inflating false-positive rates and feeding the file drawer.
Ioannidis 2005: Why Most Published Research Findings Are False — The Bayesian framework that makes the file-drawer problem one corollary of a larger structural argument.
HARKing: Hypothesizing After the Results Are Known — The post-hoc narrative move that lets a researcher publish a null as a positive by switching hypotheses after seeing the data.
OSC 2015: The Reproducibility Project in Psychology — The mass-replication study that documented the empirical consequences of the file drawer in a single field.
The A/B Testing Peeking Problem — How an internal-experimentation analogue of publication bias inflates apparent effect sizes when experimenters stop tests as soon as they reach significance.

FAQ

Is publication bias the same as p-hacking?

No, though they are closely related and often co-occur. Publication bias is a selection effect operating at the level of which conducted studies appear in print: nulls are suppressed, positives are amplified. P-hacking is a within-study practice in which a researcher exploits flexibility in design and analysis to obtain a significant result that would not have been significant under a single pre-specified analysis. Both inflate the apparent rate of positive findings in the published literature. The file drawer captures suppression at the study level; p-hacking inflates apparent positivity at the result level. A field with both will look substantially more confirmatory than the underlying evidence warrants.

Has the file drawer problem gotten better or worse since 1979?

Mixed. In fields with strong preregistration cultures (registered reports in psychology, mandatory trial registration in regulated clinical medicine), the bias has measurably decreased. In fields without such cultures (much of economics, organizational research, marketing, and the gray literature of industry-sponsored research that is not subject to FDA-style registration), the bias remains close to its pre-2014 levels. The aggregate trend is toward improvement, but the improvement is concentrated in fields where the institutional infrastructure has been built; elsewhere, the file drawer is as full as ever.

Can I trust a meta-analysis that doesn’t report a funnel plot?

You can trust it as a description of the published literature, but not as an estimate of the underlying true effect. A meta-analysis that does not assess publication bias has, by construction, no way of correcting for it. Its headline number is the average of a non-random sample. In fields with known substantial publication bias (most of social psychology before the 2010s, much of behavioral economics, large parts of nutritional epidemiology, antidepressant trials before the 2008 Turner paper), the uncorrected meta-analytic average is likely to be inflated, sometimes by a factor of 1.3 to 2 relative to the bias-corrected estimate.

How do I know if my own internal A/B tests are subject to a “file drawer” problem?

Look at your experimental archive. If you can find every test you have conducted in the last twelve months, with results, regardless of outcome, you do not have an internal file drawer problem. If the tests you can easily find are disproportionately the ones that “worked” — the winners that got rolled out — and the tests that failed are harder to locate or were never documented at the same level of detail, you have an internal file drawer problem. The fix is the same as in academic publishing: preregister the test, document the result regardless of outcome, and store the documentation in a place that does not depend on the result being interesting.

Does mandatory clinical-trial registration actually work?

It works, but imperfectly. Studies of post-registration publication rates show that registration substantially increases the probability that a trial’s results become publicly accessible — either through journal publication or through results posting on clinicaltrials.gov. But compliance is incomplete (a substantial minority of registered trials still fail to post results within the required timeframe), and outcome-switching between registration and publication remains common. The infrastructure has reduced the worst version of the problem; it has not eliminated it. The most credible recent estimates suggest that perhaps 70-80 percent of registered clinical trials in regulated jurisdictions now have publicly accessible results, compared to perhaps 50 percent before mandatory registration. This is real progress and also substantial room for further improvement.

replication-crisispublication-biasfile-drawer-problemrosenthal-1979evidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

Rosenthal 1979: A Four-Page Paper That Named the Problem

What Was Missing: Empirical Quantification

Franco, Malhotra, and Simonovits 2014: Unlocking the Drawer

The Detection Toolkit: How Meta-Analysts Estimate File-Drawer Bias

Turner 2008: The Antidepressant Case Study

The Preregistration Solution

Where the Argument Has Been Refined

What This Means for a Working Strategist

Sources

Related Reading

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook