Cohen offered d=0.2/0.5/0.8 “with some trepidation” as a last resort for fields with no better referent. The field turned a borrowed convention into a law of nature: dismissing “small” effects that compound and inflating “large” ones that do not replicate.

A product team is sitting in a research review. Someone has surfaced a peer-reviewed paper showing that a particular framing of a financial-decision prompt changes behavior with a standardized effect size of d = 0.18. The study is well-designed, pre-registered, and run on a few thousand participants. Across the table, a stakeholder who took one statistics course a decade ago delivers the verdict: “That’s a small effect. Cohen’s d under 0.2 is trivial --- it’s basically noise. Let’s not build around it.” The paper is set aside. In the same review, a second paper is celebrated: a lab study with forty undergraduates that produced a d = 0.9 on a contrived button-clicking task. “Now that’s a large effect,” the room agrees, and a roadmap item gets created. Both judgments are backwards. The d = 0.18 effect, applied to millions of decisions, is worth more than the entire product line. The d = 0.9 effect, measured on forty people on a meaningless task, is almost certainly a wild overestimate that will evaporate on the first attempt to reproduce it. The room has been led to the wrong conclusion on both counts by the single most overused crutch in applied statistics: the small/medium/large labels attached to Cohen’s d.

This is not a story about people being innumerate. The stakeholder did exactly what an entire academic and professional culture trained him to do. He took a number, looked up where it fell on a three-bucket scale, and read off the label. The problem is that the scale was never meant to be used that way, the man who invented it said so explicitly, and the way it actually gets used systematically corrupts the evaluation of evidence in precisely the direction that makes organizations chase noise and ignore signal. For a strategist who reads research to make decisions --- or who runs experiments and reports the results --- understanding what is wrong with “small/medium/large” is one of the highest-leverage pieces of statistical literacy available, because the error is everywhere and the correction is cheap.

What Cohen Actually Said

Jacob Cohen was a quantitative psychologist who spent his career trying to drag his field toward statistical seriousness. His 1969 book and its widely cited 1988 second edition, Statistical Power Analysis for the Behavioral Sciences, gave psychology its working vocabulary for statistical power, and along the way introduced the standardized effect size that bears his name. Cohen’s d is conceptually simple: it is the difference between two group means divided by the pooled standard deviation. Expressing a difference in standard-deviation units strips out the original measurement scale, so an effect on a reaction-time task and an effect on a survey scale can, in principle, be compared on the same ruler. That was a genuine advance, and it is not in dispute.

The trouble started with a single passage. Cohen needed his readers to be able to do power calculations even when they had no prior literature to estimate an effect size from --- a researcher planning a study in a brand-new area has nothing to anchor on. So he supplied conventional values: d = 0.2 for a “small” effect, d = 0.5 for “medium,” d = 0.8 for “large.” And he was painfully, explicitly clear about the spirit in which he offered them. In his own words, “there is a certain risk inherent in offering conventional operational definitions for these terms for use in power analysis in as diverse a field of inquiry as behavioral science. This risk is nevertheless accepted in the belief that more is to be gained than lost by supplying a common conventional frame of reference which is recommended for use only when no better basis for estimating the ES index is available.” Read that last clause again: recommended for use only when no better basis is available. The benchmarks were a fallback for the planning stage, not a grading rubric for the interpretation stage.

Cohen also told his readers what the numbers meant in concrete, physical terms, and the examples are revealing. A “small” effect of d = 0.2, he wrote, is roughly the difference in mean height between fifteen- and sixteen-year-old girls --- on the order of half an inch, a difference that exists but is hard to notice by eye. A “medium” effect of d = 0.5 is about the difference between fourteen- and eighteen-year-old girls, a gap “visible to the naked eye.” A “large” effect of d = 0.8 is roughly the difference between thirteen- and eighteen-year-old girls --- obvious to anyone looking. Notice what these examples do not say. They say nothing about whether a half-inch difference matters. Whether the difference between fifteen- and sixteen-year-olds is “trivial” depends entirely on what you are doing with it. If you are buying clothes in bulk for a population, half an inch of systematic difference is the difference between a fitting inventory and a returns nightmare. The size of an effect and the importance of an effect are different questions, and Cohen’s scale only ever addressed the first one --- crudely, and as a last resort.

The reluctance was not a footnote. Cohen returned to the theme in his 1992 “A power primer” in Psychological Bulletin, the four-page paper that, ironically, did more than anything to cement the conventions in the field’s muscle memory by putting them in a convenient lookup table. He restated that the values were conventions, not laws. And by accounts from colleagues, late in his life Cohen privately said he regretted ever having proposed the benchmarks at all, having watched them metastasize from a planning aid into exactly the kind of mechanical, context-free labeling he had warned against. The man built a tool, stapled a warning label to it, repeated the warning, and lived to regret that nobody read the label.

Why The Conventions Mislead

The definitive modern statement of the case against mechanical effect-size labeling is Funder and Ozer’s 2019 paper in Advances in Methods and Practices in Psychological Science, bluntly titled “Evaluating Effect Size in Psychological Research: Sense and Nonsense.” Their argument has two prongs. The first is destructive: the small/medium/large benchmarks, used as a grading scale, are at best uninformative and at worst actively misleading. The second is constructive: there are better referents, and small effects in particular are routinely and wrongly written off.

Start with the destructive prong, because it dismantles the most common move of all --- squaring the correlation to get “percent of variance explained.” A correlation of r = 0.30 is often dismissed as accounting for “only 9% of the variance,” which sounds pitiful. Funder and Ozer show this is not merely uninformative but, in their words, the practice of squaring r “is not merely uninformative; for purposes of evaluating effect size, the practice is actively misleading.” Their illustration is sharp: imagine sorting coins where one rule predicts outcomes with r = 0.8944 and another with r = 0.4472. Squared, those become “80% of variance” and “20% of variance,” tempting you to conclude the first rule matters four times as much. But the actual ratio of predictive power is 2 to 1, not 4 to 1. Squaring inflates the apparent gap between strong and weak predictors and crushes small effects toward a visual zero. The “9% of variance” framing is one of the most effective rhetorical devices ever deployed for making a consequential effect look like nothing.

Now the constructive prong, which is where the real reframing lives. Funder and Ozer propose interpreting effects against the benchmark of accumulated consequences over repeated occasions rather than a single-event yardstick. On their scale, an effect of r = 0.05 is “very small for the explanation of single events but potentially consequential in the not-very-long run”; r = 0.10 is “still small at the level of single events but potentially more ultimately consequential”; r = 0.20 is “a medium effect that is of some explanatory and practical use even in the short run”; and r = 0.30 is “a large effect that is potentially powerful in both the short and the long run.” Crucially, they argue that effects of r = 0.40 or greater in psychological research “are relatively rare” and a reported value much beyond that range is “likely to be a gross overestimate that will rarely be found in a large sample or in a replication.” That last point flips the naive intuition completely: in soft sciences, an enormous reported effect should increase your suspicion, not your confidence.

The reason small effects deserve respect is aggregation. Funder and Ozer revive the example that the statistician Robert Rosenthal made famous: a clinical trial of aspirin for the prevention of heart attacks produced an effect size of roughly r = 0.03 --- by the squared-variance logic, “0.1% of the variance,” a rounding error. Yet in the trial that effect corresponded to a meaningful number of heart attacks prevented across the sample, and the trial was stopped early on ethical grounds because it was considered unconscionable to keep the control group off a treatment that clearly worked. A correlation a careless reader would call “essentially zero” was, at scale and against the stakes of mortality, decisive. They make the same point with a baseball analogy drawn from Abelson: the correlation between a single at-bat and a player’s batting skill is tiny, yet across a season those tiny effects aggregate into the difference between a team that makes the playoffs and one that finishes last. An effect that is negligible on a single occasion can be overwhelming when it recurs.

This is the core of why the conventions mislead a decision-maker specifically. A label like “small” answers the question “how big is this effect in standard-deviation units on one occasion?” But the decision-relevant question is almost always “what does this effect do when it is applied to my entire population, repeatedly, over time?” Those questions can have opposite answers. The convention system encourages you to substitute the easy question for the hard one, and the substitution is biased toward discarding exactly the kind of broadly-applied, frequently-recurring small effects that are the bread and butter of any business operating at scale.

Publication Bias And The Winner’s Curse: Why Big Published Effects Lie

There is a second, independent reason to distrust the reflex of “large effect equals important finding,” and it has nothing to do with importance and everything to do with whether the number is even real. Published effect sizes are systematically inflated, and the inflation is worst for exactly the eye-catching, large-d findings that the convention system trains people to celebrate.

The mechanism is a compounding of two forces. The first is publication bias: journals preferentially publish statistically significant results, so the studies that reach print are a filtered, non-representative sample of the studies that were actually run. The second is sampling noise interacting with that filter. In a small study, the estimated effect size bounces around the true value with a wide margin of error. To clear the significance threshold with a small sample, an effect generally has to land on the high side of its own sampling distribution --- a study that happened to draw a below-average estimate simply won’t reach significance and won’t get published. The result is what statisticians call a Type M (magnitude) error, characterized in Gelman and Carlin’s 2014 treatment: among published significant results from underpowered studies, the reported effect size is, on average, an overestimate, sometimes by a factor of two or more. The smaller and noisier the study, the larger the inflation. This is the same statistical machinery as the winner’s curse in A/B testing: when you select on the highest observed value, you are selecting partly on luck, and luck does not persist.

The empirical fingerprint of this inflation is unmistakable in the large replication efforts. The Open Science Collaboration’s 2015 Science paper, which attempted to reproduce one hundred psychology findings, found that while 97% of the original studies had reported statistically significant results, only 36% of the high-powered replications did --- and the replication effect sizes were, on average, about half the magnitude of the originals. Half. The published literature was not a clean record of true effect sizes; it was a record of effect sizes systematically biased upward by the filtering process, and the bias was large enough to halve on honest re-measurement. A d = 0.8 pulled from a forty-person study published in a prestige journal is not a reliable estimate of a large effect. It is, far more often, a noisy estimate of a moderate effect that won the publication lottery, or of no effect at all.

This is the deep irony that should reorganize how a strategist reads research. The convention system says: trust large effects, dismiss small ones. The statistics of publication say the opposite is closer to safe: a moderate-but-precise effect from a large sample is far more trustworthy than a spectacular effect from a small one, and a “small” effect that survives a high-powered, pre-registered test may be the most bankable finding in the room precisely because there was no room for noise to inflate it. The size of the reported number tells you very little; the precision behind it tells you almost everything.

SESOI: The Fix Is To Decide What Matters Before You Look

If the benchmarks are the disease, the cure is not a better set of universal benchmarks --- it is the abandonment of universal benchmarks in favor of a domain-specific judgment made in advance. The modern methodological literature, anchored by Daniël Lakens, calls this judgment the smallest effect size of interest, or SESOI.

The logic runs through Lakens’s foundational 2013 paper in Frontiers in Psychology, “Calculating and reporting effect sizes to facilitate cumulative science,” which is on its surface a practical primer on computing effect sizes correctly. Two of its lessons matter here. First, Lakens is explicit that Cohen’s labels are not interpretive law: “these values are arbitrary and should not be interpreted rigidly,” and the right way to interpret a d “is to relate it to other effects in the literature, and if possible, explain the practical consequences of the effect.” That is a direct instruction to do the work the convention system lets you skip. Second, Lakens documents a technical point that sharpens the inflation argument: Cohen’s d, computed from sample data, is a biased estimator that overstates the population effect, and the bias is worst in small samples (below roughly n = 20). The bias-corrected version, Hedges’s g, applies a shrinkage factor. The correction is small in large samples and meaningful in small ones --- which means the small-sample studies most prone to publication-driven inflation are also subject to a separate, mechanical upward bias in the effect-size statistic itself. Two distinct forces both push small-sample published effects upward.

SESOI takes the interpretive burden and moves it to the front of the process. Before running a study or reading one, you ask: what is the smallest effect that would actually change a decision in this context? That number comes from the domain, not from a textbook. For a drug that reduces mortality, the SESOI might be a tiny standardized effect, because at population scale and against the stakes of death, a tiny effect saves many lives --- the aspirin case. For a costly, disruptive organizational change, the SESOI might be quite large, because anything smaller wouldn’t justify the cost of implementation. The same numerical d can be “interesting” in one context and “ignore it” in another, and SESOI forces you to specify which before the result can bias your judgment. Lakens’s later work extends this into equivalence testing --- formal procedures for concluding that an effect is smaller than your SESOI, i.e., for affirmatively declaring an effect too small to care about rather than just failing to find one. The framework replaces the question “is this effect small, medium, or large?” with the only question that was ever decision-relevant: “is this effect bigger than the smallest effect I would act on?”

Applied Implications: Reading Research And Running Experiments

For a strategist, the payoff of all this is a short set of habits that, applied consistently, will outperform the entire convention system.

When reading research, never accept “large equals important” or “small equals ignore” at face value. Translate the effect into your own decision context before you let a label move you. Ask three questions of any reported effect. What does this effect do when applied to my full population, repeatedly, over the relevant time horizon --- the aggregation question that turns r = 0.03 into a stopped trial? What is the confidence interval, not just the point estimate --- because a d = 0.6 with an interval running from 0.05 to 1.15 is a different finding from a d = 0.6 with an interval from 0.5 to 0.7, even though the labels are identical? And could the sample even reliably estimate an effect this size --- because a large effect from a tiny sample is a near-guaranteed overestimate, and the right reaction to a spectacular result from forty people is suspicion, not a roadmap item?

When running experiments, the same discipline reorders your priorities. This connects directly to the difference between statistical and practical significance: a conversion lift can be “statistically significant” and decision-irrelevant, or “small” and enormously profitable. A 0.5% lift on a checkout flow that millions of customers pass through every month is, in absolute revenue, frequently larger than a flashy 15% lift on a feature three hundred people use. The convention reflex flags the 15% as the win and the 0.5% as not worth shipping; the aggregation logic says the opposite. Define your SESOI before the test --- the smallest lift that clears your implementation and maintenance cost --- and judge the result against that, not against a generic notion of “big enough.” And because of the winner’s curse, discount the headline lift on whatever variant you selected for being the highest: the number that won the test is, on average, above the true effect, and the gap is widest when the test was small or the metric was noisy.

Report your own effects honestly enough that you don’t become the next inflated citation. Always report a confidence interval alongside the point estimate. Prefer the bias-corrected statistic in small samples. State your SESOI so a reader can see whether you pre-committed to what would count as meaningful or decided after the fact. None of this is exotic; all of it is the opposite of “we got a large effect, ship it.”

The Strategist Takeaway

The deepest lesson of the Cohen’s d story is not a statistical technique. It is a warning about how a useful tool, detached from the caveats its inventor attached to it, becomes an instrument for confident error. Cohen handed his field a planning aid, said in plain language that it was a last resort to be used only when nothing better was available, repeated the warning, and watched it harden into a grading rubric that he came to regret. The misuse was not a misunderstanding of the math --- the math is trivial --- but a substitution of an easy question (“which bucket does this number fall in?”) for the hard one that actually drives decisions (“what does this effect do in my world, and can I even believe the number?”).

For someone making decisions from evidence, the practical posture follows immediately. Treat “small/medium/large” as a vocabulary of convenience with no authority over your judgment. Respect small effects that apply broadly and recur often, because they aggregate into the largest consequences you will ever encounter. Distrust large effects from small samples, because the publication and sampling machinery manufactures them. Always look at the interval, not just the estimate. And decide what would count as a meaningful effect before the result is in front of you to bias the answer. The field spent fifty years learning, the hard way, that a borrowed convention is not a law of nature. A strategist can learn it in an afternoon and stop being led to the wrong conclusion by a label that its own author wished he had never written down.

Sources

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates. ISBN: 978-0805802832. (The d = 0.2/0.5/0.8 conventions, the “no better basis for estimating the ES index is available” caveat, and the height-by-age illustrations.)
  • Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155—159. DOI: 10.1037/0033-2909.112.1.155
  • Funder, D. C., & Ozer, D. J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2(2), 156—168. DOI: 10.1177/2515245919847202
  • Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. DOI: 10.3389/fpsyg.2013.00863
  • Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259—269. DOI: 10.1177/2515245918770963
  • Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641—651. DOI: 10.1177/1745691614551642
  • Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
  • Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276—1284. DOI: 10.1037/0003-066X.44.10.1276

Browse the full Replication Crisis Hub for related material on evidence evaluation and statistical inference:

FAQ

Is Cohen’s d itself flawed, or just the small/medium/large labels?

The statistic is fine. Cohen’s d --- the mean difference divided by the pooled standard deviation --- is a perfectly reasonable standardized effect size, and standardizing is genuinely useful when you need to compare effects measured on different scales. The flaw is entirely in the interpretive layer that got bolted on: the three-bucket small/medium/large scale, used as a grading rubric instead of the last-resort planning aid Cohen intended. You can and should keep computing d (or its bias-corrected cousin, Hedges’s g). What you should drop is the reflex of reading a verdict off the bucket the number falls into.

Why is squaring the correlation to get “percent of variance” misleading?

Because squaring is a nonlinear transformation that visually crushes small effects and exaggerates the gap between strong and weak predictors. Funder and Ozer’s example: two predictors with correlations of about 0.89 and 0.45 become “80% of variance” and “20% of variance” when squared --- suggesting a 4-to-1 ratio of importance --- when the actual ratio of predictive power is 2 to 1. The famous “r = 0.03 aspirin effect explains only 0.1% of the variance” framing is the same trick: it takes an effect that prevented heart attacks at population scale and makes it look like a rounding error. Report the correlation itself, or translate it into concrete consequences; don’t square it to evaluate importance.

How can a “small” effect possibly matter more than a “large” one?

Through aggregation. A standardized effect describes the size of a difference on a single occasion. But most consequential effects apply to large populations and recur over time, and small per-occasion effects accumulate. An effect of r = 0.03 sounds negligible until you apply it to tens of thousands of patients (lives saved) or millions of transactions (revenue). Conversely, a large effect on a rare or low-stakes event may be practically irrelevant. The size of the effect and the importance of the effect are different questions; the convention system answers only the first and tempts you to treat it as the answer to the second.

Why should a large effect size from a small study make me more suspicious, not less?

Because of publication bias compounded by sampling noise. Small studies produce noisy estimates; to clear the statistical-significance threshold with a small sample, an effect generally has to land on the high side of its own sampling distribution. Journals then preferentially publish the significant results. The combination guarantees that published effects from small samples are systematically inflated --- the Type M (magnitude) error of Gelman and Carlin 2014 --- often by a factor of two or more. The Open Science Collaboration’s 2015 replication project found replication effects averaged about half the magnitude of the originals. A spectacular d from forty people is far more likely to be a lucky overestimate than a reliable large effect.

What is a smallest effect size of interest (SESOI), and how do I set one?

The SESOI is the smallest effect that would actually change a decision in your specific context, decided before you see the result. You set it from the domain, not from a textbook. For a mortality-reducing treatment applied to millions, the SESOI may be tiny, because even a small effect saves many lives. For a costly, disruptive organizational change, the SESOI may be large, because anything smaller doesn’t justify the cost. In experimentation, your SESOI is the smallest lift that clears the implementation and maintenance cost of shipping. Specifying it in advance prevents the result itself from biasing your judgment of what counts as meaningful, and Lakens’s equivalence-testing methods let you formally conclude an effect is below your SESOI rather than merely failing to detect one.

How does this apply to A/B testing specifically?

Three ways. First, judge a conversion lift against a pre-committed SESOI (the smallest lift worth the cost of shipping), not against a generic notion of “big” --- a 0.5% lift across millions of users often beats a 15% lift on a niche feature in absolute terms. Second, discount the headline lift on the variant you selected for being the winner, because the winner’s curse means the highest observed lift is, on average, above the true effect, with the gap widest for small or noisy tests. Third, report confidence intervals on your lifts so stakeholders see the precision, not just the point estimate. The convention reflex --- “15% is a big win, 0.5% is noise” --- gets both the importance and the reliability backwards.

replication-crisis cohens-d effect-size research-methodology evidence-evaluation

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.