Steele & Aronson 1995 became the empirical foundation for a generation of DEI policy, diversity training, and educational reform. Meta-analyses by Stoet & Geary, Flore & Wicherts, and Shewach & Sackett showed the underlying effect is small, heavily inflated by publication bias, and may be near zero in the kinds of high-stakes settings the policy framework was built to address. The honest verdict is nuanced --- and consequential.

If you have sat through a diversity, equity, and inclusion training in the last fifteen years --- at a university, a Fortune 500 company, a hospital system, a government agency --- you have almost certainly encountered the concept. Maybe it was framed as “identity threat.” Maybe it was framed as “the SAT gap that disappears when you remove the stereotype.” Maybe it was framed as “the reason women drop out of STEM.” Whatever the framing, the empirical anchor was almost always the same: a 1995 paper by Claude Steele and Joshua Aronson in the Journal of Personality and Social Psychology showing that when Black college students were told a verbal test measured intellectual ability, they performed worse than when they were told the same test was just a problem-solving exercise.

The finding was elegant. The story was powerful. The implications were enormous. If a brief change in test framing could affect performance, then the achievement gaps that haunted American education were not fixed properties of demographic groups --- they were partly artifacts of the stereotype-laden environments in which the tests were administered. The intervention pathway was tractable. Change the environment, change the gap.

By the late 2000s, stereotype threat had become one of the most cited concepts in social psychology, the empirical justification for a generation of educational interventions, and a load-bearing element in the conceptual architecture of DEI training programs across American institutions. It appeared in textbooks. It appeared in Whistling Vivaldi, Steele’s 2010 popular book that brought the concept to a general audience. It appeared in legal briefs, congressional testimony, and policy memos.

Then the meta-analyses started arriving.

What they found, in successive waves between 2012 and 2019, was not that stereotype threat was fake --- it isn’t. It was that the effect was much smaller than the popular and institutional framings implied, that the literature showed substantial signs of publication bias, that the most-cited original studies suffered from a specific statistical misinterpretation that inflated the apparent finding, and that in the kinds of high-stakes operational testing settings the policy framework was meant to address, the effect “may range from negligible to small.” This is the careful verdict that emerged from the rigorous quantitative reviews. It is not the verdict most of the people who built careers, training programs, and policy frameworks on the concept ever heard.

This article is about how that gap opened, what the honest current evidence actually shows, and what to do as a strategist when an empirical foundation you may have built decisions on turns out to be smaller and shakier than advertised.

The 1995 Original Study (Steele & Aronson)

The foundational paper is Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797-811. The study reported four experiments. The core paradigm: Black and White Stanford undergraduates took a difficult verbal test composed of items from the Graduate Record Examination. In the “diagnostic” condition, participants were told the test measured intellectual ability. In the “non-diagnostic” condition, they were told it was simply a problem-solving exercise. The hypothesis was that Black students in the diagnostic condition would experience apprehension about confirming a negative stereotype about their group’s intellectual ability, and that this apprehension would degrade their test performance.

The key reported finding: Black participants performed worse than White participants in the diagnostic condition, but the gap shrunk substantially in the non-diagnostic condition. Steele and Aronson defined “stereotype threat” as “being at risk of confirming, as self-characteristic, a negative stereotype about one’s group” --- a concept that traveled far beyond the specific paradigm of the original study.

So far, so straightforward. Here is the crucial detail that almost every popularization of the study obscured: the analyses controlled statistically for participants’ prior SAT scores. The Black-White gap that appeared to “disappear” in the non-diagnostic condition disappeared after the dependent variable had been adjusted for participants’ SAT scores. Without that statistical adjustment, the gap was much smaller and harder to interpret.

This is the issue that Paul Sackett, Chaitra Hardison, and Michael Cullen raised in their 2004 American Psychologist paper, “On interpreting stereotype threat as accounting for African American-White differences on cognitive tests” (DOI: 10.1037/0003-066X.59.1.7). Their core argument was that an enormous fraction of the published literature describing Steele & Aronson 1995 --- they reported approximately 90% of journal articles and approximately 56% of textbook treatments they surveyed --- described the finding as showing that Black-White test-score gaps disappeared in the non-stereotype-threat condition. This is not what Steele & Aronson found. What Steele & Aronson found was that, after statistical adjustment for SAT scores, the residual within-condition gap disappeared. The unadjusted scores still showed a substantial gap. Removing the stereotype threat manipulation, in the actual reported data, did not eliminate the raw performance difference.

Steele and Aronson themselves were not responsible for the misinterpretation --- their paper described the analytic approach clearly. The misinterpretation arose in the secondary literature, in textbook summaries, in popular accounts, and in the institutional rhetoric that grew up around the construct. By the time Sackett, Hardison, and Cullen documented the pattern in 2004, the misinterpretation had become so widespread that it functioned as the conventional understanding of what the study had shown.

Sackett and colleagues followed up over the next decade, documenting that the misinterpretation persisted even after they flagged it. A later analysis found that fifteen years after the 2004 American Psychologist paper, the proportion of journal articles describing the study as showing that stereotype threat accounted for the Black-White gap had declined from approximately 91% to approximately 63% --- a meaningful improvement but still a large majority of papers misdescribing what the underlying data showed.

This is the first piece of the stereotype-threat story: the founding empirical study was elegant and interesting, but the way it was described in the secondary literature was systematically misleading about what it had actually demonstrated.

The Women-in-Math Extension (Spencer 1999)

Four years after the original paper, Steven Spencer, Claude Steele, and Diane Quinn published “Stereotype threat and women’s math performance” in the Journal of Experimental Social Psychology, 35(1), 4-28 (DOI: 10.1006/jesp.1998.1373). This paper extended the stereotype-threat paradigm to a different identity and a different domain. The hypothesis: when women take a difficult math test, the cultural stereotype that women are weaker than men at mathematics creates a stereotype-threat condition that disrupts performance through apprehension about confirming the stereotype.

The paper reported three experiments. In broad terms, the finding was that women high in math identification performed worse than men on difficult math problems when the test was described as one that typically produces gender differences, but performed equivalently to men when the test was described as not producing gender differences. The studies used Michigan undergraduates with strong math backgrounds, and the underlying logic mirrored the Steele & Aronson 1995 framing closely.

Spencer, Steele, and Quinn 1999 became, in many ways, more institutionally consequential than the original Steele & Aronson paper. It provided the empirical anchor for an enormous body of subsequent work on women in STEM, the gender gap in mathematics, and interventions designed to reduce stereotype threat in educational settings. By the 2010s, the Spencer 1999 finding was the standard citation in policy discussions of why women might be underrepresented in mathematics and the physical sciences, and the standard empirical justification for interventions designed to reduce stereotype salience in educational environments.

The Spencer 1999 finding is also the finding that subsequent meta-analyses had the most difficulty reproducing.

What the Meta-Analyses Found

Beginning in the early 2010s, several major meta-analytic reviews tested whether the published literature on stereotype threat --- particularly the women-in-math version --- held up under aggregate quantitative scrutiny.

Stoet & Geary (2012), “Can stereotype threat explain the gender gap in mathematics performance and achievement?” in Review of General Psychology, 16(1), 93-102 (DOI: 10.1037/a0026617), examined attempts to replicate the Spencer, Steele, and Quinn 1999 paradigm. Stoet and Geary reported that only about 55% of relevant replication attempts produced the predicted stereotype-threat effect at all, and that approximately half of the studies that did replicate the effect used a statistical adjustment for prior math performance --- the same kind of covariate adjustment that Sackett, Hardison, and Cullen had flagged as problematic in the Steele & Aronson 1995 context. When Stoet and Geary restricted their analysis to studies that did not rely on this kind of statistical adjustment, only about 30% of replications produced the predicted effect.

The implication was unflattering for the women-in-math version of the construct. A substantial majority of relevant replication attempts, evaluated rigorously, failed to find the effect. The published literature looked the way it did partly because successful replications were preferentially published and unsuccessful ones were preferentially shelved --- the classic file-drawer pattern.

Flore & Wicherts (2015), “Does stereotype threat influence performance of girls in stereotyped domains? A meta-analysis,” in the Journal of School Psychology, 53(1), 25-44 (DOI: 10.1016/j.jsp.2014.10.002), conducted a more comprehensive meta-analysis specifically on stereotype-threat effects in girls and adolescent female students. They examined 47 effect sizes. The estimated mean effect size was approximately d = -0.22, which is small to moderate in magnitude.

But the more important Flore and Wicherts finding was about publication bias. They reported several signs of bias in the literature --- patterns suggesting that the published effects were systematically larger than the true underlying effect would be in an unbiased literature. Their conclusion, in plain terms: the average reported effect of stereotype threat among schoolgirls was small, and after correcting for publication bias, the most likely effect size estimate was near zero. They proposed that a large preregistered replication study would be required to obtain a less biased estimate of the underlying effect.

That large preregistered replication arrived a few years later. Flore, Mulder, and Wicherts (2018), “The influence of gender stereotype threat on mathematics test scores of Dutch high school students: A registered report,” in Comprehensive Results in Social Psychology (DOI: 10.1080/23743603.2018.1559647), tested the stereotype-threat hypothesis in a Dutch high school sample of more than two thousand students. The study was preregistered and statistically very well powered. The result: no overall effect of stereotype threat on math performance, and no moderated effects through any of the four theoretically motivated moderators (domain identification, gender identification, math anxiety, test difficulty). A direct, well-powered replication of the conceptual framework that Spencer 1999 had launched produced a null result.

The largest and most recent meta-analysis is Shewach, Sackett, and Quint (2019), “Stereotype threat effects in settings with features likely versus unlikely in operational test settings: A meta-analysis,” in the Journal of Applied Psychology, 104(12), 1514-1534 (DOI: 10.1037/apl0000420). Shewach and colleagues analyzed more than 200 experimental studies of stereotype threat on cognitive ability tests, looking specifically at how the features of laboratory stereotype-threat manipulations compared to the features that would actually be present in real high-stakes testing settings (such as college admissions tests or employment selection tests).

Their core finding: the size of the stereotype-threat effect in laboratory settings was larger than the size of the effect in study conditions that approximated real high-stakes testing environments. When the analysis was restricted to study features more representative of operational testing, the effect was, in their summary, “negligible to small.” They emphasized that blatant manipulations of stereotype threat (such as explicitly telling participants that a test typically produces gender or race differences) produced larger effects than subtle manipulations, but blatant manipulations are rarely a feature of real high-stakes testing environments --- they are a feature of laboratory experiments designed to maximize the chance of detecting the effect.

Their conclusion was careful: they did not claim stereotype threat does not exist. They claimed the evidence does not support strong claims that stereotype threat systematically affects performance on real high-stakes cognitive ability tests in the kind of operational settings the policy framework was built to address.

Three meta-analyses, conducted by different research teams across about seven years, converged on roughly the same picture: real effects in some specific laboratory paradigms; substantial publication bias inflating the apparent literature; much smaller effects in well-controlled and well-powered replications; and effects that may be near zero in the operational testing settings where the policy framework had its largest institutional impact.

Why It Looked So Real

The stereotype-threat story is interesting partly because it is not a story of fraud (Diederik Stapel) or fabrication (Brian Wansink). The original studies were real studies done by serious researchers. The effects they reported were not invented. The construct they named was at least partially capturing something. Then why did the popular and institutional version of the framework so dramatically overshoot the underlying evidence?

The file-drawer problem was particularly severe in this literature. Stereotype-threat studies are relatively cheap to run --- undergraduates in a lab, a brief manipulation, a standard cognitive test. The publication ecosystem in social psychology in the late 1990s and 2000s rewarded novel, surprising findings with social-justice resonance. Successful demonstrations of stereotype threat were highly publishable. Failed replications were widely understood to be much harder to publish. The result was a literature in which the visible effects were systematically inflated relative to the underlying true effect, exactly as Flore and Wicherts later documented quantitatively.

The construct had unusual narrative resonance. The civil-rights era had established a national framework for thinking about how environmental conditions shape the achievement of historically disadvantaged groups. Stereotype threat fit that framework almost perfectly. It provided a mechanism --- apprehension about confirming a stereotype --- that explained why structural disadvantage might persist even in environments without explicit discrimination. The mechanism was psychologically plausible, scientifically respectable, and politically aligned with widely shared aspirations about reducing achievement gaps. The narrative fit was, in retrospect, almost too good. A construct that perfectly explains a deeply felt social problem and is supported by elegant lab demonstrations is exactly the kind of construct that should trigger extra empirical scrutiny.

The original researchers were excellent communicators. Claude Steele’s 2010 book Whistling Vivaldi brought the construct to a general audience in lucid, compelling prose. The framing was memorable. The implications felt actionable. Researchers in adjacent fields adopted the construct rapidly. Policy entrepreneurs translated it into intervention designs. By the time the careful meta-analytic evidence began to arrive, the institutional infrastructure built on the popular version of the construct was substantial.

The Sackett 2004 misinterpretation gave a free upgrade. Because so many secondary descriptions claimed that Steele and Aronson had shown the Black-White SAT gap disappeared in the no-threat condition, the construct was understood by policy audiences to be a much stronger empirical demonstration than it actually was. The actual finding --- a contrast in within-condition residual variance after covariate adjustment --- was modest and specific. The widely cited interpretation --- a complete elimination of racial gaps through environmental change --- was sweeping and dramatic. The popular framework was running on the upgraded version.

The replication evidence took decades to arrive. Steele & Aronson 1995. Stoet & Geary 2012. Flore & Wicherts 2015. Shewach, Sackett & Quint 2019. Twenty-four years from the foundational study to the largest definitive meta-analysis. During those twenty-four years, the institutional and cultural footprint of the construct grew enormously. When the meta-analytic evidence arrived, it had to overcome a fully formed movement with deep institutional roots.

What’s Honest to Say Now

The stereotype-threat story does not fit cleanly into either of the two narratives that dominate replication-crisis discourse. It is not a clean falsification (Vicary’s subliminal advertising) and it is not a successful replication of the original claim. It sits in the same uncomfortable middle territory as growth mindset and ego depletion: the construct captures something real in some specific paradigms, but the popular and institutional versions of the framework dramatically overshot what the rigorous quantitative evidence supports.

Three layers, ordered from most-supported to least-supported.

Layer 1: Some stereotype-threat effects are real in some specific laboratory paradigms. When experimenters use blatant manipulations (explicitly telling participants a test produces gender or race differences), in samples highly identified with the stereotyped domain, on difficult tests, in low-stakes laboratory settings, they sometimes produce effects in the direction the original studies predicted. The construct is not zero. The lab effect, in optimal conditions, exists.

Layer 2: The aggregate effect is smaller than the popular framing implies, and the literature shows substantial publication bias. The most rigorous meta-analyses --- Stoet & Geary 2012, Flore & Wicherts 2015, Shewach, Sackett & Quint 2019 --- converge on the picture of an effect that is small in well-conducted studies, substantially inflated by publication bias in the broader literature, and absent or near-zero in well-powered preregistered replications like Flore, Mulder & Wicherts 2018. The picture is closer to “real but small and heterogeneous” than to “real and substantial enough to drive policy.”

Layer 3: The strong policy claims are not supported. The claim that stereotype threat explains a substantial share of real-world achievement gaps in cognitive testing, that interventions designed to reduce stereotype salience can meaningfully change demographic gaps in high-stakes test performance, or that the empirical literature provides a robust foundation for the specific intervention designs that became standard in DEI training, is not supported by the current meta-analytic evidence. The Shewach 2019 conclusion is the key one here: in the operational testing settings where the policy framework was meant to apply, the effect is at most small and may be negligible.

This is uncomfortable terrain for people who have built careers, training programs, or institutional commitments on the larger version of the construct. The discomfort is legitimate. It does not change what the evidence shows.

What This Means For Strategists

If you are a consultant, executive, or policy professional whose organization has deployed interventions justified in part by reference to stereotype-threat research, here are the strategist takeaways.

1. Distinguish narrative from evidence when evaluating any DEI intervention. The stereotype-threat literature is the clearest available case study in how a powerful narrative can travel much further than the underlying empirical base supports. The narrative --- that environmental cues activate stereotypes which depress performance which can be intervened on by changing the cues --- is psychologically compelling. The underlying evidence is much thinner than the narrative implies. The same gap likely exists for other behavioral-science constructs that have been imported into DEI practice. Before deploying an intervention based on a behavioral-science claim, ask: what does the rigorous meta-analytic evidence actually show about effect sizes? Is the literature subject to publication bias? Are there preregistered replications? Most popular behavioral-science claims, examined this way, look much smaller and more heterogeneous than the popular framing.

2. Be cautious about high-stakes policy decisions justified by laboratory effects. The largest single insight from the Shewach 2019 meta-analysis is the laboratory-versus-operational distinction. Effects that appear in carefully designed laboratory paradigms with optimal conditions for detection often do not appear in operational settings with real stakes, real participants, and real test environments. This is a general lesson, not specific to stereotype threat. Behavioral-science effects often look much larger in the conditions where they were originally demonstrated than in the conditions where you would actually want to use them.

For organizational decisions: if you are considering an intervention based on a published behavioral-science effect, ask whether the conditions of the original study resemble the conditions of your proposed deployment. If the original study used motivated undergraduates in a lab with a low-stakes task, and you are proposing to use the construct in a high-stakes evaluative setting with employees who don’t necessarily care about the manipulation, the realistic expected effect is much smaller than the original.

3. Use rigorous evaluation when you do deploy interventions. If your organization is investing in DEI interventions, the right move is not to abandon them on the strength of a meta-analytic verdict --- many DEI interventions are valuable for reasons other than stereotype-threat reduction, and the broader project of building equitable institutions does not depend on the empirical fate of a single construct. The right move is to evaluate the interventions you deploy with rigorous methods: randomized assignment where possible, preregistered analytic plans, outcome measures that matter for the goals you actually care about, and honest reporting of null results. The same methodological discipline that exposed the limitations of the stereotype-threat literature is the discipline that lets you build interventions that actually do what they’re meant to do.

4. Communicate evidentiary uncertainty rather than collapsing into endorsement or dismissal. When advising a client, a board, or a team on the current state of the stereotype-threat evidence, the tempting moves are (a) endorse the construct because it aligns with values your audience holds, or (b) dismiss the construct because the meta-analyses are unflattering. Both moves are dishonest. The accurate move is to communicate the actual state of the evidence: real but small effects in some specific paradigms, substantial publication bias inflating the broader literature, near-zero effects in well-powered preregistered replications, and effects that are most likely negligible in the high-stakes operational settings the policy framework was meant to address. This is harder to communicate than a clean verdict. It is also a much more useful basis for organizational decisions, because it correctly calibrates expectations for what the intervention can realistically deliver.

The discipline of holding uncertainty without collapsing into false confidence --- in either direction --- is one of the highest-value habits for anyone whose work involves translating behavioral science into institutional practice.

Sources

This article is part of an ongoing series on famous behavioral-science studies and the replication-crisis evidence around them. Other entries cover the Stanford Prison Experiment, power posing, the marshmallow test, ego depletion, the facial feedback hypothesis, Bargh elderly priming, the bystander effect, the Mozart Effect, and growth mindset --- the closest analog to this article in the “real but smaller than promised” category. The full hub lives at /replication-crisis/.

If your organization has deployed interventions justified by stereotype-threat research and you want a rigorous evidence review of what you can realistically expect, book a consultation.

FAQ

Is stereotype threat “real”? Yes, in the sense that some stereotype-threat effects can be produced in some specific laboratory paradigms, particularly with blatant manipulations, motivated participants, and difficult tests. No, in the sense that the popular and institutional version of the construct claims much more than the rigorous evidence supports. The honest current verdict is that effects are real but small in lab conditions, heavily inflated by publication bias in the broader literature, and most likely negligible in real high-stakes testing settings.

Does this mean Claude Steele’s work was fraudulent or wrong? No. The original studies were real studies, conducted by serious researchers, that captured at least some real phenomenon. The problem is not the original work --- it is the gap between what the original work showed and what the popular and institutional version of the framework claimed. Steele and his colleagues bear some responsibility for not pushing back harder against overinterpretations of their findings, but the larger problem is structural: a publication ecosystem that rewards novel findings, a file-drawer problem that hides null replications, a narrative environment that amplified the construct beyond what the evidence supported.

What about Sackett 2004 --- did Steele and Aronson misrepresent their own findings? No. The Sackett 2004 critique was specifically about the secondary literature --- how textbooks, journal review articles, and popular accounts described Steele & Aronson 1995. Steele and Aronson described their analytic approach (including the SAT covariate adjustment) clearly in the original paper. The misinterpretation was downstream of the original publication, in the secondary literature that summarized it. Sackett’s contribution was documenting how widespread the misinterpretation had become.

Should organizations stop running DEI training programs? This question is bigger than the stereotype-threat evidence and should not be answered solely on the basis of one construct’s empirical fate. DEI programs serve multiple purposes --- legal compliance, signaling of organizational values, providing language for handling workplace conflict, training in inclusive practices --- and the value of those purposes does not depend on whether stereotype threat as a specific construct holds up. The evidence-based move is to be honest about which specific claims in any given DEI program are well-supported and which are not, and to evaluate the actual outcomes of programs you deploy rather than assuming they work because they invoke well-known constructs.

What about the women-in-STEM gap --- is stereotype threat the explanation? The current meta-analytic evidence does not support stereotype threat as a primary explanation for the women-in-STEM gap. The Stoet & Geary 2012 and Flore & Wicherts 2015 analyses both found that the women-in-math version of the construct is weakly supported in well-controlled studies, and the Flore, Mulder & Wicherts 2018 preregistered replication found null effects. This does not mean the women-in-STEM gap is not real, or that environmental factors do not contribute to it --- it means stereotype threat as a specific mechanism is not the main empirical lever for explaining or addressing the gap.

Are there any preregistered replications that succeeded? The preregistered replication record for stereotype threat is mixed and skews negative. The largest and most rigorous preregistered replication --- Flore, Mulder & Wicherts 2018, with more than 2,000 Dutch high school students --- produced null results. Finnigan & Corker 2016 also produced a well-powered failed replication. Smaller-scale preregistered replications have produced more variable results. The pattern across the preregistered replication literature is consistent with the meta-analytic verdict: real effects in some specific conditions, but absent or near-zero in well-powered direct tests.

How should I read older papers citing stereotype threat? Carefully. A large fraction of the secondary literature on stereotype threat --- including textbook treatments, popular accounts, and applied policy documents written before about 2015 --- describes the construct in terms that the rigorous meta-analytic evidence does not support. This is not the authors’ fault; they were summarizing the conventional understanding of the construct at the time. But when reading older work that invokes stereotype threat, you should treat the strong claims with skepticism and look for whether the underlying empirical citation actually supports the strong claim being made.

Is this story like the growth mindset story? Yes, structurally. Both stereotype threat and growth mindset are cases where a behavioral-science construct produced real but small effects in some specific paradigms, was amplified by popular communication and institutional uptake into claims much larger than the underlying evidence supported, and was eventually evaluated by rigorous meta-analyses that found small effects and substantial publication bias. The strategist lesson is the same in both cases: be cautious about constructs that have traveled much further than the rigorous evidence base supports, and use rigorous evaluation when you deploy interventions that invoke them. The growth mindset article in this hub walks through the parallel evidence story in education.

replication-crisis stereotype-threat social-psychology dei-evidence evidence-evaluation

Free Tool

Built for Experimentation Teams

GrowthLayer is the experimentation platform I built for CRO teams --- test management, AI-powered insights, and pattern recognition across your entire program.

Explore GrowthLayer → (opens in new tab)

· Start Free →

Share this article

LinkedIn (opens in new tab) X / Twitter (opens in new tab)

Copy link

Go deeper

Methodology The PRISM Method Case Studies $30M+ in Results Work Together Services & Mentoring

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

← Previous

Bargh Elderly Priming: The Day a Nobel Laureate Wrote a Letter Warning the Field of a “Train Wreck Looming”

Next →

Milgram Obedience Experiments: What The Yale Archives Actually Show

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.