The IAT is the world’s most-used measure of implicit bias --- over 40 million tests completed, deployed in DEI training across federal agencies and Fortune 500 companies. Independent meta-analyses since 2013 show IAT scores correlate only r ≈ .15 with discriminatory behavior, and changing IAT scores does not change behavior. The original authors have themselves walked back individual-prediction claims. The popular narrative has not updated.
You can take it right now. Open a browser, go to Harvard’s Project Implicit website, click through the consent screen, and within ten minutes a piece of software running at implicit.harvard.edu will tell you whether you have an implicit preference for white faces over Black faces, for thin people over fat people, for young people over old people, for men in careers and women in families. The site delivers your result with the gravity of a medical screening. Your data suggest a strong automatic preference for European Americans compared to African Americans. You will likely feel something on reading that line, and that something is the point.
The Implicit Association Test has been completed more than 40 million times through Project Implicit alone, drawn from over 80 million study sessions launched on the site. The Implicit Bias chapter has been written into corporate DEI curricula at most of the Fortune 500. The National Institutes of Health offers a two-hour implicit-bias module to its scientific workforce. The Federal Judicial Center publishes guidance on implicit-bias trainings for the federal judiciary. Police departments, medical schools, law firms, and university faculties run sessions built around the assumption that the IAT measures something --- call it implicit bias --- that is predictive of how participants will behave toward members of stigmatized groups.
The empirical reality is more complicated than the deployment pattern suggests.
The largest independent meta-analysis of IAT predictive validity, published in the Journal of Personality and Social Psychology in 2013, found average correlations of ρ̂ ≈ .15 for race-domain criterion behaviors and ρ̂ ≈ .14 overall --- small enough that the authors explicitly concluded the IAT does not predict racial or ethnic discrimination well, and no better than self-report measures for most criterion types. A 2019 meta-analysis of 492 studies and 87,418 participants found that procedures that change IAT scores do not, in general, change downstream behavior. Test-retest reliability of the standard IAT sits around r = .55—.60, well below the threshold conventionally required for individual-level diagnostic instruments. And in 2015, the IAT’s principal architects --- Anthony Greenwald, Mahzarin Banaji, and Brian Nosek --- published a statement in the same journal acknowledging that the psychometric properties of various IATs “render them problematic to use to classify persons as likely to engage in discrimination.”
This is the rare case in the replication-crisis literature where the original creators have themselves walked back the strongest claims about their instrument. The popular narrative, the corporate DEI infrastructure, and a substantial fraction of the bias-training industry have not walked back with them. This article walks through what the IAT actually measures, what the predictive-validity record looks like, how the creators have updated their own position, and what an evidence-respecting strategist should do with the gap between the science and the deployment.
The 1998 Original Study
The IAT was introduced in Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998), “Measuring Individual Differences in Implicit Cognition: The Implicit Association Test,” published in the Journal of Personality and Social Psychology, 74(6), 1464—1480 (DOI: 10.1037/0022-3514.74.6.1464). The paper proposed a reaction-time paradigm intended to measure the strength of automatic associations between mental representations.
The procedure is elegant. Participants sit at a keyboard. On the screen, words and pictures flash one at a time, and the participant sorts them with two keys. In the standard Black/White race IAT, the categories paired with the same key change across blocks. In one block, you press the left key for Black faces and for negative words (“evil,” “failure”), and the right key for White faces and positive words (“joy,” “love”). In another block, the pairings reverse: left key for Black faces and positive words, right key for White faces and negative words.
The score --- the D score, formalized in a later 2003 paper by Greenwald, Nosek, and Banaji --- is built from the difference in average response time between the two pairing blocks, scaled by the participant’s overall response variability. The interpretation: if you respond faster when Black faces share a key with negative words than when they share a key with positive words, you have a stronger automatic association between Black and negative than between Black and positive. The test labels that pattern an “implicit preference” for White over Black.
The 1998 paper presented three experiments demonstrating that the IAT could detect predicted associations (flowers being associated more strongly with pleasant than unpleasant; Japanese-American participants showing different associations with Japanese versus Korean categories than Korean-Americans did) and could distinguish between groups. The construct the authors had in mind was implicit attitudes --- evaluative associations that were not necessarily accessible to introspection and that might dissociate from explicitly reported attitudes.
This was, on its face, a measurement contribution. The IAT was demonstrably measuring something --- automatic associations between concepts in memory --- and was reliable enough at the group level to detect differences predicted by theory. None of the 1998 paper’s claims required that the test be useful for predicting individual behavior outside the lab. That extension came later, in the popularization phase, and it is the extension that has not held up.
What Predictive Validity Means And Why It Matters
To make sense of the IAT debate you have to distinguish three properties that test developers care about: reliability, construct validity, and predictive validity. These get conflated in popular coverage of the IAT, and the conflation does real damage.
Reliability is about consistency of measurement. If a test is reliable, you get similar scores when you measure the same person on different occasions (test-retest reliability) or when you measure them with different items meant to assess the same construct (internal consistency). Reliability is a precondition for everything else: an instrument that produces wildly different scores for the same person on different days cannot be useful for individual-level prediction, because you don’t know which day’s score is “real.”
Construct validity is about whether the test measures what it claims to measure. A test of mathematical ability has construct validity to the extent that high scorers really are better at math than low scorers, and to the extent that the test scores correlate with other valid measures of math ability while not correlating excessively with measures of unrelated constructs (discriminant validity).
Predictive validity is about whether test scores predict outcomes that the test is supposed to be useful for predicting. A college-admissions test has predictive validity to the extent that scores predict freshman GPA. A pre-employment cognitive ability test has predictive validity to the extent that scores predict job performance.
The IAT debate has different status on each of these dimensions. The IAT measures something --- the reaction-time difference is reliable enough at the group level that this isn’t in dispute. The construct that something corresponds to is contested (Schimmack 2021 made the case that the IAT cannot be cleanly interpreted as a measure of implicit attitudes specifically, because it fails standard discriminant-validity tests). And the predictive-validity record --- the question of whether IAT scores predict discriminatory behavior in any practically useful way --- is where the empirical bottom drops out.
This matters because the public uses of the IAT --- diagnosing individuals as biased, predicting which trainees will discriminate, justifying personalized DEI interventions --- all require predictive validity at the individual level. Reliability without predictive validity is a thermometer that consistently reads 98.6 degrees regardless of how hot or cold the day is: precisely measuring something that does not correspond to what you wanted to know.
The 2009 Greenwald Meta-Analysis
The first comprehensive meta-analytic look at IAT predictive validity came from the test’s principal architects. Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009), “Understanding and Using the Implicit Association Test: III. Meta-Analysis of Predictive Validity,” published in the Journal of Personality and Social Psychology, 97(1), 17—41 (DOI: 10.1037/a0015575), reviewed 122 research reports covering 184 independent samples and 14,900 participants.
The headline numbers are these: the average correlation between IAT measures and behavioral, judgment, and physiological criterion measures was r = .274. For samples involving Black/White interracial behavior specifically --- the domain most relevant to the public’s understanding of the IAT --- the average was r = .236. For comparison, parallel self-report measures available in 156 of the samples produced an average predictive correlation of r = .361.
You can read this finding two ways, and that’s part of what makes the IAT debate productive. The Greenwald team’s framing was that the IAT showed meaningful predictive validity, and that the IAT predicted intergroup behavior with effect sizes that, while modest in absolute terms, were on a par with effects sizes routinely treated as meaningful in personality and social psychology. They argued the IAT was a useful complement to self-report measures, especially for socially sensitive topics where self-report is contaminated by impression management.
The skeptical reading is that r ≈ .24—.27 explains roughly 6—7% of the variance in discriminatory behavior, which is a long way from the popular framing of the IAT as a window into unconscious racism that determines how people will act. And the choice of which studies to include in a meta-analysis, how to define “discrimination” outcomes, and how to handle quality variation across studies turn out to matter a great deal. Both readings would be tested by the 2013 independent meta-analysis.
The 2013 Oswald Meta-Analysis
The first major independent meta-analysis of IAT predictive validity was published as Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013), “Predicting Ethnic and Racial Discrimination: A Meta-Analysis of IAT Criterion Studies,” in the Journal of Personality and Social Psychology, 105(2), 171—192 (DOI: 10.1037/a0032734).
Oswald and colleagues focused specifically on race and ethnicity domains --- the IAT applications that had received the most public attention. They re-coded the criterion studies in the Greenwald 2009 sample (and additional studies) to evaluate which outcomes actually constituted discrimination --- behaviors that disadvantaged a target on the basis of group membership --- rather than other outcomes that had been folded into earlier predictive-validity claims (such as brain-activity responses, attitudinal self-reports, or policy preferences).
The result was substantially less favorable to the IAT than the Greenwald team’s earlier estimate. The average intraclass correlations across criterion domains were ρ̂ = .14 overall, ρ̂ = .15 for the race domain, and ρ̂ = .12 for the ethnicity domain. The IAT predicted “microbehaviors” (small interactional cues like seating distance or eye contact) better than it predicted more consequential outcomes like interpersonal behavior, person perceptions, policy preferences, or hiring-relevant judgments --- and the IAT performed no better than explicit self-report measures for most criterion categories.
The authors’ conclusion was direct: IAT scores are not good predictors of ethnic or racial discrimination, and they explain at most small fractions of the variance in discriminatory behavior even in controlled laboratory settings, where signal-to-noise should be most favorable to the IAT.
A subsequent re-examination by Carlsson, R., & Agerström, J. (2016), “A Closer Look at the Discrimination Outcomes in the IAT Literature,” Scandinavian Journal of Psychology, 57(4), 278—287 (DOI: 10.1111/sjop.12288), tightened the criterion further. Carlsson and Agerström argued that many of the outcomes treated as “discrimination” in the IAT meta-analyses were not actually operationalizations of discrimination --- they included things like brain-activity responses and voting preferences that, while interesting, are not the same construct as discriminatory action toward a person from a stigmatized group. When they restricted the meta-analysis to outcomes that genuinely constituted discrimination, the overall effect was close to zero and highly inconsistent across studies.
The Greenwald team and the Oswald team have continued to dispute methodology --- particularly the relative weight to put on small effects that might still aggregate to societal significance in the right institutional contexts (a defense Greenwald, Banaji, and Nosek elaborated in a 2015 Journal of Personality and Social Psychology response titled “Statistically Small Effects of the Implicit Association Test Can Have Societally Large Effects”). That defense has merit at the population scale. It does not rescue the individual-level prediction claims that drive most popular and applied uses of the test.
The 2019 Forscher Meta-Analysis
If the IAT measures implicit bias, and if implicit bias causes discriminatory behavior, then interventions that reduce IAT scores should produce reductions in discriminatory behavior. This is the implicit theory behind every implicit-bias training program ever sold. The test detects the bias, the training reduces the bias, the behavior changes downstream.
Forscher, P. S., Lai, C. K., Axt, J. R., Ebersole, C. R., Herman, M., Devine, P. G., & Nosek, B. A. (2019), “A Meta-Analysis of Procedures to Change Implicit Measures,” in the Journal of Personality and Social Psychology, 117(3), 522—559 (DOI: 10.1037/pspa0000160), tested that implicit theory directly. The meta-analysis included 492 studies and 87,418 participants --- at the time of publication, one of the largest meta-analytic syntheses ever conducted in social psychology.
The findings have two parts that need to be stated separately.
Part one: implicit measures can be changed, but the effects are usually small and short-lived. Procedures designed to shift IAT scores produced average effect sizes of |d| < .30, with the largest effects from manipulations that directly altered the associative structure being measured (e.g., teaching new pairings) or that taxed cognitive resources. Most of the studied procedures focused on brief, single-session interventions, and the meta-analysis did not establish durable change.
Part two --- the part that matters for the bias-training industry: changes in implicit measures did not produce corresponding changes in behavior. Looking at the subset of studies that examined both IAT change and downstream behavior change in the same participants, Forscher and colleagues found little evidence of a causal relationship between changes in implicit measures and changes in explicit measures or behavior. Reducing IAT scores by intervention did not reliably reduce discriminatory behavior. The author team included Brian Nosek, one of the IAT’s principal architects and a co-founder of Project Implicit; the publication of this finding in the field’s flagship journal, with that author on the byline, is a substantial in-paradigm acknowledgment.
This is the empirical core that the bias-training industry has not yet absorbed. The standard pitch for implicit-bias training is: “Your employees probably have implicit biases (measurable with the IAT). Our training will reduce those biases. Reduced bias will translate to fairer treatment of customers and colleagues.” The Forscher meta-analysis decoupled the second and third steps of that pitch from each other. Interventions that change IAT scores do not, on the evidence, change behavior in the directions the pitch promises. The downstream link is the link the field has been least able to demonstrate, and it is the link the commercial pitch most needs.
Greenwald’s Own Walk-Back
What makes the IAT case unusual in the replication-crisis literature is that the test’s principal architects have themselves published statements acknowledging the limits of individual-level prediction. This is not a case of the field disowning a tool while its creators dig in. The creators have explicitly walked back the strongest individual-prediction claims, while continuing to defend the IAT as a useful population-level instrument and as a window into the cognitive processes the test was designed to measure.
The clearest example comes from Greenwald, A. G., Banaji, M. R., & Nosek, B. A. (2015), “Statistically Small Effects of the Implicit Association Test Can Have Societally Large Effects,” Journal of Personality and Social Psychology, 108(4), 553—561. In responding to the Oswald 2013 critique, Greenwald and colleagues acknowledged that the psychometric issues associated with various IATs “render them problematic to use to classify persons as likely to engage in discrimination.” Their defense was that small effects aggregated over many cases --- for example, across many hiring decisions in a large institution --- could still produce socially significant disparities, even if any single IAT score had limited diagnostic value for any single person.
That defense is reasonable on its own terms. It also concedes the entire individual-level diagnostic use of the test. If IAT scores are problematic for classifying persons as likely to engage in discrimination, then it is inappropriate to use IAT scores as personal feedback in a DEI training, as a screening tool in hiring, or as a predictor of which employees will mistreat which customers. Those are exactly the uses that the popular framing of the IAT has encouraged and that a substantial fraction of the bias-training industry has built revenue around.
The walk-back continued in Greenwald, A. G., & Banaji, M. R. (2017), “The Implicit Revolution: Reconceiving the Relation Between Conscious and Unconscious,” American Psychologist, 72(9), 861—871. The 2017 paper reframed the IAT’s contribution as a population-level instrument and as a tool for studying the cognitive structure of associations --- not as a personal diagnostic. Earlier popular framings, including the framing in Banaji and Greenwald’s 2013 book Blindspot: Hidden Biases of Good People (“the Race IAT… is now established as signaling discriminatory behavior” and “predicts discriminatory behavior even among research participants who earnestly espouse egalitarian beliefs”), have not been reaffirmed in this stronger form in the subsequent decade of the authors’ peer-reviewed output.
The trajectory of the creators’ position is a case study in what intellectual honesty looks like in the face of accumulating disconfirming evidence. They have not disowned the instrument; they have not retracted the 1998 paper; they continue to argue for the IAT’s utility at the population scale. They have, in published peer-reviewed work, narrowed the legitimate uses of the test substantially relative to the 2000s-era popular framing. A reasonable inference is that the strongest version of the IAT story --- the IAT diagnoses unconscious racism in individuals and explains who will discriminate against whom --- is not a position the test’s creators are still willing to defend in writing, and has not been the position of the empirical literature for at least a decade.
The popular narrative, the corporate-training market, and the cultural meaning of “your IAT score” in DEI discourse have not received the memo.
What’s Honest To Say About Implicit Bias Now
Strip away the IAT and the question of whether implicit bias exists remains a live one. The honest answer is more nuanced than either “implicit bias is a myth” or “the IAT is just measurement noise.”
Implicit attitudes --- automatic, fast evaluations that can dissociate from explicit reports --- are a real cognitive phenomenon. A large body of work in cognitive and social psychology supports the existence of attitudes that are activated quickly, can run ahead of deliberative judgment, and are not fully accessible to introspection. This is a well-established finding in basic cognitive psychology, independent of the IAT specifically.
Discrimination in real-world settings is real and measurable. Audit studies that send matched fictitious résumés or fictitious house-rental inquiries with names varied by perceived race produce replicable, well-documented discriminatory patterns: equivalent applicants with stereotypically Black names get fewer callbacks than equivalent applicants with stereotypically White names. (Bertrand & Mullainathan’s 2004 audit study, replicated many times since, is the canonical example.) Discriminatory outcomes in housing, lending, hiring, and policing are documented at scale through observational and experimental methods that do not depend on the IAT.
The link between individual IAT scores and individual discriminatory behavior is weak. This is the empirical claim that the Oswald, Carlsson, and Forscher analyses converge on. Knowing a person’s IAT score is not a useful basis for predicting whether that person will discriminate.
The link between population-level IAT distributions and population-level outcomes is more defensible. Greenwald, Banaji, and Nosek’s small-effects-aggregate argument has merit at the institutional and societal scale. If many hiring decisions are influenced by automatic associations, even small effects can produce meaningful aggregate disparities. This is a useful frame for thinking about systemic patterns; it is not a useful frame for diagnosing individual managers.
The construct interpretation of the IAT remains contested. Schimmack’s 2021 critique (Schimmack, U., “The Implicit Association Test: A Method in Search of a Construct,” Perspectives on Psychological Science, 16(2), 396—414, DOI: 10.1177/1745691619863798) argued that the IAT lacks discriminant validity --- that is, IAT scores do not cleanly distinguish “implicit attitudes” from other constructs they should be distinguishable from in a multi-method context. The Kurdi, Ratliff, and Cunningham (2021) response in the same journal defended the IAT’s utility while conceding that more work on construct validation is needed. The debate is ongoing, and the responsible position is to treat the IAT as a measure of something --- automatic associative processing in a specific task context --- without taking on the stronger interpretation that this something is “implicit bias” in the lay sense.
The picture, in short, is this: implicit attitudes are real, discrimination is real, and the IAT measures automatic association in a reliable way at the group level. The IAT is not a reliable individual-level diagnostic, IAT-based interventions do not reliably reduce discriminatory behavior, and the popular framing of the test as a window into unconscious racism that predicts individual conduct is not supported by the empirical literature and is not defended by the test’s own creators in their current peer-reviewed writing. The honest version of the implicit-bias story is smaller and less actionable than the version that DEI vendors and pop-science books have been selling for two decades. It is also closer to what the evidence supports.
What This Means For Strategists Advising on DEI
If you are a consultant, in-house strategist, or executive responsible for diversity and inclusion programs, the IAT picture should change what you recommend.
Stop using individual IAT scores as personal feedback in training. The test’s psychometrics do not support telling an individual employee what their score “means” about their personal bias or their likely treatment of colleagues. The architects of the test have published, in peer-reviewed work, that the test is problematic for this use. A training module that delivers a personal IAT result with the implication that low scorers are unbiased and high scorers need remediation is misusing the instrument.
Stop relying on IAT-score changes as the headline outcome metric for bias trainings. The Forscher 2019 meta-analysis is clear: short-term IAT-score changes do not reliably translate into behavioral changes. A training that “moved IAT scores” but does not document behavioral or outcome changes has not demonstrated the thing the training was supposed to demonstrate.
Direct DEI budget toward interventions with stronger evidence bases. The strongest evidence in the discrimination-reduction literature is for structural interventions, not psychological ones. Examples with empirical support include:
- Structured interviews with predefined questions and scoring rubrics, which reduce the role of intuitive judgment and have decades of industrial-organizational psychology evidence behind them as discrimination-reducing.
- Blind résumé review procedures that strip names, schools, and other potentially identifying signals from initial screening.
- Audit-style internal testing that sends matched dummy applicants or matched inquiries through your hiring or customer-service processes and measures actual differential treatment in real institutional flows.
- Pre-commitment to evaluation criteria before reviewing candidates, which limits post-hoc rationalization of pattern-matching judgments.
- Demographic outcome tracking that monitors which hiring stages, promotion stages, or customer-treatment outcomes show disparate patterns and treats those patterns as the diagnostic, rather than treating individual mental states as the diagnostic.
Reframe your bias-training content if you keep doing trainings. Implicit-bias content that frames the problem as “let’s measure and reduce your unconscious bias” is on weak empirical ground. Content that frames the problem as “here are the structural and procedural changes that reduce discriminatory outcomes regardless of individual mental states” is on much stronger empirical ground and is closer to where the evidence is.
Be careful with vendor claims. A bias-training vendor that promises individual IAT diagnostics, that promises behavioral change as a downstream of IAT score change, or that does not engage with the Oswald, Carlsson, Forscher, and Schimmack critiques in its scientific basis documentation is selling a product whose central premise has been substantively walked back by the field’s own creators. You can ask vendors directly: How do you respond to the Forscher 2019 finding that IAT-change interventions do not reliably change behavior? The quality of the answer tells you a lot.
The bigger frame for strategists is this: in the replication-crisis literature, the IAT is the case where the people closest to the data have been most candid about what their instrument can and cannot do. The popular narrative has not updated. The pricing has not updated. The vendor pitches have not updated. There is a long-running gap between the empirical record and the marketplace, and the strategist’s job, if you advise on this topic at all, is to know which side of that gap you’re standing on when you make recommendations to your clients.
Sources
- Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74(6), 1464—1480. DOI: 10.1037/0022-3514.74.6.1464
- Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understanding and using the Implicit Association Test: I. An improved scoring algorithm. Journal of Personality and Social Psychology, 85(2), 197—216. DOI: 10.1037/0022-3514.85.2.197
- Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97(1), 17—41. DOI: 10.1037/a0015575
- Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171—192. DOI: 10.1037/a0032734
- Greenwald, A. G., Banaji, M. R., & Nosek, B. A. (2015). Statistically small effects of the Implicit Association Test can have societally large effects. Journal of Personality and Social Psychology, 108(4), 553—561. DOI: 10.1037/pspa0000016
- Carlsson, R., & Agerström, J. (2016). A closer look at the discrimination outcomes in the IAT literature. Scandinavian Journal of Psychology, 57(4), 278—287. DOI: 10.1111/sjop.12288
- Greenwald, A. G., & Banaji, M. R. (2017). The implicit revolution: Reconceiving the relation between conscious and unconscious. American Psychologist, 72(9), 861—871. DOI: 10.1037/amp0000238
- Forscher, P. S., Lai, C. K., Axt, J. R., Ebersole, C. R., Herman, M., Devine, P. G., & Nosek, B. A. (2019). A meta-analysis of procedures to change implicit measures. Journal of Personality and Social Psychology, 117(3), 522—559. DOI: 10.1037/pspa0000160
- Schimmack, U. (2021). The Implicit Association Test: A method in search of a construct. Perspectives on Psychological Science, 16(2), 396—414. DOI: 10.1177/1745691619863798
- Kurdi, B., Ratliff, K. A., & Cunningham, W. A. (2021). Can the Implicit Association Test serve as a valid measure of automatic cognition? A response to Schimmack (2021). Perspectives on Psychological Science, 16(2), 422—434. DOI: 10.1177/1745691620904080
- Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4), 991—1013. DOI: 10.1257/0002828042002561
Related
This article is part of the Replication Crisis Hub, a series on famous behavioral science findings that failed replication, were misrepresented, or have been quietly walked back by their creators. If you found this useful, see also:
- The Stereotype Threat Effect: A Decade of Failed Replications --- another high-profile social-psychology finding whose effect size shrank dramatically under replication scrutiny.
- The Ego Depletion Effect: When a Multi-Decade Research Program Failed To Replicate --- the registered replication that prompted the field to reconsider one of its most-cited models.
- The Growth Mindset: How a Decade of Mixed Replication Evidence Met an Industry Worth Billions --- a similar gap between contested evidence base and large commercial deployment.
- The Bargh Elderly Priming Study: When the Original Author Walked Away --- another case study in author response to failed replication.
FAQ
Q: Are you saying implicit bias doesn’t exist? A: No. Implicit attitudes --- automatic, fast evaluations that can dissociate from deliberate reports --- are a well-established finding in cognitive psychology. The narrower claim is that the IAT specifically, as currently constituted, does not reliably predict individual discriminatory behavior, and that interventions which change IAT scores do not reliably change behavior. Implicit cognition as a research area is robust; the IAT as a personal diagnostic tool is not.
Q: Are you saying discrimination doesn’t happen? A: Definitely not. Audit studies that send matched fictitious applications with names varied by perceived race produce replicable evidence of discrimination at the population level in hiring, housing, and lending. The Bertrand & Mullainathan 2004 study is the canonical example and has been replicated many times since. Discrimination is real and measurable; the IAT just is not the right tool for diagnosing it at the individual level.
Q: What’s a defensible use of the IAT? A: Research use in the original spirit of the 1998 paper --- as a paradigm for studying the cognitive structure of automatic associations at the group level. The IAT is reliable enough at the population level to detect predicted between-group differences and to study the formation, persistence, and contextual variation of associative structures in memory. The defensible uses are basic cognitive and social psychology research at the group scale. The indefensible uses are individual diagnosis, hiring screening, and as the headline outcome metric for DEI training.
Q: Why does Project Implicit still operate if the IAT has these limitations? A: Project Implicit explicitly disclaims that individual IAT scores should be used as personal diagnostic tools. The site’s stated mission is research and education about the existence and study of implicit attitudes, and it is a substantial data-generation engine for academic research on associative cognition. The gap is between the site’s own framing (research participation, educational reflection) and how IAT scores are interpreted in popular discourse and corporate-training settings (personal diagnostic, predictor of behavior).
Q: How long have the IAT’s creators been walking back individual-prediction claims? A: The clearest published walk-back is the 2015 Greenwald, Banaji, and Nosek response to the Oswald meta-analysis, which acknowledged the IAT is “problematic to use to classify persons as likely to engage in discrimination.” The 2017 American Psychologist paper by Greenwald and Banaji reframed the IAT’s contribution as a population-level instrument rather than a personal diagnostic. Earlier popular framings, including those in their 2013 book Blindspot, have not been reaffirmed in their subsequent peer-reviewed work in the stronger personal-prediction form.
Q: Should we keep doing implicit-bias trainings? A: If you do them, change what’s measured. Implicit-bias training that uses IAT-score change as the headline outcome is on weak ground; the Forscher 2019 meta-analysis showed those score changes do not reliably translate to behavior. Training that focuses on structural and procedural interventions --- structured interviews, blind résumé review, pre-committed evaluation criteria, audit-style internal testing, demographic outcome tracking --- has substantially stronger evidence behind it. If a vendor cannot articulate how its training affects behavior independent of IAT score change, that is a signal worth taking seriously.
Q: What’s the highest-leverage single change a company can make if it wants to reduce discriminatory outcomes? A: Structured interviewing with predefined questions, predefined scoring criteria, and committee scoring. The industrial-organizational psychology evidence on this is decades old and consistently strong. It does not require any claim about individual mental states; it just constrains the decision process in ways that reduce the role of pattern-matching judgments that produce disparate outcomes. It is also unsexy and not the sort of thing that gets sold as a half-day workshop, which is part of why structural interventions have lost airtime to psychological ones over the past two decades.
Q: Why haven’t I heard this critique before, despite IAT-based training being widespread? A: For the same reason the learning-styles myth and the 8-second-attention-span myth have persisted: the people producing the public-facing content are not the people producing the empirical critiques, and the commercial infrastructure built around the IAT does not have incentives to surface its predictive-validity limitations. The 2013 Oswald meta-analysis and the 2019 Forscher meta-analysis are well-known in social psychology research circles. They are not well-known in DEI consultancy, corporate training, or popular journalism about implicit bias. The gap between what the science says and what the market sells is durable, and recognizing that gap is part of how strategists earn their fees.
replication-crisis implicit-bias iat dei-evidence evidence-evaluation
Free Tool
Built for Experimentation Teams
GrowthLayer is the experimentation platform I built for CRO teams --- test management, AI-powered insights, and pattern recognition across your entire program.
Explore GrowthLayer → (opens in new tab)
Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Copy link
Go deeper
Methodology The PRISM Method Case Studies $30M+ in Results Work Together Services & Mentoring
Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.
← Previous
Milgram Obedience Experiments: What The Yale Archives Actually Show
Next →
Pygmalion Effect: The Self-Fulfilling Prophecy That Mostly Wasn’t