The Stroop Effect: Cognitive Psychology's Most-Replicated Finding (Anti-Example)

Atticus Li

← The Replication Crisis · replication-crisis

The Stroop Effect: Cognitive Psychology's Most-Replicated Finding (Anti-Example)

Most psychology findings in this hub did not survive scrutiny. The Stroop effect did — millions of times across labs, populations, and decades. It is the benchmark for what a robust cognitive finding actually looks like, and a calibration tool for evaluating every other "robust" claim in behavioral science.

By Atticus Li May 25, 2026 27 min read

Most psychology findings in this hub did not survive scrutiny. The Stroop effect did --- millions of times across labs, populations, and decades. It is the benchmark for what a robust cognitive finding actually looks like, and a calibration tool for evaluating every other “robust” claim in behavioral science.

If you have read through this hub, you have watched canonical psychology findings dissolve under inspection. Power posing collapsed under its co-author’s own recantation. Ego depletion failed preregistered replication. Bargh’s elderly-priming evaporated, the marshmallow test shrank to a confound for socioeconomic status, the bystander effect’s founding case study turned out to be largely journalistic invention, and the entire family of social-priming results --- one after another --- failed to survive scrutiny they should have failed years earlier.

A reader by now might be entitled to conclude that all of psychology is suspect. That conclusion would be wrong, and the cleanest way to demonstrate why is to present the opposite case: a finding so robust, so well-replicated, so mechanistically grounded, and so operationally useful that it serves as a benchmark against which every other “robust” claim should be compared.

That finding is the Stroop effect. It is roughly ninety years old. It has been replicated approximately a million times, across populations from children to centenarians, across more than seventy languages, across normal subjects and brain-damaged subjects and ADHD subjects and depressed subjects and bilingual subjects and elderly subjects with dementia. It has a meta-analytic effect size that is not small-d-equals-0.2 polite cough but instead a reaction-time difference so reliable that it can be detected in single-subject single-session experiments without complicated statistics. It has a neural mechanism that has been localized with converging evidence from fMRI, EEG, single-unit recording, and lesion studies. It grounds practical applications in clinical diagnosis, bilingualism research, aging research, and brain-damage assessment. And the basic paradigm has been extended in dozens of variants, each of which has independently replicated.

This is the anti-example. Read it alongside the takedown articles in this hub to calibrate. Most behavioral findings do not look like the Stroop effect. The ones that do look like the Stroop effect are the ones worth betting on.

Stroop 1935 --- The Original Paper

The Stroop effect did not begin with John Ridley Stroop in 1935. The basic observation --- that incongruent stimuli interfere with naming speed --- was documented by Cattell in the 1880s and by several German researchers in the early 1900s. What Stroop did was operationalize the paradigm in a form so clean, so reproducible, and so easy to administer that it became the standard test of selective attention for the next ninety years.

Stroop, J. R. (1935). “Studies of Interference in Serial Verbal Reactions.” Journal of Experimental Psychology, 18(6), 643—662. DOI: 10.1037/h0054651.

Stroop ran three experiments. The first asked subjects to read color words (RED, BLUE, GREEN, etc.) printed in incongruent colors --- the word RED printed in green ink, the word BLUE printed in yellow ink. Reading the words aloud, ignoring the color, was effectively no slower than reading the same words printed in black ink. Reading is too automatic for ink color to interfere with it.

The second experiment was the one that mattered. Stroop asked subjects to do the opposite --- name the color of the ink, ignoring the word. When the ink color and the word matched (the word RED printed in red ink), naming was fast. When the ink color and the word were incongruent (the word RED printed in blue ink), naming was dramatically slower. The mean response time on incongruent trials was approximately 74% longer than on neutral trials. Subjects could not help reading the word, even when reading the word actively hurt their performance on the task they had been instructed to do.

The third experiment confirmed the asymmetry. The interference ran in one direction only: words interfered with color-naming, but colors did not interfere with word-reading. Stroop interpreted this as evidence that reading was a more automatized process than color-naming --- the more practiced response intruded on the less practiced response, but not the other way around.

The paper was Stroop’s doctoral dissertation. It was published in the Journal of Experimental Psychology in 1935. He was 38 years old, working at George Peabody College for Teachers, and he never published another paper on the topic. He spent the rest of his career as a minister and a professor of religious studies. The finding outlived its discoverer by an enormous margin --- Stroop died in 1973; the paper has accumulated approximately 16,000 citations as of 2026, putting it in the top 0.1% of papers ever published in psychology, and the citation rate is still climbing.

What makes the original 1935 paper an anti-example, in this hub’s terms, is that everything that subsequent replications confirmed was already visible in Stroop’s original data. He reported the basic phenomenon, the asymmetry, the magnitude, and the qualitative shape of the response-time distribution. He did not p-hack a small effect into significance. He did not run undisclosed comparisons. He did not select among multiple dependent variables. The effect was large enough that the methodology problems that broke later psychology research were not relevant --- you can run a Stroop experiment with twenty subjects and detect the effect at p less than .001 every time. You can run it with five subjects. You can run it with one subject, on yourself, right now, on a piece of paper, and you will observe the effect.

The contrast with the canonical replication failures in this hub is stark. The original Bem precognition paper (2011) reported effect sizes of d ≈ 0.2 that vanished in preregistered replication. The original power-posing paper (2010) reported hormonal effects that other labs could not detect. The original ego-depletion paper (1998) reported effects that disappeared under Hagger’s 2016 multi-lab attempt. The original elderly-priming paper (1996) failed to replicate at Bargh’s own university. Each of these reported effects that were too small relative to the noise of psychology research to be reliably detected at all.

Stroop reported an effect that is enormous, that any first-year psychology student can demonstrate to themselves on the back of a napkin, and that has never failed to replicate.

MacLeod 1991 --- Half a Century of Research, 700+ Studies

By the early 1990s, the Stroop effect had accumulated so much literature that the field needed an integrative review. The canonical synthesis is MacLeod, C. M. (1991). “Half a Century of Research on the Stroop Effect: An Integrative Review.” Psychological Bulletin, 109(2), 163—203. DOI: 10.1037/0033-2909.109.2.163.

The MacLeod review covers more than 700 published studies of the Stroop effect spanning 1935 to 1990. The paper is approximately 40 pages long. It is one of the most-cited single review articles in cognitive psychology, with roughly 8,000 citations of its own, which is to say that even the review of the literature on the Stroop effect has accumulated more citations than the typical Nobel-mentioned psychology paper.

What MacLeod documented, exhaustively, is that the Stroop effect replicates under essentially every condition cognitive psychology has ever tested. It replicates with manual responses (button-press) as well as vocal responses. It replicates with brief stimuli, long stimuli, and self-paced stimuli. It replicates with massed and distributed presentation. It replicates with adults, children old enough to read, elderly adults, and brain-damaged patients across a wide range of pathology. It replicates in every language that has been tested, including languages that use logographic scripts (Chinese, Japanese kanji), languages with different writing directions (Hebrew, Arabic), and languages with morphologically complex word formation (Finnish, Turkish). It replicates with congenital deafness, with sign-language stimuli substituted for written words. It replicates in alphabet variants and dialect variants and orthographies invented specifically for experimental purposes.

The magnitude is large and stable. Across the studies MacLeod reviewed, the typical Stroop interference effect was in the range of 100 to 200 milliseconds on a baseline response time of approximately 700 milliseconds. The effect is detectable in single subjects in single sessions with sample sizes that are absurdly small by the standards of modern social-psychology power analysis. A standard Stroop experiment with 24 trials per condition has statistical power approaching 1.0 for the basic interference effect with as few as 10 subjects.

MacLeod also catalogued the variables that modulate the effect, which is where the literature gets theoretically interesting. The Stroop interference increases when the response is vocal (saying the color) rather than manual (pressing a button for the color), because the vocal response shares more of its representation with the reading response. The interference decreases with practice, but does not vanish even with very extended practice --- thousands of trials produce reductions of perhaps 30 to 40 percent, not elimination. The interference is reduced (but not eliminated) when the word and the color are spatially separated, or when the color is presented as a colored bar adjacent to a neutral word. The interference is reduced in subjects with reading disabilities, in subjects from cultures with low literacy, and in young children who have not yet fully automatized reading --- which is exactly what the original Stroop framework would predict, because the interference depends on reading being more automatic than color-naming.

The cumulative force of the MacLeod review is overwhelming. The Stroop effect is not a phenomenon that might or might not be real. It is one of the most heavily characterized, parametrically explored, and theoretically integrated findings in experimental psychology. There is no recent meta-analysis showing that publication bias inflated the effect. There is no preregistered multi-lab replication that produced a null result. There is no methodological critique of the original paradigm that any serious cognitive psychologist takes seriously as a challenge to the existence of the phenomenon.

There is, in other words, none of the apparatus of doubt that surrounds every other finding in this hub. The Stroop effect is what cognitive psychology looks like when it works.

The Neural Mechanism --- ACC and Cognitive Control

The contrast with replication-failure findings sharpens further when you compare neural-mechanism claims. The mechanism story for power posing was that “expansive postures change testosterone and cortisol levels”; that claim did not survive Ranehill 2015 or Carney 2016. The mechanism story for ego depletion was that “willpower depletes a finite cognitive resource”; that claim did not survive Hagger 2016 or Vohs 2021. The mechanism story for elderly priming was that “subliminal exposure to age-related concepts automatically activates motor schemata”; that claim did not survive Doyen 2012. In each case, the mechanism story turned out to be untethered to the underlying biology, partly because the underlying behavioral effect was not real.

The mechanism story for the Stroop effect is the opposite case. MacLeod, C. M., & MacDonald, P. A. (2000). “Interdimensional Interference in the Stroop Effect: Uncovering the Cognitive and Neural Anatomy of Attention.” Trends in Cognitive Sciences, 4(10), 383—391. DOI: 10.1016/S1364-6613(00)01530-8.

By 2000, the field had converged on a model in which the Stroop interference reflects a competition between two pathways --- a fast, automatic reading pathway, and a slower, controlled color-naming pathway --- with a top-down attentional control system that arbitrates between them when the two pathways produce conflicting outputs. The arbitration is effortful, takes measurable additional time, and recruits a specific brain region: the anterior cingulate cortex (ACC), particularly the dorsal ACC.

The fMRI evidence for ACC involvement in Stroop interference is among the most consistent findings in cognitive neuroscience. Across dozens of independent fMRI studies, incongruent Stroop trials produce reliably greater dorsal ACC activation than congruent trials, with effect sizes that are robust enough to detect in single subjects. The pattern is not a fluke of one task or one analysis pipeline --- it shows up with vocal responses and manual responses, with classic word-color Stroop and with numerical Stroop and spatial Stroop variants, with adults and with children, with healthy subjects and with neurological patients. ACC activation on incongruent trials is one of the textbook examples of a reliable, replicable, mechanism-grounded neural signature.

The lesion evidence converges. Patients with damage to the dorsal ACC and surrounding medial prefrontal cortex show selectively elevated Stroop interference, with the magnitude of impairment proportional to the extent of damage in the conflict-monitoring region. Patients with damage to other regions --- visual cortex, parietal cortex, lateral prefrontal cortex --- do not show the same selective Stroop deficit. The mechanism prediction is specific, and the lesion data confirm the specificity.

The EEG and intracranial recording evidence also converges. The Stroop conflict signal in EEG is the N450 / sustained-potential complex, with source localization pointing to the ACC. Single-unit recordings from ACC neurons in non-human primates performing conflict tasks show response patterns consistent with the conflict-monitoring role.

The mechanism story for the Stroop effect, in other words, is not “we have a vague verbal account of what might be happening.” It is “we have a specific, anatomically localized, multi-method-convergent account that has been validated against multiple independent evidence types.” This is what cognitive neuroscience looks like when it works. And it works for the Stroop effect because the underlying phenomenon is so robust that there is something real for the neural-mechanism investigation to target.

When the behavioral effect is real, the neural mechanism story can be developed rigorously. When the behavioral effect is not real, the neural mechanism story becomes a search for a brain signature of a non-phenomenon, which is one reason that “neurobehavioral” follow-ups to failed social-psychology findings tend to produce inconsistent results --- they are trying to localize something that is not reliably there.

Applied Uses --- Diagnosis, Executive Function, Aging, Bilingualism

The Stroop effect is not a curiosity confined to undergraduate psychology labs. It is one of the most widely deployed cognitive assessments in clinical practice and research.

Clinical neuropsychology. The Stroop Color-Word Test is one of the standard instruments in the neuropsychological assessment battery. It is used to assess executive dysfunction in patients with traumatic brain injury, stroke, dementia, and a variety of neurological and psychiatric conditions. The interference score (the difference between incongruent-condition response time and a baseline) is sensitive to dysfunction in the frontal and anterior cingulate systems, which is exactly the network that has been implicated by the fMRI and lesion literature. Clinicians use Stroop scores to track disease progression, to differentiate dementia subtypes, and to evaluate treatment response. The test has psychometric validation in dozens of populations and is available in standardized forms for adults and children.

ADHD diagnosis and assessment. Children and adults with ADHD reliably show elevated Stroop interference, consistent with the broader theory that ADHD involves dysfunction in the executive-control system that arbitrates between automatic and controlled responses. Stroop performance is part of the broader executive-function assessment used in ADHD evaluation, alongside continuous-performance tests, working-memory tasks, and parent/teacher behavioral reports. The Stroop interference signal in ADHD is reliable enough that it has been replicated in dozens of studies across age ranges and ADHD subtypes.

Aging research. Stroop interference systematically increases with age across the adult lifespan, with the increase accelerating in late adulthood. The age-related increase in interference is one of the most reliable markers of cognitive aging and has been used as an outcome variable in studies of cognitive training, exercise interventions, and pharmacological treatments for age-related cognitive decline. The effect is large enough to detect in moderate-sized cross-sectional samples and to track in longitudinal designs.

Bilingualism research. Bilingual subjects perform Stroop tasks in both of their languages and exhibit characteristic patterns that have been used to investigate the architecture of bilingual language control. Cross-language Stroop --- where the word is in one language and the response must be in the other --- has been used to study the inhibitory mechanisms that bilinguals use to manage their two languages. The bilingual-Stroop literature is large, well-developed, and one of the cleaner application domains in the broader study of bilingual cognition.

Brain damage assessment. Stroop performance is a sensitive and standardized measure of frontal-system integrity. It is used to localize and quantify the cognitive consequences of stroke, tumor resection, and traumatic injury. The interpretive framework --- that elevated interference suggests dysfunction in the dorsal ACC and adjacent prefrontal regions --- maps directly onto the neuroscience literature and provides a clinically actionable inference.

These applied uses share a feature: each of them depends on the Stroop effect being reliably present in healthy controls so that the comparison to patient or atypical populations is interpretable. If the Stroop effect were as fragile as the replication-failure findings in this hub, none of these applied uses would work. You cannot diagnose ADHD with a behavioral marker that does not replicate. You cannot track dementia progression with a test that does not produce a stable signal. You cannot study bilingual cognition with a paradigm whose effect size varies by laboratory. The applied uses are themselves evidence of robustness --- they have survived the practical scrutiny of clinicians and researchers who need reliable instruments and would have abandoned the Stroop test decades ago if it had not delivered.

Stroop Variants --- Emotional Stroop and Addiction Stroop

The robustness of the basic Stroop paradigm has supported a substantial family of variants in which the structure is preserved but the content is changed to investigate other cognitive processes.

Emotional Stroop. The emotional Stroop task uses emotionally valenced words --- threat words, depression-related words, trauma-related words --- instead of color words, and asks subjects to name the ink color of these words. The prediction is that emotional content will capture attention in a way that interferes with color-naming, with the magnitude of interference reflecting the emotional salience of the words for the particular subject. The canonical review is Williams, J. M. G., Mathews, A., & MacLeod, C. (1996). “The Emotional Stroop Task and Psychopathology.” Psychological Bulletin, 120(1), 3—24. DOI: 10.1037/0033-2909.120.1.3.

The Williams, Mathews, and MacLeod review documents that anxious subjects show selective interference for threat-related words, depressed subjects show selective interference for depression-related words, and trauma-exposed subjects show selective interference for trauma-related words. The effect has been used as a research tool in anxiety disorders, depression, PTSD, eating disorders, addiction, and a wide range of other psychopathologies. It is not as cleanly mechanistic as the classic Stroop effect --- the interpretive story involves more degrees of freedom about what counts as a “salient” word for a given subject --- but the basic phenomenon has replicated across many populations and represents one of the productive extensions of the original paradigm.

Addiction Stroop. The addiction-Stroop variant uses substance-related words (alcohol words for alcohol-dependent subjects, drug words for drug-dependent subjects) and measures interference as an index of attentional bias to the substance. Addicted subjects show reliably elevated interference for substance-related words compared to neutral words, and the magnitude of the bias has been used as a predictor of treatment outcome --- subjects with larger attentional biases are more likely to relapse. The addiction-Stroop literature is large, the basic phenomenon has replicated repeatedly, and the predictive validity for treatment outcome has been demonstrated in multiple longitudinal designs.

Numerical Stroop, spatial Stroop, animacy Stroop, and others. Researchers have constructed Stroop variants for essentially every dimension of cognitive processing that one might want to investigate. Numerical Stroop varies the physical size of digits versus their numerical magnitude (the digit “9” printed small versus large). Spatial Stroop presents arrows pointing in directions that conflict with their screen position. Each of these variants has independently replicated and contributed to the broader literature on selective attention and cognitive control.

The variants matter because they demonstrate that the underlying paradigm is robust enough to extend. Findings that survive only in the exact form of the original demonstration are usually fragile. Findings that can be modified, extended, and built upon while preserving the basic phenomenon are usually real. The Stroop paradigm has been modified, extended, and built upon for ninety years, and the modifications keep working.

What This Anti-Example Tells Us

Compare the Stroop effect’s profile against the profiles of the canonical replication failures in this hub, and a pattern emerges.

Robust cognitive findings produce large effects. The Stroop interference effect is roughly 100 to 200 milliseconds out of a 700-millisecond baseline --- a Cohen’s d well above 1.0 in most laboratory settings. Compare this to the d ≈ 0.2 effects that dominated the replication-failure literature.

Robust cognitive findings replicate trivially. You can demonstrate the Stroop effect on yourself with a sheet of paper and a few colored markers. You cannot demonstrate power posing or elderly priming on yourself with any setup, because the original effects were not real in the first place.

Robust cognitive findings have specific neural mechanisms that converge across methods. The ACC story for Stroop interference is supported by fMRI, EEG, single-unit recording, lesion data, and pharmacological manipulation, all pointing to the same anatomical region with the same functional interpretation. The neural-mechanism stories for replication-failure findings tend to be either absent, contradictory across methods, or so vague as to be unfalsifiable.

Robust cognitive findings support applied uses that survive practical scrutiny. The Stroop test is used in clinical neuropsychology, ADHD assessment, aging research, and bilingualism research because it works in the field. Clinicians would have abandoned an unreliable instrument decades ago. Compare this to the applied claims that grew up around power posing (TED-talk advice for interviews) or grit (interventions in K-12 schools) --- those applications were built on findings that did not survive replication, and the applied edifice has accordingly crumbled.

Robust cognitive findings support extensions and variants. The basic Stroop paradigm has spawned emotional Stroop, addiction Stroop, numerical Stroop, spatial Stroop, and dozens of other variants, each of which has independently replicated. Findings that cannot be extended or modified without losing the effect are usually fragile.

Robust cognitive findings have stable effect sizes across decades. The Stroop interference effect documented by Stroop in 1935 is approximately the same magnitude as the Stroop interference effect documented by MacLeod in 1991 is approximately the same magnitude as the Stroop interference effect documented in fMRI studies in 2020. Compare this to the steady decline in effect sizes for many social-psychology findings as samples got larger and methodology got tighter.

The decision-useful conclusion is that the diagnostic features of a robust cognitive finding are knowable in advance. You do not have to wait for a replication crisis to identify which findings will survive. You can look at the original effect size, the ease of self-replication, the mechanism specificity, the applied-use trajectory, the extension/variant history, and the stability of effect sizes across decades. Findings that score well on these dimensions tend to survive. Findings that score poorly tend not to.

The replication crisis in psychology was, in some part, a failure to apply this diagnostic to the canonical findings before treating them as established science.

Strategist Implications --- What to Take Away

For an executive evaluating behavioral-science claims, the Stroop effect is a calibration anchor. When a vendor or consultant cites “research shows” for some behavioral intervention, the right comparison is to ask: how does this finding look against the Stroop benchmark?

Specifically:

How large is the original effect size? If the original effect was d ≈ 0.2 (which is typical for failed social-psychology replications), demand much higher evidentiary standards than you would for a finding with a d well above 1.0. Small effects are not necessarily fake, but they are much more likely to be artifacts of methodological flexibility than large effects are.

Can a non-expert demonstrate the effect to themselves? The Stroop effect passes this test trivially. Most behavioral-economics nudges pass it weakly (you can sometimes see defaults work in your own life if you pay attention). Most social-priming and embodied-cognition claims fail this test entirely. The ability of an outside observer to verify the basic phenomenon is a reasonable filter.

Does the mechanism story converge across independent methods? The ACC story for Stroop interference converges across fMRI, EEG, lesion data, and animal recording. If the mechanism story for a behavioral claim is supported by one method or one type of evidence --- a single fMRI study, a single hormonal-correlate finding --- treat it as preliminary regardless of how confidently it is presented.

Are the applied uses surviving practical scrutiny? The Stroop test has been used in clinical neuropsychology for decades and remains in active use because it works. If a behavioral finding has applied claims that have been abandoned by practitioners, or that have not been adopted by practitioners despite years of opportunity, that is informative.

Has the basic paradigm supported successful extensions and variants? The Stroop family of paradigms (classic, emotional, addiction, numerical, spatial) all replicate. A finding that cannot be modified or extended without losing the effect is more fragile than a finding that supports a family of derivative paradigms.

Have effect sizes been stable across decades, or have they shrunk? Stroop interference today is approximately the same magnitude as Stroop interference in 1935. Most failed social-psychology findings show steadily declining effect sizes as samples get larger and methodology improves, which is one of the strongest indicators of original-finding fragility.

These six diagnostic questions are not a guarantee. They are a prior. Apply them to any behavioral claim before treating it as established science. The findings that pass --- the defaults, the Stroop effect, the basic perception and learning paradigms, the prospect-theory framing effects --- are worth betting on. The findings that fail --- and most of the social-psychology canon falls in this category --- are not.

The most useful thing this hub can do is to give you both anchors. The takedown articles show you what failed and why. The anti-examples show you what worked and why. The diagnostic questions above let you calibrate any new claim against both.

The Stroop effect is the brightest calibration point in the cognitive-psychology firmament. Most findings will not look like it. The ones that do are the ones that will still be true in fifty years.

Sources

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643—662. DOI: 10.1037/h0054651.

MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109(2), 163—203. DOI: 10.1037/0033-2909.109.2.163.

MacLeod, C. M., & MacDonald, P. A. (2000). Interdimensional interference in the Stroop effect: Uncovering the cognitive and neural anatomy of attention. Trends in Cognitive Sciences, 4(10), 383—391. DOI: 10.1016/S1364-6613(00)01530-8.

Williams, J. M. G., Mathews, A., & MacLeod, C. (1996). The emotional Stroop task and psychopathology. Psychological Bulletin, 120(1), 3—24. DOI: 10.1037/0033-2909.120.1.3.

The Default Effect: The Behavioral-Economics Finding That Actually Holds Up --- The other anti-example in this hub. Default-architecture nudges replicate across decades, domains, and cultures.
Confirmation Bias --- A robust finding with a different profile: well-replicated as a phenomenon but contested at the mechanism level.
Hindsight Bias --- Another well-replicated cognitive bias with stable effect sizes across decades.
Availability Heuristic --- A foundational Tversky-and-Kahneman finding that has held up far better than most heuristics-and-biases applied claims.
Halo Effect --- A century-old social-perception finding with a profile somewhere between the Stroop anchor and the replication-failure cluster.

FAQ

Why is the Stroop effect considered the most-replicated finding in cognitive psychology?

Because it has been replicated approximately a million times across nearly every cognitive-psychology context you can construct: every age group from young children to centenarians, more than seventy languages, every common writing system, healthy controls and brain-damaged patients across a wide range of pathology, manual and vocal responses, brief and long stimulus durations, and dozens of variants that preserve the basic structure while varying the content. The 1991 MacLeod review covered more than 700 published studies, and the literature has roughly tripled since then.

Can I demonstrate the Stroop effect on myself?

Yes. Take a piece of paper. Write the word RED in green ink, BLUE in red ink, GREEN in blue ink, YELLOW in red ink, and so on for about thirty items. Now write a control list of the same color words printed in matching colors. Time yourself naming the ink colors --- not reading the words --- on both lists. The incongruent list will take noticeably longer, by something like 30 to 70 percent. This is one of the most accessible demonstrations of a major cognitive-psychology finding, and you do not need any equipment or special expertise to do it.

What is the anterior cingulate cortex doing during the Stroop task?

The dorsal ACC is the brain’s conflict-monitoring system. When two response pathways produce conflicting outputs --- here, the automatic reading pathway saying “RED” and the controlled color-naming pathway saying “blue” --- the ACC detects the conflict and signals the prefrontal cortex to up-regulate cognitive control. The fMRI signal in the ACC during incongruent Stroop trials reflects this conflict-detection function, and the magnitude of activation correlates with the magnitude of interference. The mechanism has been validated across fMRI, EEG, lesion, and single-unit recording studies, making it one of the most well-characterized neural mechanisms in cognitive neuroscience.

Is the Stroop effect used to diagnose ADHD?

It is one component of the broader executive-function assessment used in ADHD evaluation, but it is not used as a standalone diagnostic test. Children and adults with ADHD reliably show elevated Stroop interference, but elevated interference can also reflect other forms of executive dysfunction, so the Stroop test is interpreted in the context of other measures, behavioral observation, and clinical history. Within that broader assessment, Stroop performance is a useful and validated indicator.

Why does the Stroop effect get used in bilingualism research?

Because bilinguals have to manage two competing language systems, and the inhibitory mechanisms that they use to suppress the non-target language are conceptually similar to the inhibitory mechanisms involved in suppressing the automatic reading response in the Stroop task. Cross-language Stroop --- where the word is in one language and the required response is in the other --- has been used to investigate how bilinguals control language access. The bilingual-Stroop literature has produced both the “bilingual advantage” hypothesis (that bilinguals show enhanced executive control because of their constant practice with inhibition) and the critiques of that hypothesis, with the underlying paradigm being robust enough to support the substantive debate without itself being in question.

How does the Stroop effect compare to the marshmallow test or power posing?

It is in a different reliability tier entirely. The marshmallow test’s original effect on later life outcomes shrank by approximately 75 percent in larger preregistered replications, and the residual effect appears to be largely a confound for childhood socioeconomic status. Power posing’s hormonal claims did not survive Carney’s own 2016 recantation. The Stroop effect, by contrast, has held its effect size approximately constant across ninety years of replication, has converging neural-mechanism evidence, and supports applied uses that have survived decades of practical scrutiny. The Stroop effect is the benchmark. The marshmallow test and power posing are cautionary examples of what happens when a finding gets oversold.

Is there any serious challenge to the existence of the Stroop effect?

No. There are productive theoretical debates about the relative contribution of different mechanisms (response competition versus semantic interference, the role of practice effects, the architecture of attentional control), and there are debates about specific applied questions (whether the bilingual advantage on Stroop tasks is real, whether the addiction-Stroop attentional bias predicts treatment outcome with the magnitude originally claimed). But the basic phenomenon --- that naming the color of incongruent color words takes longer than naming the color of congruent or neutral words --- is not in serious dispute and has not been since Stroop’s 1935 paper. The Stroop effect is one of the few major psychology findings that you can confidently expect to still be in the textbook in 2075.

What is the practical takeaway for someone evaluating behavioral-science claims?

Use the Stroop effect as a calibration anchor. When a vendor or consultant cites “research shows” for some behavioral claim, ask: how does the original effect size compare to Stroop’s? Can a non-expert demonstrate the effect to themselves? Does the mechanism story converge across independent methods? Have applied uses survived practical scrutiny? Has the basic paradigm supported successful extensions and variants? Have effect sizes been stable across decades, or have they shrunk? The findings that look like the Stroop effect on these dimensions are worth betting on. The findings that look like power posing or elderly priming are not.

replication-crisisstroop-effectcognitive-psychologyexecutive-functionevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

Stroop 1935 --- The Original Paper

MacLeod 1991 --- Half a Century of Research, 700+ Studies

The Neural Mechanism --- ACC and Cognitive Control

Applied Uses --- Diagnosis, Executive Function, Aging, Bilingualism

Stroop Variants --- Emotional Stroop and Addiction Stroop

What This Anti-Example Tells Us

Strategist Implications --- What to Take Away

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook