The Trolley Problem And Moral Psychology: Greene 2001 fMRI Findings And Their Limits

Atticus Li

← The Replication Crisis · replication-crisis

The Trolley Problem And Moral Psychology: Greene 2001 fMRI Findings And Their Limits

Greene 2001 used fMRI and the trolley problem to argue emotion drives deontological moral judgments while reason drives utilitarian ones. The original imaging effects have analytic concerns, the dual-process model has serious competitors, and a 2018 study found trolley responses do not predict real moral behavior.

By Atticus Li May 26, 2026 21 min read

A self-driving car ethics panel I sat in on last year opened with two slides. The first showed a trolley headed toward five people on a track, with a lever that could divert it onto a side track and kill one. The second showed the now-famous footbridge variation: the trolley still threatens five people, but the only way to stop it is to push a large stranger off a footbridge into its path. The panelist’s framing, which I have heard at least a dozen times in conference rooms across three industries, went like this: “Brain-imaging research shows that we use our emotions when we say no to pushing the stranger, but we use our reason when we say yes to pulling the lever. That’s why people are inconsistent. Building moral AI means choosing which system we want to listen to.”

There is a real piece of research underneath that slide — Joshua Greene’s 2001 Science paper, “An fMRI investigation of emotional engagement in moral judgment.” It launched what became the most influential research program in moral psychology of the past two decades. But the panelist’s confident “brain-imaging research shows” was doing several layers of work that the underlying evidence does not actually support.

The 2001 paper’s specific neuroimaging findings have been challenged on analytic grounds (McGuire et al., 2009). The dual-process model Greene built on top of the imaging — the idea that deontological judgments are emotional and utilitarian judgments are deliberative — has serious single-process competitors (Cushman, 2013). And, most usefully for any practitioner trying to apply this work, a clever 2018 study found that what people say they would do in a trolley dilemma is essentially uncorrelated with what they actually do when given a real moral choice with real consequences (Bostyn et al., 2018).

This article walks through what Greene actually found, what is contested about it, and how to evaluate the “neuroscience of ethics” or “dual-process moral judgment” claims that show up in everything from AI ethics decks to leadership training materials. The qualitative observation that some moral judgments feel quick and gut-level while others feel slow and calculated is real. The specific brain-region story and the “your emotions versus your reason” framing routinely outrun the evidence.

What Greene 2001 Actually Tested

The trolley dilemmas had been kicked around moral philosophy for thirty years before Greene picked them up. Philippa Foot introduced the basic switch-track case in 1967; Judith Jarvis Thomson added the footbridge variation in 1985. Philosophers had used the contrast — most people will pull the switch but refuse to push the stranger, even though both actions kill one to save five — as a puzzle about the moral significance of using a person as a means, the doctrine of double effect, agent-relative permissions, and similar normative-theory questions. The puzzle was philosophical: what should the right answer be?

Greene and colleagues asked a different question. What is happening in the brain when people respond to these dilemmas?

They built a stimulus set of 60 dilemmas split into three categories. Moral-personal dilemmas required directly causing serious harm to a specific person in an “up close and personal” way — the footbridge case, smothering a crying baby to prevent a militia from finding hidden refugees, throwing one person out of a sinking lifeboat. Moral-impersonal dilemmas involved morally significant choices that did not require personal direct harm — the switch-track case, keeping money found in a lost wallet, voting against a tax measure. Non-moral dilemmas were practical choices without moral weight — which travel route to take, whether to use coupons.

Nine participants read each dilemma in an fMRI scanner and pressed a button indicating whether the proposed action was “appropriate” or “inappropriate.” The team analyzed BOLD signal (the proxy for neural activity used in fMRI) by region and contrasted the three categories.

The headline findings, as Greene et al. (2001) reported them:

Moral-personal dilemmas showed greater activation than moral-impersonal dilemmas in brain regions associated with emotion — medial frontal gyrus, posterior cingulate gyrus, and angular gyrus bilaterally.
Moral-impersonal and non-moral dilemmas showed greater activation in working-memory regions — bilateral middle frontal gyrus and bilateral parietal lobe.
Within moral-personal dilemmas, reaction times were longer when participants judged the harmful action “appropriate” (the utilitarian answer of pushing the man off the bridge) than when they judged it “inappropriate” (the deontological answer of refusing). The interpretation: utilitarian responses to personal dilemmas require overriding a fast emotional signal, which takes time.

Greene then built a broader theoretical framework on top of these results: the dual-process model of moral judgment. Deontological intuitions — “don’t push the man” — are fast, emotional, and automatic. Utilitarian intuitions — “five lives outweigh one” — are slow, deliberative, and effortful. The footbridge case feels different from the switch case because pushing engages an evolved emotional response to up-close-and-personal harm that switch-pulling does not. He elaborated this in dozens of follow-up papers and ultimately in his 2013 book Moral Tribes.

This framework became enormously influential. It was cited as the empirical basis for everything from secular-utilitarian ethics arguments to AI alignment proposals to corporate ethics training. The phrase “your moral brain has two systems” became one of the most cited claims in popular moral psychology in the 2010s.

The empirical reality is more complicated.

The McGuire 2009 Reanalysis

In 2009, Jeremy McGuire, Robyn Langdon, Max Coltheart, and Catriona Mackenzie published a reanalysis in the Journal of Experimental Social Psychology titled “A reanalysis of the personal/impersonal distinction in moral psychology research.” The paper is short, technical, and devastating to the simplest version of Greene’s empirical claim.

McGuire et al. argued that the personal/impersonal distinction Greene used was not capturing what he claimed it was. They pointed out that the stimulus set was unbalanced in ways that confounded the contrast. The moral-personal dilemmas tended to involve unusual, vivid, dramatic scenarios — pushing a stranger off a bridge, smothering a baby. The moral-impersonal dilemmas tended to involve more mundane, less emotionally vivid scenarios — voting on tax policy, finding a wallet. When McGuire and colleagues looked at participant ratings of the dilemmas on dimensions like emotional intensity, severity, and clarity, the personal/impersonal categorization did not cleanly map to a single psychological dimension.

More importantly, McGuire et al. reanalyzed Greene’s own published response-time data and reported that the central reaction-time finding — that utilitarian responses to personal dilemmas take longer — was driven by a small number of unusual items rather than being a general pattern across the full stimulus set. When you removed those items, the reaction-time difference shrank substantially or disappeared.

Greene and colleagues responded to McGuire’s critique, and the back-and-forth got technical. The core point that survived the exchange is this: the original 2001 paper’s stimulus set had construction issues that make the “emotion versus cognition” mapping less clean than the headline framing suggested. The neural activations themselves are not in dispute (the brain regions Greene identified do show different activation patterns across the conditions), but the interpretation that those regions correspond to “emotional moral processing” versus “cognitive moral processing” requires substantially more inferential weight than the data themselves support.

This is a recurring problem in fMRI research generally. Reverse inference — concluding “this person is feeling fear because the amygdala lit up” or “this person is doing reasoning because the dorsolateral prefrontal cortex lit up” — is logically invalid unless those regions are uniquely selective for those mental functions, which they almost never are. Russell Poldrack’s influential 2006 critique of reverse inference in fMRI applies directly to Greene’s framework: medial prefrontal cortex activation does not entail “emotion” in any specific sense. It activates during self-referential thinking, theory-of-mind tasks, default-mode rest, and many other operations.

The Cushman 2013 Single-Process Alternative

Fiery Cushman published a 2013 review in Personality and Social Psychology Review titled “Action, outcome, and value: A dual-system framework for morality.” Despite the title’s “dual-system” phrasing, Cushman’s framework is best understood as a serious competitor to Greene’s emotion-versus-reason model, because it cuts the dual process along an entirely different axis.

Cushman argued that the right way to understand the divide is between action-based and outcome-based evaluation, not between emotion-based and reason-based evaluation. In Cushman’s account, both moral intuitions and moral deliberation can be either action-focused or outcome-focused. A fast deontological response to the footbridge case is not “emotion winning over reason” but a model-free habit system evaluating the action itself (pushing) as bad independent of consequences. A utilitarian endorsement of pushing is a model-based system evaluating the outcome (one death versus five) as better.

This reframing matters because it removes the “emotion versus reason” valence. In Cushman’s view, both systems are doing valid moral computation — they just compute over different aspects of the situation. The dual process is not “irrational gut feeling versus rational calculation” but “two different rational computations that often disagree.”

Other researchers proposed even more reductionist alternatives. Joachim Hennig and colleagues argued that the trolley-versus-footbridge difference can be largely accounted for by differences in physical contact and personal force rather than by a deep emotion/cognition distinction. People object more strongly to using direct physical force against a person than to flipping a remote switch. That is not particularly mysterious or theoretically deep; it is consistent with a simple aversion to direct violence that does not require positing two separate moral computing systems.

The point is not that Cushman or any single critic is right and Greene is wrong. The point is that the field has serious alternative accounts of the same observed phenomena, and the popular framing — which presents Greene’s dual-process model as settled science — substantially overstates its consensus status.

The Bostyn 2018 Mice Study (Hypothetical Versus Real Behavior)

Dries Bostyn, Sybren Sevenhant, and Arne Roets published a 2018 paper in Psychological Science that is, for any practitioner thinking about how to apply moral-psychology research, more important than any of the neuroimaging back-and-forth. The title: “Of mice, men, and trolleys: Hypothetical judgment versus real-life behavior in trolley-style moral dilemmas.”

Bostyn and colleagues built a real-life version of the trolley dilemma. Participants were told they would be asked to deliver a painful electric shock to a mouse to prevent five other mice from receiving shocks. The setup was elaborate, with the mice physically present in cages, and the experimenters went to considerable lengths to make participants believe the shock-delivery was real. (No mice were actually shocked; the apparatus was rigged. The deception was disclosed in debriefing, and the study went through ethics review.)

The same participants were also asked to respond to a hypothetical trolley dilemma framed in parallel terms — one mouse versus five mice. The question: do hypothetical-dilemma responses predict real-behavior responses?

The finding: substantially less than you would expect.

In the hypothetical version, about 66% of participants endorsed the utilitarian action (shock the one to save the five). In the real version, about 84% actually performed the utilitarian action when given the choice. The proportions are not just different — the rank-order correlation between an individual’s hypothetical judgment and their actual behavior was weak.

This is a serious challenge to the external validity of the entire trolley-problem research program. The Greene framework, and the broader moral-psychology literature it spawned, treats hypothetical-dilemma responses as windows into people’s underlying moral computation. The Bostyn finding suggests that hypothetical responses and actual behavior are governed by partially different processes. What you say you would do when reading a dilemma is not what you would do when actually faced with a similar choice — even in a relatively low-stakes lab setting with mice.

If you cannot generalize from hypothetical to real even within the same paradigm in the same lab, the case for generalizing from hypothetical trolley responses to real-world moral choices in business, medicine, or self-driving cars is correspondingly weak. The most important moral choices in real life are made under time pressure, with information uncertainty, by people who have skin in the game and accountability for the outcome. Lab subjects sitting in a scanner pressing buttons about hypothetical strangers on a bridge are not in any of those conditions.

A 2019 follow-up by Bostyn and Roets extended the finding: even when participants were given the chance to revise their hypothetical response after engaging in the real behavior, the two response modes remained substantially independent. People’s “moral computation” in the abstract is not the same machinery they use when actually deciding whether to do harm.

What the Empirical Record Actually Supports

Stripping away the overclaims, here is what I think the evidence does and does not support.

Does support:

Some moral judgments feel intuitive and fast; others feel calculated and slow. This qualitative observation is robust across many studies and does not depend on Greene’s specific neuroimaging.
Different brain regions show different activation patterns across different kinds of moral dilemmas. The 2001 fMRI finding, on its own neural-activation terms, is not seriously disputed.
People show systematic preferences for some dilemma framings over others. The footbridge case really does get more deontological responses than the switch case across many studies and cultures, and that asymmetry deserves explanation.
Direct, hands-on physical harm tends to feel morally worse than indirect, mediated harm of equivalent consequences. This is well-established as a behavioral phenomenon.

Does not adequately support:

The specific reverse-inference mapping from brain regions to “emotion” versus “reason.” That mapping requires unique selectivity in those regions, which the imaging literature has not established.
The dual-process model as a uniquely correct theoretical framework. Single-process alternatives (Cushman 2013, Hennig action/outcome accounts) fit the data approximately as well and make different predictions.
The generalization from hypothetical trolley responses to real moral behavior. Bostyn 2018 directly tested this and found weak correlation.
The popular “your emotions interfere with your moral reasoning” framing. This is a value-laden interpretation that builds in a position about which system is “really” doing moral cognition, and the framing does not survive contact with Cushman’s reframing or with normative philosophy.
Practical recommendations about how to “engage the right system” for better moral decisions. The translational distance from the lab data to practical advice is enormous, and the lab data themselves are contested.

Strategist Takeaway: Evaluating “Neuroscience Of Ethics” Claims

Most practitioners encountering Greene’s work do so through a downstream popularization — a TED talk, a leadership book, a consultant’s slide deck, an AI-ethics panel. Here is a short evaluation kit for anything in that genre that invokes brain-imaging studies of moral judgment.

1. Is the speaker citing the original imaging finding or the dual-process interpretation? These are different claims. The imaging finding (different brain regions activate during different dilemma types) is uncontroversial. The interpretation (those regions correspond to “emotion” versus “reason”) is contested. Most popularizations conflate the two.

2. Does the speaker mention reverse-inference limits at all? If a speaker says “the amygdala lit up, which means they felt fear” or “the prefrontal cortex was active, which means they were reasoning,” they are committing the reverse-inference error. This has been a documented critique of neuroimaging interpretation since at least Poldrack (2006). A serious presenter will acknowledge it.

3. Is the speaker generalizing from hypothetical to real behavior? This is the Bostyn 2018 question. If the practical recommendation is built on what people say they would do in a moral dilemma, the recommendation is on shakier ground than if it is built on what people actually do in real moral situations. Field studies, naturalistic observation, and behavioral economics with real stakes carry more weight here than scanner studies.

4. Are competing theoretical frameworks acknowledged? A serious popularization of dual-process moral psychology will at least name single-process alternatives like Cushman’s action/outcome framework. A presentation that treats Greene’s model as settled is misrepresenting the field’s actual consensus.

5. What is the practical recommendation, and does it follow from the data? Watch out for a slide deck that uses neuroimaging to support a recommendation that the neuroimaging cannot actually support. “Build a culture of slow deliberative moral reasoning” sounds neuroscience-y, but the empirical support for that recommendation does not come from Greene’s fMRI work — it comes from organizational decision-making research, accountability literature, and behavioral economics, which deserve to be cited on their own terms.

6. Is the recommendation reversible? Greene’s framework is often invoked to argue for utilitarian over deontological reasoning in policy (“trust the slow system”). But Cushman’s framework, applied to the same data, can be used to argue for the opposite — that action-focused intuitions encode useful moral information about agency and means/ends distinctions that pure outcome calculation misses. If the same data set can support opposing recommendations depending on which theoretical framing you adopt, the recommendation is not really being driven by the data.

A useful heuristic: when someone invokes “the neuroscience of ethics,” ask them what their position would be if Greene’s specific framework turned out to be wrong. If their entire argument collapses, they were leveraging neuroscience as rhetoric. If their argument survives, the neuroscience was decorative and not load-bearing — which is fine, but you should know which it is.

Greene’s Broader Framework: Moral Tribes And Beyond

I want to be fair to Greene here. His 2013 book Moral Tribes is a serious, careful work of moral philosophy that argues for a particular position — a sophisticated deep-pragmatism utilitarianism — using empirical moral psychology as one piece of a broader argument. The book is much more nuanced than the popularizations of it that ended up in conference rooms. Greene himself has acknowledged limits in the original 2001 findings and engaged seriously with critics over the years.

The dual-process framework, treated as a useful first approximation rather than a settled fact, does capture some genuine asymmetries in how moral judgments unfold. The footbridge-versus-switch difference is real, and demands explanation. Greene’s framework is a candidate explanation among several. Whether you find it more or less convincing than Cushman’s or Hennig’s depends on what theoretical priors you bring.

What I am pushing back on is the second-order popularization — the leadership-training slide, the AI-ethics panel, the consultant deck — that strips off the nuance, presents the imaging findings as established proof of a specific cognitive architecture, and uses that “proof” as the foundation for confident practical recommendations. That popularization runs ahead of the evidence in ways Greene himself, in his careful academic writing, does not. The distance between “Greene 2001 found differential activation patterns across dilemma types” and “neuroscience proves we should listen to our slow utilitarian brain” is measured in many additional inferential steps, and most of those steps are contested.

For anyone evaluating moral-psychology research in an applied setting, the practical advice is to read the original studies and the critical responses together. Greene 2001 alongside McGuire 2009. Moral Tribes alongside Cushman 2013. The published trolley dilemmas alongside Bostyn 2018’s mouse study. The cumulative picture is much more interesting, and much more uncertain, than any single popularization will tell you. That uncertainty is not a reason to dismiss the work. It is a reason to take it seriously as ongoing science rather than treating it as established conclusion.

Sources

Greene, J. D., Sommerville, R. B., Nystrom, L. E., Darley, J. M., & Cohen, J. D. (2001). An fMRI investigation of emotional engagement in moral judgment. Science, 293(5537), 2105-2108. DOI: 10.1126/science.1062872
Greene, J. D. (2013). Moral Tribes: Emotion, Reason, and the Gap Between Us and Them. Penguin Press.
Cushman, F. (2013). Action, outcome, and value: A dual-system framework for morality. Personality and Social Psychology Review, 17(3), 273-292. DOI: 10.1177/1088868313495594
McGuire, J., Langdon, R., Coltheart, M., & Mackenzie, C. (2009). A reanalysis of the personal/impersonal distinction in moral psychology research. Journal of Experimental Social Psychology, 45(3), 577-580. DOI: 10.1016/j.jesp.2009.01.002
Bostyn, D. H., Sevenhant, S., & Roets, A. (2018). Of mice, men, and trolleys: Hypothetical judgment versus real-life behavior in trolley-style moral dilemmas. Psychological Science, 29(7), 1084-1093. DOI: 10.1177/0956797617752640
Poldrack, R. A. (2006). Can cognitive processes be inferred from neuroimaging data? Trends in Cognitive Sciences, 10(2), 59-63. DOI: 10.1016/j.tics.2005.12.004
Foot, P. (1967). The problem of abortion and the doctrine of the double effect. Oxford Review, 5, 5-15.
Thomson, J. J. (1985). The trolley problem. Yale Law Journal, 94(6), 1395-1415.

FAQ

Did Greene 2001 actually fail to replicate?

The original imaging data themselves — the patterns of brain activation across the three dilemma categories — are not in serious dispute. What is contested is the interpretation. McGuire et al. (2009) reanalyzed the stimulus set and reaction-time data and found that several of the central inferential claims rested on a small number of unusual items. The “emotion versus cognition” mapping requires reverse-inference assumptions (Poldrack, 2006) that the imaging field generally treats as invalid without independent evidence. So: the neural findings are real, but the popular framing of “Greene proved deontological judgments are emotional” goes well beyond what those findings can support.

What is the difference between Greene’s dual-process model and Cushman’s framework?

Greene’s model divides moral processing along an emotion-versus-reason axis: deontological intuitions are fast and emotional, utilitarian intuitions are slow and deliberative. Cushman’s framework divides moral processing along an action-versus-outcome axis: one system evaluates the moral character of the action itself (independent of consequences), another system evaluates the moral character of the outcome (independent of how it was produced). Both systems in Cushman’s account can be fast or slow, automatic or deliberative — the dual process is about what is being evaluated, not about emotion versus reason. The two frameworks fit the same trolley data approximately as well and make different predictions in cases like indirect harm and luck-influenced outcomes.

What did Bostyn 2018 actually find about real moral behavior?

Bostyn, Sevenhant, and Roets ran a study where participants believed they would actually deliver a painful electric shock to a mouse to prevent five other mice from being shocked. The apparatus was rigged — no mice were harmed — but participants believed it was real. The same participants also gave a hypothetical judgment about the parallel one-versus-five mouse dilemma. About 66% endorsed the utilitarian choice in the hypothetical version; about 84% performed the utilitarian action in the real version. More importantly, individual-level rank-order correspondence between hypothetical judgment and real behavior was weak. The implication: what people say in trolley dilemmas does not cleanly predict what they actually do in analogous real situations.

Is the trolley problem useful for AI ethics or self-driving cars?

The trolley problem is useful as a teaching example for surfacing intuitions about agent-relative permissions, the doctrine of double effect, and the moral significance of action versus omission. It is much less useful as an empirical foundation for designing AI systems. Real autonomous-vehicle ethics involves probability distributions over outcomes under uncertainty, not clean trolley-style choices between known harms. The intuitions surfaced by trolley cases — collected mostly from undergraduates responding to written dilemmas — should not be treated as authoritative input to AI policy without much more careful translation work than typically happens in industry decks.

Why does the footbridge case feel different from the switch case if not because of “emotion versus reason”?

Several alternative explanations exist. The footbridge case involves direct physical contact with the victim; the switch case does not (Hennig et al. and related work). The footbridge case involves using the victim’s body as a means to an end; the switch case treats the victim’s death as a side effect (doctrine of double effect, going back to Aquinas). The footbridge case requires personal force application; the switch case is mediated through a mechanical device (Greene himself has explored this in follow-up work). The footbridge case is more vivid, dramatic, and unusual; the switch case is closer to mundane mechanical decisions. Any of these — or some combination — can account for the asymmetry without requiring the specific dual-process emotional/cognitive architecture Greene proposed.

What should I take away from this for evaluating “neuroscience of ethics” claims more generally?

Three filters. First, distinguish neural activation findings (often robust) from interpretive claims about what those activations mean (often contested via reverse inference). Second, distinguish hypothetical-judgment data from real-behavior data, and weight the latter more heavily for practical applications (Bostyn 2018 is one example of why). Third, treat the appearance of “two systems” architectures in popular neuroscience with healthy skepticism — the same brain-region patterns can usually be carved into multiple competing two-system frameworks, and the choice between them is more theoretical than empirical. Greene’s specific framework is a serious scientific contribution that has been productive for the field; treating it as settled science is a popularization error, not Greene’s.

replication-crisistrolley-problemgreene-2001moral-psychologyevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What Greene 2001 Actually Tested

The McGuire 2009 Reanalysis

The Cushman 2013 Single-Process Alternative

The Bostyn 2018 Mice Study (Hypothetical Versus Real Behavior)

What the Empirical Record Actually Supports

Strategist Takeaway: Evaluating “Neuroscience Of Ethics” Claims

Greene’s Broader Framework: Moral Tribes And Beyond

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook