The Marshmallow Test: What Willpower at Age Four Actually Predicts (And What It Does Not)

Atticus Li

← Blog · replication-crisis

The Marshmallow Test: What Willpower at Age Four Actually Predicts (And What It Does Not)

For fifty years, the marshmallow test was cited as proof that willpower at age four predicts adult success. The 2018 replication, the 2020 critique, and the 2024 follow-up have rewritten the story. Here is what the evidence actually shows, what fell apart, and what leaders should learn about the gap between cute studies and serious science.

By Atticus Li May 7, 2026 17 min read

For fifty years, the marshmallow test was cited as proof that willpower at age four predicts adult success. The 2018 replication, the 2020 critique, and the 2024 follow-up have rewritten the story. Here is what the evidence actually shows, what fell apart, and what leaders should learn about the gap between cute studies and serious science.

A four-year-old sits at a small table in a bare room. In front of her is a single marshmallow on a plate. The researcher tells her she can eat the marshmallow now, or she can wait --- and if she waits until he comes back, she’ll get two. Then the researcher leaves the room. The clock starts. The girl looks at the marshmallow. She squirms. She covers her eyes. She turns around in her chair. She licks her lips. She smells the marshmallow. She holds it up and inspects it like a tiny philosopher. After fifteen minutes --- fifteen agonizing minutes --- the researcher returns. She has not eaten the marshmallow. She gets two.

Fast-forward fourteen years. That same four-year-old, now eighteen, has higher SAT scores. She’s more socially competent. She copes better with stress and frustration. She’s on a path that leads to a successful adult life. The kid at the next table over, who ate the marshmallow within thirty seconds, is doing measurably worse on most of these metrics.

This story --- willpower at age four predicts adult success, and you can measure it with a marshmallow --- is one of the most-cited findings in 20th-century developmental psychology. It launched a thousand parenting books. It anchored arguments about character education, school curricula, and welfare policy. It became the canonical example of a finding that connects a simple childhood behavior to lifelong consequences. Walter Mischel, the Stanford psychologist who designed the test, became one of the most-cited researchers in his field.

Then, in 2018, a team of developmental economists tried to replicate the famous predictive finding using a much larger and more diverse sample. The story they told was almost completely different from the one in the books.

This article is about what actually happened --- what the original 1972 paper said, what the famous 1990 follow-up said, what the 2018 replication found, what the 2020 critique pushed back on, what the 2024 follow-up showed for adult outcomes, and what any of this means for parents, educators, hiring managers, or strategists who’d like to understand what childhood behavior actually does and doesn’t predict about adult life.

What the Original Studies Actually Said

The first thing to know is that the famous “marshmallow test predicts adult success” claim is not from the 1972 paper most people cite. It’s from a paper published almost two decades later.

The 1972 paper --- Mischel, Ebbesen & Zeiss, “Cognitive and attentional mechanisms in delay of gratification,” Journal of Personality and Social Psychology --- is about something more limited and more interesting. The researchers were studying how children’s strategies affected their ability to wait for a larger reward. They found that children who looked at the reward had a harder time waiting, and children who covered the reward, looked away, or thought about something else (singing songs, imagining the marshmallow as a “fluffy cloud”) could wait longer. The headline finding was about cognitive strategies, not about a stable trait.

Sample size was around 50 children at Stanford’s Bing Nursery School --- children of Stanford faculty and graduate students, an extremely homogeneous sample.

The famous predictive finding came eighteen years later, in Shoda, Mischel & Peake (1990), “Predicting adolescent cognitive and self-regulatory competencies from preschool delay of gratification,” Developmental Psychology. Shoda followed up with 185 of the original Bing Nursery cohort, now teenagers. He found that preschool delay-of-gratification time correlated with adolescent SAT scores (around r = 0.4 with verbal SAT and somewhat higher, around r = 0.6, with quantitative SAT) and with parent-rated self-regulation and social competence.

This was the paper that launched the cultural phenomenon. A simple childhood test --- can a four-year-old wait fifteen minutes for a second marshmallow? --- predicted measurable differences in teenage outcomes more than a decade later. The implication, drawn out in countless popular treatments, was that self-control was a stable trait you could measure early and that mattered for life.

Mischel’s 2014 book The Marshmallow Test: Why Self-Control Is the Engine of Success synthesized this into a popular framework. By then, the marshmallow test was a cultural artifact --- featured in TED talks, parenting books, charter-school curricula, and arguments about the importance of “non-cognitive skills.”

There were always problems with the predictive evidence. The original 1990 cohort was small (185 followed up out of an original Stanford-Mischel cohort of 653 children), homogeneous (Stanford-affiliated families), and the correlation, while interesting, wasn’t huge. But for almost three decades, no one had attempted a serious replication with a different sample. The cultural belief in the marshmallow test ran far ahead of the empirical foundation.

The 2018 Replication

In 2018, Tyler Watts, Greg Duncan, and Haonan Quan published “Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes,” in Psychological Science. They used a much larger, much more diverse sample --- 918 children from the NICHD Study of Early Child Care and Youth Development, a long-running longitudinal dataset that tracked children from age four through age fifteen.

They administered a version of the marshmallow test at age 4.5 and tracked outcomes at age 15.

The headline finding was striking. The bivariate correlation between early delay-of-gratification time and adolescent achievement was about half the size of the original. Once they controlled for family background --- maternal education, family income, home environment quality, early cognitive ability --- the predictive relationship largely disappeared.

The honest implication: the original finding was substantially confounded by socioeconomic status and family environment. Children from wealthier, more educated, more stable families were both better able to wait at age four and more likely to do well at age fifteen. The marshmallow test wasn’t measuring some pure “willpower trait” that caused later success; it was partly capturing the conditions of growing up in a family that taught and rewarded delayed gratification, and those same conditions also produced better adolescent outcomes through dozens of other channels.

The Watts paper was, in academic terms, devastating. The popular version of the marshmallow test --- willpower causes success --- could not survive a study where socioeconomic confounds were controlled.

The 2020 Pushback

The story didn’t end there. In 2020, Falk, Kosse & Pinger published a critique in Psychological Science --- “Re-Revisiting the Marshmallow Test: A Direct Comparison of Studies by Shoda, Mischel, and Peake (1990) and Watts, Duncan, and Quan (2018).”

Their argument was technical but important. They pointed out that some of the controls Watts had used were post-treatment variables --- that is, things that would themselves be affected by a child’s self-control rather than purely independent confounders. Including post-treatment variables in a regression doesn’t just adjust for confounds; it absorbs some of the effect you’re trying to measure, biasing the estimate downward. They argued that with a more careful specification, the predictive relationship between early delay of gratification and adolescent outcomes was still meaningful.

Watts and Duncan replied in the same journal later that year. They defended their original specification, argued that the controls were appropriate, and noted that the basic empirical pattern --- much smaller correlations in a more diverse sample, with substantial reduction after family-background controls --- was robust across alternative specifications.

This is what an honest scientific debate looks like. Both sides agreed the original 1990 effect was substantial. They disagreed about how much of that effect survives careful causal analysis. Reasonable people read the exchange and came down in different places. The simple “marshmallow test debunked” narrative wasn’t quite right; the simple “marshmallow test still predicts adult success” narrative wasn’t right either.

The 2024 Adult-Outcome Study

The most recent and most informative chapter came in 2024. Sperber, Vandell, Duncan, & Watts published “Delay of gratification and adult outcomes: The Marshmallow Test does not reliably predict adult functioning,” in Child Development. They followed 702 participants from the same NICHD dataset out to age twenty-six.

The bivariate correlations were still there --- preschool delay-of-gratification time correlated weakly with adult educational attainment (r ≈ 0.17) and with adult body-mass index (r ≈ -0.17). But once family background and early cognitive ability were controlled, almost all of the regression-adjusted coefficients were not statistically significant. The marshmallow test, in a sample tracked into mid-twenties with thoughtful causal controls, did not reliably predict adult outcomes.

This is the strongest evidence to date on the question that originally made the marshmallow test famous. The pop-science version --- willpower at age four predicts success at age twenty-five --- does not survive a careful look at the data.

Sperber et al. were careful in their interpretation. They noted that the marshmallow test does seem to capture something real about a child at age four. Children who can wait are, on average, in environments and have early cognitive abilities that predict better outcomes. But the marshmallow test itself is not a window into a deep, causally important “willpower” trait. It’s a correlate of family background and early ability, and the predictive validity it has is largely shared with other indicators of those underlying factors.

Why the Original Looked Real

The marshmallow test had everything going for it as a finding that would propagate culturally even if the underlying evidence was weaker than reported.

It was visually compelling. The footage of children struggling not to eat the marshmallow is genuinely charming and has been replayed in countless YouTube clips, documentaries, and TED talks. The visual story does enormous work that the data alone could not.

The original sample was small and homogeneous. Stanford faculty children are not representative of children generally. Effects measured in such a sample may reflect properties of that population (high baseline cognitive ability, supportive family environments, and so on) rather than properties of “humans in general.” Generalizing from Bing Nursery to all children was always a stretch.

The follow-up was selective. Only 185 of the original cohort were followed up in the 1990 paper. Selective attrition can dramatically inflate correlations, especially in small samples.

The story was perfectly aligned with cultural appetites. American culture in the 1990s and 2000s was hungry for evidence that “character” mattered, that early childhood experience was destiny, and that simple interventions could produce big developmental effects. The marshmallow test gave all of this.

No one ran the right replication for decades. The first major replication didn’t appear until 2018 --- twenty-eight years after the famous 1990 paper. By then the cultural belief was so entrenched that even a rigorous, well-powered, well-controlled replication had a hard time displacing it.

This pattern --- small original sample, charismatic finding, cultural breakout, decades-long delay before serious replication --- is exactly what produced the replication crisis in social and developmental psychology. The marshmallow test is one of its more sympathetic examples (it wasn’t fraud, the original researchers were careful, and the underlying construct probably correlates with something), but it follows the same arc as power posing, ego depletion, and many of the other Tier 1 entries in this hub.

The Honest Verdict Today

Three layers of finding, in order from most-supported to least-supported.

Layer 1: At age four, kids who can delay gratification are different from kids who can’t. This is uncontroversial. The marshmallow test does measure something real about a four-year-old --- some combination of cognitive ability, attentional control, prior experience with delayed rewards, family environment, and possibly some genuine self-regulatory trait. The differences are real at the moment of measurement.

Layer 2: That difference correlates with later outcomes, but mostly via shared causes. Children who can wait at age four do, on average, do better at age fifteen and at age twenty-six. But this correlation is largely explained by the fact that both behaviors share underlying causes (family environment, cognitive ability, early developmental trajectory) rather than by a direct causal pathway from “willpower” to “success.”

Layer 3: There is no robust, well-established causal effect of preschool delay of gratification on adult life outcomes. The strong popular claim --- that you can identify a four-year-old’s life trajectory from a marshmallow test --- is not supported by rigorous evidence. The remaining causal signal, if any, is small and contested.

This is a more nuanced picture than either “the marshmallow test is real” or “the marshmallow test is debunked.” It’s also a less marketable picture, which is why the simpler versions persist culturally even though the field has moved on.

What This Means If You’re a Strategist

Three takeaways with very direct practical implications.

1. Be deeply suspicious of single-measure predictions about complex life outcomes. The marshmallow test promised something that doesn’t really exist in the evidence: a single, simple, early-life measurement that reliably predicts a complex adult outcome through some causal mechanism. Almost nothing in the social-science literature actually delivers this kind of clean predictive package. When something claims to --- a personality test that predicts job performance, a single behavioral question that predicts customer lifetime value, a five-minute assessment that predicts leadership potential --- the empirical track record of such claims is poor.

This matters for hiring. It matters for talent assessment. It matters for any system that uses a single early signal to make a high-stakes downstream judgment about a person. The base rate of “small early measurement actually predicts complex outcome” is much lower than most popular treatments suggest. Put much more weight on multi-signal, cumulative evidence than on any single test.

2. Confounds are usually doing more work than the variable you’re studying. The most important thing the Watts replication showed is that family background, cognitive ability, and home environment together did most of the predictive work that had been attributed to “willpower.” This is the modal finding in social science. When a single measurement appears to predict an outcome, the predictive power is usually mostly carried by underlying confounds --- variables that produce both the measurement and the outcome through separate channels.

For organizational decisions: when an HR system, a customer-segmentation model, a marketing campaign attribution, or any other measurement-based decision system claims that variable X predicts outcome Y, look hard at what else is correlated with X. If X is correlated with the obvious confounds in your domain --- tenure, prior experience, customer wealth, geography --- then most of the apparent predictive power is probably coming from those, and the policy implication (“we should change X to improve Y”) may not hold.

3. Cultural fame outruns scientific revision by years and sometimes decades. The marshmallow test’s popular version has been culturally famous for thirty-five years. The serious replication evidence that revised the story has existed for less than a decade. Many people whose entire mental model of “willpower” rests on the original story will continue to operate that way for years even after reading articles like this one --- because the story is sticky and the revised version is more nuanced and harder to summarize.

This applies to your own beliefs about behavioral science. The version of “what we know about people” that you absorbed from books, TED talks, and articles in the 2010s is, on average, several years behind the current academic consensus. Some of what you confidently believe about willpower, character, or human development is probably no longer the consensus view. The discipline of periodically auditing your own beliefs --- which findings you cite in arguments, which interventions you recommend, which assumptions you build strategy on --- is a meaningful competitive advantage in any role that depends on understanding people.

Sources

[Mischel, W., Ebbesen, E. B., & Zeiss, A. R. (1972). Cognitive and attentional mechanisms in delay of gratification. Journal of Personality and Social Psychology, 21(2), 204-218. DOI: 10.1037/h0032198](https://psycnet.apa.org/record/1972-20631-001) --- original mechanisms paper.
[Shoda, Y., Mischel, W., & Peake, P. K. (1990). Predicting adolescent cognitive and self-regulatory competencies from preschool delay of gratification. Developmental Psychology, 26(6), 978-986. DOI: 10.1037/0012-1649.26.6.978](https://depts.washington.edu/shodalab/wordpress/wp-content/uploads/2015/05/1990.PredictingAdolescent_Shoda.pdf) --- the famous predictive paper.
[Watts, T. W., Duncan, G. J., & Quan, H. (2018). Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes. Psychological Science, 29(7), 1159-1177. DOI: 10.1177/0956797618761661](https://journals.sagepub.com/doi/abs/10.1177/0956797618761661) --- 2018 replication.
[Falk, A., Kosse, F., & Pinger, P. (2020). Re-Revisiting the Marshmallow Test. Psychological Science, 31(1), 100-104. DOI: 10.1177/0956797619861720](https://journals.sagepub.com/doi/10.1177/0956797619861720) --- critique of Watts.
[Watts, T. W., & Duncan, G. J. (2020). Controlling, Confounding, and Construct Clarity: A Response to Criticisms of “Revisiting the Marshmallow Test.” Psychological Science, 31(1), 105-108. DOI: 10.1177/0956797619893606](https://journals.sagepub.com/doi/10.1177/0956797619893606) --- Watts’s reply.
[Sperber, J. F., Vandell, D. L., Duncan, G. J., & Watts, T. W. (2024). Delay of gratification and adult outcomes: The Marshmallow Test does not reliably predict adult functioning. Child Development, 95(6), 2015-2029. DOI: 10.1111/cdev.14129](https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdev.14129) --- 2024 adult-outcomes study.

This article is part of an ongoing series on famous behavioral-science studies that did not survive replication. Other entries cover the Stanford Prison Experiment, power posing, ego depletion, the bystander effect, and the Mozart Effect. The full hub lives at /replication-crisis/.

If you’re using single-measure predictive tools for hiring, talent assessment, or customer segmentation and want a careful evidence audit, book a consultation.

FAQ

Is the marshmallow test “debunked”? The strong popular version (preschool willpower causes adult success) is not supported by rigorous evidence. The weaker version (preschool delay-of-gratification correlates with later outcomes, partly through shared family-environment factors) is supported. “Debunked” overstates it; “the popular version was wrong, the real story is more nuanced and less interesting” is more accurate.

Should I still teach my child to delay gratification? The evidence about whether teaching delay-of-gratification produces measurable life-outcome benefits is weak. Teaching it for its own sake (because patience and self-control are useful skills in everyday life) is fine; doing it because you believe a few extra minutes of marshmallow-resistance will make your child more successful at age twenty-five is not supported by the data.

What does this say about “non-cognitive skills” or “grit” research generally? The broader claim that non-cognitive skills predict outcomes has had a tougher empirical run than the popular versions suggest. Angela Duckworth’s “grit” construct, Carol Dweck’s growth mindset, and the marshmallow test all share a common pattern: a charismatic original finding, a cultural breakout, and then disappointing replication and meta-analytic results. This doesn’t mean non-cognitive skills don’t matter at all; it does mean the strong popular claims about them have not held up.

What about the famous YouTube videos of children taking the test? Those are real. They’re charming. They show that children differ in their ability to wait at age four. None of that is in dispute. The disputed part is whether those differences predict adult life outcomes through a causal pathway that runs through “willpower.” The current evidence says: not really, or only weakly.

If the marshmallow test doesn’t predict outcomes, what does? For long-run life outcomes, the strongest predictors in the developmental literature are family socioeconomic status, parental education, early cognitive ability, and the cumulative quality of early childhood environment. None of these are as cute or simple as a marshmallow on a plate. They are also much harder to intervene on, which is part of why the marshmallow test was so culturally appealing --- it suggested an easy diagnostic for a problem that turns out to be much harder.

replication-crisis behavioral-science developmental-psychology evidence-evaluation leadership

Free Tool

Built for Experimentation Teams

GrowthLayer is the experimentation platform I built for CRO teams --- test management, AI-powered insights, and pattern recognition across your entire program.

Explore GrowthLayer → (opens in new tab)

· Start Free →

Share this article

LinkedIn (opens in new tab) X / Twitter (opens in new tab)

Copy link

Go deeper

Methodology The PRISM Method Case Studies $30M+ in Results Work Together Services & Mentoring

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

← Previous

The Mozart Effect: How a 36-Person Study Became a State Policy --- And Why It Was Never There

Next →

Loss Aversion: What Survives of Behavioral Economics’ Most Famous Idea

replication-crisis behavioral-science developmental-psychology evidence-evaluation leadership

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter

The Marshmallow Test: What Willpower at Age Four Actually Predicts (And What It Does Not)

What the Original Studies Actually Said

The 2018 Replication

The 2020 Pushback

The 2024 Adult-Outcome Study

Why the Original Looked Real

The Honest Verdict Today

What This Means If You’re a Strategist

Sources

FAQ

Built for Experimentation Teams

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the Weekly
Experimentation Playbook

What the Original Studies Actually Said

The 2018 Replication

The 2020 Pushback

The 2024 Adult-Outcome Study

Why the Original Looked Real

The Honest Verdict Today

What This Means If You’re a Strategist

Sources

Related: Other Studies in This Series

FAQ

Built for Experimentation Teams

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook