The Cognitive Reflection Test: A 3-Question Anti-Example That Predicts Decision Quality

Atticus Li

← The Replication Crisis · replication-crisis

The Cognitive Reflection Test: A 3-Question Anti-Example That Predicts Decision Quality

Shane Frederick’s 3-question Cognitive Reflection Test (CRT) is one of the cleanest anti-examples in behavioral science — a tiny measure that robustly predicts heuristics-and-biases susceptibility, intertemporal patience, paranormal belief, and fake-news susceptibility across decades.

By Atticus Li May 26, 2026 19 min read

A bat and a ball cost $1.10

Try this before you read any further. Don’t reach for a pen. Don’t open a calculator. Just answer fast, the way you’d answer if a friend asked you in the kitchen:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

If your first instinct said 10 cents, you have a lot of company. Roughly half of the Princeton, Harvard, and MIT undergraduates Shane Frederick tested in Frederick (2005) — the paper that introduced the Cognitive Reflection Test — gave that answer. It is wrong. If the ball cost 10 cents, the bat would cost $1.10 (a dollar more), and the total would be $1.20, not $1.10. The correct answer is 5 cents. The bat costs $1.05. They sum to $1.10. The intuitive answer is wrong, and the reflective answer takes about ten seconds of arithmetic to surface.

That tiny gap — between the answer that arrives instantly and the answer that arrives after pausing to check — is what the CRT measures. And it turns out to measure quite a lot.

In a literature where most behavioral-science findings have come under attack during the replication crisis (priming, ego depletion, power posing, growth mindset, implicit bias measurement — the casualty list is long and growing), the Cognitive Reflection Test sits in an unusual position. It is one of the few canonical behavioral measures that has held up across two decades, hundreds of studies, dozens of countries, and a meta-review’s worth of predictive-validity tests. The CRT is, in other words, an anti-example: a behavioral measure that actually does what it claims to do.

This article is about why the CRT works, what it predicts, where its limits are, and how a strategist might use it without falling into the familiarity trap that retired the original 3-item version from serious research use.

The three questions in full

Frederick’s 2005 instrument has exactly three items. There is no scoring rubric beyond “number correct out of three.” This is not a typo. The original CRT really is that short.

Item 1: Bat and ball. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

Intuitive (wrong) answer: 10 cents.
Reflective (correct) answer: 5 cents.

Item 2: Widgets and machines. If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?

Intuitive (wrong) answer: 100 minutes.
Reflective (correct) answer: 5 minutes.

Item 3: Lily pads. In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?

Intuitive (wrong) answer: 24 days.
Reflective (correct) answer: 47 days.

Each question shares the same structure. There is an answer that pattern-matches to the surface of the problem (subtract, scale, halve). And there is an answer that requires you to pause, override the pattern match, and do the arithmetic. The whole instrument is a forced choice between System 1 (fast, automatic, pattern-matching) and System 2 (slow, deliberate, rule-based) — Kahneman’s dual-process vocabulary, which the CRT operationalizes about as cleanly as anything in the literature.

Three questions. Each takes about a minute. The total score — 0, 1, 2, or 3 — is what Frederick called a person’s “cognitive reflection.”

What Frederick (2005) found

Frederick administered the CRT to about 3,400 participants across 35 separate studies — undergraduates at elite universities, online panels, web-recruited samples. The empirical core of Frederick (2005) is that CRT scores predict, with surprising consistency, performance on the major heuristics-and-biases paradigms that Kahneman and Tversky spent thirty years building.

Specifically, higher CRT scores predicted:

Less anchoring. When asked to estimate the height of the tallest redwood after seeing an arbitrary high or low anchor, high-CRT participants were less swayed by the anchor.
Less framing-effect susceptibility. The Asian disease problem (lives saved vs. lives lost framing) produces the classic preference reversal in low-CRT participants but a much smaller reversal in high-CRT participants.
Less conjunction-fallacy susceptibility. On Tversky and Kahneman’s Linda-the-bank-teller problem, high-CRT participants were more likely to correctly identify “Linda is a bank teller” as more probable than “Linda is a bank teller and a feminist.”
More patience in intertemporal choice. Given a choice between $3,400 today and $3,800 in a month, low-CRT participants preferred the immediate payoff; high-CRT participants waited.
Different risk preferences. High-CRT participants were more willing to gamble in the gain domain (where most people are risk-averse) and less willing to gamble in the loss domain (where most people are risk-seeking) — closer to expected-utility maximization, further from prospect-theory behavior.

The correlations were not huge. Most were in the r ≈ 0.30 to r ≈ 0.50 range. But the consistency was. A three-question instrument that takes three minutes to administer was predicting outcomes on five different decision paradigms in the same direction, in study after study. By the standards of personality psychology — where instruments with 60 items struggle to hit r = 0.30 on outcomes that aren’t the same questionnaire reworded — that is striking.

For comparison: SAT scores, IQ scores, and need-for-cognition scales also predict heuristics-and-biases performance, but the CRT does so with a fraction of the items, and in regressions, the CRT often carries variance that IQ and SAT do not. The instrument is not a backdoor IQ test. It is measuring something narrower and more specific: the disposition to override an intuitive response in favor of a reflective one. Stanovich and West later called this disposition “miserly information processing,” and the trait the CRT measures is a tendency to do less of it.

Why dual-process theory predicts the CRT works

The theoretical scaffolding behind the CRT is Daniel Kahneman’s dual-process model, popularized in Thinking, Fast and Slow. In simplified form: the mind runs two systems. System 1 is fast, automatic, parallel, low-effort, and pattern-matching. System 2 is slow, sequential, effortful, and rule-following. Most cognition runs on System 1; System 2 intervenes only when System 1 produces something that feels wrong, or when external constraints (a teacher, a calculator, a deadline) force it.

The CRT works by constructing problems where System 1’s default answer is wrong and System 2’s correct answer requires only a small amount of work. The bat-and-ball question is not algebraically hard. A bright eighth-grader could solve it in under a minute with a pen and paper. What makes it diagnostic is that the wrong answer arrives in the brain before the question is finished being read. To get the right answer, you have to notice that the wrong answer is wrong, refuse to commit to it, and force System 2 to take over.

That moment — the “wait, that can’t be right, let me check” — is the entire psychological construct. People who do it consistently across all three items score 3/3. People who never do it score 0/3. People in between score 1/3 or 2/3. And that 0-3 score, it turns out, generalizes to a striking range of real-world outcomes.

For a strategist, the value of this framing is that it makes the CRT predictive logic clear. The CRT is not asking “how smart are you?” or “how educated are you?” It is asking “when an intuitive answer pops into your head, do you stop and check it?” That single behavioral disposition turns out to be most of what separates analytic from intuitive decision-making in adults — and most of what makes someone susceptible or resistant to a wide range of biases.

Pennycook’s extensions: paranormal belief and fake-news susceptibility

If the CRT only predicted performance on other lab puzzles, it would be an elegant curiosity. The work that made it consequential came later, when Gordon Pennycook and colleagues at Yale extended the instrument to outcomes that matter outside the lab.

The first extension was paranormal and religious belief. In Pennycook et al. (2012), the authors administered the CRT alongside measures of supernatural belief (belief in God, an afterlife, angels, ghosts, ESP, telekinesis) to a sample of nearly 900 participants. CRT score was a robust negative predictor of paranormal belief: people who reflected more believed less in supernatural claims, controlling for age, education, and political orientation. The effect was not enormous, but it replicated across multiple samples and held up against IQ and education controls.

The second extension was political and informational. In Pennycook & Rand (2019), the authors showed CRT score predicts the ability to distinguish real news from fake news — and, crucially, the relationship runs in the same direction for both Republicans and Democrats. The paper’s title — “Lazy, not biased” — captures the central finding. Susceptibility to partisan fake news is better explained by failure to engage analytic reasoning than by motivated reasoning that confirms ideological priors. People who score 3/3 on the CRT are better at identifying fake news headlines as fake, regardless of whether the headline is pro-Republican or pro-Democrat. People who score 0/3 are worse at it, again regardless of direction.

This is one of the more politically loaded findings in modern behavioral science, and it has been contested. But the basic empirical pattern — CRT score predicts truth-discrimination on news headlines, with effect sizes around r = 0.2 to r = 0.4 — has replicated across multiple labs and samples. The CRT, originally designed to predict performance on the bat-and-ball problem, turns out to predict whether you’ll fall for a fake news headline on Facebook.

The 7-item extension: Toplak (2014)

The original 3-item CRT had a problem that did not become visible until the instrument went viral. By 2010, the three questions were everywhere — popular books, viral websites, internet quizzes, undergraduate psychology classes. Anyone who had taken an intro psych class or read Thinking, Fast and Slow had seen the bat-and-ball problem. And once you’ve seen it, the CRT no longer measures what it’s supposed to measure: it measures your memory.

Toplak, West, and Stanovich (2014) addressed this by extending the CRT to seven items, adding four new questions that preserved the original structure but had not yet entered the popular bloodstream. The new items work the same way: each has an intuitive-but-wrong answer and a reflective-but-correct answer, and each requires a deliberate override to get right. Examples include:

“If John can drink one barrel of water in 6 days, and Mary can drink one barrel of water in 12 days, how long would it take them to drink one barrel of water together?” (Intuitive: 9 days. Correct: 4 days.)
“Jerry received both the 15th highest and the 15th lowest mark in the class. How many students are in the class?” (Intuitive: 30. Correct: 29.)

The 7-item CRT (CRT-7) has equivalent predictive validity to the original 3-item CRT, but with the advantage of including items most participants haven’t pre-memorized. By 2014, serious research had largely migrated to the CRT-7 or to numeracy-extended versions (Cokely et al. 2012’s Berlin Numeracy Test, which mixes CRT-style items with probabilistic reasoning questions).

For a strategist using the CRT today, the lesson is clean: do not use the original 3-item version. The familiarity effect is real, large, and well-documented. Use the CRT-7, or use a custom set of CRT-structure items that participants in your population have not previously encountered.

The familiarity-effect limitation: Stieger (2016)

The clearest documentation of how badly the original CRT now fails comes from Stieger and Reips (2016), who explicitly tested how familiarity with the items distorts scores. The authors administered the CRT to participants who were then asked, after completing it, whether they had encountered any of the questions before. About 30% of their sample reported prior exposure to at least one item. Among that subgroup, scores were substantially inflated, and the correlations with the dependent variables that make the CRT useful (heuristics-and-biases performance, numeracy, education) were attenuated.

The implication is that the original CRT’s predictive validity in any modern sample is now contaminated by item familiarity. Even if 30% of your sample has seen the bat-and-ball problem, that 30% will score systematically higher than their true cognitive-reflection disposition would predict, and the validity of the instrument as a whole drops.

This is not a failed replication. The CRT did not break. The construct (cognitive reflection, miserly information processing, System 2 override) is still real, still measurable, and still predictive. What broke is the specific instrument: three questions that became too famous to remain diagnostic. The fix is straightforward — use newer items — but it took the field a decade to fully internalize it, and a lot of mid-2010s CRT studies are now considered noisy because they used the original 3-item version on populations that had encountered it.

The story matters because it illustrates a general principle in behavioral measurement that the replication crisis has hammered home: instruments wear out as they become popular. A measure that works in the lab when no one has seen the items can lose validity once the items enter popular culture. This is true of the CRT. It is also true of the Implicit Association Test (whose stimuli are now widely circulated), the Stroop test (whose mechanic is in elementary-school textbooks), and many of the classic priming paradigms (whose effects have not replicated, in part because participants now know what to expect). For a strategist, the lesson is to take any classic behavioral measure’s reported effect size with a grain of salt if the measure has been famous for more than ten years.

Real-world predictive validity

The Toplak et al. (2017) meta-review and subsequent work have established that CRT scores predict a range of real-world outcomes beyond lab puzzles. The effect sizes are typically modest — correlations in the r = 0.10 to r = 0.30 range — but the consistency across outcome categories is what makes them interesting.

Outcomes that CRT scores have been shown to predict include:

Financial decision-making. Higher CRT scores predict less susceptibility to financial fraud and pyramid schemes, better performance on financial-literacy assessments, and lower rates of credit-card delinquency.
Health behavior. Higher CRT scores predict more accurate risk assessment in medical decisions (e.g., interpreting cancer screening statistics, understanding base rates in diagnostic testing).
Belief in conspiracy theories. Multiple studies have found negative correlations between CRT score and endorsement of conspiracy theories, ranging from JFK assassination claims to anti-vaccine narratives.
Vulnerability to “bullshit.” Pennycook’s work on “pseudo-profound bullshit receptivity” (the tendency to rate randomly generated profound-sounding statements as deeply meaningful) found CRT score to be one of the strongest predictors of low bullshit receptivity.

None of these effects is large in the sense that CRT score alone explains most of the variance. But the cross-domain consistency is unusual. A three-minute measure, originally designed to predict bat-and-ball-style puzzle performance, also predicts how susceptible you are to financial scams, medical-statistics confusion, conspiracy beliefs, and randomly generated nonsense. That cross-domain consistency is the signature of a real underlying trait, not a measurement artifact.

For comparison, most personality measures (Big Five facets, grit, growth mindset, emotional intelligence) struggle to demonstrate this kind of cross-domain consistency at all. The CRT is one of a small number of brief behavioral measures that genuinely passes that bar.

Strategist application: calibrating team decision quality

For a senior decision-maker, the practical question is whether the CRT is useful as a tool, not just an interesting research finding. The honest answer is: yes, with caveats.

The first useful application is diagnostic. If you’re evaluating a forecasting team, a research function, a product analytics group, or any team whose output depends on overriding plausible-but-wrong intuitions, the CRT-7 (not the original 3-item version) is one of the cheapest signals available. It takes ten minutes to administer, it correlates with the analytic-disposition trait that distinguishes calibrated forecasters from confident-but-wrong ones, and it’s validated across dozens of populations.

The caveat is that it should be one signal among several, not the basis for any individual decision. CRT scores have within-person variance, are sensitive to mood and fatigue, and (as Stieger documented) are vulnerable to familiarity contamination. Used as a screen for analytical disposition in conjunction with track record, calibration on forecasting questions, and domain-relevant work samples, the CRT-7 adds incremental signal. Used as a one-shot screen, it adds noise.

The second useful application is meta-cognitive. If you administer the CRT to yourself and your team — anonymously, low-stakes, just to see — the conversation that follows is more valuable than the score. The questions force people to confront the gap between their first answer and the correct answer. That confrontation is the whole point of the instrument. It is also the entry point for a conversation about which decisions in your business are vulnerable to the same pattern: an obvious answer that arrives quickly, and a correct answer that requires somebody to stop and check.

The third application is process design. Once a team has internalized that intuitive answers can be wrong, the natural next step is to build review processes that force System 2 engagement on the decisions that matter. This is what red-teaming, pre-mortems, decision journals, and forecasting tournaments are all variations of: institutional mechanisms for forcing reflection on decisions whose intuitive answers would otherwise go unchallenged. The CRT does not, by itself, improve decision-making. But it provides a vocabulary for explaining why those processes matter, and a baseline measure for whether a team is more or less prone to the kind of cognitive miserliness those processes are designed to correct.

What the CRT does not do is teach you to be more reflective. There is no evidence that CRT scores improve from training. The trait is stable. What does change is the environment — whether the decisions you face have built-in checkpoints that force a pause, or whether they reward speed over accuracy. For most consequential business decisions, the environment is the lever, not the individual. The CRT is a diagnostic, not a curriculum.

Sources

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25-42. DOI: 10.1257/089533005775196732
Toplak, M. E., West, R. F., & Stanovich, K. E. (2014). Assessing miserly information processing: An expansion of the Cognitive Reflection Test. Thinking & Reasoning, 20(2), 147-168. DOI: 10.1080/13546783.2013.844729
Pennycook, G., Cheyne, J. A., Seli, P., Koehler, D. J., & Fugelsang, J. A. (2012). Analytic cognitive style predicts religious and paranormal belief. Cognition, 123(3), 335-346. DOI: 10.1016/j.cognition.2012.03.003
Pennycook, G., & Rand, D. G. (2019). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, 188, 39-50. DOI: 10.1016/j.cognition.2018.06.011
Stieger, S., & Reips, U. D. (2016). A limitation of the Cognitive Reflection Test: Familiarity. PeerJ, 4, e2395. DOI: 10.7717/peerj.2395

Representativeness heuristic — the System 1 pattern-matching that CRT items are designed to override
Base-rate neglect — a specific failure mode that correlates with low CRT scores
The conjunction fallacy and the Linda problem — one of the bias paradigms Frederick (2005) showed the CRT predicts
Framing effect — another paradigm where high CRT scores attenuate the classic finding
Tetlock’s superforecasting — what calibrated, reflective decision-making looks like in practice

FAQ

Q: Is the original 3-item CRT still valid? A: Partially. It still measures something real — there is no evidence the underlying construct of cognitive reflection has changed. But because the items are now widely known, scores from any modern sample are contaminated by familiarity. For research and applied use, the 7-item CRT-7 (Toplak et al. 2014) is the current standard.

Q: Does the CRT just measure IQ? A: No. CRT scores correlate with IQ at around r = 0.3 to r = 0.5, but in regressions predicting heuristics-and-biases performance, the CRT carries variance that IQ does not. The instrument measures a more specific disposition — the willingness to override intuitive responses — that is related to but separable from general intelligence.

Q: Can you train people to score higher on the CRT? A: Not durably. CRT scores are stable across testing sessions in adults, and training studies have not produced reliable score improvements. The trait the CRT measures appears to be a relatively fixed cognitive disposition. What can change is the environment: decision processes that force reflection (red-teaming, pre-mortems, structured forecasting) compensate for individual variation in cognitive reflection.

Q: How does the CRT relate to Kahneman’s dual-process theory? A: The CRT is the cleanest single instrument for measuring the System 1 / System 2 distinction. Each item is engineered to produce a wrong System 1 answer and a correct System 2 answer. Score reflects the consistency with which a participant overrides System 1 in favor of System 2.

Q: Has the CRT replicated cross-culturally? A: Yes, in dozens of countries and many translations. The predictive relationships hold across Western, East Asian, and developing-country samples, with some variation in absolute scores but consistent direction of effects.

Q: Should I use the CRT to hire? A: Cautiously. The CRT is one signal among many and should not be used as a sole screen. Used in conjunction with work samples, structured interviews, and track record, the CRT-7 (not the original 3-item version) can add incremental signal for roles where analytical disposition matters — forecasting, research, analytics, risk assessment. It should not be used in roles where the construct is irrelevant, and it should never be used in a way that creates legal exposure for hiring discrimination.

Q: Did the CRT survive the replication crisis? A: Yes, more cleanly than almost any other behavioral measure of the era. The construct, the dual-process framing, the predictive validity across heuristics-and-biases paradigms, the cross-cultural validation — all of it has held up across two decades and hundreds of studies. The one real limitation, familiarity effects on the original three items, was identified, documented, and addressed with the 7-item extension. The CRT is one of the cleanest anti-examples in the replication-crisis literature: a behavioral measure that works.

replication-crisiscognitive-reflection-testfrederick-2005dual-process-theoryevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

A bat and a ball cost $1.10

The three questions in full

What Frederick (2005) found

Why dual-process theory predicts the CRT works

Pennycook’s extensions: paranormal belief and fake-news susceptibility

The 7-item extension: Toplak (2014)

The familiarity-effect limitation: Stieger (2016)

Real-world predictive validity

Strategist application: calibrating team decision quality

Sources

Related articles

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook