Most behavioral-science findings cataloged in this hub did not survive scrutiny. Tetlock’s superforecasting research did. Large preregistered samples, multi-year tournaments, identifiable traits, and independent replication --- here is why this one is different.

If you have read through this hub in any depth, you have watched a long line of canonical behavioral-science findings get systematically dismantled. Power posing collapsed under Carney’s own recantation. Ego depletion failed Hagger 2016. Money priming evaporated in preregistered replications. The Stanford Prison Experiment turned out to be coached theater. The marshmallow test shrank to a household-income confound. Bargh’s elderly-walking priming did not replicate. The bystander effect’s Kitty Genovese mythology turned out to be journalistic fiction. Implicit-association testing failed predictive validity at the individual level. Grit collapsed into conscientiousness with worse measurement. Multiple intelligences was never operationalized into testable form in the first place.

A rational executive reading this hub could plausibly conclude that the entire genre of academic behavioral research is too unreliable to use for serious decisions. That conclusion would be wrong, and this article exists to demonstrate why.

Because in the same period that produced all those collapses, a few research programs in behavioral and decision science kept producing findings that held up. They held up across independent replication. They held up at large sample sizes. They held up under preregistered scoring. They held up across years and across application domains. The default-effect literature, covered in this hub’s companion anti-example, is one of them. The Tetlock-Mellers superforecasting program is another.

That program is the empirical study of geopolitical forecasting accuracy: how good people actually are at predicting world events, what traits and behaviors distinguish the consistently best forecasters from the consistently mediocre ones, and whether those traits can be cultivated through training. The headline finding is that yes, identifiable individuals consistently outperform averaged groups, prediction markets, and even classified intelligence analysts by margins that are large enough to matter operationally; the traits and behaviors that distinguish them are identifiable, measurable, and partially teachable; and the entire result has held up across multiple independent forecasting tournaments under preregistered scoring rules that make cherry-picking impossible.

This is the second anti-example article in a hub full of takedowns. It exists for three reasons. First, calibration --- a reader who walks away believing “all behavioral science is broken” has made an error of overcorrection that is just as costly as the original error of overcrediting it. Second, decision-usefulness --- the superforecasting program produced concrete, deployable practices for anyone whose job involves making consequential forecasts under uncertainty, which describes essentially every executive role. And third, intellectual honesty --- if you spend a hub catalog criticizing weak behavioral research, you owe readers an account of what strong behavioral research looks like by contrast.

So here is the case for Tetlock’s superforecasting program, as honest as I can make it, including the legitimate critiques.

The Good Judgment Project --- IARPA’s Tournament

The superforecasting program did not begin as a pure academic exercise. It began as one of five competing teams in a U.S. intelligence-community-funded forecasting tournament called the Aggregative Contingent Estimation (ACE) program, run by the Intelligence Advanced Research Projects Activity (IARPA) from 2011 through 2015.

IARPA’s premise was empirical and unsentimental. The intelligence community wanted to know which methodologies for aggregating human judgments about uncertain world events actually produced the most accurate forecasts. Rather than commission another literature review, IARPA ran a four-year competitive tournament. Five academic teams were given roughly the same set of questions about geopolitical events --- “Will Greece leave the eurozone by date X?”, “Will North Korea conduct a nuclear test by date Y?”, “Will the Assad regime fall by date Z?” --- and asked to produce probabilistic forecasts. Forecasts were scored using Brier scores, a proper scoring rule that penalizes both overconfidence and underconfidence and that has the mathematical property that the only way to game it is to report your actual best probability estimate.

The Brier score matters because it preregisters the scoring rule. There is no way for a research team to retroactively decide that a forecast was “really” right or to redefine the prediction after the event. The question was specified up front in unambiguous terms with a verifiable resolution date and outcome variable. The forecast was a single probability. The score was a deterministic function of the probability and the outcome. Cherry-picking is mathematically impossible in this design.

This is, by itself, a methodological feature that distinguishes the tournament from almost everything cataloged in this hub. Most failed behavioral findings had researcher degrees of freedom in how the dependent variable was operationalized, how outliers were excluded, how analyses were conducted. The IARPA tournament had none. The question was the question, the forecast was the forecast, the outcome was the outcome, the score was the score.

The Good Judgment Project, led by Philip Tetlock and Barbara Mellers at the University of Pennsylvania, was one of the five teams. Over four years they recruited and assessed approximately 5,000 volunteer forecasters, ran experimental manipulations on subsets of those forecasters (training conditions, team versus individual conditions, aggregation method conditions), and tracked forecast accuracy at the individual level question by question. By the end of the second year of the tournament, the Good Judgment Project was so far ahead of the other four academic teams in aggregate accuracy that IARPA dropped the other four teams from the official tournament. The Good Judgment Project’s best subset of forecasters --- the ones eventually labeled “superforecasters” --- outperformed even the U.S. intelligence community’s own classified-information-equipped analysts on the same questions, working from open-source information alone.

That is an unusual outcome. In most domains, behavioral research that claims to identify high-performing experts produces effect sizes that are small, fragile, and disappear under replication. The Good Judgment Project’s headline accuracy advantage was large, was measured under preregistered rules that made gaming impossible, and was observed against a benchmark (classified intelligence analysts) that has no incentive to lose.

What Made The Methodology Rigorous

Before we get to the substantive findings about what superforecasters actually do, it is worth being specific about what makes the underlying research design unusually credible. A reader trained on this hub’s catalog of behavioral-science failures should be skeptical until shown evidence to the contrary. The evidence to the contrary is the design itself.

Large samples. The 5,000-forecaster pool is two orders of magnitude larger than the typical underpowered behavioral-science study that fails to replicate. Within that pool, individual forecasters typically made hundreds of probability estimates over multi-year periods. The forecast-level dataset for the Good Judgment Project’s four years of IARPA participation included on the order of one million individual forecasts. Statistical power was not a constraint; the effects being identified are visible without requiring statistical heroics.

Preregistered scoring rule. As above. The Brier scoring rule was fixed by IARPA at the start of the tournament. The questions were specified by IARPA in advance with explicit resolution criteria and resolution dates. The forecast was a single number on each question. There is no analytic flexibility to exploit.

Multi-year duration. The tournament ran for four years. Individual forecasters’ accuracy was tracked across multiple cohorts of questions resolved over multiple years. This eliminates the possibility that observed accuracy was the product of lucky guessing on a single batch of questions --- a forecaster who maintained top-decile accuracy across four years of questions has demonstrated something that is not consistent with chance.

Identifiable traits and behaviors. Forecasters were surveyed on dispositional and cognitive variables (active open-mindedness, need for cognition, numeracy, fluid intelligence, knowledge of base rates, etc.) and tracked on behavioral variables (frequency of forecast updates, magnitude of updates, granularity of probability estimates, frequency of post-mortem self-review). The research team could then statistically associate measured traits and observed behaviors with measured accuracy --- using accuracy that had already been quantified by the preregistered Brier scoring rule, on questions specified before the forecasters answered them.

Experimental manipulations within the tournament. The Good Judgment Project did not just observe correlational patterns. Tetlock and Mellers ran experimental manipulations on randomly assigned subsets of the forecaster pool: training conditions (some forecasters received probability-training modules; others did not), team conditions (some forecasters worked in collaborative teams; others worked individually), and aggregation conditions (some forecasts were aggregated using simple averaging; others using extremized weighted aggregation). The randomization eliminated selection confounds. The training, teaming, and aggregation effects could be measured against control conditions within the same tournament.

Independent benchmark comparison. The forecasters’ accuracy was not just compared to other Good Judgment Project conditions; it was compared to the accuracy of the U.S. intelligence community’s classified-information-equipped analysts answering the same questions. This is an unusually strong external benchmark. A research finding that outperforms intelligence analysts with access to classified information is not a finding that can be dismissed as an artifact of academic-sample peculiarities.

Open data and reproducibility. The Good Judgment Project’s data has been made available to outside researchers through multiple papers and through subsequent academic and commercial forecasting tournaments. Subsequent independent research has been able to interrogate the original findings.

You should be reading this list of design features as a contrast against the dominant pattern in this hub. Most failed behavioral findings had small samples, no preregistration, single-study evidence, vague constructs, and no comparison benchmark. The superforecasting program had the opposite of each. That is why the findings hold.

What Mellers 2014 Identified --- Traits Of The Best Forecasters

The flagship academic publication from the Good Judgment Project’s first two years is Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). “Psychological Strategies for Winning a Geopolitical Forecasting Tournament.” Psychological Science, 25(5), 1106—1115. DOI: 10.1177/0956797614524255.

This paper reports the results of three preregistered experimental interventions run on subsets of the Good Judgment Project’s first-year forecaster pool. The interventions were chosen to test three hypotheses: that probability training improves accuracy, that collaborative teaming improves accuracy, and that tracking the best forecasters and weighting their judgments more heavily improves aggregate accuracy.

All three interventions produced statistically significant and operationally meaningful improvements in Brier scores, against randomly assigned control conditions, on the same set of preregistered questions.

The training intervention was a short probability-training module called CHAMPS KNOW, which taught forecasters specific cognitive techniques: how to use base rates as a starting point for any estimate, how to decompose a complex question into more tractable sub-questions, how to seek out reference classes for the event being predicted, how to average across multiple independent estimates, and how to update incrementally on new information rather than oscillating. Forecasters who received the training improved their Brier scores by roughly 10% relative to control forecasters on subsequent questions. A 10% Brier score improvement is operationally meaningful at this scale --- it is the kind of improvement that, summed across thousands of forecasts on geopolitically consequential questions, is worth funding.

The teaming intervention assigned some forecasters to small collaborative teams where members could see and discuss each other’s reasoning and forecasts. The team forecasters improved their Brier scores by approximately 23% relative to forecasters who worked alone. This is consistent with a broader behavioral-economics literature on collective intelligence: when multiple independent judgments are pooled and the pooling mechanism allows for argumentation and updating, the aggregate is typically more accurate than any individual.

The tracking intervention identified the top 2% of forecasters from the first year by Brier score, grouped them together in elite teams for the second year, and weighted their forecasts more heavily in the aggregated team forecast. This subset --- the eventual superforecasters --- maintained top-decile accuracy in the second year. The first-year identification was predictive of second-year performance, which is exactly the test that would falsify the claim that first-year top performers were just lucky. They were not.

A companion paper, Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M. C., Merkle, E., & Tetlock, P. E. (2015). “The Psychology of Intelligence Analysis: Drivers of Prediction Accuracy in World Politics.” Journal of Experimental Psychology: Applied, 21(1), 1—14. DOI: 10.1037/xap0000040, extends the trait-and-behavior analysis to the larger forecaster pool. Mellers et al. report that forecast accuracy is most strongly predicted by three classes of variables: dispositional active open-mindedness (the willingness to seriously consider arguments against one’s current view), behavioral granularity of probability use (the tendency to report probabilities at fine resolution --- e.g., 73% rather than rounding to 75% or 70%), and behavioral frequency and incremental nature of forecast updates (the tendency to update one’s forecast frequently in response to new information, in small steps rather than large reversals).

Each of these is independently measurable. Each is independently predictive. Each is partially modifiable through training. The construct is not the vague hand-waving of much behavioral science; it is a specific empirical claim about specific traits and behaviors, supported by data from a preregistered tournament with a fixed scoring rule.

What Holds Up Under Independent Replication

The single most common failure mode of behavioral-science findings cataloged in this hub is that the original result fails to replicate when run by an independent team with preregistered methods. The superforecasting program has the opposite track record: the headline findings have been confirmed across multiple independent forecasting tournaments and across subsequent academic publications.

A second IARPA-funded tournament, the Hybrid Forecasting Competition (HFC), ran from approximately 2017 through 2019 and tested human-machine hybrid forecasting designs against pure-human baselines. The HFC re-validated the basic finding that top human forecasters identified by historical Brier scores continued to outperform crowd averages on new questions.

Commercial and publicly accessible forecasting platforms that emerged from the Good Judgment Project (Good Judgment Open, the professional Superforecaster panels operated by Good Judgment Inc., the Hypermind market) have continued to track forecaster accuracy on geopolitical, economic, and pandemic-related questions over the period 2015 through the present. The pattern observed in the original IARPA tournament --- a small subset of identifiable forecasters maintaining accuracy substantially above crowd averages --- has been observed continuously since.

Atanasov, P., Witkowski, J., Ungar, L., Mellers, B., & Tetlock, P. (2020). “Small Steps to Accuracy: Incremental Belief Updaters Are Better Forecasters.” Organizational Behavior and Human Decision Processes, 160, 19—35. DOI: 10.1016/j.obhdp.2020.02.001 is the cleanest single-finding follow-up. Atanasov and colleagues took the Good Judgment Project’s full forecast-level dataset and tested whether the behavioral pattern most strongly associated with accuracy --- frequent small updates rather than infrequent large updates --- held up under fine-grained statistical scrutiny. It did. Forecasters who updated their probability estimates in small increments in response to new information were systematically more accurate than forecasters who held a position for long stretches and then made large jumps. This is consistent with Bayesian updating done correctly: each piece of new information should move your posterior by an amount proportional to its likelihood ratio, which is usually small. Forecasters who behave as Bayesians, even informally, do better than forecasters who do not.

The earlier book-length popularization, Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown., is not itself a research publication, but the empirical claims in the book are sourced to the underlying academic papers, and the academic papers have held up. This is a different posture from, say, Blink or Outliers or Thinking, Fast and Slow’s priming chapter --- popular books in which the underlying academic citations turned out to be either retracted, unreplicated, or substantially overstated. Tetlock’s book is unusual in the popular behavioral-science genre in that the underlying citations have held up reasonably well in the decade since publication.

Tetlock’s earlier work also matters here for context. Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press. is the canonical study of pundit and expert forecasting accuracy that established the empirical baseline against which the superforecasters were measured. In that earlier study, Tetlock spent two decades tracking the predictive accuracy of professional political experts --- academic, journalistic, governmental --- and concluded that the average expert was, in Tetlock’s famous phrasing, “roughly as accurate as a dart-throwing chimpanzee.” That earlier finding, which is itself robust and has not been challenged in any serious way, is what makes the superforecaster result so striking: it identifies a small subset of individuals who are quantifiably much better than the established expert class at the established expert class’s own job.

The earlier work also identified the cognitive style associated with worse-than-chance expert prediction: what Tetlock called the hedgehog style, in which the expert applies a single overarching theory to every question and predicts in directions consistent with that theory regardless of base rates. The hedgehog/fox distinction --- foxes being the multi-perspective, base-rate-anchored, granularly probabilistic thinkers --- is the precursor to the superforecaster trait profile that emerged from the IARPA tournament. The continuity across two decades of independent research, using different methodologies and different forecaster pools, is part of why this finding sits in the “holds up” category rather than the “needs more replication” category.

What Distinguishes This From The Failures In This Hub

Step back from the specific results and ask the meta-question this hub keeps returning to: what is structurally different about the superforecasting research program that explains why it survived scrutiny when so much other behavioral science did not?

I think there are five reasons, and they form a useful diagnostic checklist for evaluating any behavioral-science finding you might be considering deploying for real decisions.

The construct is operationally defined. Forecasting accuracy is measured as Brier score on preregistered questions with verifiable resolution criteria. There is no analytic flexibility in how the dependent variable is operationalized. Contrast with grit (a slippery construct that turned out to be conscientiousness), multiple intelligences (never operationalized into testable form), emotional intelligence (multiple competing scales that did not converge), implicit bias (a measure with weak individual-level predictive validity), or power posing (an effect that depended on which dependent variable you measured). The superforecasting program has the same outcome variable that everyone in the field has agreed to use, and that variable is mathematically immune to gaming.

The scoring rule is preregistered. The Brier score was fixed by IARPA at the start. There is no post-hoc opportunity to redefine what counts as accurate. Contrast with the studies in this hub that engaged in optional stopping, dependent-variable switching, or post-hoc subgroup analysis. Preregistration eliminates the entire class of researcher-degrees-of-freedom problems that drove the replication crisis.

The sample is large and the duration is long. Approximately 5,000 forecasters making approximately one million forecasts across four years, with continuous re-validation in subsequent tournaments and platforms running through the present. Contrast with the typical replication-crisis casualty: a single experiment with 50 to 100 undergraduates run over a single semester. Sample size and study duration matter not just for statistical power but for ruling out the possibility that observed effects are sample-specific or moment-specific artifacts.

The outcome has independent external validation. Forecaster accuracy is not just measured against a researcher-controlled comparison condition; it is measured against the U.S. intelligence community’s classified analysts on the same questions, and against prediction markets, and against averaged-crowd benchmarks. Beating the intelligence community on geopolitical forecasting is a real-world benchmark that cannot be gamed by sample selection. Contrast with most failed behavioral findings, which were measured only against null hypotheses or weak control conditions and which never produced an external real-world validation against a credible competing methodology.

The mechanism is concrete and well-specified. Active open-mindedness, granular probability use, frequent incremental updating, and use of base rates and reference classes are each specific, identifiable cognitive and behavioral practices that can be measured and trained. The CHAMPS KNOW training module operationalizes the mechanism into something teachable. Contrast with the failed findings in this hub, where the mechanism was typically a vague hand-wave at “priming,” “embodied cognition,” “willpower,” or “implicit attitudes” --- constructs that never produced clear training implications because they were never operationalized into specific behaviors.

A finding that satisfies all five of these criteria is a strong candidate for being real. The superforecasting research program is one of a small number of behavioral-science programs that does. Most of the catalog in this hub fails at least three of these criteria.

If you internalize one thing from this hub overall, it should probably be this checklist. Apply it to any behavioral-science claim you encounter before you build operational decisions on top of it. If a researcher cannot tell you the preregistered scoring rule, the operational definition of the construct, the sample size, the duration, the external validation benchmark, and the trainable mechanism --- you are looking at a candidate for the next replication-crisis casualty, not a finding you should act on.

What This Means For Strategists Making Forecasts In Business

The practical implications of the superforecasting research for anyone whose job involves making consequential forecasts under uncertainty --- which is essentially every executive role --- are unusually concrete. Unlike most behavioral-science findings, the superforecasting program produced a specific list of cognitive and behavioral practices that the data shows are associated with better forecasting accuracy. You can adopt them.

Use granular probabilities, not vague verbal categories. The CIA’s own internal research, going back decades, has documented that phrases like “likely,” “probably,” “unlikely,” and “almost certain” are interpreted by different readers as ranges from roughly 20% to roughly 80% depending on the reader. The superforecaster finding is that forecasters who report probabilities at fine resolution (37%, 62%, 81%) are more accurate than forecasters who round to 25%, 50%, 75%. The granularity is doing real work: it forces the forecaster to actually take a position on the relative weight of competing considerations rather than retreating to a vague verbal hedge. In business forecasting --- product launches, market entries, hiring decisions, investment commitments --- the discipline of stating a numerical probability you would actually bet on is one of the highest-leverage practices you can adopt.

Anchor on base rates before adjusting. Almost every business question has a base-rate reference class that is more informative than your team’s intuitive guess. “What is the probability this acquisition produces the projected revenue synergies?” has a base rate (typical realization rate of projected synergies in similar acquisitions in this industry). “What is the probability this engineering project ships on time?” has a base rate (your team’s historical on-time delivery rate). “What is the probability this new market segment produces the projected revenue?” has a base rate (typical first-year revenue of similar segment entries by similar companies). The superforecaster behavior is to identify the base rate first, then adjust upward or downward for the specific factors that distinguish this case from the reference class. The common failure mode is to skip the base rate entirely and start from a narrative-driven gut estimate, which produces systematically overconfident forecasts in either direction.

Update incrementally on new information. The Atanasov 2020 finding is that small frequent updates beat large infrequent updates. The behavioral analog in business decision-making is to revisit your probability estimates frequently as new information arrives, moving the estimate by an amount proportional to how diagnostic the new information actually is. The common failure mode is to anchor on the original estimate and refuse to update until evidence is overwhelming, at which point you over-correct. The superforecaster pattern is closer to a continuous Bayesian update: each piece of news shifts the estimate by a few percentage points; over time the estimate converges to a well-calibrated posterior.

Practice active open-mindedness, especially against your current view. The single dispositional trait most strongly associated with superforecasting accuracy is the willingness to seriously consider arguments against one’s current position. In practice this is operationalized as the discipline of explicitly searching for evidence that would change your mind, weighting that evidence on its merits, and updating accordingly. The common organizational failure mode is the opposite: confirmation bias, motivated reasoning, and the tendency to dismiss dissenting evidence as low-quality. Building organizational practices that systematically surface and engage with dissenting evidence --- pre-mortems, red teams, devil’s advocate roles --- is the operational implementation of the active-open-mindedness finding.

Conduct post-mortems on your own forecasts. Superforecasters review their forecast track records and identify systematic patterns in their own errors. Were they consistently overconfident on questions in a particular domain? Did they underweight a particular kind of evidence? Did they fail to update fast enough on a particular type of news? The discipline of post-mortem self-review is what enables continuous calibration improvement. In a business context, this is the practice of going back to forecasts you made six and twelve months ago, comparing them against what actually happened, and asking what you would do differently. Most organizations skip this step entirely, which is why most organizations’ forecasting accuracy does not improve over time.

Use teams, but in the right way. The Mellers 2014 finding is that collaborative forecast teams outperform individual forecasters, but only when the team has the right structure: members can see and engage with each other’s reasoning, members are encouraged to disagree, and the final aggregation weights informed disagreement rather than suppressing it. The wrong way to use teams is groupthink-prone consensus-seeking, which degrades accuracy. The right way is the explicit aggregation of independently formed judgments with structured argumentation. Most business “team forecasts” are the wrong kind.

Aggregate the best forecasters more heavily. The tracking-and-extremizing finding from the Good Judgment Project is that, once you have a track record of who is consistently accurate, you should weight their forecasts more heavily than the crowd average. In a business context, this implies the practice of explicitly identifying the individuals on your team who have demonstrated forecasting accuracy --- not the most senior, not the most articulate, the most empirically accurate --- and giving their estimates more weight in consequential forecasts. This is harder than it sounds because the most empirically accurate forecaster is often not the most politically powerful, and corporate forecasting processes tend to up-weight political power rather than empirical track record.

None of these practices is a magical hack. Each is a concrete cognitive and behavioral discipline that has empirical evidence for improving forecast accuracy in environments where the scoring rule is honest. If you adopted only one of them --- the discipline of stating numerical probabilities you would actually bet on --- you would be ahead of most organizational forecasting practice.

What This Anti-Example Tells Us About Behavioral Science Overall

The replication crisis is real, and the catalog of canonical-then-collapsed findings in this hub is long. But the superforecasting program demonstrates what behavioral and decision science can produce when the methodology is done properly: rigorous, replicable, operationally useful findings about how people actually think and how thinking can be improved.

What distinguishes the productive corner of behavioral science from the unproductive corner is not the topic, not the researchers’ credentials, and not the prestige of the publication outlet. It is the methodology. Preregistered scoring rules, large samples, multi-year duration, external validation benchmarks, operationally defined constructs, and concrete trainable mechanisms --- these design features predict which findings will hold up. The findings that have them tend to survive. The findings that lack them tend to collapse.

For an executive deciding where to invest scarce attention on behavioral and decision-science research, the implication is to apply the five-criterion checklist explicitly. The default-effect literature passes it. The superforecasting program passes it. A small number of other programs --- prospect theory’s core asymmetry finding, some of the heuristics-and-biases literature on overconfidence and base-rate neglect, the field-experimental work on choice architecture in tax compliance and energy use --- pass it. Most of the rest of what gets cited in management decks and TED talks does not.

The hub you are reading is a guided tour of the failures. This article and the defaults anti-example are the matched account of what is left standing. Together they should give you a workable calibration: behavioral science is mostly weaker than the popular literature implies, but a small and identifiable subset of it is rigorous enough to deploy with confidence, and the criterion for identifying that subset is not subjective. It is the methodology.

That calibration is the point of the hub. The superforecasting program is proof that the calibration is possible.

Sources

  • Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
  • Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D., Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106—1115. DOI: 10.1177/0956797614524255
  • Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M. C., Merkle, E., & Tetlock, P. E. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1—14. DOI: 10.1037/xap0000040
  • Tetlock, P. E., Mellers, B. A., Rohrbaugh, N., & Chen, E. (2014). Forecasting tournaments: Tools for increasing transparency and improving the quality of debate. Current Directions in Psychological Science, 23(4), 290—295. DOI: 10.1177/0963721414534257
  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
  • Atanasov, P., Witkowski, J., Ungar, L., Mellers, B., & Tetlock, P. (2020). Small steps to accuracy: Incremental belief updaters are better forecasters. Organizational Behavior and Human Decision Processes, 160, 19—35. DOI: 10.1016/j.obhdp.2020.02.001
  • Satopää, V. A., Salikhov, M., Tetlock, P. E., & Mellers, B. (2021). Bias, information, noise: The BIN model of forecasting. Management Science, 67(12), 7599—7618. DOI: 10.1287/mnsc.2020.3882
  • Good Judgment Project. (2011—2015). Aggregative Contingent Estimation tournament data and reports. Intelligence Advanced Research Projects Activity (IARPA).

Browse the full Replication Crisis Hub for other behavioral-science findings, including:

FAQ

Can anyone become a superforecaster, or is it a fixed trait?

The empirical answer is “partially trainable, partially dispositional.” The Mellers 2014 training intervention (CHAMPS KNOW) produced a roughly 10% Brier score improvement in randomly assigned forecasters, which demonstrates that the cognitive techniques are teachable and produce measurable accuracy gains. But the very top tier --- the eventual superforecasters, the top 2% by Brier score --- combined trained techniques with dispositional traits like high active open-mindedness, high numeracy, and intrinsic motivation to engage with hard questions. Most readers of this article will not become superforecasters in the formal sense, but most readers can substantially improve their own forecasting accuracy by adopting the trainable practices: granular probabilities, base-rate anchoring, incremental updating, and post-mortem review.

What about machine prediction --- doesn’t AI just outperform humans now?

For some classes of problems, yes. For others, no, and the boundary is informative. Machine learning systems outperform humans on problems with abundant historical data, stable underlying generative processes, and clearly defined outcome variables --- recommendation systems, image classification, certain financial forecasting niches. They underperform humans on problems with sparse data, novel structural conditions, geopolitical or strategic dynamics, and questions where the relevant base rate has to be constructed from analogical reasoning rather than retrieved from a training set. The IARPA Hybrid Forecasting Competition found that the best designs combined human judgment (which generalized to novel situations) with machine aggregation and consistency-checking (which removed noise from the human forecasts). Pure-machine forecasting did not dominate; pure-human forecasting did not dominate; hybrid systems did best. For most business-strategy forecasting questions, you are in the human-comparative-advantage regime.

Does superforecasting work for business and financial forecasting, or only geopolitical?

The training and the cognitive practices generalize. The empirical base rate is harder to establish for business forecasting specifically because business forecasting outcomes are usually proprietary and rarely scored under preregistered Brier rules. There is some evidence from forecasting platforms (Good Judgment Open, Hypermind) that the same individuals who score well on geopolitical questions also score well on economic and pandemic questions, which suggests the underlying skill is general rather than domain-specific. The most rigorous test of business-forecasting accuracy would require a corporate forecasting tournament with preregistered scoring; few companies have run one. The practices (granular probabilities, base rates, incremental updating, active open-mindedness) generalize across any forecasting domain where the scoring rule is honest.

Why does this research hold up when so many other behavioral-science findings don’t?

The methodology. Preregistered scoring rules, large samples (~5,000 forecasters, ~1 million forecasts), multi-year duration, external validation against the U.S. intelligence community, operationally defined construct (Brier score), randomized experimental interventions within the tournament, and a concrete trainable mechanism (CHAMPS KNOW probability training, granular probability use, incremental Bayesian updating). Most failed behavioral findings violate at least three of those design features. The superforecasting program satisfies all of them.

What is the single most important practice to adopt from this research?

Stating numerical probabilities you would actually bet on, rather than vague verbal categories like “likely” or “probably.” The discipline of being forced to commit to a number does the work: it surfaces the considerations you have not actually weighed, forces base-rate anchoring, and creates the track record that allows post-mortem self-review. Most organizational forecasting culture allows vague verbal hedges that are uncalibrated and unaccountable. Replace those with numerical probabilities and you have already adopted half of the superforecaster practice.

How big is the superforecaster accuracy advantage over crowd averages?

Roughly 30% improvement in Brier score versus the simple-averaged crowd, and roughly 60% improvement versus the U.S. intelligence community’s classified analysts on the same questions, based on the IARPA tournament’s first two years. These are large effects in a domain where most methodological improvements produce single-digit percentage gains.

Should I trust Tetlock’s earlier “hedgehog/fox” framework from the 2005 book?

The hedgehog/fox distinction --- hedgehogs apply a single overarching theory, foxes integrate multiple perspectives --- has held up reasonably well as a description of the cognitive style differences between consistently accurate and consistently inaccurate forecasters. The fox cognitive style is the precursor to the superforecaster trait profile that emerged from the IARPA tournament. The earlier 2005 study had smaller samples and less rigorous scoring than the IARPA tournament, but the basic finding has been validated by the subsequent work.

Are there commercial applications of this research available to non-academic users?

Yes. Good Judgment Inc., the spin-off company from the Good Judgment Project, operates a panel of professional superforecasters that produces forecasts on questions commissioned by client organizations. Good Judgment Open is a publicly accessible forecasting platform where anyone can participate. Hypermind operates a similar prediction market platform. The CHAMPS KNOW training module and its successors have been adapted into corporate training programs. For an individual reader, the cheapest deployment is to start practicing granular probability estimation on your own consequential forecasts and to keep a track record for post-mortem review.

replication-crisis superforecasting tetlock prediction-research evidence-evaluation

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.