Most behavioral science headlines tell stories that died in replication. Power posing did not raise testosterone. Ego depletion did not survive meta-analysis. Implicit-bias scores did not predict behavior. The “marshmallow test” effect shrank dramatically once parental income entered the model.

The wisdom of crowds is different.

It is a rare anti-example in the replication-crisis hub: a behavioral effect from 1907 that not only replicates but has been confirmed across weather forecasting, prediction markets, geopolitical forecasting tournaments, and online estimation experiments with thousands of participants. The Galton ox-weight story is true. Surowiecki’s popularization was directionally correct.

But the same body of research that confirms the effect also delivers a brutal qualifier: the four conditions Surowiecki named are not decoration. They are load-bearing. Violate them — especially independence — and crowds become dumber than individuals, not smarter. Lorenz et al. (2011) showed this experimentally in PNAS. Becker et al. (2017) showed which network structures preserve the effect and which destroy it.

This article is the rare case where the popular version of the story holds up — provided you read past the headline to the conditions.

The ox-weight story (and why it still matters)

In 1906, Francis Galton attended the West of England Fat Stock and Poultry Exhibition in Plymouth. A weight-judging contest invited fairgoers to estimate the dressed weight of an ox after slaughter and butchering. Each ticket cost sixpence. Closest guess won a prize. 787 tickets were sold and judged usable.

Galton — better known today for inventing regression and worse things — collected the tickets and computed the central tendency. He published the result in Nature the following year under the title “Vox Populi” (Galton, 1907).

The middlemost estimate was 1,207 pounds. The actual dressed weight was 1,198 pounds. The collective guess of a heterogeneous crowd of butchers, farmers, and ordinary fairgoers landed within 0.8% of the true value. No individual estimate was that accurate. The crowd’s middle beat almost every individual in it.

Galton, who had set out partly to demonstrate the unreliability of democratic judgment, instead produced one of the most-cited results in collective intelligence research. He wrote: “This result is, I think, more creditable to the trustworthiness of a democratic judgment than might have been expected.”

Why does this 1907 result matter now?

Because unlike most century-old behavioral findings, it survives modern scrutiny. Wallis (2014) re-analyzed Galton’s original data and confirmed the result. The same effect — that the mean or median of many independent estimates is more accurate than most individuals — has been reproduced thousands of times in jelly-bean-in-the-jar experiments, classroom exercises, weather forecasting ensembles, and online studies.

That is rare. Most “classic” psychology results from the early 20th century either failed to replicate, were never properly tested, or turned out to depend on demand characteristics. The wisdom of crowds is one of the few that still works, in part because the underlying mechanism is statistical, not psychological.

The statistical mechanism (it is not magic)

Here is the math, stripped down.

Each individual guess can be decomposed into the true value plus an error term. The error has two components: bias (systematic over- or under-estimation shared by the crowd) and noise (idiosyncratic individual variation).

When you average many guesses, the noise components cancel — because they are roughly equally likely to be positive or negative around the truth. What remains is bias plus a much smaller residual noise term.

The result is what statisticians call the “diversity prediction theorem” (Page, 2007): the crowd’s error equals the average individual error minus the diversity of the crowd’s predictions. Hold individual error constant, and more diverse crowds are more accurate. Hold diversity constant, and crowds with less individual error are more accurate.

This is not mystical collective wisdom. It is the law of large numbers applied to estimation, with a correction for systematic bias. If everyone in the crowd shares the same bias (say, all systematically underestimate the ox), no amount of averaging fixes it. Diversity is the load-bearing input.

The implication is unforgiving: a crowd of clones is not wise. A crowd that has been influenced by the same expert opinion is partially a crowd of clones. A crowd that talks to itself before guessing is increasingly a crowd of clones. We will return to this.

Surowiecki’s four conditions

James Surowiecki popularized the effect in The Wisdom of Crowds (2004), a New Yorker financial columnist’s tour through prediction markets, ant colonies, jellybean-counting classrooms, and submarine search-and-rescue. The book sold widely and embedded the phrase in management vocabulary.

What gets less attention in the popular reception is that Surowiecki was explicit about four conditions a crowd must meet to be wise. They are not optional:

1. Diversity of opinion. Each person should have private information, even if it is just their idiosyncratic interpretation of the publicly available facts. A crowd of identically trained experts is not diverse — they share frameworks, blind spots, and reference classes.

2. Independence. People’s opinions should not be determined by the opinions of those around them. If everyone is watching everyone else, the crowd collapses into informational cascades — Bikhchandani, Hirshleifer, and Welch (1992) showed how rapidly this happens in sequential decision settings.

3. Decentralization. People should be able to specialize and draw on local knowledge. A crowd judging the ox included butchers who knew dressed-weight ratios, farmers who knew live-weight, and laypeople with no particular knowledge — and that mixture is the point.

4. Aggregation. Some mechanism must exist to turn private judgments into a collective decision. Galton used the median. Markets use prices. Polling uses sums. Without aggregation, you just have a room full of opinions.

Surowiecki was correct that when all four conditions are met, crowds are remarkable. He was also correct that they rarely all hold in real institutions — which is why most “crowd-sourced” decisions in practice do not deliver Galton-style accuracy.

Lorenz 2011: how social influence destroys the effect

In 2011, Jan Lorenz, Heiko Rauhut, Frank Schweitzer, and Dirk Helbing published a paper in PNAS titled “How social influence can undermine the wisdom of crowd effect” (Lorenz et al., 2011). It is the single most important empirical qualifier to Galton’s result published in the last 25 years.

The design was clean. Subjects estimated factual quantities — for example, the length of the Swiss border, the number of new immigrants in Zurich in 2006, the number of murders in Switzerland the previous year. They made five sequential estimates with monetary incentives for accuracy.

The manipulation was the information condition between rounds:

  • No information. Subjects re-estimated with no feedback. (Independent.)
  • Aggregated information. Subjects saw the group mean before re-estimating.
  • Full information. Subjects saw every other group member’s estimate before re-estimating.

The independent condition replicated Galton: the crowd’s collective estimate was accurate, with low collective error and wide diversity.

In the aggregated and full-information conditions, three things happened together. First, individual estimates converged — diversity collapsed. Second, subjects became more confident in their increasingly homogeneous guesses. Third — and this is the killer finding — collective accuracy did not improve, and in many conditions it got worse. The crowd was now confident, homogeneous, and wrong.

Lorenz et al. called these effects the social influence effect, the range reduction effect, and the confidence effect. Together, they describe exactly the failure mode that wrecks crowdsourced decisions in real organizations. The first person to speak in a meeting anchors everyone. The expert who weighs in early collapses the diversity that made the group potentially wise. The Twitter consensus that emerges in the first hour after a news event becomes the reference point everyone defers to.

The wisdom of crowds is fragile, and the thing it is most fragile to is communication.

What Becker, Brackbill, and Centola added in 2017

Lorenz’s result raised an uncomfortable question: if social influence destroys collective wisdom, what about all the supposedly wise collective decisions in democracies, markets, juries, and online platforms? Are they all worse than uninformed averages?

Damon Centola’s lab at Penn pushed back in a 2017 PNAS paper (Becker, Brackbill, & Centola, 2017). They argued that Lorenz’s result depended on the specific network structure — namely, a “complete” network where everyone sees everyone else’s estimate. That structure is unusual in real social networks.

In their experiment, subjects estimated quantities in three conditions: independent (no influence), centralized networks (a few hubs that everyone sees), and decentralized networks (each subject sees only a few neighbors).

The independent condition replicated Galton again. The centralized condition reproduced Lorenz: crowds got worse. But the decentralized condition produced something different: estimates improved across rounds, individual accuracy went up, collective accuracy went up, and outliers were pulled toward truth by their neighbors.

The mechanism was that in a decentralized network, no single source dominates. Subjects update based on a small, idiosyncratic subset of peers. The aggregate effect is closer to averaging than to cascading.

This is a meaningful refinement, not a refutation. The wisdom of crowds survives in decentralized, weakly-connected social structures with diverse information. It dies in centralized, densely-connected structures where expert opinion or majority sentiment dominates. The internet, depending on which corner, is sometimes one and sometimes the other.

Why prediction markets work (and where they break)

If you want a wisdom-of-crowds mechanism that runs at industrial scale, prediction markets are the cleanest example.

Wolfers and Zitzewitz (2004), in the Journal of Economic Perspectives, surveyed the early evidence. The Iowa Electronic Markets had been forecasting U.S. presidential elections since 1988, generally outperforming polls. Sports betting markets are essentially efficient. Corporate prediction markets at HP, Google, and others had repeatedly outperformed internal expert forecasts of sales and project completion dates.

The mechanism is straightforward: traders with private information have a financial incentive to act on it, prices aggregate the information of many traders, and the marginal trader’s payoff structure rewards accuracy rather than conformity. This is Surowiecki’s four conditions implemented in a market: diverse traders, independent positions (each trader profits from their own correct view, not from agreeing), decentralized information, and aggregation via price.

Modern prediction markets — Polymarket, Manifold, Kalshi — extend this to a much wider range of events. Polymarket’s election forecasting and geopolitical prediction markets have at times outperformed major media polls and pundits. The 2024 U.S. election cycle saw Polymarket’s odds track final outcomes more closely than most public polling aggregators, though the comparison is contested.

But prediction markets have failure modes worth naming:

Thin markets. A market with few traders has unreliable prices. Most niche prediction-market contracts suffer from this.

Manipulation. Large traders can move prices to influence reported probabilities (sometimes for off-market reasons). The 2024 cycle had visible attempts at this.

Long-tail events. Markets are poor at pricing rare events because traders rarely have informative private signals.

Reflexivity. When prediction markets are widely reported, they can influence the outcomes they predict — for example, when traders or campaigns react to perceived odds.

Prediction markets are the best practical implementation of wisdom-of-crowds principles. They are also not magic, and the same Surowiecki conditions explain when they work and when they fail.

The Good Judgment Project

If prediction markets are the market-based instantiation of crowd wisdom, the Good Judgment Project is the survey-based one.

Run by Philip Tetlock, Barbara Mellers, and Don Moore at Penn starting in 2011, the Good Judgment Project was part of an Intelligence Advanced Research Projects Activity (IARPA) tournament that pitted forecasting teams against each other on geopolitical questions. Tetlock’s team won every year of the four-year tournament, outperforming intelligence community analysts with access to classified information.

The mechanism, documented in Mellers et al. (2014) and Tetlock and Gardner’s Superforecasting (2015), combined several wisdom-of-crowds principles with deliberate training:

  • Aggregation across many forecasters — Galton’s original insight.
  • Identification of top performers (“superforecasters”) whose estimates were weighted more heavily.
  • Training in probabilistic reasoning — base rates, reference classes, calibration practice.
  • Independent forecasts before any team discussion.
  • Selective use of team deliberation in carefully structured ways that preserved independence.

The Good Judgment Project produced two related results. First, the crowd of trained forecasters consistently beat individual experts. Second, the top 2% of forecasters — superforecasters — produced individual forecasts that beat the crowd average. There is wisdom in crowds, and there is also genuine forecasting skill that distinguishes individuals.

For the replication-crisis context, what matters is that this is a genuine, replicated, multi-year demonstration that calibrated collective forecasting outperforms expert intuition. It is one of the cleanest wins for explicit probabilistic reasoning in any domain.

Design implications for product and research teams

If you build products, run research teams, or design any process that aggregates judgments — A/B test reviews, roadmap prioritization, hiring panels, model evaluations — the wisdom-of-crowds literature has direct implications.

Collect independent estimates first. Before any meeting where the group will discuss a quantitative judgment (a launch readiness score, a probability of success, a sizing estimate), have each person write down their estimate independently and silently. Reveal them simultaneously. This is the simplest possible implementation of the Galton mechanism, and it produces measurably better collective judgments than open discussion.

Discuss the diversity, not the consensus. The most informative thing about a set of independent estimates is usually the spread, not the mean. Wide spreads signal that the group is missing shared information or framing the problem differently. Narrow spreads with low accuracy signal that the group shares a blind spot.

Beware the senior person speaking first. Almost every observed wisdom-of-crowds failure in organizations involves social influence collapsing diversity early. The most-experienced person speaking first reliably anchors the discussion. Defer their input until after the group has shared independent views.

Use prediction markets for genuinely uncertain quantitative questions. Internal prediction markets at engineering organizations (Google, Microsoft) have repeatedly outperformed manager forecasts on project completion dates. The infrastructure is now cheap. The barriers are cultural and political.

Train calibration. Tetlock’s work shows that probabilistic forecasting is a teachable skill. People with calibration training produce better estimates and better collective averages. Most teams do not invest in this.

Avoid faux-crowdsourcing. Many “we asked the team” exercises violate every Surowiecki condition: low diversity (same team, same training, same incumbent priors), zero independence (open Slack thread, senior people speaking first), centralized communication (everyone sees everyone), and no formal aggregation. These exercises produce the appearance of wisdom-of-crowds rigor with none of the mechanism. They are theater.

Why this is the anti-example for the replication-crisis hub

Most articles in this hub document effects that did not survive replication. Wisdom of crowds is the opposite: an effect from 1907 that has been replicated, extended, and qualified — but not falsified.

It survives because:

  1. The underlying mechanism is statistical, not psychological. The law of large numbers and the diversity prediction theorem do not depend on cultural context or demand characteristics.
  2. The boundary conditions are explicit and testable. Surowiecki’s four conditions name what must hold. When they hold, the effect appears. When they fail (Lorenz 2011), the effect disappears predictably.
  3. Independent research traditions converge. Prediction markets (Wolfers and Zitzewitz, 2004), geopolitical forecasting tournaments (Tetlock), social-network experiments (Becker, Brackbill, & Centola, 2017), and meteorological ensemble forecasting all produce variants of the same result.

This is what a healthy behavioral finding looks like: a mechanism, named boundary conditions, multiple independent replications, an honest failure mode (Lorenz), and clear practical implications. Most pop-behavioral-science claims do not look like this. The wisdom of crowds does.

The lesson for the broader replication-crisis hub is not that “Galton was right and everyone else was wrong.” It is that the effects that survive scrutiny tend to have these features — explicit mechanism, named conditions, replications across methods — and the effects that don’t survive tend to be brittle claims about complex psychology with vague boundary conditions and demand-characteristic-dependent measurements.

When you encounter a behavioral claim, the Wisdom of Crowds checklist is a useful diagnostic. Is there a clear mechanism? Are the boundary conditions named? Do independent research traditions converge on the same finding? Has someone tried to make the effect disappear, and how did that go?

If a finding cannot survive those questions, it probably will not survive replication either.

Sources

  • Becker, J., Brackbill, D., & Centola, D. (2017). Network dynamics of social influence in the wisdom of crowds. PNAS, 114(26), E5070–E5076. https://doi.org/10.1073/pnas.1615978114
  • Bikhchandani, S., Hirshleifer, D., & Welch, I. (1992). A theory of fads, fashion, custom, and cultural change as informational cascades. Journal of Political Economy, 100(5), 992–1026.
  • Galton, F. (1907). Vox populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0
  • Lorenz, J., Rauhut, H., Schweitzer, F., & Helbing, D. (2011). How social influence can undermine the wisdom of crowd effect. PNAS, 108(22), 9020–9025. https://doi.org/10.1073/pnas.1008636108
  • Mellers, B., Ungar, L., Baron, J., et al. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106–1115.
  • Page, S. E. (2007). The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press.
  • Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday.
  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
  • Wallis, K. F. (2014). Revisiting Francis Galton’s forecasting competition. Statistical Science, 29(3), 420–424.
  • Wolfers, J., & Zitzewitz, E. (2004). Prediction markets. Journal of Economic Perspectives, 18(2), 107–126. https://doi.org/10.1257/0895330041371321

FAQ

Was Galton’s original 1907 result actually accurate?

Yes. The original Nature paper reported a median estimate of 1,207 pounds versus an actual dressed weight of 1,198 pounds — within 0.8%. Kenneth Wallis re-analyzed Galton’s source data in 2014 (Statistical Science) and confirmed both the data and the calculation. The mean of all 787 estimates was 1,197 pounds — even closer to the actual.

If the wisdom of crowds is real, why do crowds also do stupid things?

Because the four Surowiecki conditions are rarely all met in real social settings. Riots, financial bubbles, viral misinformation, and witch hunts all involve crowds, but they fail the independence condition: people in those settings are heavily influenced by each other in real time. Lorenz et al. (2011) showed in PNAS that social influence destroys the wisdom-of-crowds effect. The mechanism is consistent — wise crowds and stupid crowds obey the same statistical rules, but only one set of conditions produces wisdom.

Should I trust prediction markets like Polymarket?

For high-volume, well-defined, short-horizon questions with diverse trader bases, prediction markets generally outperform polls and pundits — this is well-documented since Wolfers and Zitzewitz’s 2004 JEP survey. For thin markets, long-tail events, and questions where prices can be moved by a few large traders, treat them with skepticism. They are not infallible, but they are usually better than the median pundit.

Does the wisdom of crowds work for non-numeric decisions?

The cleanest cases are numeric estimates and binary forecasts, where averaging or majority voting has clear statistical meaning. For complex qualitative decisions — strategic direction, design choices, hiring — the effect is much weaker because aggregation is harder to define and Surowiecki’s conditions are harder to verify. The literature is honest about this limitation.

What should I take away if I run a team?

Three things. First, collect independent estimates before any group discussion of a quantitative question — this is the easiest evidence-based meeting practice you can adopt. Second, do not let senior people speak first when the group’s collective judgment matters. Third, for genuinely uncertain quantitative questions, consider running an internal prediction market — the infrastructure is cheap and the evidence base from Google, HP, and others is strong.

How does this connect to the rest of the replication-crisis hub?

This is the rare anti-example. Most articles in the hub document effects that did not survive replication. The wisdom of crowds did — because it has an explicit mechanism, named boundary conditions, multiple independent replications across research traditions, and an honest failure mode in Lorenz 2011. That combination is what robust behavioral science looks like. The contrast with brittle effects (power posing, ego depletion, implicit bias as a behavioral predictor) is the broader lesson of the hub.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.