Replication Markets: Prediction Markets For Whether A Finding Will Replicate

Atticus Li

← The Replication Crisis · replication-crisis

Replication Markets: Prediction Markets For Whether A Finding Will Replicate

Dreber 2015 PNAS showed researcher prediction markets forecast replication outcomes with ~71% accuracy. Camerer 2018 confirmed it in Nature. The implication: experts collectively know which findings are weak, but publication systems do not reward saying so publicly. Modern markets monetize that hidden consensus.

By Atticus Li May 26, 2026 18 min read

A 2015 paper in PNAS asked a question that, if you spend any time inside academic behavioral science, you have probably asked privately: do researchers themselves already know which published findings are likely to be wrong?

The answer turned out to be yes. Embarrassingly yes.

Anna Dreber, Thomas Pfeiffer, Magnus Johannesson, and colleagues ran a prediction market in which active researchers bet small amounts of real money on whether each of 41 high-profile psychology studies from the Open Science Collaboration’s Reproducibility Project would replicate. The market’s closing prices forecast actual replication outcomes with roughly 71% accuracy. Brian Nosek’s Reproducibility Project (OSC, 2015) had not yet finished its replication attempts when the market traded. By the time the replications were complete and published, the market had already separated the survivors from the casualties at well above chance.

Three years later, Colin Camerer, Dreber, and an expanded team replicated the result on the Social Sciences Replication Project — 21 experimental papers published in Nature and Science between 2010 and 2015. Markets again forecast accurately. The pattern was robust enough to bake into the design of subsequent reform infrastructure: DARPA’s SCORE program (2018-2023) used prediction markets and AI-assisted scoring on roughly 3,000 social-science claims. The Replication Markets project, run by the same intellectual lineage, traded on thousands more.

This article walks through what those markets demonstrated, what mechanism makes them work, and the uncomfortable strategic implication for anyone evaluating research-based claims: the publication system systematically suppresses information that the research community possesses collectively. The markets are a way of reading that suppressed information at scale.

The cold-open finding: markets beat individuals at forecasting replication

Before the methodology, the headline. In Dreber et al. (2015), 92 researchers traded contracts on 41 OSC studies — the famous psychology cohort that ultimately replicated in only about 36% of cases. Each market contract paid out $1 if the corresponding replication succeeded (defined by the OSC’s pre-specified criteria) and $0 if it did not. Closing prices represented the market’s implied probability of replication.

Three results matter.

First, market prices correlated with actual replication outcomes at r ≈ 0.59. The market’s predicted probabilities tracked reality. Studies the market judged unlikely to replicate mostly did not. Studies the market judged likely to replicate mostly did. Translating that into the binary forecast — would each finding replicate or not — the market got it right on 71% of studies.

Second, the market substantially outperformed the survey forecasts the same researchers had submitted before trading opened. When asked individually “do you think this study will replicate?”, forecasters were noisier and less calibrated. The act of putting small amounts of money on the line, and seeing what other traders were doing, sharpened the collective judgment.

Third — and this is the result with the largest strategic implications — the market’s accuracy emerged despite individual traders being only modestly informed. Most participants were not field experts on most of the 41 papers. They were researchers in adjacent or overlapping subfields who recognized methodological smells: small samples, single-experimenter labs, conceptual replications dressed up as direct replications, p-values clustered just below 0.05, effect sizes too large to be plausible. The market aggregated those scattered, partial signals into a forecast more accurate than any individual within it.

That last finding is the one to internalize. It is not that any single researcher knew which 26 of the 41 studies were dead. It is that the community collectively encoded enough signal across its members that a properly designed aggregation mechanism could extract a coherent forecast.

Dreber et al. 2015 PNAS — the original demonstration

The full citation is Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50), 15343-15347. DOI: 10.1073/pnas.1516179112.

The design deserves a moment of attention because the methodological care is what makes the result interpretable.

The 41 studies were a subset of those targeted by the Open Science Collaboration’s Reproducibility Project: Psychology (OSC, 2015) — the project that found only 36 of 100 psychology studies showed statistically significant results in their direct replication attempts. Dreber’s team took 41 of those studies for which the replication attempts had not yet been completed or unblinded, and ran a two-stage forecasting exercise.

Stage one was a survey. Researchers were asked to estimate the probability that each study would replicate. This produced a per-study forecast aggregating their individual judgments.

Stage two was a market. The same researchers — plus additional ones recruited from the same community — received an endowment of trading credits and traded contracts on each study for two weeks. Prices floated freely; participants could buy and sell at any time.

The market’s closing prices were then compared against the OSC’s actual replication outcomes, which were not revealed to participants until trading closed. Market prices predicted those outcomes with the 71% accuracy already cited. Survey forecasts were less accurate. The market outperformed the survey aggregation by a meaningful margin even though it was drawing on the same population of forecasters.

Why? The literature on prediction markets gives several reasons. Markets force participants to put a price on belief, which disciplines hedging and clarifies private estimates. They aggregate continuously rather than as a single snapshot, allowing late information to update the consensus. They reward participants who hold non-consensus views and are correct, which incentivizes contrarian forecasters to participate when they have private signal. And they implement a real-money loss function that punishes overconfidence in both directions.

The Dreber paper is short — five pages in PNAS — and the data are available. The methodology is the kind that holds up to scrutiny precisely because it commits in advance to a single, public, binary outcome (each study either does or does not meet the OSC’s pre-specified replication criterion). There is no room for post-hoc reinterpretation.

Camerer et al. 2018 — the SSRP confirmation

If Dreber 2015 was the demonstration, Camerer et al. (2018) was the confirmation. Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644. DOI: 10.1038/s41562-018-0399-z.

This is the Social Sciences Replication Project (SSRP). The team selected 21 experimental social-science papers published in Nature or Science during the 2010-2015 window. They replicated each study at high statistical power — typically with sample sizes about five times the original — and they ran a prediction market and a forecasting survey on each.

The replication results, taken on their own, were sobering. Of the 21 high-profile studies, 13 (62%) replicated in the direction predicted by the original paper with statistically significant effects. Eight (38%) did not. The average effect size of the replicated findings was approximately 50% of the original — meaning that even among the studies that did replicate, the original papers substantially overestimated the magnitude of the effects they reported.

The market results layered on top of that picture made the case for prediction markets stronger. Market prices correlated with the binary replication outcomes (correlation around r ≈ 0.84 across the 21 studies, per the published figures), and the market identified most of the failures in advance. Forecasters who participated were generally researchers who had not worked directly on the studies in question, again confirming that the predictive signal came from methodological pattern-matching rather than insider information.

Combine the two papers — Dreber 2015 with 41 studies and Camerer 2018 with 21 studies — and the cumulative evidence is that researcher prediction markets are an empirically validated mechanism for forecasting which published findings will or will not survive direct replication, across two independent samples of studies, in two different subfields of behavioral science, with two separate teams of traders.

That is roughly the level of evidentiary support that the original findings being forecasted often did not themselves achieve.

The Forsell 2019 extension — Many Labs 2

A third paper extended the result to a different replication infrastructure. Forsell, E., Viganola, D., Pfeiffer, T., Almenberg, J., Wilson, B., Chen, Y., Nosek, B. A., Johannesson, M., & Dreber, A. (2019). Predicting replication outcomes in the Many Labs 2 study. Journal of Economic Psychology, 75, 102117. DOI: 10.1016/j.joep.2018.10.009.

Many Labs 2 (Klineberg et al., 2018) was a 28-finding multi-site replication of canonical social and cognitive psychology studies, conducted across 125 samples in 36 countries with a combined N of about 15,000 participants. It is one of the largest direct replication efforts ever conducted in psychology.

Forsell’s team ran prediction markets and surveys on whether each of the 28 findings would replicate before Many Labs 2 unblinded its results. Markets again outperformed individual forecasts. The published correlations are slightly weaker than in the original Dreber paper, partly because Many Labs 2 was studying findings on which expert opinion was already more divided, but the directional result is the same: traders systematically separated the studies that replicated from those that did not, at well above chance.

Three papers, three different samples, three independent confirmations. This is roughly what robust evidence looks like in behavioral science — a rarer condition than you might expect.

The implication: researchers privately know which findings are weak

Here is the part of the literature that is uncomfortable for the institutions involved.

If prediction markets composed of mostly non-expert researchers can forecast replication outcomes at 71-84% accuracy, then the collective research community already possesses substantial private information about which published findings are weak. That information exists. It is not a secret. Researchers talk about it at conferences. Journal clubs dissect papers and identify their methodological flaws. Twitter threads (or Bluesky, depending on the year) tear apart bad statistics in public.

What does not exist — and what the markets effectively work around — is a publication-system mechanism for that private information to enter the official scientific record before a formal replication is completed and published years later.

The publication system rewards positive results in high-impact journals. It does not reward writing a letter to Nature saying “I read this paper and I think it will not replicate, here is why.” Such a letter is unlikely to be published, would attract personal hostility from the original authors, and produces no measurable career return on the time spent writing it. The expected cost of saying “I think this published finding is wrong” is high; the expected return is near zero. So most researchers, most of the time, say nothing publicly even when they hold strong private beliefs.

The market mechanism inverts that. Trading anonymously with small amounts of real money has near-zero professional cost. Being correct in the market produces a measurable financial return. The market becomes a mechanism for surfacing private beliefs that the publication system would have left buried. The result is that the market price summarizes a quantity — the community’s calibrated belief about each finding’s replicability — that nowhere else in the scientific infrastructure exists in legible form.

This is the result Robin Hanson predicted in his 1995 paper Could Gambling Save Science? Encouraging an Honest Consensus, Social Epistemology, 9(1), 3-33. DOI: 10.1080/02691729508578768. Hanson argued that prediction markets on scientific claims would force researchers to put real prices on their beliefs and would surface honest consensus that the peer-review system suppresses. The Dreber and Camerer results are the empirical confirmation of Hanson’s theoretical case, a generation later.

DARPA SCORE and modern implementations

Once Dreber 2015 had been published, the question of whether the result generalized stopped being merely academic. In 2018, DARPA — yes, the military research agency — funded the Systematizing Confidence in Open Research and Evidence (SCORE) program, a multi-year effort to score the credibility of around 30,000 social and behavioral science claims drawn from the published literature. SCORE combined direct replication, structured expert survey, prediction markets, and machine learning models trained on bibliometric and methodological features of papers.

The SCORE replication subset — about 100 papers directly replicated — produced replication rates broadly consistent with prior efforts in social science: roughly half of high-profile findings did not survive direct replication, with substantial effect-size attenuation among those that did. The market and ML components of SCORE were developed to scale credibility scoring beyond what direct replication alone could economically handle.

The Replication Markets project, run by the Center for Open Science, Replication Markets Ltd., and academic collaborators between 2019 and 2021, traded markets on roughly 3,000 social-science findings as part of SCORE. Participating researchers traded contracts on each claim. Results were used to score the credibility of the underlying papers and to validate the ML models against market-derived ground truth.

A few aggregated findings from SCORE and Replication Markets are worth holding in mind. Markets tended to agree with replications more often than they agreed with the original published p-values. Average market-implied replication probabilities across social-science claims clustered around 45-55% — meaning that even before any replication was attempted, the market was telling the field that roughly half of published claims were likely false. This is consistent with the direct-replication evidence from OSC 2015, SSRP, Many Labs 2, and the SCORE direct-replication sample. There is no honest reading of this corpus that arrives at the conclusion that social-science publication is reliably tracking truth at the per-paper level.

Modern implementations beyond SCORE include platforms like Metaculus, which runs forecasting tournaments on a broader set of scientific and societal questions; Manifold Markets, which hosts user-created prediction markets including some on replication outcomes; and the academic continuation of the Replication Markets program at various institutions.

The infrastructure is now mature enough that for many high-profile published findings in psychology, economics, and other social sciences, you can in principle look up an existing market-implied probability of replication. That number is often far lower than the implied confidence the original publication would suggest. Acting on that gap is the strategist’s job.

The strategist’s takeaway

If you are an executive, investor, policy designer, or product strategist who incorporates social-science research into consequential decisions, the existence of replication markets has three operational implications.

First, treat the published literature as a noisy prior, not as ground truth. Direct replication and market evidence together imply that roughly 40-60% of high-profile social-science findings will not survive rigorous replication, and that even those that do replicate will typically show effect sizes about half the originally published magnitude. Any decision rule that weights published findings as if they were settled fact is implicitly relying on a corpus that is approximately the reliability of a coin flip at the per-paper level. The corpus contains real signal, but extracting it requires far more skepticism than the typical executive-summary citation pattern admits.

Second, when a published finding is load-bearing for a decision, look for whether it has been subject to direct replication, market forecasting, or both. The OSC 2015 dataset, the SSRP papers, Many Labs 1 and 2, and the SCORE/Replication Markets corpus are all publicly searchable. If your decision-relevant finding is in those datasets and the market judged it unlikely to replicate, that should substantially down-weight your reliance on it. If the finding has not been tested but is in the same family of methodologies as findings that failed (small N, single lab, conceptual rather than direct replication, effect sizes large enough to be implausible), apply a methodological prior accordingly.

Third — and this is the broader inferential move — the existence of researcher-collective-wisdom that contradicts published consensus means that, for any sufficiently high-stakes research-based claim, you should weight your belief conditional on whether the finding is one that experts would bet against in private. The published consensus and the private consensus are systematically different. The market mechanism makes the gap legible. Where the market exists, use it. Where it does not, ask whether the finding has the structural properties (small N, large effect, single experimenter, conceptual replication record) that historically separate the survivors from the casualties.

There is a real version of this discipline already practiced inside the better-run hedge funds, intelligence services, and decision-analysis firms: treat the published literature as informative but unreliable, build internal models of which claim categories are most likely to be wrong, and reserve high-confidence action for claims that survive both publication and out-of-sample stress testing. The Dreber and Camerer papers are the academic version of that workflow, applied to behavioral science. The infrastructure is now public.

The replication crisis is real. Researcher prediction markets demonstrate that the crisis was, at every point in its development, partially visible to the research community itself. The publication system failed to surface that visibility. Markets fix the surfacing problem. Reading the markets — and the structural features that produced the failures the markets correctly forecast — is one of the highest-leverage moves available to a strategist trying to use behavioral science for consequential decisions.

Sources

Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50), 15343-15347. DOI: 10.1073/pnas.1516179112
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644. DOI: 10.1038/s41562-018-0399-z
Forsell, E., Viganola, D., Pfeiffer, T., Almenberg, J., Wilson, B., Chen, Y., Nosek, B. A., Johannesson, M., & Dreber, A. (2019). Predicting replication outcomes in the Many Labs 2 study. Journal of Economic Psychology, 75, 102117. DOI: 10.1016/j.joep.2018.10.009
Hanson, R. (1995). Could gambling save science? Encouraging an honest consensus. Social Epistemology, 9(1), 3-33. DOI: 10.1080/02691729508578768
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
DARPA Systematizing Confidence in Open Research and Evidence (SCORE) program. Defense Advanced Research Projects Agency, 2018-2023.
Replication Markets project. Center for Open Science / Replication Markets Ltd., 2019-2021.

Many Labs replication projects — the multi-site direct replication infrastructure that produced the underlying replication outcomes the markets in Forsell 2019 forecast
Registered reports — the publication reform mechanism most directly suggested by the same evidentiary critique that motivates replication markets
Tetlock’s superforecasting — the parallel literature on identifying individuals and aggregation mechanisms that produce calibrated forecasts in geopolitical domains, with substantial methodological overlap with replication markets
Wisdom of crowds — the underlying statistical mechanism that makes researcher prediction markets work, and the conditions under which it breaks down

FAQ

Do replication markets work because the traders are the original authors of the studies being predicted?

No. The Dreber 2015 design explicitly excluded original authors, and most traders in the markets discussed here were researchers in adjacent or overlapping subfields who had not worked on the specific studies being forecasted. The predictive signal comes from methodological pattern-matching, not from insider knowledge of the original experiments.

If markets are this accurate, why doesn’t the publication system use them?

The publication system rewards positive results in high-impact journals, and the people who staff and run that system have careers built on the existing incentive structure. Adopting a mechanism that systematically discounts a substantial fraction of published findings creates obvious institutional friction. Reform infrastructure like registered reports, preregistration, and replication markets has expanded since 2015, but the core incentive structure of high-impact journal publishing has not fundamentally changed.

Could a market be wrong because everyone is reading the same flawed methodology cues?

Yes, in principle. If the entire community of forecasters shared a systematic bias — for example, over-discounting studies from particular labs for reputational rather than methodological reasons — the market could encode that bias. The empirical record across Dreber 2015, Camerer 2018, Forsell 2019, and the SCORE program is that markets nevertheless track actual replication outcomes well above chance, which constrains how large any such systematic bias can be. But it is not zero.

Can I trade on replication markets myself?

Some platforms (Metaculus, Manifold Markets, others) host forecasting on scientific claims. The original Dreber and Camerer markets were closed academic exercises with vetted participants. The Replication Markets project (2019-2021) was similarly restricted to invited researchers. If you are evaluating a specific published claim, the most accessible move is usually to search whether the claim has been included in any of the published replication datasets (OSC 2015, SSRP, Many Labs 1-2, SCORE) and to read whether the market-implied probability was published.

What replication rate should I assume for a high-profile social-science finding I have not specifically researched?

Approximately coin-flip, plus or minus methodological adjustments. The cumulative evidence from OSC 2015 (39% strict replication), SSRP (62%), Many Labs 2 (~50%), and SCORE direct-replication subsets (~50%) places the base rate somewhere in the 40-60% range for high-profile published findings in psychology and adjacent social sciences. Individual findings deviate from that base rate in predictable directions: large N, preregistration, direct replication, and small effect sizes shift the conditional probability up. Small N, single-lab origin, conceptual replication history, and implausibly large effect sizes shift it down.

replication-crisisreplication-marketsprediction-marketsmethodology-reformevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

The cold-open finding: markets beat individuals at forecasting replication

Dreber et al. 2015 PNAS — the original demonstration

Camerer et al. 2018 — the SSRP confirmation

The Forsell 2019 extension — Many Labs 2

The implication: researchers privately know which findings are weak

DARPA SCORE and modern implementations

The strategist’s takeaway

Sources

Related reading

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook