In the late 2010s, the public face of machine learning research became the leaderboard. A model was state of the art on ImageNet or it was not. A reinforcement learning agent had beaten the previous record on a suite of Atari games or it had not. A language model had achieved a new top score on GLUE or SuperGLUE or BIG-Bench or it had not. Conference acceptance at NeurIPS, ICML, and ICLR — the field’s top venues — flowed in substantial part from the existence of a number that was higher than the previous number. The leaderboard culture was efficient at one thing: producing publishable claims of incremental and occasionally non-incremental empirical progress at high frequency. It was less efficient at a second thing, which is the question of whether the published numbers were, in any meaningful sense, the result of the methods that the papers claimed to introduce.

By the second half of the decade, a community of skeptical methodologists inside machine learning had begun to assemble the empirical record on that second question. Joelle Pineau at McGill and Facebook AI Research, Peter Henderson and a Montreal group at McGill, Zachary Lipton at Carnegie Mellon, Jacob Steinhardt at Berkeley, Odd Erik Gundersen in Trondheim, and others published a sequence of papers between 2017 and 2021 that did to machine learning what the original reformers had done to psychology a decade earlier: they took a critical sample of the published literature, attempted to reproduce or verify the central claims, documented systematically what went wrong, and named the patterns. The diagnosis was not subtle. A meaningful fraction of state-of-the-art results in machine learning, including in the highest-prestige subfields and the highest-prestige venues, could not be cleanly reproduced from the published artifact. The reasons were structural: papers did not release code, did not specify the hyperparameters that produced the headline number, did not report variance across random seeds, compared the new method against weak or non-standard baselines, used compute resources that academic readers had no way to replicate, and embedded in their narrative scholarship issues — overclaiming, speculation presented as result, missing context — that compounded the reproducibility issues with rhetorical ones.

The community response has been one of the more concerted institutional reform efforts that any empirical field has produced in this period. The NeurIPS reproducibility checklist, mandatory for all submissions since 2019, has been the central instrument. Papers With Code, a community-built platform that links papers to executable open-source implementations, has been the infrastructure layer. The ML Reproducibility Challenge, an annual community effort to independently reproduce headline results, has been the verification layer. The combined effect has been a measurable shift in field norms toward releasable code, multi-seed reporting, and standardized benchmark protocols. The unfinished work is large. For a strategist evaluating an “AI breakthrough” claim from a vendor, a research lab, or a press release, the practical implication is that the reproducibility checklist questions are the right questions, the answers tell you a lot about how much weight to put on the headline number, and the structural pattern that “AI claim that does not release code, seeds, hyperparameters, and standardized benchmark comparisons” predicts non-replication is, by 2026, well-established empirically.

This article walks through what the empirical record actually says: Pineau and the NeurIPS reproducibility program, Henderson 2018 on deep reinforcement learning that did not survive seed perturbations, Lipton and Steinhardt 2019 on scholarship issues, Gundersen and Kjensmo 2018 on the baseline state of AI reproducibility, Hutson’s 2018 Science news piece that put the issue on the public agenda, and the institutional infrastructure — checklist, Papers With Code, ML Reproducibility Challenge — that has been built in response. The strategist takeaway at the end is a short checklist for evaluating any specific AI/ML claim against the actually-known failure modes.

The Leaderboard Culture And Its Failure Modes

The structural backdrop is the rise of standardized benchmarks as the public scoring system of machine learning research. ImageNet for image classification, COCO for object detection, the Atari Learning Environment for reinforcement learning, GLUE and SuperGLUE for natural language understanding, MS MARCO for retrieval, and dozens of domain-specific benchmarks supplied the field with a small set of standardized number-producing protocols against which any new method could be evaluated. The benchmark culture had real virtues: it produced apples-to-apples comparisons, it made empirical progress measurable, and it concentrated the field’s attention on solvable problems. It also produced a structural set of incentives toward a specific class of failure modes.

The first failure mode is code unavailability. A paper that reports a new state-of-the-art number but does not release the code to reproduce that number forces the reader to take the headline at face value. Re-implementation from the paper’s description is theoretically possible but, in practice, frequently fails to reproduce the reported number — sometimes because the paper omits implementation details that turn out to matter, sometimes because the original codebase included undocumented tricks, sometimes because the reported number reflected a specific run that subsequent runs did not match. Gundersen and Kjensmo 2018, in a survey of 400 AI papers published at IJCAI and AAAI 2013-2017, documented that a substantial fraction of papers did not share the source code needed to verify their headline results, and an even larger fraction did not provide the data — and these were papers from venues with strong methodological reputations.

The second failure mode is hyperparameter underspecification. A modern deep learning system has dozens of hyperparameters: learning rate, batch size, optimizer choice, weight decay, dropout rate, learning rate schedule, warmup steps, architecture-specific parameters, regularization choices, and training duration. The reported headline number is the result of a specific hyperparameter configuration, often arrived at through extensive search. A paper that reports the final number without reporting the full hyperparameter sweep, the search budget that was used, and the specific configuration that produced the headline forces the reader to estimate from incomplete information. The same paper run with the same code but different hyperparameters, or the same code with the reported hyperparameters but a different random seed, frequently produces a meaningfully different number — and the gap between the reported headline and the typical reproduction is, on average, in the direction of the reported being higher.

The third failure mode is test set tuning. The discipline of machine learning rests on the train-validation-test split: the model is trained on the training set, hyperparameters are selected on the validation set, and the final number is reported on a held-out test set that the model has not seen during development. The integrity of the protocol depends on the test set actually being held out. When the same test set is used by many papers, repeatedly, as the basis for hyperparameter and architecture selection, the test set has been implicitly tuned upon, and reported test-set numbers begin to overstate the true generalization performance of the methods. The structural problem is well understood — it is a version of the multiple comparisons problem applied to benchmark culture — and it produces a slow inflation of reported benchmark scores that exceeds the actual progress in the underlying methods. The phenomenon has been documented empirically by Recht and colleagues, who built fresh test sets for ImageNet and CIFAR-10 and showed that the accuracy gap between published models on the original test sets and on the new test sets was systematic.

The fourth failure mode is seed cherry-picking. A reinforcement learning experiment, or a deep neural network training run, has substantial run-to-run variance. The same code, the same hyperparameters, and a different random seed often produce different results — sometimes substantially different. A paper that reports a single run, or that selects the best of multiple runs, is implicitly reporting a number that is favored by the noise distribution rather than a number that represents the typical performance of the method. The Henderson 2018 paper, discussed below, made this failure mode concrete for deep reinforcement learning and showed that the headline rankings of competing methods often did not survive a proper accounting of seed variance.

The fifth failure mode is weak baselines. A new method is compared against the previous state-of-the-art. The previous state-of-the-art is often a baseline that was reported by a different group, in a different paper, possibly under a different hyperparameter configuration. The new paper compares its carefully tuned method against the previous group’s possibly-undertuned baseline and reports a gap. The gap may reflect real methodological progress, or it may reflect that the new group spent more compute on tuning their method than the previous group spent on the baseline. The Melis, Kočiský, and Blunsom 2018 paper on language modeling famously showed that well-tuned LSTMs matched or exceeded the published numbers of several years’ worth of supposedly-superior neural architectures, suggesting that a substantial fraction of the reported progress in language modeling architectures during that period was an artifact of asymmetric tuning effort rather than methodological superiority.

The sixth failure mode is undisclosed compute resources. A modern industry-scale experiment may use thousands of GPU-hours of compute, distributed across hundreds of GPUs in a single training run. A paper that reports the result without reporting the compute cost — and crucially, without reporting whether the result is dependent on that compute scale — forces academic readers and smaller labs into a position where reproduction is either impossible or requires a substantially different experimental protocol. The asymmetry has produced a structural concern that headline results from the largest industry labs are systematically less reproducible by the academic community, and that this gap is widening rather than narrowing as foundation-model scale increases.

These six failure modes are the empirical core of what the methodological community in machine learning began documenting in the second half of the 2010s. They are not the only failure modes, but they are the ones that the structural reforms have most directly targeted.

Henderson 2018: Deep Reinforcement Learning That Didn’t Matter

Peter Henderson and colleagues at McGill — including Joelle Pineau, who would go on to lead the NeurIPS reproducibility program — published “Deep Reinforcement Learning that Matters” at AAAI 2018 (DOI: 10.1609/aaai.v32i1.11694). The paper was the most rigorous public attempt to characterize the reproducibility properties of deep reinforcement learning, which was at that point one of the highest-prestige and most commercially interesting subfields of machine learning.

The setup was specific. The team selected four widely-cited deep RL algorithms — Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Actor-Critic with Kronecker-factored Trust Region (ACKTR) — and ran them on a standard benchmark suite (continuous-control tasks from OpenAI Gym’s MuJoCo environments). They ran each algorithm under multiple random seeds, multiple hyperparameter settings, multiple reward scaling choices, multiple network architectures, and multiple existing open-source implementations of nominally-the-same algorithm.

The findings were structural. First, the variance across random seeds within a single algorithm-environment pair was large enough that a head-to-head comparison of two algorithms with a small number of seeds frequently produced a result that contradicted the result with a larger number of seeds. Second, the variance across nominally-equivalent implementations of the same algorithm was substantial: two open-source implementations of, for example, DDPG could produce meaningfully different performance numbers on the same benchmark, with the gap sometimes exceeding the gap between supposedly-different algorithms. Third, the choice of reward scaling and network architecture had effects on reported performance that were comparable in magnitude to the choice of algorithm — meaning that a paper reporting “algorithm A beats algorithm B” was often actually reporting “the implementation, reward scaling, and architecture choices in this paper for algorithm A beat the implementation, reward scaling, and architecture choices in the cited reference for algorithm B.”

The most quoted line from the paper is the observation that, with the standard practice of comparing average performance across a small number of seeds without proper variance accounting, the published rankings of leading deep RL algorithms were not stable: re-running the comparison with proper seed accounting could and did change the ranking of which algorithm was “best.” The paper did not claim that all deep RL results were wrong; it claimed that the methodology by which deep RL papers had been comparing methods was inadequate to support the published rankings, and that a substantial fraction of the published ranking claims should be treated as unsupported by the evidence the papers actually provided.

The Henderson paper had two effects. The first was immediate: it became the standard reference for the proposition that deep RL papers needed to report multi-seed results with proper variance accounting, that this had not been the prior norm, and that the prior norm was producing publication artifacts that did not reflect underlying methodological progress. The second was structural: the paper provided the empirical case for the field-level reproducibility reforms that Pineau and others were beginning to advocate for at NeurIPS and other venues. If even the carefully-engineered subfield of deep RL, with active and well-funded research groups and well-defined benchmarks, was producing papers whose rankings did not survive proper analysis, then the case for institutional intervention at the venue level was strong.

Lipton And Steinhardt 2019: The Scholarship Diagnosis

Where Henderson was specific to deep RL and methodological in focus, Zachary Lipton and Jacob Steinhardt’s 2019 Communications of the ACM paper “Troubling Trends in Machine Learning Scholarship” (DOI: 10.1145/3320612) was broader in scope and more sociological in framing. The paper named four patterns in published machine learning research that, in the authors’ view, were degrading the scholarly quality of the field.

The first pattern was failure to distinguish explanation from speculation. Machine learning papers commonly include sections that offer post-hoc interpretations of why a new method works — theoretical motivations, intuitions about the optimization landscape, metaphors borrowed from biology or physics. Lipton and Steinhardt observed that these sections were frequently presented as if they had the same epistemic status as the empirical results, when in fact they were unfalsifiable speculation that had not been tested against alternative explanations. The pattern was particularly common in deep learning papers, where the empirical phenomenon (the method worked) was supplied with a theoretical narrative (because of attention, because of representation learning, because of inductive bias) that the paper did not actually demonstrate was the mechanism.

The second pattern was failure to identify the sources of empirical gains. A new paper reports an improvement on a benchmark. The paper attributes the improvement to its proposed architectural change. But the paper has typically also changed the training procedure, the hyperparameter sweep, the data preprocessing, the regularization, and the evaluation protocol relative to the baseline. The headline improvement is the cumulative effect of all of these changes, but the paper attributes it specifically to the proposed novel component. Lipton and Steinhardt argued that this attribution was systematically unsupported by the experiments papers were running, and that proper ablation studies — where the proposed component is removed while everything else is held constant — were the only way to isolate the source of the gain. The honest version of many papers was “we found a configuration that worked better; we are not sure which of the changes we made is responsible”; the published version was “our novel component X is responsible for the improvement.”

The third pattern was mathiness. Lipton and Steinhardt borrowed Paul Romer’s term from economics to describe machine learning papers that included formal mathematical notation that did not actually contribute to the argument — that decorated empirical results with theoretical-looking apparatus that did not constrain the claims, did not derive the empirical predictions, and did not survive translation back into plain language. The function of the mathiness was rhetorical: it lent the empirical work the appearance of theoretical depth without supplying the substance. The cost was that readers had to invest substantial effort parsing notation that, when parsed, turned out not to be load-bearing.

The fourth pattern was misuse of language. The pattern took several forms: borrowing technical terms from other fields (“attention,” “memory,” “understanding”) and using them in ways that imported unjustified connotations; describing empirical results in language that overclaimed their significance (“solves,” “achieves human-level,” “understands”); and presenting incremental engineering improvements in language that suggested fundamental conceptual progress. Lipton and Steinhardt argued that this rhetorical inflation was systematic, was rewarded by the field’s prestige economy, and was actively degrading the precision with which machine learning research could be communicated to adjacent communities (cognitive science, neuroscience, applied industry, policy).

The Lipton and Steinhardt paper was less methodologically formal than Henderson’s, but it converted the reproducibility discourse into a broader scholarship discourse. The point was not just that experimental results could not be reproduced; it was that the surrounding rhetorical apparatus of the papers was making them harder to evaluate even on their own terms. The reproducibility reforms have addressed the experimental side of the diagnosis; the scholarship side remains, by the authors’ own subsequent assessments, largely unaddressed.

The NeurIPS Reproducibility Checklist

The institutional centerpiece of the machine learning reproducibility reform has been the NeurIPS reproducibility checklist. The checklist was piloted in 2018, mandatory for all NeurIPS submissions starting in 2019, and has subsequently been adopted in various forms by ICML, ICLR, AAAI, and a growing number of other machine learning venues. Joelle Pineau, who chaired the NeurIPS 2019 program committee, was the primary architect of the program. The 2019 program included not only the checklist but a broader reproducibility initiative — a code submission policy, a Reproducibility Challenge run in parallel with the conference, and an explicit set of criteria that reviewers were instructed to consider when evaluating papers.

Pineau and colleagues published the program’s empirical results in 2021 as “Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program” in the Journal of Machine Learning Research (volume 22, issue 164). The paper documents what the program required, what authors actually submitted, what reviewers reported, and what the measurable effect was on the rate of code release, hyperparameter specification, and other reproducibility-related practices.

The checklist itself is a structured document that authors must complete and submit with their paper. The questions cover the categories that the Henderson, Lipton-Steinhardt, and Gundersen-Kjensmo work had identified as the central failure modes. For experiments, the checklist asks whether the paper specifies the data the experiments use, the hyperparameter configuration that produced the reported numbers, the range of hyperparameters that were searched, the number of random seeds the results are averaged over, the variance or confidence intervals around the reported numbers, the compute resources used, and the runtime per experiment. For models and methods, the checklist asks for a complete description of the model, the training procedure, the evaluation protocol, and the baselines used for comparison. For code and data, the checklist asks whether the code and data are publicly available, whether they are linked from the paper, and what the license terms are. For theoretical results, the checklist asks for a clear statement of the assumptions and a complete proof.

The structural mechanism is procedural: authors fill out the checklist, the checklist is visible to reviewers, and reviewers are instructed to factor the completeness of the checklist into their evaluation. The checklist does not by itself enforce a publication standard — a paper can in principle be accepted with checklist items marked “not provided” — but it makes the reproducibility profile of every submission visible to the review committee in a way that the prior unstructured norms did not.

The Pineau et al. 2021 report documents the measurable effects. The rate of code submission with NeurIPS papers rose substantially between the pre-checklist baseline and the post-checklist period — by 2019, roughly three-quarters of accepted NeurIPS papers had associated code, up from a substantially lower baseline. The completeness of hyperparameter reporting, ablation studies, and variance reporting improved measurably. The 2019 program also included a code submission requirement for the camera-ready version of accepted papers (with explicit exception categories for industry-scale work with proprietary data), which produced an additional ratchet on the rate of executable artifacts being released alongside papers. The report’s overall framing is cautious — checklists are necessary but not sufficient, the unaddressed work is large — but the empirical case that the checklist intervention measurably moved field norms in the intended direction is documented in the paper.

Subsequent NeurIPS programs have iterated on the checklist, added a separate ethics review track, and extended the reproducibility framework to cover datasets, the responsible release of capable models, and other emergent concerns. The checklist mechanism has been copied, adapted, and extended by other venues. By 2026, the major machine learning conferences operate under reproducibility regimes that are recognizably descended from the Pineau 2019 program.

Papers With Code And The Infrastructure Layer

The checklist is a policy intervention; the infrastructure intervention has been Papers With Code, a community-built platform launched in 2018 that links machine learning papers to executable open-source implementations and to standardized benchmark rankings. The platform was acquired by Meta AI in 2019 and has since become one of the central pieces of infrastructure that the machine learning community uses to locate, compare, and verify methods.

The functional contribution of Papers With Code is twofold. First, it provides a paper-to-code linking layer that makes the actual executable artifact for a paper one click away from the citation. Where the historical pattern had been that a reader interested in a method had to either contact the authors (with the response rates Wicherts had documented for psychology applying with appropriate field-level adjustments) or attempt to re-implement from the paper’s description (with the success rates the methodology community had been documenting), Papers With Code converted code linking into the default rather than the exception. Authors who released code received the infrastructure benefit of being directly linked; authors who did not release code had the absence visible in the same place.

Second, Papers With Code maintains structured leaderboards for hundreds of benchmarks across machine learning subfields, with the central feature that the leaderboard entries are linked to the underlying papers and, where available, to the underlying code. The structural effect is that a claim of state-of-the-art on a specific benchmark is verifiable against a standardized ranking — and the standardized ranking is maintained by the community, not by the paper authors themselves, which removes one of the structural incentives toward unsupported state-of-the-art claims.

The platform has limitations. Coverage is uneven across subfields. Leaderboard entries depend on community submission and curation. The benchmarks themselves can become saturated, gamed, or stale, and the platform does not by itself address the test-set-tuning problem that arises when many papers compete on the same benchmark over many years. But as a piece of community infrastructure that lowers the friction of code release and verification, Papers With Code has been a substantial part of the structural reform.

The ML Reproducibility Challenge

The verification layer of the reform has been the ML Reproducibility Challenge, an annual community effort in which graduate students, independent researchers, and other volunteers attempt to independently reproduce the headline results of recently-published machine learning papers and report the outcomes. The Challenge began as the NeurIPS 2018 Reproducibility Challenge, was formalized into an annual cross-venue program covering NeurIPS, ICML, ICLR, AAAI, and others, and has produced a corpus of reproducibility reports that constitute the most systematic empirical record of what machine learning reproduction attempts actually find when they are tried.

The structure of the Challenge is straightforward. A pool of recently-published papers is curated. Volunteer teams select papers and attempt to reproduce the headline experimental results. The teams document what they did, what they were able to reproduce, what they were not able to reproduce, and what implementation details were missing or under-specified in the original paper. The resulting reports are published, often in venues like ReScience C, a journal dedicated to reproducibility studies.

The Challenge has two distinct values. The first is the direct verification work it produces: a paper that has been independently reproduced by a Challenge team carries, in the contemporary machine learning literature, a kind of secondary stamp of verification that goes beyond the primary publication. The second is the systematic empirical record of what reproduction attempts find. Across the annual Challenge cohorts, the published reports document a recurring set of patterns: papers that are fully reproducible from the published artifact, papers that are partially reproducible (the qualitative result holds but the precise numerical headline does not), papers that are reproducible only with substantial additional engineering effort (the code as released does not run; with three weeks of debugging, it can be made to run; once it runs, the result holds), and papers that resist reproduction in ways that suggest the published result is fragile or depends on details that were not included in the published artifact.

The cumulative result of the Challenge has been a continuous empirical update on the state of machine learning reproducibility that complements the checklist (policy) and Papers With Code (infrastructure) layers. The reform is multi-layered, and each layer addresses a different mechanism in the failure space.

The Strategist Takeaway: How To Evaluate An AI Breakthrough Claim

The practical implication for a working strategist who is evaluating an AI/ML claim — from a research lab, a vendor, a press release, a consulting deck, a startup pitch — is that the reproducibility-checklist questions are the right questions. The structural pattern that has been documented across the methodological literature is that the checklist categories predict, with reasonable signal, which claims will and will not survive independent verification.

A short evaluation protocol that follows from the empirical record:

Is the code released? If a paper claims a result but does not release the code that produced the result, the prior probability that an independent reproduction will match the headline is meaningfully lower than if the code is released. The Pineau 2021 program’s central premise was that this question is the single highest-information question one can ask about a machine learning paper. The answer is also the easiest to verify — either there is a working code link or there is not.

Are the hyperparameters specified? The headline number is the product of a specific hyperparameter configuration. A paper that does not report the hyperparameter configuration that produced the headline is reporting a number that the reader cannot reproduce even with the code in hand. A paper that reports the configuration but does not report the search budget that was used to arrive at it is omitting information that affects how the result should be weighted.

Are multiple seeds reported with variance? The Henderson 2018 finding is that single-seed comparisons in noisy domains do not reliably reflect the underlying ranking of methods. A claim of “method A is better than method B” that is based on a small number of seeds without variance reporting should be treated as substantially less informative than the same claim with proper multi-seed evidence.

Are the baselines standardized and well-tuned? A new method that beats a weakly-tuned baseline is reporting something different than a new method that beats a strong baseline. The Melis 2018 language modeling result is the canonical demonstration that supposed methodological progress can be substantially an artifact of asymmetric tuning effort. Papers that compare against community-standard baselines from leaderboards like Papers With Code carry more weight than papers that compare against their own re-implementations of prior work.

Is the compute disclosed? A result that depends on industry-scale compute is methodologically informative — it tells the reader something about how methods scale — but it is also a result that the academic community and the smaller-lab community cannot independently verify. The asymmetry should be priced into how the headline is interpreted. A small-compute result that beats a large-compute baseline is unambiguously interesting; a large-compute result that beats a small-compute baseline mainly tells us that more compute helped.

Is the test set genuinely held out? A result on a benchmark that has been published on for years by many groups is operating in a regime where the test set has been implicitly tuned upon. The Recht et al. work on ImageNet and CIFAR-10 fresh test sets is the empirical basis for treating saturated-benchmark state-of-the-art claims with appropriate skepticism. The interesting test set is the one nobody has seen yet.

Is the claim of mechanism backed by ablation? The Lipton-Steinhardt scholarship critique applies. A paper that attributes its empirical gain to a specific architectural component without ablating that component while holding everything else constant is making an attribution claim that the experiments do not support.

For a strategist, the cleanest decision rule is that AI/ML claims that pass the checklist categories — released code, specified hyperparameters, multi-seed variance, strong baselines, disclosed compute, valid test sets, clean ablations — should be treated as substantially better evidence than claims that fail any of those categories. The reform infrastructure of the field exists because the failure modes are real, common, and predictive. Demanding the checklist answers is not a methodological luxury; it is the minimum due diligence the published methodological literature recommends.

Sources

  • Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché-Buc, F., Fox, E., & Larochelle, H. (2021). Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program. Journal of Machine Learning Research, 22(164), 1–20.
  • Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. AAAI Conference on Artificial Intelligence, 32(1). DOI: 10.1609/aaai.v32i1.11694
  • Lipton, Z. C., & Steinhardt, J. (2019). Troubling trends in machine learning scholarship. Communications of the ACM, 62(6), 45–53. DOI: 10.1145/3320612
  • Hutson, M. (2018). Artificial intelligence faces reproducibility crisis. Science, 359(6377), 725–726. DOI: 10.1126/science.359.6377.725
  • Gundersen, O. E., & Kjensmo, S. (2018). State of the art: Reproducibility in artificial intelligence. AAAI Conference on Artificial Intelligence, 32(1). DOI: 10.1609/aaai.v32i1.11503
  • Melis, G., Dyer, C., & Blunsom, P. (2018). On the state of the art of evaluation in neural language models. International Conference on Learning Representations.
  • Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do ImageNet classifiers generalize to ImageNet? International Conference on Machine Learning.

FAQ

Q: Hasn’t the rise of large foundation models from industry labs made the academic reproducibility discussion moot?

A: The opposite, if anything. The shift toward foundation-model scale has widened the gap between what the largest industry labs can produce and what the academic community can independently verify, and this gap is itself the reproducibility problem in its most acute form. A claim about emergent capabilities in a frontier model that costs tens of millions of dollars to train cannot be independently reproduced by an academic group; the most the academic community can do is evaluate the released model artifact, which is a substantially weaker form of verification than reproducing the training. The methodological community has begun developing frameworks for “downstream verification” of foundation-model claims — evaluating the released artifact on independent benchmarks, probing for specific failure modes, attempting to replicate behavioral claims — but this is acknowledged to be a weaker substitute for full reproducibility. The reform agenda for foundation models is, if anything, less mature than the reform agenda for the small-scale ML work the NeurIPS checklist was designed for.

Q: How does the ML reproducibility crisis compare to the replication crisis in psychology?

A: The structural pattern is recognizable, but the failure mechanisms are different. Psychology’s replication crisis was driven primarily by underpowered studies, p-hacking, and publication bias against null results — failure modes rooted in classical frequentist statistics applied to noisy human behavioral data. Machine learning’s reproducibility crisis is driven primarily by code unavailability, hyperparameter underspecification, seed cherry-picking, weak baselines, and test-set saturation — failure modes rooted in the engineering complexity of modern ML systems and the leaderboard culture that rewards incremental benchmark claims. The reform responses are correspondingly different: psychology has emphasized preregistration and registered reports; ML has emphasized code release, multi-seed reporting, and standardized benchmarks. Both fields converge on the structural insight that the published artifact must be sufficient for independent verification, and both fields are still some distance from that standard.

Q: Is the NeurIPS reproducibility checklist actually enforced, or is it just a box-checking exercise?

A: The honest answer is partially enforced and partially box-checking, with the proportion shifting over time. The Pineau 2021 report documents that the checklist measurably moved the rate of code submission, hyperparameter reporting, and variance reporting between the pre-checklist baseline and the post-checklist period — meaning that the reviewer-visible accountability the checklist created did affect author behavior. At the same time, the checklist does not by itself reject papers that mark items as “not provided,” and there are documented cases of accepted NeurIPS papers with substantial unaddressed reproducibility gaps. The reform mechanism is closer to “raise the floor” than “enforce a ceiling,” and the floor has been measurably raised. The next-generation reforms — pre-publication code execution by reviewers, structured artifact evaluation, separate reproducibility-track reviews — are at various stages of pilot adoption.

Q: If a vendor pitches me on an AI capability and won’t share the code, hyperparameters, or seed-variance information, what should I do?

A: Treat the demonstrated capability as the evidence and treat the methodological description as marketing. A vendor’s working product is, in itself, a form of empirical evidence — it either does the thing or it does not, and you can test that directly on your own data. What the vendor’s published or pitch-deck claims about how the capability was achieved or how it compares to alternatives are claims that, in the absence of the checklist information, you have no basis to verify. The practical implication is to evaluate vendors on their working product against your specific use case rather than on their methodological claims about the underlying method, and to weight the methodological claims at substantially less than face value if the checklist categories are not available. The growthlayer-side practice is to require a pilot on internal data with internal metrics before any large procurement decision.

Q: Does the ML reproducibility crisis mean that the published progress in AI is largely illusory?

A: No. The methodological community is careful on this point. The reproducibility crisis is a crisis of the published evidentiary apparatus — the gap between what papers claim and what their experiments support. It is not a claim that no real progress is happening. Substantial fractions of published ML results do reproduce, and the categories of progress that are robust across reproductions, standardized benchmarks, and downstream applications (large language models’ general capability scaling, transformer architectures’ broad applicability, certain reinforcement learning advances in specific domains) are real. The crisis framing is that the published rate of progress overstates the actual rate of progress by some amount; the amount is hard to estimate precisely but is non-trivial. A literature in which the published rate of progress is meaningfully inflated still contains real progress; it just requires more discounting and more verification work than a strict reading of the headline numbers would suggest.

Q: How does this affect AI safety, alignment, and policy work that depends on understanding what models can and cannot do?

A: It affects it substantially, and the AI safety community has been increasingly explicit about the methodological problem. A safety or alignment claim about model capabilities or model behaviors that depends on a published benchmark result is inheriting whatever reproducibility properties the underlying benchmark literature has. The policy implications are sharpest for claims about emergent capabilities, claims about specific behavioral risks, and claims about the effectiveness of specific alignment techniques — all of which require independent verification on multiple model artifacts, with multiple seeds, against meaningful baselines, before being treated as load-bearing inputs to safety arguments or policy proposals. The structural reform agenda for AI safety overlaps substantially with the structural reform agenda for ML reproducibility: standardized evaluations, open releases of artifacts, reproducibility checklists for safety claims, and independent verification by parties other than the original claimants.

Q: Is there a meaningful difference between reproducibility (same code on same data gives same answer) and replicability (different code or different data gives the same scientific conclusion)?

A: Yes, and the machine learning community uses the distinction (sometimes inconsistently with adjacent fields). Reproducibility in the strict sense is the computational property: rerunning the published code on the published data produces the published numbers. Replicability is the stronger scientific property: a different implementation by a different team, possibly on a different but comparable dataset, produces a result that supports the same scientific conclusion. The ML reproducibility reforms have focused primarily on the reproducibility end of the spectrum — releasing code, specifying hyperparameters, enabling computational re-execution — because that is the layer that is most amenable to infrastructure-based fixes. The replicability layer is harder and is closer to where psychology and biomedicine’s reform agendas have concentrated. For a strategist, both layers matter: a result that is not reproducible is unverifiable in the weak sense; a result that is reproducible but not replicable is verified in the computational sense but is brittle in the scientific sense. The full evaluation should ask about both.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.