The DSM And Diagnostic Reliability: When The Manual Itself Replicates Poorly

Atticus Li

← The Replication Crisis · replication-crisis

The DSM And Diagnostic Reliability: When The Manual Itself Replicates Poorly

The DSM-5 field trials (2010-2012) tested whether two clinicians evaluating the same patient would assign the same diagnosis. Major Depressive Disorder came back with a kappa of 0.28. Generalized Anxiety Disorder came back at 0.20. The diagnostic categories that anchor insurance billing, legal proceedings, and clinical research turn out to be substantially more subjective than the system that uses them assumes.

By Atticus Li May 26, 2026 36 min read

In January 2013, the American Journal of Psychiatry published a paper by Darrel A. Regier and colleagues titled “DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses.” The paper reported the results of a multi-site, multi-clinician study that the American Psychiatric Association had commissioned to validate the diagnostic criteria for the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders, then nearing publication. The study’s design was the closest psychiatry has come to a formal replication test of its own classification system: at eleven sites across the United States and Canada, between 2010 and 2012, patients seeking mental-health care were evaluated by two different clinicians, working independently and blinded to each other’s conclusions, using the proposed DSM-5 diagnostic criteria. The question was simple. Given the same patient and the same diagnostic manual, would two clinicians arrive at the same diagnosis?

The results, summarized through the kappa statistic that measures inter-rater agreement above chance, were lower than the field had expected and lower than the equivalent reliability figures from the DSM-IV field trials of two decades earlier. Major Depressive Disorder, the most common psychiatric diagnosis in primary care and the diagnostic anchor for the majority of antidepressant prescribing, came back with a kappa of 0.28. Generalized Anxiety Disorder came back with a kappa of 0.20. By the conventional interpretive thresholds in the psychometric literature --- kappa below 0.40 is “poor,” 0.40 to 0.59 is “fair,” 0.60 to 0.79 is “good,” 0.80 and above is “excellent” --- both of those diagnoses scored in the “poor” range. Other disorders fared better: schizophrenia and ADHD reached the moderate range; PTSD and Autism Spectrum Disorder also landed in moderate territory. But across the diagnostic categories that account for the largest fraction of mental-health visits, prescriptions, and insurance claims, the reliability values were either lower than the DSM-IV equivalents or substantially lower than the rhetorical confidence with which DSM diagnoses are used in clinical, legal, insurance, and research contexts.

This paper did not invent the concern that psychiatric diagnosis is more subjective than its institutional users acknowledge. That concern has a long history, going back through the Rosenhan 1973 “On Being Sane in Insane Places” critique --- itself a flawed but directionally influential demonstration --- to the anti-psychiatry literature of the 1960s and to internal debates within the DSM revision committees of every era. What the Regier paper did was put precise numbers on the question, using the same field-trial methodology the DSM-IV had used in the early 1990s, with explicit prespecified reliability targets and a transparent statistical framework. The numbers came back lower than the targets. The publication of those numbers, in the flagship journal of American psychiatry, by the chair of the DSM-5 Task Force, was the most consequential admission in the recent history of psychiatric classification that the diagnostic categories the field uses are reliability-limited in ways the institutional infrastructure does not adequately reflect.

For anyone evaluating a mental-health product claim, an employee-assistance-program metric, a healthcare-data outcome study, a legal-competence determination, or an insurance-coverage decision keyed to a DSM category, the Regier 2013 paper is a central methodological reference. The diagnoses that drive those decisions are not as inter-clinician consistent as the system treats them. The consequences ripple through every downstream measurement that takes the diagnostic category as a given. This is the story of the DSM-5 field trials, the reliability numbers they produced, the parallel institutional response by the US National Institute of Mental Health under Thomas Insel that effectively abandoned DSM categories as the framework for research, the HiTOP and RDoC dimensional alternatives that have emerged in the years since, and what a careful evaluator should do with mental-health data whose underlying categorical scaffolding is more imprecise than its appearance suggests.

The DSM-5 Field Trial Methodology

The DSM-5 field trials were designed in the late 2000s under the leadership of Darrel Regier, then the vice-chair of the DSM-5 Task Force and the executive director of the American Psychiatric Institute for Research and Education. The methodology was set out in advance, in a planning paper Regier and colleagues had published in the same journal in 2011, and was meant to be more rigorous and more transparent than the DSM-IV field trials of the early 1990s. The DSM-IV field trials, which had reported reliability kappas in the moderate-to-good range across most diagnoses, had been criticized in the intervening years for methodological choices that may have inflated the apparent reliability: clinicians sometimes worked from the same diagnostic interview rather than independently, the patient samples were sometimes filtered for diagnostic clarity, and the analytic decisions were sometimes underspecified in the published reports.

The DSM-5 field-trial design corrected those issues. Eleven academic medical centers in the United States and Canada were enrolled as sites. Each site recruited patients presenting to its mental-health services. Each patient was independently evaluated by two clinicians, working blind to each other’s diagnostic conclusions, using the proposed DSM-5 criteria as drafted for the upcoming manual. The reliability target, set in advance by the Task Force, was that diagnoses considered “well-established” should achieve kappa in the range of 0.6 to 0.8 (good to excellent), and that newer or more difficult-to-diagnose categories should achieve kappa of at least 0.4 to 0.6 (fair to good). The methodology and prespecified targets were published before the data collection began, making the field trials one of the few prospective preregistered methodological studies in the history of psychiatric classification.

The patient sample across the eleven sites totaled roughly 2,000 adults and roughly 600 children for the diagnostic categories the Task Force chose to test. The diagnoses tested were a selected set: not every DSM category was put through the field trials, because the resource and logistical demands of doing so would have been prohibitive. The Task Force chose to test a mix of the most common categories (Major Depressive Disorder, Generalized Anxiety Disorder), the categories whose criteria had changed most substantially between DSM-IV and DSM-5 (Autism Spectrum Disorder, where the change collapsed several DSM-IV subcategories into a single spectrum), and the categories most central to ongoing controversies about diagnostic boundaries (PTSD, where the criteria had been substantially revised). The selection was reasonable but did mean that the reliability figures from the field trials were not a uniform sample across the manual; certain diagnoses were tested and others were not.

The two clinicians at each site for each patient were not necessarily from the same specialty or training background. The intent was to capture the realistic situation in which a patient might be evaluated by a psychiatrist on one occasion and by a psychologist or licensed clinical social worker on another, or by clinicians of different seniority. This design choice made the field trials more ecologically valid than a study in which both raters were senior psychiatrists from the same academic program, and it correspondingly made the reliability figures harder to dismiss as artifacts of methodological narrowness. The kappa values produced by the field trials are estimates of the reliability that actually obtains in routine mental-health practice, not the reliability that obtains in an idealized research clinic.

The analytic plan, set in advance, was to compute Cohen’s kappa for each diagnostic category, with bootstrap confidence intervals around the point estimates and with subgroup analyses for the largest sites and for the most common diagnostic comparisons. The results were reported in two papers in the American Journal of Psychiatry in early 2013, with Part I (also led by Regier) describing the methodology and overall design, and Part II reporting the actual reliability figures for the selected diagnoses. Both papers were open about the methodological constraints and about the lower-than-hoped-for reliability values that some diagnoses produced. The transparency of the reporting was itself a methodological improvement on the DSM-IV field trials, which had been less explicit about which diagnoses underperformed their targets.

The Numbers That Came Back

The headline reliability figures from Regier 2013 Part II are worth listing carefully, because they get reported in subsequent secondary literature with enough variation that the source numbers are worth pinning down. For adult diagnoses, the kappa values reported were: Major Depressive Disorder 0.28, Generalized Anxiety Disorder 0.20, Schizophrenia 0.46, Schizoaffective Disorder 0.50, Bipolar I Disorder 0.56, Major Neurocognitive Disorder (dementia) 0.78, Post-Traumatic Stress Disorder 0.67, Alcohol Use Disorder 0.40, Borderline Personality Disorder 0.54, and Obsessive-Compulsive Disorder 0.31. For child and adolescent diagnoses, the values were: ADHD 0.61, Autism Spectrum Disorder 0.69, Conduct Disorder 0.46, and Oppositional Defiant Disorder 0.40.

The standard interpretive convention for kappa in psychometric and biomedical literature, going back to Landis and Koch (1977), is that kappa below 0.40 is “poor” agreement (essentially, the agreement above what would be expected by chance is small enough that the diagnosis cannot be considered reliably reproducible across clinicians), kappa from 0.40 to 0.59 is “fair to moderate” (reproducible but with substantial inter-clinician variation), kappa from 0.60 to 0.79 is “good” (the diagnostic category is mostly reproducible across clinicians, with the residual variation reflecting genuine borderline cases), and kappa from 0.80 and above is “excellent” (the diagnostic category is highly reproducible). These thresholds are conventions rather than law; the medical-research literature uses other conventions as well, and the appropriate threshold depends on what the diagnosis is being used for. But the conventions are the ones the DSM-5 Task Force itself had used to set its prespecified targets, and the conventions are the ones that allow the Regier reliability numbers to be interpreted by an outside reader.

Against those conventions, the field-trial numbers split the DSM-5 categories into three tiers. Dementia, autism spectrum disorder, PTSD, and ADHD landed in the “good” range, with kappas in the 0.6 to 0.8 band. These are the diagnostic categories where two clinicians evaluating the same patient with the DSM-5 criteria can reasonably be expected to arrive at the same diagnosis. A second tier --- schizophrenia, schizoaffective disorder, bipolar I, borderline personality, conduct disorder, oppositional defiant disorder, alcohol use disorder --- landed in the “fair to moderate” range, with kappas between 0.40 and 0.59. These are categories where two clinicians evaluating the same patient will agree more often than chance, but where a substantial fraction of patients will be classified differently by different clinicians. The third tier --- Major Depressive Disorder, Generalized Anxiety Disorder, Obsessive-Compulsive Disorder --- landed below 0.40, in the “poor” range. These are categories where two clinicians evaluating the same patient cannot reliably be expected to arrive at the same diagnosis.

The MDD figure of 0.28 was the result that produced the loudest reaction inside and outside psychiatry. Major Depressive Disorder is, by a wide margin, the most commonly diagnosed psychiatric condition in primary care. It is the indication for the majority of antidepressant prescriptions. It is the diagnostic anchor for insurance reimbursement of a substantial fraction of outpatient mental-health care. It is the category most often used in epidemiological studies of mental health, in pharmaceutical clinical trials of psychiatric medications, and in disability and legal determinations. A kappa of 0.28 means that the diagnostic category that anchors all of that downstream infrastructure is, at the inter-clinician-reliability level, substantially less reproducible than the equivalent category had been in the DSM-IV field trials (where MDD had reported kappa in the 0.6 to 0.8 range).

The GAD figure of 0.20 was even lower and even more striking, though it attracted less commentary because GAD is less of a clinical workhorse than MDD. The OCD figure of 0.31 was a similar story: lower than the DSM-IV equivalent, lower than the prespecified DSM-5 target, in a category that drives a real volume of clinical and pharmaceutical activity. The combination of these three figures --- three of the most common anxiety and mood diagnoses, all in the “poor” range --- meant that the DSM-5 field trials had produced a result that was difficult to reconcile with the institutional confidence in DSM diagnostic categories.

The Regier Part II paper itself was candid about the lower-than-expected reliability values. The authors observed that the lower values, relative to DSM-IV, might reflect the more rigorous methodology rather than a genuine decline in reliability of the actual diagnostic criteria. The DSM-IV field trials had used methodological designs more likely to produce high kappa values; the DSM-5 field trials had corrected those design issues, and the kappas had come down accordingly. This interpretive framing is plausible and probably correct: it is more likely that the DSM-IV reliability figures had been inflated by methodological choices than that the underlying diagnostic categories had become less reproducible in the intervening decade. But the interpretive framing does not change the operational implication. The most rigorous available estimate of inter-clinician reliability for these diagnostic categories, produced under the field’s own preferred methodology by the field’s own DSM-5 Task Force, was substantially lower than the conventions that govern how the categories are used would suggest. The institutional infrastructure had been built on reliability assumptions that the most rigorous available evidence did not support.

What A Kappa Of 0.28 Means In Practice

The technical interpretation of kappa values is intuitive once it is unpacked. Kappa measures the agreement between two raters above what would be expected by chance alone. A kappa of 0.0 means that the two raters agree at exactly the rate that would be expected if they were assigning diagnoses at random subject to the same base rates. A kappa of 1.0 means perfect agreement. The intermediate values can be roughly translated into the practical probability that two clinicians evaluating the same patient will assign the same diagnosis, though the translation depends on the prevalence of the diagnosis in the sample.

For a moderately prevalent diagnosis like MDD, a kappa of 0.28 corresponds, in rough terms, to a situation where if two clinicians independently evaluate the same patient, they will agree on the presence or absence of MDD perhaps 60% of the time when the chance baseline (driven by the prevalence) would be around 50%. The agreement above chance is real but small. Stated differently: a substantial fraction of patients --- on the order of 30% to 40%, depending on the specifics --- will be diagnosed differently by two different clinicians using the same DSM-5 criteria. Some patients diagnosed with MDD by Clinician A will not meet criteria for MDD when evaluated by Clinician B. Some patients judged not to have MDD by Clinician A will be diagnosed with MDD by Clinician B.

This is not a result that can be straightforwardly absorbed into the clinical, insurance, legal, and research infrastructure that the diagnosis anchors. Insurance reimbursement assumes that a DSM diagnosis is a reliably reproducible classification; otherwise the audit and claims-review process loses its grounding. Clinical research that selects subjects on the basis of MDD assumes that the subjects are a homogeneous population sharing a common condition; if a third of those subjects would have been classified differently by a different clinician, the apparent homogeneity is partly an artifact of which clinicians did the diagnosis. Disability and legal determinations that rest on a psychiatric diagnosis assume that the diagnosis would have been the same if a different clinician had made it; if it would not have been, the legal and disability determinations are partly a function of which clinician the person happened to encounter. Pharmaceutical trial designs that randomize patients diagnosed with MDD to drug or placebo arms assume that the diagnostic classification is consistent across enrolled sites and clinicians; if the classification varies substantially across sites and clinicians, some of the variability in trial outcomes that gets attributed to drug effect or to placebo response or to site variation is actually diagnostic-classification variation.

The Regier paper did not argue that MDD does not exist or that the diagnostic criteria are useless. It argued that the diagnostic criteria, as operationalized in the DSM-5 field trials, produce inter-clinician reliability below the threshold that the downstream institutional infrastructure assumes. This is a softer claim than “the diagnosis is invalid” and a more operational claim than “the diagnosis is contested.” It is a claim that the gap between the actual reliability of the diagnostic category and the assumed reliability built into the institutional uses of the category is large enough to require correction at the institutional level. That correction has been slow and partial in the years since the field trials.

Thomas Insel And The NIMH RDoC Decision

The institutional response inside the National Institute of Mental Health to the DSM-5 process more broadly --- which included but was not limited to the field-trial reliability findings --- was an administrative decision that effectively rewrote how federal psychiatric research funding would be allocated. In April 2013, four months after the Regier Part II paper appeared and about a month before the DSM-5 itself was published, Thomas Insel, then the director of NIMH, posted a brief entry on the NIMH Director’s Blog titled “Transforming Diagnosis.” The post was widely reported in the psychiatric and general press, partly because of what it said and partly because of who was saying it: the director of the federal agency that funds the majority of US psychiatric research was announcing that the agency would no longer use DSM categories as the framework for its research funding decisions.

The post acknowledged that DSM categories had served an important function as a common clinical language and as the basis for insurance reimbursement, but argued that as a research framework the categories had become an obstacle. The text Insel used was direct: DSM diagnoses were “based on a consensus about clusters of clinical symptoms” rather than “any objective laboratory measure,” and the validity of the categories as targets for biological research was unproven. The NIMH would therefore pivot its research framework to the Research Domain Criteria (RDoC) project, which Insel and colleagues had been developing inside the agency since 2009. RDoC organizes psychopathology around functional dimensions --- negative valence systems, positive valence systems, cognitive systems, social processes, arousal and regulatory systems --- that map more cleanly onto neuroscience and genetics than the DSM categorical structure does. The RDoC framework specifies units of analysis at multiple levels (genes, molecules, cells, circuits, physiology, behavior, self-report) and asks researchers to design studies that span these levels rather than starting with a DSM categorical diagnosis as the inclusion criterion.

The Insel announcement was not, formally, a critique of the Regier 2013 reliability findings. The two events were independent in their genesis; RDoC had been in development for years before the field trials reported, and Insel did not cite the field-trial reliability figures directly in the blog post. But the timing and the substance were widely read as connected. The DSM-5 field trials had produced reliability numbers that were uncomfortable for psychiatric classification. The NIMH was now announcing that it would no longer organize its research portfolio around the diagnostic categories that had just been shown to have those reliability limitations. The two events together signaled that the institutional confidence in DSM categories as the basis for both clinical and research work was lower than the surface-level institutional infrastructure suggested.

Insel himself, in subsequent statements and in his 2022 book Healing, has been explicit about his evolving views on the limits of DSM-based research. The RDoC framework has been the NIMH research-funding framework ever since the 2013 transition, with NIMH grant programs explicitly favoring proposals that organize their hypotheses around RDoC domains rather than DSM categories. The framework has been controversial within psychiatry --- some researchers argue that it is too neuroscience-centric and underweights the clinical realities of psychiatric care, others argue that it has not yet produced the breakthroughs in biological understanding of mental illness that it promised --- but it remains the operational framework for federally funded US psychiatric research. The Cuthbert 2014 paper in World Psychiatry is the canonical methodological description of RDoC and its rationale; it explicitly cites diagnostic-reliability limitations as part of the motivation for the dimensional reframing.

HiTOP: The Dimensional Alternative

Parallel to the federal RDoC effort, a separate group of academic clinical psychologists and psychiatrists has been developing the Hierarchical Taxonomy of Psychopathology, or HiTOP, as a dimensional alternative to the DSM categorical structure. The canonical reference is Kotov, Krueger, Watson, Achenbach, Althoff, Bagby, et al. (2017) in the Journal of Abnormal Psychology, titled “The Hierarchical Taxonomy of Psychopathology (HiTOP): A Dimensional Alternative to Traditional Nosologies.” The HiTOP framework, like RDoC, is dimensional rather than categorical, but unlike RDoC it is derived from the empirical structure of psychopathology as observed in clinical samples (using factor-analytic and structural-equation modeling techniques on symptom data) rather than from a top-down neuroscience-informed framework.

The HiTOP framework organizes psychopathology into a hierarchy of dimensions at multiple levels of specificity. At the top of the hierarchy is a general psychopathology factor (sometimes called the “p factor” in analogy to the general intelligence factor) that captures shared variance across all forms of psychopathology. Below it are broad spectra: internalizing (which includes most depression, anxiety, and PTSD-related symptoms), externalizing (which includes substance use, antisocial behavior, and disinhibition-related symptoms), thought disorder (which includes psychotic symptoms and related phenomena), and somatoform (which includes physical-symptom-related phenomena). Below the spectra are subfactors, and below those are individual symptom dimensions. The framework is meant to be empirically driven: as more data accumulate from clinical and epidemiological samples, the structure is updated to reflect what the data actually show about the covariance patterns of psychopathology, rather than being fixed by committee decision as the DSM categories are.

The reliability argument for HiTOP is straightforward. Dimensional measures generally have higher reliability than categorical classifications derived from the same underlying symptom data. The difference comes from information loss: when you take a continuous distribution of symptom severity and dichotomize it into “has the disorder” versus “does not have the disorder,” you discard information at the cut point, and the resulting categorical classification is more sensitive to small differences in symptom presentation near the cut point than the underlying continuous symptom data are. A patient one symptom above the cut point and a patient one symptom below the cut point are categorically different in the dichotomized framework, but their symptom profiles are nearly identical. Two clinicians evaluating those two patients can easily disagree on which side of the cut point each patient falls, even when they agree closely on the symptom severity. The HiTOP framework’s dimensional structure avoids this loss of information and the inter-clinician disagreement at the cut points that it produces.

The Kotov 2017 paper, and the subsequent HiTOP-consortium publications, have argued that the dimensional reframing produces measures with substantially higher reliability than DSM categorical diagnoses of the equivalent constructs. The argument is based both on theoretical considerations (the information-loss argument above) and on empirical comparisons in clinical samples. The empirical comparisons consistently show that dimensional measures of internalizing or externalizing psychopathology have reliability values --- whether measured as inter-rater agreement, test-retest stability, or internal consistency --- in the 0.7 to 0.9 range, considerably higher than the DSM-5 categorical equivalents for many disorders in the Regier field-trial data. The HiTOP framework has been gaining adoption in research contexts, particularly in clinical-psychology research, and has begun to appear as a complement (not yet a replacement) for DSM categorical diagnoses in some clinical settings.

The institutional position of HiTOP relative to DSM and RDoC, as of the mid-2020s, is roughly as follows. DSM remains the categorical framework that anchors clinical care, insurance reimbursement, legal determinations, and most pharmaceutical-trial inclusion criteria; this institutional role has not been displaced. RDoC is the dimensional framework that anchors federally funded biological-research questions about psychopathology, with the explicit institutional position from NIMH that DSM categories are not the appropriate inclusion criteria for the kinds of mechanistic research the agency wants to fund. HiTOP is the dimensional framework derived empirically from clinical-symptom data, which is increasingly used in clinical-psychology research and which is being proposed as a complement to DSM in clinical settings, particularly for treatment-planning purposes. The three frameworks coexist in tension, with the field gradually moving toward dimensional approaches in research contexts while the categorical DSM framework continues to anchor the operational institutional infrastructure of mental-health care.

What This Means For Insurance, Legal, And Research Systems

The institutional uses of DSM diagnoses extend far beyond clinical care, and the reliability findings from Regier 2013 have implications for each of those uses. For insurance reimbursement, US health insurance billing for mental-health care is keyed to DSM diagnostic codes (in conjunction with the ICD codes that the federal billing infrastructure uses). The reimbursement system assumes that the diagnostic codes are reliably reproducible classifications that audit and claims-review processes can verify. The reliability findings imply that the audit process for mental-health diagnoses cannot reach the level of consistency that the equivalent process can reach for, say, cardiology or oncology diagnoses, where the diagnostic categories are anchored in biological measurements (lab values, imaging findings, biopsy results) that are more inter-rater reproducible than DSM symptom checklists. The mental-health insurance billing system therefore operates on a foundation that is reliability-limited in a way the rest of medical billing is not, and the audit and fraud-detection processes for mental-health billing are correspondingly more difficult.

For legal determinations, DSM diagnoses are routinely used in forensic psychiatric evaluations to determine competence to stand trial, insanity defenses, civil commitment proceedings, child custody disputes, disability determinations, and worker’s compensation claims. In each of these contexts, the legal process treats the DSM diagnosis as a finding of fact about the person being evaluated. The reliability findings imply that the diagnostic finding is, in part, a function of which clinician conducted the evaluation. For categories in the “poor” reliability range (MDD, GAD, OCD), the probability that a different clinician evaluating the same person would have arrived at the same diagnosis is substantially below the level of certainty the legal system typically associates with expert findings of fact. The forensic-psychiatry literature has been internally aware of this for decades and has developed conventions (multiple-evaluator designs, structured diagnostic interviews, explicit reliability disclosures in expert reports) to mitigate the problem, but the underlying institutional assumption that a DSM diagnosis is a reproducible classification remains embedded in the legal infrastructure.

For pharmaceutical clinical trials, DSM diagnoses are used as the standard inclusion criterion for trials of psychiatric medications. A trial of a new antidepressant enrolls patients with MDD, defined by DSM-5 criteria; a trial of a new anxiolytic enrolls patients with GAD; and so on. The reliability findings imply that the trial inclusion criterion is reliability-limited in a way that contributes to the variability of trial outcomes. Some of the variability that gets attributed to drug effect, placebo response, site effects, or patient characteristics is actually diagnostic-classification variation. Trials run at multiple sites with multiple clinicians enrolling patients on the basis of MDD criteria are, in effect, enrolling somewhat different patient populations at different sites, even after stratifying for severity and other measurable factors. This is one of the structural contributors to the difficulty of getting consistent results across antidepressant trials, alongside the publication-bias mechanisms documented in Turner 2008 and the placebo-response trends that have been observed across the antidepressant literature.

For population-level mental-health epidemiology and for the increasingly common analyses of mental-health outcomes in healthcare administrative data, the reliability findings imply that the DSM-coded population in any given dataset is a noisy approximation of the underlying clinical population. The noise is not random; it is structured by the clinicians who did the diagnosing, the settings in which the diagnosis was made, the criteria the clinicians applied, and the diagnostic conventions of the time. Healthcare-data analyses that take the DSM code at face value --- which describes essentially all administrative-data analyses --- are working with a diagnostic classification that has the reliability profile that Regier 2013 documented. The downstream conclusions of these analyses should be understood as conditional on a diagnostic classification with that reliability profile, and the precision of those conclusions should be understood as bounded above by the precision of the underlying classification.

For employee-assistance-program (EAP) metrics and workplace mental-health initiatives, the reliability issues compound with the additional difficulty that EAP-context diagnoses are often made by clinicians with less specialty training than the academic-medical-center clinicians in the field trials. EAP utilization metrics that track diagnoses, treatment plans, and outcomes assume an underlying diagnostic classification with reproducibility that may be lower than the field-trial estimates suggest. Vendor claims about EAP outcomes that hinge on diagnostic categories should be evaluated with the diagnostic-reliability constraints in mind.

Strategist Takeaway: How To Read Mental-Health Data Conditional On Reliability Limits

For a strategist evaluating mental-health products, employee mental-health programs, healthcare data on psychiatric outcomes, or research evidence keyed to DSM categories, the Regier 2013 reliability findings produce a small set of operational principles that improve the quality of inference from these data sources.

Treat DSM diagnostic categories as approximate, not exact. The diagnostic classification is a reliability-limited construct, especially for the most common mood and anxiety disorders. Any analysis that takes the DSM code as a precise classification will overstate the precision of its conclusions. Use the diagnostic category as a useful approximation to the underlying clinical state, but apply wider uncertainty bands to any conclusion that hinges on the classification.

Prefer dimensional measures of symptom severity where they are available. The PHQ-9 for depressive symptom severity, the GAD-7 for anxiety symptom severity, the PCL-5 for PTSD symptom severity, and the equivalent dimensional instruments for other conditions have substantially better reliability profiles than the equivalent DSM categorical diagnoses, because they avoid the information loss at the diagnostic cut point. When evaluating a mental-health product, an EAP program, or a research study, prefer outcome measures that use dimensional symptom severity scores rather than (or alongside) categorical diagnostic status changes.

Discount claims that rest on diagnostic-category outcomes more heavily than claims that rest on biological or behavioral outcomes. A claim that a mental-health intervention reduces the incidence of MDD diagnosis is a weaker claim than an equivalent claim about a cardiology intervention reducing the incidence of myocardial infarction, because the underlying diagnostic classification is less reliable. The same magnitude of effect, expressed in terms of MDD incidence reduction, should be read with wider uncertainty than the equivalent effect expressed in terms of MI incidence reduction. This is not a critique of the mental-health intervention; it is a calibration of how confident a reader should be about the magnitude of the effect.

When evaluating EAP or workplace-mental-health vendor claims, ask how the vendor’s diagnostic and outcome classifications were made. Were they made by licensed clinical staff using structured diagnostic interviews, or by less-formal screening tools applied by non-specialist staff? Were they made using dimensional symptom severity measures, or only categorical diagnostic status? Are the outcome claims expressed in dimensional or categorical terms? Vendors whose claims rest on dimensional severity changes are operating on a more reliable measurement foundation than vendors whose claims rest on diagnostic-status changes.

For research evidence on mental-health interventions, prefer studies that use dimensional outcome measures and that report inter-rater reliability for any clinician-administered measures. Modern mental-health intervention research increasingly reports both categorical and dimensional outcomes; the dimensional outcomes are generally the more reliable signal. Studies that report only categorical outcomes (e.g., “X% of patients no longer met criteria for MDD at follow-up”) are working with a categorical classification whose reliability constraints carry through into the outcome metric.

Recognize that the institutional rhetoric around DSM diagnoses outruns the underlying reliability evidence. The institutional infrastructure --- insurance, legal, clinical care, billing --- treats DSM diagnoses as more reliable than they are. The most rigorous available evidence about the reliability is the Regier 2013 field-trial data, which is publicly available and explicit about the limitations. When evaluating any claim or argument that rests on a DSM diagnostic categorization, the appropriate prior is that the underlying classification is reliability-limited in the ways the field-trial data document, and the appropriate calibration is to apply correspondingly wider uncertainty bands to any decision built on the classification.

The Regier 2013 findings, taken together with the Insel RDoC pivot and the HiTOP dimensional alternative, represent the most honest available institutional acknowledgment that psychiatric classification is more subjective than its surface-level institutional uses suggest. The careful reader, the careful evaluator, the careful strategist working with mental-health data should internalize the implications of that acknowledgment and apply them as a default calibration to any subsequent analysis. The diagnostic categories are useful approximations to clinical reality; they are not precise classifications; and treating them as precise classifications produces overconfident downstream conclusions.

Sources

[Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry, 170(1), 59-70. DOI: 10.1176/appi.ajp.2012.12070999](https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2012.12070999) --- the primary paper reporting the DSM-5 field-trial reliability figures.
[Clarke, D. E., Narrow, W. E., Regier, D. A., Kuramoto, S. J., Kupfer, D. J., Kuhl, E. A., et al. (2013). DSM-5 field trials in the United States and Canada, Part I: Study design, sampling strategy, implementation, and analytic approaches. American Journal of Psychiatry, 170(1), 43-58. DOI: 10.1176/appi.ajp.2012.12070998](https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2012.12070998) --- the companion methodology paper.
Insel, T. (2013, April 29). Transforming Diagnosis. NIMH Director’s Blog. --- the post announcing NIMH’s pivot to RDoC.
[Kotov, R., Krueger, R. F., Watson, D., Achenbach, T. M., Althoff, R. R., Bagby, R. M., et al. (2017). The Hierarchical Taxonomy of Psychopathology (HiTOP): A dimensional alternative to traditional nosologies. Journal of Abnormal Psychology, 126(4), 454-477. DOI: 10.1037/abn0000258](https://psycnet.apa.org/doi/10.1037/abn0000258) --- the canonical HiTOP framework paper.
[Cuthbert, B. N. (2014). The RDoC framework: facilitating transition from ICD/DSM to dimensional approaches that integrate neuroscience and psychopathology. World Psychiatry, 13(1), 28-35. DOI: 10.1002/wps.20087](https://onlinelibrary.wiley.com/doi/10.1002/wps.20087) --- the canonical RDoC methodology paper.
[Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. DOI: 10.2307/2529310](https://www.jstor.org/stable/2529310) --- the source of the conventional kappa-interpretation thresholds.
Insel, T. (2022). Healing: Our Path from Mental Illness to Mental Health. Penguin Press. --- Insel’s book-length treatment of his evolving views on psychiatric classification and research.
Frances, A. (2013). Saving Normal: An Insider’s Revolt Against Out-of-Control Psychiatric Diagnosis, DSM-5, Big Pharma, and the Medicalization of Ordinary Life. William Morrow. --- the DSM-IV Task Force chair’s critique of DSM-5 process; relevant background on the field’s internal debates about diagnostic categorization.

This article is part of an ongoing series on famous claims, frameworks, and studies that did not survive scrutiny. The mental-health and psychiatric-evidence cluster includes Rosenhan’s On Being Sane in Insane Places, the polyvagal theory critique, Turner 2008 on antidepressant publication bias, the Schachter-Singer two-factor theory of emotion, and the James-Lange theory of emotion. The full hub lives at /replication-crisis/.

If you are evaluating mental-health product claims, healthcare data, or research evidence keyed to psychiatric diagnostic categories and want a careful audit of the underlying classification reliability, book an evidence review.

FAQ

Did the Regier 2013 paper claim that DSM diagnoses are useless? No. The paper reported inter-rater reliability values for a selected set of DSM-5 diagnostic categories, found that some categories landed in the “poor” reliability range and others in “fair to moderate” or “good” ranges, and discussed the implications for the diagnostic system. The paper did not argue that the diagnostic categories were invalid or that they should not be used. It argued, implicitly, that the reliability values needed to be understood and the institutional uses of the categories needed to be calibrated to the actual reliability rather than the assumed reliability. The distinction matters: a diagnostic category can be a useful clinical and administrative construct while still being reliability-limited at the inter-clinician level. The Regier paper put the reliability question on a quantitative footing rather than the rhetorical footing it had previously occupied.

Why did MDD and GAD score so much lower than ADHD and dementia? Several plausible reasons. ADHD and dementia have more behavioral and cognitive markers that can be observed independently of the patient’s subjective report; the clinicians in the field trials therefore had more inter-rater-consistent observations to work from. MDD and GAD, by contrast, rest heavily on the patient’s subjective report of mood, anxiety, and related internal states; two clinicians evaluating the same patient on different occasions may receive somewhat different self-reports (because the patient’s mood and anxiety fluctuate), and even when they receive similar self-reports they may apply the diagnostic threshold differently because the threshold is itself fuzzy. The DSM-5 criteria for MDD and GAD also overlap substantially (both involve symptoms like sleep disturbance, concentration difficulty, fatigue), making the boundary between them inherently more difficult to draw consistently. The lower reliability for these diagnoses is partly a reflection of the inherent difficulty of categorizing internal mood and anxiety states, and partly a reflection of the criterion overlap that makes the diagnostic boundaries fuzzy.

Have the reliability findings been replicated or extended since 2013? The field trials themselves have not been formally repeated, but the reliability findings have been broadly consistent with subsequent studies that have tested inter-rater reliability of DSM-5 diagnoses in other samples. The general pattern --- moderate-to-good reliability for some categories, lower reliability for many of the common mood and anxiety disorders --- has been confirmed across multiple studies in the years since. The HiTOP empirical literature, which compares dimensional measures of psychopathology to the DSM categorical equivalents, has consistently found that the dimensional measures have higher reliability than the categorical diagnoses derived from the same underlying symptom data. The Regier findings are part of a larger empirical pattern, not an isolated result.

What is RDoC and how is it different from DSM? RDoC, the Research Domain Criteria, is a framework developed by the National Institute of Mental Health for organizing research on psychopathology around functional dimensions rather than DSM categorical diagnoses. The framework specifies six functional domains (negative valence, positive valence, cognitive systems, social processes, arousal and regulatory systems, sensorimotor systems), each with multiple constructs that can be measured at multiple units of analysis (genes, molecules, cells, circuits, physiology, behavior, self-report). The intent is to organize research questions around biology-aligned dimensions that can connect across diagnostic boundaries, rather than around DSM categories that may not correspond to natural kinds in the underlying biology. Since 2013, NIMH grant programs have favored proposals organized around RDoC. The framework has been controversial within psychiatry --- some critics argue it is too neuroscience-centric --- but it remains the operational framework for federally funded US psychiatric research.

What is HiTOP and how does it relate to RDoC? HiTOP, the Hierarchical Taxonomy of Psychopathology, is a separate dimensional framework developed by an international consortium of clinical psychologists and psychiatrists, originating in the work of Roman Kotov, Robert Krueger, David Watson, and colleagues. Unlike RDoC, which is organized around neuroscience-informed functional domains, HiTOP is derived empirically from the factor structure of psychopathology as observed in clinical and epidemiological samples. The framework organizes psychopathology into a hierarchy with a general factor at the top, broad spectra (internalizing, externalizing, thought disorder, somatoform) below, and progressively more specific dimensions and symptoms below those. HiTOP is intended as a clinical and research framework that can be used alongside or in place of DSM categorical diagnoses. It has been gaining adoption in clinical-psychology research and is increasingly used in treatment-planning contexts. The relationship between RDoC and HiTOP is largely complementary: RDoC is more neuroscience-focused and oriented toward etiological research, HiTOP is more clinically grounded and oriented toward classification and treatment planning.

Should insurance companies stop using DSM diagnoses for reimbursement? The Regier 2013 findings do not directly support that conclusion. DSM diagnostic codes are an administrative classification that the insurance system uses for billing, claims review, and audit. The reliability limitations of the diagnostic categories are a constraint on how precisely the billing system can verify diagnostic accuracy, not an argument that the billing system should abandon the diagnostic framework. The realistic implication is that the insurance system should incorporate the reliability constraints into its audit processes (recognizing that disagreements between clinicians on a given patient’s diagnosis are common and not necessarily evidence of fraud or error) and into its coverage decisions (recognizing that the diagnostic category is a reliability-limited proxy for the underlying clinical condition). Some insurance systems have begun to incorporate dimensional symptom measures (PHQ-9 scores, for example) into utilization-management decisions; this is a partial response to the reliability constraints, though dimensional measures bring their own administrative challenges.

What should a healthcare-data analyst do with DSM-coded data? Treat the DSM code as a useful approximation to the underlying clinical condition, not as a precise classification. For analyses where the diagnostic classification is central, consider whether the analysis can be supplemented by dimensional symptom measures from the same dataset (PHQ-9, GAD-7, PCL-5 scores if available), which will provide a more reliable measure of the underlying condition. For analyses where dimensional measures are not available, apply appropriately wide uncertainty bands to conclusions that hinge on the diagnostic classification. Be especially careful with analyses that compare populations defined by different DSM codes (e.g., MDD versus GAD), because the diagnostic boundaries between these conditions have lower reliability than the diagnoses themselves and the comparison may be partly confounded by classification noise. Be cautious with longitudinal analyses where diagnostic categories may have changed conventions or DSM versions across the time period; reliability constraints compound with version-change effects in these contexts.

How does this connect to the broader replication crisis in psychology and psychiatry? The diagnostic-reliability findings are a separate methodological concern from the replication problems that have affected experimental psychology and psychiatric research more broadly, but they share a common underlying theme: the institutional confidence in certain claims and constructs is higher than the most rigorous available evidence supports. The Regier 2013 findings are part of a larger pattern of psychiatric and psychological constructs being shown, under careful empirical examination, to have weaker foundations than the institutional infrastructure assumes. Other entries in this series cover related cases --- the Rosenhan critique of psychiatric diagnosis, the polyvagal theory’s empirical limitations, the publication-bias and effect-size issues in the antidepressant literature, the failures of various classic emotion theories to replicate --- each of which adds to the picture that mental-health constructs are often less well-established than they appear. The strategic implication is consistent across these cases: apply careful evidence-evaluation discipline to any claim resting on a psychiatric or psychological construct, and calibrate confidence to the actual empirical foundation rather than to the institutional rhetoric around it.

replication-crisisdsm-reliabilitypsychiatric-diagnosismental-healthevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

The DSM-5 Field Trial Methodology

The Numbers That Came Back

What A Kappa Of 0.28 Means In Practice

Thomas Insel And The NIMH RDoC Decision

HiTOP: The Dimensional Alternative

What This Means For Insurance, Legal, And Research Systems

Strategist Takeaway: How To Read Mental-Health Data Conditional On Reliability Limits

Sources

Related: Other Studies in This Series

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook