Hormone Replacement Therapy: The WHI Trial Reversal That Changed Women's Medicine

Atticus Li

← The Replication Crisis · replication-crisis

Hormone Replacement Therapy: The WHI Trial Reversal That Changed Women's Medicine

For a decade, observational studies suggested hormone replacement therapy cut postmenopausal heart disease by a third. In July 2002, the largest randomized trial ever run on HRT was stopped early because the treatment was doing the opposite. The episode is the cleanest modern demonstration of why observational epidemiology is not a substitute for an RCT.

By Atticus Li May 25, 2026 28 min read

On 9 July 2002, the Data and Safety Monitoring Board of the Women’s Health Initiative (WHI) made one of the most consequential calls in the recent history of clinical medicine. The estrogen-plus-progestin arm of the trial --- a randomized, double-blind, placebo-controlled comparison enrolling more than 16,000 healthy postmenopausal women across 40 US clinical centers, originally planned to run until 2005 --- was halted at a planned interim analysis after a mean follow-up of 5.2 years. The reason for stopping was not that the treatment had failed to do the things it had been prescribed to do for two decades. The reason was that the treatment was doing the opposite. In the women randomized to receive conjugated equine estrogens plus medroxyprogesterone acetate, the hazard ratio for coronary heart disease was 1.29 (95% CI 1.02 to 1.63), for stroke was 1.41 (95% CI 1.07 to 1.85), for venous thromboembolism was 2.11 (95% CI 1.58 to 2.82), and for invasive breast cancer was 1.26 (95% CI 1.00 to 1.59). A global index intended to summarize benefit and harm crossed a prespecified safety boundary. The trial was stopped. The principal-results paper appeared in JAMA eight days later, on 17 July 2002.

The cultural shock was immediate. Hormone replacement therapy had, throughout the 1990s, become close to a default recommendation for postmenopausal women in the United States. By 2001, an estimated 15 million American women were taking it. The reasons given to patients, by physicians, by professional societies, and by direct-to-consumer marketing, were not narrowly about menopausal symptoms. They were about long-run health: HRT was supposed to prevent heart disease, stroke, osteoporosis, dementia, and possibly colorectal cancer. The evidence base for those long-run preventive claims was overwhelmingly observational --- prospective cohort studies of nurses, of physicians’ wives, of women in HMO registries --- and the observational evidence was consistent and substantial. In the Nurses’ Health Study, the largest of the cohorts, current users of postmenopausal estrogen had roughly half the rate of major coronary disease as never-users, with adjustment for the obvious confounders. The 1996 update from Grodstein and colleagues in the New England Journal of Medicine put the relative risk of major coronary disease at 0.60 (95% CI 0.43 to 0.83) for current hormone users versus never-users. The number was repeated in textbooks, in editorials, in the AHA’s cardiovascular prevention guidelines, and in the consultation rooms where physicians wrote prescriptions.

The WHI was the test of that claim. It was the most expensive and most ambitious randomized trial of a preventive intervention in women’s health ever attempted. It was set up specifically because the observational evidence, however suggestive, was not enough. Five years in, the trial produced an answer that contradicted the observational literature so directly that it forced an immediate, profession-wide reversal of practice. Prescriptions for HRT in the United States dropped by roughly 50% within two years. Hormone-prescribing patterns in the UK, Australia, and most of Western Europe followed. The Million Women Study in the UK, published the next year in The Lancet, reported similar elevations in breast cancer risk in women on combined HRT and helped consolidate the new clinical posture.

For anyone whose work involves making decisions on the basis of “epidemiology shows X reduces Y” --- which describes much of public health, much of nutrition science, large portions of management research, and most of behavioral economics --- the WHI episode is one of the cleanest cautionary cases in the modern record. A treatment that the observational literature confidently endorsed, with effect sizes on the order of a 40% to 50% reduction in cardiovascular disease, with biological plausibility, with a coherent mechanism, with broad professional consensus, turned out, in a properly powered randomized trial, to cause exactly the harms it was supposed to prevent. The mechanism by which the observational evidence misled is well understood in retrospect and is structural enough that it generalizes to many other settings where investigators rely on adjusted comparisons of users versus non-users to estimate causal effects of an intervention. The mechanism is healthy-user bias. The fix is the randomized trial.

The 1990s HRT Consensus

To understand how the WHI result landed, it is necessary to recover how settled the pre-WHI consensus was. By the mid-1990s, the framework for thinking about hormone replacement was that menopause was a state of estrogen deficiency, that the deficiency had downstream consequences for cardiovascular health and bone density, and that exogenous estrogen --- either alone in women who had had a hysterectomy, or combined with a progestin in women with an intact uterus, to prevent estrogen-induced endometrial hyperplasia --- could correct the deficiency and prevent those consequences. The cardiovascular story was the strongest. Premenopausal women had lower rates of coronary heart disease than age-matched men; that gap narrowed after menopause; estrogen had favorable effects on lipid profiles, raising HDL and lowering LDL; estrogen had favorable effects on vascular endothelium in animal models; observational cohorts found lower cardiovascular event rates in hormone users. Each link in the chain looked solid. Each was being independently reinforced.

The American College of Physicians issued a 1992 clinical practice guideline recommending that all postmenopausal women be considered for hormone replacement therapy for the primary prevention of cardiovascular disease and osteoporosis, with the decision to be individualized but with the presumption tilted toward treatment. The American Heart Association did not formally recommend HRT for cardiovascular prevention in its women-specific guideline updates in the 1990s, but it acknowledged the observational evidence and treated HRT as a reasonable adjunct in women already on it for other reasons. The North American Menopause Society’s position statements through the 1990s endorsed long-term HRT for postmenopausal women with no contraindications. Conjugated equine estrogens --- the brand Premarin, marketed by Wyeth-Ayerst and derived from the urine of pregnant mares --- became one of the most prescribed drugs in the United States. The combined estrogen-progestin formulation Prempro, launched in 1995, was Wyeth’s flagship product. Prescriptions for postmenopausal hormones in the United States rose from roughly 30 million in 1990 to roughly 90 million by 2001.

The framing of HRT in the 1990s also went beyond the cardiovascular and skeletal endpoints. There were claims about cognitive protection, including reductions in Alzheimer’s disease risk in cohort studies, claims about colorectal cancer reduction, claims about quality-of-life improvements that extended well beyond symptom relief. Direct-to-consumer marketing, particularly Wyeth’s “Feminine Forever” descendant campaigns and physician-targeted materials, reinforced the framing that HRT was not a treatment for hot flashes but a long-term health-maintenance regimen analogous to statins or antihypertensives. Women asked for it; physicians prescribed it; insurance covered it; the cycle compounded.

A few dissenting voices existed. Diana Petitti and others had warned through the 1990s that the observational evidence on HRT was vulnerable to confounding by characteristics that distinguished women who took hormones from those who did not. Women who initiated HRT in the 1980s and 1990s were disproportionately well-educated, white, slim, non-smokers, with private insurance and regular contact with the health system. They were, on every measurable characteristic and almost certainly on unmeasurable ones, healthier at baseline. The Heart and Estrogen/Progestin Replacement Study (HERS), a secondary-prevention randomized trial in women with existing coronary disease, had reported in 1998 that combined HRT did not reduce cardiovascular events and possibly increased them in the first year. HERS was a small enough trial and a narrow enough population that it did not move the consensus much; it was treated as a finding specific to women with established disease, not a challenge to primary prevention. The WHI was already running by then. The field waited for its result.

The Observational Evidence WHI Overturned

The single most influential study supporting HRT for cardiovascular prevention was the Nurses’ Health Study analysis published by Francine Grodstein and colleagues in the New England Journal of Medicine in 1996. The Nurses’ Health Study, run out of the Harvard School of Public Health, was a prospective cohort of US registered nurses recruited starting in 1976. By the time of the 1996 analysis, it included 59,337 postmenopausal women followed for up to 16 years, with self-reported hormone use updated biennially. The headline result was a relative risk of major coronary heart disease of 0.60 (95% CI 0.43 to 0.83) for current users of estrogen alone versus never-users, and 0.39 (95% CI 0.19 to 0.78) for current users of estrogen plus progestin. The numbers were striking. They were also consistent with a half-dozen other large prospective cohorts.

The Nurses’ Health Study investigators were not careless. The 1996 analysis adjusted for age, body mass index, smoking status, hypertension, diabetes, hypercholesterolemia, family history of myocardial infarction, age at menopause, type of menopause, past oral contraceptive use, parental history of cardiovascular disease, and a range of dietary variables. The relative risk barely moved with adjustment. This was treated by the field as evidence that the protective effect was real, because if confounding by health status were the explanation, adjustment for the major confounders should have attenuated it substantially. It did not.

The problem, in retrospect and in the methodological literature that emerged after WHI, is that the variables a cohort study can measure --- BMI, smoking status, blood pressure, cholesterol, self-reported physical activity --- capture only a fraction of the systematic differences between women who took hormones and women who did not. Women who initiated HRT in the 1980s were the women who saw their physicians regularly, who took preventive recommendations seriously, who exercised, who ate vegetables, who paid attention to their cholesterol numbers, who got mammograms, who showed up for follow-up visits. Many of those characteristics are not captured in a cohort questionnaire, or are captured with substantial error. The cumulative effect of unmeasured differences in health-promoting behavior is what subsequent analyses called the healthy-user effect, and it can comfortably produce a 30% to 50% spurious protective association in observational data even after adjustment for all measured confounders.

The other observational evidence pointed in the same direction. The Lipid Research Clinics Follow-Up Study, the Leisure World cohort in California, the Walnut Creek Contraceptive Drug Study cohort, the Uppsala cohort in Sweden --- each reported lower cardiovascular event rates in hormone users than non-users, with adjusted relative risks generally in the range of 0.5 to 0.7. The consistency was itself treated as evidence of a real effect. The 1991 Stampfer and Colditz meta-analysis in Preventive Medicine pooled 31 observational studies and reported a summary relative risk of 0.56 for cardiovascular disease in hormone users. The 1992 Grady meta-analysis in Annals of Internal Medicine, which informed the American College of Physicians guideline, came to similar conclusions and explicitly recommended HRT for cardiovascular prevention in postmenopausal women without contraindications. The observational evidence was not a single weak study. It was a literature.

What WHI Was And Why It Was Built

The Women’s Health Initiative was launched in 1991 by the National Institutes of Health, under the direction of Bernadine Healy, then the first woman director of the NIH. It was a deliberate response to the observation that most of what was known about cardiovascular disease and cancer prevention in older Americans had been studied in men. The WHI was structured as a set of overlapping trials and an observational cohort, eventually enrolling 161,808 women aged 50 to 79 at 40 clinical centers across the United States between 1993 and 1998. The hormone trial was the most ambitious component. It was divided into two arms by surgical history: women with an intact uterus were randomized to combined conjugated equine estrogens (0.625 mg) plus medroxyprogesterone acetate (2.5 mg) or to placebo; women with prior hysterectomy were randomized to conjugated equine estrogens alone or to placebo. The combined-hormone arm enrolled 16,608 women. The estrogen-alone arm enrolled 10,739 women. Both arms were designed for long follow-up, with prespecified interim analyses and stopping rules, with cardiovascular disease as the primary benefit outcome and invasive breast cancer as the primary safety outcome, and with a global index combining major outcomes to detect overall net harm or benefit.

The trial was, by any reasonable standard, well designed. It used the formulations that were actually being prescribed in US clinical practice rather than experimental compounds. It enrolled a population broadly representative of postmenopausal women on or eligible for HRT, with a mean age of 63.2 years at enrollment. It used double-blinding and placebo control. It had an independent Data and Safety Monitoring Board with prespecified stopping rules for both benefit and harm. It was funded with a budget on the order of $625 million across all WHI components. It was, in short, the test that the observational literature had earned and that the field had agreed in principle to run.

The trial enrolled its first participant in 1993. The DSMB began reviewing interim data on schedule. From early on, the hormone arm showed an excess of cardiovascular events in the treatment group during the first year of follow-up, which had been seen also in HERS and which was tentatively attributed to early prothrombotic effects of oral estrogen. The DSMB watched. The cardiovascular signal did not reverse with longer follow-up. The breast cancer signal accumulated. By the spring of 2002, the cumulative excess of invasive breast cancer in the treatment group had crossed the prespecified monitoring boundary, and the global index showed net harm. The DSMB recommended termination of the combined-hormone arm. The trial steering committee accepted the recommendation. Participants were notified by letter beginning 7 July 2002. The principal-results paper appeared in JAMA on 17 July 2002. The estrogen-alone arm was permitted to continue, since the safety signals were specific to the combination, but it was eventually stopped in February 2004, after a mean follow-up of 6.8 years, for increased stroke risk and absence of cardiovascular benefit.

The 2002 Findings In Detail

The principal-results paper reported, for the combined-hormone arm, hazard ratios with nominal 95% confidence intervals as follows. Coronary heart disease: 1.29 (1.02 to 1.63). Invasive breast cancer: 1.26 (1.00 to 1.59). Stroke: 1.41 (1.07 to 1.85). Pulmonary embolism: 2.13 (1.39 to 3.25). Colorectal cancer: 0.63 (0.43 to 0.92). Endometrial cancer: 0.83 (0.47 to 1.47). Hip fracture: 0.66 (0.45 to 0.98). Death from any cause: 0.98 (0.82 to 1.18). The global index, which weighted the major outcomes, showed a hazard ratio of 1.15 (1.03 to 1.28) for harm.

The translation into absolute risk over one year, per 10,000 women on combined HRT versus placebo, was: 7 more coronary heart disease events, 8 more strokes, 8 more pulmonary emboli, 8 more invasive breast cancers, 6 fewer colorectal cancers, 5 fewer hip fractures. The numbers were not enormous in any single category. The combination, across multiple categories, was unequivocally net harmful for a treatment being given for primary prevention. The cardiovascular finding was the most consequential because cardiovascular benefit had been the principal rationale for long-term HRT and because the observational literature had pointed in the opposite direction with high confidence.

The professional reaction was unusually swift. The National Heart, Lung, and Blood Institute issued a press release on the day of publication recommending that the combined regimen not be prescribed for primary prevention of chronic disease. The North American Menopause Society revised its position statement within months. Insurers reviewed coverage. Prescribing patterns shifted within weeks. The number of dispensed prescriptions for Prempro in the United States fell from roughly 22 million in 2001 to roughly 7 million by 2004. The number of dispensed prescriptions for any hormone product fell by roughly 50% in the same period. Breast cancer incidence in the United States, particularly hormone-receptor-positive breast cancer in women aged 50 to 69, declined measurably starting in 2003, in a temporal pattern that the SEER program analyses (Ravdin et al., NEJM, 2007) attributed primarily to the population-level reduction in HRT exposure.

Healthy-User Bias As The Culprit

The post-WHI literature on why the observational evidence had been so misleading converged on a single structural answer. Women who initiated HRT in the 1980s and 1990s differed systematically from women who did not, on a long list of characteristics that affect cardiovascular risk, and the differences were not eliminated by statistical adjustment for the variables that cohort studies usually measure. The clearest single demonstration came from a reanalysis of the Nurses’ Health Study data published by Hernán and colleagues in Epidemiology in 2008. They restricted the cohort to women who had started HRT within two years of menopause and used target-trial emulation methods to estimate effects under the same protocol as the WHI. With those adjustments, the Nurses’ Health Study no longer showed a protective effect of HRT on coronary heart disease. The pooled estimate moved from a substantial protective association to one consistent with the WHI null-to-harm finding. The implication was that the Nurses’ Health Study had not been measuring a treatment effect; it had been measuring a selection effect, dressed up as a treatment effect by inadequate adjustment.

The structural problem can be stated cleanly. Observational comparisons of users versus non-users of an intervention will be unbiased estimates of the causal effect of that intervention only if treatment assignment is conditionally exchangeable with potential outcomes given the measured covariates. In plain language: after adjusting for what you measured, the only systematic difference between users and non-users must be the treatment itself. In practice, when the treatment is a behavioral choice that correlates with general health-orientation --- as HRT initiation did, as vitamin supplementation does, as preventive screening does, as much of nutrition behavior does --- the conditional exchangeability assumption fails. The user group is systematically healthier on unmeasured dimensions. The “treatment effect” estimated from observational data is the sum of the actual treatment effect plus the selection effect, and the selection effect can comfortably dominate.

This was not a new diagnosis. Petitti and Freedman, writing in the American Journal of Epidemiology in 2005 in a paper titled “How far can epidemiologists get with statistical adjustment?”, argued that the WHI episode was the cleanest possible demonstration that adjustment for measured confounders is not, in general, sufficient to estimate causal effects from observational data, and that the alternative interpretation --- that the WHI population was somehow unlike the observational cohorts --- did not hold up under examination. Their conclusion was that observational epidemiology is a hypothesis-generating tool and that for any high-stakes preventive intervention, the appropriate test is a randomized trial.

The mechanism is general. It is not specific to HRT and not specific to medicine. Wherever an intervention is taken up voluntarily by people who differ systematically from those who do not take it up, observational comparisons will mis-estimate the causal effect. Vitamin E supplementation, beta-carotene supplementation, vitamin D supplementation, hormone-receptor screening, low-fat diet adherence --- the list of cases where large observational cohorts suggested benefits that randomized trials subsequently failed to confirm or reversed is long enough that the pattern is no longer surprising. It is structural.

Manson 2013 And 2017: Timing And Long-Run Mortality

The story did not end in 2002. The WHI investigators followed the trial cohorts for years after the interventions ended. The 2013 reanalysis by JoAnn Manson and colleagues in JAMA reported intervention-phase and post-intervention findings for both the combined-hormone arm and the estrogen-alone arm, with cumulative follow-up of 13 years. Several refinements emerged. First, the cardiovascular harm signal from combined HRT attenuated after the intervention was stopped; the cumulative hazard ratios remained elevated for the intervention phase but did not continue to diverge afterward. Second, the breast cancer excess in the combined arm persisted into the post-intervention phase. Third, the estrogen-alone arm showed a more favorable profile than the combined arm, with no excess cardiovascular risk and a reduction in invasive breast cancer over long-term follow-up.

Fourth, and most consequential for clinical practice, was the timing finding. When the cohort was stratified by age at hormone initiation and by years since menopause, women who initiated HRT within ten years of menopause showed substantially different risk-benefit profiles than women who initiated it later. For women aged 50 to 59 at randomization, the absolute risk differences for most outcomes were smaller, and in some analyses estrogen-alone showed a non-significant trend toward cardiovascular benefit. This became the basis for what is now called the “timing hypothesis”: that HRT initiated early in menopause may have different vascular effects than HRT initiated in women who have already developed substantial atherosclerotic disease. The hypothesis is not a vindication of the pre-WHI consensus. The observational literature had not stratified by timing in this way; the claimed cardiovascular protection had been generic. The timing-stratified WHI results are a refinement, and they do not support primary prevention of cardiovascular disease as an indication for HRT in any subgroup. They do reframe the clinical risk-benefit calculation for short-term use of HRT for symptom relief in early menopause, which is now the dominant clinical use.

The 2017 Manson et al. analysis in JAMA reported all-cause and cause-specific mortality with cumulative follow-up of 18 years. Across both trial arms combined, there was no statistically significant difference in all-cause mortality between hormone-treated and placebo groups. This was, in a sense, reassuring news after a decade of revised practice. It was not, however, evidence that long-term HRT for primary prevention is beneficial. It was evidence that the net effects of the regimen, integrated over many years and many causes of death, are approximately a wash, with elevated risks in some categories and reduced risks in others. The 2017 paper does not undo the 2002 finding. It contextualizes it.

The current clinical posture, codified in the 2022 North American Menopause Society position statement and largely consistent across major societies, is that HRT is indicated for treatment of moderate-to-severe menopausal symptoms in women without contraindications, that the lowest effective dose for the shortest necessary duration is preferred, that initiation within ten years of menopause has a more favorable risk-benefit profile than later initiation, that transdermal estrogen is preferred over oral when feasible due to lower thrombotic risk, and that HRT is not recommended for the primary prevention of chronic disease. The contrast with the 1990s consensus is total.

The Strategist’s Lesson: RCT Versus Observational Evidence

For strategists, decision-makers, and researchers in fields outside medicine, the WHI episode is one of the most useful case studies in the modern literature. The lesson is not that observational evidence is useless. The lesson is that observational evidence has a structural vulnerability that randomized trials do not have, and the vulnerability is large enough to flip the sign of an estimated effect. When you see a claim of the form “studies show that X reduces Y” --- where X is something people choose to do and Y is a health, business, or behavioral outcome --- the immediate question is whether the studies are observational or randomized. If they are observational, the next question is whether the people who chose X differ systematically from the people who did not, on dimensions that affect Y. If they do, the estimated effect is some combination of the actual effect of X and the selection effect of who chooses X, and the selection effect can be larger than any real effect. Statistical adjustment helps but does not fully solve the problem, because adjustment is only as good as the measurements, and consequential confounders are routinely unmeasured.

The cases where this matters in non-medical decision-making are abundant. Observational evidence that companies that adopt a particular management practice outperform those that do not is contaminated by selection on the kind of company that adopts the practice. Observational evidence that employees who use a particular productivity tool are more productive is contaminated by selection on the kind of employee who adopts the tool. Observational evidence that customers who engage with a particular feature have higher retention is contaminated by selection on the kind of customer who engages. The structural vulnerability is the same as in HRT. The strength of association is not, by itself, evidence that the association is causal. The biological or business plausibility of the mechanism is not, by itself, evidence either; HRT had biological plausibility in abundance, with a coherent story about estrogen and lipids and vascular endothelium, and the story turned out to be the wrong story.

The corrective is the same corrective the WHI applied. When the stakes are high enough --- when a recommendation will affect millions of decisions, when the costs of being wrong are large, when the existing evidence is observational --- the appropriate response is a randomized trial. If a randomized trial is impossible or unethical, the next-best response is a quasi-experimental design with a plausible exogenous source of variation in the treatment, such as an instrumental variable, a regression discontinuity, or a natural experiment that mimics random assignment. When neither is possible, the appropriate response is calibrated uncertainty: an explicit acknowledgment that the estimated effect could plausibly be much smaller, zero, or of the opposite sign than the observational data suggest. Treating a strong observational association as actionable, in a setting where confounding is plausible, is the failure mode that produced two decades of HRT prescribing on the basis of evidence that, when actually tested, did not survive contact with the test.

The WHI investigators, when they designed the trial in the early 1990s, knew that they were testing an intervention that the field believed worked. They ran it anyway. That decision turned out to be the single most important clinical-research decision of its decade, because it produced an answer the observational literature could not have produced, and the answer changed the practice of medicine. The lesson is not that all observational evidence is wrong. The lesson is that the strength of observational evidence is not, by itself, sufficient grounds for treating a causal claim as established. The randomized trial is the test. Until the test is run, the claim is provisional.

Sources

Writing Group for the Women’s Health Initiative Investigators. (2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the Women’s Health Initiative randomized controlled trial. JAMA, 288(3), 321-333. DOI: 10.1001/jama.288.3.321

Manson, J. E., Chlebowski, R. T., Stefanick, M. L., Aragaki, A. K., Rossouw, J. E., Prentice, R. L., et al. (2013). Menopausal hormone therapy and health outcomes during the intervention and extended poststopping phases of the Women’s Health Initiative randomized trials. JAMA, 310(13), 1353-1368. DOI: 10.1001/jama.2013.278040

Manson, J. E., Aragaki, A. K., Rossouw, J. E., Anderson, G. L., Prentice, R. L., LaCroix, A. Z., et al. (2017). Menopausal hormone therapy and long-term all-cause and cause-specific mortality: The Women’s Health Initiative randomized trials. JAMA, 318(10), 927-938. DOI: 10.1001/jama.2017.11217

Grodstein, F., Stampfer, M. J., Manson, J. E., Colditz, G. A., Willett, W. C., Rosner, B., Speizer, F. E., & Hennekens, C. H. (1996). Postmenopausal estrogen and progestin use and the risk of cardiovascular disease. New England Journal of Medicine, 335(7), 453-461. DOI: 10.1056/NEJM199608153350701

Petitti, D. B., & Freedman, D. A. (2005). How far can epidemiologists get with statistical adjustment? American Journal of Epidemiology, 162(5), 415-418. DOI: 10.1093/aje/kwi224

Hernán, M. A., Alonso, A., Logan, R., Grodstein, F., Michels, K. B., Willett, W. C., Manson, J. E., & Robins, J. M. (2008). Observational studies analyzed like randomized experiments: An application to postmenopausal hormone therapy and coronary heart disease. Epidemiology, 19(6), 766-779. DOI: 10.1097/EDE.0b013e3181875e61

Hulley, S., Grady, D., Bush, T., Furberg, C., Herrington, D., Riggs, B., & Vittinghoff, E. (1998). Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women: Heart and Estrogen/progestin Replacement Study (HERS) Research Group. JAMA, 280(7), 605-613. DOI: 10.1001/jama.280.7.605

Ravdin, P. M., Cronin, K. A., Howlader, N., Berg, C. D., Chlebowski, R. T., Feuer, E. J., Edwards, B. K., & Berry, D. A. (2007). The decrease in breast-cancer incidence in 2003 in the United States. New England Journal of Medicine, 356(16), 1670-1674. DOI: 10.1056/NEJMsr070105

Stampfer, M. J., & Colditz, G. A. (1991). Estrogen replacement therapy and coronary heart disease: A quantitative assessment of the epidemiologic evidence. Preventive Medicine, 20(1), 47-63. DOI: 10.1016/0091-7435(91)90006-p

Million Women Study Collaborators. (2003). Breast cancer and hormone-replacement therapy in the Million Women Study. The Lancet, 362(9382), 419-427. DOI: 10.1016/S0140-6736(03)14065-2

Saturated Fat And The Diet-Heart Hypothesis --- Another case where observational nutrition epidemiology drove decades of population-level recommendations that subsequent randomized trials did not confirm.
The Stress-Causes-Ulcers Myth --- A confidently-held medical consensus that survived for decades and then collapsed when a different mechanism was proposed and tested.
SSRI Antidepressants And Publication Bias --- What the regulatory record showed about efficacy estimates when unpublished trials were included.
Reinhart-Rogoff And The 90% Debt Threshold --- An influential empirical claim in macroeconomics that did not survive a replication attempt.
Ioannidis 2005: Why Most Research Findings Are False --- The methodological argument for why observational findings in low-prior-probability domains are unlikely to be true.

FAQ

Was the WHI cohort representative of women typically prescribed HRT?

The WHI enrolled women aged 50 to 79 at baseline, with a mean age of 63.2 years. A common post-publication critique was that this was older than the typical HRT initiator in clinical practice and that the WHI population was therefore not the right test of HRT for women starting therapy in early menopause. The 2013 Manson analysis stratified outcomes by age at randomization and by years since menopause and found that the risk-benefit profile did differ by timing, with women aged 50 to 59 showing smaller absolute risk increases than older women. The timing-stratified results are now part of clinical guidance. They do not, however, restore an indication for HRT for primary prevention of cardiovascular disease in any age subgroup; that indication was the pre-WHI consensus that the trial overturned.

Did the WHI test the same hormone formulations that women are prescribed today?

The WHI tested conjugated equine estrogens (Premarin, 0.625 mg) and combined conjugated equine estrogens plus medroxyprogesterone acetate (Prempro). Both were the dominant formulations in US practice in the 1990s. Contemporary prescribing has shifted toward transdermal estradiol and micronized progesterone for many patients, on the hypothesis that the route of administration and the specific molecules used affect the risk profile, particularly for thrombotic and breast cancer risk. The hypothesis is biologically plausible. The randomized evidence for substantially different long-term safety profiles between the WHI regimen and modern formulations is, as of the most recent reviews, limited. The clinical guidance to prefer transdermal estrogen and short-term use rests on a combination of WHI data, observational comparative-formulation data, and mechanistic reasoning.

How did clinical practice change after the WHI?

US prescriptions for combined HRT fell from approximately 22 million dispensed prescriptions in 2001 to approximately 7 million in 2004. Total HRT prescriptions fell by roughly 50% in the same period. Practice patterns in the UK, Australia, and Western Europe followed similar trajectories. The clinical indication for HRT shifted from long-term primary prevention to short-term treatment of menopausal symptoms. Insurance coverage adjusted. Direct-to-consumer marketing of HRT for cardiovascular prevention essentially ended in the United States.

Did the WHI cause harm by leading women to stop hormones?

A persistent argument in some clinical literature is that the WHI was over-interpreted, that the absolute risk differences were small, and that the reduction in HRT prescribing led to women suffering treatable menopausal symptoms unnecessarily and possibly to other downstream harms. The argument has a kernel of truth in the sense that the WHI was a study of long-term primary prevention, not of short-term symptom treatment, and the early post-WHI period likely saw under-prescribing for symptom management as physicians and patients overcorrected. Current guidance from major menopause societies attempts to recalibrate by separating the two indications. The strongest version of the over-interpretation argument --- that the WHI was wrong on cardiovascular disease and should not have changed practice --- is not supported by the trial data, the 2013 reanalysis, or the 2017 mortality follow-up. The current consensus is that the practice change was directionally correct and that the residual calibration question is about appropriate use of HRT for symptom management, not about long-term primary prevention.

What is the analogous lesson for non-medical fields?

The general lesson is that observational evidence on a voluntarily-adopted intervention is structurally vulnerable to confounding by the characteristics that lead people to adopt the intervention, and that statistical adjustment is not, in general, sufficient to eliminate the bias. In business and management research, this is the structural problem underlying many published findings about the effects of management practices, organizational interventions, hiring algorithms, productivity tools, and behavioral nudges, where the people or firms that adopt the intervention differ systematically from those that do not. The fix, when it is feasible, is a randomized trial. When randomization is not feasible, the next-best fix is a quasi-experimental design with a plausible source of exogenous variation. When neither is feasible, the appropriate response is to treat the estimated effect as a hypothesis to be tested rather than as a result to act on.

replication-crisishormone-replacement-therapywhi-trialmedical-researchevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter