Class Size Effects: The Tennessee STAR Experiment And What Education Actually Knows

Atticus Li

← The Replication Crisis · replication-crisis

Class Size Effects: The Tennessee STAR Experiment And What Education Actually Knows

Most education research is observational, confounded, and contested. Tennessee STAR was different — a true randomized controlled trial of 11,500 kids across 79 schools, tracked into adulthood. The qualitative claim survived 30 years of scrutiny. The quantitative magnitudes did not. Here is what STAR actually proved.

By Atticus Li May 25, 2026 26 min read

Most education research is observational, confounded, and contested. Tennessee STAR was different — a true randomized controlled trial of 11,500 kids across 79 schools, tracked into adulthood. The qualitative claim survived 30 years of scrutiny. The quantitative magnitudes did not. Here is what STAR actually proved.

In the summer of 1985, the state of Tennessee did something no education-policy researcher had ever done at scale before. It took roughly 11,500 kindergarten students entering 79 elementary schools across the state, and inside each school, it randomly assigned every incoming child to one of three classroom types: a small class of 13 to 17 students, a regular class of 22 to 25 students, or a regular class of 22 to 25 students with a full-time teacher’s aide. Teachers were also randomly assigned to classroom types within each school. The children remained in their assigned classroom type from kindergarten through third grade, a span of four academic years. Standardized achievement tests were administered at the end of each grade. The project — the Student-Teacher Achievement Ratio experiment, universally known as STAR — cost the state of Tennessee approximately $12 million in 1985 dollars and remains, four decades later, the largest randomized controlled trial ever conducted in K-12 education in the United States.

If you have been reading through this hub, you have watched education-adjacent claims collapse with grim regularity. Learning styles failed to replicate. Carol Dweck’s growth-mindset effects shrank from large to negligible under preregistered conditions. The 10,000-hours rule turned out to be a popularization that the underlying research did not support. The Pygmalion effect — Rosenthal and Jacobson’s claim that teacher expectations alone could raise IQ scores — turned out to be one of the messier examples of methodological choice masquerading as discovery. A reader who has worked through that material might reasonably conclude that education research is simply not a serious empirical field and that anything coming out of it should be treated as opinion dressed up as data.

That conclusion would be wrong, and this article exists as the anti-example needed to make the correction. Tennessee STAR is the rare educational claim that meets the methodological bar this hub demands. It was a true RCT with random assignment of students and teachers to conditions, conducted at large scale, in real-world classrooms, across multiple years of follow-up. Its primary qualitative finding — that students assigned to small classes in early grades show measurably higher academic achievement than students assigned to regular-sized classes — has survived three decades of adversarial reanalysis, including reanalysis by economists who came into the question expecting to find weaker effects. The long-term follow-up by Raj Chetty and colleagues in 2011, using IRS records to trace STAR participants into their late twenties, found persistent effects on college attendance, savings behavior, and earnings.

But this article is also a story about the difference between qualitative replication and quantitative replication. STAR robustly shows that small classes in early grades help. It does not robustly tell you exactly how much, or for which kids, or whether the gains are worth the considerable cost per student of cutting class sizes in half. The California statewide class-size reduction of 1996, the largest natural experiment in scaling the STAR finding, produced weaker effects than STAR itself — possibly because mass scaling diluted teacher quality, possibly for other reasons. And Eric Hanushek’s reanalysis of the original STAR data raised methodological concerns that remain genuinely contested twenty-five years after he raised them. For strategists evaluating any education-policy claim — and that category includes corporate L&D, edtech investment, and the ongoing public debate about how to spend marginal dollars in school systems — the STAR story is the right template for what calibrated belief in an education finding should look like.

What Mosteller 1995 Reported From The Original STAR Data

The standard reference summary of the experiment is Mosteller, F. (1995). “The Tennessee Study of Class Size in the Early School Grades.” The Future of Children, 5(2), 113-127. DOI: 10.2307/1602360. Frederick Mosteller — Harvard statistician, founding chair of the Harvard biostatistics department, and one of the most important figures in twentieth-century applied statistics — wrote this review precisely because he believed STAR was a methodological landmark that deserved to be understood by a broader policy audience than just education researchers.

The experimental design was the part Mosteller emphasized first, because it was the part that distinguished STAR from essentially every prior class-size study. Earlier work on class size had been almost entirely observational. Researchers had compared achievement in naturally occurring small and large classes, attempted to control statistically for differences between the kinds of schools and families that produced small versus large classes, and reached conclusions that depended heavily on which controls were included. The resulting literature was, predictably, all over the place. Some studies found large class-size effects. Others found nothing. Meta-analyses concluded that the literature was too heterogeneous to support strong inferences.

STAR was different because it used random assignment. Within each of the 79 participating schools, students entering kindergarten in 1985 were randomly assigned to small classes, regular classes, or regular classes with an aide. Teachers were also randomly assigned. This solved the selection problem that had plagued earlier work. The students in the small classes were not the students whose parents demanded smaller classes. They were not the students in the wealthier schools that could afford smaller classes. They were a random subset of the population of students entering the participating schools, identical in expectation to the students in the larger classes.

The design also held the unit of randomization at the level it mattered: classroom. Each kindergarten classroom was randomly assigned a target size, and students were randomly assigned to classrooms within their school. Random assignment was maintained across grades — students stayed in their assigned condition through third grade, though they were reshuffled into new specific classrooms each year while keeping their condition assignment.

The primary outcome was performance on the Stanford Achievement Test, administered at the end of each grade from kindergarten through third grade. The headline result that Mosteller reported was that students in the small-class condition outperformed students in the regular-class and regular-with-aide conditions on achievement tests in every grade tested. The effect size on the composite achievement measure was approximately 0.20 standard deviations — meaning a student at the median of the small-class distribution scored roughly at the 58th percentile of the regular-class distribution. The effect emerged in kindergarten, persisted through third grade, and was visible across reading, mathematics, and word-recognition subtests.

The effect was not uniformly distributed across the student population. Mosteller emphasized in his review that the gains for African American students and for students attending inner-city schools were approximately twice the gains for white students and suburban students. The small-class condition appeared to deliver a meaningful boost on average, but the boost was substantially larger for the students who entered the system most disadvantaged. This finding — that randomized class-size reduction has equity properties beyond its average effect — would become one of the most-cited points from the STAR literature, and a central reason that civil-rights organizations have repeatedly pointed to STAR as evidence for resource-based interventions in low-income school districts.

The regular-with-aide condition, by contrast, produced essentially no measurable benefit relative to the regular-class condition without an aide. Mosteller treated this as a meaningful negative finding. It implied that the benefit of the small-class condition was not simply about adult-to-child ratio — adding an adult to a regular-sized class did not replicate the effect. Something about smaller groups, specifically, was driving the result.

Mosteller’s review was deliberately cautious about cost-effectiveness implications. He noted that small classes cost substantially more than regular classes, that statewide rollout would face implementation challenges the controlled experiment had not faced, and that the precise mechanism by which small classes produced gains remained unclear. But he was unambiguous that the experiment had cleanly demonstrated the qualitative effect, and he framed STAR as the most important piece of education-policy evidence of the late twentieth century.

Krueger 1999 And The Economics Reanalysis

The most influential economist’s reanalysis of the STAR data was Krueger, A. B. (1999). “Experimental Estimates of Education Production Functions.” The Quarterly Journal of Economics, 114(2), 497-532. DOI: 10.1162/003355399556052. Alan Krueger — Princeton labor economist, future chair of the Council of Economic Advisers, and one of the founders of the credibility revolution in empirical economics — came to STAR with the explicit goal of subjecting it to the kind of econometric scrutiny that prior education-research treatments had not applied.

Krueger’s analysis confirmed the central Mosteller finding while sharpening several quantitative estimates. Using the original STAR data and applying production-function methods standard in labor economics, Krueger estimated that students assigned to small classes scored approximately 0.20 standard deviations higher than students in regular classes on the composite achievement measure in the first year of treatment. The effect size estimate matched Mosteller’s qualitative summary, but Krueger placed it in the broader context of education-production-function estimates and demonstrated that the magnitude was meaningfully larger than the effects typically attributed to other school-resource interventions.

Krueger also addressed several methodological concerns that critics had raised about the original experiment. One concern was attrition — over the four years of the study, students moved between schools, between conditions, and out of the sample entirely. If attrition was non-random with respect to treatment, the estimated effects could be biased. Krueger demonstrated that the basic class-size effect survived multiple corrections for differential attrition, including instrumental-variables approaches that used initial assignment as an instrument for actual treatment received.

A second concern was the possibility that some students had been reassigned across conditions partway through the experiment in ways that compromised the original randomization. Krueger showed that approximately 10 percent of students experienced changes in their assigned condition between grades, but that intent-to-treat analyses — comparing students by their initial random assignment rather than by the condition they actually experienced — produced essentially the same effect sizes as the as-treated analyses. The randomization had held up well enough across the four years to support causal inference.

A third concern was the absence of a true pre-treatment baseline. STAR began at kindergarten entry, before any standardized testing had occurred, so the experiment could not directly demonstrate that the small-class and regular-class groups were equivalent in academic ability at baseline. Krueger addressed this by using the kindergarten-year test scores themselves as a quasi-baseline for older-grade analyses, and by demonstrating that the random assignment had produced groups balanced on every observable demographic characteristic the experimenters had recorded.

The Krueger reanalysis matters because it transformed STAR from a piece of education research into a piece of economics. After Krueger, STAR became one of the canonical examples in the labor-economics literature on causal inference — frequently cited alongside Card and Krueger’s minimum-wage natural experiment as evidence that policy-relevant questions could be answered with credible quasi-experimental and experimental methods. The methodological credibility of the original Mosteller finding, in the eyes of the economics profession, depended substantially on Krueger having reanalyzed the data and found that it held up.

Chetty 2011 And The Long-Term Earnings Effects

The most consequential follow-up to the original STAR experiment was Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D. (2011). “How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project STAR.” The Quarterly Journal of Economics, 126(4), 1593-1660. DOI: 10.1093/qje/qjr041. The Chetty et al. team did something that had never been done with an education-research dataset: they linked the original STAR participants to their IRS administrative records, tracing each child’s labor-market and college-attendance outcomes into their late twenties.

The mechanics of the linkage are worth understanding. The STAR study recorded the names, dates of birth, and other identifying information of the original 11,500 participants. By the time the Chetty team was working on the data in the late 2000s, those participants were in their late twenties — approximately ages 27 to 28 — with established work histories and tax records. The team obtained authorization to link STAR participant records to IRS data on labor earnings, college attendance (tracked via 1098-T tuition statements), and other administrative outcomes. The match rate was high enough that the long-term analysis recovered most of the original sample.

The headline test-score effect at the end of the experiment had been approximately 0.20 standard deviations on the composite achievement measure. By eighth grade, however, the test-score advantage of the small-class students had largely faded out, a finding that had been noted in earlier STAR follow-up work and used by critics to argue that the experimental effects were transient. The Chetty team’s central contribution was demonstrating that the test-score fade-out did not predict fade-out on adult outcomes.

Students assigned to small classes were significantly more likely to attend college by age 20 — the effect was approximately 1.8 percentage points on a base of 26 percent college attendance, statistically significant in the full sample. Students assigned to small classes who had been taught by more experienced kindergarten teachers earned more by age 27 than students assigned to regular classes. The team also constructed an index of non-cognitive outcomes — measures like savings behavior, retirement-account participation, home ownership — that combined into a composite the small-class students scored significantly higher on.

The earnings effects deserve careful treatment because the Chetty team was scrupulous about their interpretation. The point estimates for earnings effects of small-class assignment alone — without conditioning on teacher quality or other classroom characteristics — were positive but small and statistically marginal. The much larger and more robust earnings effects emerged when the team constructed a composite measure of overall kindergarten-classroom quality (combining class size, teacher experience, and peer characteristics) and used it to predict adult earnings. The implication is that early-childhood classroom quality matters for long-run outcomes, but that disentangling the specific contribution of class size from other correlated classroom features is harder than the test-score-only analysis suggested.

Even with that caveat, the Chetty paper is one of the most important results in the empirical-economics literature on education. It demonstrated that a randomized intervention at age 5 could produce measurable effects on outcomes at age 27 — closing one of the longest temporal gaps any educational RCT has ever bridged. It established that early-childhood educational experiences have persistent effects even when the test-score signature of those experiences fades. And it provided the first rigorous causal evidence for the long-suspected but previously unproven claim that early classroom quality has substantial economic value.

Hanushek 1999 And The Methodological Critique

The most thorough adversarial critique of the STAR experiment was Hanushek, E. A. (1999). “Some Findings from an Independent Investigation of the Tennessee STAR Experiment and from Other Investigations of Class Size Effects.” Educational Evaluation and Policy Analysis, 21(2), 143-163. Eric Hanushek — Stanford economist, longtime skeptic of resource-based interventions in education, and the most cited researcher in the literature arguing that school spending has weak effects on student outcomes — reanalyzed the STAR data with the explicit goal of identifying problems that more sympathetic analysts had missed.

Hanushek’s critique operated on several levels. The first-order methodological concern was the integrity of the randomization across grades. The STAR design called for students to be randomly assigned to a condition at kindergarten entry and to remain in that condition through third grade. In practice, Hanushek documented, the randomization was imperfect in several ways. Students moved between schools. Some students were reassigned across conditions within a school for administrative reasons. Teachers were not always randomly assigned, particularly in later grades. The result, Hanushek argued, was that the as-treated analyses inevitably mixed experimental signal with selection effects, and that even the intent-to-treat analyses were less clean than the original design had promised.

The second concern was the pattern of effects across grades. If small classes produced their benefit through a year-by-year accumulation of better instruction, one would expect the test-score gap between small and regular classes to grow over the four years of treatment. The actual pattern in the STAR data showed the effect emerging strongly in kindergarten, persisting at roughly the same magnitude through third grade, and then fading after the experiment ended. Hanushek argued that this pattern was more consistent with a one-time adjustment effect — perhaps a Hawthorne effect, perhaps a novelty effect, perhaps simply better classroom organization at entry — than with the cumulative-instruction model that the small-class advocates favored.

The third concern was external validity. STAR was conducted in one state, in one cohort, across a particular set of schools that had volunteered to participate. The schools that volunteered may have differed systematically from schools that did not. The teachers who taught in the experimental conditions knew they were being studied, which could have affected their behavior in ways that would not replicate under non-experimental conditions. Statewide rollout of class-size reduction, Hanushek argued, would face implementation challenges that the controlled experiment had not faced, and would likely produce smaller effects than STAR itself.

The Hanushek critique has been partially absorbed and partially contested in the subsequent literature. The randomization-integrity concerns have been substantially addressed by the Krueger reanalysis, which showed that intent-to-treat estimates closely matched as-treated estimates. The pattern-of-effects concern is harder to dismiss — the timing of the effect is genuinely ambiguous between competing mechanisms, and the post-experiment fade-out is real. The external-validity concern is the part of the Hanushek critique that has aged best, because the California natural experiment that followed STAR produced exactly the weakened effects Hanushek had predicted.

The California 1996 Natural Experiment And Why Replication Got Harder

In 1996, the state of California passed legislation funding statewide class-size reduction in grades K-3, targeting class sizes of 20 students or fewer. The motivation was substantially the STAR findings — California policymakers cited Tennessee STAR in justifying the program — and the rollout was rapid, comprehensive, and well-funded. By 1998, virtually every public elementary school in California had reduced K-3 class sizes to the legislated target. The intervention reached approximately 1.6 million students per year.

If the STAR findings had reflected a robust quantitative phenomenon, the California rollout should have produced detectable improvements in early-grade achievement at scale. The actual results, as documented in subsequent evaluations including those compiled by the Public Policy Institute of California and the RAND Corporation, were substantially weaker than the STAR experiment had predicted. Achievement gains in California were positive but small — generally in the range of 0.05 to 0.10 standard deviations, less than half the magnitude observed in STAR — and were heavily attenuated by negative side effects of the rapid rollout.

The leading explanation for the weaker California results is teacher-quality dilution. The rapid expansion of K-3 classrooms required California to hire approximately 25,000 additional elementary teachers in two years. The supply of qualified, credentialed elementary teachers in California was insufficient to fill those positions, and many of the new hires were teachers with less experience, less training, and weaker credentials than the average teacher in the state before the rollout. The dilution was particularly severe in low-income school districts, which struggled to compete for the limited supply of qualified teachers. In some districts, the class-size reduction effectively replaced an experienced teacher of 25 students with a less-experienced teacher of 20 students, and the net effect on instructional quality was ambiguous or negative.

This is the part of the STAR story that strategists need to internalize most carefully. The experimental finding — that small classes produce gains, holding teacher quality constant — was robust. The policy implication that derives from it — that scaling down class sizes will produce gains in the field — depended on assumptions about teacher labor supply that turned out not to hold. The intervention that produced a 0.20 SD effect in the controlled experiment produced a 0.05 SD effect when scaled to a state. The difference was not a failure of the original science. It was a failure of the implementation environment.

The California experience generalizes beyond California. Subsequent natural experiments in Israel (using Maimonides’ Rule, a religious-law-based threshold for class size, as a quasi-instrument) and in several European countries have produced class-size effects that vary from substantially positive to indistinguishable from zero. The cross-context pattern is consistent with the view that class-size reduction is a real intervention with real effects, but that the effects depend heavily on the institutional context, the available teacher labor pool, and the specific implementation choices. The robust qualitative claim — smaller classes can help — is empirically well-supported. The strong quantitative claim — small classes produce 0.20 SD gains anywhere they are implemented — is not.

The Cost-Effectiveness Debate That Is Genuinely Live

The most legitimate ongoing debate about the STAR findings is not about whether the experimental effects were real. It is about whether the magnitude of the effects justifies the cost. Reducing class sizes from 25 to 15 increases per-student spending by approximately 50 to 65 percent, because teacher salaries are the largest line item in school budgets and class-size reduction is fundamentally a teacher-multiplier intervention. The Schanzenbach 2014 review (Schanzenbach, D. W. (2014). “Does Class Size Matter?” National Education Policy Center, Boulder, CO) provides a careful summary of the cost-effectiveness literature.

The central question is the opportunity cost. The marginal education dollar can be spent on class-size reduction, on teacher salary increases that attract and retain higher-quality teachers, on early-childhood programs like Head Start, on after-school tutoring programs, on technology, or on any number of other interventions. Empirical estimates of the cost per standard-deviation increase in test scores vary substantially across interventions. Targeted tutoring programs, in particular, have produced effect sizes in some studies that match or exceed STAR’s class-size effects at substantially lower per-student cost. From a pure cost-effectiveness standpoint, the strongest case for class-size reduction is for disadvantaged students in early grades — exactly the subgroup where STAR found the largest effects — rather than as a universal intervention.

The cost-effectiveness debate is the part of the STAR story where reasonable analysts continue to disagree. Researchers like Diane Schanzenbach argue that the long-term earnings effects documented by the Chetty team substantially improve the cost-benefit calculation, because the present value of small lifetime earnings increases can dwarf the per-student cost of class-size reduction. Researchers like Hanushek argue that teacher-quality interventions dominate class-size reduction on cost-effectiveness grounds, and that mass class-size reduction policies tend to dilute teacher quality in ways that undermine the intended benefits. Both positions are defensible. The empirical evidence does not unambiguously settle the policy question.

What This Means For Strategists Evaluating Education Claims

The STAR story carries several lessons for anyone evaluating education-research evidence in a strategic or commercial context.

First, the methodological gold standard in education research — large randomized controlled trials with long-term follow-up — is rare, expensive, and disproportionately valuable. Tennessee spent $12 million in 1985 dollars to run STAR. The resulting evidence base has shaped class-size policy debates for four decades. Almost no other education research has been conducted at this rigor and scale. When you encounter claims about educational interventions that are not backed by something approaching STAR-level evidence, treat the claims as preliminary regardless of how confident their proponents sound.

Second, qualitative replication and quantitative replication are not the same thing. STAR’s qualitative finding — small classes in early grades help — has replicated across multiple contexts and survived three decades of adversarial reanalysis. The specific quantitative magnitude — a 0.20 SD effect — has not replicated under scale-up conditions. Strategists need to distinguish between these two senses of replication when they read about any empirical finding. A finding that qualitatively replicates is real. A finding whose magnitude does not survive scale-up is one that requires careful thinking about implementation conditions before it can be applied.

Third, scale-up conditions matter as much as the original intervention. The California experience illustrates a general phenomenon: interventions that work in controlled experimental conditions can produce weaker effects when scaled to entire populations, because the constraints that operated in the controlled study (a fixed supply of qualified teachers, a particular set of motivated participating schools) often relax in different and unfavorable directions during scaling. If you are evaluating a proposed intervention based on RCT evidence, ask explicitly about what would change between the experimental condition and the deployment condition you are contemplating. The most important predictor of scale-up success is often the labor-market constraints on the resources the intervention requires.

Fourth, cost-effectiveness is its own analysis. Even when an intervention demonstrably works, it does not automatically follow that the intervention is the best use of marginal dollars. STAR demonstrated that small classes help. It did not demonstrate that small classes are the most cost-effective way to help. The strategic question is not “does this work” but “does this work better per dollar than the alternatives.” The empirical-economics literature on educational cost-effectiveness is genuinely informative on this question, and it tends to point toward targeted interventions for disadvantaged subgroups rather than universal large-scale resource increases.

Fifth, long-term follow-up is the most underrated form of evidence. The Chetty team’s tracing of STAR participants into their late twenties was methodologically heroic and substantively important. Most educational interventions are evaluated on short-term outcomes — typically end-of-year test scores — because long-term outcomes are expensive and slow to measure. The STAR follow-up demonstrated that short-term effects can mislead in both directions: test-score effects can fade out while long-run economic effects persist. If you are evaluating an educational intervention that has only short-term evidence behind it, treat the absence of long-term data as a real limitation rather than a minor caveat.

The STAR experiment exists in this hub as the example of what serious educational evidence looks like. It does not vindicate every claim that gets attached to it. It does not justify every class-size policy that has cited it. But it demonstrates that careful experimental work, conducted at scale, with long-term follow-up, can produce evidence that survives decades of adversarial scrutiny. That is rare. It is what good educational science looks like when it actually happens. And the contrast between STAR’s robust qualitative replication and its more contested quantitative magnitudes is the right model for how every well-designed RCT should be interpreted.

Sources

Mosteller, F. (1995). “The Tennessee Study of Class Size in the Early School Grades.” The Future of Children, 5(2), 113-127. DOI: 10.2307/1602360
Krueger, A. B. (1999). “Experimental Estimates of Education Production Functions.” The Quarterly Journal of Economics, 114(2), 497-532. DOI: 10.1162/003355399556052
Hanushek, E. A. (1999). “Some Findings from an Independent Investigation of the Tennessee STAR Experiment and from Other Investigations of Class Size Effects.” Educational Evaluation and Policy Analysis, 21(2), 143-163.
Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D. (2011). “How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project STAR.” The Quarterly Journal of Economics, 126(4), 1593-1660. DOI: 10.1093/qje/qjr041
Schanzenbach, D. W. (2014). “Does Class Size Matter?” National Education Policy Center, Boulder, CO.

Acemoglu-Robinson Institutions: The 2024 Nobel-Winning Program — another anti-example where adversarial reanalysis left the core finding standing
Card & Krueger’s Minimum Wage Study — the parallel anti-example from labor economics
Learning Styles: The Educational Myth That Won’t Die — the contrast case from education research that did not replicate
Growth Mindset: From Carol Dweck To Meta-Analytic Disappointment — another education claim whose effect sizes shrank under preregistered conditions
Spaced Repetition And The Testing Effect — the rare education-research finding that replicates as robustly as STAR

FAQ

Q: How big was the actual effect size in Tennessee STAR? The composite achievement test effect was approximately 0.20 standard deviations comparing small classes (13-17 students) to regular classes (22-25 students). Effects were roughly twice as large for African American students and students attending inner-city schools.

Q: Did the test-score effects last after the experiment ended? The test-score gap between small-class and regular-class students largely faded by middle school. However, the Chetty 2011 follow-up using IRS records found that small-class students were significantly more likely to attend college by age 20 and showed positive effects on a composite measure of adult outcomes including savings, home ownership, and retirement-account participation.

Q: Why did California’s 1996 statewide class-size reduction produce weaker effects than STAR? The leading explanation is teacher-quality dilution. California had to hire roughly 25,000 additional elementary teachers in two years, and the available pool of qualified, credentialed teachers was insufficient. Many new hires had less experience and weaker credentials, particularly in low-income districts that struggled to compete for the limited teacher supply. The intervention effectively replaced experienced teachers in classes of 25 with less-experienced teachers in classes of 20, and the net effect on instructional quality was substantially weaker than under STAR’s controlled conditions.

Q: Is STAR proof that class-size reduction is good policy? No. STAR proves that small classes can produce gains under controlled experimental conditions, particularly for disadvantaged students. It does not prove that class-size reduction is the most cost-effective use of marginal education dollars. The cost-effectiveness debate is live, and reasonable analysts disagree about whether class-size reduction dominates alternatives like targeted tutoring, teacher-quality investments, or early-childhood programs.

Q: What about the Hanushek critique of STAR? Eric Hanushek raised three concerns: imperfect randomization integrity across grades, ambiguous timing of the effect (consistent with one-time adjustment rather than cumulative instruction), and external-validity limits. The randomization concerns have been largely addressed by Krueger’s reanalysis showing intent-to-treat estimates match as-treated estimates. The timing concern remains genuinely ambiguous. The external-validity concern was substantially vindicated by the weaker California results.

Q: How does STAR compare to other education-research findings in this hub? STAR is the strongest piece of education research treated in this hub. It survives the methodological standards that dismantled learning styles, growth mindset, the Pygmalion effect, and the 10,000-hours popularization. Among education-research claims, only the spaced-repetition and testing-effect findings have a comparably robust empirical base.

Q: Should I treat any claim about class-size reduction as settled? The qualitative claim — small classes in early grades can produce achievement gains, particularly for disadvantaged students — is well-supported. The quantitative claim about specific effect magnitudes depends on implementation conditions, particularly the available supply of qualified teachers. Cost-effectiveness depends on the alternatives being compared. Treat the qualitative claim as robust evidence and the quantitative and cost-effectiveness claims as context-dependent.

replication-crisisstar-experimentclass-sizeeducation-researchevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

What Mosteller 1995 Reported From The Original STAR Data

Krueger 1999 And The Economics Reanalysis

Chetty 2011 And The Long-Term Earnings Effects

Hanushek 1999 And The Methodological Critique

The California 1996 Natural Experiment And Why Replication Got Harder

The Cost-Effectiveness Debate That Is Genuinely Live

What This Means For Strategists Evaluating Education Claims

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook