Finance's Replication Crisis: Harvey-Liu-Zhu 2016 On Why Most "Anomalies" Don't Replicate

Atticus Li

← The Replication Crisis · replication-crisis

Finance's Replication Crisis: Harvey-Liu-Zhu 2016 On Why Most "Anomalies" Don't Replicate

Cam Harvey, Yan Liu and Heqing Zhu reviewed 316 documented stock-return factors published over 50 years and concluded that the conventional t > 2 threshold is too low — most "anomalies" are noise. Hou-Xue-Zhang 2020 then re-ran 452 of them and found roughly half don't survive.

By Atticus Li May 26, 2026 28 min read

Walk into any meeting at a mid-sized institutional asset manager in 2026 and there is a high chance somebody will, at some point in the next hour, use the phrase factor investing or its more marketing-friendly cousin smart beta. The pitch deck will reference half a dozen academic factors with familiar names — value, size, momentum, profitability, low volatility, quality. The footnotes will cite peer-reviewed papers from The Journal of Finance or the Review of Financial Studies. The recommended portfolio allocation will lean on the idea that these factors deliver, in some combination, a long-run risk-adjusted return premium relative to the market.

The marketing is grounded, the narrative goes, in fifty years of academic research. Hundreds of factors documented in peer-reviewed journals. T-statistics above 2.0. Published findings from the most prestigious finance economists at the most prestigious universities. This is not stockbroker folklore; this is science.

In 2016, Campbell R. Harvey of Duke University — a former editor of The Journal of Finance and one of the most-cited empirical finance economists in the world — published a paper, co-authored with Yan Liu and Heqing Zhu, that argued the science was not in nearly as good shape as the brochures implied. Harvey, Liu, and Zhu (2016), “…and the Cross-Section of Expected Returns,” Review of Financial Studies, 29(1), 5–68 (DOI: 10.1093/rfs/hhv059), reviewed 316 distinct “factors” — variables claimed in the published academic literature to predict cross-sectional differences in stock returns — and concluded that, once you account for the multiple-hypothesis-testing problem implied by half a century of factor-hunting, the conventional statistical bar in the field was set far too low. The paper recommended that any newly proposed factor be required to clear a t-statistic of at least 3.0, not the traditional 2.0, before being treated as a real finding.

Four years later, Kewei Hou, Chen Xue, and Lu Zhang took the argument out of the realm of theory and into the realm of measurement. Hou, Xue, and Zhang (2020), “Replicating Anomalies,” Review of Financial Studies, 33(5), 2019–2133 (DOI: 10.1093/rfs/hhy131), assembled the longest single replication exercise the field had ever attempted: 452 anomaly variables, drawn from the published literature, re-tested with a single consistent methodology, a longer sample period, and standardised stock-universe filters. Their headline finding was that the majority of the 452 anomalies they re-tested did not produce statistically significant returns at the conventional 5% level when implemented with consistent methodology.

These two papers — Harvey-Liu-Zhu (2016) on the theory of the multiple-testing problem, and Hou-Xue-Zhang (2020) on the empirics of running 452 anomaly tests in one consistent framework — together constitute what is now widely referred to as the factor-zoo critique or, less politely, finance’s replication crisis. They are not the only papers in this literature. Chordia, Goyal, and Saretto (2020), “Anomalies and False Rejections,” Review of Financial Studies, 33(5), 2134–2179 (DOI: 10.1093/rfs/hhaa018), comes at the same problem with different statistical machinery. Linnainmaa and Roberts (2018), “The History of the Cross-Section of Stock Returns,” Review of Financial Studies, 31(7), 2606–2649 (DOI: 10.1093/rfs/hhy030), tests published factors out-of-sample on pre-publication data. Bessembinder (2018), “Do Stocks Outperform Treasury Bills?,” Journal of Financial Economics, 129(3), 440–457 (DOI: 10.1016/j.jfineco.2018.06.004), goes after a completely different premise — the very idea that “stocks” as an aggregate are what’s earning the equity premium — and finds something that should disturb anyone who treats long-only diversified equity exposure as obviously rational.

The marketing industry has not, to put it gently, fully absorbed these findings.

This is the story of how the academic finance literature accumulated several hundred “anomalies” over fifty years, why a majority of those anomalies do not survive serious replication, and what the strategist evaluating any quantitative-finance claim — whether from a smart-beta ETF prospectus, an institutional consultant’s recommendation, or a hedge-fund pitch deck — should learn from it.

The Factor Zoo: Fifty Years of Anomaly-Hunting

To understand the Harvey-Liu-Zhu critique you need to understand what an anomaly is, in the technical sense the academic finance literature uses the word, and why so many of them have been published.

In the canonical asset-pricing framework that dominated the literature in the 1970s and 1980s — the Sharpe-Lintner Capital Asset Pricing Model — the only firm characteristic that should predict cross-sectional differences in expected stock returns is the stock’s beta with the market portfolio. Anything else that predicts returns is, in CAPM terms, an anomaly: a violation of the model that requires either a richer model (more risk factors) or a behavioural explanation (some kind of mispricing).

Starting in the late 1970s, empirical work began documenting variables that appeared to predict returns in addition to market beta. Banz (1981) documented the size effect — small-cap stocks earning higher average returns than large-cap stocks beyond what beta predicted. Basu (1977, 1983) and others documented the value effect — high book-to-market or low price-to-earnings stocks earning higher returns. Jegadeesh and Titman (1993) documented momentum — stocks that had gone up over the prior 3 to 12 months tended to continue going up. By the mid-1990s, Fama and French (1992, 1993) had proposed a three-factor model — market, size, value — that became the new baseline against which subsequent anomalies were tested.

Then the literature multiplied. The 1990s and 2000s saw a steady accumulation of additional documented characteristics: profitability, investment, asset growth, accruals, idiosyncratic volatility, share issuance, momentum-reversal, post-earnings-announcement drift, long-term reversal, illiquidity, dispersion of analyst forecasts, max-daily-return, lottery-like skewness, and dozens of more obscure variables. By the early 2010s, anyone trying to read the cross-sectional asset-pricing literature was confronting what John Cochrane, in his 2011 American Finance Association presidential address, called a “factor zoo” — hundreds of distinct variables, each claimed in some published paper to add explanatory power on top of whichever model had been the previous benchmark.

Cochrane (2011) framed the question that Harvey-Liu-Zhu would later answer empirically: “We have a zoo of new factors. Which characteristics provide independent information about average returns? Which are subsumed by others?” The implication was that not all of them could be real. If every published factor were genuinely picking up an independent source of expected returns, the typical stock would have an expected return determined by dozens of distinct risk premia, which is not what the data on actual portfolio returns looks like. Some of the factor zoo, Cochrane suggested, was probably noise.

What Harvey, Liu, and Zhu did in 2016 was take that suggestion and put a number on it.

What Harvey-Liu-Zhu 2016 Actually Argued

The Harvey-Liu-Zhu (2016) paper is, at its core, a paper about the multiple-testing problem. The argument runs as follows.

In any single hypothesis test, the conventional threshold for statistical significance is a t-statistic of roughly 2.0, which corresponds to a p-value of approximately 0.05. The interpretation is that, if the null hypothesis were true (the factor doesn’t really predict returns), you would observe a test statistic this extreme purely by chance in roughly 5% of independent tests. A single t-statistic of 2.0 is, in a single isolated test, reasonable evidence against the null.

But if you run 100 independent tests on noise — variables that genuinely have no predictive power for returns — you should expect roughly 5 of them to clear the t > 2 threshold by chance alone. If you run 1,000 such tests, you should expect roughly 50. The academic finance literature, Harvey-Liu-Zhu argued, has effectively been running thousands of such tests over the past fifty years. Researchers have tried characteristic after characteristic, looking for variables that produce significant return predictability. Only the successes — the variables that cleared t > 2 — got published. The failures sit in file drawers.

This is the well-known publication bias problem, but Harvey-Liu-Zhu pushed it further than the standard treatment. They constructed, as carefully as they could, a census of every factor that had been documented in the major finance and accounting journals from the 1960s through 2012. Their count came to 316 distinct factors, drawn from 313 published papers. They then applied standard multiple-testing corrections — Bonferroni, Holm, and the Benjamini-Hochberg false-discovery-rate procedures — to the implied universe of tests, treating the 316 published factors as the tip of an iceberg of unknown but larger size, since published factors are by definition the surviving ones.

The conclusion was that the conventional t > 2 threshold was no longer defensible. Under the Bonferroni correction, the equivalent threshold for a single test to be treated as significant rose dramatically. Under the Benjamini-Hochberg false-discovery-rate framework, which is less conservative, the threshold still rose substantially. Harvey-Liu-Zhu’s headline recommendation, which has since become widely cited as the t > 3.0 standard, was that any newly proposed factor should be required to clear a t-statistic of approximately 3.0 — not 2.0 — before being treated as a real finding worthy of publication.

The implication for the historical literature was uncomfortable. A great many published factors had cleared t > 2 but would not clear t > 3. Under the new standard, a substantial fraction of the documented factor zoo would have to be reclassified as likely false positives — variables that had appeared to predict returns in one sample, with one methodology, but were probably not picking up any genuine source of expected returns.

Harvey-Liu-Zhu were careful in their language. They did not claim that any specific named factor was definitely a false positive. They presented an aggregate argument about the field’s statistical standards. But the practical force of the paper was a wholesale downgrade of the credibility of the older anomalies literature.

It is worth noting that Harvey himself was, at the time of this paper, a former president of the American Finance Association and former editor of The Journal of Finance. This was not a methodological critique coming from a hostile outsider; this was finance’s own establishment performing a serious self-audit. The 2016 RFS paper was followed by Harvey’s 2017 American Finance Association presidential address, “Presidential Address: The Scientific Outlook in Financial Economics,” The Journal of Finance, 72(4), 1399–1440 (DOI: 10.1111/jofi.12530), which made the case to the field at the highest possible podium.

Hou-Xue-Zhang 2020: What Happens When You Actually Re-Run 452 Anomalies

The Harvey-Liu-Zhu paper was a theoretical and statistical argument. Hou, Xue, and Zhang’s 2020 paper was something different: a direct empirical replication exercise. They took 452 documented anomaly variables, drawn from the published cross-sectional asset-pricing literature, and re-ran every single one of them under a uniform methodological framework on a long sample of U.S. stock returns through 2016.

The methodology choices mattered enormously, and Hou-Xue-Zhang were explicit about each one.

Filter out microcaps. A substantial fraction of the original anomalies literature relied heavily on the smallest stocks in the U.S. universe — micro-cap and nano-cap names that, individually, account for a tiny fraction of the total market capitalization but, in equally-weighted portfolios, drive most of the variation in returns. Hou-Xue-Zhang’s baseline implementation followed what has become the modern standard in cross-sectional asset pricing: excluding stocks below the 20th percentile of NYSE market capitalization. This filter alone substantially weakens many of the historical anomalies, because the predictive power of many factors turns out to be concentrated in stocks that are too small and too illiquid for institutional capital to actually trade in size.

Value-weight, don’t equal-weight. Many of the original anomaly papers reported equally-weighted portfolio returns, in which a $10 million micro-cap counts the same as a $1 trillion mega-cap. Hou-Xue-Zhang’s baseline used value-weighted portfolios, which better reflect what an investor could actually implement.

Standardise the sample period. Hou-Xue-Zhang ran every anomaly on the same sample window through 2016, rather than letting each anomaly use the sample window from its original publication. This eliminates the in-sample-only bias of using the period in which the anomaly was originally discovered.

Under this consistent framework, the headline finding was stark. A majority of the 452 anomalies tested — roughly 65% in the paper’s headline tabulation, depending on exactly which subset and which threshold — did not produce a statistically significant return premium at the conventional 5% level (t > 1.96). When Hou-Xue-Zhang raised the bar to the Harvey-Liu-Zhu recommended t > 3.0 threshold, the failure rate rose further still. The fraction of anomalies that cleared both the modern microcap-filter / value-weight implementation and the t > 3 threshold was a small minority of the 452.

The pattern of which anomalies survived and which didn’t was itself informative. Anomalies based on investment, financing, profitability, and quality variables — the categories Hou, Xue, and Zhang themselves had emphasised in their own q-factor model — tended to survive better than older anomalies based on technical indicators, prior-return-based signals at long horizons, or accounting variables that turn out to be sensitive to small data-cleaning choices. The “factor zoo” was not random noise; some characteristics did seem to robustly predict returns. But the universe of robust survivors was, the paper argued, substantially smaller than the published literature implied.

The Hou-Xue-Zhang paper provoked, predictably, a response. Some of the original anomaly authors pushed back, arguing that filtering out microcaps and value-weighting was throwing out exactly the variation in which their anomalies lived. There is a defensible methodological argument here — if a particular anomaly is genuinely driven by limits-to-arbitrage in small, illiquid stocks, excluding microcaps is excluding the phenomenon. But the broader point that Hou-Xue-Zhang made is harder to dismiss: an anomaly that only works in stocks that institutional capital can’t actually trade in size is not, in any practical sense, a viable strategy. It may be a real empirical regularity, but it is not investable, and it is not what is being sold in factor-investing ETF prospectuses.

Chordia-Goyal-Saretto 2020: The False-Rejection Frame

In the same 2020 special issue of Review of Financial Studies that hosted Hou-Xue-Zhang, Chordia, Goyal, and Saretto (2020) approached the same problem from a different angle. Rather than re-running 452 anomalies under a single methodology, they ran a large-scale data-mining exercise on the universe of accounting variables in the COMPUSTAT database, generated 2 million test statistics from random combinations of financial-statement-item ratios, and used the resulting distribution to map out how easy it is, in principle, to find a “significant” return-predicting variable purely by chance.

The conclusion was sobering. Under the conventional t > 2 threshold, the number of false rejections — variables that look significant by chance — vastly exceeded the number of genuine signals you would expect to detect. Chordia-Goyal-Saretto recommended thresholds roughly in line with the Harvey-Liu-Zhu t > 3 standard for any new variable, but they framed the recommendation in terms of the false-discovery rate of the implied search process. If you knew, in advance, how many candidate variables a researcher had considered before reporting the one they finally published, you could calibrate a defensible threshold. Since researchers generally don’t disclose the size of their search, the field-wide default has to assume the search was wide.

The combined Hou-Xue-Zhang and Chordia-Goyal-Saretto papers, both in the May 2020 issue of RFS, are usually treated as the empirical companion pieces to Harvey-Liu-Zhu’s theoretical argument. Together they constitute the case that the cross-sectional asset-pricing literature, taken as a whole, has been generating false positives at a rate that the field’s traditional reporting standards do not adequately discount.

Bessembinder 2018: A Different Kind of Result That Survives

Not every uncomfortable finding in modern empirical finance is a story of failed replication. Some findings have, on the contrary, held up remarkably well — and one of the most disturbing of them is Hendrik Bessembinder’s 2018 paper.

Bessembinder (2018), “Do Stocks Outperform Treasury Bills?,” studied the cross-section of long-run individual-stock returns in the CRSP database from 1926 through 2016. The headline finding is one that most retail investors find counterintuitive to the point of disbelief: across the entire universe of common stocks in the U.S. market over 90 years, only about 4% of the individual stocks accounted for the entire net gain (above Treasury-bill returns) of the U.S. stock market.

Put another way: if you had held a randomly selected individual U.S. stock over its full listed lifetime, your expected return would have been below the return on Treasury bills. The reason the aggregate equity market delivered a premium over T-bills was that the right tail of stock-level returns — a small minority of names — was so extreme that those few winners pulled the market average up. The median stock, over its life, lost money relative to T-bills.

This finding is robust. Bessembinder followed up with international extensions (Bessembinder, Chen, Choi, & Wei, 2023, “Long-term shareholder returns: Evidence from 64,000 global stocks,” Journal of Financial Economics) confirming the same skewness pattern in non-U.S. markets. The basic asymmetry — equity returns at the individual-stock level are right-skewed, the typical stock underperforms a riskless benchmark, and a small number of extreme winners drive the aggregate equity premium — is now widely accepted.

The strategist takeaway from Bessembinder is the opposite of the takeaway from Harvey-Liu-Zhu. Harvey-Liu-Zhu says: many specific factor claims are false, demand higher statistical bars. Bessembinder says: the basic case for broad diversification is stronger than naïve intuition suggests, because the long-run equity premium is concentrated in a small number of names you cannot reliably pick in advance. Holding 30 stocks instead of 1,000 doesn’t just raise your variance; it raises the probability that your portfolio entirely misses the few names that account for the actual premium. The case for index funds, in the post-Bessembinder framing, is not just about cost minimization or efficient-markets ideology; it is about the empirical fact that the equity premium lives in the right tail and cannot be replicated by concentrated holdings.

Both findings — Harvey-Liu-Zhu’s factor-zoo critique and Bessembinder’s skewness finding — point at the same broader skepticism about anything that promises to systematically beat a low-cost broad-market index.

Linnainmaa and Roberts 2018: The Out-Of-Sample Test

A separate and complementary line of evidence comes from Linnainmaa and Roberts (2018), “The History of the Cross-Section of Stock Returns,” Review of Financial Studies, 31(7), 2606–2649. The paper’s methodology was elegant: take a long list of published cross-sectional anomalies, and test each one on data from before the period the original paper used. The original papers’ samples were the in-sample test; the pre-sample data was the out-of-sample test, free of the selection bias that makes any in-sample finding hard to interpret.

The results were, as the paper put it, “humbling.” Many of the published anomalies that had appeared significant on their original samples produced much weaker results — or no result at all — on the pre-publication out-of-sample data. The pattern is exactly what you would expect if much of the published anomaly literature were picking up sample-specific noise that doesn’t generalise to other periods. The factors that survived out-of-sample testing were a smaller subset of the full anomaly zoo, and they corresponded roughly to the factors that had survived under the Hou-Xue-Zhang and Chordia-Goyal-Saretto replication frameworks. The convergence of methods on a similar shortlist is, in itself, mild evidence that the surviving factors are picking up something real, while the full literature contains substantial noise.

The Linnainmaa-Roberts paper is also methodologically important for a reason that strategists should care about: it is essentially a demand for out-of-sample evidence as a precondition for treating any factor claim seriously. An in-sample t-statistic — even one well above 3.0 — is a much weaker piece of evidence than the same number replicated on data the original author never saw.

The Smart-Beta Marketing Problem

Now consider what the asset-management industry has done with all of this.

In 2024, the global “smart beta” / factor ETF market was estimated at roughly $1.5 to $2 trillion in assets, depending on whose definition you use. The marketing materials for these products typically cite academic factor research as their intellectual foundation. The named factors — value, size, momentum, quality, low volatility, profitability — correspond to the academic literature’s most heavily studied anomalies. The implicit pitch is that the products give retail and institutional investors systematic exposure to the same return premia that academic researchers have documented over decades.

The Harvey-Liu-Zhu, Hou-Xue-Zhang, and related literatures complicate this pitch in several ways.

Factors that don’t survive replication probably shouldn’t anchor a product. Some of the factors marketed in smart-beta products are among the better-replicated ones; the value, profitability, and investment factors all show up on the surviving-shortlist in multiple replication frameworks. Others — especially some of the more exotic single-characteristic factors that appear in more specialised products — fall into the categories that Hou-Xue-Zhang found do not survive consistent re-implementation.

Even the survivors look weaker post-publication. A separate empirical regularity, documented in McLean and Pontiff (2016), “Does Academic Research Destroy Stock Return Predictability?,” The Journal of Finance, 71(1), 5–32 (DOI: 10.1111/jofi.12365), is that anomaly returns tend to shrink substantially in the period after publication. On average, post-publication anomaly returns are roughly 26% lower than in-sample returns, and roughly 58% lower than out-of-sample-but-pre-publication returns. The natural interpretation is that some of the in-sample return premium was statistical artifact (decayed away when re-tested) and some was a real but partially-arbitraged-away inefficiency. Either way, the historical in-sample t-statistic is a substantial overstatement of what an investor entering the trade post-publication should expect.

The product wrapper costs change the breakeven. Even if a factor is real and has not been fully arbitraged away, a smart-beta ETF charging 0.30–0.50% in annual fees needs the underlying factor to deliver, after transaction costs and tax inefficiency, enough excess return to make the wrapper worth holding over a low-cost broad-market index. Many of the post-publication factor premia, once you subtract realistic implementation costs, are small enough that the value-add over a simple S&P 500 or total-world index fund becomes ambiguous.

The volume of factors makes “factor diversification” misleading. Some industry rhetoric talks about diversifying across factors. But if many of the candidate factors are noise, weighting an allocation across them is, in part, an allocation across noise.

None of this is to say that systematic factor investing is fraudulent or that every smart-beta product is mis-sold. The better products, built on the better-replicated factors, with realistic cost structures and honest about expected magnitudes, are reasonable diversifiers in a sophisticated portfolio. The point is that the gap between “what the marketing implies the academic literature has established” and “what the academic literature, including its own self-corrections, has actually established” is substantial — and the burden is on the investor or the strategist evaluating the product to close that gap.

The Strategist’s Takeaway

For any strategist evaluating a quantitative-finance claim — whether the source is a peer-reviewed paper, a hedge-fund pitch, a smart-beta ETF prospectus, or an institutional consultant’s report — Harvey-Liu-Zhu, Hou-Xue-Zhang, and the broader factor-zoo literature suggest a small handful of habits.

Treat in-sample t-statistics with calibrated skepticism. A reported t-statistic of 2.1 from a single in-sample test, with no out-of-sample validation, is not strong evidence of a real effect. The honest field-wide bar for new factor claims is approximately t > 3, and even that may be insufficient if you don’t know how many candidate variables the researcher considered before reporting the one they finally published.

Demand out-of-sample evidence. A factor that has been tested on data the original author never saw — either pre-publication out-of-sample data, post-publication out-of-sample data, or out-of-country data — is much better-supported than a factor whose only evidence is the original in-sample regression. Linnainmaa-Roberts is the methodological template here.

Apply the McLean-Pontiff decay. Even for factors that do replicate, the post-publication return premium is typically substantially smaller than the in-sample number. Treat the in-sample magnitude as an upper bound for what a forward-looking investor should expect.

Distinguish “anomaly that exists in the data” from “anomaly that is investable at scale.” A factor that lives in microcaps, in illiquid names, or in stocks that are hard to short is not the same as a factor that an institutional allocation can capture after transaction costs. Hou-Xue-Zhang’s microcap filter is not arbitrary methodology — it is the filter that distinguishes the academic anomaly from the implementable strategy.

Remember Bessembinder. Whatever clever factor allocation you are evaluating, it is being compared against an alternative — broad, low-cost, diversified equity exposure — that is itself supported by a finding (the right-skewed distribution of individual stock returns) which is more robust than most of the factor literature. The factor strategy has to clear the broad-market alternative, after costs, with real out-of-sample evidence, before it deserves a place in the portfolio.

Read the Harvey 2017 presidential address. Cam Harvey’s American Finance Association presidential address, “The Scientific Outlook in Financial Economics,” is the single most accessible statement of how the field’s own senior leadership thinks about the replication problem. It is not a hostile critique from outside finance; it is finance auditing itself. A strategist who reads it carefully will be better-equipped to evaluate any factor-investing claim they encounter for the rest of their career.

The factor zoo of academic finance, like the social-psychology literature of the 2010s and the medical-research literature analysed by Ioannidis (2005), is a domain where the published evidence base substantially overstates the underlying signal — because the publication process is implicitly selecting on a t-statistic threshold that, given the size of the implicit search, was too low. The corrective is not skepticism toward all of empirical finance. It is calibrated skepticism, with explicit demands for higher statistical bars, out-of-sample evidence, and implementable methodologies. The investment industry that takes this lesson seriously will build better products. The strategist who takes it seriously will be a harder customer to sell to — which is, in this domain, almost certainly the correct posture.

Sources

Primary papers:

Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the cross-section of expected returns. Review of Financial Studies, 29(1), 5–68. DOI: 10.1093/rfs/hhv059.
Hou, K., Xue, C., & Zhang, L. (2020). Replicating anomalies. Review of Financial Studies, 33(5), 2019–2133. DOI: 10.1093/rfs/hhy131.
Chordia, T., Goyal, A., & Saretto, A. (2020). Anomalies and false rejections. Review of Financial Studies, 33(5), 2134–2179. DOI: 10.1093/rfs/hhaa018.
Bessembinder, H. (2018). Do stocks outperform Treasury bills? Journal of Financial Economics, 129(3), 440–457. DOI: 10.1016/j.jfineco.2018.06.004.
Linnainmaa, J. T., & Roberts, M. R. (2018). The history of the cross-section of stock returns. Review of Financial Studies, 31(7), 2606–2649. DOI: 10.1093/rfs/hhy030.

Methodological and contextual:

Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. The Journal of Finance, 72(4), 1399–1440. DOI: 10.1111/jofi.12530.
Cochrane, J. H. (2011). Presidential address: Discount rates. The Journal of Finance, 66(4), 1047–1108. DOI: 10.1111/j.1540-6261.2011.01671.x.
McLean, R. D., & Pontiff, J. (2016). Does academic research destroy stock return predictability? The Journal of Finance, 71(1), 5–32. DOI: 10.1111/jofi.12365.
Bessembinder, H., Chen, T. F., Choi, G., & Wei, K. C. J. (2023). Long-term shareholder returns: Evidence from 64,000 global stocks. Journal of Financial Economics, 149(2), 251–270. DOI: 10.1016/j.jfineco.2023.04.012.

Foundational anomaly papers cited:

Banz, R. W. (1981). The relationship between return and market value of common stocks. Journal of Financial Economics, 9(1), 3–18.
Basu, S. (1977). Investment performance of common stocks in relation to their price-earnings ratios: A test of the efficient market hypothesis. The Journal of Finance, 32(3), 663–682.
Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1), 65–91.
Fama, E. F., & French, K. R. (1992). The cross-section of expected stock returns. The Journal of Finance, 47(2), 427–465.
Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1), 3–56.

Frequently Asked Questions

Does the Harvey-Liu-Zhu critique mean factor investing is a scam?

No. The critique says that the academic finance literature has, in aggregate, been generating false positives at a rate that the field’s traditional t > 2 threshold does not adequately discount, and that a substantial minority of the documented factor zoo would not survive a higher statistical bar or a consistent out-of-sample re-implementation. It does not say that every factor is noise. A small subset of factors — value, profitability, investment, and a few others — survive in multiple replication frameworks and are reasonable building blocks of a diversified strategy. The critique is a call for calibrated skepticism toward individual factor claims, not for blanket dismissal of systematic equity strategies. The actionable implication is to demand high statistical bars, out-of-sample evidence, and realistic cost-adjusted magnitudes — not to abandon the asset class.

Why is t > 3.0 better than t > 2.0?

Because of the multiple-testing problem. A single isolated test with t = 2.0 has roughly a 5% probability of clearing the threshold by chance under the null hypothesis. But the academic finance literature has effectively run thousands of tests over fifty years — only the successes get published, the failures sit in file drawers. If you treat each published t > 2 result as if it were a single isolated test, you systematically underestimate how many of those results are likely to be chance findings. The Bonferroni and Benjamini-Hochberg corrections that Harvey-Liu-Zhu apply imply that, for a field with the implicit search size of cross-sectional asset pricing, the t > 3 threshold gives roughly the same type-I error control that t > 2 was supposed to give for a single isolated test. The t > 3 bar is not a magic number — it is a rough field-level calibration of what a defensible threshold looks like once you account for the search.

What about the smart-beta ETFs I already own?

The strategist’s answer is: look at which specific factors the product is built on, look at whether those factors are on the surviving shortlist in Hou-Xue-Zhang, look at the post-publication return decay documented by McLean-Pontiff, and compare the post-cost expected magnitude to what a low-cost broad-market index would deliver. Some smart-beta products, built on the better-replicated factors with low fees, are reasonable diversifiers. Others, built on more exotic single-characteristic factors with higher fees, are harder to defend. The replication literature does not by itself tell you what to do with any specific product — but it equips you to ask the right questions of the prospectus.

Has the finance industry responded to this critique?

Slowly and unevenly. Some institutional asset managers and academic-aligned research providers — AQR, Dimensional Fund Advisors, and others — have engaged seriously with the factor-zoo and replication literatures, narrowed their factor menus to the better-replicated ones, and discussed the post-publication decay in their own research notes. Other parts of the industry continue to publish marketing materials that lean on the broader factor literature without acknowledging that much of it has been re-evaluated by serious researchers. The retail end of the smart-beta market, in particular, has been slower to absorb the post-2016 critique than the institutional end. The pattern is consistent with what you would expect: where the audience is sophisticated and cost-conscious, the corrections have propagated faster; where the audience is buying on brand and headline factor names, the corrections have lagged.

How does this compare to the replication crisis in psychology or medicine?

The structural parallels are striking. In psychology, the Open Science Collaboration (2015) replication of 100 published studies in major journals found roughly 36–47% replication depending on the criterion. In medicine, John Ioannidis’s 2005 PLOS Medicine essay “Why Most Published Research Findings Are False” laid out a theoretical case that has since been borne out by numerous large-scale replication exercises in biomedicine. In finance, the Hou-Xue-Zhang (2020) re-test of 452 anomalies found that the majority did not survive consistent re-implementation. Each of these fields has its own specific institutional features — laboratory experiments vs. clinical trials vs. observational stock-return data — but the underlying pattern is the same: a publication culture that rewards positive findings, a researcher-degrees-of-freedom problem in analytical choices, weak post-publication replication infrastructure, and a multiple-testing problem implicit in the cumulative search. The good news is that all three fields have, to varying degrees, started to acknowledge the problem and reform the standards. Finance is, in some ways, further along than psychology was in 2015, partly because the field’s most senior figures — Harvey, Cochrane, and others — engaged the problem from inside the establishment rather than treating it as a hostile critique from outside.

What’s the single most important thing a strategist should remember from this literature?

Probably this: the academic finance literature, taken as a whole, has been generating return-predictor claims at a rate that exceeds what the underlying data can support. A peer-reviewed paper with a published t-statistic above 2.0 is a much weaker piece of evidence than its institutional packaging suggests. Before treating any factor claim as a basis for capital allocation, ask: has it been replicated out-of-sample, by a different team, under a consistent methodology, on data the original author never saw, at a magnitude that survives realistic implementation costs? In the cross-sectional asset-pricing literature, the answer for a substantial fraction of published factors is “no.” Calibrate accordingly.

replication-crisisfinance-anomaliesharvey-liu-zhu-2016factor-investingevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

The Factor Zoo: Fifty Years of Anomaly-Hunting

What Harvey-Liu-Zhu 2016 Actually Argued

Hou-Xue-Zhang 2020: What Happens When You Actually Re-Run 452 Anomalies

Chordia-Goyal-Saretto 2020: The False-Rejection Frame

Bessembinder 2018: A Different Kind of Result That Survives

Linnainmaa and Roberts 2018: The Out-Of-Sample Test

The Smart-Beta Marketing Problem

The Strategist’s Takeaway

Sources

Related

Frequently Asked Questions

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook