In 1998, Personality and Social Psychology Review published a paper by Michigan State psychologist Norbert L. Kerr with a title that turned a familiar bad habit into a permanent piece of methodological vocabulary: “HARKing: Hypothesizing After the Results are Known.” The paper was twenty-one pages long, single-authored, and contained no new experiments. What it did was give a name to a practice that every working researcher recognized but that nobody had quite isolated as its own category of error. Within a decade, HARKing was being taught in graduate methods seminars. Within two decades, it was one of the four named horsemen of the replication crisis, sitting alongside p-hacking, low power, and publication bias as a structural reason why the literature in many fields was less reliable than it appeared.

The act Kerr was naming had been going on for as long as researchers had been writing up studies. You run an experiment. You collect data. You analyze the data. You find some patterns you expected and some you did not. When you sit down to write the paper, you frame the patterns you found — including the ones you did not predict — as if you had predicted them all along. You rewrite the introduction so that the hypothesis section now leads naturally to the results you actually observed. The reviewers see a clean, coherent story: a theory was proposed, an experiment was run, the data confirmed the theory. The exploratory parts of your work — the parts where you genuinely did not know what would happen — have been laundered into confirmatory parts, where the answer was supposedly known in advance and the data merely supplied the evidence.

Kerr’s contribution was not to discover that this happens. His contribution was to give it a name memorable enough to argue about, an analysis serious enough to take seriously, and a catalogue of the specific epistemic costs that follow when a field’s published record is systematically misclassified between exploratory and confirmatory work. The acronym was sticky on purpose. He wanted researchers, students, editors, and readers to have a word they could use the next time they noticed the pattern in someone else’s paper — or, more difficult, in their own.

This article walks through what Kerr actually argued in 1998, why HARKing inflates the false-positive rate of an entire literature, how Mark Rubin’s 2017 typology refined the original analysis into something more usable, what the preregistration response looks like in practice, why HARKing is especially prevalent in business and management research, and how a working strategist can spot HARKed claims in the research they read.

Kerr’s 1998 Paper, in Plain English

Kerr opened with a definition that has been quoted in essentially every subsequent treatment: HARKing is “presenting a post hoc hypothesis (i.e., one based on or informed by one’s results) in one’s research report as if it were, in fact, an a priori hypothesis.” The key word is presenting. The act is not the post hoc thinking itself, which is unavoidable and often valuable. The act is the misrepresentation of when in the research process the hypothesis was formed.

Kerr was careful to distinguish HARKing from several adjacent practices it is sometimes confused with. Generating a new hypothesis after looking at one’s data is not HARKing — it is exploratory data analysis, which has been a respectable scientific activity since the eighteenth century. Running additional analyses suggested by the data is not HARKing — it is following the evidence where it leads, which is also respectable. Reporting an unexpected finding clearly labelled as unexpected is not HARKing — it is honest reporting. What constitutes HARKing is the specific move of taking a finding that emerged from looking at the data and writing the paper as if the finding had been predicted before the data were collected.

The paper enumerated twelve costs of HARKing, ranging from the statistical to the sociological. The statistical costs are the easiest to formalize. The sociological costs — the way HARKing distorts the cumulative record of a field, the way it teaches graduate students that exploratory work is shameful, the way it pressures honest researchers to compete with dishonest ones on a tilted field — were the costs that most concerned Kerr and that have proven the hardest to fix.

The single most consequential of the twelve costs, and the one that connects HARKing directly to the replication crisis, is the inflation of the Type I error rate at the level of the literature. A single significance test conducted at the conventional alpha = 0.05 threshold has a 5% probability of producing a “significant” finding even when nothing is happening. If a researcher runs twenty exploratory tests on their data, picks the one with the smallest p-value, and writes the paper as if that one had been the a priori prediction, the effective alpha on the reported test is no longer 0.05 — it is closer to 0.64, the probability that at least one of twenty independent tests will hit the threshold by chance. The published p-value of 0.03 is a lie about how much evidence the data actually provided.

Multiply this across a literature. If a substantial fraction of “confirmatory” tests in the published record are in fact HARKed exploratory tests, the headline false-positive rate of the literature is much higher than the per-test alpha suggests. Combine this with low statistical power, publication bias toward significant results, and the rest of the four horsemen and you reproduce, from a different starting point, the same conclusion that Ioannidis would reach seven years later by a Bayesian route: a substantial fraction of published “significant” findings in many fields are likely false.

Kerr’s paper landed in a field, social psychology, that was about to undergo a public crisis of confidence over precisely these issues. The 2011 Daryl Bem precognition paper, the Open Science Collaboration’s 2015 attempt to replicate one hundred psychology studies, and the broader meta-research movement of the 2010s would all return to HARKing as one of the structural explanations for why so many published findings did not replicate. Kerr’s paper was the early warning. The field took a decade and a half to start treating it as one.

How Type I Error Inflation Actually Works

The mathematics of HARKing-induced Type I error inflation is straightforward enough that it should be in every undergraduate methods course. Yet it is consistently underappreciated even by working researchers, because the inflation happens at the literature level rather than at the individual-paper level — and individual papers are how researchers think about their own work.

Suppose you have collected data on a sample of subjects and measured ten different outcome variables: reaction time, accuracy, confidence, satisfaction, willingness to recommend, and so on. Suppose further that none of your manipulations actually affect any of the outcomes — the true effect is zero for every comparison. If you run a significance test on each of the ten outcomes at alpha = 0.05, the probability that at least one of them produces a “significant” result purely by chance is approximately 1 − (1 − 0.05)^10 = 0.40. There is a 40% chance you will see at least one false positive across the ten tests, even though nothing is really happening.

This is the family-wise error rate problem, and it is well-understood when researchers acknowledge that they have run multiple tests. Standard corrections — Bonferroni, Holm, Benjamini-Hochberg — exist precisely to handle this case. The corrections work when the analyst is transparent about how many tests they ran.

HARKing breaks this transparency. The HARKing researcher runs the ten tests, finds the one with the smallest p-value, and writes the paper as if that single test was the only test they had ever planned to run. The reviewers see a paper reporting one significance test at p = 0.03, with no multiple-comparisons correction applied. The reviewers cannot apply a correction because they do not know about the other nine tests. The published claim is that the data provided evidence at the 0.03 level for the reported hypothesis. The actual situation is that the data provided evidence at the 0.30 level (approximately) for the same hypothesis after honest multiple-comparisons accounting, or no useful evidence at all if the comparison was one of many possible comparisons in a large hypothesis space.

The damage compounds because the next researcher who reads the published paper does not know to discount the p-value. They treat it as a 0.03-level finding. They design their next study assuming the effect is real, calculate sample size based on the inflated effect estimate, and either fail to replicate (in which case the failure is published rarely, if at all) or partially replicate via a smaller version of the same HARKing process (in which case the literature appears to be converging on a real effect that does not exist).

The Open Science Collaboration’s 2015 reproducibility project found that fewer than 40% of attempted replications of psychology studies reached significance in the same direction as the original. The original effect sizes were on average about half the size of the published claims. This pattern — replicated effects much smaller than original effects — is exactly what HARKing predicts when applied at scale to a literature. The original literature is reporting the upper tail of an exploratory distribution; the replication literature is sampling from the actual distribution. The gap between the two is roughly the size of the HARKing inflation.

Rubin 2017: When Does HARKing Actually Hurt?

Nineteen years after Kerr’s paper, Newcastle University psychologist Mark Rubin published an important refinement in Review of General Psychology titled “When Does HARKing Hurt? Identifying When Different Types of Undisclosed Post Hoc Hypothesizing Harm Scientific Progress.” Rubin’s argument, which has been widely accepted, is that not all forms of HARKing are equally harmful, and that the field would benefit from a more precise typology than Kerr’s original umbrella term.

Rubin distinguishes between several distinct practices that all fall under Kerr’s original definition but that have different epistemic implications.

CHARKing — Constructing Hypotheses After Results are Known. The researcher analyses the data, observes a pattern, constructs a new theoretical hypothesis to explain the pattern, and presents the new hypothesis as if it had been generated a priori. This is the strongest form of HARKing and the one most clearly damaging. It corrupts the theoretical record by attributing predictive success to a theory that did not actually predict anything.

RHARKing — Retrieving Hypotheses After Results are Known. The researcher had a long list of plausible hypotheses before the study, ran the study, observed which hypotheses the data supported, and presented only the supported ones as if they had been the focal a priori predictions. This is more subtle but still harmful, because it hides the size of the original hypothesis space from the reader and inflates the apparent predictive power of the surviving hypotheses.

SHARKing — Suppressing Hypotheses After Results are Known. The researcher had specific a priori hypotheses, found that the data did not support them, and quietly removed them from the paper. The remaining write-up presents the analyses that worked while concealing the ones that did not. This is closely related to publication bias but operates within a single paper rather than across the literature.

THARKing — Transparent Hypothesizing After Results are Known. The researcher analyses the data, observes a pattern, generates a new hypothesis, and explicitly labels the new hypothesis as post-hoc and exploratory in the published paper. Rubin argues that THARKing is not only acceptable but valuable — it is the honest version of exploratory data analysis and it should be encouraged, not stigmatized.

Rubin’s typology was important because it rescued legitimate exploratory work from the blanket condemnation that an overly broad reading of Kerr risked imposing. The problem with HARKing is not that researchers think after they look at their data — they should, and Bayesian analysis explicitly requires it. The problem is the misrepresentation in the write-up of when the hypothesis was formed. Transparent post-hoc hypothesizing solves the problem at zero cost to legitimate exploration. The only thing the field needs to do is end the convention that exploratory work is less prestigious than confirmatory work, which is the convention that creates the incentive to HARK in the first place.

A 2017 commentary by Jens B. Asendorpf and colleagues in Personality and Social Psychology Review extended Rubin’s distinction by arguing that the HARKing label, while useful, had sometimes been deployed in ways that discouraged honest reporting of unexpected findings. The consensus that emerged from this exchange is that the field needs both stronger preregistration norms for confirmatory work and stronger normalization of explicitly-labelled exploratory work. The two reforms are complementary, not in tension.

The Preregistration Response

The modern methodological response to HARKing is preregistration: the practice of publicly recording your hypotheses, analysis plan, sample size, exclusion criteria, and outcome measures before collecting (or, for archival data, before analyzing) the data. Once your study is preregistered, the distinction between confirmatory and exploratory work becomes verifiable. Any analysis that matches the preregistered plan is confirmatory. Any analysis that does not is exploratory. Reviewers can check. Readers can check. The HARKing move — presenting an exploratory analysis as confirmatory — becomes impossible to execute without being immediately falsified by the public preregistration record.

The Open Science Framework, founded in 2013 by Brian Nosek and the Center for Open Science, became the de facto infrastructure for preregistration in psychology and a growing list of adjacent fields. By the early 2020s, more than three million projects had been registered. AsPredicted.org, a lighter-weight preregistration service started in 2015 by Wharton’s Joseph Simmons and colleagues, provided a friction-free path for researchers running smaller experiments. Major journals — Psychological Science, Royal Society Open Science, Nature Human Behaviour, BMJ — began offering Registered Reports, a publication format in which the study design and analysis plan are peer-reviewed and accepted in principle before the data are collected. Once accepted, the paper will be published regardless of whether the results are significant, removing the incentive both for HARKing and for publication bias in a single move.

The Registered Reports format is the strongest available solution to HARKing because it removes the publication incentive that creates the HARKing temptation. If the paper will be published either way, there is no reward for laundering exploratory findings as confirmatory. Cortex, the journal that pioneered the format under Chris Chambers’s editorship starting in 2013, has since been joined by approximately three hundred other journals. Empirical comparisons of Registered Reports to standard articles consistently find that Registered Reports report null results at substantially higher rates — which is exactly the pattern you would expect if standard articles were systematically suppressing or reframing null findings.

Bishop’s 2019 Nature comment “Rein in the four horsemen” placed HARKing as one of the four structural pathologies of contemporary research — alongside publication bias, low statistical power, and p-hacking — that the institutional reforms of the 2010s were designed to address. Bishop argued, persuasively, that each of the four horsemen has a known remedy: pre-registration for HARKing, registered reports for publication bias, power analysis and adequate funding for low power, and full open data for p-hacking. The remedies are not technically difficult. The barriers to their adoption are institutional and incentive-based, not intellectual. The pace of adoption since 2019 has been faster in some fields (psychology, parts of biomedicine) than in others (much of economics, most of business research, large portions of nutritional epidemiology), and the methodological-reform map of contemporary science is essentially a map of where the four horsemen have and have not been reined in.

HARKing in Business and Management Research

Hollenbeck and Wright’s 2017 paper “Harking, Sharking, and Tharking” in the Journal of Management argued that HARKing was particularly prevalent in management and organizational behavior research and that the conventions of the field had institutionalized the practice in ways that were difficult to dislodge. Their argument is worth taking seriously because business research touches working strategists’ decisions in a more direct way than most academic psychology does.

Three features of business research make HARKing structurally tempting. First, the dependent variables are typically organizational outcomes — performance, turnover, satisfaction, engagement — that are influenced by an enormous number of factors. Any cross-sectional or longitudinal dataset will produce many statistically significant correlations purely by chance, and the temptation to construct theoretical narratives around the significant ones is constant. Second, the field’s top journals have historically expected papers to tell a clean theoretical story with confirmed hypotheses; messy exploratory write-ups have been viewed as unfinished or undertheorized. Third, the practical implications of business research — what executives should do — depend on the findings being treated as confirmatory, which creates downstream pressure to frame all findings as confirmatory regardless of how they were actually generated.

Hollenbeck and Wright introduced “sharking” (Secret Hypothesizing After Results are Known) to describe the most damaging form, in which the researcher does not even acknowledge to themselves that they are doing it — the post-hoc nature of the hypothesis becomes invisible even to the person who generated it, because the cognitive sequence of “look at data, generate theory, write theory as a priori prediction” feels natural and is not flagged as problematic by the field’s professional norms. They contrasted this with “tharking” (Transparent Hypothesizing After Results are Known), echoing Rubin’s THARKing category, which they argued the field should normalize.

The empirical evidence on prevalence is sparse but suggestive. Surveys of management researchers asking about their own practices have found that a substantial minority — somewhere between a quarter and a third in most surveys, sometimes higher — acknowledge having presented post-hoc hypotheses as a priori at some point in their careers. The actual prevalence is almost certainly higher than the self-reported rate, both because researchers underreport socially undesirable behavior and because much HARKing is sharking — invisible even to its practitioner. A meta-analytic review of published management papers’ relationships between hypothesis specification and result pattern suggested that the published literature contains far more “confirmed” hypotheses than would be expected if the hypotheses had genuinely been specified a priori with realistic estimates of effect sizes and power.

The practical implication for a working strategist reading business research is that the published “confirmation” of any management theory should be discounted heavily unless the underlying studies were preregistered. The base rate of HARKing in the field is high enough that an unpreregistered “confirmatory” finding in a management journal carries less evidential weight than a clearly-labelled exploratory finding in a more methodologically conservative field would carry. This is not a charge against any individual researcher; it is a structural adjustment to the prior probability that a given published claim describes a real effect at the size and direction reported.

How a Working Strategist Spots HARKed Research

You will not, in practice, get to see whether a study was preregistered before you read it, in most cases. You need a set of textual signals that suggest a paper’s “confirmatory” findings are likely to be HARKed, so you can apply an appropriate discount before acting on the claim.

Signal one: the introduction reads like a perfect setup for the results. A non-HARKed paper has an introduction that was written before the results were known. It will typically include some hypotheses that were not supported by the data, some that were partially supported, and some that were supported in unexpected ways. A HARKed paper has an introduction that was rewritten after the results were known and reads, suspiciously, as if the theoretical literature led inexorably to exactly the pattern of findings the paper reports. When every hypothesis is confirmed and the theoretical framing fits the results perfectly, the explanation is rarely that the researcher had perfect theoretical insight. The explanation is usually that the introduction was rewritten to fit the data.

Signal two: the hypotheses are stated at a high level of granularity that exactly matches the analyses. Genuine a priori hypotheses tend to be stated at the level of theoretical constructs — “we predict that high-power individuals will show more risk-taking behavior than low-power individuals.” HARKed hypotheses, because they are generated from specific analyses, tend to be stated at the level of specific operationalizations — “we predict that participants in the high-power condition will choose the risky lottery 60% of the time, compared to 40% in the low-power condition, with a significant effect of condition (p < .05) but no significant interaction with gender.” When the level of detail in the predictions matches the level of detail in the analyses, the predictions were almost certainly generated from the analyses rather than prior to them.

Signal three: surprising findings are presented without surprise. If a paper reports a finding that ought to be surprising given the theoretical framing — an unexpected three-way interaction, a moderator effect in an unusual direction, a result that runs counter to a substantial prior literature — and the paper presents this finding as if it were the obvious expected outcome, the finding is likely a HARKed post-hoc discovery. Real predictions are often wrong in interesting ways. Predictions that are always right, especially in surprising places, are usually retrofits.

Signal four: the paper does not distinguish between confirmatory and exploratory analyses. Methodologically reformed papers now routinely include sections explicitly labelled “Confirmatory analyses” and “Exploratory analyses,” with the former reported strictly as preregistered and the latter labelled as post-hoc. Papers that present everything as confirmatory, or that bury exploratory analyses in the main results without labelling them, are operating in the pre-reform paradigm and warrant the corresponding discount.

Signal five: the sample size justification is missing or absurd. A confirmatory study has a sample size justification grounded in power analysis for the predicted effect size. An exploratory study that was later written up as confirmatory will typically either omit the sample size justification or include one that was clearly back-calculated from the observed effect to look adequate. When the sample size is exactly the number that produces a significant result for the reported effect, that is not a sample size justification — it is a post-hoc rationalization.

Once you have these signals, you have a structured habit of mind that you can apply to any research claim before changing your strategic behavior based on it. The discount is not a precise calculation. It is a structured skepticism, calibrated by Kerr’s original analysis, that you apply before acting on a claim that depends on the researcher having genuinely predicted the result they report. The operational rule that falls out of the HARKing literature is simple: any “confirmed” finding in an unpreregistered study should be treated as exploratory until independently replicated in a preregistered design. This rule, applied consistently, will save you from most of the strategic mistakes that flow from over-trusting the apparent predictive success of theories that did not actually predict anything.

Sources

  • Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. DOI: 10.1207/s15327957pspr0203_4.
  • Rubin, M. (2017). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21(4), 308–320. DOI: 10.1037/gpr0000128.
  • Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5–18. DOI: 10.1177/0149206316679487.
  • Bishop, D. V. M. (2019). Rein in the four horsemen of irreproducibility. Nature, 568(7753), 435. DOI: 10.1038/d41586-019-01307-2.
  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. DOI: 10.1177/0956797611417632.
  • Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716.
  • Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. DOI: 10.1073/pnas.1708274114.
  • Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. Cortex, 49(3), 609–610. DOI: 10.1016/j.cortex.2012.12.016.
  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. DOI: 10.1371/journal.pmed.0020124.
  • Asendorpf, J. B., et al. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27(2), 108–119. DOI: 10.1002/per.1919.

Frequently Asked Questions

Is HARKing the same as p-hacking?

No, though they are closely related and frequently co-occur. P-hacking is an analysis-stage practice: trying multiple analytical paths through the same data until one produces a significant result, then reporting only that path. HARKing is a reporting-stage practice: presenting a hypothesis that was generated after looking at the data as if it had been specified before the data were collected. The two can occur separately — you can HARK without p-hacking, by simply rewriting your introduction after a single planned analysis produces an unexpected result, and you can p-hack without HARKing, by transparently reporting that you searched the analysis space. In practice the two often co-occur because the same incentive structure rewards both: the publication system rewards clean confirmatory stories, and both p-hacking and HARKing are paths to producing one when the data did not naturally provide one.

Is generating hypotheses from data always wrong?

No, and Rubin’s THARKing category is essential here. Generating new hypotheses from data is one of the most valuable activities a researcher can do — most major theoretical advances begin with someone noticing an unexpected pattern in data. The problem is not the act of generating the hypothesis. The problem is presenting the generated hypothesis as if it had been predicted in advance. Transparent exploratory analysis, clearly labelled as such, is honest, useful, and should be normalized. The reform the field needs is not less exploration but more honest labelling of which work is exploratory and which is confirmatory.

How can I tell if a study was preregistered?

Preregistered studies will typically include a link to the preregistration in the methods section, often with the formula “we preregistered this study on the Open Science Framework at [URL]” or “this study was preregistered on AsPredicted (#XXXXX).” The URL should resolve to a public record of the original hypotheses and analysis plan, dated before data collection. If the paper does not mention preregistration, the study was almost certainly not preregistered, regardless of how confirmatory the framing of the paper feels. The absence of a preregistration link is itself a signal — most reformed-paradigm researchers now mention preregistration explicitly when they have done it, because it is one of the strongest available credibility markers.

Does HARKing happen in fields outside psychology?

Yes, and the structural conditions that produce it are present in essentially every field where the publication system rewards clean confirmatory stories. Surveys of researchers in economics, management, biomedicine, education, and political science have all found substantial self-reported rates of HARKing or HARKing-adjacent practices. The fields with the strongest preregistration infrastructure today — clinical trials, parts of psychology, parts of economics — are also the fields where the prevalence has been measured and where reform efforts have been most concentrated. The fields with weaker preregistration infrastructure — much of management, much of nutritional epidemiology, much of marketing research — should be assumed to have higher prevalence by default, in the absence of evidence to the contrary.

What is the single most useful practical rule from the HARKing literature?

Treat any “confirmed” finding from an unpreregistered study as exploratory until it has been independently replicated in a preregistered design. This rule does not require you to dismiss unpreregistered research — exploratory work is valuable and should be read. It requires you to apply the appropriate discount when deciding whether to act on a claim. The publication system has not yet caught up to the methodological reforms, which means that the published “confirmatory” record in many fields is still substantially contaminated by HARKed exploratory work. The discount rate is the single most useful adjustment a working strategist can make to their consumption of research claims.

Why is HARKing so hard to stop, if everyone agrees it is bad?

Because the incentive structure of academic publishing rewards it. Journal editors and reviewers, on average, prefer papers with clean confirmatory stories. Tenure committees count publications in top journals. Top journals reject papers with messy exploratory findings at higher rates than papers with clean confirmatory ones. The result is a system in which the honest researcher who labels their exploratory work as exploratory is competing for publication slots with the HARKing researcher who launders the same exploratory work as confirmatory. Until the publication incentive structure changes — through Registered Reports, through preregistration requirements, through editorial commitment to publishing null results — the HARKing temptation will persist regardless of how clearly everyone understands that it is harmful. The reforms of the 2010s and 2020s are changing the incentive structure in some fields. The pace is slower than the methodological consensus would suggest it should be, because changing institutional incentives is slower than changing intellectual consensus.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.