Spaced Repetition and Testing Effect: The Learning Science That Actually Replicates (Anti-Example)

Atticus Li

← The Replication Crisis · replication-crisis

Spaced Repetition and Testing Effect: The Learning Science That Actually Replicates (Anti-Example)

Two of the most robust findings in cognitive psychology — spaced repetition and retrieval practice — have a century of replication, large effect sizes, mechanistic grounding, and applied tools that work. Corporate L&D ignores both. Here is why the science is right and the industry is wrong.

By Atticus Li May 25, 2026 27 min read

Two of the most robust findings in cognitive psychology — spaced repetition and retrieval practice — have a century of replication, large effect sizes, mechanistic grounding, and applied tools that work. Corporate L&D ignores both. Here is why the science is right and the industry is wrong.

If you have been reading through this hub, you have watched canonical psychological findings get dismantled one after another. Power posing, ego depletion, the marshmallow test, money priming, learning styles, multiple intelligences, growth mindset at scale, brain-training games, the Mehrabian 7-38-55 rule, almost the entire family of social-priming results — over and over, the most-cited demonstrations of “this is how the mind works” have either failed to replicate, shrunk to something much smaller than their original claim, or turned out to be outright fraud.

A reader by now might conclude that all of psychology is suspect. That conclusion would be wrong, and this article exists to explain why.

Because in the same period that produced all those failures, cognitive psychology was quietly accumulating two findings that have done the opposite. They replicate. They replicate in laboratories, in classrooms, in medical training, in language learning, in vocabulary acquisition, in mathematics, in the most rigorous meta-analyses anyone has run. They replicate across age groups, across cultures, across decades, across measurement paradigms. The effect sizes are not the small-d-equals-0.2 polite cough that haunts most of social psychology — they are moderate to large, large enough to be detectable at the level of an individual classroom. The mechanisms are reasonably well understood. And the practical applications already exist as commercial products that work: Anki, Duolingo, the medical-board-prep apps that essentially every U.S. physician trains on, Quizlet, the spaced-retrieval modules now embedded in Khan Academy.

The findings are spaced repetition — the observation that the same total study time, distributed across spaced intervals, produces dramatically better long-term retention than the same time massed into a single session — and the testing effect, also called retrieval practice, which is the observation that attempting to retrieve information from memory strengthens future retention more than re-studying the same material does.

This is an anti-example article in a hub full of takedowns. It exists for three reasons. First, calibration: readers should leave the hub knowing that “psychology is mostly broken” is wrong; the more accurate claim is that some psychological findings are extraordinarily well-supported, and a much larger number are fragile, and the field’s central failure was treating both categories as if they had the same evidential weight. Second, decision-usefulness: for a CEO, CHRO, or strategist evaluating training programs, certification platforms, employee onboarding, or knowledge retention, these two findings are the highest-ROI interventions on the menu. Third, the more interesting puzzle: if these findings are this well-supported, why does almost every corporate training program ignore them?

So here is the case for spaced repetition and the testing effect, including the legitimate caveats, plus an explanation of why robust science fails to propagate into practice when the practice industry has incentive structures pointing the other direction.

Spaced Repetition — Ebbinghaus 1885 Onward

Spaced repetition is the oldest empirical finding in experimental psychology. It is older than psychology as an organized discipline. It is older than the existence of psychology departments at universities.

In 1885, Hermann Ebbinghaus published Über das Gedächtnis (Memory: A Contribution to Experimental Psychology), the founding monograph of empirical memory research. Ebbinghaus performed the experiments on himself, memorizing thousands of nonsense syllables across years, varying the spacing between rehearsal sessions, and measuring how much he retained at later test points. He documented two phenomena that have stood ever since.

The first is the forgetting curve: forgetting is rapid in the hours and days immediately after learning, then slows asymptotically. The second is the spacing effect: the same total amount of study, distributed across multiple sessions separated by gaps, produces substantially better long-term retention than the same total time massed into a single session. Ebbinghaus’s data on this point were not modest correlations. The retention advantage from spacing was large enough to be visible in the raw recall percentages without any inferential statistics, which was fortunate because inferential statistics had not been invented yet.

In the 140 years since, the spacing effect has been replicated more times than any other finding in psychology. Not in the sense of “frequently cited.” In the literal sense: more experimental demonstrations of this single phenomenon exist than for any other claim in the discipline.

The canonical modern synthesis is Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). “Distributed practice in verbal recall tasks: A review and quantitative synthesis.” Psychological Bulletin, 132(3), 354–380. DOI: 10.1037/0033-2909.132.3.354.

Cepeda and colleagues conducted a meta-analysis of distributed practice that located 839 separate effect-size estimates across 317 experiments in 184 published articles. This is not a meta-analysis of a hot recent finding scraped from a decade of conferences. It is a meta-analysis of a hundred-plus years of accumulated experimental work, and the conclusion is unambiguous. Distributing study sessions across time produces large, consistent gains in long-term retention compared to massed study, and the magnitude of the gain depends in a predictable way on the relationship between the spacing interval (ISI, inter-study interval) and the retention interval (the gap between final study and final test). As retention interval increases, the optimal spacing interval also increases. There is no single “right” interval; there is a function relating two variables. This is exactly the kind of mechanistic regularity you want to see in a finding you intend to trust.

The Cepeda meta-analysis is one of the cleanest results in the entire psychological literature. Effect sizes are large, the dose-response relationship is orderly, and the phenomenon shows up across every verbal-learning paradigm anyone has tested. Subsequent extensions have replicated it in vocabulary learning, mathematics practice, medical-board content, motor skill acquisition, and language learning. The field-wide replication crisis that demolished so much of social psychology did not touch spacing because spacing had already been confirmed thousands of times under conditions stringent enough to make any single failed replication statistically irrelevant.

The Testing Effect — Roediger and Karpicke 2006, Karpicke and Blunt 2011

The companion finding, the testing effect, is younger as a focus of organized research but is now arguably the most-replicated learning-science result of the past quarter century.

The canonical review is Roediger, H. L., & Karpicke, J. D. (2006). “The Power of Testing Memory: Basic Research and Implications for Educational Practice.” Perspectives on Psychological Science, 1(3), 181–210. DOI: 10.1111/j.1745-6916.2006.00012.x.

Roediger and Karpicke synthesized a century of scattered laboratory work into a single coherent claim: the act of retrieving information from memory, even without feedback, produces better long-term retention than the act of re-studying the same material for the same amount of time. In the most striking version of the result, students who studied a passage once and then took a recall test on it outperformed, on a delayed final test, students who had studied the same passage four times. The contrast is not between testing and nothing. The contrast is between testing and substantially more time spent re-studying. The first group wins by a margin that does not shrink with replication.

Five years later, Karpicke, J. D., & Blunt, J. R. (2011). “Retrieval Practice Produces More Learning than Elaborative Studying with Concept Mapping.” Science, 331(6018), 772–775. DOI: 10.1126/science.1199327 delivered the result that put retrieval practice into mainstream scientific consciousness. Karpicke and Blunt directly pitted retrieval practice against concept mapping — at the time, one of the most heavily promoted “active learning” techniques in education. Across two experiments using science-education texts, the retrieval-practice group outperformed the concept-mapping group by a substantial margin on both factual recall and inference questions, including, remarkably, on a final test that itself required producing a concept map. Retrieval practice was better even at the criterion task that concept mapping was supposedly the right tool for.

A second result from the same paper deserves attention. The students consistently predicted, in advance, that concept mapping would help them more than retrieval practice. They were wrong by a wide margin. The activity that produces the largest learning gains is one that learners systematically underestimate, because it feels harder and feels less productive in the moment. This is the “desirable difficulties” framework that Robert and Elizabeth Bjork formalized in Bjork, R. A., & Bjork, E. L. (2011). “Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning.” Psychology and the Real World, 56–64. Conditions that slow down acquisition often improve retention. Conditions that feel productive often are not.

The meta-analytic confirmation arrived as Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). “Rethinking the Use of Tests: A Meta-Analysis of Practice Testing.” Review of Educational Research, 87(3), 659–701. DOI: 10.3102/0034654316689306. Across 272 effect sizes from 188 separate experiments, the weighted mean effect of practice testing versus re-studying was Hedges’ g = +0.51 — a moderate-to-large effect by any standard in education research. Against filler or no-activity controls, the effect was +0.93, which is, frankly, enormous. The moderator analyses behaved sensibly: short-answer and fill-in formats produced larger effects than multiple-choice, longer retention intervals showed larger benefits, and the effect generalized across age groups and content domains.

For comparison: in social psychology, the median effect size across the studies that survived the Many Labs replication efforts was somewhere around d = 0.1 to 0.2. The testing effect is roughly five times the magnitude of the typical social-psychology effect that does replicate, and it shows up in nearly every experiment that has tested for it.

The Dunlosky 2013 Comprehensive Review

The single most important citation in this literature, the one document I would recommend any executive read before commissioning a new training program, is Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). “Improving Students’ Learning with Effective Learning Techniques: Promising Directions from Cognitive and Educational Psychology.” Psychological Science in the Public Interest, 14(1), 4–58. DOI: 10.1177/1529100612453266.

Dunlosky and his coauthors performed a systematic review of ten learning techniques in widespread educational use. For each technique, they reviewed the evidence on whether the technique produces durable learning, and they classified each technique as “high utility,” “moderate utility,” or “low utility” based on (a) the strength of the evidence, (b) whether the benefit generalized across learner age groups and content domains, (c) whether the benefit held up at delayed retention intervals rather than just immediate tests, and (d) whether the technique was actually feasible to deploy.

Two techniques received the “high utility” classification: distributed practice (spacing) and practice testing (retrieval practice). These are the two findings this article is about. They are the only two that cleared the bar.

The “moderate utility” tier contained elaborative interrogation (asking yourself “why is this true?”) and self-explanation (verbalizing what you are doing as you work). Both have decent evidence but with more domain restrictions and smaller effects.

The “low utility” tier — and this is where the corporate training industry should pay attention — contained the techniques that most learners and most training designers actually rely on. Highlighting and underlining: low utility. Re-reading: low utility. Summarization: low utility for most learners (it requires training to produce useful summaries). Keyword mnemonics: low utility outside narrow conditions. Imagery for text: low utility. The techniques that feel productive — that feel like studying — are not the techniques that produce learning.

Dunlosky and coauthors are careful, conservative scientists writing for the journal of record in evidence-based psychology. They are not selling a product. They concluded, in print, that the entire study-skills repertoire that most people associate with “studying” is empirically ineffective at producing durable knowledge, and that the two techniques people use least are the two that work best.

That paper came out in 2013. It is the canonical citation. It is taught in graduate education programs. And in the thirteen years since, corporate L&D has barely moved.

What Most Corporate L&D Does Instead

If you have sat through a typical corporate training experience in the last decade, you have observed Dunlosky’s “low utility” methods deployed at industrial scale.

The new-hire onboarding is a one-week event. New employees sit through eight-hour days of presentations, watch videos, read slide decks, and then are tested at the end of the week or not tested at all. The compliance training is annual: a one-hour video, a ten-question multiple-choice quiz that mostly tests whether you can pay attention to the previous sentence, and a certificate. The sales-enablement program is a two-day off-site where account executives watch the product team present features and use cases, then return to their desks. The leadership development program is a four-week intensive at a corporate university.

Every one of these is the spacing effect’s opposite. Mass the material into a short window. Do not revisit it. Do not test retrieval. Score completion, not retention. The empirical prediction is that a substantial fraction of the content will be forgotten within weeks. That prediction is the one that actually obtains.

The “learning experience” platforms that have proliferated in the past decade are mostly worse, not better. Most of them are libraries of videos and slide decks. The interaction model is “watch and click next.” Knowledge checks, when they exist, are multiple-choice with low item-discrimination, taken immediately after the relevant content while the answer is still in working memory. There is no spaced re-test. There is no distributed practice. The technology stack supports rich measurement of consumption — minutes watched, completion percentage, modules finished — and almost no measurement of retention.

This is not because the underlying technology cannot deliver spaced retrieval. The technology to do this has existed for decades. Anki, the open-source flashcard application built on Piotr Wozniak’s SuperMemo algorithms, has implemented adaptive spaced repetition since the early 2000s. The medical-board-prep industry — UWorld, AMBOSS, AnKing, the Pepper deck — runs almost entirely on spaced retrieval and accounts for billions of dollars of consumer revenue. Duolingo built one of the largest consumer-software successes of the past fifteen years substantially on spaced repetition. The science is implementable. It just is not implemented in the corporate L&D stack.

Why Practice Ignores the Science

The gap between the cognitive-psychology literature and corporate training practice is one of the largest evidence-to-implementation gaps in any applied field. The mechanism, once you look at the incentive structure of the L&D vendor market, is not mysterious.

Training is sold as an event, not a process. A CHRO has a budget cycle. The budget purchases a deliverable: a course, a workshop, a platform license. The deliverable has a delivery date. The metric reported back to the budget owner is “completed.” A vendor whose product requires daily ten-minute spaced-retrieval sessions distributed across six months cannot be packaged as an event. It is harder to sell, harder to expense, harder to put on a slide for the board.

The measurement infrastructure rewards completion, not retention. Almost every learning-management system in corporate use measures completion rate, time-on-platform, and end-of-module quiz scores. None of these have any relationship to long-term retention, and at least one (end-of-module quizzes) is actively misleading: a quiz immediately after the relevant content measures working memory plus reading comprehension, not learning. Retention measurement requires a delayed test, and a delayed test requires you to be willing to discover that most learners have forgotten most of the content, which is not a finding that helps anybody’s quarterly metrics.

The learner-satisfaction signal points the wrong way. Recall the Karpicke and Blunt finding that students systematically prefer concept mapping over retrieval practice and systematically underestimate the latter. The same pattern shows up in adult learners. Spaced retrieval feels worse in the moment. It feels effortful, sometimes embarrassing (you cannot recall something you “learned” six weeks ago), and unproductive (you spend ten minutes on what feels like five flashcards). Re-reading feels productive. Watching a video feels productive. Highlighting feels productive. Course evaluations reward the activities that feel productive, not the activities that produce learning, and vendors optimize for the evaluations.

The procurement model rewards features over outcomes. When a procurement team evaluates an L&D platform, they compare feature lists: number of courses, integration with HRIS, mobile app, analytics dashboard, AI-generated content. “Implements distributed practice with adaptive scheduling” is a feature you have to know to ask about, and most procurement teams do not. The platforms that win deals are the ones that look good in feature comparisons, which has approximately zero correlation with the platforms that produce retention.

The replication-crisis-era reflex of “it’s all probably wrong” is operative here too. A learning-and-development leader, briefed by an enthusiastic vendor on the latest “neuroscience-based learning approach,” has reasonable cause to be skeptical. The unfortunate side effect is that the same skepticism gets applied to the two findings that genuinely earned the right to be trusted, and they end up bracketed alongside the latest brain-training fad or learning-styles assessment as “yet another claim from psychology.” The whole point of this article is to argue against that flattening.

What’s Actually Working Now

The application case is clearest in domains where the consequences of forgetting are immediate and consequential, and where consumers pay out of pocket.

Medical board preparation is the cleanest example. U.S. medical residents preparing for USMLE Step 1, Step 2, and Step 3 use spaced-retrieval flashcard systems (AnKing, the Pepper deck, dedicated paid platforms like AMBOSS and UWorld) as the spine of their preparation. Twelve months of daily reviews of thousands of flashcards is the modal prep pattern, because the empirical evidence (and the visible pass-rate difference) makes the case undeniable. This is a consumer market with skin in the game, and it converged on spaced retrieval because it works.

Language learning moved the same direction. Duolingo, Anki, Memrise, and a cohort of competitors all implement spaced-retrieval scheduling as the core mechanic. Duolingo’s product team has talked publicly about A/B testing the spacing algorithm and observing that the spaced version produces better long-term retention than the alternatives they have tested.

K-12 and higher education, especially the parts that have embraced cognitive science as a basis for curriculum design, are increasingly building spaced retrieval into core practice. Khan Academy’s “Mastery” path uses spaced re-testing to schedule when a topic resurfaces. The Retrieval Practice Consortium, organized around Roediger and Karpicke and McDaniel and others, has produced classroom-ready materials that teachers have adopted at scale.

Corporate domains where retention has hard business consequences — cybersecurity awareness, regulated-industry compliance, technical certification — are slowly moving the same way. Cybersecurity-training vendors increasingly send micro-doses of phishing-recognition content on a spaced schedule rather than running an annual one-hour video. The reason is not pedagogical sophistication. It is that the spaced approach measurably reduces click-through rates on test phishes, and that number lands on a CISO’s dashboard.

The pattern is that wherever retention has a sharp measurable consequence that the buyer cares about, the market eventually finds spaced retrieval. Where retention has no measurable consequence, the market sticks with what feels productive — videos, slides, completion checkboxes.

What This Anti-Example Tells Us About Robust Science

The reason spaced repetition and the testing effect survive while so many other psychological findings have collapsed is worth looking at directly. There is a pattern.

A century-plus of independent replication. The original spacing demonstration is from 1885. The testing effect’s earliest published demonstrations are from the early 1900s. The findings did not arrive with a celebrity press tour; they accumulated quietly across a hundred-plus years of laboratory work conducted by different research groups using different paradigms with no incentive to coordinate. This is the opposite of the typical replication-crisis profile, where a single charismatic paper produces a wave of citations and then the wave collapses when independent groups attempt direct replication.

Multiple converging paradigms. Spacing has been demonstrated in word-list learning, vocabulary acquisition, mathematics practice, motor skills, medical content, and second-language acquisition. The testing effect has been demonstrated in laboratory paradigms, classroom field experiments, online courses, and applied training. The findings are not paradigm-specific.

Mechanistic grounding. Both effects have plausible, well-articulated mechanisms (consolidation during inter-study intervals; retrieval-induced reconsolidation; the difficulty of the retrieval attempt as a forcing function on encoding strength). The mechanisms make further predictions — the spacing-by-retention-interval interaction Cepeda confirmed, the relationship between retrieval-attempt difficulty and benefit — that have themselves been tested and supported. This is the hallmark of a finding that is more than a coincidence in a particular dataset.

Effect sizes large enough to detect in real applications. The Adesope 2017 meta-analytic effect size for practice testing was Hedges’ g = +0.51. The spacing effect is in the same range. These are not effects that require N = 100,000 to detect; they are visible at the level of an individual classroom. “Detectable by ordinary inspection” is a property of robust findings.

Applied tools that demonstrably work. The market test has been run. Anki, Duolingo, UWorld, AMBOSS, the medical-board-prep ecosystem — these are commercial products built on these findings, used by tens of millions of people, with measurable user outcomes. If the science were wrong, the products would not produce retention, and the consumer market would not have converged on them.

When you see this combination — long replication history, multiple paradigms, articulated mechanism, large effects, applied products that work — you are looking at a finding you can trust. When you see the opposite — one or two foundational papers, a single paradigm, post-hoc mechanism, small effects, no successful applied tools — you are looking at a finding that probably will not survive scrutiny.

What This Means For Strategists Designing Training Programs

The practical implication for executives is straightforward, and the calibration is the opposite of what most L&D pitches imply.

Replace mass-and-forget training with spaced-and-retrieve training. A two-day off-site followed by nothing produces low retention. The same content delivered as twenty-minute spaced-retrieval sessions distributed across three months produces dramatically higher retention. The content is not different. The schedule is different. The retention difference is large enough that it should show up in performance metrics if you measure them.

Replace re-watching and re-reading with retrieval. Any module in a training program that does not require the learner to attempt to produce information from memory is, on the Dunlosky 2013 evidence, in the “low utility” category. Adding a free-recall component, a short-answer prompt, or an applied scenario in which the learner must generate rather than recognize content produces gains that are large and well-documented.

Stop measuring completion as a proxy for learning. Completion measures whether the learner clicked through the content. It does not measure retention. If retention matters to the business, measure retention — with a delayed assessment, weeks or months after the original delivery. Expect the first measurement to be uncomfortable.

Budget for distributed delivery. Spaced-retrieval programs do not fit the one-week-event model. They require the learner to spend small amounts of time across an extended period. This is a project-management problem, not a content problem, and most LMS infrastructures can be configured to handle it if the procurement team asks for that capability.

Set realistic expectations about what training can accomplish. Even with spacing and retrieval, you cannot produce durable expertise from twelve hours of content. The advantage of spacing is not that it produces miracles — it is that the same content investment produces meaningfully more retention than the mass-and-forget alternative. That advantage compounds over years of distributed delivery into something the cram model never reaches.

What This Means For Evaluating L&D Vendors

For procurement teams selecting learning platforms or training vendors, the empirical literature offers a small number of high-signal questions to ask.

Does the platform implement spaced retrieval? Not “spaced delivery” (release a new module every Tuesday) but spaced retrieval (re-test previously-learned content at increasing intervals). The Anki SM-2 algorithm and its descendants are the canonical references. A platform that cannot articulate how it schedules re-tests is selling content management, not learning.

Does the platform require retrieval, or just exposure? Watch-and-click-next is exposure. Attempt-to-recall is retrieval. The difference is large enough in the empirical literature that “do your knowledge checks require generation, or just recognition?” is the right discriminating question.

Does the platform measure retention at delay? Immediate-test scores measure working memory, not learning. A platform that only reports completion rates and immediate-quiz performance has no way to demonstrate that learning happened. Ask for delayed-retention measurements.

Does the platform’s pedagogy match a “high utility” technique in Dunlosky 2013? If the answer is “we use video-based microlearning with engagement gamification,” you are looking at a low-utility approach with a marketing layer. If the answer is “we use distributed practice and retrieval-based assessment grounded in Roediger and Karpicke,” you are looking at a vendor whose product has a chance of working.

Does the platform’s own internal evidence include retention measurement? Any L&D vendor making evidence-based claims should have run their own retention experiments — comparing their model to a control, at delayed retention intervals. Most have not. The ones that have tend to lead with it.

The two findings in this article do not solve corporate learning, and they do not eliminate the need for content design, motivation, applied practice, and the dozen other things that go into a competent training program. They do, however, mark the floor. Any program that does not use spaced retrieval or retrieval practice has chosen to ignore the most-replicated finding in cognitive psychology and the second-most-replicated. The cost of that choice is measurable, and it falls on the organization’s ability to retain what it teaches.

Sources

Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie (Memory: A Contribution to Experimental Psychology). Duncker & Humblot.
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students’ learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58. DOI: 10.1177/1529100612453266
Roediger, H. L., & Karpicke, J. D. (2006). The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science, 1(3), 181–210. DOI: 10.1111/j.1745-6916.2006.00012.x
Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775. DOI: 10.1126/science.1199327
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380. DOI: 10.1037/0033-2909.132.3.354
Bjork, R. A., & Bjork, E. L. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher, R. W. Pew, L. M. Hough, & J. R. Pomerantz (Eds.), Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society (pp. 56–64). Worth Publishers.
Pashler, H., Bain, P., Bottge, B., Graesser, A., Koedinger, K., McDaniel, M., & Metcalfe, J. (2007). Organizing Instruction and Study to Improve Student Learning. IES Practice Guide (NCER 2007-2004). National Center for Education Research, U.S. Department of Education.
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701. DOI: 10.3102/0034654316689306

The Replication Crisis Hub — full index of the canonical findings this hub examines.
Learning Styles: The Edu-Myth That Will Not Die — the popular learning-science claim that did not replicate; the opposite-direction case study to this one.
Multiple Intelligences: Howard Gardner’s Untested Hypothesis — another popular learning-science framework with weak empirical support.
Brain-Training Games: The Cognitive Enhancement That Doesn’t Transfer — adjacent applied-cognition claim that has not held up.
Growth Mindset at Scale: The Effect That Shrank — another education-psychology claim whose effect size collapsed under replication.

FAQ

How long should spacing intervals be? The Cepeda 2006 meta-analysis shows that optimal spacing depends on how long you want the learner to retain the material. As a rough rule, optimal inter-study interval is roughly 10–30% of the desired retention interval — so if you want learners to retain content for a year, study sessions should be spaced roughly a month to several months apart. For most corporate training contexts where retention is needed for years, expanding intervals starting at days and growing to weeks then months are appropriate. Adaptive spaced-retrieval algorithms (the Anki SM-2 family, FSRS) handle the scheduling automatically based on each learner’s recall accuracy.

What about Anki specifically — is it the right tool for corporate training? Anki is excellent as a personal study tool and as the canonical implementation reference for SM-2 style scheduling. It is not typically the right tool for corporate L&D because it lacks the administrative infrastructure (cohort management, compliance reporting, content authoring workflows) that enterprise buyers require. Several corporate learning vendors have built Anki-like spaced-retrieval engines into more enterprise-suitable platforms; that is the right pattern to look for.

What about corporate onboarding specifically — is one week of training a lost cause? Not lost, but predictably leaky. The empirical prediction is that a substantial fraction of one-week-event onboarding content will be forgotten within months. The high-leverage redesign is to convert the one-week event into a distributed program — an initial intensive followed by spaced retrieval-based reviews over the following ninety days. The total content does not need to change; the schedule does. Organizations that have made this conversion report better retention measurements and faster ramp times.

What does this mean for live training, workshops, and conferences? Live training is high-bandwidth and motivationally valuable but produces predictably poor long-term retention without spaced follow-up. The right model treats the live event as the first exposure and builds a distributed retrieval program around it — spaced reviews, applied scenarios, periodic re-tests — in the weeks and months following. The cost of the follow-up is modest relative to the cost of the live event itself, and the retention multiplier is large.

What about adult learners specifically — does the evidence translate? Yes. The Adesope 2017 meta-analysis and prior reviews specifically examined age as a moderator. The testing effect is present and substantial across age groups, including adult learners and adults in professional and continuing-education contexts. The spacing effect is similarly age-invariant. There is no empirical basis for treating “adult learning” as a separate domain requiring different techniques; the cognitive mechanisms are the same.

What about subject-matter differences — does this work for technical content, soft skills, compliance? The strongest evidence is in factual and conceptual content — vocabulary, terminology, procedures, principles, factual relationships. Spaced retrieval is unambiguously the right approach for any training where the learner needs to know things at delay. The evidence is weaker for tacit-skill development (judgment, interpersonal skill, leadership presence) where deliberate practice with feedback is the appropriate framework. For mixed content — most corporate training falls here — spaced retrieval should handle the declarative-knowledge component while applied practice handles the skill component.

Are there any legitimate critiques or limits of these findings? A few. First, the spacing effect requires that the learner actually do the spaced sessions — adherence is a real constraint, and gamification and reminders matter for real-world deployment. Second, the testing effect requires meaningful retrieval attempts; poorly designed multiple-choice with low item-discrimination produces much smaller benefits than free-recall or short-answer formats. Third, the effects are about retention; they do not by themselves produce transfer to genuinely novel applications, which requires additional design. Fourth, the optimal spacing schedule depends on individual variability and content difficulty, which is why adaptive algorithms outperform fixed schedules. None of these critiques contradicts the core findings; they refine the conditions under which the findings produce maximum benefit.

If these techniques are so well-established, why isn’t every training company using them? This is the central puzzle the article addresses. The short answer is that the L&D vendor market has incentive structures pointing the wrong way: training is sold as events rather than processes, completion is easier to measure than retention, learner-satisfaction surveys reward the activities that feel productive rather than the activities that produce learning, and the procurement model rewards platforms that compare well on feature lists. Vendors that have built genuinely spaced-retrieval-based products exist and are growing; they tend to win where the buyer is sophisticated enough to ask about retention rather than completion.

replication-crisisspaced-repetitiontesting-effectlearning-scienceevidence-evaluation

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter

Spaced Repetition — Ebbinghaus 1885 Onward

The Testing Effect — Roediger and Karpicke 2006, Karpicke and Blunt 2011

The Dunlosky 2013 Comprehensive Review

What Most Corporate L&D Does Instead

Why Practice Ignores the Science

What’s Actually Working Now

What This Anti-Example Tells Us About Robust Science

What This Means For Strategists Designing Training Programs

What This Means For Evaluating L&D Vendors

Sources

Related

FAQ

Related Articles

Cohen's d And The Misuse Of "Small/Medium/Large" Effect Sizes

The False Consensus Effect: Why You Think Everyone Agrees With You

The Barnum/Forer Effect: Why Personality Tests And Horoscopes Feel So Accurate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook