How to Avoid Repeating Failed Experiments: The Failure Recurrence Rate

TL;DR: Repeated failed experiments aren't a sign of ambition — they're a sign the team isn't reading its own archive. The Failure Recurrence Rate tells you how often you're relearning lessons you already paid for.

Key Takeaways

  • Most high-volume programs repeat failed hypotheses at rates between 15-30% because failed tests are systematically under-archived
  • The Failure Recurrence Rate quantifies the leak: percentage of failed tests that retest a hypothesis the team already failed on
  • Root cause analysis separates execution flaws (broken tracking, bad randomization) from conceptual errors (wrong hypothesis, flawed mental model) — different failure types require different defenses
  • Staged testing and automated alerts catch execution failures early; archive-driven hypothesis review catches conceptual repeats before they launch
  • The fix isn't running fewer tests. It's making the failure archive as easy to retrieve as the success archive

Failed Tests Are the Most Valuable Entries You're Not Keeping

Most experimentation programs archive winners in detail and failures in a sentence. "Variant B lost" with no context. Six months later, a new hypothesis proposes something structurally similar. Nobody remembers the earlier failure. The team runs it again.

This isn't a character flaw. It's a documentation asymmetry. Wins get celebrated and documented because they ship. Losses get mentioned and forgotten because there's nothing to implement. The result is an archive that systematically underrepresents what didn't work — which is the part of the archive that contains the most learning density.

"If you keep seeing the same failures, the problem isn't your hypothesis. It's your process, your tools, or your people. Running more tests won't fix it." — Atticus Li

Programs that treat failed tests as equally valuable archive entries build a different kind of institutional memory. They learn where their hypotheses are systematically wrong, which is information no individual test can produce.

The Failure Recurrence Rate

Here's the metric:

FRR = Failed tests that retest a previously failed hypothesis / Total failed tests

Retesting a previously failed hypothesis means the new test, in substance, proposed the same directional change on the same metric targeting the same or similar audience — and the earlier attempt produced a clear negative or inconclusive result.

Interpretation thresholds:

  • FRR below 10% — Strong archive discipline. Teams are reading past tests before designing new ones.
  • FRR between 10% and 25% — Typical. Some retrieval is happening, but failures get missed more than wins.
  • FRR above 25% — Archive is effectively write-only for failures. The program is losing significant learning to repetition.

Separating Execution Flaws from Conceptual Errors

Not all failures are the same. Treating them identically misses the intervention each needs.

Execution flaws are failures where the test was well-designed but something broke: tracking misfired, randomization was uneven, sample size was too small, a caching layer corrupted assignment, an SRM went undetected. The hypothesis might have been right; we just can't tell from the data.

Conceptual errors are failures where the test ran cleanly but the hypothesis was wrong. The change didn't produce the predicted effect because the underlying mental model of user behavior was off.

These require different defenses:

  • Execution flaws are caught by pre-launch QA, automated SRM detection, and guardrail metric monitoring.
  • Conceptual errors are caught by archive-driven hypothesis review — checking whether similar hypotheses have been tested and what was learned.

Teams that conflate the two typically over-invest in technical QA and under-invest in hypothesis archive review. The second category is where most repeated failures actually come from.

Building the Archive That Catches Repeats

Tag every experiment by hypothesis type, not just feature area. "Checkout page" is too broad. "Remove a required field to reduce friction" is specific enough to match against past attempts.

Document what failed and why. A failed test entry should include: the hypothesis, the variant description, the metric that didn't move (or moved wrong), the sample size and significance reached, and a one-sentence interpretation of what the failure suggests.

Surface past failures at intake. The hypothesis entry workflow for new tests should include a search step: "Have we tested this direction before?" If the archive is searchable, this takes 2 minutes and catches most repeats.

Group related experiments into iteration chains. When a test is a follow-up to an earlier one, link them explicitly in the schema. Meta-analysis by chain reveals diminishing returns that no individual test can show.

Normalize failure classification. Clean negative, inconclusive (underpowered), inconclusive (null effect), flat with guardrail issue — these are different failure types with different implications. A single "loss" tag flattens the information.

Staged Testing to Catch Execution Flaws

For high-volume programs, staging protects against execution failures cheaply:

Internal testing. Employee accounts or QA environments catch configuration bugs before any real user sees the test.

Beta rollout. 0.1-1% of traffic for 24-48 hours validates that the infrastructure works and no obvious bugs emerged.

Graduated ramp. Expand from 5% to 25% to 50% over a few days. Guardrail violations at small scale stop the test before scaling.

Circuit breakers. Automated rollback if critical metrics (page load, error rate, revenue per visit) degrade beyond thresholds.

This sequence filters out the execution-flaw category almost entirely, leaving the archive free to focus on conceptual learning.

Common Mistakes in Failure Management

Documenting failures as "didn't work." Two-word entries teach nothing. The next person writing a similar hypothesis won't find it, and wouldn't learn from it if they did.

Skipping failure meta-analysis. Looking across 10+ failures in a feature area often reveals a pattern — a shared assumption the hypotheses kept making that was wrong. No individual failure surfaces this.

Over-correcting after one bad failure. A single shipped false positive triggers a governance crackdown that slows everything. The right response is targeted: what specific failure mode was this, and what specific archive entry would have caught the next one?

Blaming individuals. Repeat failures are almost always systemic — the archive didn't surface the past learning. Fixing retrieval fixes the pattern. Fixing people doesn't.

Advanced: Using AI for Failure Detection

At sufficient archive scale (a few hundred past tests), semantic search and clustering tools can help surface relevant past failures that exact-match tags might miss. A hypothesis about "making the signup form shorter" can be matched against past tests about "reducing form fields" or "one-click registration" even without explicit tag overlap.

This is worth investing in once the basic archive discipline is in place. Without structured capture, no amount of AI helps.

Frequently Asked Questions

How do I convince the team to document failures in detail?

Show them the cost. An FRR audit that finds 20% of recent failed tests repeated prior failures is a strong motivator. Repeat failures are cheap to count and visible once surfaced.

Should inconclusive tests count as failures?

Yes, with a different classification. Inconclusive results tell you something about sample size or MDE choice, which is useful for future test design. Flatten them to "didn't work" and you lose that.

What's the minimum documentation per failure?

Hypothesis, variant description, primary metric with result, sample size and significance, one-sentence interpretation. About 5 minutes of writing. That floor is enough to catch 80% of future repeats.

How do I handle a failure where I suspect execution issues but can't confirm?

Archive it as inconclusive with an execution suspicion note. Don't mark it as a clean negative — future teams should know the data isn't trustworthy.

When is it legitimate to rerun a previously failed test?

When the context has changed materially: different audience, different feature surface, substantially different implementation. The archive entry should make this explicit in the hypothesis: "Unlike the prior test, this one targets X because Y has changed."

Methodology note: Failure Recurrence Rate thresholds reflect experience across mid-market experimentation programs. Specific figures are presented as ranges. Classification patterns draw on established practice in experiment archive design.

---

Past failures compound in value when they're actually searchable. Browse the GrowthLayer test library for examples of how failures can be documented as first-class archive entries.

Related reading:

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.