Every mature experimentation program is sitting on a graveyard of half-dead feature flags — tests that concluded months or years ago, whose code still ships to production, whose owners have left the company, and which nobody is confident is safe to delete. They don't show up on any dashboard. They show up when a new test produces an impossible result, or when a reused flag reanimates code everyone forgot existed.
TL;DR
- A concluded experiment doesn't disappear when you stop looking at it. The flag, the branching code, and the instrumentation keep shipping until someone deliberately removes them — and "someone" rarely does.
- Zombie flags carry a real, compounding cost: they add code complexity, multiply the combinations QA has to reason about, and — worst — silently contaminate future experiments that unknowingly run on top of a still-live prior test.
- The catastrophic version is a reused flag reactivating dead code. This is not hypothetical; it erased a company in 45 minutes.
- The fix is lifecycle discipline, not heroics: every flag gets an expiry and an owner at creation, cleanup is a definition-of-done step, and a periodic audit kills anything past its expiry.
| Flag age past conclusion | What it's become | Correct action |
|---|---|---|
| A few weeks | Release debt (expected, temporary) | Schedule removal |
| A few months | Technical debt by definition | Remove now |
| A year-plus, no owner | A zombie — risk with no upside | Audit, verify, delete |
| Reused for a new purpose | A live landmine | Stop; never repurpose flags |
The dangerous flags aren't the ones you remember. They're the ones nobody remembers — still executing, still branching traffic, still capable of firing.
A flag is inventory, and inventory has a carrying cost
Martin Fowler's canonical treatment of feature toggles makes the economic framing explicit: toggles let you change behavior without changing code, but they introduce new abstractions and conditional logic, and that complexity is a carrying cost that has to be actively constrained over time (Fowler, *Feature Toggles*). His rule of thumb is blunt: a release toggle still in the codebase a couple of months after launch is debt by definition. The toggle earned its keep during the rollout; after that, it's just risk sitting in production.
The trouble is that the incentive structure guarantees the debt accumulates. Creating a flag is thrilling — it unblocks a test, it ships a feature safely. Removing a flag is thankless — no new value, some regression risk, and a chance you delete the wrong thing. So every program is biased toward creation and against cleanup, and the flag population grows monotonically until someone imposes discipline. Research and industry reports on high-scale feature-flag systems describe the same one-directional drift: flags get created far faster than they're retired, and without an explicit removal process the count only climbs. The natural state of an experimentation program is flag accumulation, the same way the natural state of a garage is filling with boxes.
The three costs, from annoying to catastrophic
1. Cognitive and QA load. Every live flag is a branch. Ten flags is a thousand theoretical combinations of system state, and while most are irrelevant, no one can be sure which. QA can't reason about all of them, so testing quietly stops being exhaustive and starts being hopeful. This is the same tax that makes running high experiment velocity without sacrificing quality hard: uncleaned test infrastructure is drag on every future test.
2. Experiment contamination. This is the one that corrupts your data. A new experiment gets randomized and launched — and unknowingly runs on top of a prior test's flag that's still live for some fraction of users. Now your new test's arms are polluted by an old test's treatment, and you have an interaction effect you can't see. The result looks valid. It isn't. This is a cousin of the sample-integrity failures that experimentation governance exists to catch — except here the contamination source is your own archaeology, not a splitter bug.
3. Reactivation of dead code. The catastrophic tail. In 2012, Knight Capital deployed new code by repurposing a configuration flag that had previously activated an old, dormant program called "Power Peg" — dead code that had sat unused in production for roughly eight years. One of eight servers didn't get the update, the reused flag reactivated the ancient code path, and in about 45 minutes the runaway algorithm sent millions of erroneous orders and lost roughly $440 million, effectively ending the company (Knight Capital Group). The proximate trigger was a reused flag colliding with un-removed dead code — the two most common zombie-experiment hazards, combined.
Most zombie flags will only ever cost you complexity. The rare one costs you the company. You cannot tell them apart while they're asleep.
Why nobody owns the cleanup
The reason this debt is so durable is that ownership evaporates. The person who created the flag understood its full context — which users, which code path, what "done" looks like. When they move teams or leave, that context leaves with them, and the flag becomes an orphan: still live, no owner, no documentation, and a deletion that feels risky because nobody remembers what it touches. This is a specific, concrete instance of the general problem in preventing institutional knowledge loss in an A/B testing program — the flag is tribal knowledge encoded as live production behavior, and when the tribe forgets, the knowledge doesn't disappear, it just goes feral.
The result is a standoff. Cleanup is obviously correct and individually irrational: the engineer who deletes an orphaned flag inherits all the risk and none of the reward. So the flags stay, and the graveyard grows, and everyone agrees it's a problem for "later."
The fix: lifecycle discipline built into the process
Killing zombie flags is not a cleanup project — a one-time purge just starts the accumulation over. It's a lifecycle policy applied at creation. When I scaled a program from a couple dozen tests a year to over a hundred, the thing that broke first wasn't test design; it was hygiene — different people creating flags with no shared convention, and no expectation that anyone would ever remove them. The fix was standardization, and flag lifecycle was part of it. Four rules do most of the work:
- Every flag gets an owner and an expiry at creation. Not a suggestion — a required field. A flag with no expiry is a flag with no exit plan.
- Cleanup is a definition-of-done step, not a follow-up ticket. The experiment isn't "done" when the decision is made; it's done when the losing code path and the flag are removed. A concluded test with live flag code is an open task, not a closed one.
- Never repurpose a flag. New purpose, new flag. Reuse is exactly the failure mode that reanimated Power Peg. The few characters you save aren't worth reintroducing dead-code risk.
- Run a periodic flag audit — quarterly is enough — that lists every flag past its expiry with no clear owner, and forces a verify-and-delete decision on each. Fowler's own test: if you can't measure your stale-flag count and time-to-removal, you aren't managing flags at all.
None of this is glamorous. All of it is the difference between a program whose infrastructure you can trust and one whose results are quietly contaminated by its own history. Across enough tests, the programs that stay trustworthy at scale are the ones that treat flag removal as part of the experiment, not a chore for later — a pattern visible in what two years and 200+ tests at scale actually require.
FAQ
How is a zombie experiment different from ordinary technical debt?
Ordinary tech debt slows you down. A zombie experiment can corrupt your data — a live flag from a concluded test can contaminate a new experiment's arms and produce results that look valid but aren't — and in the worst case can reactivate dead code. It's debt with a measurement-integrity hazard and a small catastrophic tail attached, which is why it deserves stricter lifecycle discipline than most debt.
Isn't deleting old flags risky? What if I break something?
Deleting a flag whose context is lost does carry risk — which is exactly the argument for not letting flags reach that state. If you assign an owner and expiry at creation and remove flags as part of finishing each test, deletion happens while the context is fresh and the risk is low. The risk you're worried about is a symptom of having skipped the lifecycle discipline in the first place. For genuinely orphaned flags, verify current behavior behind the flag before removal rather than deleting blind.
We have hundreds of old flags already. Where do we start?
Start with an audit that sorts by two axes: age past conclusion and blast radius. Prioritize anything that's both old and touches high-traffic or high-stakes paths, and specifically hunt for any flag that's been reused for a second purpose — those are the Knight Capital pattern and the highest risk. You don't have to clear the whole graveyard at once; you have to stop it growing (lifecycle rules on new flags) and drain the most dangerous flags first.
Who should own flag cleanup — engineering or the experimentation team?
Both, at different points. The experimentation team owns the policy — expiry, no-reuse, audit cadence — because they understand the test lifecycle. Engineering owns the removal execution because they understand the code. The failure mode is when neither owns it and it falls into the gap between them, which is precisely how zombies are born. Name the owner at flag creation and the gap closes.
Bottom line
A concluded experiment leaves a body: flag code that keeps shipping until someone deliberately removes it. Left alone, those bodies accumulate into complexity, contaminate new experiments with invisible interaction effects, and occasionally — via a reused flag hitting un-removed dead code — do catastrophic damage. The cause isn't incompetence; it's an incentive structure that rewards creating flags and punishes deleting them, plus the knowledge loss that turns old flags into orphans nobody dares touch. The fix is lifecycle discipline imposed at creation: owner and expiry on every flag, cleanup as definition-of-done, never reuse a flag, and a periodic audit that forces the delete. Treat flag removal as part of the experiment, or your program's own history will keep poisoning its future.
Building that lifecycle discipline into a repeatable process is part of why I made GrowthLayer. For more on the operational debt that quietly degrades experimentation programs, subscribe to Lean Experiments.