Why You Can Never Change a Running Experiment (And What To Do Instead)

Atticus Li

← Blog · optimizely

Why You Can Never Change a Running Experiment (And What To Do Instead)

Someone changed your live A/B test. Maybe it was you. Here's exactly what that broke, why the data is compromised, and the step-by-step rescue workflow to salvage the situation.

Atticus Li March 31, 2026 9 min read

Someone Just Changed the Running Test

It happens in every testing program. A designer notices a typo in the variant copy. A developer "just tweaks" the CTA color slightly. A product manager updates the traffic allocation because they're worried about revenue impact. A stakeholder asks to add a new metric halfway through.

None of these feel catastrophic in the moment. All of them are.

After running programs where this has happened — and having to explain to executives why a "winning" test quietly reversed after launch — I've become genuinely zealous about experiment integrity. Here's the full explanation of why changes break your data, what's actually salvageable, and the exact workflow to follow when the damage is already done.

Why Changes Invalidate Your Data: Simpson's Paradox

The statistical mechanism that makes mid-test changes dangerous is called Simpson's Paradox — a real phenomenon where a trend that appears in subgroups of data disappears or reverses when those subgroups are combined.

Here's a concrete example. Say you're testing a new checkout flow. After week one, your variant is converting at 3.8% versus control's 3.2% — a 19% lift. On day eight, your developer changes the variant's payment form layout.

Now your data pool contains two fundamentally different experiments:

Pre-change period (days 1-7): Variant A vs. Control, original form layout
Post-change period (days 8+): Variant A' vs. Control, new form layout

When Optimizely aggregates all data, it's comparing the combined performance of two different variants against a consistent control. If the new form layout happens to perform worse, the aggregate result might show only a 5% lift — or no lift at all. If it performs better, you might see a 30% lift in aggregate.

Neither number means anything. You've created a dataset that conflates two different tests, and there's no way to cleanly separate them after the fact.

**Pro Tip:** Before launching any experiment, document the exact state of the variant — screenshot it, paste the HTML, note the traffic allocation. This becomes your audit trail if someone later claims they "only made a small change."

The 5 Changes That Kill Your Test

Not all changes are equally damaging, but these five will always compromise your data:

1. Editing the variant code or design mid-test. Even "fixing a typo" changes what users in the variant cohort experience. Users who entered the experiment before the fix were exposed to the original variant. Users who entered after see the corrected version. They are not the same population.

2. Changing traffic allocation between variations. If you move from 50/50 to 70/30 mid-test, returning visitors stay in their originally assigned bucket. New visitors get allocated under the new ratios. The result is two unequal populations with different time-on-test distributions. Your conversion rates for each variation are now calculated against fundamentally different audience mixes.

3. Adding or removing a variation. Adding a third variation creates a new bucket that can only contain visitors from the change date forward. Removing a variation doesn't delete its existing data — it just stops routing new users there, creating a pool of stranded partial data.

4. Adding metrics after launch. This one is subtle but critical. Optimizely's false discovery rate calculations account for the number of metrics being tracked. Adding a metric after launch retroactively changes the statistical correction applied to all other metrics. Results you thought were significant may no longer be; results that looked insignificant might cross the threshold. The calculations are corrupted.

5. Changing audience or URL targeting. If you narrow your audience mid-test, existing users in the experiment might fall outside the new targeting rules. You end up with a contaminated pre-change period and a clean post-change period, but the pre-change data doesn't disappear — it's still in your results.

**Pro Tip:** Run a pre-launch checklist with your team before every experiment goes live: correct URLs, correct audiences, correct metrics, correct traffic allocation, correct variation code. Signature from the team lead. This 10-minute process prevents 90% of mid-test emergency changes.

Traffic Allocation: Monotonic vs. Non-Monotonic Changes

Optimizely distinguishes between two types of mid-test allocation changes:

Monotonic changes (consistently moving in one direction — e.g., increasing allocation from 20% to 40% to 60%) are technically less damaging because the directional pressure on the population is consistent. However, they still extend your time-to-significance (you now need to account for the changing effective sample sizes across time periods) and Optimizely still flags them as problematic.

Non-monotonic changes (decreasing then increasing, or vice versa — e.g., 50% → 20% → 80%) are the most dangerous. Reducing traffic allocation and then raising it again creates multiple cohorts with different entry points and different exposure durations. The statistical contamination is severe and, practically speaking, unrecoverable.

The safe rule: never change allocation while a test is running. If your risk tolerance changes, use the pause-duplicate workflow below.

What's Actually Salvageable

If someone already made a change to your running experiment, you have limited but non-zero options:

If the change was made less than 24 hours ago and traffic is very low: The contaminated cohort is small. Stop the experiment, note the exact timestamp of the change, and consider restarting from zero. The small contaminated dataset won't meaningfully affect a fresh run.

If you can identify the exact timestamp of the change: Optimizely's data export (via the Results API or CSV export) includes timestamps. You can analyze the pre-change and post-change periods as two separate experiments — but only if each period independently has sufficient sample size to reach statistical significance. If neither period alone has enough data, the analysis is inconclusive.

If the change was minor and non-functional (e.g., a backend infrastructure change your users couldn't see): Document it. Keep running. Note it in your test record as a potential confound. Whether it's actually salvageable depends on whether the change could plausibly affect user behavior.

If the change was substantial: Stop the test, discard the data, restart cleanly. It hurts, but the alternative — shipping a decision based on contaminated data — costs more.

**Pro Tip:** Create an "experiment change log" — a shared document (Notion, Confluence, wherever your team lives) that records any change to any running experiment. Even if a change doesn't invalidate the test, having a timestamp record protects you during results analysis.

The Pause-Duplicate Rescue Workflow

If you need to make a change to a running experiment — for any legitimate reason — this is the only safe path:

Step 1: Pause the current experiment. In Optimizely Web Experimentation, click the experiment, then click Pause. This stops new visitors from being bucketed. Existing bucketed users will continue to see their assigned variation (their cookie persists), but no new users enter.

Step 2: Record all current data. Export or screenshot your current results, including visitor counts, conversions, and the current statistical significance for every metric. This is your record of what the original experiment showed before the change.

Step 3: Duplicate the experiment. Use Optimizely's duplicate function (available in the experiment list view). This creates a new experiment with the same setup — same metrics, same audiences, same URL targeting.

Step 4: Make your change in the duplicate. Apply whatever modification was needed. Now you have a clean version with the correct setup.

Step 5: Publish the new experiment. The new experiment starts with zero data. Any previously bucketed users who return will be re-bucketed under the new experiment rules (their old cookie is for the old, paused experiment).

Step 6: Document the restart. Note in your experiment records: "Original experiment [ID] paused on [date] after [X] visitors, [Y]% significance. Restarted as experiment [ID2] due to [reason]."

This workflow costs you the accumulated data from the first run. But it gives you a clean dataset going forward — one you can actually trust when you're deciding whether to ship.

**Pro Tip:** If your original experiment had run long enough to reach significance before the pause, analyze those pre-pause results separately. They're valid data for the original variant configuration. Just be clear when reporting: "Test 1 (original variant): inconclusive/significant at X%. Test 2 (revised variant): [result]."

Common Mistakes

Making "emergency" changes to running tests. There are almost no true emergencies that require changing a running experiment. A conversion-blocking bug? Stop the test and fix it in production. A legal compliance change? Stop the test, apply it to both control and variant, restart. The emergency is never served by contaminating data.

Thinking "I'll just note it in the comments." Notes don't fix the statistical contamination. They just provide documentation of the damage. The data is still invalid.

Assuming Optimizely will automatically correct for mid-test changes. It doesn't, and can't. Stats Engine corrects for peeking. It does not correct for variant modifications or allocation changes.

Adding secondary metrics after launch "just to see." Resist this. Agree on your primary and secondary metrics before launch. Adding them later corrupts the false discovery rate calculation, even if you never intended to make a shipping decision based on the post-hoc metric.

Changing URL targeting to fix an implementation mistake. If you targeted the wrong URLs, your test is already compromised. Stop it, fix the targeting, restart. Adding the correct URLs after the fact doesn't clean up the data from the incorrectly targeted period.

What to Do Next

Add a "no changes" rule to your experiment launch checklist. Make it explicit: once an experiment is live, the only permitted action is pausing it. All changes require the pause-duplicate workflow.
Audit your last five experiments. Were any changes made mid-test? If so, flag those results as potentially unreliable in your test log.
Read our guide on How Long Should You Run an A/B Test? — premature changes often happen because teams feel pressure to speed up slow-moving tests. The answer is always better pre-test planning, not mid-test interference.
Set up Optimizely experiment change notifications (available via the audit log in Optimizely's settings) and route them to a Slack channel your team monitors.

optimizely ab-testing experimentation best-practices test-integrity

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter

Someone Just Changed the Running Test

Why Changes Invalidate Your Data: Simpson's Paradox

The 5 Changes That Kill Your Test

Traffic Allocation: Monotonic vs. Non-Monotonic Changes

What's Actually Salvageable

The Pause-Duplicate Rescue Workflow

Common Mistakes

What to Do Next

Related Articles

A/B Test Repository Architecture: Schema, Tagging, and the Retrieval Time Budget

A/B Testing Documentation Framework: Templates, Metadata Standards, and the Hypothesis Reuse Rate

How to Avoid Repeating Failed Experiments: The Failure Recurrence Rate

Related Articles

A/B Test Repository Architecture: Schema, Tagging, and the Retrieval Time Budget

A/B Testing Documentation Framework: Templates, Metadata Standards, and the Hypothesis Reuse Rate

How to Avoid Repeating Failed Experiments: The Failure Recurrence Rate

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook