Why Your Optimizely Results Keep Changing (And When to Worry)

Atticus Li

← Blog · optimizely

Why Your Optimizely Results Keep Changing (And When to Worry)

Tuesday your experiment shows 94% confidence. Friday it's 71%. Nothing changed — so what's happening? Here's how to tell normal statistical fluctuation from novelty effect, regression to the mean, and actual drift.

Atticus Li March 31, 2026 12 min read

Tuesday morning: your test shows 94% confidence and a 12% lift. You tell your team it's almost ready to call.

Friday afternoon: 71% confidence. 3% lift. Same experiment.

You didn't change anything. The test is just... different now. Is something wrong with your experiment? With the platform? With your data? Should you be worried?

Usually, no. But sometimes yes. The ability to tell the difference is one of the most underrated skills in experimentation.

The Three Types of Result Changes

Not all result movement is the same. There are three distinct phenomena that look identical in your dashboard but have very different implications:

Type 1: Normal statistical fluctuation. Confidence intervals are wide early in an experiment. Results bouncing around from 60% to 90% confidence in the first few days is expected behavior, not a signal that something is wrong.

Type 2: Novelty effect. Users behave differently when they encounter something new. Your "winning" variant may perform well simply because it's different — and that advantage erodes as users habituate.

Type 3: Actual drift. Something real changed — traffic composition, seasonal behavior, a code deployment, an external event. This is the one that warrants investigation.

Type 1: Normal Statistical Fluctuation

Here's the core insight most practitioners miss: confidence levels are not monotonically increasing. They bounce.

In the early days of an experiment, your sample is small. A few hundred visitors means your confidence intervals are enormous. If your control converts at 3% and your variation converts at 4% in the first 50 visitors per variation, that's a 33% relative lift — but the confidence interval on 50 visitors at a 3% base rate is so wide that 4% and 3% are statistically indistinguishable. Then 200 more visitors come in, the variation dips to 3.5%, and suddenly your "33% lift" looks like an 8% lift.

This isn't your experiment failing. It's Bayesian inference doing its job: more data narrows the uncertainty. The key is that early-experiment fluctuation is symmetrical — the confidence level should be bouncing up and down, not trending consistently downward.

A worked example: imagine a test with a true underlying lift of 8% (meaning the variation genuinely converts 8% better). In week 1 with 500 visitors per variation, you might see confidence levels ranging from 60% to 88% day by day, driven by random batches of high-converting or low-converting visitors. By week 3 with 3,000 visitors per variation, that range tightens to 89%-96%. The individual days are still noisy, but the trend direction becomes clearer.

**Pro Tip:** Don't look at your experiment daily during the first week. Set a calendar reminder to check at the end of your planned minimum runtime. Early peeking doesn't just lead to bad decisions — it causes anxiety about normal fluctuation that teams misinterpret as problems.

How Stats Engine Handles Sequential Testing

This is where Optimizely's approach differs fundamentally from classical frequentist statistics.

In classical A/B testing (using fixed-horizon tests), looking at your results before the planned end date inflates your false positive rate. Every time you "peek," you're giving yourself additional chances to cross the significance threshold by chance. This is called the optional stopping problem, and it's why traditional statistics courses say "never peek."

Optimizely's Stats Engine uses a sequential testing framework that mathematically accounts for the fact that you're going to look at results multiple times. The confidence levels you see in Optimizely are valid whenever you look at them — the engine continuously updates its estimates in a way that maintains the intended false positive rate regardless of when you stop.

What this means practically: the daily fluctuations you see in Optimizely are not a sign that peeking is invalidating your results. They're the engine's honest representation of uncertainty at each point in time. The confidence level on Tuesday reflects what the data says on Tuesday. Friday's number reflects Friday's data. Neither is "wrong."

**Pro Tip:** Optimizely's sequential testing validity doesn't mean results are reliable early on — just that the false positive rate is controlled. A 95% confidence result from 200 visitors is statistically valid under sequential testing, but practically unreliable because the sample is too small to have ruled out large-scale confounds. Aim for both statistical validity (controlled by Stats Engine) and practical reliability (requires adequate sample size and runtime).

Type 2: Novelty Effect

This one has ended more than a few careers in experimentation. A practitioner runs a bold redesign, sees 18% lift for the first two weeks, ships it, and watches the lift evaporate over the next 30 days. The post-mortem blame falls on "seasonal changes" or "measurement issues." The real culprit was novelty.

Novelty effect is a well-documented psychological phenomenon: users engage more with new interfaces simply because they're new. The new design gets more clicks, more exploration, more engagement — not because it's better, but because it's different. As users habituate to the new design, their behavior reverts toward baseline.

How long do novelty effects typically last? It varies by product and change magnitude, but research across multiple testing programs suggests:

Minor UI changes (button color, microcopy): novelty effect, if any, dissipates within 3-5 days
Moderate changes (new feature placement, redesigned form): 1-2 weeks
Major redesigns (new navigation, completely reworked flows): 3-6 weeks

How to detect novelty effect in your data: Look at your conversion rate trends over time within the experiment. If your variation started at 12% lift in week 1 and has trended to 4% lift in week 3 while still accumulating more data, that's a novelty effect signature. The lift is decreasing as sample size increases — which is the opposite of what happens with a genuine effect (where lift stabilizes as sample grows).

**Pro Tip:** For major redesigns, run your experiment for at least 4 weeks before analyzing results. The first 2 weeks are likely contaminated by novelty. Your real signal is in weeks 3-4 when users who've seen the new design multiple times are making decisions based on the experience itself, not the novelty.

The "returning visitor" analysis: If you can segment by new vs. returning visitors in your experiment, do it. Returning visitors are more habituated to your existing design, so their behavior is less distorted by novelty. If your variation wins strongly for new visitors but shows no lift (or negative lift) for returning visitors, you have a novelty effect. If it wins for both segments, you have a genuine improvement.

Type 3: Regression to the Mean

Related to novelty but mechanically distinct. Regression to the mean explains why that 40% lift in week 1 almost never holds.

When you see an extreme result early in an experiment, the most likely explanation is not that your variation is extraordinarily effective. It's that you got lucky. Extreme early results are most common when your sample is small and noisy. As you collect more data, your estimate regresses toward the true underlying effect.

This is not a statistical flaw — it's statistics working correctly. The problem is when teams treat extreme early results as evidence that's more reliable than later, larger-sample results.

A team testing a new product recommendation algorithm saw 52% lift in revenue per visitor after 3 days. They halted the experiment to "lock in" the win. Two weeks later, after re-running the test properly, the lift was 9%. The 52% was noise. The 9% was the signal.

**Pro Tip:** Treat any result above 30% relative lift with extreme skepticism until you have at least 1,000 visitors per variation. Extreme early lifts are almost always regression-to-the-mean artifacts. Your prior for most CRO changes should be "small positive or no effect" — very large effects are possible but rare enough to warrant extra scrutiny.

Type 3: Actual Drift

When results change genuinely because something in the world changed. The key is distinguishing this from fluctuation.

Signals that suggest actual drift:

Directional shift, not just variance: Your variation was consistently outperforming for 2 weeks, then consistently underperforming for 2 weeks. That's a structural break, not noise.
Correlated with an external event: You can point to a date when behavior changed — a promotional email, a competitor launch, a news event, a code deployment.
Affects both variations similarly: If your control and variation both drop in conversion rate on the same day, it's an external factor, not an experiment problem. If only the variation changes, investigate the variation's implementation.
Traffic mix change: Check your experiment's traffic sources over time. If week 1 was 70% organic and week 3 is 55% organic (because you ran a paid campaign), your traffic is fundamentally different between periods.

How to investigate: In Optimizely, look at your results segmented by date range. Compare week 1 vs. week 3 head-to-head. If the effect size is dramatically different between time windows, you have temporal drift.

**Pro Tip:** Every time you have a significant code deployment, email campaign, or traffic acquisition change during an experiment, log it in your experiment notes with the date. This is your audit trail for diagnosing drift. Teams that do this consistently can separate genuine drift from fluctuation in minutes rather than hours.

Seasonal changes: Day-of-week and time-of-day effects are real and often forgotten. A test that ran Monday-Wednesday (mostly business-hours traffic) will have different results than one that continued through the weekend. If your product has strong weekday/weekend behavioral differences, a partial-week sample will look different from a full-week sample — not because anything changed, but because your sample composition changed.

The "When to Call It" Decision Framework

The right call isn't just about hitting a confidence threshold. It's about confidence level + time + business cycle + trend direction.

Step 1: Have you run at least one complete business cycle? If no, keep running. A result from 3 days is not reliable regardless of confidence level.

Step 2: What is your confidence level? Above 95%, you can proceed to the next step. Below 85% with the minimum runtime elapsed, consider cutting and reallocating traffic.

Step 3: What direction is the trend going? Plot your confidence level by day. Is it trending upward (toward significance), stable, or trending downward? A stable 90% confidence after 4 weeks is different from a 90% confidence that was at 95% two weeks ago and has been declining.

Step 4: Is there evidence of novelty effect or drift? Segment your results by week. If week 1 shows 15% lift and week 3 shows 3% lift, you may be seeing novelty effect — and the "true" result may be the week 3 number.

Step 5: What are your guardrail metrics saying? A significant primary metric result with a guardrail that's trending negative warrants caution.

What to Do When a Winning Test "Un-Wins"

You had 97% confidence. You called it a win. You started the shipping process. Then confidence dropped to 74%. What do you do?

First: don't panic. A drop from 97% to 74% confidence is consistent with normal fluctuation, especially if sample size is still relatively small. Check whether you've crossed the minimum runtime threshold.

Second: look at the trend. If confidence has been declining consistently over the past week (97% → 90% → 82% → 74%), that's a signal worth taking seriously — especially if you've had an external event that could explain the drift.

Third: segment by time period. Did the variation win in week 1-2 and lose in week 3? That's novelty. Did it win consistently until a specific date and then drop? That's drift.

Decision tree:

Confidence fluctuating (not consistently declining): Keep running to your planned end date. Don't react to noise.
Confidence consistently declining with no external event to explain it: Extend runtime and watch for 1 more week before deciding.
Confidence declining with a clear external event: Stop the experiment. Your data is confounded. Rerun after the event passes.
Confidence declining and novelty effect detected (week 1 >> week 3 lift): The variation may still be worth shipping for new users. Consider a segmented rollout.

**Pro Tip:** If a "winner" loses significance after you've called it but before you've shipped it, resist the urge to immediately call it a loss either. Run a clean replication experiment. Some of the most valuable learnings come from effects that are genuine but noisy — they reveal where you need more traffic, better metrics, or a stronger variant.

Common Mistakes

Stopping when you first see significance, then restarting when confidence drops. This creates selection bias — you're only shipping tests where the early fluctuation happened to be in your favor.

Treating decreasing confidence as evidence the test is "broken." Confidence can decrease legitimately as more data comes in and the true effect becomes clearer. A result that was 96% confident with 200 visitors might correctly settle at 85% with 2,000 visitors if the true effect is smaller than early data suggested.

Not accounting for novelty when testing UI changes. Any time you significantly change the visual design or interaction pattern, budget for at least 3-4 weeks of runtime. Week 1 data for UI tests is highly unreliable.

Conflating experiment and platform issues. When results change dramatically, teams often blame Optimizely before checking their own implementation. Check for code deployments, tag manager changes, and traffic mix shifts first.

Making launch decisions on partial business cycles. A test that ran Monday through Thursday will look different from one that captured a full 7-day week. Always check the day-of-week coverage in your results window.

What to Do Next

Review your three most recent experiment results for temporal stability. Segment by week. Is the lift consistent across time periods, or was it driven by a specific window?
Check for novelty effect signals. For any experiment with UI changes, compare week 1 vs. later weeks. Flag any where early lift was more than 2x the later-period lift.
Audit your guardrail metrics. For experiments you shipped in the last 90 days, check whether post-ship performance matches the test results.
Build a drift log. Start recording code deployments, campaigns, and traffic changes with dates. This is your future debugging infrastructure.
Set up the "trend direction" check. Before calling any experiment, plot confidence level by day for the past 2 weeks. Declining trends warrant a second look.

For the statistical mechanics of why experiments take longer than expected, see Why Your Experiment Won't Reach Statistical Significance. For the program-level implications of running many experiments at once, see False Discovery Rate in Optimizely.

optimizely statistics stats-engine ab-testing results novelty-effect

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter