The Question Everyone Gets Wrong

The most common question I get from new experimenters isn't "what should I test?" It's "how long should I run this?" And the most common mistake isn't stopping too late — it's stopping the moment the dashboard shows a green uplift number, usually around day three.

After running 100+ tests, I can tell you: that impulse to stop early has probably killed more good testing programs than any other single behavior. It inflates win rates, creates false confidence, and sends teams shipping changes that quietly drag down revenue for months before anyone notices the regression.

Here's the full framework.

The Two Rules That Actually Matter

Before you touch a sample size calculator, internalize these two non-negotiables:

Rule 1: Seven days minimum, no exceptions. Even if your test hits statistical significance on day two, you run it for at least a full week. Why? Because user behavior is not uniform across days. B2B sites see dramatically different behavior Monday vs. Friday. E-commerce sites see weekend spikes. Media sites have Tuesday content drops. If you stop after two days — both weekdays — you've systematically excluded the weekend cohort. Your "winner" is actually just a weekday winner.

Rule 2: One full business cycle. For most sites that's seven days. For B2B SaaS with monthly billing cycles, one business cycle might be a full month. For a retailer running a weekend flash sale, it might literally be the sale period. Define your business cycle before you start, not after you see the results.

**Pro Tip:** Set your end date in your calendar before you launch the test. Write it down: "This test ends on [date] regardless of what the dashboard says before then." This one habit eliminates 80% of premature stopping.

The Sample Size Calculator: A Worked Example

Optimizely's built-in sample size calculator (found in the experiment setup flow) takes four inputs. Let's walk through a real scenario.

Scenario: You're testing a new product page layout. Your current add-to-cart rate is 4.2%. You want to detect a 15% relative improvement (meaning you're looking for the variant to hit 4.83% or higher). Your product page gets 3,500 visitors per week. You're running one variant against control.

Plug into the calculator:

  • Baseline conversion rate: 4.2%
  • Minimum Detectable Effect (MDE): 15% relative (= 0.63 percentage points)
  • Statistical significance: 90% (Optimizely's default)
  • Traffic per week: 3,500
  • Number of variations: 2 (control + one variant)

The calculator will give you approximately 18,000 visitors per variation needed. At 3,500 visitors/week split 50/50, each variation gets ~1,750 visitors/week. That means you need roughly 10 weeks to reach significance.

**Pro Tip:** If the calculator spits out a number that requires 6+ months of traffic, your MDE is set too small. Either test a bigger change (one that produces a larger effect), or accept that this particular page isn't worth testing right now. Running an underpowered test isn't neutral — it wastes your budget and gives you noise you might act on.

What "15% relative lift" actually means: If you're at 4.2%, a 15% relative lift means reaching 4.83% (4.2 × 1.15). A 15% absolute lift would mean reaching 19.2%, which is almost never what anyone means. Always clarify relative vs. absolute with your stakeholders — this confusion causes real arguments.

The Peaking Problem, Explained Plainly

"Peeking" is checking your results before the planned end date and stopping if you see significance. This is the statistical equivalent of looking at a coin flip after three flips and declaring it's biased toward heads.

Here's what actually happens mathematically: at 90% significance, you accept a 10% false positive rate. But that 10% assumes you check once, at the planned end. If you check your results 10 times during the run (daily), your actual false positive rate balloons to roughly 40%. You're four times as likely to declare a winner that isn't one.

Optimizely's Stats Engine uses sequential testing, which continuously corrects for this — it's why Optimizely can show you results while the test is running without the false positive rate exploding. But even with Stats Engine, the trend stabilizes over time. Early significance readings are noisy. The confidence interval is wide. Wait.

**Pro Tip:** If you're using Optimizely's Stats Engine, you can safely check results at any time — the engine adjusts the significance thresholds dynamically. But "safely check" doesn't mean "safely stop." The green checkmark early in a test is directional, not decisive.

The Novelty Effect and Why Day-One Numbers Lie

When you launch a new variant, returning visitors see something different. Some percentage of them will click on it simply because it's new — not because it's better. This is the novelty effect, and it typically inflates your variant's performance in the first 24-72 hours.

Conversely, some variants perform better over time as users learn where things are — the learning curve effect. A simplified navigation might look worse on day one (users are confused by the change) but outperform the control by week two.

Both of these dynamics mean that early results are systematically distorted. The only way to see through them is time.

**Pro Tip:** Segment your results by new vs. returning visitors. If your variant is only winning among new visitors and losing or flat among returning visitors, you're likely seeing novelty effect. Don't ship it yet.

When Is It Actually OK to Stop Early?

Three legitimate scenarios:

1. Clear harm. If your variant is causing a statistically significant drop in revenue or completions — not just conversion rate, but actual business metrics — stop immediately. Don't wait for a scheduled end date when you're actively losing money. Set automated alerts in Optimizely for negative movements exceeding your risk threshold.

2. Technical breakage. If you discover your variant has a JavaScript error on a major browser, your tracking pixel is misfiring, or the variant renders broken on mobile, stop the test. Document what happened, fix it, restart from zero.

3. External contamination. Major site changes, a PR crisis, a viral social post that floods your site with an anomalous audience — if a significant external event contaminates the traffic, the data is compromised. Stop, document the event, restart after the anomaly passes.

Curiosity, impatience, or "it looks like it's winning" are not on this list.

Practical Duration Formula

Here's the quick mental math I use to estimate test duration before running the calculator:

Required weeks = (Visitors per variation needed) / (Weekly traffic × 0.5)

Where 0.5 assumes a 50/50 split. If you're using an uneven split (e.g., 80/20 for risk management), replace 0.5 with the smaller allocation percentage.

For our example: 18,000 / (3,500 × 0.5) = 18,000 / 1,750 = ~10 weeks.

If that number is longer than your roadmap cycle, you have a traffic problem, not a testing problem. The solution is higher-impact changes that produce larger effects (and therefore need smaller sample sizes), not shorter test durations.

Common Mistakes

Confusing "statistical significance" with "you can stop now." Significance tells you the probability the result isn't random. It says nothing about whether you've run long enough to see through novelty effects or business cycle variation.

Setting MDE to match what you hope to see, not what you need to see. If your test can't justify its implementation cost unless it produces a 20% lift, set MDE to 20% and build a test big enough to detect it — or accept it's not testable.

Running a test for one week and calling it "one business cycle." For most businesses, one week only works if your product has pure daily cadence. Monthly subscription products need monthly cycles. Seasonal businesses need to account for seasonal windows.

Using the same test duration for every test. Low-traffic pages need longer runs. High-traffic pages can run shorter. Duration follows from sample size math, not from a standard "two-week policy."

Forgetting to account for seasonality when setting end dates. Running a test that straddles a major holiday, a product launch, or a marketing campaign creates a confounded dataset that's impossible to interpret.

What to Do Next

  1. Open your current or next Optimizely experiment. Before you set it live, run the sample size calculator using your actual baseline conversion rate and a realistic MDE. Set a calendar reminder for the calculated end date.
  2. Check your last three test results. How long did those tests actually run? Did any of them stop before reaching the calculated sample size? If yes, treat those results as directional hypotheses, not shipping decisions.
  3. Read our guide on A/B vs Multivariate vs Multi-Page Testing: Which Should You Run? to understand how test type affects duration requirements.
  4. Set up a statistical significance alert in Optimizely — but set a corresponding calendar alert for "minimum end date." Treat both as required before stopping.
Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.