The Question Everyone Asks, and the Answer Nobody Wants

"How long should I run this test?" If you've been doing CRO for more than a week, you've been asked this question. And if you've been doing it for more than a month, you've probably given a bad answer at least once.

The honest answer is: it depends on three things. Your baseline conversion rate, how small a lift you actually care about detecting, and how much traffic hits the page. Get those three numbers, and the duration follows mathematically. Skip the math, and you're guessing — which means you're either calling tests too early (inflating false positives) or running them too long (wasting resources on an already-decided question).

This article gives you the actual framework. No hand-waving, no "run it at least two weeks" non-answers.

The Three Factors That Determine Test Duration

Factor 1: Your Baseline Conversion Rate

The lower your baseline CVR, the more data you need to detect a meaningful change. If you're converting at 8%, a 10% relative lift gets you to 8.8% — a 0.8 percentage point gap that's relatively easy to detect. If you're converting at 1.2%, that same 10% relative lift is only 0.12 percentage points. You need far more observations to distinguish that from random noise.

This is the factor most teams underweight. They pull a sample size calculator, enter their traffic numbers, and forget to verify that their baseline is accurate. If your GA4 conversion rate and your Optimizely baseline diverge by more than 10-15%, figure out why before you run anything.

**Pro Tip:** Always verify your baseline CVR using at least 4 weeks of historical data, not the last 7 days. Seasonality, a recent campaign, or a one-week anomaly can throw your entire duration estimate off by 2-3x.

Factor 2: Minimum Detectable Effect (MDE)

MDE is the smallest lift you actually care about detecting. This is where business judgment enters the equation. If you're running a checkout flow test and your revenue baseline is $2M/month, a 1% relative lift is $20K/month — worth detecting. If you're running a test on a low-traffic support page, a 1% lift may not be worth the development cost of the winning variant.

The smaller your MDE, the larger the required sample size. This relationship is quadratic: halving your MDE roughly quadruples the sample you need. A lot of teams set MDE too small ("we want to detect any improvement") and end up running tests for months before reaching significance.

A practical default: set MDE to 10% relative if you have modest traffic, 5% relative only if you have very high traffic and the business impact justifies a long test.

Factor 3: Traffic Volume

Specifically, unique visitors per week to the page being tested — not site-wide sessions, not pageviews, not daily active users. The metric you're optimizing (conversions, clicks, sign-ups) needs to be tied to the specific funnel step you're testing.

The Sample Size Formula (No Statistics Degree Required)

For a two-tailed test at 95% confidence and 80% statistical power, the sample size per variation is approximately:

n = 16 p (1 - p) / (MDEabsolute)^2

Where:

  • p = baseline conversion rate (as a decimal, e.g. 0.032 for 3.2%)
  • MDEabsolute = the absolute change you want to detect (e.g. 0.0032 for a 10% relative lift on a 3.2% baseline)

Worked Example

  • Baseline CVR: 3.2% (p = 0.032)
  • Desired MDE: 10% relative lift = 0.32 percentage points absolute (0.0032)
  • Traffic: 5,000 visitors/week to the test page

Sample size per variation: n = 16 0.032 (1 - 0.032) / (0.0032)^2 n = 16 0.032 0.968 / 0.00001024 n = 16 0.030976 / 0.00001024 n = 0.495616 / 0.00001024 n ≈ 48,400 visitors per variation

Total sample needed: 96,800 visitors across control and treatment. At 5,000 visitors/week: approximately 19-20 weeks.

That's a long time. If that feels too long, you have two levers: increase your MDE (accept that only larger lifts will be detectable) or increase traffic to the page. What you cannot do is stop the test at week 3 because the numbers look good.

**Pro Tip:** Use Optimizely's built-in Stats Accelerator or a sample size calculator before building any test. If the required duration exceeds 8 weeks, it's often worth reconsidering whether this specific test is worth running at all — or whether you can increase traffic to the page first.

The Minimum Duration Floor: One Full Business Cycle

Here's a rule that overrides the math: never run a test for fewer than 7 full days, regardless of what your sample size calculator says. If you're hitting your required sample by Tuesday of week one because you have high traffic, don't call it yet.

The reason is day-of-week effects. E-commerce sites convert differently on weekends than weekdays. B2B sites have a completely different visitor profile on Monday morning vs. Friday afternoon. SaaS trials convert at different rates at the start vs. the end of a pay period. Any test that doesn't span a full 7-day cycle will be contaminated by this temporal variance.

For high-stakes tests — pricing changes, checkout flow revamps, major homepage redesigns — I'd argue the floor should be two full business cycles (14 days), even if you hit sample size in three days. The additional data costs you almost nothing and dramatically reduces the chance you're looking at a temporal anomaly.

**Pro Tip:** Always start tests on the same day of the week and end them on the same day. Starting on a Monday and ending on a Friday gives you a biased sample that overweights weekday behavior.

The Novelty Effect: Why Early Results Almost Always Lie

When you launch a significant visual or UX change, returning visitors notice it. They engage with it differently — sometimes positively (curiosity drives exploration), sometimes negatively (disruption of learned patterns). Either way, their behavior in weeks 1-2 is not representative of steady-state behavior.

This is the novelty effect, and it's particularly dangerous because it almost always looks like a win. New button is big and shiny? Engagement spikes. New layout is dramatically different? Visitors explore. By week 3, behavior normalizes and your "20% lift" has eroded to 4% — or disappeared entirely.

Empirically, novelty effects tend to dissipate within 2-3 weeks for major layout or visual changes. For copy-only changes (headlines, CTA text, form labels), novelty effects are minimal. For structural changes (navigation redesigns, checkout flow restructuring, pricing page overhauls), plan to run the test for at least 3 weeks before drawing conclusions.

**Pro Tip:** Segment your test results by new vs. returning visitors. If your returning visitor segment shows a different direction of lift than your new visitor segment, you're likely seeing novelty effect contamination. The new visitor result is closer to your long-term truth.

The Peeking Problem: Why Stopping at Week 1 Significance Destroys Your Numbers

Imagine flipping a coin 100 times. After 30 flips you have 18 heads and 12 tails. Looks like a biased coin, right? By flip 100, you're at 51 heads and 49 tails — perfectly fair. The early signal was noise.

This is exactly what happens when you peek at A/B test results before you've reached your required sample size. The p-value is not stable at low sample counts. If you check daily and stop at the first "p < 0.05" reading, you are not running a 95% confidence test. You're running something closer to a 70-75% confidence test — and calling it 95%.

The math: if you peek at results at 5 equally-spaced checkpoints during a test and stop at the first significant result, your effective false positive rate climbs from 5% to roughly 23%. You'll ship losing variants as winners roughly 1 in 4 times, not 1 in 20.

This is why "we run everything for two weeks and check on Friday" is a better process than "we check the dashboard daily and ship when it looks good." Discipline around peeking is one of the highest-leverage process improvements a CRO team can make.

How Sequential Testing Changes the Rules

Classical (fixed-horizon) testing requires you to decide your sample size upfront and not peek. Sequential testing — which is what Optimizely's Stats Engine uses — is designed specifically to allow valid interim looks at data without inflating your false positive rate.

Optimizely's Stats Engine uses a sequential testing methodology (based on the always-valid p-value framework) that continuously adjusts confidence thresholds as data accumulates. This means you can check results daily without the peeking problem — the system's confidence threshold is higher when sample size is small and relaxes as you accumulate more data.

This does not mean you can call tests after two days. It means you can monitor results in real time without invalidating your statistical inference. You still need to reach the required sample size before trusting a result. The difference is that if you hit significance at 60% of your target sample, the Stats Engine's confidence bounds already account for the fact that you're peeking early. Classical stats does not.

For teams that need to make business decisions faster than a fixed-horizon test allows, sequential testing is the right tool. The trade-off is that sequential tests generally require slightly larger total sample sizes to achieve the same power as a perfectly-executed fixed-horizon test.

**Pro Tip:** If you're using a tool without a sequential testing engine (Google Optimize's replacement, custom-built tools, etc.), commit to the Bonferroni correction for interim analyses: divide your alpha by the number of planned looks. If you plan to peek 4 times, use p < 0.0125 per look to maintain an effective p < 0.05 overall.

When It's OK to Stop Early

There are legitimate business reasons to stop a test before reaching your planned sample:

Ship on clear harm. If a variant is causing a measurable, statistically significant drop in a critical metric (checkout starts, revenue per visitor, account creations), stop it. The business cost of waiting for full sample exceeds the statistical cost of an early call in the negative direction.

Ship on overwhelming signal. If you're at 90% of your required sample and significance has been stable at 99%+ for a week, the additional 10% of sample is unlikely to reverse the finding. This is a judgment call, but a reasonable one.

Kill tests that are clearly going nowhere. If you're at 150% of required sample and your lift estimate is 0.2% (when you needed to detect 10%), there's nothing here. Stop, learn what you can from the data, and move on.

What is never OK: stopping a test at 40% of required sample because the dashboard shows p = 0.049 and someone in leadership is impatient. This is how CRO programs lose credibility.

Decision Table: Test Type vs. Traffic Level

| Test Type | Low Traffic (<2K/wk) | Medium Traffic (2K-20K/wk) | High Traffic (>20K/wk) | |---|---|---|---| | Copy change (headline, CTA) | 6-8 weeks | 2-4 weeks | 1-2 weeks | | UI element change (button, form) | 8-12 weeks | 3-6 weeks | 2-3 weeks | | Layout/structural change | 10-16 weeks | 4-8 weeks | 2-4 weeks | | Pricing/offer change | 12+ weeks or don't test | 6-10 weeks | 3-6 weeks |

These assume a 10% relative MDE and 80% power. Halve the MDE, roughly double the duration.

Common Mistakes

Mistake 1: Using site-wide traffic instead of page-specific traffic. Your site gets 50,000 visitors/week. Your checkout page gets 3,000. Your test is on the checkout page. Use 3,000.

Mistake 2: Counting visitors who didn't see the test. If your variant is only shown to users who reach step 2 of a funnel, only count visitors who reached step 2. Including upstream traffic inflates your sample count and understates the time you need.

Mistake 3: Running tests through major seasonal events. A test that spans Black Friday, a product launch, or a major external event is not a clean test. Either complete the test before the event or restart after it's over.

Mistake 4: Changing the variant mid-test. Once a test is live, the variant is locked. If you fix a bug in the variant after launch, restart the test. The pre-fix and post-fix data are not comparable.

Mistake 5: Not accounting for test interactions. If you're running two tests that affect the same page simultaneously, their effects can interact. At a minimum, segment your results to see if users exposed to both tests are behaving differently from users exposed to only one.

What to Do Next

  1. Pull your actual baseline CVR for the specific page and step you want to test. Use at least 4 weeks of data from your analytics tool — not your testing platform's default.
  2. Define your MDE based on business impact. What lift would justify the development cost of implementing the winner?
  3. Run the sample size calculation. If duration exceeds 8-10 weeks, reconsider scope — either increase traffic to the page or increase your MDE threshold.
  4. Lock in your start/end day. Start Monday, plan to end Monday. Calendar it before you launch.
  5. If you're using Optimizely, enable Stats Engine (it's on by default) and configure your Experiment Guardrail metrics so a degradation in a secondary metric triggers an alert before you have a catastrophic result.

For a full walkthrough of setting up properly configured tests in Optimizely — including Stats Engine configuration, traffic allocation, and metric setup — see the Optimizely Practitioner Toolkit at atticusli.com/guides/optimizely-practitioner-toolkit.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.