Test Duration
The length of time an A/B test should run to produce valid results — balancing the need for sufficient data against the risks of running too long, including novelty effects and day-of-week biases.
What Is Test Duration?
Test duration is the length of time you let an experiment run before making a decision. It's a function of three things: the sample size required to detect your expected effect with adequate power, the need to capture a full business cycle (usually one week), and the diminishing returns of extending beyond that. Getting duration right is half art, half arithmetic — and teams consistently err on both sides.
Also Known As
- Marketing teams say test runtime or test window.
- Growth teams call it test duration or experiment length.
- Product teams refer to it as experiment window or trial period.
- Engineering teams use runtime or collection window.
- Statisticians distinguish between sample size and duration (the latter being a derivative of traffic).
How It Works
Your page gets 5,000 daily visitors split 50/50. To detect a 10% relative lift on a 4% baseline conversion with 80% power at alpha=0.05, you need ~30,000 per arm (60,000 total). At 5,000/day total, that's 12 days — but you round up to 14 days to cover two full weekly cycles. You launch Monday morning. By Thursday you see a significant lift. You ignore it and keep running — because weekend users behave differently from weekday users, and three days of data doesn't include weekend behavior at all.
Best Practices
- Calculate required sample size before launching, and translate it into days based on your traffic.
- Always run at least one full business cycle (usually 7 days) regardless of significance.
- Cap tests at 4–6 weeks to limit cookie churn and external-event contamination.
- Don't extend indefinitely — if you haven't reached significance in 2x expected duration, call it inconclusive.
- Segment by week if running multi-week tests to check for trend changes.
Common Mistakes
- Stopping at three days because significance showed up — missing all weekend behavior.
- Running tests for three months "to be sure," introducing seasonality and cookie-churn bias.
- Confusing test duration with sample size — you need both correct, not just one.
Industry Context
- SaaS/B2B: Low-traffic sites often require 4–8 week tests, which strains organizational patience.
- Ecommerce/DTC: Usually 1–3 weeks; hard cap at 4 to avoid seasonal contamination.
- Lead gen: 2–4 weeks is typical; lead quality data often requires longer attribution windows.
The Behavioral Science Connection
Test duration is where hyperbolic discounting wreaks havoc. A three-day 15% lift feels more exciting than the same test's three-week 4% lift, even though the three-week result is dramatically more reliable. Pre-registering duration is the Ulysses contract — you bind your future self before the siren song of early significance arrives.
Key Takeaway
Run every test for at least one full business cycle, and pre-register the duration so you aren't tempted to stop when the numbers look good.