Week 4. Your test shows a 6% lift. Optimizely says "not enough data." Your PM is asking when you can call it. Sound familiar? You're staring at a dashboard that refuses to cooperate, and the pressure to ship is mounting.
Before you make the wrong call — either shipping a test that might not actually work, or killing something that could have won — let's diagnose what's actually happening. There are five distinct causes for an experiment that won't reach statistical significance, and each has a different fix.
Cause 1: Traffic Is Too Low
This is the most common culprit, and it's the one teams feel most helpless about. You ran a sample size calculator before launch — maybe it said you needed 5,000 visitors per variation — but three weeks in, you've only accumulated 800.
How to diagnose it: Pull your traffic source data. Is the page actually getting the visits you expected? Check your Optimizely experiment report and compare "visitors" to your analytics tool. If there's a big gap, you may have a tracking issue (more on that in Cause 5). If the numbers match and traffic is just low, you have a resource allocation problem.
The math: If your page gets 200 unique visitors per day and you need 10,000 per variation, you're looking at 100 days minimum — before accounting for weekends, seasonality, or any business-cycle effects. That's not a statistics problem. That's a product strategy problem.
What to do:
- Move the test to a higher-traffic page that addresses the same behavioral hypothesis
- Increase your Minimum Detectable Effect (MDE) — if a 6% lift isn't worth waiting for, you can legitimately set your bar higher
- Consider a different metric: clicks are higher volume than purchases, so if you're testing a CTA, use click-through rate as your primary metric and revenue as a guardrail
- Abandon the test and redirect effort to your highest-traffic pages
**Pro Tip:** Before launching any experiment, calculate your expected runtime: (required sample size per variation x 2) / (daily unique visitors x allocation). If it's over 60 days, seriously reconsider the test design. Long-running experiments accumulate more noise from seasonal changes and code deployments.
Cause 2: Your MDE Was Set Too Low
The "boiling frog" problem in experimentation. Teams set their MDE to 1% because they're afraid of missing small wins. Then they wonder why they're waiting forever.
Here's the brutal math: detecting a 1% lift in conversion rate (from 3.00% to 3.03%) requires roughly 1.2 million visitors per variation at 80% power and 95% confidence. At 5,000 visitors per day, that's 480 days. You will never finish this test.
The business case for a higher MDE: If your product margin is 40% and a 1% lift in CVR generates $8,000 per year in incremental revenue, ask yourself: is the cost of running this experiment — developer time, opportunity cost of not testing other things, organizational distraction — worth less than $8,000? Often it isn't.
A worked example from practice: A SaaS team tested a pricing page redesign with a 1% MDE set by a cautious analytics manager. After 8 weeks they had 67% confidence and 2.1% lift. They'd burned 8 weeks and still couldn't call it. Reset the MDE to 5% — the actual threshold for a business-relevant lift — and they would have needed 47,000 visitors, achievable in 3 weeks.
**Pro Tip:** Your MDE should reflect the minimum business impact worth acting on, not the minimum technically detectable change. Ask: "If we see a 2% lift, will we actually ship this and allocate resources to scale it?" If yes, set your MDE to 2%. If not, set it higher.
Decision framework for MDE:
- High-traffic, low-stakes test (button color, microcopy): MDE of 2-3% is fine
- Medium-traffic test (checkout flow step): MDE of 5% is appropriate
- Low-traffic, high-stakes test (pricing, onboarding): MDE of 10-15% — or don't test there at all until you have more traffic
Cause 3: High Metric Variance
This is the one that surprises people most. Revenue per visitor is an extremely noisy metric. A single $5,000 B2B order in your test group can swing your results by 15% on a small sample. That's not signal — it's noise.
Why revenue tests need 3-5x more traffic than CVR tests: Conversion rate is a binary metric (0 or 1 per visitor). Revenue per visitor follows a highly skewed distribution with a long right tail. The variance of the metric directly determines your required sample size.
The formula matters here: required sample size scales with variance. If your revenue-per-visitor metric has a standard deviation of $12 around a mean of $8 (which is a coefficient of variation of 1.5x — extremely common in e-commerce), you'll need roughly 9x more traffic than if you were testing a metric with CV of 0.5x.
What to do about high variance:
Option 1: Switch to a lower-variance metric. If you're testing a product page, use add-to-cart rate instead of revenue. If you're testing checkout, use completion rate. Revenue becomes a guardrail metric you monitor for safety, not your primary metric.
Option 2: Use CUPED (Controlled-experiment Using Pre-Experiment Data). Optimizely supports variance reduction through pre-experiment covariate adjustment. By using each visitor's pre-experiment behavior as a covariate, you can reduce metric variance by 50-80% in some cases — effectively cutting your required sample size in half. This is one of the highest-leverage technical improvements a CRO team can make.
Option 3: Trim outliers. Cap revenue values at the 99th percentile before analysis. That $5,000 one-off order becomes $200 (or whatever your cap is), dramatically reducing variance.
**Pro Tip:** When you set up a revenue test in Optimizely, always check the "coefficient of variation" of your revenue metric before launch. If it's above 1.0, either switch metrics or plan for 2-3x your normal sample size requirement. Treat revenue as a guardrail and test on a conversion funnel metric instead.
Cause 4: Seasonal Traffic Fluctuations
You launched your test on November 1st. Week 1: 15,000 visitors. Week 2: 14,500 visitors. Week 3: 45,000 visitors (Black Friday). Week 4: 9,000 visitors (post-holiday slump). Your confidence interval is a mess because your traffic mix has been all over the place.
Why business cycles matter: Statistical significance calculations assume your sample is drawn from a consistent underlying population. When your traffic source mix changes — weekend shoppers vs. weekday researchers, holiday buyers vs. regular customers — you're effectively comparing apples to oranges.
The minimum viable business cycle rule: Run every experiment for at least one complete business cycle. For most B2C e-commerce, that's 7 days minimum. For SaaS with monthly billing cycles, it's at least 14 days (ideally 28). For B2B with quarterly budget cycles, some decisions take weeks to manifest.
A concrete example: A team tested a new enterprise pricing page. After 3 weeks, they had 91% confidence and 14% lift. They shipped it. Two weeks later, their sales team reported that qualified demos had dropped 8%. What happened? The original 3-week window happened to capture end-of-quarter urgency from enterprise buyers. The test looked like a winner because buyers were already motivated. When normal traffic resumed, the new page actually hurt.
**Pro Tip:** Always check your traffic volume by day of week in your experiment data. If you see a strong day-of-week pattern (typical in SaaS — Monday spikes are common), make sure your test has run through at least one complete week before analyzing results. Partial weeks introduce systematic bias.
How to handle genuine seasonal tests: If you need to test something during a promotional period, accept that results will only generalize to similar promotional periods. Document this explicitly in your experiment notes.
Cause 5: Implementation Bug
The sneaky one. Everything looks fine on the surface, but under the hood your traffic split is 73/27, your event is firing twice on some pages, or the variation only loads for 60% of visitors because of a JavaScript timing conflict.
How to check your traffic split: In Optimizely, go to your experiment results and look at the "Visitors" count for each variation. In a 50/50 split, you expect roughly equal numbers — but not exactly equal. A small imbalance (say, 49.2/50.8) is normal randomization variance. An imbalance of 40/60 or worse is a red flag.
Use a chi-square test to check if your observed split is within expected bounds. For a 50/50 split with 10,000 total visitors, you'd expect each variation to be within about ±200 of 5,000. If you're seeing ±800+, investigate.
How to verify event firing: Open your browser console on your test page with the Optimizely snippet loaded. Trigger the conversion event (add to cart, form submit, purchase) and watch the network tab for the event call. Check:
- Is it firing once per conversion? (Double-fires inflate your conversion rate)
- Is it firing for both variations? (Selective firing biases results)
- Is it firing on page load instead of on the actual user action? (This inflates CVR to near 100%)
**Pro Tip:** Build a QA checklist for every experiment before launch: (1) Verify variation renders correctly in both Chrome and Safari, (2) Confirm traffic split after 100 visitors matches intended allocation, (3) Check conversion event fires exactly once per conversion using network tab, (4) Validate that excluded segments (employees, internal IPs) are actually excluded. Run this within 24 hours of launch — catching bugs early saves weeks of wasted data.
Common Optimizely implementation bugs:
- Variation CSS loading after page paint, causing flicker and user confusion that distorts behavior
- Redirect experiments where the redirect fires before Optimizely cookies are set, causing visitors to be assigned to both variations
- Events tied to form submissions that fire before the actual server confirmation, counting failed submissions as conversions
The Decision Framework: When to Call It
After diagnosing your cause, you need a framework for the actual decision. Here it is:
If confidence is below 70% after 2 full business cycles: The effect is likely too small to detect reliably, or there's no real effect. Unless you have a strong prior belief, cut the test.
If confidence is 70-85% after 2 full business cycles with positive lift: Consider the stakes. For low-risk changes (microcopy, button color), you can ship with 80% confidence. For high-risk changes (checkout flow, pricing), wait or increase traffic allocation.
If confidence is 85-95% after 2 full business cycles: You're in the gray zone. Look at the trend direction — is confidence increasing or stable? If it's been hovering around 88% for two weeks without moving, more data won't help much. Make the call.
If confidence is above 95%: Ship it. But double-check that you've run at least one full business cycle. A 99% confidence result from 3 days of data is not trustworthy.
**Pro Tip:** Time in experiment is as important as statistical confidence. A 95% confidence result from 4 days means your confidence intervals are narrow enough statistically, but you may be missing weekly behavior patterns. Optimizely's Stats Engine accounts for sequential testing, but it doesn't account for the behavioral differences between Monday and Friday visitors.
Common Mistakes
Peeking and stopping early. Optimizely's Stats Engine is designed to be peeked at (it uses sequential testing), but many teams still stop the moment they see "significance" without checking if they've run a full business cycle.
Running at 20% traffic allocation "to be safe." This extends your required runtime by 5x. If you're worried about downside risk, set up guardrail metrics instead. Running at partial traffic for months defeats the purpose of experimentation.
Treating every test as equal. A test on your highest-traffic page needs less runtime than a test on a product detail page. Prioritize your test roadmap by expected sample size, not just business impact.
Ignoring variance before launch. Teams calculate sample size using a CVR metric but then measure revenue as their primary metric. The variance mismatch means their sample size calculation was wrong from the start.
Not documenting the expected runtime. If you don't know when you planned to call the test, you'll be pressured to call it whenever stakeholders ask.
What to Do Next
- Audit your current running experiments: For each one, calculate expected runtime vs. actual runtime. Flag any that have exceeded their planned window.
- Check traffic splits: Pull the visitor counts per variation. Any split worse than 45/55 warrants investigation.
- Review your MDE settings: Are they set to the minimum business-relevant threshold, or are they aspirationally small?
- Set up CUPED if you're testing revenue metrics: This is the single highest-leverage change most CRO teams aren't making.
- Build the QA checklist: Don't launch another experiment without verifying event firing in the first 24 hours.
For more context on how Optimizely's statistical engine works under the hood, see Why Your Optimizely Results Keep Changing. For a deep dive on the false positive risk from running many tests simultaneously, see False Discovery Rate in Optimizely.