Segmenting A/B Test Results in Optimizely: Where the Real Insights Hide
You ran a 3-week test. Overall result: +2.1% CVR, not statistically significant. Your stakeholder says "it didn't work, move on."
But you segment by device type. On desktop, the variation is +8.4% CVR, statistically significant. On mobile, it's -4.2%, also significant. The overall result was the average of two opposite effects canceling each other out.
That's not a failed test. That's two new hypotheses, a segmented rollout strategy, and the most valuable 3 weeks you've spent on experimentation this quarter.
This is why segmentation matters. Here's how to do it without fooling yourself.
Why the Top-Line Result Is Often a Lie
The aggregate conversion rate is an average across all your visitors. If the variation helps one group and hurts another, those effects will partially or fully cancel out — giving you an inconclusive overall result that obscures real, actionable signals.
This is a version of Simpson's Paradox: a trend that appears in subgroups can disappear or even reverse when the groups are combined. In CRO, it shows up most often across device type, traffic source, and new vs. returning visitor segments.
The business implication is significant. An experiment that "doesn't work" on aggregate might be a clear win for your highest-value segment. Killing it and moving on means leaving revenue on the table.
The 5 Most Valuable Segmentation Cuts
Not all segments are created equal. After running 100+ experiments, these are the five that generate the most actionable insights, in order of value.
1. Device Type (Desktop vs. Mobile)
This is the single most important segment to check every time. Mobile and desktop users behave fundamentally differently: session length, form factor, task completion intent, and technical constraints all vary. A checkout flow improvement that works beautifully on desktop may be disruptive on mobile.
Optimizely captures this as a default attribute — no setup required.
2. New vs. Returning Visitors
New visitors are discovering your product. Returning visitors are evaluating or re-evaluating. A variation that increases trust signals (more social proof, more detailed specs) may convert returning visitors at a higher rate while overwhelming new visitors who aren't ready to buy.
This segment also tells you about your test's validity: if the variation shows a huge lift only for returning visitors, consider whether those visitors saw your test on a prior visit and their behavior is now skewed.
3. Traffic Source
Paid traffic converts differently than organic. Direct traffic converts differently than email. If your variation improves CVR for branded paid search visitors but tanks it for organic content visitors, you have a clear segmentation strategy for personalization.
**Pro Tip:** In Optimizely Web, you can segment results by UTM parameters if you've set them up as custom attributes. This is worth the upfront implementation cost — it turns every experiment into traffic source intelligence.
4. Geography
Region-level segmentation matters more than most teams realize. Price sensitivity, cultural norms around trust signals, and even language nuance affect how users respond to copy and design changes. A variation that performs well in North America may underperform in EMEA.
For SaaS products, geography also correlates with plan type and company size — which makes it a useful proxy for customer value.
5. Time of Day / Day of Week
This segment is often overlooked but is particularly valuable for B2B products. If your variation wins during business hours but loses on weekends, you're looking at a fundamental difference in visitor intent, not a design preference.
Setting Up Attributes for Segmentation in Optimizely
Optimizely Web Experimentation captures several attributes by default: browser, device type, source, campaign, ad group, referrer, and custom tags. These are available immediately on every results page.
For custom attributes, you need to set them up before the experiment starts. You cannot retroactively apply attributes to data that was already collected.
To add custom attributes in Optimizely Web:
- Go to Settings > Attributes
- Create the attribute with a key and display name
- In your experiment's JavaScript, call window.optimizely.push({ type: "user", attributes: { yourattributekey: value } })
- The attribute will then appear as a segmentation option in your results
For Optimizely Feature Experimentation, attributes must be defined in the datafile and passed in the user context object at decision time.
**Pro Tip:** Set up your segmentation attributes during experiment QA, not after launch. The most valuable segments — like "high-intent visitors" based on page depth or "logged-in users" based on session state — require custom attributes that you need to plan ahead.
What to Look For in Segment Analysis
When you open the segment breakdown, you're looking for one thing: interaction effects. That's when the direction or magnitude of the variation's effect changes meaningfully between segments.
Three patterns to watch for:
Reversal: Variation wins on desktop (+8%), loses on mobile (-5%). This is the most actionable pattern — it tells you to ship the variation for desktop traffic only, or run a mobile-specific follow-up test.
Amplification: Variation wins overall (+4%), but wins much more strongly for new visitors (+12%) than returning visitors (+1%). This tells you the variation is particularly effective for acquisition, and you may want to weight toward top-of-funnel tests.
Isolation: Variation shows no effect overall, but one segment shows a clear win. This is the "hidden winner" scenario from the introduction — the most common source of valuable hypotheses from inconclusive experiments.
**Pro Tip:** When you find a reversal between device types, don't immediately kill the experiment. Check whether that device segment has enough traffic to run a standalone test with proper statistical power. If mobile gets 60% of your traffic and shows a -5% loss, fixing the mobile experience before shipping is the right call.
The Segment Fishing Trap
Here is the statistical trap that makes segment analysis dangerous: if you look at enough segments, one will show significance by chance.
At 95% confidence, you accept a 5% false positive rate for any single test. If you look at 20 segments, you expect one of them to show significance by pure random chance — even if the variation has zero real effect on any group.
This is the multiple comparisons problem, and Optimizely's Stats Engine significance numbers in the segments view do not correct for it. The documentation is explicit: segmented results should be used for exploratory data analysis only, not for making significance-based decisions.
The practical rule: pre-register your segments before the experiment launches. Write down the 3-5 segments you plan to check when you design the experiment. Check those segments with higher confidence in the results. Any other segment you look at is exploratory — treat it as a hypothesis for the next test, not evidence from this test.
Minimum Segment Size to Trust Results
The same sample size principles that apply to your overall test apply to each segment.
If your experiment needs 10,000 visitors per variation to detect a 5% improvement at 80% power, and your mobile segment represents 40% of traffic, then you need roughly 25,000 total visitors before your mobile segment has enough data (4,000 per variation per segment at 40% of traffic each).
A segment showing "significant" with 300 visitors per arm is almost certainly a false positive. The confidence intervals will be wide, the effect size will be exaggerated, and the result will not replicate.
**Pro Tip:** Before the experiment launches, run a quick power calculation for your highest-priority segments. If your most important segment — say, mobile visitors from paid search — only gets 400 visitors per week, you'll need to run the experiment much longer than the primary metric alone would require.
Worked Example: From Inconclusive to Actionable
Experiment: Tested a simplified checkout form (fewer required fields, inline validation) on a SaaS sign-up page.
Overall result: +2.3% CVR, not statistically significant after 3 weeks. 18,400 total visitors.
Segment analysis:
| Segment | Control CVR | Variation CVR | Improvement | Significant? | |---------|-------------|---------------|-------------|--------------| | Desktop | 4.8% | 5.7% | +18.8% | Yes | | Mobile | 3.2% | 2.9% | -9.4% | Yes | | New visitors | 2.1% | 2.6% | +23.8% | No (n=1,200) | | Returning visitors | 6.4% | 6.7% | +4.7% | No |
Conclusion: The form simplification helps desktop users significantly and hurts mobile users significantly. The overall result was the two effects averaging out.
Action: Ship the variation for desktop only. Commission a separate mobile UX audit to understand why simplified forms hurt mobile conversion (hypothesis: the inline validation was not rendering correctly on iOS). Run a mobile-specific test in 6 weeks after the fix.
Revenue impact calculation: Desktop traffic is 55% of 18,400/3 weeks = ~3,400 desktop visitors/week. At 4.8% baseline CVR, 163 conversions/week. A +18.8% improvement = ~194 conversions/week. At $1,200 ACV, that's 31 additional conversions/week × $1,200 = $37,200/week incremental ARR, not visible in the aggregate result.
Turning No-Win Tests Into Hypotheses
Every inconclusive test that gets segmented properly should yield at least 2-3 new hypotheses. Document them systematically.
When you close an experiment in your test management system, require the analyst to write one sentence for each major segment that showed a different pattern than the aggregate. Those sentences are your next experiment backlog.
The teams that build compounding experimentation programs are the ones who treat every result — win, loss, or inconclusive — as information, not just a pass/fail grade.
Common Mistakes
Using segment significance to make rollout decisions. Segment-level significance numbers in Optimizely are not corrected for multiple comparisons. A segment that hits 95% confidence when you looked at 15 segments may be a false positive.
Retroactive segment setup. You cannot add segmentation attributes to data you've already collected. The segment has to be defined before the experiment runs.
Ignoring segment size. A 40% improvement in a segment of 120 visitors is noise. Set a minimum visitor threshold (typically the same power calculation you'd use for the overall test) before treating a segment result as meaningful.
Treating every segment as equally important. Pre-register the segments that matter to your business. If you have a hypothesis about device type or traffic source, that's a primary segment. Everything else is exploratory.
Not documenting segment learnings. The insights from segment analysis are often more valuable than the primary result. If you don't write them down and turn them into follow-up hypotheses, you're leaving the most valuable part of the experiment unused.
What to Do Next
- Open your last 3 inconclusive experiments and check the device type segment. I'd bet at least one of them shows a reversal or amplification effect you haven't acted on.
- Before your next experiment launches, write down the 3 segments you plan to analyze. This is your pre-registered analysis plan.
- Check your custom attribute setup — can you currently segment by new vs. returning visitor, and by traffic source? If not, spend an hour getting those attributes in place before your next test launches.
- Read the results page walkthrough if you haven't already — understanding the samples ratio mismatch check is critical before you trust any segment breakdown.