Most failed experiments I've reviewed weren't failed because of bad hypotheses. They were failed because the primary metric was wrong. Either it was too noisy to detect a real signal, too narrow to capture the actual business impact, or misaligned with what the test was actually trying to move.
Metric selection is the most under-discussed decision in experimentation. Here's how to get it right.
Why You Can Only Have ONE Primary Metric
The temptation is to declare 5 metrics primary and let the data tell you which one moved. This is how you generate false positives.
The multiple comparisons problem: if you test 5 metrics each at 95% confidence, your probability of getting at least one false positive is 1 - (0.95)^5 = 22.6%. Nearly 1 in 4 experiments will show a "significant" result on at least one metric purely by chance. The more metrics you check, the more likely you are to fool yourself.
Optimizely's Stats Engine (sequential testing / always-valid inference) handles some of this through Bayesian methods and multiple comparison corrections, but it doesn't eliminate the problem entirely. More importantly, it doesn't solve the organizational problem: when you have multiple primary metrics, stakeholders pick the one that supports the narrative they wanted, and the experiment "succeeds" regardless of what actually happened.
One primary metric. Non-negotiable. That metric is what you commit to before the experiment runs. Secondary metrics are for understanding, not deciding.
**Pro Tip:** Write your primary metric in your experiment brief BEFORE you write the hypothesis. If you can't name a single metric that would definitively tell you whether your change worked, you don't have a clear enough hypothesis yet. Go back and sharpen it.
Primary vs Secondary Metrics in Optimizely
In Optimizely's interface:
Primary metric (Decision Metric): The single metric that determines winner/loser. Statistical significance and sample size calculations are calibrated to this metric. You need to hit your MDE (minimum detectable effect) and confidence threshold on this metric to call a winner.
Secondary metrics: Monitored for context and understanding. You might declare victory on the primary metric while checking secondary metrics to make sure you haven't caused collateral damage (e.g., a change boosted add-to-cart but crushed actual purchases).
Guardrail metrics: A specific type of secondary metric that represents a floor — you won't ship the winner if this metric degrades beyond a threshold. Common guardrails: page load time, unsubscribe rate, support ticket volume.
The setup in Optimizely: when you build your experiment, designate one metric as the primary in the Metrics section. All others are secondary. The Results page shows the primary metric first and flags it as the decision criterion.
**Pro Tip:** Add at least one guardrail metric to every experiment, even if it's just a simple one. The number of times I've seen a "winning" variant that increased purchases by 3% while simultaneously increasing the 30-day refund rate by 8% should alarm you. Your primary metric rarely tells the full story.
Revenue Per Visitor vs Conversion Rate: Why RPV Usually Wins
Conversion rate is the default metric for most teams. It's intuitive, easy to explain, and easy to calculate. It's also often the wrong primary metric for revenue-generating tests.
Here's why: conversion rate treats a $10 purchase and a $500 purchase identically. Both count as "1 conversion." If your test changes the mix of what people buy, CVR might go up while revenue goes down.
Revenue Per Visitor (RPV) = total revenue / total visitors (including non-purchasers, who contribute $0)
RPV captures the full economic impact of a change in one metric. If your test increases CVR from 3% to 3.3% but decreases average order value from $85 to $72, CVR says you won. RPV says:
- Before: 3% CVR x $85 AOV = $2.55 RPV
- After: 3.3% CVR x $72 AOV = $2.38 RPV
You actually lost 7% of revenue despite "improving" conversion rate.
When CVR is the right primary metric:
- Your product has a fixed price (no AOV variation)
- You're optimizing for lead generation (no transaction value variation)
- You're in an early funnel stage where the monetary conversion is too far downstream to measure in your test window
When RPV is the right primary metric:
- E-commerce with variable order values
- SaaS where different plans have different prices
- Any test where you suspect the change might affect product selection or upsell behavior
**Pro Tip:** RPV has higher variance than CVR (because it includes $0 values for non-converters plus widely varying transaction amounts), which means you need more traffic to reach significance. Run the sample size calculation for RPV, not CVR, when planning your test. The difference is often 30-50% more traffic required.
Click Events vs Page Reach vs Custom Events: Choosing the Right Event Type
Optimizely lets you define metrics based on several event types. Here's when to use each:
Click Events
Best for: testing UI changes where the direct action is a click (button clicks, link clicks, tab selections)
Setup: Element click tracking via CSS selector or element ID
Limitation: Clicks are not always meaningful conversions. A button click that leads to an error page is still counted. Use click events when the click itself IS the goal, not when it's just a signal.
Page Reach (Pageview)
Best for: funnel progression metrics — "% of visitors who reached the confirmation page"
Setup: Trigger on page URL match
This is often the right choice for conversion events when you have a dedicated confirmation or success page. It's clean, reliable, and not dependent on JavaScript event firing.
Custom Events
Best for: actions that don't have a corresponding URL (AJAX form submissions, SPA interactions, add-to-cart without page reload, video plays)
Setup: JavaScript call — window.optimizely.push({type: 'event', eventName: 'youreventname'})
The most flexible option. Also the most error-prone because you're relying on implementation correctness.
**Pro Tip:** For purchase conversions, prefer page reach to the order confirmation URL over a custom JavaScript event whenever possible. Page reach is evaluated server-side from the URL match — it doesn't depend on a JavaScript push() call completing successfully. Custom event calls can be blocked by ad blockers, fail due to JavaScript errors, or fire multiple times. Pageview events are cleaner and more reliable.
The Guardrail Metric Concept
A guardrail metric is a secondary metric with a hard threshold. If the test variant causes this metric to move in the wrong direction beyond the threshold, you don't ship — even if the primary metric wins.
Common guardrail examples:
- Checkout abandonment rate — guardrail on any checkout flow test
- Page load time — guardrail on any change that adds JavaScript or images
- 30-day retention rate — guardrail on any test that might attract low-quality conversions
- Error rate / 404 rate — guardrail on any structural UI change
Setting a guardrail in Optimizely: add it as a secondary metric. Guardrail logic is typically enforced manually (you check it before shipping) rather than through automated blocking, though some teams build custom alerts.
The threshold I typically use: if a guardrail metric moves negatively by more than 10% relative, I don't ship, even if the primary metric won.
**Pro Tip:** Discuss and document guardrail thresholds with your stakeholders BEFORE the experiment runs, not after you see the results. If you define "acceptable degradation" after seeing the data, you're rationalizing, not governing.
How Metric Selection Affects Test Duration
Your primary metric directly determines how long the test needs to run to reach sufficient statistical power. The two key factors:
Baseline rate: A metric with a 2% baseline needs more traffic to detect a 10% relative lift (0.2 percentage points) than a metric with a 20% baseline to detect the same relative lift (2 percentage points).
Variance: High-variance metrics (like revenue per visitor, which has lots of zeros and some large values) need substantially larger samples than low-variance metrics (like conversion rate, which is binary).
Worked Example
Test: Homepage hero change on an e-commerce site Traffic: 10,000 visitors/day Baseline CVR: 3.2% Baseline RPV: $2.80 (AOV $87, CVR 3.2%) Target MDE: 10% relative improvement
With CVR as primary metric:
- Target CVR: 3.52%
- Required sample size per variation: approximately 28,000 visitors
- Test duration at 50% traffic allocation: 6 days
With RPV as primary metric:
- Target RPV: $3.08
- RPV variance is high (std dev approximately $18 on $2.80 mean — most people spend $0)
- Required sample size per variation: approximately 160,000 visitors
- Test duration at 50% traffic allocation: 32 days
Same experiment, different primary metric, 5x different test duration. This is why you make the metric decision upfront, not after you've already launched.
Matching Metrics to Your Hypothesis
The metric must be the direct outcome of the hypothesis. This sounds obvious but breaks down constantly.
Better product images → use RPV or CVR (not page views). Simpler checkout form reduces abandonment → use checkout completion rate (not homepage CVR). Urgency messaging increases conversion speed → use 2-day CVR or same-session CVR (not 7-day CVR). Trust badges increase checkout starts → use add-to-cart rate (not revenue). Price anchoring increases premium tier selection → use percent premium plan selections (not total CVR).
The mismatch pattern: testing a change in stage X but measuring an outcome from stage Y, with too many confounding steps in between. A homepage test where you measure 30-day LTV as the primary metric will almost never reach significance — there are too many other touchpoints between the homepage and LTV.
Match the metric to the stage. If you're testing a product page, your primary metric should be something that happens on or immediately after the product page.
**Pro Tip:** If your primary metric is more than two funnel steps downstream from the change you're testing, either move the metric closer to the change or accept that you'll need a very large sample size. Chasing distant downstream metrics with small traffic is the fastest way to never reach conclusions.
Common Mistakes
Picking the metric after seeing early results. This is p-hacking. Your brain will rationalize picking the metric that's moving in the direction you hoped. Commit to the metric before launch.
Using the metric that's easiest to measure instead of the metric that matters. Click rate is easy to measure. Revenue is harder. If revenue is what matters, measure revenue.
Ignoring sample size implications of your metric choice. RPV almost always requires more traffic than CVR. "Let's use RPV" without updating the sample size calculation is how you end up calling winners at 70% confidence.
Setting a metric on a downstream event that's too far from the treatment. If your test changes a product page element, measuring 90-day subscription retention as the primary metric means you need months of data and the signal will be swamped by noise.
Not including a business-value metric as at least a secondary metric. Pure CTR optimization that doesn't connect to revenue is how teams optimize themselves into irrelevance.
What to Do Next
- Review your last 5 experiments — for each one, ask: was the primary metric the direct outcome of the hypothesis? Was there a guardrail metric? Was there a revenue metric somewhere in the metric stack?
- Create a metric taxonomy for your site — list the 8-10 most important measurable actions, specify the event type and implementation method for each, and document which funnel stage they map to. Reuse this for every experiment.
- Run a sample size calculation for RPV on your next e-commerce test, not just CVR. If the duration is impractical, either reduce the scope of the test or accept CVR as a proxy with an explicit acknowledgment of the limitation.
- Define your guardrail thresholds in writing before your next experiment launches. What level of page speed degradation is unacceptable? What increase in error rate would cause you to not ship a winner? Document it before you see results.