The Test Nobody Wants to Run
Here's a conversation I've had at least a dozen times:
"We keep seeing our A/B test winners fail to hold up post-launch. Significant in Optimizely, but no real-world revenue impact after we ship."
"Did you run an A/A test before you started your program?"
"What's an A/A test?"
That's the gap. A/A tests are the unglamorous infrastructure work of experimentation. They don't generate wins. They don't impress stakeholders. They just make sure that when you do generate a win, you can trust it.
After running programs where A/A tests caught serious data pipeline issues that would have sent us chasing phantom wins for months, I now treat them as non-negotiable before launching any new testing program — and quarterly thereafter.
What an A/A Test Actually Is
An A/A test pits two identical versions of a page against each other. Users are split 50/50. Both groups see exactly the same experience. The test runs for the same duration you'd run a real A/B test.
The expected result: statistical insignificance. No winner. No significant lift in either direction.
If your A/A test shows a significant result — if Optimizely declares one "variant" a winner even though both variations are identical — something is wrong with your setup. That wrong thing would have affected every subsequent A/B test you ran.
**Pro Tip:** An A/A test is not something you run once and forget. Run it when you first set up your Optimizely program, again after any significant changes to your site architecture, tracking code, or analytics stack, and on a semi-regular basis (quarterly for high-volume programs) as an ongoing health check.
The 5 Things A/A Tests Catch That You'd Never Find Otherwise
1. Sample Ratio Mismatch (SRM). SRM occurs when the actual traffic split between variations doesn't match the configured split. If you set a 50/50 split and your A/A test shows 58% of visitors in variation A and 42% in variation B, something in your implementation is biasing the bucketing. This bias will also affect your real A/B tests, systematically over-representing one population in one variant.
Common SRM causes: bot traffic being bucketed unevenly, an existing redirect or caching layer interfering with the randomization, or a poorly implemented custom bucketing function overriding Optimizely's default.
2. Flicker or FOOC (Flash of Original Content). If there's a visible flash of the original page before the variant loads, users who experience the flicker are technically exposed to both experiences. A/A tests won't directly catch flicker, but they will catch its downstream effect: if both variations experience flicker at the same rate, your A/A will look fine. If one variation's code takes longer to apply (because of JavaScript execution timing), you may see systematic conversion differences even between "identical" variants.
3. Tracking and instrumentation errors. If your conversion event fires twice on some percentage of page views (a common bug with single-page applications where the page-load event fires on navigation without a full page reload), your conversion rate will be inflated. An A/A test with both variations showing a 5.8% conversion rate on a page that you know converts at roughly 3% tells you your tracking is double-counting. Any A/B test you run in this environment will show inflated and unreliable conversion rates.
4. Interaction effects with other running experiments or personalizations. If you have Optimizely personalizations running alongside your experiments, or if you're running experiments on the same URL with different targeting conditions, they can interact. An A/A test that shows unexpected segmentation differences (mobile converts at 8% in both variants, but desktop shows a significant difference between the identical variants) is a signal that something is interfering with the desktop user population specifically.
5. Audience or targeting configuration errors. If your Optimizely targeting conditions are misconfigured — wrong URL match, wrong audience definition, or a JavaScript condition that evaluates differently than intended — your A/A test will show it through unexpected segmentation patterns or implausibly high/low visitor counts compared to your analytics baseline.
**Pro Tip:** Cross-reference your A/A test visitor counts against your Google Analytics or Segment page views for the same URL and time period. A 10-15% discrepancy is normal (Optimizely counts unique experiment participants; analytics tools count all sessions). A 40%+ discrepancy means your experiment is only capturing a fraction of your traffic, and your A/B test results won't generalize to your full audience.
What "Statistically Significant" in an A/A Test Means (It's a Red Flag)
At 90% statistical significance threshold, you expect roughly 10% of tests to show a false positive purely by random chance. Run 20 A/A tests and about 2 of them will show "significant" results even though nothing is different.
So seeing significance in an A/A test is not automatically a crisis — but it is a signal that demands investigation. The question is: is this within the expected false positive rate, or is it symptomatic of a systematic problem?
The investigation protocol:
- Check the absolute magnitude of the apparent lift. A "significant" 0.3% absolute lift on a baseline of 2% is probably noise. A "significant" 2% absolute lift on a 2% baseline means one variation is showing double the conversion rate of an identical twin — that's a systematic issue.
- Check whether the significance persists or fluctuates. Random false positives tend to approach and recede from the significance threshold. A systematic bias tends to show consistent, directional significance that doesn't resolve as the test accumulates data.
- Segment by device, browser, traffic source, and new vs. returning visitors. If the "significant" result is concentrated in one segment, that segment has a data problem.
**Pro Tip:** You actually need *less* traffic for an A/A test than for an A/B test. Because you're expecting no effect (rather than trying to detect a specific effect size), you just need enough data to validate that the visitor counts, conversion rates, and confidence intervals look plausible. As a rule of thumb, 5,000 visitors per variation is a reasonable floor for a basic A/A test. For detecting subtle SRM issues, 20,000 per variation is more reliable.
Sample Size for A/A Tests
This is where most guidance gets it wrong. For an A/B test, your sample size is determined by the minimum effect you want to detect. For an A/A test, you're not trying to detect an effect — you're trying to confirm there isn't one.
The practical approach: set your A/A test sample size to match what you'd use for your smallest anticipated A/B test. If your most sensitive A/B tests require 10,000 visitors per variation, run your A/A test to 10,000 per variation. This validates that your platform performs correctly at the traffic volume you'll actually use.
Duration rule: run your A/A test for the same minimum duration you'd use for a real test — at least seven days, spanning at least one full business cycle. This validates that your bucketing and tracking remain stable across time, not just at a single point.
Step-by-Step: Setting Up an A/A Test in Optimizely Web Experimentation
Step 1: Create a new A/B test experiment. In Optimizely Web Experimentation, go to Experiments > Create New Experiment > A/B Test.
Step 2: Set your URL targeting. Target the URL where you do most of your conversion testing — typically your homepage, a key landing page, or a high-traffic product page.
Step 3: Create the variant — but don't change anything. Add a variation. In the visual editor, make absolutely zero changes. The variant should be a perfect copy of the control. If you accidentally apply a change, undo it before saving.
Step 4: Set traffic allocation to 100%, split 50/50. You want the full audience to participate so you can validate your bucketing against your analytics baseline.
Step 5: Add your standard conversion metrics. Use the exact same events you'd use in a real A/B test — your primary CTA click, form submission, or purchase event. This validates the tracking, not just the bucketing.
Step 6: Set distribution mode to Manual. Do not use Stats Accelerator or Multi-Armed Bandit modes for an A/A test. You want static 50/50 throughout.
Step 7: Run for your planned minimum duration. At least seven days. Don't stop early even if results look suspicious — let it run to your target sample size.
Step 8: Interpret results.
- Both variants show similar conversion rates, no significant winner, confidence intervals overlap substantially: your setup is clean. Document this as a passed A/A test and proceed with your program.
- One variant shows statistically significant lift: investigate using the protocol above. Don't launch real experiments until you've identified and resolved the cause.
- Visitor counts are very different from expected: check your analytics baseline, look for SRM causes (bot traffic, redirects, caching).
**Pro Tip:** Save your A/A test results as a baseline benchmark. Your control conversion rate from a clean A/A test is the most reliable baseline you have — it was measured with the exact experimental infrastructure you'll use for real tests, under the same bucketing conditions. Use it as your baseline input for the sample size calculator on subsequent experiments.
When to Run an A/A Test
Mandatory situations:
- When you first set up Optimizely on a new site or domain
- After any major technical change: new tag manager implementation, migration to a new analytics platform, SPA framework upgrade, or CDN configuration change
- After a long testing hiatus (3+ months without any experiments)
- After any Optimizely snippet update or implementation change
Highly recommended:
- Quarterly health checks for active, high-volume testing programs
- Before launching a test series on a page you've never tested before
- Any time you see unexplained win-rate inflation (winning more than 40% of tests is a red flag — industry average is 10-30%)
Not necessary:
- Before every single experiment (once you have a clean A/A baseline established, you don't need to re-run before every test)
- As a substitute for QA on your individual test implementations — A/A tests validate the platform; QA validates the specific experiment
Common Mistakes
Skipping the A/A test because "we don't have time." The same team will spend two weeks debugging why their A/B winner didn't hold up post-launch. The A/A test takes less time than the post-launch investigation.
Making changes to the A/A variant "just to make sure it's set up right." If the variant has any changes, it's not an A/A test — it's an A/B test with a trivial change. The whole point is zero changes.
Stopping the A/A test early because it's "obviously inconclusive." Don't. You need the full sample to validate SRM and tracking consistency over time, not just at a single moment.
Interpreting a clean A/A as proof that everything is perfect. A clean A/A test means your platform and tracking infrastructure are functioning correctly for the audience and metrics you tested. It does not validate that every possible metric or every possible audience segment is clean. Investigate any metric or segment you care about before relying on it in a high-stakes experiment.
Using A/A tests to "warm up" the experiment for an upcoming A/B test. This doesn't work and will contaminate your A/B data. Users who were bucketed during the A/A period will have their cookies set. If you then launch an A/B test on the same URL with new code, bucketed users may see a mix of experiences. Always create a new experiment for your actual A/B test.
What to Do Next
- If you've never run an A/A test on your current Optimizely implementation, schedule one this week. Use the step-by-step setup above. It will take 30 minutes to set up and 7-14 days to run.
- Cross-reference the visitor count from your A/A test against your analytics platform. Document the discrepancy percentage as your expected baseline variance.
- Review our guide on How Long Should You Run an A/B Test? to apply the same duration discipline to your A/A tests.
- After your A/A test completes cleanly, note the measured baseline conversion rates in your experiment log. This is your anchor for calculating sample sizes on all future tests.