A bare reported lift is uninterpretable. The same "+25% lift" can mean a clean win with tight CI [22%, 28%], a false positive with CI [-2%, 52%], or a broken randomization producing artifactual results. Without the diagnostic context, you cannot tell which one you are looking at. This guide walks through the 5-element diagnostic checklist every new testing team should standardize, with definitions, formulas, and a copy-paste template.

What You'll Learn

  • The 5 elements that should appear on every test readout (and why each matters)
  • Plain-English definitions of power analysis, MDE, CI, SRM, and segment cuts
  • Quick-reference formulas you can compute in a spreadsheet
  • A copy-paste readout template you can adopt this week
  • The single hour of work that upgrades your team's reporting permanently

Quick Stats Reference

The 5 elements every test readout should include:

>

1. Pre-test power analysis — required sessions per arm to detect your MDE.
2. MDE (minimum detectable effect) — the smallest effect size you'd consider a meaningful win.
3. Confidence interval — the range of plausible true effects given the observation.
4. SRM checkchi-squared test that traffic split as configured.
5. Segment cuts — new vs returning, mobile vs desktop, traffic source.
Why this matters: Decisions made on bare point estimates are decisions made on incomplete information. The cost of producing the diagnostic table is one afternoon of standardization. The benefit compounds over every test the team runs from that point forward.

Why This Matters For New Testing Teams

Most consumer A/B testing tools display a result as: a single number with a directional indicator. New analysts learn that this is what a test result _is_. It is not. A test result is a structured statistical claim with multiple components, and the bare point estimate is only one of them.

When a team standardizes the diagnostic table, three things change:

Decisions get better. A "+25% lift" with a CI that includes zero is consistent with no effect. The same "+25% lift" with a tight CI well above zero is a real win. The team rolling out the first interpretation is making a different decision than the team rolling out the second. Without the CI, both teams are running the same dashboard and reading the same number.

Stakeholder communication gets better. Confidence intervals translate directly to risk language that executives understand. "Our point estimate is 8%, with a 95% CI of [3%, 13%]" is a finding leadership can act on. "We saw 8%" is a finding leadership tends to over-interpret.

Catching broken tests gets faster. Most randomization failures — the kind that produce spurious results — are detectable by an SRM check that takes one cell in a spreadsheet. Most teams have shipped at least one "winner" that was actually a broken randomization. The SRM check would have caught it.

---

Element 1: Pre-Test Power Analysis

Quick definition: _Statistical power_ = the probability that a test will detect a real effect of a given size. _Power analysis_ = the pre-test calculation of how many sessions per arm are needed to achieve a target power level (typically 80%).

What it tells you. How many sessions per arm you need to reliably detect your chosen MDE. Without this calculation, your test is implicitly powered for whatever effect happens to materialize.

The four inputs:

| Input                    | Typical value            | Notes                                               |

| ------------------------ | ------------------------ | --------------------------------------------------- |

| Baseline conversion rate | Your actual baseline     | Pull from analytics; use a representative window    |

| MDE                      | 5%, 10%, or 20% relative | Smaller MDE → more sessions required                |

| Confidence               | 95% (alpha = 0.05)       | Standard for A/B testing                            |

| Power                    | 80% (beta = 0.20)        | Standard; means 20% chance of missing a real effect |

Quick reference: rough sample sizes per arm at 2.5% baseline CVR:

| MDE (relative) | Sessions per arm | Days at 5,000 daily sessions split 50/50 |

| -------------- | ---------------- | ---------------------------------------- |

| 5%             | ~62,000          | ~25 days                                 |

| 10%            | ~16,000          | ~7 days                                  |

| 15%            | ~7,000           | ~3 days                                  |

| 20%            | ~4,000           | ~2 days                                  |

The numbers above are illustrative — use a real sample-size calculator with your actual baseline CVR.

Where to compute it. Most A/B testing tools have a built-in calculator. Free options include Evan Miller's online calculator, the pwr package in R, or statsmodels.stats.power in Python. A spreadsheet implementation takes about ten minutes to build and is worth it once for your team's permanent template.

Common mistake: Running a test, observing a flat result, and concluding "the change didn't work." Without a pre-test power analysis, you don't know if the test was capable of detecting the effect you cared about. A flat result on an underpowered test tells you nothing.

---

Element 2: Minimum Detectable Effect (MDE)

Quick definition: _MDE_ = the smallest effect size your test is designed to reliably detect. Pre-committing to an MDE before launch prevents post-hoc rationalization of marginal results as wins.

What it tells you. What threshold counts as a meaningful win for the business. Setting this _before_ the test launches is one of the highest-impact discipline moves available to a new team.

How to choose an MDE:

| Test type                            | Typical MDE (relative) | Reasoning                                                       |

| ------------------------------------ | ---------------------- | --------------------------------------------------------------- |

| Conversion rate test on product page | 10%                    | Below this, the change isn't worth the engineering cost to ship |

| Pricing or offer test                | 5-10%                  | High-traffic surfaces; smaller effects are still meaningful     |

| Post-purchase upsell                 | 15-20%                 | Lower-traffic surfaces; need larger effects to justify sample   |

| High-stakes redesign                 | 5%                     | Justifies a longer test for finer detection                     |

The right MDE for a given test is the smallest effect that would change your business decision. If a 3% lift would not change anything you do, your MDE is higher than 3%. Set it to the threshold that would.

Common mistake: Setting MDE = "anything positive." This is not a valid MDE — it is a path to chasing noise. Any test will eventually drift positive on a noisy day; declaring that as a win is the peeking problem under another name.

The pre-commitment habit. Document your MDE in the test plan before launch. Put it next to the hypothesis. When the test ends, evaluate against the pre-committed MDE — not against whatever lift happened to materialize.

---

Element 3: Confidence Interval

Quick definition: _Confidence interval (CI)_ = the range of plausible values for the true effect, given your observation. A 95% CI means: if you ran the test infinitely many times, 95% of the resulting intervals would contain the true effect.

What it tells you. How much uncertainty surrounds your point estimate. The CI is the single most important element of a test readout, and it is the most commonly omitted.

Why it matters more than the point estimate:

| Scenario A                       | Scenario B                               |

| -------------------------------- | ---------------------------------------- |

| Lift: +25%                       | Lift: +25%                               |

| 95% CI: [22%, 28%]               | 95% CI: [-2%, 52%]                       |

| → Real win, ship with confidence | → Consistent with no effect, do not ship |

Both scenarios show "+25%" on the dashboard. They are different findings. Without the interval, you cannot distinguish them.

Quick formula (for a difference in proportions):

```

CI = (p_B - p_A) ± 1.96 × √[(p_A × (1-p_A) / n_A) + (p_B × (1-p_B) / n_B)]

```

Where p_A, p_B are the observed conversion rates and n_A, n_B are the sample sizes. Implement once in a spreadsheet and your team has CIs forever.

Tip for new analysts: If your A/B testing tool doesn't surface CIs, compute them manually. The formula above takes one cell. Add it to your team's reporting template and never run a test without it again.

How to read a CI:

  • CI excludes zero on the positive side → likely a real positive effect.
  • CI excludes zero on the negative side → likely a real negative effect.
  • CI includes zero → result is consistent with no effect; do not declare a winner.
  • CI is very wide → the test was underpowered; the result is uninformative regardless of point estimate.

---

Element 4: Sample Ratio Mismatch (SRM) Check

Quick definition: _Sample ratio mismatch (SRM)_ = a discrepancy between the configured traffic split (typically 50/50) and the actual observed split. SRM indicates that randomization is broken — and a broken randomization invalidates everything else in the test.

What it tells you. Whether your test is even valid before you read the lift. A test with an SRM is not a real A/B test — it is a comparison of two non-equivalent groups, and the observed lift cannot be causally attributed to the variant.

How to check it. A one-line chi-squared test:

```

expected_per_arm = total_users / 2

chi_squared = ((users_A - expected)² / expected) + ((users_B - expected)² / expected)

```

If the resulting p-value is below 0.001, you have an SRM and should not interpret the test results until the cause is identified.

Common causes of SRM:

| Cause                                 | What to look for                      |

| ------------------------------------- | ------------------------------------- |

| Bot traffic skewing one arm           | Check user agent / known bot patterns |

| Variant has higher load time / errors | Check page load metrics by arm        |

| Caching layer mis-routing             | Check CDN configuration               |

| Tracking pixel firing inconsistently  | Check event volume by arm             |

| Geographic targeting accident         | Check geo distribution by arm         |

Tip for new teams: Run the SRM check before reading the lift. If SRM is flagged, the lift is meaningless until you identify and fix the cause. Most teams ship at least one "winner" per year that was actually a broken randomization. The SRM check would have caught it.

---

Element 5: Segment Cuts

Quick definition: _Segment cuts_ = the test result broken down by user-level dimensions like new vs returning, mobile vs desktop, traffic source, geography, or device type. A headline lift can be uniform across segments or concentrated in one segment — these are different results that call for different rollout decisions.

What it tells you. Whether the variant works for the audience you actually care about, or whether the headline lift is being driven by a subset that may not generalize.

The minimum segment cuts for any DTC test:

| Segment                                  | Why to cut it                                                                     |

| ---------------------------------------- | --------------------------------------------------------------------------------- |

| New vs returning users                   | Often respond differently to social proof, discount messaging, urgency            |

| Mobile vs desktop                        | Different funnels, different conversion drivers                                   |

| Traffic source (paid / organic / direct) | Intent differs systematically by source                                           |

| Day-of-week                              | Catch peeking-driven artifacts; verify the win wasn't driven by one anomalous day |

How to read segment cuts:

| Pattern                                           | Interpretation                                                    |

| ------------------------------------------------- | ----------------------------------------------------------------- |

| Lift uniform across segments                      | Likely real, generalizable; ship with confidence                  |

| Lift concentrated in one segment, flat elsewhere  | Heterogeneous effect; segment-specific rollout may be appropriate |

| Lift positive in one segment, negative in another | Mixed effect; do not ship without further investigation           |

| Lift driven by one or two anomalous days          | Likely a peeking artifact; replicate before shipping              |

Common mistake: Reporting only the headline aggregate lift. The aggregate hides interaction effects, segment-specific responses, and time-driven artifacts. The 30-second cost of producing segment cuts pays for itself the first time it catches a non-generalizable "win."

---

The Copy-Paste Test Readout Template

Standardize this template once for your team. Apply it to every test from that point forward.

```

TEST NAME: [name]

DATES: [start] to [end]

HYPOTHESIS: [pre-registered hypothesis]

PRIMARY METRIC: [metric]

PRE-COMMITTED MDE: [X% relative]

PRE-COMMITTED SAMPLE TARGET: [N sessions per arm]

ACTUAL SAMPLE: [N_A in control, N_B in variant]

DIAGNOSTIC CHECKS:

  • SRM check: [pass / fail with details]
  • Test ran for: [N days, M complete weekly cycles]

PRIMARY RESULT:

  • Control CVR: [X% with N sessions]
  • Variant CVR: [Y% with N sessions]
  • Observed lift: [Z%]
  • 95% CI: [lower%, upper%]
  • Decision: [ship / kill / replicate first]

SEGMENT CUTS:

  • New users: [lift, CI]
  • Returning users: [lift, CI]
  • Mobile: [lift, CI]
  • Desktop: [lift, CI]
  • Top traffic source: [lift, CI]

INTERPRETATION:

[1-2 paragraphs: what does this mean, what's the recommended next step]

REPLICATION STATUS:

[planned / completed / not required]

```

The template takes one afternoon to standardize. It changes the quality of every decision from that point forward.

---

Quick Tips for New Testing Teams

  • Standardize the template before your next test. One hour of work, permanent payoff.
  • Compute CIs manually if your tool doesn't surface them. The formula is one spreadsheet cell.
  • Run the SRM check first. If randomization is broken, the lift is meaningless.
  • Pre-commit your MDE before launch. Document it next to the hypothesis. Evaluate against it at the end.
  • Always cut segments. New vs returning, mobile vs desktop, traffic source. Catches non-generalizable wins.
  • Read segment cuts before shipping. A lift driven by one anomalous segment is not a generalizable win.

---

Tips for CRO Managers

  • Make the diagnostic table mandatory. No test result is presented to leadership without the full table. This is the single highest-leverage policy change a new manager can make.
  • Train new analysts on the template in week 1. It saves months of false-positive recovery later.
  • Audit your existing test history. Pull the last 10 "wins." How many would survive a full diagnostic table? The ones that wouldn't are candidates for replication.
  • Add SRM monitoring to your test platform. Most don't surface it by default. The one-time engineering cost is small and the failure mode it catches is severe.
  • Connect the diagnostic table to rollout sizing. A win with a tight CI gets a different rollout plan than a win with a wide CI. Document the linkage explicitly so that decisions are reproducible.

---

FAQ

My A/B testing tool doesn't surface CIs or SRM checks. What do I do?

Compute them in your reporting layer. A CI is one cell in a spreadsheet given the point estimates and sample sizes. An SRM check is a one-line chi-squared test. Most teams find that adding these manually takes one afternoon and changes the quality of every subsequent decision.

Is the diagnostic table overkill for small teams?

No — it is _more_ important for small teams. A small team has less margin for shipping false positives, and the diagnostic table is the cheapest possible way to catch them. The full table takes about 5 minutes per test once standardized.

How many segment cuts is too many?

The risk with too many segment cuts is multiple-comparisons inflation: if you cut 20 segments, one will look "significant" by chance even on tests with no underlying effect. Stick to the 4-5 segments most relevant to your business (new vs returning, mobile vs desktop, top 1-2 traffic sources). For deeper segmentation, apply Bonferroni correction or treat segment-level findings as exploratory.

What if SRM is flagged on a winning test?

Do not ship. Investigate the cause. Until the SRM is resolved, the observed lift cannot be causally attributed to the variant. Common causes include bot traffic, variant-specific load issues, caching mis-routing, or tracking pixel inconsistencies.

Should I report all five elements even on losing tests?

Yes. The diagnostic table is the standard format for _every_ test readout. Losing tests with the full table are more informative than winning tests with the bare number — they tell you what didn't work and why, with the context to interpret the result.

What about Bayesian credible intervals?

Equivalent to frequentist CIs for practical purposes when the prior is uninformative. If your tool reports Bayesian credible intervals, treat them the same way — they are the range of plausible true effects given the observation. The interpretive habit is the same.

---

Standardize Your Reporting This Week

For new testing teams, standardizing the diagnostic table is one of the highest-leverage process changes available. The cost is one afternoon. The benefit compounds over every test the team runs. Most teams that adopt the discipline trace their measurable improvement in decision quality to this single change.

I built GrowthLayer to surface the full diagnostic table by default. Every test readout includes power analysis, pre-committed MDE, confidence intervals, automated SRM checks, and standard segment cuts — without the analyst having to compute them manually. The team's reporting template is enforced at the platform level.

To find experimentation roles where this discipline is the operating standard, explore open positions on Jobsolv.

Or book a consultation for help standardizing your team's diagnostic reporting template, or for a methodological audit of an existing program.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.