The Most Uncomfortable Truth in Experimentation

Most A/B test results are less reliable than the teams running them believe. Not because the math is wrong. Not because the tools are broken. Because the humans interpreting the results are subtly, unconsciously biased in ways that inflate positive findings and suppress negative ones.

This is not an accusation. It is human psychology. When you design an experiment, run it, and then analyze the results, you have enormous flexibility in how you interpret the data. Which metric do you focus on? Which time period do you use? Which user segments do you examine? Each of these choices affects the conclusion, and each is influenced by what you want to find.

Pre-registration is the antidote. It is the practice of documenting your experiment plan — hypothesis, metrics, analysis approach, decision criteria — before you see any results. It removes the flexibility that enables unconscious bias.

What Pre-Registration Actually Is

Pre-registration is a written document created before an experiment launches that specifies:

  1. The hypothesis: What you expect to happen and why
  2. The primary metric: The single metric that determines success
  3. Secondary and guardrail metrics: What else you will measure and the thresholds for concern
  4. Sample size and duration: How much data you need and how long you will run
  5. Analysis plan: The specific statistical test you will use
  6. Decision criteria: What result leads to ship, iterate, or kill
  7. Subgroup analyses: Any planned segment breakdowns (and the reason for each)

The document is timestamped and shared with the team before the experiment begins. It cannot be modified after launch. This is the critical requirement — the plan is locked before the data exists.

The Biases Pre-Registration Prevents

P-Hacking

P-hacking is the practice of running multiple statistical tests on the same data until one of them produces a significant result. It is rarely intentional. It usually looks like this:

"The overall result was not significant, so let us check if it was significant for mobile users. Not significant for mobile either. What about new users? Let us look at users who visited at least three times. Here — if we look at returning users on desktop who arrived from organic search, the result is significant."

Every subgroup analysis is an additional statistical test. Each one increases the probability of finding a false positive. Run enough tests and you will always find something that looks significant.

Pre-registration prevents p-hacking by specifying which analyses you will run before you see the data. Any analysis not in the pre-registration document is exploratory and cannot be used to declare success.

Metric Shopping

Metric shopping is the practice of choosing the metric that tells the best story after seeing the results.

"The primary metric did not move, but look — time on page increased significantly. Users are more engaged. Let us call this a win."

Pre-registration prevents metric shopping by locking in the primary metric before launch. If the primary metric does not move, the experiment did not succeed, regardless of what other metrics show.

Stopping Rules Violations

Many teams peek at results daily and stop the experiment as soon as the result looks significant. This dramatically inflates the false positive rate because statistical significance fluctuates — a result that looks significant on day three may not be significant on day seven.

Pre-registration specifies the stopping rule: "We will analyze results after reaching our target sample size of X, which we expect to occur around date Y." No peeking. No early stopping based on favorable-looking data.

Post-Hoc Rationalization

Perhaps the most insidious bias. After seeing the results, teams construct narratives that make the outcome seem inevitable:

"Of course the variant won — users clearly prefer simpler layouts." But if the control had won, the narrative would be: "Of course the control won — users need the detailed information to make decisions."

Post-hoc rationalization is undetectable in the moment because it feels like genuine insight. Pre-registration exposes it by comparing your pre-experiment reasoning to your post-experiment reasoning. If the story changed, something is wrong.

How to Write a Pre-Registration Document

The Template

A practical pre-registration document has seven sections:

Section 1: Background and Motivation What data or observations motivated this experiment? What is the business context? This section grounds the experiment in evidence, not opinion.

Section 2: Hypothesis State the hypothesis in a falsifiable format: "We predict that [treatment] will [increase/decrease] [primary metric] by at least [minimum detectable effect] because [behavioral mechanism]."

Section 3: Experiment Design Describe the treatment and control conditions, the randomization method, and the traffic allocation. Include any exclusion criteria (for example, excluding internal users or users on specific platforms).

Section 4: Metrics List the primary metric, secondary metrics, and guardrail metrics. For each, specify:

  • Exact definition and calculation
  • Data source
  • Expected baseline value
  • For guardrails, the threshold for violation

Section 5: Sample Size and Duration Provide the sample size calculation, including the inputs (baseline rate, minimum detectable effect, significance level, power). State the expected duration and the minimum duration (to account for weekly cycles).

Section 6: Analysis Plan Specify the statistical test (two-sample t-test, chi-squared test, Mann-Whitney, etc.), the significance level, and whether you will use one-tailed or two-tailed testing. If you plan any subgroup analyses, list them here with the rationale.

Section 7: Decision Criteria Define the outcomes:

  • Ship: Primary metric improves by at least [X] with statistical significance and all guardrails pass
  • Iterate: Primary metric trends positive but does not reach significance, or guardrails show minor concerns
  • Kill: Primary metric degrades, or guardrails are violated

Common Mistakes in Pre-Registration

Too vague: "We expect the variant to perform better" is not a useful hypothesis. Specify the metric, the direction, and the minimum meaningful effect.

Too rigid: Pre-registration should not prevent you from exploring unexpected findings. It should prevent you from claiming unexpected findings as confirmatory evidence. Mark all analyses not in the pre-registration as exploratory.

Missing the mechanism: A hypothesis without a mechanism ("We predict conversion will increase because... we hope so") is not useful. The mechanism is what you learn from, regardless of the outcome.

No decision criteria: Without predefined decision criteria, the team will argue about what the results mean after the fact. This is exactly the kind of flexibility that creates bias.

Objections and Responses

"Pre-registration slows us down"

A thorough pre-registration takes one to two hours. An experiment typically runs for two to four weeks. The pre-registration time is less than one percent of the total experiment duration. The cost of a false positive — shipping a change that does not actually work or causes hidden harm — is far greater.

"We need flexibility to explore the data"

Pre-registration does not prevent exploration. It prevents claiming exploratory findings as confirmatory. You can and should explore the data after the experiment ends. Just label those findings as hypotheses for future experiments, not as conclusions from this one.

"Our team is too small for this kind of process"

Small teams benefit more from pre-registration, not less. With fewer people reviewing experiments, the risk of unchecked bias is higher. A simple pre-registration template takes five minutes to fill out and prevents the most common mistakes.

"Pre-registration is for academic research, not business"

The biases that pre-registration addresses — p-hacking, metric shopping, post-hoc rationalization — affect business experiments just as much as academic ones. The consequences are different (shipping a bad product change instead of publishing a false finding) but equally costly.

Implementing Pre-Registration in Your Team

Start Small

You do not need a perfect process from day one. Start with a minimal pre-registration that covers:

  • The hypothesis (one sentence)
  • The primary metric
  • The stopping rule (sample size or date)
  • The decision criteria (ship, iterate, or kill)

This takes five minutes and prevents the worst biases.

Make It Part of the Workflow

Add pre-registration to your experiment template. No experiment launches without a completed pre-registration. Make it a checkbox in your experiment management tool.

Review Pre-Registrations in Team Meetings

Before an experiment launches, have the team review the pre-registration. This serves two purposes: it catches design flaws early and it creates social accountability for following the plan.

Compare Plans to Results

When analyzing results, pull up the pre-registration document. Did the result match the prediction? If not, why? Were the decision criteria followed? Were any analyses added that were not in the plan?

This comparison is where the learning happens. Over time, teams that compare plans to results develop better intuition for what works and more honest assessment of their own biases.

Build an Archive

Store all pre-registration documents in a shared, searchable location. This archive becomes a knowledge base — new team members can read past pre-registrations to understand how experiments were designed, what was expected, and what actually happened.

The Culture Shift

Pre-registration is ultimately a cultural practice, not a technical one. It embodies a specific belief: that honest assessment of experimental results is more valuable than flattering interpretation of ambiguous data.

Teams that adopt pre-registration tend to develop a healthier relationship with experiment results. Losses become less painful because they are acknowledged as learning. Wins become more credible because they passed a higher bar. And the overall quality of decision-making improves because decisions are based on reliable evidence rather than optimistic interpretation.

This is the real payoff. Not just better experiments, but better decisions.

FAQ

Can I update the pre-registration after the experiment starts?

Only if you document the change, the reason for it, and the date. Major changes (like switching the primary metric) effectively invalidate the pre-registration. If you need to make a major change, consider stopping the experiment and starting fresh.

What if the pre-registered analysis is not the best analysis for the data?

Report both. Run the pre-registered analysis first — this is your confirmatory result. Then run the better analysis and label it as exploratory. Use the exploratory result to inform future experiments.

How detailed should the pre-registration be?

Detailed enough that someone else could run the analysis without asking you questions. If the analysis plan is ambiguous, it is not detailed enough.

Does pre-registration guarantee that results are reliable?

No. Pre-registration prevents several common biases but does not address others (like poorly implemented randomization or tracking errors). It is one important component of rigorous experimentation, not a complete solution.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.