The Most Common Experimentation Interview Question
Every experimentation interview includes some version of this question: "Design an A/B test for X." The X changes — a new checkout flow, a recommendation algorithm, a pricing page — but the underlying evaluation is the same. The interviewer wants to see structured thinking that connects business goals to experimental methodology.
Most candidates fail this question not because they lack knowledge but because they lack structure. They jump to variant design without defining what success looks like. They discuss statistical significance without connecting it to business impact. They describe the test mechanics without addressing the practical constraints.
Here is the framework that consistently impresses.
Step 1: Clarify the Business Context
Before you design anything, ask clarifying questions. This is not stalling. It is the most important part of the answer.
Questions to ask:
- What is the business goal? Not "improve the page" but specifically what outcome matters. Revenue? Engagement? Retention? The answer determines your primary metric.
- Who are the users? Understanding the user segment helps you anticipate behavioral patterns and potential confounds.
- What does the current experience look like? You need a baseline to design against.
- Are there any constraints? Traffic volume, technical limitations, legal requirements, or timeline pressure all shape the experimental design.
This step demonstrates product thinking. You are showing the interviewer that you understand experimentation serves business goals, not the other way around.
Step 2: Define the Hypothesis
State your hypothesis explicitly using this structure: "If we [specific change], then [specific metric] will [improve/decrease] because [behavioral mechanism]."
The behavioral mechanism is what separates good answers from great ones. Anyone can say "a simpler form will convert better." Strong candidates explain why: "Reducing form fields decreases the cognitive effort required to complete the action, which matters particularly for users in a browsing mindset who have not yet committed to a high-effort interaction."
Your hypothesis should be falsifiable. If the test cannot disprove your hypothesis, it is not a useful hypothesis.
Step 3: Choose Your Metrics
Define three categories of metrics:
Primary metric: The single metric that determines whether the test succeeds. Choose one. Only one. This is the metric your ship-or-revert decision depends on.
Secondary metrics: Additional metrics that provide context. These help you understand why the primary metric moved and whether the change has broader effects. Do not use secondary metrics to override a negative primary result.
Guardrail metrics: Metrics that should not degrade. These protect against unintended negative consequences. For example, if you are testing a change to increase sign-ups, a guardrail metric might be seven-day retention — you want more sign-ups, but not at the cost of attracting users who immediately churn.
The metric hierarchy shows the interviewer you think about experiments holistically, not just whether a number went up.
Step 4: Determine the Randomization Strategy
This is where statistical thinking meets practical constraints.
Randomization unit: What entity gets assigned to a variant? Common choices are user, session, device, or geographic region. The right choice depends on the feature being tested.
For most product changes, user-level randomization is correct because the experience should be consistent across sessions. Session-level randomization makes sense for tests where each visit is independent. Geographic randomization is appropriate for marketplace tests where users in the same region interact.
Assignment persistence: Users should see the same variant every time they visit. Inconsistent experiences confuse users and dilute your measurement.
Potential for contamination: Consider whether users in one variant can influence users in the other. In marketplace or social products, this is a real concern that requires special designs like cluster randomization or switchback experiments.
Most candidates skip this step entirely. Addressing it demonstrates sophistication.
Step 5: Calculate Sample Size and Duration
Walk through the inputs:
- Baseline rate: State the current conversion rate (or estimate it if the interviewer has not provided it).
- Minimum detectable effect: State the smallest improvement worth detecting. Justify this with business logic — if a two percent relative lift translates to meaningful revenue, that is your MDE.
- Power and significance: Standard values are eighty percent power and ninety-five percent significance. State them explicitly.
Calculate (or estimate) the total sample needed and divide by daily traffic to get the test duration.
Then address what happens if the duration is impractical. Options include: increasing the MDE, using variance reduction techniques like CUPED, testing on a higher-traffic surface, or running a more aggressive variant that produces a larger effect.
This step shows you understand the constraints that separate theoretical experiment design from practical experimentation.
Step 6: Address Threats to Validity
Every experiment has potential threats to its internal validity. Proactively identifying them demonstrates expertise.
Novelty and primacy effects: New features may perform differently in the short term than in the long term. Users might engage with a new feature out of curiosity (novelty effect) or reject it because they prefer the familiar (primacy effect). Address this by running the test long enough for these effects to wash out.
Selection bias: If the randomization is not truly random (for example, if only logged-in users see the variant), your results may not generalize to the full population.
Interference effects: In networked products, users in the treatment group may affect users in the control group. A recommendation algorithm change, for instance, could shift content supply in ways that affect everyone.
Instrumentation bias: If the variant loads slower or triggers different tracking events, measurement differences may masquerade as behavioral differences.
You do not need to solve every threat. Identifying them and explaining how you would mitigate the most serious ones is sufficient.
Step 7: Describe the Analysis Plan
Specify how you will analyze results before the test launches. This prevents post-hoc rationalization.
- When will you analyze? After reaching the pre-calculated sample size, not before.
- What statistical test will you use? A two-sample z-test for proportions (for conversion rates) or a t-test for continuous metrics.
- Will you do any segmentation? Pre-declare which segments you plan to analyze. Post-hoc segmentation inflates false positive rates.
- How will you handle the multiple comparisons problem? If you are testing multiple variants or multiple metrics, explain your correction strategy.
Step 8: Discuss the Decision Framework
Close by explaining what happens after the analysis.
If the variant wins significantly: Ship it. Describe the rollout plan — will you go to one hundred percent immediately or ramp gradually?
If the variant loses significantly: Keep the control. Document the learning. Explain what you would test next based on this result.
If the result is inconclusive: Keep the control. Discuss whether the hypothesis is worth retesting with a different approach.
If the variant wins on the primary metric but hurts a guardrail: This is the nuanced case that reveals depth of thinking. The answer depends on the severity of the guardrail degradation and whether it is likely to recover over time.
Putting It Together: A Practice Example
Suppose the interviewer asks: "Design an A/B test for a new product recommendation algorithm."
Here is how the framework applies:
- Clarify: What product? What is the current algorithm? What metric matters most — clicks, purchases, revenue per user, or long-term engagement? What is the traffic volume?
- Hypothesis: "The new algorithm will increase revenue per user because it surfaces items with higher purchase probability based on behavioral signals rather than just popularity."
- Metrics: Primary: revenue per user. Secondary: click-through rate on recommendations, items viewed per session. Guardrails: user satisfaction score, return rate.
- Randomization: User-level. Persistent across sessions. Watch for interference if recommendations affect inventory visibility.
- Sample size: Estimate based on revenue per user variance and target MDE.
- Threats: Novelty effects (new recommendations may get exploratory clicks), interference (algorithm changes affect what is shown to control users in marketplace settings).
- Analysis: Pre-declared primary metric, run to full sample size, segment by new versus returning users.
- Decision: Ship if primary metric improves without guardrail degradation.
This entire answer can be delivered in five to seven minutes with clear structure and confident delivery.
Common Mistakes to Avoid
Skipping the clarifying questions
Jumping straight to the design signals that you do not understand the ambiguity inherent in real experimentation. Real experiments require context. Show that you seek it.
Choosing the wrong primary metric
Vanity metrics (page views, time on site) are almost never the right primary metric. Choose metrics that connect to business outcomes. Revenue, conversion rate, retention, or activation are usually better choices.
Ignoring practical constraints
A beautiful experimental design that requires more traffic than you have or more engineering time than is available is useless. Acknowledge constraints and adapt your design accordingly.
Over-complicating the design
Do not propose a multi-armed bandit with Bayesian updating and heterogeneous treatment effect analysis for a simple landing page test. Match the complexity of your design to the complexity of the problem.
Forgetting to mention what happens after the test
The experiment is not the end goal. The decision is. Always close by explaining how you will use the result.
FAQ
How long should my answer take in an interview?
Aim for five to eight minutes for the full walkthrough. Spend the first minute on clarifying questions, two to three minutes on design and metrics, one to two minutes on statistical considerations, and one minute on the decision framework. Leave time for follow-up questions.
What if I do not know the exact formula for sample size calculation?
That is fine. Explain the inputs (baseline rate, MDE, power, significance) and the directional relationships (smaller MDE requires more traffic). The conceptual understanding matters more than the formula in most interviews.
Should I mention specific tools or platforms?
Only if directly relevant. The interviewer cares more about your thinking process than which button you click in a specific tool. Mentioning tools briefly shows practical experience, but do not let it dominate your answer.
How do I handle a question about a domain I know nothing about?
Use the framework anyway. The structure works regardless of domain. Ask clarifying questions to understand the context, make reasonable assumptions, and state them explicitly. Interviewers care about your reasoning process, not your domain expertise in their specific industry.
What is the biggest differentiator between a good and great answer?
The behavioral mechanism in the hypothesis and the proactive identification of threats to validity. These two elements separate candidates who understand statistics from candidates who understand experimentation.