Why A/B Testing Interviews Are Different

A/B testing interviews do not follow the standard technical interview playbook. They sit at the intersection of statistics, product thinking, and business strategy. The interviewer wants to know whether you can design experiments that produce trustworthy results and whether you understand the business implications of those results.

Most candidates over-prepare on statistics and under-prepare on everything else. The best experimentation professionals combine quantitative rigor with behavioral intuition and business judgment. Your interview answers need to reflect all three.

Core Statistics Questions You Will Face

These questions test whether you understand the statistical machinery behind experimentation. Do not just memorize formulas. Understand the concepts well enough to explain them to a product manager.

What is statistical significance and how do you determine it?

Statistical significance measures the probability that an observed difference between variants is real rather than a product of random chance. Most teams use a significance threshold of ninety-five percent, meaning there is at most a five percent chance the result is a false positive.

The key insight interviewers look for: significance alone is not enough. A result can be statistically significant but practically meaningless if the effect size is trivially small. Always pair significance with effect size and confidence intervals.

Explain Type I and Type II errors in business terms

A Type I error means you ship a change that does not actually help. You declared a winner that was really just noise. The business cost is the development resources spent implementing a neutral or harmful change, plus the opportunity cost of not testing something else.

A Type II error means you discard a good idea. The test said inconclusive, so you moved on, but the change would have improved performance. The business cost is the revenue you never captured.

Strong candidates explain the asymmetry: in most organizations, Type I errors are more visible (you shipped something that failed) while Type II errors are invisible (you never know what you left on the table).

How do you calculate sample size for an experiment?

Sample size depends on four inputs: baseline conversion rate, minimum detectable effect, statistical power (typically eighty percent), and significance level (typically ninety-five percent). The smaller the effect you want to detect, the more traffic you need.

The follow-up question is always practical: what do you do when the required sample size exceeds what your traffic can deliver in a reasonable timeframe? Strong answers include increasing the minimum detectable effect, testing bolder changes, choosing higher-traffic surfaces, or using variance reduction techniques.

What is the multiple comparisons problem?

When you test multiple metrics or multiple variants simultaneously, the probability of at least one false positive increases. With twenty metrics at a five percent threshold, you expect one false positive by chance alone.

Solutions include Bonferroni correction, false discovery rate control, or pre-declaring a single primary metric. The best answer acknowledges the tradeoff: corrections reduce false positives but increase false negatives.

Experiment Design Questions

These questions assess whether you can translate a business problem into a well-structured experiment.

How would you design an experiment to test a new onboarding flow?

Strong answers follow a structure: define the hypothesis, choose the primary metric (activation rate, not just completion), determine the randomization unit (user-level, not session-level for onboarding), calculate sample size, identify guardrail metrics (support ticket volume, churn rate), and specify the analysis plan before launch.

Weak answers jump straight to the variant design without establishing the measurement framework.

What is the difference between randomization units and analysis units?

The randomization unit is what gets assigned to a variant (user, session, device, geographic region). The analysis unit is the level at which you measure the outcome.

Mismatch between these creates statistical problems. If you randomize at the user level but analyze at the session level, you violate independence assumptions because sessions from the same user are correlated.

How do you handle novelty effects?

Novelty effects occur when a new variant performs well initially because users are curious, then performance decays as the novelty wears off. Strong answers suggest running the test long enough for the effect to stabilize, segmenting results by new versus returning users, and monitoring performance trends over the test duration.

When would you use a pre-post design instead of a randomized experiment?

Pre-post designs compare metrics before and after a change without a simultaneous control group. They are appropriate when randomization is impossible (infrastructure changes, pricing changes that apply to everyone) but produce weaker causal evidence because external factors can confound the comparison.

The best answer explains why randomized experiments are preferred and articulates the specific scenarios where pre-post is the only option.

Business and Product Questions

These separate candidates who understand statistics from candidates who understand experimentation.

How do you prioritize which tests to run?

Frameworks like ICE (Impact, Confidence, Ease) or RICE (Reach, Impact, Confidence, Effort) provide structure. But the meta-answer is more important: prioritization depends on organizational context. A team trying to build a testing culture should prioritize high-confidence tests that are likely to win and demonstrate the value of experimentation. A mature team should prioritize high-impact tests that push boundaries even if the hypothesis is uncertain.

A test shows a significant increase in sign-ups but a significant decrease in activation. What do you do?

This tests whether you think beyond the primary metric. The sign-up increase might be attracting lower-quality users who never activate, which means the change is actually hurting the business despite improving the top-of-funnel number.

Strong answers discuss metric hierarchies, the importance of guardrail metrics, and the need to evaluate changes on downstream business outcomes rather than intermediate proxies.

How do you communicate inconclusive results to stakeholders?

An inconclusive result is not a failure. It means the true effect is likely smaller than your minimum detectable effect. Frame it as: "We can confidently say this change does not produce a large improvement. If there is any effect, it is small enough that other priorities likely offer better return on investment."

This reframing prevents stakeholders from viewing inconclusive tests as wasted effort.

Behavioral Science Questions

These questions increasingly appear in experimentation interviews as organizations recognize the link between behavioral science and effective testing.

How does loss aversion affect experiment design?

Loss aversion means people weight losses roughly twice as heavily as equivalent gains. In experimentation, this matters for framing. A variant that removes something (even if it adds something better) may underperform because users perceive the removal as a loss.

It also affects organizational behavior. Teams are loss-averse about their existing designs and may resist shipping variants that change familiar elements, even when data supports the change.

What is the distinction between correlation and causation in testing?

Randomized experiments are one of the few methods that establish causation. Observational data shows correlation. The interview question usually takes the form of a scenario: "Users who complete onboarding step three have higher retention. Should we force everyone through step three?" The answer is no, because the correlation may reflect user motivation rather than the causal effect of the step.

How do defaults and choice architecture influence test outcomes?

Defaults are powerful because of status quo bias. When you test a new default setting, you are not just testing whether users prefer it. You are testing whether the behavioral inertia of accepting defaults outweighs any preference for the old setting. Understanding this distinction changes how you interpret results.

Technical Deep-Dive Questions

Senior roles often include questions about infrastructure and advanced methods.

How would you build an experimentation platform from scratch?

Key components: assignment service (handles randomization and persistence), event tracking pipeline, statistical analysis engine, feature flagging system, and a reporting dashboard. The assignment service is the most critical component because errors there invalidate all downstream analysis.

What is CUPED and when would you use it?

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses pre-experiment behavior to adjust post-experiment metrics. It can substantially reduce the variance of your estimates, allowing you to detect smaller effects with the same sample size. Use it when you have reliable pre-experiment data for your metric.

Explain the difference between frequentist and Bayesian approaches to A/B testing

Frequentist testing uses p-values and fixed sample sizes. You declare a result significant or not based on a pre-set threshold. Bayesian testing calculates the probability that one variant is better than another, updated continuously as data arrives. Frequentist methods are simpler and more established. Bayesian methods allow continuous monitoring without peeking penalties but require prior assumptions.

How to Prepare Effectively

Studying statistics alone will not get you through these interviews. Here is what actually works:

  • Practice explaining concepts simply. If you cannot explain statistical power to a non-technical person, you do not understand it well enough.
  • Prepare case studies from your experience. Have three to five stories about experiments you designed, ran, and learned from. Include at least one failure.
  • Understand the business model of the company you are interviewing with. Their experimentation priorities directly follow their business model.
  • Read about behavioral science. Understanding cognitive biases, choice architecture, and decision-making frameworks differentiates you from pure statisticians.
  • Be honest about what you do not know. Experimentation spans many disciplines. No one is expert in all of them. Intellectual honesty is more impressive than confident hand-waving.

FAQ

What level of statistics knowledge do I need for an experimentation role?

For analyst roles, you need solid understanding of hypothesis testing, confidence intervals, sample size calculation, and common pitfalls. For senior roles, add Bayesian methods, sequential testing, variance reduction, and causal inference. You do not need to prove theorems, but you need to understand the intuition behind the methods you use.

How do experimentation interviews differ between tech companies and agencies?

Tech companies focus on platform-scale experimentation, long-term metric effects, and infrastructure design. Agencies focus on client-facing optimization, rapid iteration, and communicating results to non-technical stakeholders. Prepare for the context that matches your target role.

Should I bring a portfolio of past experiments to the interview?

Absolutely. A concise document showing three to five experiments with hypothesis, design, result, and learning demonstrates practical experience far better than theoretical knowledge alone. Anonymize company-specific details but keep the reasoning and methodology clear.

What programming languages should I know for experimentation roles?

SQL is non-negotiable. Python or R for statistical analysis is expected. Familiarity with a testing platform and basic understanding of front-end implementation (for client-side testing) is a plus. The exact stack matters less than your ability to work with data fluently.

How important is domain knowledge versus statistical knowledge?

Both matter, but domain knowledge is harder to teach. A statistician can learn your product in weeks. A product expert needs months to build statistical intuition. The best candidates bring both, but if you are stronger in one area, compensate by demonstrating genuine curiosity about the other.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.