Short-term proxy metrics in A/B testing can move in the opposite direction of the long-term outcomes they were meant to predict. Hohnhold 2015 at Google showed short-term ad-revenue lifts erode long-term ad-clicks. Prentice 1989 from clinical trials gave the operational criterion that almost no product team applies. If your program ships on surrogate wins without long-term holdouts, you are degrading the business while reporting victories.
It is the Tuesday read-out after a winning test. The team has run a redesign of an upper-funnel surface and the primary metric --- click-through rate to the next step --- is up +15.1%, p = 0.003, sample size pre-registered and met, no peeking, sample ratio mismatch checks clean. The decision is unambiguous: ship. The variant goes to 100% of traffic the following week. Six months later, an unrelated quarterly review surfaces three numbers nobody connected at the time: 30-day retention is down 8% on cohorts acquired since the rollout, revenue per active user is down 3%, and unsubscribe rates from the lifecycle email program are up 12%. The growth team investigates each metric in isolation and attributes the moves to macro conditions, channel mix shifts, and a parallel pricing test. No one revisits the +15.1% click-through win. The dashboard still shows it as a successful launch.
There is something to investigate, and the investigation should start with the click-through win. The +15.1% lift was real. The retention drop, the revenue-per-user drop, and the unsubscribe spike were also real. They are connected. The redesign optimized a surrogate metric --- click-through rate --- without verifying that the surrogate is on the causal path to the outcomes the business actually monetizes. The redesigned surface produced more clicks but lower-intent clicks, the downstream funnel converted at a lower rate per click, the users who were nudged through with the redesign were less engaged once activated, and the lifecycle program that depended on a certain baseline level of intent began burning the list. The test was clean. The metric was the wrong metric.
This is the Surrogate Metric Trap, and it is the structural reason a portfolio of well-run A/B tests can ship a sequence of “winning” variants and end the year with the long-term business metrics flat or declining. The mechanism is documented in the experimentation literature from Google, Microsoft, Netflix, and Yandex, and the underlying statistical framework comes from forty years of clinical-trials methodology --- specifically Ross Prentice’s 1989 operational criterion for what makes a measurable proxy a valid surrogate for an unmeasurable true endpoint. The framework is unambiguous about what surrogate validation requires. Most product experimentation programs do not meet the requirement. Most do not know there is a requirement.
The implication for CRO leaders, growth executives, and product strategists is uncomfortable: shipping decisions made on short-term surrogate metrics without surrogate validation are not “data-driven” in any meaningful statistical sense. They are decisions made on a number that may or may not predict the outcome you care about, and the published industry evidence is that for many product changes, the short-term surrogate moves and the long-term outcome moves can be uncorrelated or even oppositely signed. The fix is surrogate validation --- long-term holdouts, multi-week measurement of the proxy-outcome link, retroactive validation of which surrogates in your program have actually predicted the outcomes they were assumed to predict. The investment is non-trivial. The cost of not making it is shipping a portfolio of changes that “win” on the dashboard and lose on the business.
What A Surrogate Metric Actually Is
A surrogate metric, in A/B testing terms, is a short-term measurable proxy for a long-term outcome that is too slow, too noisy, or too expensive to measure directly within the test window. The typical A/B test runs for one to four weeks. The outcomes the business cares about --- retention at six months, lifetime revenue, brand equity, churn, customer lifetime value --- accrue over months or years. The test window cannot directly observe them. So the test instead observes a proxy that is hypothesized to predict them, and the team ships variants based on the proxy.
Common examples in product experimentation: click-through rate as a proxy for downstream conversion; engagement-time as a proxy for retention; signup completion as a proxy for activated user; daily-active-users in week one as a proxy for monthly-active-users; ad-clicks as a proxy for ad revenue; email open rate as a proxy for email program contribution to revenue; trial-to-paid in the first 14 days as a proxy for annual contract value. Each of these proxies is measurable within the test window. Each of these outcomes that the proxy is meant to predict is measurable only outside the test window.
The construct has a much older history in clinical trials. The canonical example is blood pressure as a surrogate for cardiovascular mortality. A drug trial cannot, in practical terms, wait twenty years for a primary endpoint of “did the patient have a heart attack.” It instead measures the drug’s effect on blood pressure --- a fast, cheap, measurable biomarker that the medical literature has spent decades validating as a predictor of cardiovascular mortality. The drug is then approved on the basis of the blood-pressure surrogate, with the implicit claim that improving the surrogate will improve the true outcome.
The problem the clinical-trials literature has spent forty years documenting is that this implicit claim is wrong much more often than the field expected. A drug can improve the surrogate and worsen the outcome. The mechanism is straightforward: a biomarker can be correlated with the outcome in observational data without being causally on the path from the intervention to the outcome. If the drug’s mechanism of action affects the biomarker through a pathway that is not the pathway the biomarker shares with the outcome, the drug can move the biomarker without moving the outcome --- or in the worst cases, move the biomarker in the “good” direction while moving the outcome in the “bad” direction through a different pathway. This is the failure mode that motivated Ross Prentice’s 1989 paper to formalize the operational criterion that a surrogate must meet to be valid.
The A/B testing version of this problem is identical in structure. A product change is the intervention. The short-term proxy metric is the biomarker. The long-term business outcome is the true endpoint. If the product change affects the proxy through a pathway that is not the pathway the proxy shares with the outcome, the change can move the proxy without moving the outcome --- or move the proxy in the “winning” direction while moving the outcome in the “losing” direction through a different pathway. Most product changes affect users through multiple pathways. Most short-term proxies are not causally on the unique path to the outcome. The trap is built in.
Why Optimizing Surrogates Can Destroy Long-Term Value
The structural reason short-term surrogates can diverge from long-term outcomes is that the relationship between a product change and a user behavior typically operates through a learning process. The first time a user encounters a change, the response is dominated by short-term reactions: surprise, curiosity, exploration, primacy effects. Over weeks of repeated exposure, the response shifts toward whatever the steady-state behavior is going to be once the user has formed expectations about the surface and has updated their priors about whether the surface is delivering value.
The short-term reaction and the steady-state response can differ in magnitude and, critically, in sign. A more aggressive promotional banner can drive a large first-week click-through lift because the surprise of the more aggressive design captures attention, and then drive a long-term click-through decline as users learn that the aggressive design is associated with lower-quality content and habituate to ignoring it. A more frequent email program can drive a first-week clicks-per-email lift because some users who were already going to click anyway are reached more often, and then drive a long-term subscriber-base decline because the increased frequency causes the marginal user to unsubscribe. A more prominent autoplay video can drive a first-week engagement-time lift because users who would have left the page are held by the autoplay, and then drive a long-term session-frequency decline because users associate the experience with lower agency and visit less often.
In each of these cases, the short-term proxy and the long-term outcome are moving in opposite directions because they are downstream of different psychological mechanisms. The short-term proxy captures the immediate response, which is dominated by the novelty of the change. The long-term outcome captures the steady-state response, which is dominated by the user’s updated model of the experience. The two responses are not the same thing. Optimizing for the first does not, in general, optimize for the second.
The Prentice 1989 criterion makes this rigorous. For a measurable proxy to be a valid surrogate for an unmeasurable outcome, the proxy must “capture” the full effect of the intervention on the outcome --- meaning that, conditional on observing the proxy, the intervention provides no additional information about the outcome. Operationally, this requires the outcome rate at any follow-up time to be conditionally independent of the intervention given the history of the proxy. In product terms: if you tell me the change to the click-through rate, the additional information that “this was caused by treatment A rather than treatment B” should not help me predict the change to retention. The full effect of treatment A on retention must flow through the click-through rate.
Almost no product change satisfies this criterion. Almost every product change affects users through multiple pathways, only one of which is captured by the short-term proxy. A redesigned signup flow can increase signup completion (the proxy) through reducing form friction, while simultaneously reducing the quality of self-selection (a separate pathway) by allowing less-committed users through. The signup-completion proxy captures the friction-reduction pathway. It does not capture the quality-of-self-selection pathway. The activation-rate outcome is affected by both pathways. The proxy is not a valid Prentice surrogate for activation. Optimizing the proxy can move activation in either direction depending on the relative magnitudes of the two pathways.
This is not a hypothetical edge case. It is the default condition for most A/B-tested product changes. The Prentice criterion is a strong condition that is hard to satisfy. The fact that the experimentation community largely does not invoke it, and largely ships on un-validated surrogates, is the source of the trap.
Hohnhold 2015 --- Google’s Long-Term Ad-Blindness Study
The cleanest documented industry case study of the Surrogate Metric Trap comes from a Google data-science team’s investigation of long-term effects of search ads on user behavior. The work was published as Hohnhold, O’Brien & Tang (2015), “Focusing on the Long-term: It’s Good for Users and Business,” at KDD 2015 (DOI: 10.1145/2783258.2788583), and the paper documents a measurement framework Google had been developing internally since approximately 2008.
The setup: Google’s ads system has a large set of short-term tunable parameters --- ad load, ad ranking thresholds, formatting choices --- that can be tested via standard A/B testing on short-term metrics like ad revenue per query and ad click-through rate. A short-term test that increases ad load by a small amount typically shows a short-term ad-revenue lift, because more ads create more opportunities for clicks. The question Hohnhold et al. investigated is whether the short-term ad-revenue lift translates into a long-term ad-revenue lift, or whether something more complicated happens once users have time to update their behavior.
The finding: users develop “ad blindness.” Repeated exposure to more or lower-quality ads causes users to learn, over a timescale of weeks to months, to ignore ads --- the click-through rate per ad-impression declines as the user’s model of “Google ads are worth attending to” degrades. The long-term ad-click rate after the user’s behavior reaches a new equilibrium is lower than the short-term ad-click rate immediately after the change. The short-term ad-revenue lift is therefore an overestimate of the long-term ad-revenue lift, and for some changes is large enough that the long-term effect is negative even though the short-term effect was positive.
The methodology Hohnhold et al. developed estimates the long-term effect by running cookie-cookie-day experiments that hold users in a treatment condition for an extended period and model the user-learning trajectory. The paper reports that the learning trajectory follows approximately exponential curves with characteristic learning rates that depend on the user’s ad-exposure frequency, and that the steady-state click-through rate is consistently lower than the short-term click-through rate for changes that increase ad load or reduce ad quality.
The business consequences Google extracted from this research, documented in subsequent commentary on the paper, included: a 2011 modification of the AdWords auction ranking function to incorporate long-term user-satisfaction signals rather than only short-term click-through, and an approximately 50% reduction in smartphone ad load over a two-year period based on the measured long-term negative effects of higher ad load. The 50% smartphone ad-load reduction is the operationally consequential number: a company that monetizes through ad clicks voluntarily cut ad inventory in half on its highest-growth surface because the long-term measurement framework showed the short-term ad-revenue gains were destroying long-term user engagement.
The Hohnhold paper is the clearest publicly documented case of a major experimentation program discovering, via long-term measurement, that a large portfolio of its “winning” short-term tests had been silently degrading the long-term business. The findings only became visible because Google built the measurement infrastructure to look. Most experimentation programs do not have that infrastructure. The trap is invisible without it.
Surrogate-Metric Failure Modes With Examples
The Hohnhold case is one instance of a general failure pattern. The mechanism --- a product change affects users through multiple pathways, only one of which is captured by the short-term surrogate, and the un-captured pathways drive the long-term outcome --- generates a recurring set of failure modes that show up across industries.
Engagement-time as a proxy for retention. A common product KPI is “time on page” or “time in app,” used as a proxy for the user finding the product valuable. The failure mode: changes that increase friction to exit (autoplay video, infinite scroll without natural stopping points, modal interruptions that delay leaving) can increase engagement-time without increasing the underlying value the user derives. The user spends more time per session because the product made it harder to leave, not because the product became more valuable. The retention outcome --- whether the user returns next week --- depends on whether the user judges the experience worth returning to, and that judgment can degrade even as the engagement-time proxy improves. The Drutsa, Gusev & Serdyukov (2017) paper at Yandex documents systematic divergence patterns between short-term engagement metrics and long-term return-visit patterns in search-engine usage.
Signup completion as a proxy for activated user. A growth team optimizes the signup funnel and observes a large lift in signup-completion rate. The failure mode: the changes that increased signup completion --- removing fields, simplifying authentication, deferring email verification, adding social-login one-click flows --- reduced the quality of self-selection among users who completed signup. The signups that the redesigned funnel added are disproportionately users with low intent, who completed signup because the friction was removed rather than because they had a strong reason to start. These users do not activate at the same rate as users from the higher-friction baseline funnel, and the activation rate per signup declines enough that the total number of activated users is flat or down despite the signup-completion lift.
Email click-through rate as a proxy for email program revenue contribution. A lifecycle team experiments with email frequency and finds that the higher-frequency variant has a higher clicks-per-subscriber-per-week. The failure mode: the higher-frequency variant also has a higher unsubscribe rate, and over a multi-month horizon the subscriber base in the high-frequency arm shrinks faster than in the baseline arm. The revenue-per-active-subscriber may be up, but the active-subscriber count is down, and the total revenue from the email program is lower in the high-frequency arm. The clicks-per-subscriber-per-week proxy does not capture the subscriber-base attrition pathway.
Ad click-through rate as a proxy for ad revenue. The Hohnhold case, generalized. A more aggressive ad-placement variant can lift short-term ad CTR. The user’s long-term ad-blindness response degrades the CTR over weeks to months. The long-term ad revenue is lower than the short-term ad-revenue projection. Multiple ads companies (Google publicly; presumably others non-publicly) have documented this pattern.
Conversion rate on a landing page as a proxy for cohort revenue. A landing-page optimization test produces a +20% conversion-rate lift. The failure mode: the conversion-rate lift came from changes that broadened the conversion --- claim language that appealed to a wider audience, aggressive discount framing, social-proof elements that lowered the perceived risk. The users who converted as a result of these changes have lower expected lifetime value than the baseline cohort, because they self-selected on a more aggressive pitch and are more likely to churn when they encounter the actual product. The cohort revenue is up by less than 20%, possibly flat, possibly down.
The common structural feature: in each case, the short-term proxy captures a “quantity” effect (more clicks, more signups, more conversions) but not the “quality” effect (the per-event downstream behavior). The long-term outcome depends on both quantity and quality. The proxy is not a valid Prentice surrogate because the intervention affects the outcome through pathways the proxy does not observe.
The Statistical Framework --- Prentice Criteria and Athey 2019 Surrogate Index
The formal framework for evaluating surrogate metrics comes from the clinical-trials literature, primarily Ross Prentice’s 1989 paper “Surrogate endpoints in clinical trials: definition and operational criteria” (Statistics in Medicine, 8(4), 431-440, DOI: 10.1002/sim.4780080407). Prentice’s criterion is precise: a measurable response variable $S$ is a valid surrogate for an unmeasurable true endpoint $T$ if and only if the distribution of $T$ given $S$ does not depend on the treatment $Z$. Operationally, this means the true-endpoint rate at any follow-up time, conditional on the history of the surrogate, is independent of treatment assignment.
The intuition: a valid surrogate must “capture” the full causal effect of the treatment on the true endpoint. If you observe the surrogate trajectory, knowing whether the patient received treatment or control should provide no additional information about the true-endpoint outcome. Any pathway by which the treatment affects the true endpoint must flow through the surrogate. If there is any treatment-outcome pathway that bypasses the surrogate, the criterion fails, and the surrogate is not valid.
Prentice’s criterion is strong. The clinical-trials methodology literature in the decades after 1989 documented that most candidate surrogates fail to meet it. Fleming & Powers (2012), “Biomarkers and surrogate endpoints in clinical trials” (Statistics in Medicine, 31(25), 2973-2984, DOI: 10.1002/sim.5403), is a widely cited survey of cases in which biomarkers that appeared to be valid surrogates in observational data led to drug approvals that subsequently failed on the true endpoint, sometimes because the drug improved the biomarker through a different pathway than the one connecting the biomarker to the outcome.
The product-experimentation literature has begun importing this framework. Athey, Chetty, Imbens & Kang (2019), “The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely” (NBER Working Paper 26463), formalizes a method for combining multiple short-term proxies into a single index that, under the Prentice surrogacy assumption, has the property that the average treatment effect on the index equals the average treatment effect on the long-term outcome. The Athey et al. method addresses the practical problem that no single short-term metric satisfies the Prentice criterion on its own --- but a combination of short-term metrics, weighted appropriately, may satisfy it. The combination requires historical data with both short-term and long-term outcomes observed in the same cohort, which permits estimating the surrogate index’s weights and then applying the index to future experiments where only the short-term metrics are observed.
The Athey et al. surrogate index has been picked up by industry experimentation teams. Netflix’s empirical evaluation of the method (Tran et al., 2023, “Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix,” arXiv:2311.11922) reports that across 1098 test arms from 200 Netflix A/B tests, surrogate index decisions using 14 days of data agreed with direct measurement at day 63 approximately 95% of the time for statistical inferences and achieved 79% precision and 65% recall for launch decisions. The Netflix paper is unusual in publishing the empirical validation rate; most industry surrogate-index implementations are undocumented externally. The Netflix rates --- non-trivially less than perfect on launch decisions --- are consistent with the general finding that surrogate-based decisions are useful but require ongoing validation and should not be treated as equivalent to direct long-term measurement.
The key statistical point: the Prentice criterion and the Athey surrogate index are tools for thinking precisely about what it would mean for a short-term proxy to be a valid stand-in for a long-term outcome. They do not say “all proxies are bad.” They say “the validity of a proxy is a testable empirical claim, and the test requires historical data linking the proxy to the outcome in the population of interest.” Programs that ship on proxies without running this test are not making data-driven decisions. They are making decisions on a metric whose relationship to the outcome they care about is unverified.
OEC (Overall Evaluation Criterion) Selection --- Kohavi 2013 Guidance
The Microsoft experimentation team has been the most prolific publisher of practical guidance for product experimentation programs, and their treatment of surrogate-metric selection appears under the heading of “OEC” (Overall Evaluation Criterion) selection. The foundational reference is Kohavi, Deng, Frasca, Walker, Xu & Pohlmann (2013), “Online Controlled Experiments at Large Scale” (KDD ‘13, DOI: 10.1145/2487575.2488217), which presents the lessons from Microsoft Bing’s experimentation infrastructure circa 2013.
The OEC framework formalizes the operational decision a product team faces in an A/B test. The team must specify, in advance of running the test, a single metric (or weighted combination of metrics) that the test will be evaluated against. The OEC is the function the team is choosing to optimize. If the OEC is “click-through rate on the result page,” then the test will be evaluated on click-through rate, and a variant that improves click-through rate will be considered a winner. If the OEC is poorly chosen --- if it captures a short-term proxy that does not predict the long-term outcome the business actually cares about --- then the experimentation program will systematically ship variants that improve the proxy while not improving (or degrading) the long-term outcome.
Kohavi et al.’s 2013 paper explicitly addresses this. The paper argues that OEC selection is one of the highest-leverage decisions an experimentation program makes, because the OEC determines what the program is optimizing across the portfolio of tests over time. A program that runs 1000 tests against a flawed OEC will systematically tilt the product in the direction the flawed OEC favors. The paper advocates for OECs that are “leading indicators” of long-term value rather than narrow short-term metrics, and explicitly calls out the failure mode where short-term metrics that appear to track long-term outcomes in observational data turn out to be poor predictors in experimental data.
The Kohavi guidance has been refined in subsequent Microsoft publications and in Kohavi, Tang & Xu’s 2020 textbook “Trustworthy Online Controlled Experiments.” The recurring point: the choice of OEC is not a technical decision to be delegated to the analytics team. It is a strategic decision about what the experimentation program is actually optimizing. If the OEC is wrong, the program is precise, statistically rigorous, and pointed at the wrong target. The precision and the rigor do not save the business outcome.
A specific practice the Microsoft literature recommends: the OEC should be validated by running long-term holdouts where a randomly-selected portion of users is held in the baseline condition for an extended period, allowing the team to directly measure how the long-term outcome differs from the OEC’s prediction across the portfolio of changes that have been shipped. Programs that do not maintain a long-term holdout are unable to validate their OEC and unable to detect surrogate-metric divergence in their portfolio. The Hohnhold case at Google is structurally an instance of long-term holdout analysis: the cookie-cookie-day measurement framework is the long-term holdout that revealed the ad-blindness pattern.
How To Validate A Surrogate Metric For Your Domain
Surrogate validation is the operational practice that converts the Prentice / Athey / Kohavi statistical framework into something a working experimentation program can run. The general procedure has four components.
Long-term holdouts. The program maintains a fraction of users (typically 1-5%) who are held in a baseline condition for an extended period --- months to a year. Variants that win short-term tests are rolled out to the remaining users but not to the holdout. The holdout permits direct measurement of the long-term outcome differential between users who have received the cumulative portfolio of shipped changes and users who have not. The holdout differential is the true long-term effect of the portfolio. The sum of short-term test wins is the surrogate-based prediction of that long-term effect. The ratio of true effect to predicted effect is the calibration of the surrogate metric for the program’s portfolio.
Multi-week measurement of the proxy-outcome link. For specific tests that are candidates for shipping, the program runs the test for longer than the standard window --- four to eight weeks rather than one to two --- and measures the trajectory of both the short-term proxy and the long-term outcome (to the extent the longer window permits measuring the outcome) over the duration of the test. The trajectory analysis reveals whether the short-term proxy and the long-term outcome are moving together or diverging over the multi-week horizon. If the proxy continues to improve while the long-term outcome flattens or reverses, the surrogate is failing for this specific test and the shipping decision should be made on the long-term outcome rather than the proxy.
Retroactive surrogate validation. The program periodically reviews the historical portfolio of shipped tests and analyzes, for the subset of tests where the long-term outcome can be measured retrospectively, the correlation between the short-term test win and the long-term outcome change attributable to the rollout. This analysis is the empirical answer to the question “does our surrogate metric predict the long-term outcome we care about?” The Hohnhold work at Google and the Netflix surrogate-index validation work are both versions of retroactive surrogate validation, conducted at scale and published. Smaller programs can do scaled-down versions on their own portfolios.
Athey surrogate index construction. For programs with sufficient historical data, the Athey et al. (2019) surrogate index method permits combining multiple short-term proxies into a weighted index that is, under the Prentice assumption, an unbiased estimator of the long-term treatment effect. The index construction requires historical data where both the proxies and the long-term outcome have been observed, and the weights are estimated by regressing the long-term outcome on the proxies in this historical sample. The resulting index can then be applied to future experiments where only the short-term proxies are observed within the test window.
The investment required for surrogate validation is non-trivial. Long-term holdouts cost the program some statistical power on every test (because some users are held out of every variant). Multi-week measurement costs throughput. Retroactive validation costs analyst time. Surrogate index construction requires meaningful historical data. Programs that have not made this investment are running a portfolio of tests against a metric whose relationship to the business outcome is unverified, and the published evidence is that the unverified relationship is wrong often enough to materially mislead the program.
What Industry Practice Looks Like
The major experimentation programs that have published their surrogate-metric practices fall into a small number of patterns.
Google’s pattern (Hohnhold 2015 and successors). Maintain a long-running measurement framework based on cookie-cookie-day designs that hold users in treatment conditions long enough to observe the user-learning trajectory. Use the trajectory to estimate the long-term effect that the short-term test underestimates or overestimates. Apply this framework to high-stakes parameters like ad load and ad ranking, where the long-term effect can diverge substantially from the short-term effect. Modify the OEC to incorporate the long-term measurement signals.
Microsoft’s pattern (Kohavi et al. and Trustworthy Online Controlled Experiments). Standardize on the OEC framework at the program level. Require OEC validation before adopting a new OEC. Run a portfolio of long-term holdouts to permit ongoing validation of the OEC against the true long-term outcomes. Treat the OEC choice as a strategic decision that requires executive sign-off, not an analytics-team default. The published OEC discipline at Bing is one of the most documented examples in industry of explicit surrogate-validation infrastructure.
Netflix’s pattern (Tran et al. 2023 and predecessors). Use the Athey surrogate index methodology to combine multiple short-term proxies into a single decision metric, and validate the index empirically against direct measurement of long-term outcomes for the subset of tests where the longer window permits direct comparison. Publish the empirical validation rates so the rest of the industry can calibrate expectations about how well surrogate-index decisions match long-term-outcome decisions. The 95% inference-agreement / 79% launch-precision / 65% launch-recall numbers from the Netflix paper are unusual in being publicly documented; they are also useful as a realistic anchor for what a well-implemented surrogate-index program can achieve, which is “very good but not equivalent to direct long-term measurement.”
Facebook / Meta and other platform-scale companies. Have published less detail than Google, Microsoft, and Netflix on their surrogate-validation infrastructure, but the published material indicates that long-term experiment frameworks --- holding cohorts in treatment for months and measuring downstream behavior --- are used for high-stakes product changes. The specifics are largely internal.
The common thread across the industry-scale published practice: surrogate validation is treated as infrastructure, not as a per-test decision. The program builds the long-term measurement capability once and applies it across the portfolio. Programs that try to do surrogate validation ad-hoc on individual tests typically fail to do it consistently and end up shipping a portfolio of tests against unvalidated surrogates anyway.
What This Means For Your Experimentation Program
The operational implications for a working CRO, growth, or product experimentation program.
Stop treating short-term test wins as production forecasts unless the surrogate is validated. When a test reports “+15% on the primary metric, p < 0.01,” the appropriate posterior is “+15% on the primary short-term metric. The relationship between this metric and the long-term business outcome is unverified unless we have run surrogate validation. We should expect the long-term effect to be smaller than the short-term effect, and we should not exclude the possibility that the long-term effect is zero or negative.” This calibration discount is the analog of the Winner’s Curse shrinkage but for surrogate-metric divergence rather than selection bias --- a separate adjustment that compounds with the Winner’s Curse adjustment.
Invest in long-term holdouts as infrastructure. A 1-5% holdout fraction held in the baseline condition for six to twelve months is the foundational tool for surrogate validation. The cost is non-trivial (lost statistical power on every test) but the cost is fixed --- it does not grow with the size of the portfolio --- and the value is portfolio-wide. A program that ships 100 tests against a validated OEC is doing materially different work from a program that ships 100 tests against an unvalidated proxy, even if the per-test analysis looks identical.
Run multi-week extensions on high-stakes tests. For changes that are expected to materially affect long-term behavior --- pricing, ad load, signup flow, lifecycle email frequency, anything that touches the user’s relationship to the product rather than a narrow surface improvement --- the standard one-to-two-week test window is too short. Extend the test to four to eight weeks. Measure the trajectory of both the short-term proxy and any partially-observable long-term outcomes. Make the shipping decision on the trajectory, not on the first-week point estimate.
Build a retroactive validation cadence. Once or twice a year, review the portfolio of shipped tests against the long-term outcomes the business actually monetizes. For tests where the relationship can be estimated, calculate the calibration of the short-term proxy as a predictor of the long-term outcome. If the calibration is poor, change the OEC. If the calibration is good for some categories of tests and poor for others, change the OEC by category. The retroactive validation is what closes the feedback loop between the experimentation program and the business.
Treat OEC selection as a strategic decision. The choice of what metric the experimentation program is optimizing is not a technical default to be inherited from the analytics platform. It is the most consequential decision the program makes, because it determines the direction the product is being tilted by the cumulative portfolio of tests. The Kohavi 2013 guidance --- that the OEC should be a leading indicator of long-term value rather than a narrow short-term metric --- is the high-leverage operational advice.
The Surrogate Metric Trap is not solved by being a more careful experimenter on individual tests. It is solved by building the infrastructure to validate the relationship between the metrics the program is optimizing and the outcomes the business cares about. The programs that have built this infrastructure --- Google’s long-term ad measurement, Microsoft’s OEC discipline, Netflix’s surrogate index --- have published evidence that it materially changes which variants get shipped. The programs that have not built it are running portfolios of tests whose long-term impact is, in the rigorous statistical sense, unknown.
Sources
- Hohnhold, H., O’Brien, D., & Tang, D. (2015). Focusing on the long-term: It’s good for users and business. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘15), 1849-1858. DOI: 10.1145/2783258.2788583. PDF: research.google/pubs/pub43887.
- Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431-440. DOI: 10.1002/sim.4780080407. PubMed: 2727467.
- Fleming, T. R., & Powers, J. H. (2012). Biomarkers and surrogate endpoints in clinical trials. Statistics in Medicine, 31(25), 2973-2984. DOI: 10.1002/sim.5403. PubMed: 22711298.
- Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper 26463. URL: nber.org/papers/w26463.
- Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘13), 1168-1176. DOI: 10.1145/2487575.2488217.
- Drutsa, A., Gusev, G., & Serdyukov, P. (2017). Periodicity in user engagement with a search engine and its application to online controlled experiments. ACM Transactions on the Web, 11(2), Article 9. DOI: 10.1145/2856822.
- Tran, A., Hua, Y., Lee, J., & McFarland, C. (2023). Evaluating the surrogate index as a decision-making tool using 200 A/B tests at Netflix. arXiv:2311.11922. URL: arxiv.org/abs/2311.11922.
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 9781108724265.
- Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., & Vaz, G. (2016). Pitfalls of long-running A/B tests. 2016 IEEE International Conference on Big Data. PDF: exp-platform.com/Documents/2016 IEEEBigDataLongRunningControlledExperiments.pdf.
Related
- The Replication Crisis Hub --- 75-article reference on evidence quality, replication failures, and methodological calibration for executives.
- The Peeking Problem in A/B Testing --- why stopping a test early inflates the false-positive rate, and the sequential-testing fix.
- Novelty and Primacy Effects in A/B Testing --- the closely related failure mode where the short-term test result is driven by the novelty of the change rather than its steady-state effect.
- The Winner’s Curse in A/B Testing --- the selection-bias mechanism that compounds with the Surrogate Metric Trap to overstate portfolio impact.
FAQ
How do I validate a surrogate metric for my domain? Run long-term holdouts for six to twelve months, then retrospectively analyze whether the short-term proxy you used as the OEC actually predicted the long-term outcome difference between the holdout and the rolled-out cohorts. The validation is empirical, not theoretical --- the Prentice criterion tells you what a valid surrogate looks like, but whether your specific metric satisfies it in your specific product is a question only the data can answer. Athey et al. (2019) provides a more formal methodology for combining multiple short-term proxies into a single validated index when you have historical data with both proxies and outcomes observed.
What if I cannot run long-term holdouts? The fallback is multi-week measurement on high-stakes tests. Extend the test window to four to eight weeks, measure the trajectory of the short-term proxy over time, and look for divergence from the first-week point estimate. If the proxy is flat or declining over weeks two through four after a first-week lift, you are seeing a novelty-driven short-term effect rather than a steady-state effect, and shipping on the first-week number will produce smaller-than-projected production impact. The Hohnhold (2015) cookie-cookie-day methodology is essentially this approach applied at industrial scale.
What about leading versus lagging indicators? The terminology is informal and not quite the same thing as surrogate validation. A “leading indicator” is any metric that moves before the outcome of interest --- it may or may not be causally on the path to the outcome. The Prentice criterion is the formal version of “this leading indicator is actually a valid stand-in for the lagging outcome.” A leading indicator that is not Prentice-valid is informative but not a substitute for the outcome. A leading indicator that is Prentice-valid behaves as a true surrogate. The empirical question is which of your leading indicators are Prentice-valid in your domain, and the answer requires retroactive validation against the long-term outcome.
Should I just measure revenue directly? For some businesses, yes --- if your revenue is measured per-transaction and the relevant transactions occur within the test window, then revenue is the outcome rather than a surrogate, and you can optimize it directly. For most product businesses, no --- the relevant revenue accrues over months to years (subscription LTV, ad-supported user lifetime, repeat-purchase patterns), the test window does not observe it, and you are forced to use surrogates. The question is which surrogates predict the long-term revenue, which is exactly the surrogate-validation question.
Is the Athey surrogate index a complete solution? No. The Athey index is a methodological improvement that, under the Prentice assumption, gives an unbiased estimator of the long-term treatment effect from a combination of short-term proxies. The Prentice assumption is itself a strong condition that may not hold in your domain. The Netflix evaluation at 79% precision / 65% recall on launch decisions is consistent with the index being useful but not perfect. Treat it as a major improvement over single short-term proxies, not as equivalent to direct long-term measurement.
How does the Surrogate Metric Trap interact with the Winner’s Curse? The two mechanisms compound. The Winner’s Curse inflates the magnitude of the short-term test result through selection on the maximum observed lift. The Surrogate Metric Trap then disconnects the inflated short-term result from the long-term outcome through pathways the proxy does not capture. The combined effect is that the headline test result --- the +15% lift the leadership team is briefed on --- can overstate the long-term business impact by a factor that includes both the Winner’s Curse shrinkage (typically 20-50%) and the surrogate-outcome divergence (which can be arbitrarily large, including sign flips). The two adjustments should be applied jointly when calibrating expectations for production impact.
Does this mean A/B testing is fundamentally broken? No. It means that A/B testing is precise about answering the question “does this variant move this short-term metric.” It is not, by itself, precise about answering the question “does this variant move the long-term business outcome.” The long-term answer requires surrogate validation infrastructure that the test alone does not provide. Programs that have invested in this infrastructure --- Google, Microsoft, Netflix, Yandex --- continue to find A/B testing extremely valuable. Programs that have not made the investment are running a precise methodology against a target whose relationship to the business outcome is unverified, which is a different and more limited thing than what they typically believe they are doing.
What is the first thing a CRO or growth lead should do this quarter? Audit the OEC of the program against the long-term outcomes the CEO and CFO are actually trying to grow. If the OEC is “click-through rate” or “signup completion” or “engagement-time” and the business outcome is “annual contract value” or “cohort retention at 12 months” or “lifetime revenue per customer,” the gap between OEC and outcome is the size of the surrogate-validation problem. The audit is cheap and exposes whether the program is currently calibrated. The follow-on investment in long-term holdouts and retroactive validation is the multi-quarter project, but the audit is what tells you whether the project is needed.