Even with perfect statistical hygiene --- no peeking, clean sample-ratio checks, proper multiple-comparison correction --- a five-day test on a UI change is fundamentally measuring “users react to change” rather than “users prefer this design.” Novelty fades within weeks; primacy resolves within months. Weekly-cadence test programs systematically overestimate UI wins and underestimate UI losses on established product surfaces.
It is the Monday standup after a successful A/B test. The growth team is presenting the results of a redesigned checkout button --- new color, new copy, slightly larger hit target. The test ran for seven days on roughly 40,000 visitors per arm. The primary metric, add-to-cart conversion rate, came in at +12.4% with p = 0.003. The statistician on the team confirms the math is clean: no peeking, sample sizes met the pre-registered minimums, no SRM detected, the segment analysis is unremarkable. The PM proposes shipping the new button to 100% of traffic this week. Engineering deploys on Wednesday. Leadership is briefed on Thursday with a +12% headline lift framed as a quarterly highlight.
Four weeks later, the production monitoring shows a +1.8% lift on the same primary metric, comfortably within the 95% confidence interval of the original test but seven times smaller than the headline number that leadership remembers. The PM brings the discrepancy to the experimentation review. The team debates the usual suspects: implementation drift, a confounding seasonality effect, an interaction with a parallel marketing push, perhaps some cookie-churn-driven attribution shift. Engineering files a ticket to re-validate the production instrumentation. The +12% lift quietly disappears from the quarterly review deck.
There is nothing wrong with the production environment. There was no implementation drift, no seasonality confound, no interaction effect. The button is doing what it was always going to do --- delivering a real but much smaller lift, because the +12.4% number was never a measurement of how much users prefer the new button. It was a measurement of how much users react to a new button. Those are different quantities, and the literature on novelty and primacy effects in online experiments has been documenting the divergence between them for nearly two decades.
This article exists because novelty and primacy are the two phenomena most likely to silently corrupt an experimentation program that has otherwise solved its statistical hygiene problems. The peeking problem and the Winner’s Curse are sources of false-positive inflation that the careful analyst can control with better procedures. Novelty and primacy are sources of systematic bias that survive even the most rigorous procedural discipline, because the bias is built into the data-generating process itself. The variant really did get +12% lift in week one. The variant really is delivering +2% in steady state. Both numbers are true. The first one is not a measurement of the second.
The structural insight for experimentation leaders is uncomfortable: the speed-to-ship pressure that defines modern growth culture --- weekly test cadences, two-week sprints, the demand for “fast iteration” on the product surface --- is in direct mathematical tension with the measurement requirement for habit-affecting changes. A test program that ships winners on the strength of week-one lifts on an established product surface is systematically biased toward novelty-driven false positives. The fix is not procedural cleanup. The fix is longer test durations, explicit pre-registration that the relevant effect is the post-novelty steady state, and a portfolio-level recalibration of what “the test won” actually means.
Novelty Effects --- Users React To Change Itself
A novelty effect is what happens when a user encounters something new in a familiar product and pays attention to it because it is new, independent of whether the new thing is better. The cognitive mechanism is the orienting response --- the involuntary attentional reaction that draws focus to anything unexpected in a familiar environment. The orienting response is not a preference signal. It is a salience signal. The brain has noticed that something is different from its prior expectation, and it is allocating attention to figure out what changed.
For an A/B test, the implication is that a new button color, a new layout, a new piece of copy, or a new feature placement is going to get more attention from established users in the first days after exposure than it will after those same users have seen it ten or twenty times. The increased attention shows up in the test metrics as increased clicks, increased engagement, increased conversion. A new button color is genuinely more clicked in week one. The clicks come from users registering “oh, that looks different” and investigating, not from users registering “oh, that is better and I want to use it more.”
The mechanism is documented across the cognitive psychology literature as habituation --- the well-established phenomenon that organisms reduce their response to repeated, biologically uninformative stimuli over time. The first time a user encounters the new variant, the orienting response fires at full strength. By the tenth encounter, it has decayed substantially. By the thirtieth, it has typically decayed to baseline. The decay curve is roughly exponential and the time constant depends on the visit frequency of the user and the salience of the change.
For an A/B test, the operational signature of novelty is a treatment effect that decays toward a smaller steady-state value across the test duration. A representative pattern documented in industry experimentation work looks roughly like this: a homepage redesign variant shows +18% click-through in week one, declining to +12% in week two, +7% in week three, +3% in week four, and statistical noise around the control by week six. If the test had been stopped at day seven on the strength of the +18% reading, the declared victory would have been almost entirely novelty-driven. The +3% by week four is the genuine effect; the +18% in week one is the genuine effect plus a large orienting-response inflation that will not persist in production.
The Kohavi, Tang, and Xu (2020) textbook Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press) frames the phenomenon with the precision that two decades of running Microsoft and Bing experimentation has produced. Their definition: “Novelty effect describes the desire to use new technology that tends to diminish over time. When you introduce a new feature, especially one that’s easily noticed, initially it attracts users to try it. If users don’t find the feature useful, repeat usage will be small. A treatment may appear to perform well at first, but the treatment effect will quickly decline over time.” The Microsoft experimentation team has run thousands of experiments per year for over a decade and has seen the novelty pattern enough times to treat it as a standing diagnostic to apply to any test on an established product surface.
The Larsen, Stallrich, Sengupta, Deng, Kohavi, and Stevens (2024) review “Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology” (The American Statistician, 78(2), 135—149, DOI: 10.1080/00031305.2023.2257237) places novelty effects in the broader methodology landscape as a non-stationarity problem that the standard A/B testing framework is not equipped to detect without explicit time-series modeling. The classical A/B testing inferential machinery assumes the treatment effect is constant across the test window. Novelty effects violate that assumption in a specific direction --- the effect is large at the beginning and smaller at the end --- and the average effect across the window is a weighted average that puts substantial weight on the inflated initial period.
The empirical magnitude is large enough to matter. The novelty inflation on highly-noticed UI changes can be a multiple of the steady-state effect. A 2-3x inflation is common; 5x or higher is not unusual on changes that are particularly salient. For a CRO running tests on the homepage, the checkout flow, or the navigation, this is the difference between a portfolio of test wins that holds up in production and a portfolio that systematically shrinks by half or more after rollout.
Primacy Effects --- Established Habits Resist New Variants
The primacy effect is the mirror image of novelty. Where novelty inflates the treatment effect in the early period because users react to change in a positive direction (increased attention, increased engagement), primacy depresses the treatment effect in the early period because users react to change in a negative direction (decreased efficiency, frustration, habitual return to the old behavior).
Primacy is what happens when an established user has built habits around a particular product configuration and a new variant disrupts those habits. The user knows where the “save” button used to be. The user has internalized the previous navigation hierarchy. The user has muscle memory for the checkout flow they have run a hundred times. The new variant requires the user to override those habits and learn the new pattern, and during the learning period the user is genuinely less efficient with the product. Clicks go down. Time-on-task goes up. Conversion drops. The metrics measure the cost of habit disruption, not the merits of the new design.
Kohavi, Tang, and Xu define primacy as: “Primacy effect is when you change something on the site and experienced users may be less efficient until they get used to it. When a change is introduced, users may need time to adopt, as they are primed in the old feature, that is, used to the way it works.” The temporal pattern is the opposite of novelty: the treatment effect is artificially depressed in the early test period and gradually recovers toward the true steady-state value as users adapt to the new variant.
For an A/B test, the primacy signature looks like: variant launches at -8% vs control in week one, recovers to -3% in week two, -1% in week three, and to a small positive lift by week four. If the test had been stopped at day seven on the strength of the -8% reading, the variant would have been declared a loser and rolled back. The decision would have been wrong: the variant is genuinely better, the -8% was the cost of habit disruption, and the true effect (perhaps +2-3%) would have emerged once users adapted.
Primacy is particularly insidious for navigation changes, layout changes, and any modification to a workflow that users run frequently. The more habituated the user base is to the existing flow, the larger the primacy depression will be, and the longer it will take to resolve. For consumer products where users run the same flow daily (email clients, social apps, productivity tools), primacy on a workflow change can last weeks to months. For a B2B SaaS where a power user might run a workflow hundreds of times a week, the primacy effect on a navigation change can be sharply negative for weeks and only resolve once the user’s muscle memory has been retrained.
The Dmitriev, Frasca, Gupta, Kohavi, and Vaz (2016) paper “Pitfalls of Long-Term Online Controlled Experiments” (2016 IEEE International Conference on Big Data, pp. 1367—1376, DOI: 10.1109/BigData.2016.7840744) documents the empirical patterns from Microsoft’s experimentation portfolio and flags primacy as one of the most common sources of false rollback decisions on experiments that, in retrospect, were testing genuinely better variants. The paper’s broader argument is that the literal solution --- “just run the test longer” --- introduces its own pitfalls (cookie churn invalidating up to 75% of long-term cookie-based identification; survivorship bias from differential user attrition between arms; interaction with other concurrent experiments). The corollary is that “just run the test longer” is necessary but not sufficient; the long-running test has to be designed with the long-running pitfalls in mind.
The asymmetry between novelty (early inflation) and primacy (early depression) means that the calibration error from a short test goes in different directions depending on the change. For novel features and changes that increase salience (new buttons, new badges, new notifications), the short-test bias is upward and the program ships variants that look better than they are. For changes that disrupt established habits (navigation reorganizations, flow simplifications, layout consolidations), the short-test bias is downward and the program rolls back variants that would have won at steady state. A program that uses a single test-duration heuristic for both types of change is making errors in both directions simultaneously.
Why Both Effects Resolve With Time
The reason novelty and primacy both fade is that both are anchored to the user’s prior expectations about the product, and prior expectations update with exposure. Novelty fades because the user’s expectation of “what the product looks like” updates to include the new variant; the orienting response stops firing because the new variant is no longer surprising. Primacy fades because the user’s habits update to incorporate the new flow; the muscle-memory cost of overriding the old habit goes away once the new habit is established.
The time constant for novelty decay depends on the visit frequency of the user and the salience of the change. For a daily-use product with a highly salient UI change, the orienting response typically habituates within 5-15 exposures, which translates to roughly 1-3 weeks for most user segments. For a weekly-use product with a moderately salient change, the same habituation might take 4-8 weeks. For an infrequent-use product where a user might encounter the change only every few months, novelty effectively never fades within any practical test duration.
The time constant for primacy resolution depends on the strength of the established habit being disrupted and the frequency with which the user practices the new flow. For a navigation change that a power user encounters dozens of times a day, the muscle-memory retraining typically takes 1-4 weeks. For a change that affects a less-frequent workflow (account settings, billing, profile editing), primacy resolution can extend to months because the user has fewer opportunities to practice the new pattern.
The Hohnhold, O’Brien, and Tang (2015) KDD paper “Focusing on the Long-Term: It’s Good for Users and Business” (KDD ‘15, pp. 1849—1858, DOI: 10.1145/2783258.2788583) provides one of the most rigorous empirical analyses of these time constants in production data. Their work was on the long-term effects of ad-load and ad-quality experiments at Google, where the relevant user-learning effect is ads blindness (users habituating to ads and tuning them out over time) and ads sightedness (users learning to perceive and engage with ads after a change). The paper reports that for changes that impact every search results page, they estimate that half the learning has happened after 14-21 days. The implication for test duration is direct: a one-week test captures less than half of the user learning that will eventually occur; a two-week test captures roughly half; meaningful steady-state estimation requires substantially longer test windows. Based on their estimated learning rate (β ≈ 0.012), Google’s long-term desktop experiments typically run for 90 days specifically to allow the user-learning effect to substantially complete before final readout.
The deeper structural point in Hohnhold et al. is that the short-term effect and the long-term effect on the same metric can be different in both magnitude and direction. An experiment that increases the number of ads on a search results page might increase revenue in the first couple of weeks --- the typical duration of an experiment --- but the opposite effect can emerge months later as user attrition and ads blindness compound. The short-term and long-term measurements are answering different questions. The product decision needs the long-term answer; the typical experimentation infrastructure produces the short-term answer.
The Empirical Magnitude --- Short Versus Long Divergence
The published industry literature provides enough quantitative grounding to set practitioner expectations about how much the short-term and long-term effects typically diverge. The Microsoft experimentation team has documented specific cases where week-1 wins reversed or substantially shrank by week 4 across a range of product surfaces. Google’s ads team has documented short-versus-long divergences of similar or larger magnitude in the ad-load experiments specifically. The Netflix engineering team has published cases where features that looked like clear wins in short-duration tests delivered substantially smaller production lifts once the novelty had decayed.
A useful framing from the Kohavi et al. textbook is that for highly noticeable UI changes on established product surfaces, the rule of thumb is that the steady-state effect is typically 30-70% of the headline short-test effect, with the remainder being novelty inflation. The variance around that range is large --- some changes show near-zero novelty inflation, others show 5x or higher --- but the central tendency is enough that a CRO leader should reflexively discount headline lifts on UI changes by some shrinkage factor when reporting to leadership.
For primacy-affected changes, the framing flips. A change that shows -10% in week one might be a genuine -2% loser, a genuine flat-effect change being measured through primacy noise, or a genuine +3% winner whose value is masked by the habit-disruption cost. Without longer-duration data, the three scenarios are statistically indistinguishable. A program that rolls back any variant showing early negative lift is making the wrong call on the third scenario approximately whenever it occurs.
The empirical work most directly grounded in industry data is the Xu and Chen (2016) KDD paper “Evaluating Mobile Apps with A/B and Quasi A/B Tests” (KDD ‘16, DOI: 10.1145/2939672.2939703), which documents the LinkedIn experimentation team’s experience with mobile-app testing where the testing constraints (app review delays, slower release cycles) force longer test durations and surface novelty and primacy effects more visibly than the equivalent desktop or web experimentation infrastructure. The paper’s broader empirical claim is that quasi-experimental approaches with explicit time-series modeling consistently outperform standard A/B test analysis on mobile-app changes specifically because the standard A/B test analysis cannot adequately handle the non-stationary treatment effects that novelty and primacy produce.
The Larsen et al. (2024) review aggregates the broader empirical evidence across the major industry experimentation programs and frames the methodological state of the art as one where the academic statistics literature has not yet fully caught up with the practical methods that the industry teams have developed to handle non-stationary treatment effects. The implication for experimentation leaders is that the standard statistics training is genuinely incomplete on this dimension; the people running the largest experimentation programs in the world have had to develop in-house methods because the textbook A/B testing framework does not address the problem.
Why This Hits Established-Product Changes Hardest
The asymmetry that matters most for experimentation portfolio strategy is that novelty and primacy effects are concentrated in tests on established product surfaces with habituated user bases. They are largely absent in tests on greenfield surfaces where users have no prior expectations to be disrupted.
A new user encountering the checkout flow for the first time has no opinion about whether the button should be on the left or the right. The variant assigned to them is just “what the checkout looks like.” There is no orienting response because there is no prior expectation to be violated; there is no primacy depression because there is no established habit to be disrupted. The treatment effect measured on the new-user segment is a substantially cleaner estimate of the variant’s true value to a user encountering it fresh.
An established user encountering the same A/B-tested variant has a prior expectation of where the button should be (because they have used the product before), and any deviation from that expectation triggers either an orienting response (novelty inflation) or a habit-disruption cost (primacy depression). The treatment effect measured on the established-user segment is the genuine variant value plus the orienting-response or habit-disruption term, with the noise term often comparable in size to the genuine effect.
This asymmetry has a practical diagnostic implication: segmenting the test analysis by user tenure (new users vs. returning users vs. long-tenure power users) often reveals novelty and primacy structure that the overall analysis hides. A test that shows +5% overall but +2% on new users and +8% on returning users in week one, declining to +2% on both segments by week four, has a clear novelty signature concentrated on the established users. A test that shows -3% overall but flat on new users and -6% on returning users in week one, recovering to flat in week four, has a clear primacy signature concentrated on the established users. The new-user segment, in both cases, is closer to the true steady-state effect because new users do not carry the prior expectations that drive both effects.
This is why the Kohavi et al. textbook recommends, as a standard diagnostic, plotting the treatment effect over time separately for users at different tenure tiers. The decomposition often reveals that what looks like a single unified treatment effect is actually a mixture of a stable effect on new users and a time-varying effect on established users, with the latter being the source of the overall non-stationarity.
The practical implication for tests on greenfield product surfaces --- new features, new flows, new product lines where the user base is still establishing its habits --- is that novelty and primacy effects are substantially smaller. A test on a brand-new onboarding flow shown to new signups has less novelty distortion than a test on a redesigned dashboard shown to two-year tenured users. The corollary is that the “ship fast on weekly cadences” model is genuinely closer to defensible for greenfield testing than for established-product testing, and the test-duration recommendations should differentiate between the two cases.
Diagnostic Approaches
The standard diagnostic toolkit for novelty and primacy effects has three components, each of which is well-documented in the industry experimentation literature.
The first is the time-series plot of the treatment effect across the test window. The classical A/B test analysis produces a single point estimate (the time-averaged effect across the full test duration) and a confidence interval, which collapses all temporal structure into a single number. The diagnostic alternative is to compute the treatment effect for each successive period (day, week, or natural usage cycle) and plot the trajectory. A monotonically decaying effect is the signature of novelty. A monotonically rising effect (starting negative or low and increasing toward a positive asymptote) is the signature of primacy. A roughly flat trajectory is what an effect with no novelty or primacy distortion looks like, and is the only one of the three patterns that justifies treating the time-averaged estimate as a clean estimate of the steady-state effect.
The second is the user-tenure segmentation. Computing the treatment effect separately for new users, returning users, and long-tenure power users typically reveals the structure described in the preceding section --- novelty and primacy are concentrated in the established users, the new users provide a closer-to-steady-state reading, and the divergence between segments is itself a diagnostic for the magnitude of the time-varying component.
The third is the exposure-day analysis recommended in the Microsoft methodology --- segmenting users by the number of days since their first exposure to the variant rather than by calendar date. The exposure-day analysis controls for the fact that different users entered the test at different times, and re-aligns the data so that “day 1 of exposure” is comparable across all users regardless of when their day 1 happened to fall on the calendar. The treatment effect plotted against exposure-day reveals the user-learning curve more cleanly than the treatment effect plotted against calendar-day, because the calendar-day analysis confounds user-learning with the changing composition of users who have been exposed for varying durations.
A more rigorous diagnostic, used in the Hohnhold et al. ads-experimentation methodology, is to formally model the treatment effect as a function of exposure days with an explicit user-learning component. The model partitions the observed effect into a stable steady-state component and a learning-decay component, and provides separate estimates and confidence intervals for each. This is more expensive to implement than the visual diagnostics but produces a quantitative estimate of how much of the headline effect is attributable to user learning versus the true steady-state effect, which is the number that should actually drive ship/no-ship decisions on the variant.
The Dmitriev et al. (2016) pitfalls paper documents an important diagnostic failure mode: even with the time-series and exposure-day analyses, the long-running experiment is vulnerable to cookie churn that systematically biases the long-period estimates. Users whose cookies persist throughout the test are not a representative sample of the user base; they are biased toward heavier users, more loyal users, and users on more stable browsing setups. The long-run treatment effect estimated on the cookie-persistent population is not necessarily the same as the long-run treatment effect on the full population, and the gap can be substantial for any user-cohort-correlated outcome. The implication is that the diagnostic for novelty and primacy needs to be paired with a diagnostic for cookie-churn-driven survivorship bias, especially on tests that run for months.
Practical Test-Duration Recommendations
The literature converges on a set of practical guidelines for test duration on changes that may carry novelty or primacy effects.
For UI changes on established product surfaces with daily-use patterns, the minimum defensible test duration is two to four weeks, with the upper end of that range preferred whenever the change is on a particularly salient surface (homepage, primary CTA, navigation). The two-week minimum captures roughly half of the user learning per the Hohnhold et al. estimates; the four-week duration captures most of it for daily-use products with moderately salient changes.
For changes that disrupt established workflows (navigation reorganizations, flow simplifications, task-completion paths that users perform frequently), the recommended duration extends to 4-8 weeks, because primacy resolution on habit-disrupting changes is slower than novelty decay on attention-disrupting changes. Microsoft’s experimentation guidance specifically calls out navigation changes as a category requiring extended test durations.
For changes on infrequently-used surfaces (account settings, billing flows, profile editing, advanced features), the test duration may need to extend to multiple months, because the per-user exposure rate is too low for either novelty or primacy to resolve within a few weeks. The trade-off here is that long test durations increase cookie churn and other long-running-experiment pitfalls; the practical fix is often to run shorter tests on infrequently-used surfaces while explicitly accepting that the short-term reading is a biased estimate and applying a portfolio-level shrinkage to the reported effects.
For changes on greenfield surfaces (new features, new flows, new product lines), the standard test-duration heuristics (one to two weeks for adequate power on the primary metric) are more defensible, because the novelty and primacy distortions are smaller in the absence of established user expectations. The caveat is that “greenfield” should be assessed honestly --- a feature that is new in the test design but visually consistent with an existing product surface may still trigger novelty effects on the surrounding context.
The single most important pre-registration discipline that addresses novelty and primacy is to specify, at the time of test design, that the relevant effect for the ship/no-ship decision is the steady-state effect rather than the time-averaged effect across the test window. This re-frames the test as an estimator for a specific quantity (the post-novelty effect) rather than an estimator for an undefined mixture of novelty, primacy, and steady-state components. The practical implementation is to define the steady-state period in advance (e.g., “days 21 through 35 of the test”) and compute the headline effect on that period specifically, treating the early period as a diagnostic for novelty and primacy structure rather than as input to the headline estimate.
A diagnostic that the Kohavi et al. textbook recommends as a portfolio-level calibration check is to compare the week-1 effect to the week-4 effect across a large sample of historical tests and compute the typical shrinkage ratio empirically. The ratio gives the program a data-driven shrinkage factor to apply to short-test headline numbers in cases where the longer test is not operationally feasible. A program that historically sees a 40% shrinkage from week-1 to week-4 on its test portfolio should be reporting week-1 lifts to leadership with an explicit 40% shrinkage applied, rather than the raw headline number.
The Org Tension --- Speed To Ship Versus Steady-State Measurement
The structural tension that makes novelty and primacy effects an organizational problem rather than just a methodological one is the misalignment between the test-duration requirement for clean measurement and the cadence pressure that defines most modern growth orgs.
The growth-team operating model that has crystallized over the past decade --- weekly sprint cadences, daily standups, biweekly roadmap reviews, quarterly OKRs that reward shipped features --- is structurally biased toward short test durations. The PM under pressure to ship a feature this quarter has every incentive to call the test on the strength of a week-one reading rather than waiting four weeks for the steady-state effect. The growth lead under pressure to demonstrate experimentation velocity to leadership has every incentive to report the count of tests run per quarter, which directly trades off against test duration. The experimentation platform vendors selling into these orgs have every incentive to make tests feel “faster” by reporting headline numbers as soon as they cross significance thresholds, which directly conflicts with the discipline of waiting for novelty to decay.
The mathematical consequence of the cadence pressure is that a program optimized for “tests per quarter” is systematically biased toward false positives on novelty-affected changes and false negatives on primacy-affected changes. The effect compounds across the portfolio: every shipped novelty-driven false positive contributes a small amount of credit to the next quarter’s “tests we won” count and a small amount of debit to the production-vs-test calibration gap. Over years, the calibration gap becomes substantial; the portfolio of “wins” the program has accumulated does not reproduce in production aggregate metrics; leadership begins to question whether experimentation is contributing real value.
The organizational fix is not procedural; it is incentive-level. The metrics that the experimentation program is held accountable to need to include production-vs-test calibration as a first-class outcome, not just test counts and headline lifts. The cadence the org runs needs to accommodate the longer test durations that habit-affecting changes require, rather than forcing all tests into the same one-to-two-week mold. The PM career incentives need to reward “the variant I shipped held up in production” rather than just “the variant I shipped had a green test.” These are organizational changes more than methodological ones; the methodology fixes (longer tests, steady-state pre-registration, exposure-day diagnostics) are well-documented but go unimplemented because the organizational incentives push against them.
The Hohnhold et al. paper’s title --- “Focusing on the Long-Term: It’s Good for Users and Business” --- is itself a piece of organizational advocacy from the Google ads team, arguing that the org culture should privilege long-term measurement over short-term test velocity. The fact that this argument needed to be made, at a major industry conference, by senior researchers at the company that essentially invented modern online controlled experimentation, is itself evidence of how strong the speed-to-ship pressure is and how rare the long-term measurement discipline remains even at organizations that should know better.
What This Means For Your Experimentation Program
For a CRO, growth lead, or experimentation team owner, the practical implications of the novelty and primacy literature compress into a small number of program-level changes.
The first is calibration on UI tests. If the program is currently running one-to-two-week tests on UI changes and reporting raw headline lifts to leadership, the headline numbers are systematically overstated by a factor that depends on the salience of the change and the established-ness of the user base, but typically ranges from 1.5x to 3x for highly noticeable changes on familiar surfaces. The fix is either to extend the test duration to four or more weeks (which leadership will resist on cadence grounds) or to apply an explicit shrinkage factor to short-test results (which leadership will resist on accuracy grounds). Either fix is preferable to the status quo of reporting numbers that systematically fail to replicate in production.
The second is differential test-duration policy by change type. The program should distinguish between tests on greenfield surfaces (one-to-two-week minimums are defensible), tests on established UI surfaces with daily-use patterns (two-to-four-week minimums), tests on established workflows with strong habit formation (four-to-eight-week minimums), and tests on infrequently-used surfaces (months or explicit acceptance of biased short-term readings). A single cadence policy applied uniformly across all test types is going to be systematically wrong on the changes where the cadence is mismatched to the user-learning time constant.
The third is exposure-day and tenure-segmented analysis as a standard diagnostic. Adding the segmentation to the test analysis is cheap once the data infrastructure is in place and produces immediate visibility into which tests have time-varying effects worth investigating further. The diagnostic does not require a methodology change; it requires a reporting change. The headline test report should include not just the time-averaged effect but the effect-over-time trajectory and the effect-by-tenure segmentation.
The fourth is portfolio-level production-vs-test calibration as an ongoing measurement. The program should be tracking, across its full test portfolio, the relationship between headline test effects and post-launch production effects. The relationship is the program’s empirical shrinkage factor and is the single most important calibration number the program can produce. A program that does not measure this number does not actually know how trustworthy its tests are.
The fifth is the cultural recalibration of “the test won.” The phrase, as currently used in most growth orgs, conflates several distinct claims: the variant achieved statistical significance during the test window; the variant’s time-averaged effect was positive and larger than the minimum detectable effect; the variant will deliver comparable production impact at scale. These claims are different. The first does not imply the third. The novelty and primacy literature is the formal demonstration of why. A program that treats them as equivalent is making predictably wrong calls on a meaningful fraction of its portfolio.
The uncomfortable summary for experimentation leaders is that weekly-cadence test programs systematically overestimate UI wins on established product surfaces and systematically underestimate UI losses on habit-disrupting changes. The fix is not to find a better short-test methodology; the fix is to extend the test duration enough to allow user-learning to substantially complete, or to explicitly accept that short tests produce biased estimates and apply portfolio-level shrinkage. The cost is reduced test velocity; the benefit is a portfolio of test wins that holds up in production and an experimentation program that retains organizational credibility over multi-year time horizons.
The novelty and primacy effects are not a bug in A/B testing methodology. They are a fundamental feature of how users respond to changes in familiar products. The methodology fix is to measure the right quantity --- the steady-state effect --- rather than the convenient quantity --- the early time-averaged effect. The organizational fix is to align incentives with that measurement. The growth orgs that figure out both will accumulate, over years, a portfolio of genuinely-replicated wins; the ones that do not will accumulate a portfolio of headline numbers that quietly fail to materialize in the production data, with all the downstream loss of organizational trust in experimentation that implies.
Sources
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN 978-1108724265.
- Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). “Practical guide to controlled experiments on the web: Listen to your customers not to the hippo.” KDD ‘07. DOI: 10.1145/1281192.1281295.
- Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., & Stevens, N. T. (2024). “Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology.” The American Statistician, 78(2), 135—149. DOI: 10.1080/00031305.2023.2257237.
- Xu, Y., & Chen, N. (2016). “Evaluating Mobile Apps with A/B and Quasi A/B Tests.” KDD ‘16. DOI: 10.1145/2939672.2939703.
- Hohnhold, H., O’Brien, D., & Tang, D. (2015). “Focusing on the Long-Term: It’s Good for Users and Business.” KDD ‘15, pp. 1849—1858. DOI: 10.1145/2783258.2788583. PDF.
- Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., & Vaz, G. (2016). “Pitfalls of Long-Term Online Controlled Experiments.” 2016 IEEE International Conference on Big Data, pp. 1367—1376. DOI: 10.1109/BigData.2016.7840744. Microsoft Research.
- Goodson, J., Sankararaman, K. A., Khalessizadeh, A., Goldman, R., Snyder, J., Karim, T., et al. (2022). “Novelty and Primacy: A Long-Term Estimator for Online Experiments.” Technometrics, 64(4). DOI: 10.1080/00401706.2022.2124309. arXiv:2102.12893.
- Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010). “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation.” KDD ‘10. DOI: 10.1145/1835804.1835810.
Related
- Replication Crisis Hub --- the full taxonomy of replication, reproducibility, and measurement failures across psychology, economics, and applied experimentation.
- The Peeking Problem in A/B Testing --- the statistical mistake that inflates the false-positive rate to 40%+ in any test where the practitioner watches the dashboard.
- The Winner’s Curse in A/B Testing --- the structural reason headline lifts on shipped variants are systematically larger than the production effect, even with perfect statistical hygiene.
- Simpson’s Paradox in A/B Testing --- how segment-level effects can run in the opposite direction of the aggregate effect, and what that means for ramp-up analyses.
- Multiple Comparisons in A/B Testing --- the false-discovery-rate inflation that happens when test programs analyze many secondary metrics without correction.
FAQ
What’s the right test duration to control for novelty and primacy effects?
The answer depends on the change type and user base. For UI changes on established product surfaces with daily-use patterns, two to four weeks is the minimum defensible range, with four weeks preferred. For habit-disrupting workflow changes, extend to four to eight weeks. For greenfield surfaces with no established user expectations, one to two weeks remains defensible. The Hohnhold, O’Brien, and Tang (2015) estimate that half of user learning happens within 14-21 days is a useful anchor: a one-week test captures less than half the learning; meaningful steady-state estimation typically requires three or more weeks for daily-use products.
How do I separate the novelty effect from the real treatment effect?
The most direct method is to compute the treatment effect for each successive period (day or week) and plot the trajectory. A monotonically decaying effect is the signature of novelty; a roughly flat trajectory is consistent with no novelty distortion. The Microsoft methodology recommends pairing the time-series plot with a user-tenure segmentation: novelty and primacy concentrate in established users, while new users provide a closer-to-steady-state reading. For a quantitative estimate, the exposure-day analysis (segmenting by days-since-first-exposure rather than calendar date) and explicit user-learning models like those in Hohnhold et al. produce separate estimates for the steady-state component and the learning-decay component.
What about brand-new products with no existing user base?
Novelty and primacy effects are substantially smaller on greenfield surfaces because they require established user expectations to operate. A new product with no prior user base has no orienting response to be triggered (everything is new) and no established habits to be disrupted. Standard one-to-two-week test durations are more defensible in this case. The caveat is honest assessment of what “greenfield” means: a feature that is new in the test design but visually consistent with an existing product surface may still trigger novelty effects on the surrounding context.
What about ad networks and recommendation systems where the algorithm itself learns?
The user-learning effects compound with algorithm-learning effects in this case. The treatment-arm algorithm is updating its model on the variant-exposed users; the model-learning curve and the user-learning curve interact. Hohnhold et al.’s ad-load experiments at Google had to model both effects simultaneously and the recommended test durations there are substantially longer (90 days for desktop) precisely because of the interaction. For programmatic ad systems and recommendation engines, the test duration should accommodate both user habituation and algorithm convergence, and the steady-state estimate should be computed after both have substantially stabilized.
Should I just always run longer tests?
Longer tests carry their own pitfalls, as documented in Dmitriev et al. (2016). Cookie churn invalidates user identification for up to 75% of users over multi-month durations. Survivorship bias from differential user attrition between arms can corrupt long-period estimates. Interactions with other concurrent experiments compound over time. The pragmatic recommendation is to match test duration to the user-learning time constant of the specific change being tested, rather than defaulting to either very short or very long durations. A change with a one-to-two-week learning constant should run for at least three to four weeks; a change with a multi-month learning constant should consider quasi-experimental approaches like those in Xu and Chen (2016) rather than purely classical A/B testing.
Are novelty effects always positive (inflating the early reading)?
No. Novelty in the strict sense of orienting-response inflation tends to be positive on attention-related metrics (clicks, engagement, time-on-page). But the broader category of early-period distortions includes primacy effects, which are negative on efficiency-related metrics (task completion time, navigation success, conversion). A short test can be biased upward (novelty-dominated), biased downward (primacy-dominated), or roughly clean (no significant early distortion), and the direction depends on the change type. The standard diagnostic --- plot the effect over time --- discriminates among the three cases without requiring strong priors about which is operating.
How does pre-registration help with novelty and primacy specifically?
Pre-registering the specific quantity being estimated --- the steady-state effect, defined on a specified post-novelty window (e.g., days 21-35) --- prevents the post-hoc rationalization where the analyst picks the time period that produces the most favorable headline number. Without pre-registration, the analyst who sees a +15% week-1 effect and a +3% week-4 effect has substantial discretion in deciding which to report; with pre-registration of the steady-state window, the headline number is fixed in advance and the early period becomes a diagnostic rather than a competing estimate.
Can I just use new users only as the test population to avoid novelty/primacy effects?
This is a defensible methodology choice and is used in some experimentation programs specifically for the cleaner steady-state reading. The cost is that the test result then applies only to the new-user population and may not generalize to the established-user population, where the production impact will be measured at scale. For changes where the test goal is to estimate the full-population production impact, the new-user-only approach is a partial measurement; for changes where the test goal is to estimate the post-habituation effect that the established population will eventually settle into, the new-user-only reading is often a reasonable proxy for that quantity.