Standard A/B testing has a critical assumption that most practitioners never question: what happens to user A does not affect user B. On social platforms, that assumption is spectacularly wrong. When I first ran experiments on a product with social features, I learned this the hard way — my results looked clean, my analysis was rigorous, and my conclusions were completely wrong.
If you are building experiments for any product where users interact with each other — social networks, marketplaces, messaging apps, collaborative tools — you need to understand interference before you run a single test.
The Independence Assumption Nobody Talks About
Traditional A/B testing relies on what statisticians call the Stable Unit Treatment Value Assumption, or SUTVA. In plain language: one user's treatment assignment does not affect another user's outcome.
SUTVA is what makes the math work. When you calculate a confidence interval or a p-value, you are assuming that your treatment group and control group are independent samples from the same population. Each user's outcome depends only on their own treatment assignment.
For most product experiments — testing a button color, a new checkout flow, a pricing page layout — SUTVA holds reasonably well. User A seeing a green button does not change user B's experience in any meaningful way.
But on social platforms, SUTVA is violated constantly and in ways that are not always obvious. Understanding validity threats (/blog/posts/ab-testing-external-validity-threats) is critical when your experimental design has structural flaws this fundamental.
The Interference Problem
Interference comes in several flavors, each capable of ruining your experiment in different ways.
Direct interference happens when treated users interact directly with control users. If you test a new messaging feature, treated users send messages using the new interface to control users who receive them through the old interface. The control group's experience is now contaminated by the treatment.
Indirect interference operates through market-level effects. On a marketplace like Uber, if you test a pricing algorithm on half your riders, you change supply-demand dynamics that affect all riders — including the control group. A treated rider who sees lower prices takes a ride, which removes a driver from the supply pool, which raises wait times for control riders.
Viral interference is the subtlest. You test a new share button that makes sharing easier. Treated users share more content. That shared content appears in the feeds of control users. Now your control group is seeing more shared content — an indirect effect of a treatment they were never assigned to.
Here is a concrete example that illustrates how bad this gets. You test a new share button on your platform. Treated users share 20% more content. Control users see 15% more shared content in their feeds as a result. Your experiment measures the direct effect of the share button on treated users, but completely misses the contamination of the control group. The true effect is larger than what you measure, but you have no way to know how much larger from a standard A/B test.
Real-World Examples of Interference
Every major platform has dealt with this problem. Understanding how they have approached it illuminates why standard user-level randomization breaks down.
LinkedIn faces interference in nearly every experiment. Testing a new feed algorithm changes which content gets visibility. Even if only treated users see the new feed, the content created by treated users (who are now incentivized differently by the new algorithm) reaches control users' feeds. The content ecosystem is a shared resource.
Uber and Lyft have perhaps the most acute interference problem. Testing pricing for riders affects driver supply, which affects wait times for all riders. Testing driver incentives affects supply distribution, which affects the marketplace for all riders. Every experiment on one side of the marketplace spills over to the other.
Facebook discovered that testing feed changes affected not just content consumption but content creation incentives. When treated users engaged differently, content creators changed their behavior, which changed what control users saw. The feedback loop made simple user-level experiments unreliable.
Airbnb deals with inventory interference. Testing search ranking for guests changes which listings get visibility, which affects booking patterns for all guests. A listing booked by a treated user is unavailable for control users.
How Big Platforms Solve This
The solutions to interference all involve changing the unit of randomization from individual users to larger groups where interference is contained.
Cluster-Randomized Experiments
The most theoretically sound approach is cluster randomization. Instead of randomizing individual users, you randomize groups of users who interact heavily with each other.
Users within a cluster all get the same treatment. Interference happens within clusters (which is fine — everyone in the cluster has the same treatment) but not between them (because cross-cluster interaction is minimal).
This requires mapping the social graph and identifying natural clusters. The quality of your clusters directly determines the quality of your experiment. Good clusters have high internal interaction and low external interaction.
The tradeoff: you have far fewer independent units (clusters instead of users), which means lower statistical power. You might have millions of users but only hundreds of clusters, and your effective sample size is the number of clusters.
Geo-Based Experiments
LinkedIn and Uber pioneered the geo-based approach: randomize by geographic region instead of by user. Different cities or metro areas get different treatments.
This works when cross-geography interaction is limited — a rider in San Francisco does not affect a rider in Chicago. It is simple to implement and avoids the complexity of social graph clustering.
The downsides: geographic regions differ in ways that create confounding variables (population density, demographics, market maturity). You also have very few randomization units — maybe 50 to 100 metro areas — which limits statistical power and the number of experiments you can run simultaneously.
Time-Based Experiments
The simplest approach: different time periods get different treatments. Run the control for a week, then the treatment for a week, compare.
This works when the metric has limited carryover effects — what happened last week does not much affect this week's outcomes. It is easy to implement but has low statistical power and is vulnerable to time-based confounders (seasonality, news events, competitor actions).
Ghost Ads Experiments
Facebook developed the ghost ads methodology for measuring ad effectiveness. Instead of showing nothing to the control group, you show a placeholder ad — the ad that would have been shown if the treatment ad did not exist.
This measures the incremental impact of the actual ad compared to the counterfactual, which is far more meaningful than comparing an ad to no ad. The insight transfers to other platforms: always ask what the alternative experience is, not just whether the treatment exists or not.
Spillover Effects: Measuring What Leaks
Sometimes you do not want to eliminate interference — you want to measure it. Spillover effects are often the most interesting finding in an experiment.
The elegant approach is two-stage randomization. First, randomize clusters into treatment and control. Then, within treatment clusters, randomize individuals into direct treatment and indirect exposure. This lets you measure three things: the direct effect of treatment, the spillover effect on untreated users in treated clusters, and the total effect.
A simpler detection method: compare control users who are connected to many treated users versus control users who are connected to few treated users. If there is a difference, you have spillover, and the magnitude tells you how much your standard experiment is biased.
For a deeper look at how segmentation approaches (/blog/posts/ab-testing-segmentation-targeting-heterogeneous-effects) can help decompose these effects across user groups, see the segmentation guide.
When Network Effects Are the Feature, Not the Bug
Sometimes interference is exactly what you are trying to create and measure. Viral features, referral programs, and network-effect products are designed so that treated users affect others.
If you are testing a new referral mechanism, the whole point is that treated users recruit control users. Measuring only the direct effect on treated users misses the entire value proposition.
Measuring viral lift requires experimental designs that account for interference rather than trying to eliminate it. You need to estimate the viral coefficient — how many additional users each treated user affects — and that requires tracking the cascade of influence through the network.
Political campaigns (/blog/posts/political-ab-testing-campaign-optimization-lessons) face a similar challenge: the messages voters receive get discussed with friends and family, creating interference that is both a measurement challenge and the primary mechanism of persuasion.
The Offline Measurement Challenge
A growing area of research asks whether online experiments affect real-world behavior beyond the platform. LinkedIn has studied whether online A/B tests that change connection recommendations actually affect offline networking behavior — do people meet in person more often when the platform connects them more effectively online?
This is extraordinarily difficult to measure. You need to link online treatment assignment to offline outcomes, which requires either surveys (unreliable) or real-world behavioral data (hard to obtain and privacy-sensitive).
But the question matters. If your A/B test only measures online clicks while the real business value is in offline behavior change, you are optimizing the wrong metric.
The Mistake New Analysts Make
The most dangerous mistake is running standard user-level A/B tests on features with obvious network effects and treating the results as reliable. If users in your experiment interact with each other — through messaging, sharing, competing for resources, or participating in a marketplace — your results are biased and you do not even know which direction.
I have seen teams launch features based on A/B test results that underestimated the true effect by 30% or more because they ignored spillover. I have also seen teams kill features that appeared to have no effect when the true effect was masked by control group contamination.
The standard A/B testing framework (/blog/posts/what-is-ab-testing-practitioners-guide) gives you wrong answers in networked environments. Knowing when it breaks is as important as knowing how to use it.
Pro Tip: Read Up on Cluster Randomization Before You Need It
If your product has social features — messaging, sharing, feeds, marketplaces, collaboration — read up on cluster randomization before you design your next experiment. Do not wait until you have run a contaminated experiment and made a bad decision based on the results.
The academic literature on network experiments is dense but there are excellent practitioner-focused summaries from LinkedIn's and Uber's experimentation teams. The core concepts are accessible even if the math is advanced.
The method comparison in A/B testing vs. multivariate vs. bandits (/blog/posts/ab-testing-vs-multivariate-vs-bandit-algorithms) becomes even more nuanced when network effects are in play. Bandits in particular can amplify interference because they shift traffic dynamically, changing the network composition of each treatment group over time.
Career Guidance
Network experiment design is one of the most in-demand skills in tech experimentation today. Very few analysts understand it. Most experimentation platforms do not support it natively, which means the analysts who can design and analyze network experiments are immediately valuable to any platform company.
If you are building an experimentation career, this is the specialization that will set you apart. Master the basics with standard A/B testing, then invest in understanding interference. The combination of statistical rigor and practical platform knowledge is rare and highly valued.