Most analysts think A/B testing started with Google. It didn’t. The practice of splitting a population into groups, changing one variable, and measuring the difference has been around for nearly two centuries. If you want to be genuinely good at experimentation, you need to understand where these ideas came from — because the history of A/B testing is littered with lessons that people keep relearning the hard way.

I’ve watched smart analysts spend weeks building “novel” testing frameworks that are essentially reinventions of techniques direct mail marketers perfected in the 1960s. Understanding history prevents that kind of embarrassment.

1835: The First Controlled Trial

The earliest documented controlled experiment that resembles modern A/B testing happened in 1835, when a group of physicians in Nuremberg conducted a double-blind trial to assess the effectiveness of homeopathic remedies. They split participants into two groups — one receiving the actual treatment, one receiving a placebo — and compared results.

This wasn’t marketing optimization. It was medical science. But the core principle is identical to what you do every time you split traffic between a control and a variant: isolate a single variable, randomize assignment, measure the outcome.

The pharmaceutical industry continued refining these methods for over a century before the business world caught on. By the time the FDA formalized randomized controlled trial requirements in the 1960s, the methodology was already battle-tested across thousands of medical studies.

There’s a lesson here that new analysts miss: A/B testing isn’t a tech industry invention. It’s the scientific method applied to business decisions. The rigor you need — proper statistical tests (/blog/posts/statistical-tests-ab-testing-t-test-chi-squared-mann-whitney), adequate sample sizes, pre-registration of hypotheses — all of that was established in clinical research decades before anyone tested a button color.

1908: A Brewer Creates Modern Statistics

In 1908, William Sealy Gosset was working as a chemist at the Guinness brewery in Dublin. He had a problem: small sample sizes. When you’re testing barley batches, you can’t run experiments on millions of observations. You need statistical methods that work with limited data.

Gosset developed what we now call Student’s t-test — published under the pseudonym “Student” because Guinness had a strict policy against employees publishing research. The brewery didn’t want competitors knowing they were using statistical methods to optimize their product.

The irony is beautiful: one of the most important statistical tools in modern A/B testing exists because a beer company was paranoid about trade secrets. Every time you run a t-test on your experiment results, you’re using a technique born in a brewery over a century ago.

Gosset’s work laid the groundwork for the entire field of small-sample statistics, which remains critical today. Not every experiment you run will have millions of observations. When your B2B SaaS product gets 500 conversions a month, you’re dealing with the same fundamental challenge Gosset faced with barley samples.

1923: Claude Hopkins and Scientific Advertising

Claude Hopkins published “Scientific Advertising” in 1923, and it remains one of the most practical books ever written about testing commercial ideas. Hopkins didn’t use the term “A/B testing,” but that’s exactly what he was doing.

His method was direct response marketing with built-in measurement. He used keyed coupons — unique codes assigned to different ad variations — to track exactly which headline, which offer, which layout drove the most responses. He split newspaper ads, tested different copy against each other, and made decisions based on data rather than opinion.

Hopkins understood something fundamental: almost any question can be answered cheaply and quickly by a test campaign, rather than by arguments around a table. That insight was written over 100 years ago, and I still hear marketing teams arguing around tables about which headline will perform better instead of just testing it.

The direct mail industry took Hopkins’ ideas and ran with them for decades. By the 1950s and 1960s, direct mail marketers were running sophisticated split tests at scale — testing envelopes, headlines, offers, letter lengths, postscripts, and response mechanisms. They built rigorous testing cultures long before the internet existed.

The Golden Age of Direct Mail Testing

Between the 1950s and 1990s, direct mail was the proving ground for split testing methodology. Companies like Reader’s Digest, American Express, and Columbia House ran thousands of tests per year. They tested everything: envelope teaser copy, letter length, font choices, offer structures, guarantee language, P.S. lines, and response device formats.

The sophistication was remarkable. Direct mail testers understood concepts like interaction effects, seasonal variation, and list segmentation decades before digital marketers discovered them. They knew that a winning headline on one mailing list could be a loser on another — what we now call heterogeneous treatment effects (/blog/posts/ab-testing-segmentation-targeting-heterogeneous-effects).

They also understood something that many digital testers still get wrong: the importance of testing at scale before rolling out. A direct mail piece that tested well on 50,000 names might fail at 500,000 because the initial test list was unrepresentative. This is the same external validity problem (/blog/posts/ab-testing-external-validity-threats) that plagues digital experiments today.

2000: Google’s First A/B Test

In 2000, Google ran one of its first A/B tests on search results. The question was simple: how many results should they show per page? They tested 10, 20, and 30 results. The finding was counterintuitive — showing more results actually decreased user satisfaction because the page loaded slower.

This single test encapsulated a lesson that keeps repeating: more is not always better, and the reason something works or doesn’t often isn’t what you’d guess. Google expected users to want more results. Instead, users wanted faster results. The variable that mattered wasn’t the one they were intentionally testing.

Google’s engineering culture embraced experimentation early and aggressively. By 2011, they were running approximately 7,000 experiments per year. Today, they run well over 10,000 annually. That volume is only possible because they invested heavily in experimentation infrastructure — automated analysis, robust statistical methods (/blog/posts/ab-testing-statistics-p-values-confidence-intervals), and a culture where decisions require experimental evidence.

2012: The Bing Test Heard Round the World

In 2012, a Microsoft engineer at Bing made a small change to how ad headlines were displayed. It was a trivial modification — the kind of thing that would normally sit in a backlog for months. But someone tested it.

The result: a 12% increase in revenue. From one test. One small change. This single experiment generated over $100 million in additional annual revenue for Microsoft.

Ronny Kohavi, who led experimentation at Microsoft, used this example repeatedly to make a critical point: you cannot predict which changes will have massive impact. The only way to find these wins is to test everything, including — especially — the changes that seem too small to matter.

This is why I push teams to lower their testing thresholds (/blog/posts/ab-testing-tradeoffs-when-not-to-test). The expected value of testing is enormous when you account for the possibility of finding a massive win hidden in a seemingly minor change.

The Platform Era: 2015 to Present

Starting around 2015, a new wave of companies began building massive internal experimentation platforms. Airbnb built ERF (Experiment Reporting Framework). Netflix built ABlaze. Uber built XP. Booking.com famously runs thousands of concurrent experiments and has built an entire culture around experimentation.

These platforms didn’t just automate test deployment. They automated analysis, flagged statistical issues (/blog/posts/ab-testing-statistics-p-values-confidence-intervals) like sample ratio mismatch, implemented advanced techniques like CUPED for variance reduction (/blog/posts/cuped-variance-reduction-faster-ab-tests), and created guardrail metrics that automatically stop tests causing harm.

Booking.com’s approach is worth studying. They democratized experimentation so completely that product managers, designers, and engineers can all launch tests without needing a data scientist’s involvement. The result: they run more experiments per employee than almost any other company. Their philosophy is that running a test should be as easy as deploying code — and in many cases, feature flags and A/B tests are the same thing (/blog/posts/feature-flags-vs-ab-tests-canary-deployment).

What New Analysts Get Wrong

The biggest mistake I see from new analysts is treating A/B testing as a “tech industry thing.” They assume it’s a modern practice that only applies to websites and apps. This blind spot leads to several problems.

First, they ignore decades of accumulated knowledge from direct mail, pharmaceutical trials, and agricultural experiments. The statistics haven’t changed. The methodology hasn’t changed. Only the medium has changed.

Second, they reinvent old ideas with new names and think they’ve discovered something novel. “Growth hacking” is direct response marketing with a hoodie. “Experimentation platforms” are sophisticated split testing frameworks — the direct mail industry had its own version of these in the 1970s, using mainframe computers and response tracking databases.

Third, they underestimate the organizational challenge. Building an experimentation culture (/blog/posts/ab-testing-process-research-prioritize-test-analyze) is harder than building the technology. Google, Microsoft, and Booking.com didn’t get to 10,000 tests a year just by having good tools. They got there by making experimentation a core organizational value — a transition that takes years and executive commitment.

Pro Tip: Spot the Repackaged Ideas

Understanding history gives you a superpower: pattern recognition. When someone pitches a “revolutionary new testing methodology,” you can immediately assess whether it’s genuinely novel or a repackaging of established techniques. “Growth hacking” is direct response marketing. “Data-driven culture” is what Claude Hopkins called “scientific advertising” in 1923. “Personalization engines” are what direct mailers called “list segmentation.” Even the concept of running tests on social platforms with network effects (/blog/posts/ab-testing-social-platforms-network-effects-interference) has roots in agricultural field experiments that dealt with spatial interference between treatment plots.

None of this means modern implementations aren’t valuable. The technology is genuinely better. But the principles are old, and the pitfalls are well-documented. You just have to know where to look.

Career Guidance

Read “Scientific Advertising” by Claude Hopkins. It’s 100 years old and still more practical than most modern CRO blog posts. You can read it in an afternoon. Every principle Hopkins describes — test don’t argue, measure everything, scale what works — applies directly to digital experimentation.

Also read Ronny Kohavi’s work on trustworthy online controlled experiments. His papers and book bridge the gap between classical statistics and modern digital experimentation. Understanding both the historical foundations and the modern applications makes you a fundamentally better practitioner.

The analysts who understand history don’t just run better tests. They avoid the traps that everyone else falls into — like running tests too short (/blog/posts/how-long-to-run-ab-test-sample-size), ignoring validity threats (/blog/posts/ab-testing-external-validity-threats), or assuming that statistical significance means practical significance. These aren’t new problems. They’re old problems that keep catching people who haven’t done their homework.

The history of A/B testing teaches one overarching lesson: the scientific method works. It worked in 1835 for drug trials, in 1923 for advertising, and in 2012 for search engines. The companies that embrace rigorous experimentation outperform the ones that rely on intuition. That was true a century ago, and it’s true today.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Written by Atticus Li

Revenue & experimentation leader — behavioral economics, CRO, and AI. CXL & Mindworx certified. $30M+ in verified impact.