Stop Testing Button Colors: A Manifesto for Meaningful Experimentation

Q: What if we do not have enough traffic for big tests?

Low traffic is actually an argument for bigger tests, not against them. Small tests require large samples to detect small effects. Big tests, with larger expected effects, can reach significance with less traffic.

Atticus Li

← Blog · a/b testing

Stop Testing Button Colors: A Manifesto for Meaningful Experimentation

Button color tests are a symptom of shallow experimentation culture. This manifesto argues for testing ideas that actually move the business needle.

Atticus Li April 7, 2026 7 min read

The Test That Started a Thousand Bad Habits

Sometime in the late 2000s, someone published a case study about changing a button from green to red and increasing conversions. The optimization industry has never recovered.

That single result -- likely a statistical artifact, almost certainly not generalizable -- became the origin story for an entire approach to experimentation that prioritizes trivial changes over meaningful ones. Fifteen years later, teams with sophisticated testing infrastructure are still spending significant portions of their testing capacity on cosmetic tweaks that, even when they "win," move the business by fractions of a fraction of a percent.

This has to stop. Not because testing small things is inherently wrong, but because testing small things at the expense of testing big things is a catastrophic misallocation of a finite resource.

The Opportunity Cost of Trivial Tests

Every testing program has a limited number of experiments it can run per year. The constraint is usually traffic (you need sufficient sample size) combined with engineering bandwidth (someone has to build the variants).

A typical mid-market company can run approximately twenty to forty meaningful tests per year. That is not a lot. Each test slot is valuable real estate.

When you spend one of those slots testing button colors, you are not spending it testing your pricing model, your value proposition, your onboarding flow, or your checkout architecture. The button test has a ceiling of maybe a few tenths of a percent improvement. The pricing test has a ceiling of potentially transforming the business.

The opportunity cost of trivial testing is not the test itself. It is the important test you did not run.

The Hierarchy of Testing Impact

Not all tests are created equal. There is a clear hierarchy of potential impact:

Tier 1: Business Model Tests

Changes to pricing structure, monetization strategy, target market, or core value proposition. These tests can move revenue by twenty percent or more. They are also the scariest to run, which is why most teams avoid them.

Tier 2: Structural Tests

Changes to information architecture, user flow, page structure, feature set, or content strategy. These tests typically move the primary metric by five to fifteen percent when they win.

Tier 3: Communication Tests

Changes to headlines, value propositions, messaging, positioning, and copy. These tests typically produce two to eight percent lifts.

Tier 4: Design Tests

Changes to layout, visual hierarchy, imagery, and interaction patterns. Typical impact: one to five percent.

Tier 5: Cosmetic Tests

Changes to colors, font sizes, button shapes, and minor visual elements. Typical impact: under one percent.

Most testing programs spend the majority of their capacity on Tier 4 and 5, while Tiers 1 through 3 remain untested. This is the equivalent of rearranging deck chairs while ignoring a hole in the hull.

Why Teams Default to Trivial Tests

The bias toward trivial testing is not accidental. It is driven by several organizational dynamics:

Low risk. Changing a button color cannot break anything important. Testing a new pricing model could alienate customers. Teams choose safe tests to avoid blame.
Easy consensus. Everyone can have an opinion about button colors. Testing the value proposition requires strategic alignment that is harder to achieve.
Fast results. Trivial changes need smaller sample sizes and shorter test windows. Teams that are evaluated on test velocity prefer fast tests.
Visible activity. Running lots of tests looks productive, even if the tests are inconsequential. Spending three months designing one high-impact test looks like stalling.

The Framework for Meaningful Tests

Here is how to shift your testing program toward impact:

Start With the Revenue Model

Revenue equals traffic multiplied by conversion rate multiplied by average order value multiplied by retention rate. A test that improves any of these levers by even a few percent is worth more than a test that improves a micro-conversion by double digits.

Map every test hypothesis to a specific lever in the revenue model. If the connection is indirect ("this button color might improve click-through, which might improve engagement, which might improve conversion"), the test is probably too far removed from impact to be worth running.

Test Hypotheses, Not Elements

A meaningful test starts with a hypothesis about user behavior, not a design variation. "Users abandon the checkout because they are unsure about the return policy" is a hypothesis. "A green button might outperform a blue button" is not.

Hypothesis-driven tests are naturally more impactful because they address real user problems. Element-driven tests are naturally trivial because they address design preferences.

Embrace Larger Variants

The most impactful tests involve substantial changes. Instead of testing one headline against another, test a completely different page concept. Instead of testing form field order, test a fundamentally different form structure.

Larger variants are harder to isolate causally, but they are more likely to produce meaningful effects. The academic ideal of testing one variable at a time is appropriate for research but inefficient for business optimization.

Build a Testing Roadmap

Treat your testing program like a product roadmap. Prioritize tests by expected business impact, not by ease of implementation. Staff the program with strategists who understand the business, not just designers who understand interfaces.

The Cultural Transformation

Moving from trivial to meaningful testing requires a cultural shift at the organizational level:

Redefine success. A test that produces a large, validated learning but no "winner" is more valuable than a test that produces a 0.1 percent lift on a button click.
Reward courage. Teams that test bold hypotheses and fail should be celebrated for expanding the organization's understanding.
Invest in strategy. The bottleneck in most testing programs is not tooling or traffic. It is the quality of the hypotheses. Invest in research, customer interviews, and data analysis to generate better ideas.
Extend time horizons. Meaningful tests take longer to design, implement, and evaluate. Accept this and adjust expectations accordingly.

The Manifesto

This is what meaningful experimentation looks like:

Every test is connected to a specific business outcome.
Every hypothesis is based on evidence about user behavior.
The testing portfolio is balanced across impact tiers.
The team is evaluated on business impact, not test velocity.
Bold tests are encouraged, and failures are valued as learning.
Results are translated into financial terms that leadership understands.
The program generates strategic insights, not just conversion tweaks.

Button colors are not the enemy. Complacency is. If your testing program is not regularly producing results that change how you think about your business, you are testing the wrong things.

Frequently Asked Questions

Are small tests ever worth running?

Yes, in two contexts. First, as quick wins to build organizational confidence in the testing program. Second, when you have exhausted higher-impact hypotheses and have spare testing capacity. But they should never be the primary activity.

How do I convince leadership to support riskier tests?

Frame the risk in financial terms. The downside of a failed test is the implementation cost (usually small). The upside of a successful bold test could be significant revenue growth. The expected value math almost always favors bolder tests.

What if we do not have enough traffic for big tests?

Low traffic is actually an argument for bigger tests, not against them. Small tests require large samples to detect small effects. Big tests, with larger expected effects, can reach significance with less traffic.

How do I build a hypothesis-driven testing culture from scratch?

Start with the data. Mine your analytics for user behavior patterns. Conduct customer interviews. Review support tickets. Build a hypothesis backlog that is grounded in evidence. Then prioritize by expected impact and test methodically.

a/b testing experimentation strategy conversion optimization testing culture

Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.

About LinkedIn Newsletter