The A/B Test That Became a Regulatory Case Study
Meta description: Credit Karma's "pre-approved" claim wasn't a lie — it was an A/B test winner, shipped the way any growth team ships a winner. A breakdown of how a correctly-run experiment became a $3M FTC case, and the guardrail-metric gap that let it happen.
Most dark-pattern writeups implicitly assume sloppiness — a design team that didn't think it through, a growth hacker cutting corners. The FTC's 2022 complaint against Credit Karma is more useful precisely because it doesn't fit that story. The underlying practice was an A/B test, run the way a competent experimentation team runs A/B tests: a hypothesis, a controlled comparison, a statistically clear winner, a ship decision. If you've built or reviewed an experimentation program, this is a process you'd sign off on. That's exactly why it's worth understanding in detail — the failure wasn't in the testing. It was in what the test was allowed to measure.
The setup
Credit Karma's core product shows users credit card and loan offers, monetized when a user applies through the platform. The specific claim under scrutiny was language telling users they were "pre-approved" for a given credit card, with some versions citing "90% odds" of approval. According to the FTC's complaint, Credit Karma tested this framing against more hedged, accurate alternatives — language closer to "excellent odds" without the specific "pre-approved" designation — and found that "pre-approved" produced a meaningfully higher click-through and application rate.
That's a normal test. Persuasive, specific, confidence-conveying language usually outperforms hedged language — this isn't a fintech-specific finding, it replicates across almost every vertical. If you've run copy tests on a CTA, a headline, or an offer page, you've seen this exact pattern win before, probably multiple times.
Where it diverges from a normal test
The complaint states that Credit Karma's own data showed close to a third of consumers who applied for these "pre-approved" offers were subsequently denied at underwriting. The claim wasn't randomly wrong — it was wrong for a known, sizable, quantifiable share of the people seeing it, and Credit Karma had the data to know that share before and during the period the language was live.
Two real costs land on a denied applicant that a click-through metric doesn't see: a hard credit inquiry, which can measurably ding a credit score on its own, and the time cost of applying for something they were told was close to certain. Neither of those costs appears anywhere in "click-through rate" or "application rate." The test had a clean, statistically valid winner on the metric it was built to measure — and the metric it was built to measure was structurally blind to the harm the winning variant caused.
This is the detail that makes the case instructive rather than just cautionary: the FTC's theory wasn't "Credit Karma lied." Their theory was closer to "Credit Karma knew, from its own experiment, that this specific claim was inaccurate for a specific, known proportion of the people seeing it, and optimized for the metric that didn't capture that cost anyway." Unfair-or-deceptive-practices law under FTC Act Section 5 doesn't require an outright falsehood — a claim that's technically true in aggregate but predictably misleading to a substantial share of the audience receiving it can qualify on its own. (Credit Karma resolved the matter via a January 2023 FTC consent order, settling without admitting or denying the allegations — standard practice in FTC administrative settlements, and the facts below are what the complaint alleges rather than a court's adjudicated findings.)
The guardrail that was missing
Every mature experimentation program uses guardrail metrics — secondary metrics you watch alongside your primary conversion metric specifically so a "win" on the primary metric can't silently mask a loss somewhere the team cares about. The standard guardrails in most growth orgs are engagement-adjacent: unsubscribe rate, support-ticket volume, refund rate, churn in the following period.
Credit Karma's test, as described in the complaint, had a primary metric (click-through / application rate) but nothing that would have caught "rate at which the winning variant is factually wrong for the user seeing it." That's a different kind of guardrail than engagement guardrails — it's an accuracy guardrail, and it's specific to any test where the copy makes a claim about the individual user's situation rather than a general statement about the product. "Pre-approved," "your odds are X," "you qualify for," "recommended for you" are all claims of this type: they're personalized enough that they can be individually false even when they're in-aggregate defensible, and a standard conversion or engagement guardrail will never surface that.
This is the guardrail-coverage test I'd apply to a case like this, stated as a question a test owner should be able to answer before shipping: for any claim that asserts something about the specific user's status, eligibility, or likelihood of an outcome, what's the known false-positive rate for the winning variant, and is that rate itself being tracked as a metric — not inferred after the fact from a regulatory complaint, but tracked, on a dashboard, the same way you'd track unsubscribe rate?
If the answer is "we don't have that number," that's not a hypothetical compliance gap. In Credit Karma's case, the company had the number — it appears in the FTC's own complaint, meaning it existed somewhere in their data — and the number wasn't part of the decision to ship.
What this means for a fintech or regulated-industry growth team
The instructive version of this case isn't "don't test persuasive copy" — persuasive copy is not illegal, and the FTC's order doesn't ban A/B testing or even ban the phrase "pre-approved" outright; it requires the claim to be accurate for the specific user it's shown to. The instructive version is narrower and more useful: any claim your test makes about an individual user's eligibility, status, or odds needs its own accuracy guardrail, tracked with the same rigor as your primary metric, before you ship the winner.
Concretely, that means three things a growth or CRO lead can implement without waiting on legal review of every test:
- Classify claims, not just copy. Any test variant asserting something specific about the user (pre-approved, qualified, recommended, likely to succeed) gets flagged differently in your test-tracking system than a variant testing tone, layout, or general product claims. The classification determines whether an accuracy guardrail is required.
- Pull the false-positive rate before you ship, not after a complaint. If the claim is "pre-approved," the underwriting data to check accuracy against usually already exists downstream in the funnel — it's a data-pipeline problem, not a new data-collection problem, in most fintech stacks.
- Set a threshold, not just a metric. A guardrail without a stop condition doesn't stop anything. If the false-positive rate on an eligibility-type claim exceeds a set threshold, the variant doesn't ship regardless of its click-through performance — the same way you'd hard-stop a variant that blew past your unsubscribe-rate guardrail, even if it converted better.
None of that requires slowing down testing velocity in general. It requires a narrower category — claims about individual eligibility or likelihood — to carry one more mandatory metric before a winner ships. That's a testing-methodology change a growth team owns directly; it doesn't require waiting on a compliance sign-off process that doesn't exist yet at an early-stage company, which is exactly the situation most startups testing this kind of copy are actually in.
This is part of a series on the design and strategy trade-offs behind marketing-compliance enforcement actions, read through an experimentation lens. See the full series or get in touch if you're building an experimentation program in a regulated space and want the guardrail layer built in from the start.