How to Save Tokens With Claude Code Without Making It Dumber

Q: Should I turn off extended thinking to save tokens?

No — turn it down where it doesn't help, and keep it on where it does. Thinking is worth its tokens on hard reasoning (subtle bugs, architecture, migrations) and wasteful on trivial edits. It's the same match-effort-to-difficulty discipline as model selection.

Atticus Li

← Blog

How to Save Tokens With Claude Code Without Making It Dumber

Most token-saving advice for Claude Code quietly makes the agent worse — starve its context or drop to a cheap model and you pay for the retries.

By Atticus Li July 3, 2026 13 min read

There are two ways to cut your Claude Code token bill.

The first is to starve the agent — smaller context, cheaper model, terse prompts — and watch your spend-per-message drop. The second is to help the agent land the change in one clean pass instead of three broken ones. Only one of these actually saves you money, and it's not the one most "token-saving tips" articles are selling.

I wrote a whole piece on why the right unit of cost isn't the token — it's the cost per accepted production change: subscription + credits + retries + your steering time + bug cleanup + review. This article is the tactical follow-up. Every technique below is chosen because it lowers total tokens-to-a-working-change, not tokens-per-message. The distinction matters, because the most popular way to "save tokens" — making the model dumber — is a false economy that shows up as a bigger bill two turns later.

First, understand where the tokens actually go

You can't optimize what you don't understand, and Claude Code spends tokens in a way that surprises people coming from a chatbot mental model.

In Anthropic's own Claude Code walkthrough, one of the engineers describes the tool as "a very pure agent": some instructions, some powerful terminal tools, and a model running in a loop until it decides it's done. Two consequences of that design drive your entire bill:

There's no magic index. Claude Code doesn't embed your repo into a vector database and retrieve slices. It explores the way a new engineer would — glob, grep, reading files, following the trail. That's agentic search, and every file it reads to orient itself becomes tokens that sit in the context window.
The context window is resent every turn. The API is stateless. On each step of the loop, the entire accumulated conversation — every file read, every tool result, every reasoning step — is sent again. A 30-message session isn't 30 messages' worth of tokens; it's the running sum, paid over and over.

Current Claude models give you a 1M-token context window (the old 200K ceiling is gone), which sounds like it makes this a non-issue. It does the opposite. A bigger window means it's easier to let junk accumulate, and you pay to reprocess that junk on every single turn. The single highest-leverage lever you have is not which model you pick — it's what's sitting in the context window when the model does its work.

Everything below is a corollary of that.

Move 1: Manage context on purpose, not by accident

The default failure mode is the "zombie session" — you keep one Claude Code conversation alive across three unrelated tasks because starting fresh feels wasteful. It's the opposite of wasteful. That stale history rides along on every turn, getting reprocessed, quietly confusing the model, and inflating the bill.

Anthropic's docs recommend three commands, and the discipline is knowing which to reach for:

`/clear` when you switch to an unrelated task. This wipes the working context (keeping your CLAUDE.md) and is the biggest single lever — community measurements put the per-message savings from clearing accumulated history at roughly 30–50%. Clearing isn't "losing your work"; it's refusing to pay rent on context you no longer need.
`/compact` when one phase is done but the thread matters — after a refactor lands, before you start the next module. Compaction summarizes the conversation so far, keeping current file states, task objectives, and key decisions while discarding intermediate reasoning, rejected approaches, and repeated file reads. Claude Code will auto-compact for you at ~95% of the window, but manual compaction lets you pick a clean checkpoint instead of a random one mid-thought.
`/context` to see what's eating the window before you decide. Most people optimize blind. Look first.

The paradigm shift here: treating context as a renewable resource you top up and flush deliberately, rather than a log that only ever grows. The engineers who spend the least per accepted change are ruthless about /clear.

Move 2: A tight CLAUDE.md makes it smarter and cheaper

Here's where "save tokens" and "don't make it dumber" stop being in tension.

CLAUDE.md is the file Claude Code loads into context automatically at the start of every session — your standing instructions on how to run tests, where things live, what the conventions are. It's the highest-value tokens in your whole session, because good context up front prevents dozens of exploratory tool calls later. The model that already knows how to run your tests doesn't burn 2,000 tokens rediscovering it.

But there's a trap: because it helps, people hoard. They dump the entire architecture, every edge case, every past incident into a 600-line CLAUDE.md — and now they pay for all of it, on every turn, forever. Both Anthropic and OpenAI explicitly warn that oversized context files consume your usage faster.

The judgment call — and it is a judgment call, which is why it doesn't automate away — is curation. Keep CLAUDE.md under ~150 lines. Put in what changes the model's behavior on most tasks (how to run and test, the non-obvious conventions, the one architectural fact that prevents a class of mistakes). Leave out what the model can cheaply rediscover with a grep. A lean, high-signal CLAUDE.md is the rare optimization that makes the agent both cheaper and smarter. A bloated one makes it more expensive and, past a point, less focused.

Move 3: Plan before you execute — this is the big one

If you take one thing from this article, take this: the most expensive tokens are the ones spent implementing the wrong thing.

The pattern most people use is "here's the bug, fix it" — and then they pay for a long, multi-file edit loop that turns out to be aimed at the wrong module. The fix is to separate exploration from execution. Ask for a plan first: "Search around, figure out what's causing this, and tell me the plan. Don't write any code yet." Claude Code uses its agentic-search tools, comes back with an approach, and you get a cheap checkpoint to catch a wrong-headed plan before you've paid for the implementation.

This maps directly to the pillar's cost equation: a wasted implementation loop is the "retries" and "your steering time" line items, and it's usually the largest one. Planning is a few thousand tokens of reading and reasoning; a wrong implementation is tens of thousands of tokens plus a bug-cleanup pass plus your attention. Plan mode isn't a productivity nicety — it's the single biggest token-saver in the toolbox, precisely because it kills whole wasted loops rather than trimming the edges of good ones.

The corollary lever is the escape key. You don't have to let a run finish to know it's going sideways. When you see Claude Code heading for the wrong file or building the wrong abstraction, interrupt and redirect. Every token after the moment you knew it was wrong is pure waste. Knowing when to interject is one of the highest-skill, highest-savings habits in agentic coding.

Move 4: Right model for the sub-job — and know the subagent trap

Claude Code lets you switch models mid-session (/model) and pin models per subagent. The naive read of "save tokens" is "use the cheap model." The correct read is match the model to the difficulty of the specific step.

The current lineup, by cost: Opus 4.8 ($5/$25 per million in/out), Sonnet ($3/$15), Haiku 4.5 ($1/$5). The allocation that minimizes cost-per-accepted-change:

Frontier (Opus) for the hard, irreversible work — architecture, auth, payments, migrations, gnarly multi-file bugs, and final review. This is exactly where a cheap model's "savings" evaporate into retry loops. A wrong auth change caught in review is cheap; one that ships is not.
Cheap (Haiku) for the mechanical work — summarization, test scaffolding, file discovery, log grinding, boilerplate. Here the frontier model is overkill and the cheap one is genuinely a saving with no quality cost.

Now the counterintuitive part, because this is where "save tokens" advice goes most wrong. Subagents feel like they should be cheaper — isolate a task, keep the main context clean. Sometimes they are. But Anthropic notes that subagent-heavy workflows can consume around 7× the tokens of a single-thread session. Each subagent re-pays for its own system prompt, tool definitions, and extra round trips. For a quick shell command or a one-file read, that overhead dwarfs any benefit.

So the rule isn't "use subagents to save tokens." It's: use a subagent when the main-context clutter it prevents is worth more than the startup overhead it adds — parallel, independent workstreams; reading many files without polluting your main thread; a genuinely separable sub-task. For a sequential single-file edit, work directly.

And a governance point that saves real money on a team: commit your subagent configs to the repo and pin the models there — code review on Sonnet, linting on Haiku, architecture on Opus. Otherwise every developer defaults every agent to the most expensive model out of caution, and your bill reflects willpower instead of policy.

Move 5: Let caching and thinking do free work

Two levers most people never touch:

Prompt caching. Claude caches the stable prefix of your context, and a cache hit costs about 10% of the normal input price — a 90% discount on the tokens you reuse. Claude Code leans on this automatically, but you can defeat it without realizing: caching is a prefix match, so churning your CLAUDE.md, reordering tools, or swapping models mid-session invalidates the cache and forces a full-price reprocess. The practical habit: keep the stable stuff (your CLAUDE.md, your tool set) stable within a session, and let volatile content live at the end of the conversation where it can't poison the cached prefix.

Thinking, applied surgically. Extended thinking (and the underlying effort setting) genuinely improves hard reasoning — architecture, subtle bugs, migrations — and Claude 4-class models can now think between tool calls, which is exactly where it pays off. But thinking is tokens. The move is to spend it where it changes the outcome and not on trivial edits. "Think hard about this race condition" is worth it; thinking hard about renaming a variable is just spend. Matching reasoning depth to task difficulty is the same discipline as matching model to task — applied one level down.

Move 6: Prompt for an accepted change, not a pretty draft

The fastest way to fewer total tokens is to get closer to one-shot. Anthropic's prompting guidance is really a token-efficiency guide in disguise: the clearer your instruction, the fewer correction loops you pay for.

Concretely, in a coding session that means:

Front-load the spec. State the task, the intent, and the constraints in the first turn rather than dribbling them out over five. Underspecified prompts that get corrected turn-by-turn are the single most common source of avoidable retries.
Give it a target it can check itself against. Point it at the test, tell it to run the tests, and have it iterate to green. A model that can verify its own work stops guessing — which stops the guess-check-guess loop that burns tokens.
Small diffs, commit often. Have Claude make a small change, run the tests, commit, and move on. If a run goes off the rails, you reset to the last commit instead of paying to untangle a sprawling bad edit.
Use screenshots for UI work. The models are multimodal; pasting a mock or a broken-layout screenshot is often worth a thousand tokens of you describing it in prose — and it lands the change faster.

None of this is "be terse to save tokens." It's the opposite: spend a few more tokens on a precise, verifiable first instruction so you don't spend ten times as many on correction. That is the entire thesis of the cost-per-accepted-change lens, applied at the prompt level.

The anti-patterns, in one place

The through-line of every "saving tokens made my bill worse" story:

Starving the context so the model can't see what it needs, then paying for the exploration loops it runs to recover.
A cheap model on a hard task — the false economy that shows up as three retries and a bug-cleanup pass.
"Review the whole app" when you meant one flow. Scope the request to a file, a function, or a PR.
Zombie sessions kept alive across unrelated tasks, reprocessing stale history on every turn. /clear.
A 600-line CLAUDE.md you pay for on every single turn to save a few greps.
Reflexive subagents for trivial steps, paying 7× overhead for isolation you didn't need.

Every one of these lowers tokens-per-message and raises tokens-per-accepted-change. That's the trap. Optimize the second number.

FAQ

What's the fastest way to cut my Claude Code token bill today?

Use /clear between unrelated tasks and stop keeping one long-running session alive. Clearing accumulated history is the biggest single lever — it cuts per-message cost meaningfully because the whole conversation is resent on every turn. Pair it with a lean CLAUDE.md and plan-before-execute and you've got the top three levers.

Does using a cheaper model actually save money?

For mechanical work — summarization, test scaffolding, file discovery — yes. For architecture, auth, payments, migrations, and hard bugs, usually no: the cheap model's mistakes turn into retry loops and cleanup that cost more than the frontier model would have. Match the model to the difficulty of the specific step, not to your budget anxiety.

Are subagents a good way to save tokens?

Only sometimes. Subagent-heavy workflows can use around 7× the tokens of a single thread because each subagent re-pays for its prompt, tools, and round trips. Use them when they prevent genuine main-context clutter or enable parallel independent work — not for quick sequential steps.

Should I turn off extended thinking to save tokens?

No — turn it down where it doesn't help, and keep it on where it does. Thinking is worth its tokens on hard reasoning (subtle bugs, architecture, migrations) and wasteful on trivial edits. It's the same match-effort-to-difficulty discipline as model selection.

How does this connect to picking the right model in the first place?

This article is the tactical layer under the strategic one. The flagship piece on which model to use makes the case that the metric that matters is cost per accepted production change. Everything here is a way to lower that number inside Claude Code specifically.

Where to go from here

Saving tokens in Claude Code isn't about doing less — it's about not paying twice. Manage context deliberately, keep your standing instructions lean, plan before you execute, match model and thinking depth to the actual difficulty of each step, and prompt for a change you can verify. Do that and your bill drops because your output got better, not in spite of it.

If you're building an AI-native product and want the strategic version of this — which models, what stack, how to think about the whole tokens-to-revenue ladder — start with Which AI Model Should You Actually Use to Build Software?, then work with me or subscribe to the newsletter for more field notes from building in public.

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified in behavioral economics. Led 100+ in-house experiments at NRG in 2025, with project evidence and limits documented in the case studies.

About LinkedIn Newsletter