Claude API pricing is token-in, token-out. The meter runs on everything: your system prompt, the conversation history, the documents you paste in, the model's reply. If your bill is higher than you expected, the cause is almost always one of a handful of structural problems — and most of them have a fix.
This is a practical map of the main techniques. Some require engineering work. One doesn't.
Requires engineering
Anthropic offers a 90% discount on cached prompt tokens. If you have a long system prompt, a large document, or a tool definition block that appears in many requests unchanged, you can mark it as cacheable. The first call charges full price; subsequent calls within the cache TTL (5 minutes for default caching, up to 1 hour with extended) charge at the cache-read rate.
The win is significant if you have a fixed context block you're sending on every request. A 2,000-token system prompt sent on 1,000 requests a day is 2 million input tokens. With prompt caching, 999 of those calls read from cache at a fraction of the cost.
The catch: you have to instrument it. You add "cache_control": {"type": "ephemeral"} markers to your API request payload at the right breakpoints. It's a few lines of code, but it's code.
Requires engineering
Not every task needs Sonnet. Claude Haiku is approximately 15× cheaper per token than Sonnet 3.5 and handles classification, extraction, simple summarisation, and short-answer tasks well. Sonnet handles most reasoning work. Opus is for the genuinely hard stuff — complex strategy, nuanced long-form writing.
The pattern: run a cheap classifier call first (or use heuristics based on the prompt structure) to route to the right model. Tools like RouteLLM automate this. Done well, model routing can cut your bill significantly without any visible quality drop on the tasks users actually care about.
The catch: you need routing logic, you need to test which tasks can safely downgrade, and you're now maintaining a multi-model system.
Requires engineering
Most prompts have more words in them than they need. Filler phrases, redundant examples, verbose instructions that could be half as long and twice as clear. A well-compressed prompt costs fewer input tokens on every single call.
Practical moves:
JSON only, no commentary instead of two sentences asking for JSON outputSmall gains per call, but they compound across millions of calls.
Requires engineering
Anthropic's Message Batches API offers 50% off standard input and output token pricing. You submit a batch of up to 10,000 requests, they process asynchronously, and results are available within 24 hours (usually much faster).
This is an obvious fit for any workload that isn't time-sensitive: overnight document processing, bulk classification, report generation, content pipelines. If you're doing real-time work, batching doesn't apply. If you're running nightly jobs, it's free money.
The catch: you have to restructure your code to use the batches endpoint, handle async results, and deal with partial failures. Straightforward, but it's an engineering task.
Requires engineering
In a long multi-turn conversation, the full message history is sent to the model on every turn. Turn 20 of a conversation carries turns 1–19 as input context. If each turn is 500 tokens, you're sending 10,000 tokens of history before Claude even reads the current message.
The fix is to compress the history at checkpoints. At every N turns (or when the history crosses a token threshold), summarise the conversation so far into a compact digest and replace the full history with that digest. The model loses granular detail from early turns but retains the key thread.
The catch: writing a good summarisation checkpoint that doesn't lose important context is non-trivial. You need to test it against your actual use cases. And again — code required.
Requires engineering — except one option
This is the technique most people underuse. The problem it solves: you have data that Claude needs to know about — your project notes, your contact list, your product knowledge base, your personal context — and you're either pasting it all into every conversation (expensive), or keeping it in a large document you dump into context (expensive and inflexible).
The better pattern is to store that data externally and retrieve only what's relevant to the current task. Claude asks for it via a tool call; the store returns the matching records; Claude reads 200 tokens instead of 4,000.
How to implement it: build a retrieval layer (a vector store, a full-text search index, or a structured database) and expose it to Claude as an MCP server or function tool. When Claude needs to look something up, it calls the tool and gets a tight, relevant result back.
The catch: building a retrieval layer is real engineering work — unless you use a hosted MCP store, which is a pre-built version of the same thing. More on that below.
Requires engineering
You can't optimise what you can't see. Before making targeted changes to prompt structure, model selection, or caching strategy, you need to know where your tokens are actually going.
Tools like Helicone, LangSmith, and LiteLLM (self-hosted) give you per-call logging: which prompts, which models, how many tokens, what they cost. The diagnostic step usually surfaces obvious wins — a single prompt that's 3× longer than it needs to be, a workload that's using Sonnet where Haiku would do fine, a system prompt that's being sent unnecessarily on single-turn requests.
The catch: wiring observability into your system is a setup cost. Worth it for anything at scale, but it's not zero effort.
Requires engineering
Related to prompt compression and external memory, but applied to retrieved content specifically. When you pull in a document or search result to give Claude context, that raw content often carries a lot of tokens that aren't relevant to the question at hand.
Context distillation means compressing retrieved content before passing it to the main model call. A common pattern: use a small, cheap model (or Claude Haiku) to extract only the relevant sections from a long document, then pass that extract to the main model. The main model reads 400 tokens of distilled context instead of 4,000 tokens of raw document.
This adds a hop to your pipeline — the distillation call — but if your documents are long, the net token cost is often lower, and the main model gives better answers because it's reading signal rather than noise.
Every technique above requires engineering work: code changes, infrastructure, testing. That's fine if you're a developer. It's a real barrier if you're not.
There's exactly one technique in this list that a non-developer can apply today, in about 30 seconds, with no code: using a hosted MCP record store instead of dumping data into context.
Stash is that store.
Here's the concrete problem it solves. If you use Claude regularly for work — tracking projects, keeping notes, remembering contacts, loading your working context at the start of each day — you've probably developed a habit of pasting things in. Your role, your current projects, your preferences. Maybe a document. Maybe a list.
Each paste costs tokens. More importantly, it costs every time, on every conversation, whether or not that particular context is relevant to today's task.
Stash moves that data out of your context window and into a searchable store. Claude calls the store when it needs something specific, gets back a tight result, and doesn't carry the rest of it as dead weight.
You add the connector once: Settings → Connectors → Add custom → paste:
https://app.stashlite.com/mcp
Sign in with Google. You get a free account. No card required.
Then tell Claude to populate your context: "Add to my Stash context: I'm a product manager at a B2B SaaS company, currently working on a pricing review and a user interview synthesis project." Claude writes it to your store.
From that point, instead of pasting your context into each conversation, you tell Claude: "start my day." Claude calls context() from Stash. Gets back your standing context in one tool call. Uses it. Doesn't carry it across every turn for the rest of the conversation.
The same pattern applies to any data you regularly pull into Claude: a contact list, a project reference document, a list of decisions you've made, a knowledge base. Store it in Stash, search it when relevant, skip it when it's not.
When we compared Notion MCP (another common approach) against Stash on a 500-record dataset with the same query:
The gap is structural. Notion returns full page content — metadata, property columns, version markup. Stash returns only the fields you asked for, in a terse format Claude reads cleanly. (These are preliminary numbers from a single run. Data shape will affect your results. But the structural reason for the gap is stable.)
If you're pulling data into Claude from a Notion database on every query, external memory via Stash is likely your highest-leverage cost-reduction move — and unlike the others, it takes 30 seconds to set up.
A rough prioritisation:
These techniques stack. Prompt caching plus model routing plus a tighter prompt can get you to a fraction of your original cost on the same workload. Start with the most obvious win for your specific situation and measure before moving on.