Stash / Blog · June 2026 · 8 min read

8 Ways to Reduce Your Claude API Costs (And the One That Doesn't Need Engineering)

Claude API pricing is token-in, token-out. The meter runs on everything: your system prompt, the conversation history, the documents you paste in, the model's reply. If your bill is higher than you expected, the cause is almost always one of a handful of structural problems — and most of them have a fix.

This is a practical map of the main techniques. Some require engineering work. One doesn't.

The full map

1. Prompt caching

Requires engineering

Anthropic offers a 90% discount on cached prompt tokens. If you have a long system prompt, a large document, or a tool definition block that appears in many requests unchanged, you can mark it as cacheable. The first call charges full price; subsequent calls within the cache TTL (5 minutes for default caching, up to 1 hour with extended) charge at the cache-read rate.

The win is significant if you have a fixed context block you're sending on every request. A 2,000-token system prompt sent on 1,000 requests a day is 2 million input tokens. With prompt caching, 999 of those calls read from cache at a fraction of the cost.

The catch: you have to instrument it. You add "cache_control": {"type": "ephemeral"} markers to your API request payload at the right breakpoints. It's a few lines of code, but it's code.

2. Model routing

Requires engineering

Not every task needs Sonnet. Claude Haiku is approximately 15× cheaper per token than Sonnet 3.5 and handles classification, extraction, simple summarisation, and short-answer tasks well. Sonnet handles most reasoning work. Opus is for the genuinely hard stuff — complex strategy, nuanced long-form writing.

The pattern: run a cheap classifier call first (or use heuristics based on the prompt structure) to route to the right model. Tools like RouteLLM automate this. Done well, model routing can cut your bill significantly without any visible quality drop on the tasks users actually care about.

The catch: you need routing logic, you need to test which tasks can safely downgrade, and you're now maintaining a multi-model system.

3. Prompt compression

Requires engineering

Most prompts have more words in them than they need. Filler phrases, redundant examples, verbose instructions that could be half as long and twice as clear. A well-compressed prompt costs fewer input tokens on every single call.

Practical moves:

Switch from prose instructions to structured lists — models read them more reliably and they're shorter
Remove examples once the model is reliably producing correct output (one example is usually enough)
Use precise constraints rather than long explanations: JSON only, no commentary instead of two sentences asking for JSON output
Audit your system prompt for anything that's there "just in case" — if you haven't needed it recently, remove it and test

Small gains per call, but they compound across millions of calls.

4. Batch API

Requires engineering

Anthropic's Message Batches API offers 50% off standard input and output token pricing. You submit a batch of up to 10,000 requests, they process asynchronously, and results are available within 24 hours (usually much faster).

This is an obvious fit for any workload that isn't time-sensitive: overnight document processing, bulk classification, report generation, content pipelines. If you're doing real-time work, batching doesn't apply. If you're running nightly jobs, it's free money.

The catch: you have to restructure your code to use the batches endpoint, handle async results, and deal with partial failures. Straightforward, but it's an engineering task.

5. Conversation summarisation

Requires engineering

In a long multi-turn conversation, the full message history is sent to the model on every turn. Turn 20 of a conversation carries turns 1–19 as input context. If each turn is 500 tokens, you're sending 10,000 tokens of history before Claude even reads the current message.

The fix is to compress the history at checkpoints. At every N turns (or when the history crosses a token threshold), summarise the conversation so far into a compact digest and replace the full history with that digest. The model loses granular detail from early turns but retains the key thread.

The catch: writing a good summarisation checkpoint that doesn't lose important context is non-trivial. You need to test it against your actual use cases. And again — code required.

6. External memory / MCP record stores

Requires engineering — except one option

This is the technique most people underuse. The problem it solves: you have data that Claude needs to know about — your project notes, your contact list, your product knowledge base, your personal context — and you're either pasting it all into every conversation (expensive), or keeping it in a large document you dump into context (expensive and inflexible).

The better pattern is to store that data externally and retrieve only what's relevant to the current task. Claude asks for it via a tool call; the store returns the matching records; Claude reads 200 tokens instead of 4,000.

How to implement it: build a retrieval layer (a vector store, a full-text search index, or a structured database) and expose it to Claude as an MCP server or function tool. When Claude needs to look something up, it calls the tool and gets a tight, relevant result back.

The catch: building a retrieval layer is real engineering work — unless you use a hosted MCP store, which is a pre-built version of the same thing. More on that below.

7. Observability

Requires engineering

You can't optimise what you can't see. Before making targeted changes to prompt structure, model selection, or caching strategy, you need to know where your tokens are actually going.

Tools like Helicone, LangSmith, and LiteLLM (self-hosted) give you per-call logging: which prompts, which models, how many tokens, what they cost. The diagnostic step usually surfaces obvious wins — a single prompt that's 3× longer than it needs to be, a workload that's using Sonnet where Haiku would do fine, a system prompt that's being sent unnecessarily on single-turn requests.

The catch: wiring observability into your system is a setup cost. Worth it for anything at scale, but it's not zero effort.

8. Context distillation

Requires engineering

Related to prompt compression and external memory, but applied to retrieved content specifically. When you pull in a document or search result to give Claude context, that raw content often carries a lot of tokens that aren't relevant to the question at hand.

Context distillation means compressing retrieved content before passing it to the main model call. A common pattern: use a small, cheap model (or Claude Haiku) to extract only the relevant sections from a long document, then pass that extract to the main model. The main model reads 400 tokens of distilled context instead of 4,000 tokens of raw document.

This adds a hop to your pipeline — the distillation call — but if your documents are long, the net token cost is often lower, and the main model gives better answers because it's reading signal rather than noise.

The one that requires no engineering

Every technique above requires engineering work: code changes, infrastructure, testing. That's fine if you're a developer. It's a real barrier if you're not.

There's exactly one technique in this list that a non-developer can apply today, in about 30 seconds, with no code: using a hosted MCP record store instead of dumping data into context.

Stash is that store.

Why this matters for non-developers: Every other cost-reduction technique on this page requires writing and deploying code. Stash requires pasting a URL into Claude's connector settings. That's the full setup.

Here's the concrete problem it solves. If you use Claude regularly for work — tracking projects, keeping notes, remembering contacts, loading your working context at the start of each day — you've probably developed a habit of pasting things in. Your role, your current projects, your preferences. Maybe a document. Maybe a list.

Each paste costs tokens. More importantly, it costs every time, on every conversation, whether or not that particular context is relevant to today's task.

Stash moves that data out of your context window and into a searchable store. Claude calls the store when it needs something specific, gets back a tight result, and doesn't carry the rest of it as dead weight.

What it looks like in practice

You add the connector once: Settings → Connectors → Add custom → paste:

https://app.stashlite.com/mcp

Then tell Claude to populate your context: "Add to my Stash context: I'm a product manager at a B2B SaaS company, currently working on a pricing review and a user interview synthesis project." Claude writes it to your store.

From that point, instead of pasting your context into each conversation, you tell Claude: "start my day." Claude calls context() from Stash. Gets back your standing context in one tool call. Uses it. Doesn't carry it across every turn for the rest of the conversation.

The same pattern applies to any data you regularly pull into Claude: a contact list, a project reference document, a list of decisions you've made, a knowledge base. Store it in Stash, search it when relevant, skip it when it's not.

The token difference

When we compared Notion MCP (another common approach) against Stash on a 500-record dataset with the same query:

Notion MCP: ~4,100 tokens for a top-5 result set
Stash: ~175 tokens for the same top-5

The gap is structural. Notion returns full page content — metadata, property columns, version markup. Stash returns only the fields you asked for, in a terse format Claude reads cleanly. (These are preliminary numbers from a single run. Data shape will affect your results. But the structural reason for the gap is stable.)

If you're pulling data into Claude from a Notion database on every query, external memory via Stash is likely your highest-leverage cost-reduction move — and unlike the others, it takes 30 seconds to set up.

Stash is free to start.
10,000 records · 100 queries/month · no card required.
Sign up or add the connector at stashlite.com.
Add to Claude →

Which technique to try first

A rough prioritisation:

If you're not a developer: start with external memory via Stash. It's the only option here you can do today without code.
If you have a long, repeated system prompt: prompt caching is the highest-leverage engineering change. The 90% discount on cached tokens adds up fast.
If you're running async jobs: batch API is a straightforward 50% cut for jobs that don't need real-time responses.
If you don't know where your tokens are going: wire up observability first. The diagnosis usually reveals something more specific than any of the above.
If you have long conversations or drag large documents in: conversation summarisation and context distillation are likely your highest-leverage moves after caching.

These techniques stack. Prompt caching plus model routing plus a tighter prompt can get you to a fraction of your original cost on the same workload. Start with the most obvious win for your specific situation and measure before moving on.

8 Ways to Reduce Your Claude API Costs (And the One That Doesn't Need Engineering)

The full map

1. Prompt caching

2. Model routing

3. Prompt compression

4. Batch API

5. Conversation summarisation

6. External memory / MCP record stores

7. Observability

8. Context distillation

The one that requires no engineering

What it looks like in practice

The token difference

Which technique to try first

Further reading