Claude for DevOps and SRE: Persistent Runbooks and System Context

June 2026 · 6 min read · For DevOps engineers, SREs, and platform teams

Claude is genuinely useful for incident response — parsing stack traces, suggesting triage steps, drafting postmortems. The problem is it knows nothing about your stack. Every time you open a session, you're explaining your architecture from scratch. What services exist, how they connect, what's gone wrong before, and what the standard fix is.

That context lives in your head (and sometimes in a Confluence page no one reads). Stash gives you a way to get it into Claude, persistently, so you stop pasting the same architecture diagram in every incident session.

What you're re-explaining in every session

Most of this never makes it into documentation because documentation is work. With Stash, you can add notes in natural language as you work. Claude builds up a picture of your system over time.

What to store and how to structure it

Collection What goes in it When Claude uses it
services Service name, function, owner team, key dependencies, SLA Incident triage, architecture questions, dependency mapping
runbooks Incident type + step-by-step procedure + last-used date "What's the runbook for X?" during an incident
incidents Date, service, symptoms, root cause, fix, duration Postmortems, pattern detection, "have we seen this before?"
infra Cloud config notes, regions, key env vars, cost notes Capacity planning, cost analysis, new service provisioning
context Stack summary, team structure, on-call rotation, key principles Loaded at the start of every session via context()

Incident triage with runbook memory

Here's what an incident session looks like once your runbooks are in Stash:

You Context(). Payment service is returning 503s. Error spike started 14:32. Database connections not exhausted. Queue depth normal.
Claude Context loaded. Stack: Node.js payment service, a database primary + read replica, Redis queue, hosted on AWS eu-west-1.

Payment 503 with normal DB connections and queue: your runbook entry from March flags this pattern — last time it was a Redis auth token rotation that didn't propagate to the payment service pods. Check: (1) Redis connection errors in payment service logs, (2) last secret rotation timestamp in Secrets Manager. If the token rotated in the last 2 hours and pods weren't bounced, that's your culprit. Rollout a restart of payment service pods and watch the error rate.

If that's not it: second candidate from your history is the Stripe webhook endpoint timing out on high-volume days. Check Stripe dashboard for webhook delivery failures.

You didn't explain the architecture. You didn't look up the runbook. Claude already knows your March incident pattern and applied it to the current symptoms. That's the difference between Claude-as-generic-assistant and Claude-as-your-ops-partner.

Postmortem drafting with incident history

You Load my incidents collection. Write a postmortem for today's payment 503. Root cause was Redis auth token not propagated after rotation. Duration 38 minutes, customer impact was degraded checkout — no data loss.
Claude [Postmortem draft: Timeline / Impact / Root Cause / Contributing Factors / Resolution / Action Items]

Notable from your incident history: this is the second Redis auth propagation issue this quarter (last time: March, same root cause). Action item suggestion: add automated verification step to the secret rotation runbook — confirm all pods have picked up the new token before the rotation is marked complete.

The "second time this quarter" catch comes from your incident history in Stash. A generic LLM can't tell you that. Your Stash-backed Claude can, because you logged the last incident.

Capturing runbooks as you work

The best time to write a runbook is immediately after you've just used it. Stash makes this low-friction:

You Add to runbooks: "Redis auth propagation failure" — symptoms: 503s on services that use Redis, normal DB connections. Fix: (1) check Redis connection errors in service logs, (2) verify secret rotation timestamp, (3) bounce affected pods, (4) confirm error rate drops. Prevention: add pod-restart step to rotation runbook.
Claude Stored. I'll have this available in any future incident session when symptoms match.

One minute to capture. Available in every future session. This is how institutional knowledge accumulates without a documentation sprint.

Architecture documentation without a wiki sprint

Most architectural documentation doesn't exist because "we'll write it later" never happens. Stash is the alternative: add a service entry when you build it, update it when something changes. Not a wiki sprint — a running log.

You Add to services: "notification-service" — sends transactional emails + push notifications. Owner: platform team. Stack: Python FastAPI, Celery + Redis for async jobs, SES for email, FCM for push. Key dependency: user-service for contact preferences. SLA: best-effort (non-critical path).

Two minutes per service. Claude now knows your notification service exists, what it does, and who owns it. When something breaks and involves email delivery, Claude can suggest checking SES delivery logs without you explaining that you use SES.

What Stash doesn't replace

Not Stash Use instead
Live metrics, logs, or traces Grafana, Datadog, CloudWatch — Stash doesn't have live data access
Secrets management AWS Secrets Manager, Vault — never store secrets in Stash
Ticketing and incident management PagerDuty, Jira — Stash doesn't integrate with these
Config-as-code or infrastructure state Terraform, Pulumi — Stash is notes, not a source of truth

Token efficiency matters for API users

If you're calling Claude via API (common in DevOps automation scripts), token costs add up. Stash uses FTS5 keyword search — it returns only the matching records, not your entire knowledge base. A search across 500 runbook entries returns roughly 192 tokens of results on average, versus thousands of tokens if you embedded the full context. For automation use cases, this matters.

Setup takes two minutes

Stash is a remote MCP server — no install, no local setup. You add one URL to Claude Settings and you're done. Signups are free, self-serve, and use Google OAuth.

Get your connector URL

Sign in at stashlite.com and add your connector URL to Claude Settings under Integrations. Takes two minutes.

Free tier: 2,500 records · 50 queries/month. Pro: 100k records · 1k queries — £8/month.

No install. No app. Paste one URL and every conversation knows your stack.

Pricing may change as the service develops. Cancel anytime.