Claude for DevOps and SRE: Persistent Runbooks and System Context

June 2026 · 6 min read · For DevOps engineers, SREs, and platform teams

Claude is genuinely useful for incident response — parsing stack traces, suggesting triage steps, drafting postmortems. The problem is it knows nothing about your stack. Every time you open a session, you're explaining your architecture from scratch. What services exist, how they connect, what's gone wrong before, and what the standard fix is.

That context lives in your head (and sometimes in a Confluence page no one reads). Stash gives you a way to get it into Claude, persistently, so you stop pasting the same architecture diagram in every incident session.

What you're re-explaining in every session

Service architecture (what services exist, what they do, how they communicate)
Infrastructure topology (which cloud, which regions, key dependencies)
Known failure modes ("a database connection exhaustion is always the first thing to check")
Runbook procedures for common incidents
Past incidents and their root causes
On-call escalation paths and who owns what
Deployment conventions and rollback procedures

Most of this never makes it into documentation because documentation is work. With Stash, you can add notes in natural language as you work. Claude builds up a picture of your system over time.

What to store and how to structure it

Collection	What goes in it	When Claude uses it
services	Service name, function, owner team, key dependencies, SLA	Incident triage, architecture questions, dependency mapping
runbooks	Incident type + step-by-step procedure + last-used date	"What's the runbook for X?" during an incident
incidents	Date, service, symptoms, root cause, fix, duration	Postmortems, pattern detection, "have we seen this before?"
infra	Cloud config notes, regions, key env vars, cost notes	Capacity planning, cost analysis, new service provisioning
context	Stack summary, team structure, on-call rotation, key principles	Loaded at the start of every session via `context()`

Incident triage with runbook memory

Here's what an incident session looks like once your runbooks are in Stash:

You Context(). Payment service is returning 503s. Error spike started 14:32. Database connections not exhausted. Queue depth normal.

Claude Context loaded. Stack: Node.js payment service, a database primary + read replica, Redis queue, hosted on AWS eu-west-1.

Payment 503 with normal DB connections and queue: your runbook entry from March flags this pattern — last time it was a Redis auth token rotation that didn't propagate to the payment service pods. Check: (1) Redis connection errors in payment service logs, (2) last secret rotation timestamp in Secrets Manager. If the token rotated in the last 2 hours and pods weren't bounced, that's your culprit. Rollout a restart of payment service pods and watch the error rate.

If that's not it: second candidate from your history is the Stripe webhook endpoint timing out on high-volume days. Check Stripe dashboard for webhook delivery failures.

You didn't explain the architecture. You didn't look up the runbook. Claude already knows your March incident pattern and applied it to the current symptoms. That's the difference between Claude-as-generic-assistant and Claude-as-your-ops-partner.

Postmortem drafting with incident history

You Load my incidents collection. Write a postmortem for today's payment 503. Root cause was Redis auth token not propagated after rotation. Duration 38 minutes, customer impact was degraded checkout — no data loss.

Claude [Postmortem draft: Timeline / Impact / Root Cause / Contributing Factors / Resolution / Action Items]

Notable from your incident history: this is the second Redis auth propagation issue this quarter (last time: March, same root cause). Action item suggestion: add automated verification step to the secret rotation runbook — confirm all pods have picked up the new token before the rotation is marked complete.

The "second time this quarter" catch comes from your incident history in Stash. A generic LLM can't tell you that. Your Stash-backed Claude can, because you logged the last incident.

Capturing runbooks as you work

The best time to write a runbook is immediately after you've just used it. Stash makes this low-friction:

You Add to runbooks: "Redis auth propagation failure" — symptoms: 503s on services that use Redis, normal DB connections. Fix: (1) check Redis connection errors in service logs, (2) verify secret rotation timestamp, (3) bounce affected pods, (4) confirm error rate drops. Prevention: add pod-restart step to rotation runbook.

Claude Stored. I'll have this available in any future incident session when symptoms match.

One minute to capture. Available in every future session. This is how institutional knowledge accumulates without a documentation sprint.

Architecture documentation without a wiki sprint

Most architectural documentation doesn't exist because "we'll write it later" never happens. Stash is the alternative: add a service entry when you build it, update it when something changes. Not a wiki sprint — a running log.

You Add to services: "notification-service" — sends transactional emails + push notifications. Owner: platform team. Stack: Python FastAPI, Celery + Redis for async jobs, SES for email, FCM for push. Key dependency: user-service for contact preferences. SLA: best-effort (non-critical path).

Two minutes per service. Claude now knows your notification service exists, what it does, and who owns it. When something breaks and involves email delivery, Claude can suggest checking SES delivery logs without you explaining that you use SES.

What Stash doesn't replace

Not Stash	Use instead
Live metrics, logs, or traces	Grafana, Datadog, CloudWatch — Stash doesn't have live data access
Secrets management	AWS Secrets Manager, Vault — never store secrets in Stash
Ticketing and incident management	PagerDuty, Jira — Stash doesn't integrate with these
Config-as-code or infrastructure state	Terraform, Pulumi — Stash is notes, not a source of truth

Token efficiency matters for API users

If you're calling Claude via API (common in DevOps automation scripts), token costs add up. Stash uses FTS5 keyword search — it returns only the matching records, not your entire knowledge base. A search across 500 runbook entries returns roughly 192 tokens of results on average, versus thousands of tokens if you embedded the full context. For automation use cases, this matters.

Setup takes two minutes

Stash is a remote MCP server — no install, no local setup. You add one URL to Claude Settings and you're done. Signups are free, self-serve, and use Google OAuth.

Get your connector URL

Free tier: 2,500 records · 50 queries/month. Pro: 100k records · 1k queries — £8/month.

No install. No app. Paste one URL and every conversation knows your stack.

Pricing may change as the service develops. Cancel anytime.