Troubleshooting Kubernetes with AI Agents
Using Claude Skills and an operator-in-the-loop pattern to triage production Kubernetes outages with general-purpose agents
- Randy Bias
- 11 min read

Introduction
Recently, my team has been engaged with Mirantis customers who are keen to use custom agents for troubleshooting production Kubernetes clusters. While there has been a lot of significant work in applying agents to software development, there is not nearly as much activity around using agents to help with operations. I think this area, “AIOps”, is underserved and not well understood. There are MCP servers that can be used, but applying AIOps to Kubernetes is more than just technology—it’s also about process, best practices, and deliberateness. Letting an agent loose on your production systems probably isn’t the best first step.
So what is the best first step? Right now, it seems evident that an operator-in-the-loop is a fundamental requirement of using agents with production systems. What would that even look like? Do we need a custom agent to troubleshoot production issues? What about general purpose agents such as Claude Code, Codex, or Goose?
In this article, I want to show you some early work my team has done with applying general purpose agents to the task of triaging Kubernetes clusters.
Some Context
Some of this work was inspired by a recent YouTube video from Anthropic, entitled Don’t Build Agents, Build Skills Instead by Barry Zhang and Mahesh Murag of Anthropic.
You can watch the video or use the outstanding YT summarization capability, but the tl;dr is simple:
Use a general purpose agent such as Claude Code, by providing it with a set of domain specific “skills.”
These skills, in the form of Claude Skills, are essentially structured domain knowledge in Markdown format. Your general purpose agent can use these skills to attack specific problems. One part of what inspired me can be seen in the video right about the 8 minute mark where, in passing, they talk about supplementing skills with domain-specific tools in the form of MCP servers.
This feels like the 1-2 punch of turning general purpose agents into domain experts.

So what would this look like in practice?
Claude Code as Kubernetes Triage Expert
We used Claude Code1 itself to help develop a proof-of-concept set of Claude Skills that can assist with troubleshooting Kubernetes clusters, called “k8s-troubleshooter” that can be found in the k8s4agents GitHub repo. This skill includes both domain specific knowledge in Markdown format and some utility shell scripts. It also includes some initial support for MCP servers such as the kubernetes-mcp-server.
Importantly, these skills were developed and meant to be primarily read-only and to provide troubleshooting and reporting to an operator, rather than fully autonomous intervention in production systems.
So what can it do?
The k8s-troubleshooter’s core capabilities include:
- Perform an overall triage loop
- Assess k8s pods looking at lifecycle and container issues
- Look at basic service connectivity and DNS
- Evaluate storage (PVC/PV) and CSI drivers and state
- Look at network policies and CNI (Calico)
- Find potential helm related issues
- Node health and cluster-wide diagnostics
- Wrap all of the above up in a comprehensive report when asked
This skill runs an initial triage script for production issues, which does the following:
| Action/Check | Description/Details |
|---|---|
| Captures evidence | Preserves cluster state before investigation (nodes, pods, events, optional cluster-info dump) |
| Checks control plane | Uses /readyz?verbose for component-level health status |
| Assesses blast radius | Classifies impact: single pod, namespace, multiple namespaces, or cluster-wide |
| Classifies symptoms | Detects crash loops, OOM, scheduling failures, DNS/network issues, storage problems |
| Recommends workflows | Provides specific diagnostic scripts and commands based on detected symptoms |
| Generates report | Creates markdown report with executive summary and text summary for quick reference |
| Output | Triage report with blast radius, symptoms, recommended next steps, and captured evidence |
Claude uses this as guidance for its next steps, which then exercise other scripts, use kubectl commands directly, and allow it to “deep dive” into the various issues such that it can perform an initial root cause analysis.
The early returns are quite interesting. I have a local kind cluster and I’ve induced a complicated set of interrelated failures. Can Claude Code do a reasonable root cause analysis?
Let’s find out!
“Production” Setup & Failure Situation
In our production failure scenario2 we have multiple pods in different states with different failure conditions as follows:
| |
Granted this is a contrived scenario, but let’s see how it goes. Please note that I’m using completely different Claude Code instances to setup the production failure scenario and to do the troubleshooting so that there is no shared context.
Claude Code
We pass the following prompt to Claude Code:
| |
For the sake of brevity, I’ll redact some of the entire process that Claude Code uses, but I’ll post a full dump here.
Step 1: Run the Triage Script
The first thing that Claude does is run the triage script that we have created:
| |
Step 2: Assess the Triage Script Results
The full script results can be found here, but the key is at the end where it gives Claude hints about what to do next:
| |
Step 3: Pod Diagnostics & Triage Process
At this point Claude uses the next steps to run pod by pod diagnostics. It also kicks off the use of a number kubectl commands to gather additional information for each pod failure.
| |
Step 4: Root Cause Analysis & Remediation Steps
For brevity, I will redact some of the output, but you can find the complete report here.
| |
So this is fairly interesting in that Claude has done what appears to be a fairly good job at performing root cause analysis. I would very much like to see someone apply this to a real production incident and provide feedback on how it does.
A Brief Aside on MCP Servers
Just briefly I want to touch on MCP servers, which we do not use here. Given that Claude Skills can do quite a bit, when does it make sense to use MCP servers? This is a question we are trying to understand ourselves. Right now there seems to be a tendency to want to make MCP servers that are simply an alternative to existing APIs (REST, etc), which isn’t terribly interesting to be honest. It also creates risk of context rot. Most agents are more than capable of calling an existing API without the need for MCP. I think we see this because MCP is so new and the patterns for using it aren’t clear yet. I talk about this a little bit in my recent article on The New Stack: What is ‘AI Native’ and Why is MCP Key?
From my perspective, here are some key things that an MCP server can do for us when talking about operations:
- Limited, read-only, but root-level access to a running system to let a lower privileged agent access data it normally could not for performing triage
- Multi-step workflows, for example it could replace the scripts in the k8s-troubleshooter skill, saving context and providing richer tools
- Event-based subscriptions to have agents get notified about problems in production
We are going to take a deeper look at using Claude Skills with MCP servers in a follow on article. I think there is a particular advantage in #3 above that we can take a look at through the lens of doing 24x7 production operations.
Conclusion
While this is just a proof of concept, I think it does reinforce the direction that Anthropic (and others) seem to be headed, which is to provide domain expertise to general purpose agents. Many of these general purpose agents have had some fairly significant work done on them to help them reason more effectively, to avoid loops, and to manage the LLM interaction more effectively. If you build a custom agent, you don’t have the benefits of the work that was done in this regard.
Importantly, there may be a tendency to encode your own bias and deterministic logic within a custom agent. Perhaps this is good, but only time will tell. If, instead, you are building domain skills and domain tools, you can iterate on the skills and tools for your specific business context and ride the innovation curve of those who are building and tuning general purpose agents3.
It is looking more and more to me that the right ‘AI-native’ pattern will be the use of general purpose agents turned into domain experts. I strongly recommend folks take a look at this direction rather than building their own custom agents. In a follow on we will see what this might look like for triggering agents to respond to production issues.
Note that just recently, OpenAI has announced experimental support for skills. ↩︎
Interestingly, I performed a first test without specifying it was “production” and per the Skill, Claude Code took a different approach and did not run the triage script, instead running some kubectl commands quickly and reporting back to the operator almost immediately. This was interesting in that Claude seems to adhere well to the Skill instructions. ↩︎
Check out this interesting video from Nik Pash at Cline: Hard Won Lessons from Building Effective AI Coding Agents. ↩︎