Claude Code for SREs

Kushal Thakkar (Anthropic) gives a live look at how Claude Code handles real-world incident response and debugging.

  • Kushal Thakkar
    Kushal Thakkar, Member of Technical Staff, Anthropic
The transcript below has been generated using AI and may not fully match the audio.
Hi. I'm Kushal, I work at Anthropic. I'm like a member of staff. let me quickly introduce myself, to start. I built, like, why am I talking? I'm a traditionally like a software engineering background. but to give my perspective on what my previous career was like, I was a, I've always built like large scale systems.Throughout my career, and I was at Meta, I was an incident commander and with some of the largest clusters, one of the largest park clusters in the world. yeah. And now I'm building Claude Code at Anthropic. yeah. And that's the reason, so my background gives like a unique perspective on like using AI to build, to automate some of the core SRE workflows.yeah, 76%, and I, this number is actually old. This is from like last year. Like it says 76% of developers or like swes are using, AI to build software. but SREs are staying at one-to-ten ratios, which are like traditional ratios. And so this creates like a bottleneck where we actually building a lot of software.the development velocity is increasing, but the operational support doesn't increase with that flow. So the only way is we can't hire our way out of like this. we need to actually automate using ai, so Claude Code. And so as I said earlier, I'll, provide like a blueprint. For building a lot, automating some of the core SRE workflows.And we do have, Anthropic has a product called Claude Code, which is like agent coding product. And I'll just, I'll talk about a few features. I'll talk about features of Claude Code, which could, which will help us like build some of those, workflows. So quickly talk about, what is Claude Code. For people who don't know about it, it's a bare metal, CLI interface for cloud models.It is intentionally bare metal. We don't want to like be very opinionated here. the idea is like model capabilities are moving rapidly, very rapidly. So the, product keeps improving as the model improves rather than we have to change the product as the model, improves. So it's also Terminal-native, so it's actually perfect for SREs who live in terminals.It integrates with all the existing tools. so rather than replacing them, so it has access to all the tools that any engineer like either SWE or SRE has. yeah. Finally it has permission based architecture. So humans maintain control of actions, at least for now. Yeah.So I'll talk about like key features, of Claude Code, which will be helpful in like building some of the workflows later in the talk. first is SDKs. An SDK is a non-interactive version of Claude Code where you can actually use it for automation. So you can use it as a uniqueability or you can integrate into existing like Python or TypeScript workflows.We do, also support subagents. So subagents basically provide specialized capabilities with granular tool control. so you can have, let's say a monitoring subagent or a debugger subagent, for like complex tasks. We also support MCP integration. So you can actually, so we, Claude Code has access to Bash, it has access to a lot of tools through Bash, but there are tools which are not accessible through Bash.So we basically augment that using MCP support. MCP is the Model Context Protocol. I think, a lot of your folks know about it nowadays. Hooks let you customize behavior at different points of the lifecycle of Claude Code. So basically you can add shell commands at different points. So before, let's say tool use, after tool use, that helps in like different things that I'll talk about later.We also support, we have GitHub Actions you can use to build different GitHub workflows, which can be used for different scenarios. yeah. And these features help together to build like a comprehensive. SRE platform or like automation basically.Now let's dive into like specific workflows. First is like incident response. like everyone knows the pain of incident response. Like you wake up at 3:00 AM and like through an alert and you're like, production is down. First thing, you're like, oh man, I don't want to do this, but you gotta do it because it's, so yeah.And then you go and look at like logs, grab logs or look at dashboards, run queries and things like that. I try to correlate things and it takes some time to actually know what is going on here? so yeah, I think, this is a, very critical sort of use case.The usual traditional flow is as I explained, like usually you go wake up alert, wakes you up, and you as a human, you joy investigate and you go find a fix and you basically, raise some, do something about it. with Claude Code. Basically the idea is to be, have Claude Code as the first line of defense, do the initial analysis.So whatever you're doing, like one hour initial, maybe, hopefully like cloud code can help automate that part and give you some initial analysis. And there you from there, you either cloud code gives you an action or you, and then you take up for, action on that or you decide you wanna do more investigation.Claude Code SDK is like perfect for this. Basically you can inject SDK call during in your alert handlers and. Since this is like a non-interactive workflow, you could basically, make it, it makes it easier to actually just integrate AI into your workflows. the prompt here is very simple, but obviously the prompt is where the magic will happen.but yeah, I think, the integration part becomes very simple. So how it'll work. To give an example, let's say you give Claude Code, it'll go, it'll do the same thing a human would do. It'll go fetch like. Metrics from like, logs from Datadog or run GCP commands, like using gcloud, using the Bash tool.It can go look at like PRs or commits using git. so basically it'll do the same actions and try to correlate, like the data points that as a human will do, and then it'll decide, okay, I need to take an action. At that point you can just like. Create, ask for permissions if it's running, like scaling up, scaling down the cluster, ask for permission to the user or like the on-call or create a pr At that point it can decide to like robot dial the on-call, to say, this is an action.Take an action. yeah, the idea is to have like faster MTTR, mean time to remediation and like hopefully less on-call fatigue. Yeah, moving to other workflow monitoring, this is you're creating a new service. You gotta like, it takes a lot of configs to set up monitoring and usually, like most people just create a service and they don't do actually.They don't create manual alerts. That happens all the time. I've done it, so I don't blame other people. But, because it needs a lot of things, like it needs like a lot of configs, like che config, Grafana alerts, lu documentation, things like that. And I think this is like very good for like ripe for automation where you can use Claude Code to do all that.it can go analyze the service code to understand like key metrics that could be useful and it can go create graph configs. Like pro Prometheus configs, and basically it can also add like on-call talks, right? Yeah. And basically the idea is to use let's say GitHub Actions for this, which is a feature I mentioned earlier, right?So whenever a new service is actually somewhere raises a PR for a new service, you can actually, the action can look and look into that this is a new service and decide to actually add. Configs for, new alerts or, documentation for those alerts. and you can do this even not just for new service.You can actually do it whenever the service code changes as well. So that partly automates a lot of the work that you, That you have to do like in some ways, and it also pushes the author to think about alerts or monitoring from the beginning rather than, doing it after an incident happens or things like that.Subagents are also like very good like feature for this kind of use case where you can actually have specialized sub ages where you can have specialized prompts for figuring out, okay. Monitoring requires like very special kind of skills and like you can prompt the LLMs to actually tell you are a monitoring agent and you should look for this kind of things in the code to decide the metrics you should, look for this kind of things or to, to decide like what kind of documentation you should write and things like that.You can give a lot more context about your own. monitoring, alerting, setup, for, to this agent. And that can actually speed up or improve the process overall.So the idea, as I said, like it's basically this is how you can onboard new services faster and have to have, you can also reduce, like monitoring gaps. Now let's talk about like automation, which is like one of the biggest like core SRE expectations is, automate stuff and. So you can actually, so one, one part is like hero security and compliance.So let's say you are doing, people are using AI a lot these days and how do you actually make sure the actions they're taking are they, compliant with what you wanna do. So we do have feature called hooks, which actually goes, which you can run pre-tool-use or post-tool-use. In pre-tool-use, you can actually check for dangerous commands.You can basically check whether a person or like no automation is doing dangerous commands or commands that you don't want to run, through AI automatically run through AI or even through humans. We also, it hooks also support like post tool use. So basically you can write, logging, you can log basically all commands, that were run from any machine or.By any automation, and that can help you, use it for audit logging for in the future. And like it'll help you like, with compliance as well.Automation includes like audits as well. There are a lot of ways you, a lot of kind of security audits people. we do. And a lot of those tasks can be actually automated by a cloud code. These are some of the like example tasks. I know like every organization has different kind of automation tasks. I don't wanna be prescriptive here, but like something, to actually, take away.These, all of these things can be actually made faster or have more thorough process rather than having humans do all those processes completely. Infrastructure management is a big time sink for most, a lot of SREs and like chart core can help with that as well. It can scale up, scale down as needed.manage configurations. Or the automated updates or run book automation, things like that. I think SDK is like perfect for these automation use cases because you can basically run it using it as a unique ity or in your Python automation. Subagents are also very useful in this context because you wanna do very specialized tasks and you can actually offload some of those tasks to like subagents, for like better control and things that.So the outcome here is the actually reduced toil, which is what like automation helps with. yeah. So the key take takeaways here, let me summarize It is like MCP and Bash tool provide like universal connectivity to all the tools, existing tools. SDK will enables you do programmatically or automate your workflows.Subagents help with like specialization and also tool control. so basically you can decide to give certain tools to subagents and decide to restrict some of the tools. So that makes sure that you know that subagent is only using the tools that you want to use. Hooks help with, compliance and auditing, so you can, as I explained earlier, most importantly, you maintain like control permissions and like final actions, at least for the most critical operations that you're doing.Now let me talk about what doesn't come out of the box from Claude Code, right? Which is the kind of work that we might have to do. this is the kind of new engineering work we are doing with the emergence of s LLMs. First is like permission engineering. You don't wanna over-permission, under permissions.If you are, if you, we wanna balance like security versus functionality. So if you give, if you give too very permission for to the user or to the human, then basically it, lot it, it is not very secure because people start like accepting all the permissions. So at the same time, we don't wanna award permission and not give like, all the permissions to the agent because we wanna be careful about what we actually give control over.Another is prompt engineering. This is very important. I think it takes, it is underappreciated generally. good prompts actually make a huge difference in how good the AI performs. And having like specific prompts on what it should do, what it should not do is actually very important. The other part is important, that's important is like feedback loops.So once you actually put something out, you actually go have to go and look at the feedback, understand what's working, what's not working, and optimize the prompt based on that. And that's, like super important, very underappreciated. yeah. The, and the other is like regular engineering work. You go like integration with like web books, or APIs.We also need to like codify a lot of, Knowledge, organizational knowledge, which humans have. But AI doesn't have some of that knowledge depending on what MCP tools you have. So a lot of times you'll have to codify that in either docs or Claude Code files. So getting started is simple.This talk was like short. I didn't wanna spend too much time, but yeah. Getting started. Simple. You can, it's a simple NPM command, and documentation is available on Anthropic documentation page. Thank you.

San Francisco 2025 Sessions