Keynote: Humans, machines, and the future of incident response

Stephen Whitworth, incident.io CEO and Co-founder, kicks off SEV0 London 2025 with an opening keynote focusing on the future of incident management in an AI-first world.

  • Stephen Whitworth
    Stephen Whitworth, Co-founder and CEO, incident.io
The transcript below has been generated using AI and may not fully match the audio.
Hey, morning gang. as Tom said, thank you for braving, an AWS outage, the pouring rain and a broken northerner line. I can't believe it's been down since Sunday, but we are managing. so I wanted to take a few minutes, to just set the scene. I'm gonna be talking a little bit about.The company we started four years ago, but I'm also gonna be talking about the future of incident response. 'cause I think those things are intimately linked. so we had our first ze a year ago in San Francisco, sort of San Francisco is where a lot of our customers are, based. you can tell by the accent, London is home for me and ultimately, like we, as soon as we could, wanted to do a London subzero as well. So very glad to be able to do that today. so over the past year with the company, we've really picked up momentum, I think. So we're now, around 1300, 1400 incident customers and this just continues to grow rapidly every month.And we've helped those customers resolve nearly half a million real incidents on our platform, which is crazy. If I think about the stress of a single incident on Monday, us being able to do half a million of them to, I guess help ease some of that collective stress, makes me feel great.and then we've helped serve over 750 million status page views, to our customer status pages. So we host the status pages of OpenAI, square, Intercom, all of these amazing businesses, when things go wrong, a sort of thundering herd of people want to know why things have happened. and we've ended up powering a huge amount of that, which is, which has been fantastic.And a kind of, as a result of all of that momentum, we've ended up with 82% of our customers, not just using one product that we have, but using two or more of our products. so it's really great validation to see we started with one when it was just, me, Pete, and Chris around the kitchen table.We then built our status pages product. we then built on-call, and now that's resulted in the average customer using, using multiple products. as a reflection of that momentum. You may have seen this news earlier this year that we raised a series B and. Ultimately, that meant that we raised $62 million.it brings our total amount of money raised to about a hundred million. and fun fact, we actually have, the Chainsmokers, as investors in our company, they have kind of fingers in many pies. and one of those is a tech, tech venture capital firm. And if you use incident io on call, you can decide to be woken up by the dulcet tones, of the Chainsmokers who have, contributed, some ring tones to that.I like, I'm an engineer. I know that many of you do not actually care about fundraising. Like what is the point of us raising money and what does it all mean for us? It's just it's pure. It helps us build more and it helps us build faster. And if I think about some of the stuff we built over the last year, I just wanted to pick out a few favorite snippets, of what we'd done.So first I describe, I'd say this is one of my personal favorites. hands up who, has ever been in an incident where there is some kind of tired engineer that is listening to the call and kind of somewhat haphazardly taking notes in a Google doc. Great. you don't have to do that anymore. so Scribe, will transcribe all of your calls in Google Meet or Zoom.it will take transcripts of that, which is great. I'm imagining a bunch of people here use AI notetakers, but the kind of extra magic on top of that is that it will listen for what seem like key moments in the incident and automatically summarize and post them for you. and yeah, I thought it was a great example of You could do the basic version, like version of it, but the actual magic comes from like the little bits that you sprinkle on top we then have at incident. the way that you interacted with our product over the last few years was, your type of slash command, and then upward pop of form.That's great, but like we are getting more and more used to just speaking to software now. so you can literally just talk to incident. You can say at incident, draft me an update and let me know if there's any follow-ups that I've missed that I should add to linear and like it will go do all of that stuff, which is amazing.Pete's gonna demo it later. We have taken the sort of, brave slash stupid idea of doing a 45 minute live demo in production. pray for Pete. but we'll be able to show you a lot of that later. We then have teams as a first class concept. So in incidents, a lot of the kind of things that matter are very team scoped.It's, teams are on call for a set of services or teams have workload applied to them of reactive work, and you often wanna cut and slice by that. this didn't really exist in many incident platforms, like we built it and it's now natively wedded through. And we now integrate with ServiceNow.so for the folks, that have, the pleasure of using that directly, we now can give you a sort of a nice way to integrate, build all of that stuff, and then sync it back to everything in ServiceNow as well. So like we didn't stop there. There's tons more. we have a change log, so if you go to incident.io/changelog, we have posted every week for the past four years about all the stuff we built in the previous week.so you can go peruse, and have a look, but I can't fit it all on the slides. And that means that we ended up doing a lot of deployments. so we have shipped about 16,500 changes this year. if you do the maths that is many, hundreds per day, and as a result it means that, constantly we are like improving and improving.And like really when we started the company, our kind of. The feeling of software that we used was that we couldn't influence the software that we ended up paying for, and it felt stodgy and it never really changed. We wanted to build a company that was around velocity of changes, so you should be able to see rapid improvement, but we also wanted you to be able to influence like the products that you built.So that means engineers in Slack channels with our customers, talking to us directly, helping us build a better product. and I think that's great validation of that, but. Why are we even building this company in the first place? me, Pete, and Chris and others are probably gonna spend 10 or 15 years of our life doing this.Like why this? Our background is as engineers. we were all engineers at Monzo. and as a result, like we were on call for kind of some of the most critical financial infrastructure in the uk. and as a result, we knew how bad the status quo for incident response was. It was a. PagerDuty wakes you up and then you get thrust to the front of the stage and then you had to glue together 15 different tools manually.And my co-founder got so frustrated about it that he snuck away from his kids' swimming lessons to hack around on tooling, on the side. and he ended up building, building sort of Monzo response, which was the kind of internal tool that people used to respond to incidents that then got open, sourced it, then got picked up by companies like Reddit.And as a result it was like a. A marriage of, I know this problem has always sucked. Chris has built a much better way. Some people seem interested in this. let's go and put a company behind it and imagine what it could be if it was more than Chris at his kid's swimming lessons.we built the entire company, and four years later, like it's now the market leader for what we do. So we are one of the largest, we're the fastest growing. And if I think about why, like why is that? Ultimately, I'd say we're like reasonably humble in that no one really wants to use our product every single day.It's not like a Figma where you're a designer, you open your laptop nine hours, a Figma, close your laptop. This is a reactive tool that is helping you in times of stress, and ultimately that's stress is, The kind of stress that you go through is really just all about achieving reliability and resiliency.and really running a great incident process is a means to an end to achieve that. And, a year ago I stood up and gave this keynote at our SF conference, and I looked through those slides as I was writing the slides for this one, and I noticed that I didn't really talk very much about ai.There was some sprinkles here and there, and it's oh, we can summarize some channels for you. But it didn't really feel like the kind of world had shifted and I think honestly like it has for us. I'm gonna spend a bit of today talking about why I think that is, and if I think about it, it's, one is It helps humans think ahead on things. It removes a huge amount of the rote work that people have to do. The tired engineer at 2:00 AM writing down notes in a Google Doc. If they didn't have to do that, what could they be focused on instead? I think with ai, like it also means that you can do a much better job of diagnose what's actually happened because you're able to wire intelligence into these applications.it can do an amazing job of that, and I think, honestly. Like a lot of incident work is just drudge work that's not very fun. and the goal would be take a huge amount of that off your plate so you just don't have to do it in the first place. and we'll walk through a bit of like, how, do I think that happens?Ultimately, some of these processes can take hours. So if we take postmortems, incident, debriefs, whatever you choose to call them, that is a pretty painful process. If you're not using a product like ours, it can look like a kind of CSI style reconstitution of 15 different tools, and you build a timeline and ultimately it just is really hard.And if it's hard and it takes a long time. Humans don't like doing hard things that take a long time, right? They just skip it and they go onto the next thing, and really that's your opportunity for kind of. You burnt your fingers, like, how do we stop it happening again? And I think that is now massively easier with some of the advances in ai.You can take really the sort of central context of an incident. You can then draft according to the types of postmortems that you want to write. A lot of the drudge stuff that is just reporting on what actually happened. We're not trying to replace the human judgment that should go into what we think we should do.We're just trying to take away the timeline. all of the stuff that requires you to log into all of these different tools. so again, this is not an aspirational vision. This is not slideware. Pete will spend, about 15, 20 minutes of that going through what we built in our postmortems product, and we can't wait to release it.So I, AI is not a new word. It's been around for, 40 years at this point. And with lms, definitely chat. GBT is about three years old at this point. Like, why? Didn't I talk about this before and I think this is not the first time that someone like me has stood up and talked to a bunch of people like you about, automating large parts of incident response.There was an acquisition of a tool called Rundeck by PagerDuty and I think there were like a hundred percent on the right track here. So what Rundeck did was about giving teams like a self-service way. To diagnose and remediate incidents. It was script based. So you'd say, Hey, if I saw this error code.SSH into this Cassandra node, run this thing, wait for an exit code and that's the right track. But I think it failed really to get the traction, that, it should have done for the problem it was solving. And I think the reason is for, because of that is. Determinism is not the answer to this stuff.you are all very smart people. Like we cannot replace you with a shell script or sort of a, small set of shell scripts. and as a result, I just don't think you can if this, then that your way to solving the lots of problems that you folks do on a daily basis. And I think the key advances is like.AI can help us wire in a lot of context that was missing in these tools before. They also can help wire in intelligence. So the ability to take nuanced decisions based off of information. And like to me, adaptability is the key thing here. can we take a. Different things that haven't been predetermined based off of the information that we have.and I think we can now, or at least we're starting to be able to, and that makes a really, big impact in terms of kind of the, what we can actually achieve with the products that we have. Underneath that, we have seen this like ever improving performance, in the capability of LLMs. You have, like I think Maths Olympiad medals being won.You have Claude code taking over, a huge amount of our engineering team. and ultimately Dario from Anthropic, loves to give a spicy take. and I think this was, one of the things that got people really angry, a, a number of years ago.So I wanted to do a quick pulse check of the audience. who here agrees with Dario?Dozens of you, who disagrees with Dario. Great. Who doesn't care? I honestly think he is closer to being right than he is to being wrong. If I think about the trajectory of how things are going, and if I just stop extrapolating to the world and I just go look at what is happening at our company at incident, how are people using these tools?this is our usage of Claude by month. So if I roll back an entire year. Weren't even really using it all like it at all. January, February a hundred million tokens between friends, what's going on there. And then now we have ramped up massively to, a huge amount of our engineering team, and nearly 15 billion tokens a month.And that has, if I think about the difference between. The average workflow of an engineer a couple of years ago, it was like, oh, some people use Vim, some people use emax, some people use, static analysis. But it wasn't that different. If I then think about the difference now, one person is using Vim, and then.I dunno if Rory Bain is in the room today. one of our engineers, it's like he has four different parallel Claude Code opens, which he then speaks to through whisper flow and tells them what to do and then goes to make a tea and comes back in a few minutes. Like it's a huge shift in like how people are actually using these tools and as a result.I've instituted this kind of unlimited budget internally for people to go spend on AI things. And this is not like wanton, wasteful use of cash. This is, it is dramatically accelerating. Not all the work, but big chunks of work, that we, that we are doing in our engineering team. So I think Dar is not all the way right, but closer to being right than wrong.And with technological change, there is no free lunch. stuff gets harder, stuff gets easier. The key shift that I think we're seeing is software as artisanal craft. hand baked is now moving to like software as commodity. It's not all the way there, but it is now being commoditized by machines and that introduces some new risks that I just.Didn't think you had to worry about if everything is being written by humans. So first is if you just, if you don't make the argument that machine written code is better or worse, you just say it's the same as humans. What happens is you just have a lot more of it because you can run these things in parallel, and that means on average you have systems that are more complicated because more code generally equals more complexity.And then. More code equals less context. If you divide code by the number of engineers you have in your organization, and then that means that like ultimately I think this is worse for software reliability than it is easier. I'm obviously talking my own book a bit here, but if you go look at the Dora metrics report that was released by Google recently, they've set, they essentially have.Kind of a set of things that are the impact of, AI assisted engineering. And one of the things is like we are able to build a lot more, and then one of the negative impacts is software instability and resiliency. So I think empirically it's being broken out in practice. And, this leads me on to, I think, what is becoming the fastest, fastest growing new role in technology.get your phones out, be ready to, update your LinkedIns, because you are all now vibe coding cleanup specialists. And I think this is obviously. Kind of not a real role, but I think it taps into a bit of the, temperature in the industry of like how people are feeling.and there are people on LinkedIn that will call themselves this. This is not, this is not me making it up. And I. The kind of serious application of this is if I take the average sort of AI and engineering tool, like all of that is going towards building software faster. it is very little of it is going towards helping operate that software in production more reliably and more resiliently and I think the only way like.The dam has been broken. At this point, there's not a way to say, okay, roll it back, stop it happening. It's just going to happen. and as a result, the only way I think you can combat this is using AI to fight ai, I guess in this analogy. This has led to this formation of a new category called AI SREs.And I think, that is, I don't think that's is attempting to replace all of the work that SREs don't do, but it is more a kind of description of, I think the sort of reliability, resiliency, and diagnosis stuff. We think it's very interesting. We have reorganized the company around this.we, I feel like we're very well placed to go lead this and I'll talk through it in a second. We've reorged the company. So we have about a third of our product development teams that are working on AI SRE, which is, one of our products that'll talk about in a minute. And then about half the organization is.Building agents of some capacity be that scribe, be that AI SRE. and I think it's because of the advance that we can have in some of the intelligence that we're able to do. so we unveiled AI SRE in July, and this is focused on diagnosing issues, providing a highly accurate root cause and in circumstances helping take action to help you remediate the issue.And like we are not trying to be 5% better at running incidents here. Our goal is like a step change in reliability and resiliency for your organization. And it's, speaking of hitting the snooze button, I, would love to show you a, what the life of is of someone that is, moving to ai, SRE.there we go. I, I think my instructions when we were making the ad was like, the first bit should make you sweat. and I feel like PTSD, maybe it's from Monday, maybe it's from a, maybe from other times. But yeah, this is a journey that we are on and we're making really, big advances and I'm happy, excited to show you in a, in just a couple of hours.Pete is gonna be spending 45 minutes doing it. we did this in, SF a month ago. That was the week after. I don't know if anyone saw the meta demo with the RayBan glasses. that was, I think a couple of days afterwards. So Pete was like, okay, this, this really better work.the reason that we wanna show you a live demo is it's it. Easy to make a 90% accurate demo that kind of falls over the moment that you, that you blow on it. And as a result, like we want to show like real products, like actually working, ask it questions, not vaporware. So only way to do that is just to show you live, we think.So I guess if you asked me a year ago, like, how do I see incident response, like this would be my answer. we have people running through code and telemetry and incidents because what else are we gonna do? It's the way we've always done it. humans will run the incident and they will use incident to do it and you still had to keep a eye on a bunch of different things and I guess we were a control panel, but you still had to drive it.I think that we were better than the status quo, but not like 10 x better. we were definitely a better alternative to PagerDuty and manual process. And I think that's propelled us to where we are today. But right now, I feel like we're at the base camp of what's achievable.and we can go ship, a ton more that will change things. So with what we're building, we're integrating AI throughout our entire product and process. And we're not doing this because, it makes our valuation better or as a better story for investors. Like to us, there's just huge amounts of work that like I, we had to go do, that is, wrote.Administrative stuff, it can be done better, faster, and humans can go focus on stuff that's higher leverage. So if I think about diagnosing, telemetry, trying to figure out what's going wrong, digging through past incidents, now AI SRE will search all of your deployments, all of your telemetry, every incident you ran in the past, and essentially in two minutes.Not necessarily solve the problem for you, but give you a steer in the right direction. To me, this is the first 15 minutes of an incident that we can try and condense down to just a couple of minutes. Instead of you having to blurry eyed, fix this nil pointer exception and write tests for it.If we think we can do a good job, like we'll just give you a pull request. this isn't going to, if Stripe is down, we are not going to build you an entirely new billing provider or anything like that. But if we think we can fix it, we'll do it for you and save you time. And then scribe. So this is, again, the tired engineer at 2:00 AM Hapa haphazardly, taken notes.This can just sit in the call for you, and that means that you can focus entirely on what you need to do to solve the incident. Instead of you manually telling people to do the right thing and shoving them down the right roots. a classic one here is we have a part of our product called a custom field, where you can say, Hey, this customers are affected, or This is how much revenue is affected, if we can work that out for you.So say someone on a scribe call mentioned that Netflix was affected by this incident, we'll just pop up and give you a suggestion of Hey, do you want us to go update that for you? So just these like little incremental things that add up over time. And then postmortems again, I think this is one of the most magical things we've shipped, since Scribe.And this is, being able to draft a postmortems in seconds and ask questions to dive deeper into your infrastructure, your response process, and, learn a lot more. So this is what our platform looks like now. So we now have four products. We have oncall, wake you up, response your incident command center, AI SRE, your, compatriot to help you go resolve incidents and then status pages, a way to tell your customers about things.Kind of with the platform that we have now, we are committing to this set of incredibly ambitious goals. So we are going for a step change in reliability here, not, three to 5% better. and what I really want to go do is, I, think you can do a huge chunk of downtime reduction if you can nail this product.Nailing the product is the hard bit of it. But that's our job. Like we'll go do that. and I think you can do a huge amount of elimination of alert fatigue because with us, you have, we see all of the alerts, we see how you respond to them. We understand what you need to do to go fix them. And we think we can do an amazing job.And ultimately, a lot of people just wanna build stuff here. They don't wanna be woken up at 2:00 AM to have to deal with the impact of things. So we are not alone in this vision. Lots of companies are going to build ai ses like Y us I think ultimately, foundation model providers like Anthropic and, open AI are going for sort of models, can do everything.You don't need, applications to be able to go do this. And then you have a PM platforms like Datadog where you know they've got your logs and your traces. Like why, why little on an incident? Like why should we be able to do it? I think there's two key reasons. I think, coming from a bank, it was just so obvious to us that Technology is just only a part of the incident. there is so much stuff outside of that you need to do to run an incident well. So whether this is sort of communication with your customer support team as they're dealing with the issues, execs that are trying to make multimillion dollar decisions, in just a couple of minutes or instituting improvements after the incident's been resolved.You really need a strong process to be able to do this. And I think of kind of incident, doing an incident well as. Two things. One is run the process of an incident and the second is fix the problem. And like you can't just do one of these things. And I think AI SREs are focused on helping you solve the problem, but they need a process to exist within and to do that.and I think that splitting these apart is a kind of artificial distinction and is not sort, super helpful. And I think they're bets together. And with us, you can have them bets together and ultimately, The whole process of running an incident to us is just a big, giant feedback loop, and it's the same for humans as it is for machines.how you respond to incidents is unique to your organization. So if I take an engineer that's just joined your company, I think you'd say it's a pretty safe bet that engineer's gonna be better at responding to incidents in month six than they are in week. Two, right? They've just absorbed more context.Know what that piece of code does, how it interacts with this random system over here. and we see the same dynamic in practice with ai, SRE, our product, which is more incidents run and resolved equals better, better outcomes and and faster resolution. So to me this is it's not just about more logs and more traces.It's about integrating all of the stuff together to be able to understand, in your incident response platform, how you deal with things. So I'm, I'm gonna go like all McKinsey, two by two on you for a second. I apologize for that. This is how I would split the market.So there's a load of people going after it. Who's doing what and how should I think about them? one is like autonomous agents. So these are like resolve AI traversal bits, AI from Datadog that are very focused on how do I. On the two things you need to do in an incident. this is detect the problem.and then you have incident management, which is the first part of it, which is run the process of the incident really well. so companies like Resolve, they're not interested in doing incident management. They want to be an agent that helps you, understand production better. you have the Datadog of the world that are doing a little bit of both, PagerDuty as well.We then have fire hydrants. So one of our competitors, great people, they are Taken their horse out of the race. They don't want to go build an ai SRE, they want to integrate with all of them. And as a result, like I think, four years later, we are the leading incident response platform that helps you run the process well.our goal is to then go nail, the kind of autonomous agent aspect of it as well and help you resolve it. So if we do our jobs, I think we are remarkably well positioned to go solve boat problems and you don't have to split them into different products, to do that.Yeah, so I've, touched on it a couple of times. I think you'll be being pitched, I think, up to the eyeballs by many different pro, people selling AI agents, AI platforms, AI automation things. A lot of this stuff is honestly, is. At the edge of what is possible right now. And that's a good thing.It means that as models get better, then you'll be able to take advantage of it. But you have to be aware of vaporware, and, ultimately, that's why we're doing a 45 minute live demo. We're not in GA yet. We still have work to do, I think, to improve it, but I think we're seeing sort of flashes of brilliance at this point.and I wanted to share sort of some of the feedback that we have so far on, on what we've built. Pete's gonna give you a walkthrough after lunch. you can get your hands on it a little bit later today. So we have demo stations outside. You can give it a kicking, ask it any random question, go look at the rest of our platform.so there'll be, Oscar, Alex and Chris Manning demo stations. so go chat to them all certifiably, nice, and friendly people. and ultimately, like we are gonna be bringing more customers on board soon as well. So please come talk to us, and bear with us, whilst we do that. So I think, in Characteristic fashion. I've, ran over my time. so yes, personal weakness, but to the people that are new here today. welcome. We are very grateful to have you here. Tom said it, but you could be spending your life doing many other different things. we're very happy to have you. have v tap you with us.To customers. It's awesome to see, when I was in sf some people had flown over from Europe. We're now in Europe and some people have flown over from, over in the US and Canada. So really, excited to have you here. Thank you for betting on incident. there's been some people here that have been with us since 2021, when it was literally like me, Pete, and Chris, around the kitchen table.so thank you for doing that. please come say, Hey, we would love to talk and ultimately It goes back to what I was saying at the start, which is, we are a means to an end and we're humble enough to accept that like reliability and resiliency we know is what you are actually shooting for here.And I think with Plenty of Capital, an amazing team behind us and amazing world class customers, many of which are in the room, I feel like we have never been in a better position to go build stuff. sorry. Have a great SEV0. Sorry about the rain. and with that, we have Mary, thank you.

London 2025 Sessions