Keynote: the evolution of incident management

Stephen's keynote discusses the evolution of incident management. He reviews industry challenges, critiques current practices, and presents a maturity model to improve incident response, emphasizing collaboration, data integration, and future AI opportunities.

Stephen Whitworth

Stephen Whitworth, Co-founder and CEO, incident.io

keynote

The transcript below has been generated using AI and may not fully match the audio.

Thank you very much. Thank you. Good morning everyone. I am excited to kick off what I think is going to be an awesome day. And, we don't take things too seriously here at Incident. io so my speaker notes say my name is Ron Burgundy. So Chris has been, Chris has been messing with them.before I started this company I was an engineer for about 10 years. I did things like found a company to do credit card fraud detection. I ran systems at banks to prevent financial crime. And what that means is I was basically on cool for 10 years. And really I feel like I spent countless hours dealing with incidents, navigating chaos, and just trying to fix things as fast as possible.And there was this time, I remember a couple of years ago, where I got paged so repeatedly by PagerDuty that I took my Apple Watch off and threw it across the room and smashed the screen on it. So I have some war stories from responding to incidents. And if you've been in this world long enough, and I think this is a collection of people that feels that, it's not the most glamorous job in the world.I think responding to incidents is stressful, exhausting, and often unpredictable. I think the alternate view is it's a space where you get to see how your organization really works under pressure. And, yeah, it's just a way to feel alive, isn't it? I love my job. I get to be deeply involved in incident response.I was trying to think, are there more, is there anyone in the world that sort of spends more time talking about incidents, talking to people that run them. I'd wager we're like top, top 5 percent of that. I will leave you, up to you to figure out whether that is a good or a bad thing. Why listen to me?What gives us credibility? We work with some awesome companies. We started the company about three and a half years ago. kitchen table during COVID. and now, we're here in San Francisco. Very proud. And we've spoken to thousands of companies at this point. And helped them run hundreds of thousands of incidents collectively.And really worked with them to improve incident management programs. I can see lots of familiar faces in the crowd. So I guess thank you also for traveling here. There's some people that have come from the UK, some people across the country, so I'm very thankful for that. as a result, I feel like we have a pretty good pulse on what is happening here, and I want to share some of that with you today.So You should think of this session as a bit of a state of the union for incident response. Like, where do I think we are? How did we get here? It is just my opinion. I'm trying to ground this in stuff that we've seen with all of our customers. but ultimately, this is a judgment call from our perspective.Gonna talk about the status quo, and why I think it sucks. We'll talk a little bit about the stage of maturity that companies go through as they build and scale out their incident management programs. And then we'll talk about the good future. So where are we investing, where are we seeing the cutting edge going, and what's next.Ultimately, again, why should you listen to me? you just got here, so I think it would be a bit rude, if you left already. But ultimately, like I wanted to write a speech that a cynical engineer would actually enjoy listening to. And although I run a sales team at this point, I still have my engineering roots, deep within me.I'm not gonna bullshit you. This will just be me dropping my thoughts occasionally, some hot takes, about where the industry is going. You don't have to agree with me. Two, I'm trying to ground this in actual fact and data. so we've talked to a bunch of companies, to prepare all this stuff. it is not a timeshare pitch, where you all have to come here and listen to me, in order to get fed.And lastly, I really wanted this conference to be pragmatic. Like, how will you be better on Wednesday based on what you heard today? and I want you to walk away with concrete ideas. Stuff that you can go implement. let's go. incidents are not optional. This should not be a surprising fact to any of you in the room.I prefer the alternate parlance, which is shit happens, and no matter how well engineered your system is, stuff is going to go wrong, and they're inevitable, and given that we have to deal with the stuff, what does the status quo look like for most organizations doing that? It looks like this.I don't think it's great. and I will walk through it. we're gonna, zoom back. It's 3 a. m. We're in the shoes of someone responding to an incident. And you wake up to this blaring alarm. If I paid, if I played the PagerDuty ringtone, I'm sure there'd be like a bunch of Pavlovian reactions in the room.and really, you reach for your phone, you press 4 to acknowledge, and then open your laptop, grab a coffee, and get to work. At this point it feels like the software stops and you're launched into the world of manual process. you create a slack channel, you try and find that like a dusty confluence document that's hanging around that tells you whether this is like a step two or a step three.you then create a Jira ticket. There's 25 different fields that someone has made you fill out so you just NA like all of the fields to get it going. and there, the incident is declared. At this point, we want to go and update the status page. turns out you last logged into it three weeks ago, so the SSO has expired, so you have to go find your login.and at this point, support tickets are flooding in from Zendesk, and the CEO has just joined the incident channel. Are you stressed? This sounds very stressful to me. And I feel this sort of very deeply in my soul. So we, I was a responder for incidents, literally for every job that I've done at this point.And I just felt like this human glue between all of these different tools. And, This really is the status quo for most organizations. They are trying to glue all of these tools together, and it might be different tools, but they are the same problems under the hood. And, yeah, I think ultimately this sucks, and we can do a lot better.When you're running incidents, there is this mass of tools that you need to deal with. Most of our customers are using between five and seven different tools in an incident. So this might be Sentry is where the issue is reported. DataDog has all of the metrics. Jira is like the ticket that you're tracking.Zoom is where people are fixing things. StatusPage is where your customers are being told about it. Again, you have to glue all of this stuff together. And I think that there's a few key problems with that. So the first is What the hell is going on? you have all these tools that have a little bit of the information, but there is no consistent whole.And as a result, if you're coming to an incident, say I'm like, leading an engineering team, or I'm CTO, it's very hard to figure out who is doing what, and what have we tried so far. And ultimately, as you scatter this data across multiple tools, it makes it really, hard to get an idea of, That the kind of central picture and I think the best example of this is when people are trying to build a post mortem They then go to this, you know a variety of different tools.It's this like CSI style Reconstitution of what happened and yeah, you may as well just have a degree in criminology to like real rebuild all this stuff and Then lastly like none of the tools really speak to each other And yeah, that's it Given that they have to, humans need to glue all of that stuff together.and they have to do this when they're tired, they're stressed, and they're out of practice. it's not the best. After dry running this slide yesterday, I realized that this was maybe not the best thing. So you know, if you point there, you just get like a big intelligence not found underneath me.I'm sure people will make me rue this. Anyway, you're all engineering leaders. You need data to do your jobs. And I think it is nearly impossible in the status quo to get high quality data about who is in incidents, how long did they take, which teams are getting hammered the most.And really the status quo tends to be a crude list of SEVs, perhaps by team, if you're lucky. And, some of you are spending nine figures on engineering teams a year. And if you have to pay a sort of reactive tax on, The work that you want to generate for, roadmaps, product. I think most leaders just couldn't say, are they spending 7 percent of their time, 15%?All of this adds up to a colossal amount of investment. And then there's the key people risk. without, without which three or four people, would you be really screwed if they decide to leave? And, without this visibility, I don't think leaders can understand what's working and what's not, so where they need to invest.So I've called this very dramatically, the doom loop. I think there's a very common pattern, between organizations, which is, one, incidents are really painful to run, 20 mandatory JIRA fields, et cetera, et cetera, so no one actually wants to run them, unless it's super necessary, and then teams will avoid them wherever possible.Two, because it's hard, you get fewer incidents, and that means you get less practice, and then incident responders lose sort of the muscle to be able to respond. Three, as a result of that, incidents are run poorly. and you end up with three or four people usually knowing how to run it well, and then everyone else is struggling.and as a result, you just have this kind of slow, disorganized, inefficient way to respond. And lastly, because all of this data is like a false picture of what's going on, and even the data itself is crap, you don't really know what is going on in your organization when you're responding.And I think of it as trying to solve a puzzle with half the pieces missing. you can get somewhere, but it will be a very unsatisfying sort of conclusion. The status quo sucks. Are we all totally screwed? No, I'm a savior. I'm joking. So I want to talk a little bit about, what is the journey that most organizations tend to go through?I don't think this is often that well discussed, like, where am I? Is this good? Where should we go next? so we have created this incident maturity model that kind of breaks it down into a bit of a framework. We have three stages. We have Centralized. Distributed, and Democratized.So you can think of this as a different level of maturity in how incidents are handled, how teams collaborate, and then how tooling and data is used to support that. We will jump in. The first stage is centralized incident management. This is like where the vast majority of companies have some element of things going on.in the kind of DevOps y phrase, this would be like you build it, they run it. My preferred phrase is the like, it's not my fucking problem, model. and I think there's a few things that characterize it. One, you end up with a centralized team. They are responsible in part or wholly for responding to incidents.This might be a NOC. This might be the first line on call that isn't team specific. It might be a dedicated incident management team. Everything sort of funnels into this team. Two, I think in this stage, incidents are like a scary thing, and as a result, the service owners aren't really declaring them because they have to go tell the incident team, and it's oh, they're going to be annoyed at me.And I think the tooling here usually ends up being pretty manual, maybe some Jira stuff, maybe a little home built Slack bot, if you've, if you've invested there. And I think the key thing is that, for this point, is that you're not really using data from previous incidents to inform what you're doing next.There are some good things, the team is a team, which means that you can train humans, they can share things along, with each other easily, and then that means that you can get consistency at a small scale, which means that, you can share knowledge and as a result you're not having to go and do this sort of like mass training of individuals.There's some not so great things. I realized that I have a crack team of specialists that solve all my incident management problems might sound awesome. I think in practice, there are a couple of bad points. One is misaligned incentives. If I'm a service owner, I just want to ship stuff. If I'm on the incident management team, I want to stop breaking.There might be some alignment at the start, but then over time this tends to cause tension within the teams. Two, it doesn't give the operational teams the muscle to learn to respond to incidents really, well. And I feel like it's important that service owning teams end up feeling the pain.And three is, it doesn't scale well. So if you imagine this funnel, like hopefully your company gets bigger, and hopefully you have more incidents as you grow, that means that you end up just funneling ever more through a smaller, like quite a small team proportionally. Okay. Yeah, I want to take Netflix as an example here.So when we first started partnering with Netflix, the vast majority of incidents were being run by a single team. CORE, which is Critical Operations and Reliability Engineering. I feel like that might be a backronym, it's a nice one. And teams outside of CORE didn't really have a paved road for what they were supposed to do in incidents.at Netflix scale, at Netflix criticality, like 90 percent of incidents being run by a single team is a lot of responsibility to bear on the shoulders of a responder. And yeah, I touched on it earlier, but the vast majority of organizations have some element of this happening in their organizations.This might be scale ups, it might be public technology companies, And I, my hot take would be that like, a lot of DevOps has meant that we have central APM, central monitoring, central infrastructure, but I don't think that has been fully felt by the teams in terms of being woken up for stuff.People are on a journey there, but it is, I think it is much more likely that you'll have a central metrics library that you can use, before you end up having, distributed on call across your teams. I think, works in a pinch, harder to scale, most companies, have some element of this.Two, distributed. So this is, I would say, a big leap forward from the centralized model, and is a bit more of a mature approach. As we go, we end up with one, teams taking ownership of their services, and this is really it's not just the job of a central team to respond to incidents anymore.You have a lot of the responsibility federated down to the edges, and they are responsible for software, but also usually for getting woken up to deal with that. Two, tooling is a bit better, this could be a bit more slack bot stuff, it could be you've bought something, but in general it's hard to federate out responsibility without tooling to support that.And I think one of the key things here is that incidents are still like a nerdy engineering y thing, and you might have some teams outside of engineering that know there's an incident, but there's usually very much like a, engineers doing engineering y stuff. There's a few good things, engineers are more practiced at running incident response, so they panic less. Two, the tooling helps reduce the manual burden, so the story of tech is basically like reduced friction and you get disproportionately more things, so you'll probably end up with a lot more less severe incidents. They'll have more context on the software that they're running, which means they'll usually fix it faster.And then, the kind of loop of feeling the pain of incidents. I think the ownership ends up driving resilience. No free lunch. Newsflash, training hundreds or thousands of people to run incidents well is actually quite hard. And as a result, you don't really have this like centralized incident blessed experts group anymore.Like everyone has to be good at it. That's not even remotely free. And two, because thousands of people are doing it, then consistency becomes a challenge. Pick your trade offs. How do you get from one to two? I feel like you are maybe just explaining basic stuff to a bunch of, really smart people.But, if you want people to do something, make it easy. And this looks like making, making running an easy, an incident being just a sort of guided process. My benchmark is; Can we have an engineer join us on Monday and then on Thursday they run their first incident? And it might not be the best thing in the world, but it's, they ran the process, they know what to do.I think this is a little bit self-serving from my perspective, but I do think that you need tooling to roll this out. If you are trying to teach thousands of people to do something without an investment in tooling, I think it can be really challenging. But, I would say that.Last up, democratized. I think this is where the cutting edge organizations are and where the future of incident management is. First up, incidents are for everyone, not just for engineers. So I think of this as, I worked at a bank before this. turns out if you don't let, if credit card payments go down, people are not very happy about that.and as a result, they write into their bank and complain about it, which means that; just because you had this kubernetes issue over here, you actually now have tens of thousands of customer support tickets you need to deal with. It just feels so patently obvious to me, that like incidents cannot just be an engineering thing.Like you have to link them together. And as a result, incidents end up becoming a team sport. And you have customer support, legal, execs, risk, compliance, all collaborating. There's a lot of people doing payments in the room here. Payments incidents are always very fun, because there's like a legal angle to it.There's a risk angle. And yeah, it ends up being a team sport. And the last is, you end up with this kind of advanced tooling and centralized data to support everything. And the idea being that, The last incident we had makes us better at the next one, and it is actually being driven by that.Few key strengths, faster detection resolution, like more eyes. This is the security angle, right? It's like open source, more eyes will spot more bugs. I think it's very similar to how you run incidents, like more people feeling empowered will get you to detect them faster. Different perspectives lead to different response.I'm not saying this is always the case, but generally I think it is true. And I don't think there's any reason why you shouldn't be bringing in Customer Support leaders to incidents, for example. And I think this is like the most important thing, which is like, incidents are inevitable. So let's not be scared of them.Let's just run them. And as a result, you use them to become better at your day job. And not being so scared of them. No challenges. Again, basic stuff for all the smart people here. Cultural change is really hard. you can't just click your fingers, and it happens. Second is that it's obviously a lot easier for an engineer to talk to an engineer than it is for a lawyer.You could say it's not easy for engineers to talk to anyone. But, yeah, I think, ultimately you need to have this kind of shared vocabulary and what might just be a throwaway SEV3 to an engineer might be an extremely stressful thing to a lawyer, or something like that. So as a result you need to end up aligning.I talked a little bit about Incidents of everyone and this is the kind of patently obvious thing when we started the company is that you know, there is no reason that people outside of engineering shouldn't be involved And I'm just really happy that like some of our customers are starting to see this as well so we have Skyscanner as the amazing travel company based out of the UK and China. And yeah, this is you have incidents being declared for like power outages at offices, you have incidents being declared for laptops.There's lots of different applications of it, and it ends up being this really collaborative story. 70 percent of, or greater than 70 percent of their organization has ended up being involved in an incident, which I think is just a crazy stat, I'm very proud. 2 to 3, so Distributed to Democratized.Bring people in. You need to make it really easy to like, or you want people to feel like they have some influence over the program that you're trying to roll out. And as a result, you should get their input on how you design it. I think this is, the people that get punched in the face from incidents are generally customer support.So as a result, bring them in, help them design the program. This might be working with a VP or a head of to do that. And my picture is that actually most of the org wants to be involved in incidents. They've just been locked out through tooling. So as a result, you'll tap into this well of, Oh my god, yes, let's do this thing.And I think you'll see a lot of uptake when you generally do it. 4 minutes and 18 seconds left, I'll get going. Clear path for improvement. so we'll be talking about this a lot more after SEV0, but I wanted to give you a sneaky preview. It's like some steps that you can take to go from different models.Centralization, I think, is a good place to start, but ends up being a bit tough. And then, ultimately, resilience is what we're trying to build here. The idea that, if this thing goes wrong now, then we'll be better tomorrow as a result of it. And yeah, I think a great incident management program can help you achieve that.I'm going to, thought lead to the max at this point. So I'm going to talk a bit about what's next, and where we're going. the role of AI. I got 23 minutes through this talk without mentioning AI in San Francisco, so you should be very much. Yeah. I appreciate that.I wanted to do a little audience participation thing. Everyone, if you have heard of the phrase AI Ops, just put your hand up. Great. Keep your hands up, sorry. Keep your hand up if you're using it in production. Oh, that's quite sad. Keep your hand up if you love it. Woo! Shout out to the four people.I, my hot take is that I really don't think this has worked so far, and there's a few great businesses being built here, so the BigPandas of the world I think are doing awesome stuff. I just think that at this point, like we haven't really felt like the full effect of what AI could do, and, I remember turning on this like intelligent alert grouping stuff when I was at Monzo before, and it basically got it wrong within six hours, and I turned it off, and I never turned it on again.So as a result, I just think there's, if you're going to be in the critical path, like you have to get this stuff right, and I think it's not been as effective as we need it to be. My pitch is that I think things are different now, and I want to tell you a little bit of why. Our product right now is quite reactive, so if this thing happens, do this other thing. Like it's a good product, but I think it can feel a little bit stupid at times. Like it isn't actually really helping me solve stuff and as a result I think there is a space for proactivity to really get injected here through AI. Every AI feature we build, like I'm not joking, just wildly successful like wild adoption with very little encouragement from our side.So this is, when you finish up an incident, we can scan through your entire Slack channel and highlight all the things that you said you wanted to do to improve next time, but you haven't exported to Jira. That's a cute one. Highlighting similar incidents, you probably have solved this thing before.It might be related to some other common issue. There's a lot that we do there. And then, suggested runbooks; given that you fixed this thing before, how do we take that knowledge and codify it for the future? And I think the crazy stat, and this is underplaying it to me, is that we have some organizations where 50 to 70 percent of stuff is being, summaries are being written by AI with very little editing from, from their side.And ultimately, I think there is a huge opportunity here. We are investing quite heavily. There's a lot that we'll be sharing, over the coming weeks and months. But if I was gonna boil everything down to like “people want to fix things faster with fewer people”, I would wager that AI ends up being the kind of biggest shift that has allowed us to do that for quite a long time. Right, 35 seconds left. We have these three trends which I think are all actually the same thing and I want to do like the opposite of the doom loop.I didn't have a good word for it. So, number one make incidents really easy to run. That means that teams will run them for SEV2s, SEV3s, SEV4s. Derek's going to be giving an awesome speech about that later. Shout out Derek. we will then have more and better run incidents, which means that people are more practiced, and you get this like incidents of everyone, a team sport, side of the world.You then get more and higher quality data out the back of that. which means that you get better insights as an engineering leader, but it also means you get more context for runbooks, for training AI. And then, once you shove these two things together in this kind of copasetic way, it actually helps you fix things faster, and I think your organization can continually improve off the back of it.Last thing from me. We have this little cute doom loop skull thing. I think that you, I would love you to just do a bit of reflection on like where are you in this spectrum of things that we've described. Are you in the world of fragmented tools, manual process? Or are you getting closer to incidents or a team sport?And, democratizing it across your organization. And, yeah, I think you can think of this incident maturity model as like a little guide that will help you along your journey. I want to say thank you to the speakers. You haven't heard them yet, but I have. And they are awesome. They are going to be sharing some of their journeys along the way with you.And really focused on practical, actionable, pragmatic insights that you can go and implement. I'm very excited to hear what they have to share. And, yeah, I wanted to say thank you for listening to me. It has been a blast. I will be around all day until 6, and then, helping clear up afterwards.So please come say hey. And with that, I would love to hand over to Andrew from OpenAI. He is going to talk to us about how do you maintain a blameless culture when everyone actually knows who did the bad thing. Over to you, Andrew. Thanks so much.

San Francisco 2024 Sessions