The evolution of incident management

Martin Smith (NVIDIA) shows why the way you handle incidents can make or break trust — and how treating incident management as a product feature can actually become a competitive edge.

  • Martin Smith
    Martin Smith, Principal Architect, SRE, Nvidia
The transcript below has been generated using AI and may not fully match the audio.
All right. As was said, I'm Martin Smith. I'm a principal architect and SRE at NVIDIA, and over my career I've worked on a lot of SaaS offerings, specifically with a reliability focus. But today I'm here to take you on more of a fever dream of what could be in the future—maybe some things that are more controversial, maybe some things.That folks already agree with here. But I want to talk about the future of incident management and specifically incident response—not what you're thinking of as the backend process that happens behind the scenes, but actually the incident response that your customers see.Because I think that is more of where the future is for what's happening. I'm going to show you some things that I've experienced as well as some case studies, and maybe you will agree with me by the end of this. Two things that I want to assert: one is that you will have incidents, as much prevention as you can possibly do.There will be incidents. And the second thing that I'm asserting here is that your customers are going to see them. They're going to experience them. Some of them may even think it's part of your product. If you think about the incidents you have today, I don't know how that makes you feel if they know that's part of your product.You can spend as much time as you want preventing incidents. You could throw as much time at that as possible, but I think you're still going to have them. So you're still going to have them, and customers are going to see them—those are my two big points here—and they see them as part of your product.The user experience of incidents is not really something I think we've gone as far as we could on. I think there's a lot more the industry could do on how they experience those incidents. So with that framing in mind, what does the future look like? This reminds me a lot of the time from the early 2000s all the way up to 2018–2019, where everybody was saying reliability is a product feature we need.Product is doing reliability; we need them thinking about it. You probably heard this: get them in the incidents, get them in the outages; build that experience for customers—graceful degradation, prioritizing critical workflows—all these things about the business thinking about reliability as part of the product.But I would say incident management—and specifically incident response—is a product feature. It's a key part of the user experience, and I think that's only going to increase, and we should treat it that way. I think it's becoming the rule, not the exception. I have a small story of how I've started to notice this, and then I also have a couple of case studies and other examples.I think we're running a little bit behind, so I might rush or skip one of them. First, my story: back in July—I live in Florida; we famously have a lot of hurricanes—I got a knock at my door from the power company, and they were the folks in hard hats.They weren't the marketing folks. They actually talked to me and were like, "Hey, we're planning some repairs in your neighborhood next week, and this is what we're doing. Do you have any questions?" I asked them questions like any engineer would.When you get a knock at the door, I was like, "Is it going to go off and on a whole bunch of times? Is it going to just be one long one? Tell me more." They engaged in all of that with me, and that proactive communication felt like a total game changer compared to how I've experienced the power company in the past,which is that the power goes out and you don't know. Participating in the solution meant that I was going to have this whole experience that would have before just been an outage. It was like, "Oh, I'm actually going to help. I'm going to make the power in my neighborhood better because I'm prepared."For when this happens. For me, that was the first moment where I thought, "Oh wow, the utility company is talking to me about the user experience of the power outage." And by the way, if you want to read a deep dive into this, South Africa famously has load shedding and doesn't have enough power capacity, and they've turned it into a whole user experience.There's an app, there are advertisements, and you shop for where you want to live based on what the relationship is. I think it's fully integrated into that electric-utility-grade experience. But this was just a small thing for me, and I felt like I was actually helping.If I think about that, I ask, "What are users looking for these days?" Applying product thinking to incident response is one way to deliver for them. I have some examples I've cherry-picked of what they want, what regulators are saying, what the media has said in a couple of examples, and how it's already evolving in some places.For me, the three big things called out here are that users want a lot of communication—sometimes radical transparency, way more than we used to have. I think they also want more choices. They don't just want it to be down. They want to know, "What can I do if I have to pick between these two things?Can I get A or B?" And then I think the last one is they want not just a fallback experience. I think of it as a degraded experience, but for me, it's changing from degraded to just another experience. It's not just that it doesn't work the same way or it's read-only, but maybe it's a totally different experience when you're having an incident.I'll just say one more thing on this slide, which is that if you do all these things, you actually don't have to resolve your incidents urgently. It's great that incidents are no longer binary—it's up or it's down. Users might even be having funduring some of these incidents. This is the first one I cherry-picked, which is about an AT&T outage—a 12-hour 911 outage early last year. I think this one really highlights the kind of communication that users expect now. There was an FCC investigation, but over the course of that day, they were likethey didn't clarify that it wasn't a cyberattack until a few hours in. There were all kinds of delays—three hours before anyone knew 911 wasn't working. Ultimately the FCC investigation was the classic "you need to ship code safer," but all of the media and consumer-facing action was more like"No, how you ran the incident wasn't great. Customers weren't happy during the incident." And there are small examples of things that could have potentially changed that, right? Apparently you could text 911 during this. That's a degraded experience in one way, but in another waythere are probably people who would prefer to text 911. I would think there are actually people who would prefer that experience. So my takeaway is that there just wasn't product thinking about the incident experience, and I think that's also true of a couple of other examples. But we're short on time, so I'm going to skip some of them.Another one I picked out here is the December 24, 2024, Southwest meltdown, where most airlines had a slight cancellation rate because of weather,and Southwest was more than 60%. It turned out that most of what happened afterward—as far as what they were cited on—was not that anyone was upset they had an outage, right?It was that customers called you and you didn't answer. I think about that as a good example of where this could go. I travel a ton and I would love more choice, right? Imagine if you said to the airline, "Hey, these are my top five flights. If you're going to cancel the first one, I'd prefer the second one," or "I have family in this state, so if you strand me, I'd love to be stranded here."That would be a really cool experience for the airline—giving customers more of a choice. Not just selecting your meal, but, "Hey, if something bad happens, here's how I want it to happen for me." And that would be really cool even if you said, "I'd rather fly the next day."I don't want to be on a three-hour delay and miss a day of work; I'd rather fly after work tomorrow." That kind of option is creeping more into these kinds of products. I'd love to see all the big four U.S. airlines do something like that. This last one I won't talk as long about, but I think this idea is to give them an alternative experience. This was a large Meta outage you probably heard of where everything from emergency messaging to communities with low literacy were using voice messages on WhatsApp and suddenly couldn't do it. I think Meta famously couldn't get into their data centers during this outage.It was DNS, of course. But I think there were so many opportunities to have an alternate experience, and I think there are some really cool ones. I'm actually going to ask this at the School Night demo up there. It'd be really cool to put the AI-based SRE in front of customers.What if you did that? What if you just fed it their data—a read-only copy—and said, "Look, we know you normally run this report, but here's an LLM with your data, and this is what we can offer you during the outage." So really alternatives to fallback—other experiences.With Meta, right? People are even selling things—there's all kinds of stuff. Give them an eBay-like experience—something that's different—and then there's not really this urgency. Again, we're going to have outages, and customers are going to think it's part of your product.Would you rather have it be the Amazon dog error page, or would you rather it be, "Oh, I got my work done today; it was just a little inconvenient"? Things like the AI SRE—when I was showing some of this talk ahead of time, somebody even suggested, "Oh, they could actually help you troubleshoot your issue."Your customers could tell you, "I'm seeing it here and I'm seeing it there." Maybe that would be a really awesome product experience—"I fixed the uptime for this big enterprise because I reported the issues that happened." And just one last note on this: I see crisis management come up a lot in these big incidents.I think that's another opportunity to think about—if you live in the U.S., there's the Emergency Alert System, not well known for its user experience, but maybe there's some of that we could think about in terms of what we're building.just given that we're going to have incidents and users are going to think it's part of your product. I have a bunch of takeaways that think about the classic SRE things like fallback and failover and what that would look like in a product. I think I've hinted at some of those.What's cool here is the goal is no longer to resolve the incident, right? It's to give customers another experience during an incident. Some broad alignment: your product folks—obviously this product thinking requires folks with a product mindset to get involved.I don't know that I would necessarily say they all have to be on call or something like that, but I would love to see product-focused action items from incidents even more than action items like "upgrade the database," right? I'd love to see, "Build this thing," or, "Wow, let's go talk to some customers about what they were trying to do," and then figure out, could we do some of those?Or something else where you could put that back into your product roadmap. It might even be more important than reliability items on your roadmap at that point, right? If your customers are really happy, then that might actually bubble up to the top. So what else? I think even things like giving customers more choice was another example I used before.I think you could talk to customers about what kind of regional or AZ outage they would prefer. You don't know, right? I think that would actually be super interesting to do, and be radically transparent about it when it happens. I think this is starting to happen. I think the focus, at least in some of these examples I showed before, was really about the response, not about whether you were up or down.Some other things I've been thinking about: as an SRE, if anyone's worked with me in here, I'm always the person running a retro and I'm like, "What could we have built to make it easier to resolve this incident?" I ask that all the time in retros. But now I want to ask, "What could the product be like that would let us have this incident for longer?"What would be the different product experience? I don't think you have to build a whole other thick client or even the LLM example, but you could build data features that would really change how the product is experienced by users. I've seen these. I have a home alarm system, and when the power goes out in the neighborhood, it tells me what percentage of homes are out of power within a one- or five-mile radius.That was cool. I had a really good experience with the product while it was essentially down that actually made me feel better about the product. I'm like, "Oh, okay. It's not my house, it's not the product." That was adding value. I also think we could rethink some status page features and data, where right now there are status pages that basically tell the operators, "Hey, a bunch of users just showed up."What if you let the users tell each other that? I think there's some of this on the internet, but it's not usually endorsed by whoever made the product. What if it was like, "Yeah, we're going to put you in a chat room with all the other users that are experiencing this outage and an AI SRE," right?That would be super interesting. It'd be a totally different experience than just an outage. So what else—more future things. I think most organizations separate the technical firefighting from the business and user-facing work, but what if the UX is an incident response experience for your users?I have a whole bunch of other ideas here, like that radical transparency idea. I love that we always tell people when a cloud provider has a problem, "Oh, it's not us," but we never say, "Yeah, that's our database. Remember that bit you liked two months ago?"A couple of vendors are doing that, but we are not at the level of radical transparency that I think consumers want. You all are consumers too, right? I know when I see an outage like that, I'm like, "I wonder what that was. I wonder if that was Route 53, or Google Cloud Logging," that kind of thing.I also think you could do more in some of these product experiences as far as telling users, "We have load shedding," which is an infrastructure concept—but what would that look like as a user experience? What if you were like, "Okay, I can prioritize bringing up certain capacity first—let me ask the users which ones they care about."Then I've got a prioritized list to come back, and maybe even build that more into the product—with lots of input into restoring service. I also think there are new product roles that could be out there for this. There's a product manager for incident, but also a head of incident for product, right?There are all these fun product roles I'd love to learn about—an incident developer, somebody who just develops the experience during that incident part of your product. And just to bring it back to another concrete example, I don't know about you all, but the little Chrome dinosaur game is a really fun small example of, "Oh, you're having a bad time. You clearly don't have internet. There's not that much we can do for you as a web browser, but we'll give you a little game to play," and that'll maybe make you forget that you don't have internet and you can feel better about that for five minutes.I also think there are ideas we could take from other parts of product on this, like A/B testing the incident experience for customers. Again, more transparency—if you think about public key infrastructure, there's a transparency log. What about more transparency like that for incidents?I think there are some things we could take from other parts of this that would really build an interesting user experience. And I think users are already looking for them. Also, just one last callout: we know with security that vendors' products are basically perceived as your product—supply chain attacks and all those kinds of things.But I think we should also start looking for it in incident management. I would love to put some of these products directly in front of users, and status pages do that a little bit. But again, what about something like one of these new features like the AI SRE in front of users? Incident management companies could do more research into what users experience and build something that really does feel like part of your product at the end of the day.Now, you don't need a vendor for a lot of that stuff, but I just want to call it out because I think it's not just going to be security anymore where this happens. So what else? Just reiterating, I think it's part of the user experience. Again, I'm from Florida where we have hurricanes, so you might not know this guy—this is my callout—this is Jim Kori from the weather station.And people are ambivalent when he shows up somewhere, because it probably means there's some bad weather thing. But I would ask you: who is that person for when incidents happen at your company, right? Imagine if your customers were like, "Oh, there's Martin—something good's going to happen."They know it's serious, right? If the Jeff Barr of AWS shows up, they're excited about it—they're having fun, even though it's a hurricane. So what would that experience be like to delight your customers during an incident? Anyway, I hope some of these were thought-provoking, maybe even useful.I have about 10 seconds left, so I think we hit it right on the mark there. Thank you very much.

San Francisco 2025 Sessions