We should all be declaring more incidents

Martha Lambert, Product Engineer at incident.io, talks about how to reframe incidents and use them to your advantage to make customers happier.

  • Martha Lambert
    Martha Lambert, Product Engineer, incident.io
The transcript below has been generated using AI and may not fully match the audio.
Hey everybody. Yeah, I'm Martha and I am a product engineer at Incident io and I'm hoping you all know what we do by now. so during my time here, I've been involved in over a thousand production incidents. That number seemed quite big to me until I thought about the number of incidents collectively. We in this room were involved in on Monday alone.When we think about incidents, we tend to think about things being really stressful and scary. And when you look at reporting, you're tending to want the numbers to go down, 'cause that tends to signify a better, more resilient product. Today we're gonna talk about why you should be declaring more incidents as a tool for happier customers.So we have a lot of incidents around eight per day, across 40 engineers. And you might think that sounds like a shit show. And that's exactly what I thought when I first turned up. I thought things were going really badly wrong all of the time, and I've made a terrible mistake. In reality, it's a tactical and purposeful change.So the same number of things are going wrong. We're just dealing with them differently. We're gonna start with an example. A few weeks ago, we had a production error during the working day. A customer hit this when trying to use the dashboard, and this was a minor issue isolated to a single customer.But one of our engineers, Leo, got paged and immediately declared an incident before they'd reached out, Leo sent them a message, told them we were on it, and had shipped a fix, which landed in production 10 minutes later. So this might look surprising 'cause I'm sure we're all used to bugs resulting in grumpy customers.But here we proactively reached out so the customer didn't have to. We took their issue really seriously and kept them up to date throughout, and our response was so good that rather than this being a overall negative impression, they had a really good one. So good that they thought we were a bot. So obviously we want to avoid bugs, but we all know that's impossible in software.So here's a situation where we turned an issue with our platform into an overall positive experience for a customer. And I'm not exaggerating when I say we get messages like this all the time, so that wasn't just a lucky one off. These are consistent examples of us turning bugs into places where we can make our customers happier and more confident in our service.Customers understand that things go wrong. They just wanna know that you're dealing with them really well. One of my favorite examples is a customer sending us this.So it is amazing working in an environment when you get this kind of feedback, but it's particularly wild when you get it in response to doing something wrong. At the end of the day, we shipped ABR to production. So how do we get here? We've noticed an issue, we've reached out and we've fixed it before the customer has had a chance to let us know.And you might be thinking like this guy, that what lets us do that is having shit-hot observability and having a real time knowledge of the customer or platform experience at any one time. And don't get me wrong, I love observability, but that isn't what makes this great. And you don't have to invest massively on the technical side to see the kind of response we saw from customers.It's much more about how you respond. So today we'll talk about why you should declare more incidents. We'll talk about lowering your bar and having more of them making incidents, your priority and the processes you'll need, and then finally putting the customer first in your response. So lowering your bar, this is all about being much more open-minded with how you can use the idea of incidents when more minor things go wrong day to day, and why is that even useful?So we get a load of things for free when we declare an incident. There are space where you have clear well thought through processes that work for your business, and that's something that people take really seriously. So I'm sure we've all been in a situation where there's four different slack threads about an issue that later emerges to be an incident context is missing.You don't know what's been done or who's involved, and it's really hard to piece that together. It's far better to carve out the incident space early, even if it ends up being small. You've got everything in one place and you can pull people in when it gets serious. Here we've got Leo again, having a chat to herself in an incident Slack channel, and this can feel really silly at first when you're on your own, but is absolutely invaluable if an incident escalates in severity, you can see exactly what's been done so far, or for historical incidents.It's so easy for future engineers to jump in and find out exactly how Leo Debugged, what caused a locked time out in her words, some kind of foreign key business. So another reason we should lower our bar is to train the incident muscle. So when you're using incidents for small issues day to day, you're gonna develop good incident habits in your sleep.Engineers will know how to communicate when and to who. When you have a really big incident, you don't wanna mess up on those details. So if you're comfortable with response day to day, your processes will be far more smoother when it counts, and you can focus on fixing things. You want anyone in your team to be comfortable leading an incident and then insights.So incidents naturally provide a really standardized way to respond, and you probably already have processes in place to track these. So if we're putting more things in that bucket of data, we're gonna get more interesting insights about how we're spending time on reactive work. That means we can identify places to invest or teams that are having a bit of a rough time.So how does this work in practice? How do we actually encourage more incidents? The easy answer is to not rely on people to do that, and that's why we auto-create incidents from errors. So I'm sure every team has some sort of error graveyard. Maybe it's a Slack channel or somewhere in Century where your errors go to die and maybe they were once monitored, but that always falls away, and I'm definitely really guilty of that.When you're ignoring these errors, you're missing out on rich information about the issues in your product. So using incidents as a robust triaging process for these means, you're gonna make your product much better for it. The important thing to work out is what do you care about? And the answer is that really changes over time.Right now we page for any new error in an important service, but we also track error volume anywhere across our app. The important thing is empowering your teams to keep that error stream really high signal and low on false positives. I always know if I get paged at 3:00 AM for a false alarm, I can come in and spend a couple of hours making sure it's never gonna page me again the next morning.It's really important that your team can always pull that lever. So another way we enable more incidents is reducing the overhead of declaring and running them. So you wanna empower everyone in your team to really quickly and easily raise the flag that something's wrong and that should extend beyond engineering.First is a simple to clarify flow. How long does it take someone to fill out your incident declaration form right now? And do you really need all that information?I'm not very good at clicking buttons. next is being able to decline incidents. So if you're having more incidents, you're gonna have a few that aren't a problem at all, and it needs to be really easy to throw them away and get rid of them without having to follow a load of process. And then finally, lightweight process for less important incidents.People are gonna hate the idea of having more incidents if they have to write up a really long postmortem and host a debrief. So make sure that your process flexes to fit the severity of your incident. Cool. So we've changed our mindset and we love incidents. We're declaring loads of them. How do we make sure that we have the process in place to deal with them?So incidents have to be your number one priority. They are the most important thing until you know otherwise, and you need to drop everything for them. No excuses. And why does that matter? That's literally the point of incident. They're gonna fall apart if they're not your most urgent thing because they're your only way to deal with high priority work and they need to stay that way.Critically. That doesn't mean that we fix everything straight away. It just means that we work until we know the priority of the issue relative to everything else. If AWS goes down, obviously that's quite different to a day-to-day incident, so you need your process to flex to fit that. So to keep us working on the most important thing, we have a really clear triage flow through an incident.When you have more incidents, naturally there's some that aren't issues at all. Maybe it's a deployment blip or a transient third party problem. Generally, they're one-offs. You don't want these to count towards your metrics, so having a really easy process to decline an incident once you've done some initial investigation really helps.Once we accept an incident again, that doesn't mean we're fixing it straight away, means the engineer is interested in investigating more about what went wrong once they've determined the priority, there will be a big proportion of incidents that get ticketed up and won't be handled right then and there.Cool. So we've got this lovely triage process, but that doesn't mean that it's gonna work yet because people are involved and we need to make sure they're making the right decisions. So understanding a framework for what makes an incident important will really help your team to do this quickly. This is always gonna require nuance and judgment, but you can probably define some pre solid criteria for where these fit into.For us, there's a few things that would want me, that would make me wanna fix something immediately. Maybe it's a critical product flow, so for us that is declaring incidents or paging. Perhaps it's less of an important product area, but it's actively hiring a load of customers. Or maybe it's neither of these, but it really matters to a customer or a prospect that we care about.And speed is really important to them. Knowing this gives new people a really good framework to begin with, and the ability to much more easily categorize issues and effectively triaging is gonna speed up your team so much because you can be sure that you're always working on the right things that are most important at any time.So one more principle to keep our incidents really useful and clear is never passively pausing. We're using incidents for a wider range of errors, and that means there's times when an incident is ongoing, but you need to wait for something. Maybe you're waiting for a customer to respond, or you're putting it down overnight to get some sleep.We already said that incidents need to be top of our stack. Sometimes there is sensible times to actively pause it, and that always needs to be an active decision. So that means an engineer sending a message saying, I'm putting this down. This is the reason, and this is when I'll pick it back up. That means anyone joining in the incident can get context on why the channel is quiet and disagree with that decision if they think it's a different urgency.So we're having these incidents all the time and we're making sure that we're dropping everything to deal within. So how do we make sure that we're actually getting any engineering work done? So for us, that's a process called Product Responder. This is a rotating group of engineers whose sole focus is reactive work.It's generally one to two engineers per team each week their focus is incidents, bugs, and shipping customer delight. So it's about shaping your organization to support flexibility and when you actively budget resource for interrupts, there's no drama when an incident fires of who would deal with it.When I've been in teams without this, it tends to always be the same people picking things up when they come in. The trade off is obviously a delivery on our roadmap, but I don't think we're missing out on that much. These drama and interrupts still exist. They just cause a lot more chaos when there's not someone in place to deal with them, and product responder isn't expected to deal with anything.They're just the first line. So on days like Monday, they'll draft in the entire engineering team, but for day-to-day issues, product responder has it covered.cool. So we're having incidents and we're treating them with the right urgency, and we've got the staffing available to do that through product responder. That's great, but alone, it won't get us the reaction we saw from our customers earlier. So the final piece is to frame your incidents in the right way.Your number one priority needs to be customer communication. Obviously that's almost always correlated with fixing the issue as fast as possible. But the key point is prioritizing communication throughout. There's gonna be some incidents like security where careful considered communication is important and this won't apply, but for the most part, you can probably push yourself a bit harder.So we care about this because silence is almost always worse than longer downtime. The worst incidents for a company's reputation are almost always the most badly communicated ones. When people are dragged online for a bad incident, it's normally because they really handled cons badly and kept their customers in the dark rather than because something went wrong.People understand that shit happens. So reframing your priorities here will lessen the impact of your worst.So we do that by making communication our first priority in any incident. When an engineer turns up, they know that they have to work out who's affected, how much they're affected, and whether they can reach out, and what does reaching out look like. These are a few examples of communicating really early before we know anything.All of these messages got a really positive customer response. The point is messaging to say, Hey, we see something's happening and we're on. It is all you need to do. You shouldn't have to wait for all the information, and it's far better to do that early. You can also let them know you're dealing with this with an incident internally, 'cause that relaxes people that you're taking it really seriously.But speaking directly to customers can be really scary for engineers and your customer relationships are really important, so you need to know that your team can be confident and also equipped to handle this. One thing that helps here is having a really strong culture of drafting messages. So this can feel weird at first, but I send messages like this in incident channels all the time to get a really quick thumbs up before I send an message to a customer.And that's great as new engineers on board to get them involved in how these kind of messages look. It's also really important to link your errors to who they affect. This centers the affected customer in your incident. Errors shouldn't feel like a purely technical thing. It should be incredibly easy to go from a problem to direct communication with your customer in just a few clicks.We do this via tagging all of our errors in sent with who they're affecting, but also you need a great way for engineers to speak to your customers directly, whether that's through Slack or any kind of support tooling. It doesn't matter as long as the access is there. Another thing we need to put communication first is always having a comms lead in our incidents that are bad.So the incidents that go most badly wrong are often the ones where people are doing too much at once. And I'm super guilty of this. I'll be think I'll have everything handled and I'll be debugging a technical problem and I'll think I'm so close. So I'll just wait a little longer to message the customer.I'll keep pulling that thread. 45 minutes will go past and I've kept them in the dark. So it is really easy to neglect communication when you are on that technical issue, and that's okay for small bug fixes, but for bigger incidents, you wanna actively bring someone in for communications. Our rule is that any major incident needs to have a comms lead assigned.Could be an engineer who's comfortable someone in CS or anyone, as long as the point is that there is somebody responsible for proactive communications at any time. Final thing for communication is status pages. So these can feel really scary, particularly for engineers to post to because they're so public, but they exist for a reason and they're your best method to proactively communicate to your entire customer base at once.So use this when it counts. A good sign is that if you are struggling to keep up with too many threats of reactive communication, it's probably a good time to put it on your public status page. I find having templates in the right tone of voice and examples of past incidents is incredibly helpful because for engineers it's really difficult to craft those in the moment.Also, having a clear framework for the types of incidents you should post to your status page means you're not gonna be making stressful decisions in the minute of an incident. Cool. So these are the things that we've talked about declaring more incidents through lowering our bar and considering more minor issues as incidents, making incidents our priority.So always dropping everything for them and having the processes in place to do that. And then finally, always putting the customer first and reframing our priorities rather than just focusing on the technical issue. We like as a team, are so much more confident and comfortable with reactive work now, and our customers are much happier for it.It's not just a flip of a switch. It is an organizational mindset shift, and having the tooling in place to help you enforce that is really important. I'm hoping most people in the room have some sort of incident management solution. So that's it. The message I will leave you with is declare more incidents.Thank you.

London 2025 Sessions