SEV me the trouble: Pre-incidents at Plaid

Derek Brown (Plaid) shares how his team built a pre-incident framework using incident.io that helps engineers de-risk major changes before they escalate into SEVs.

  • Derek Brown
    Derek Brown, Head of Security Engineering, Plaid
The transcript below has been generated using AI and may not fully match the audio.
thank you so much, Tom. I cannot even understate how much, how amazing it is to be here. I think we all can relate to the fact that, reliability incident management observability are super under loved, in the world of software engineering. and to be in a room full of a lot of people working in silos, solving the same problems, banging their heads against the same walls, and then to add the cherry on top to have a Steve Jobs-esque.Demo on how, the future of, AI SRE is gonna change what we do. I think it's just, really amazing. It seems like we're kinda at the start of something really awesome here. as Tom mentioned, I'm Derek. I'm the head of security engineering at Plaid. kind of a weird role to think about, how we do incident management, not the typical reliability, thing, but actually very similar.security events happen. We wanna track them. Incident management is the logical place to do that. But beyond that, a lot of the work that we're doing is very similar to the role of an SE or a software engineer. We go into a system, we find a vulnerability. We think we can evolve the threat model or the security landscape for that system.We dive into it. We wanna make a change and we wanna do so safely. so my team does a lot of changes. They also happen to create a lot of incidents. and so this talk is, kinda a way that we've learned to manage that, in a much better way, hence pre incidents at Plaid. so I wanna start off with a question.how many of you think that we can predict when incidents happen?It's, I guess it's maybe hard on the live stream. There's not a single hand raised in this room. so I put the best and brightest of my team together, and we have all these awesome AI chat bots. We obviously have quantum computers. We put those two things together and we simulated all of the possible deployment scenarios on an electron level to figure out what was gonna go wrong and when.And then our finops team got involved and figured out that this would affect our operating margin by a hundred percent. And we went back to the drawing board and decided to create some heuristic models. And if you're paying attention to this morning's portion of the presentation, it's very easy to predict when incidents are going to occur because we're the ones that cause them the majority of the time.so I just wanted to kind of back this up with some data. so I gave all of our incident postmortems, to an AI chat bot, asked it to bucket all of those incidents into a series of categories. So for those of you who don't know Plaid, we provide a layer on top of the banking infrastructure in the United States and, growing into other markets that makes it a lot easier for you to interact with that banking infrastructure.So logically, the biggest source of our incidents is bank outages. There's not much we can do about it. We're usually one of the first people to know that a bank is not operating properly, and we just need to make our customers aware of that. So let's focus on the other three columns here. The biggest one here is synchronous causes of failure.And the way that I taught at Chat bot how to analyze this is someone push button thingo, boom. and this is by far the majority of the incidents we care about. Then you see two other, columns here. and weirdly, these are the things we spend the most of our time thinking about. We try to chase down bugs that are hard to find.We think about when infrastructure is just gonna fail spontaneously. but these are less hard to pre, like less, easy to predict and, just, thought really worth worrying about if we don't know exactly when it's gonna happen. So to kind of go back to that previous question, I think we absolutely can predict when incidents are gonna happen.We might be a little over-inclusive. We might label every single PR as just an incident waiting to happen, but we can do a pretty good job of predicting when things are going to blow up. so why haven't we done anything with this insight? the reality is we're all optimists. We write a pr, we reason through the risk model of what is that gonna possibly blow up.And we say to ourselves, nothing is going to happen. And this creates a wall that we can't get past, which is we think that there's nothing that can go wrong. So why would I bother planning for when it does? and if we just step back and kind of treat these as two separate problems, risk analysis, what could go wrong?How do I prevent it? And then if that happens, what do we do? we get to a better place and we get a lot of knobs we can control once we make that assumption. so four really easy things that we can tune. Once we kind of admit to ourselves that we are the cause of incidents, we can change the release method.I'm sure many of you are familiar with A/B testing or blue/green deployments, canaries, slow rolls, dynamic flags. There's lots of ways that we can change our release posture in order to reduce the chance of incidents. Timing. This one is super easy. I'm sure all of you have spent a Friday night that you did not want to spend.Investigating incidents, If I had it my way, we would just stop all PRs at 2:00 PM and then everyone can go and have a drink. But we don't think about this. We treat these as separate problems where someone's hitting the release button on their code and we have a twenty four seven on-call rotation ready to respond.If and when things happen, why aren't we just making it so that our responders are in the right room at the right time? mitigation levers. This is a little bit hard to think about as assisting from release method, but oftentimes we have ways to mitigate incidents we don't think about until it's too late, whether that could be database backups or being able to degrade performance, as I think folks talked about earlier this morning.And lastly is response readiness. We hit that release button, we start our release process, and then we wait five minutes, 10 minutes, something blows up, and then we activate that twenty four seven on-call response process and try to get the right people in the room. This makes no sense. Why don't we just start with the right people in the room and then we deal with the incident when it arises.now, for those of you in the audience, we were thinking, This is obvious. We already have this, it's called a pre-mortem. I, wanna do another round of questions. number one, who has heard of a pre-mortem before? Raise your hand. Okay. It's about 80% of people. Alright. keep your hand or raise your hand again if your company has a template or process for creating a pre-mortem.Okay? You got four people. now let's, keep your hands raised if you've actually filled that out in the last 30 days. One, two. Okay. Three. so as you can see, we kinda have a quick drop off from people who've heard of the pre-mortem process to people actually going about implementing it. Splash because pre-mortems don't work. and there's a few key reasons for this. the first is the problem I was describing earlier. for those who don't know what a pre-mortem is it asks you to step into the shoes of someone who is writing a post-mortem and then magically work backwards to think about what are all the possible causes of failure and what could we have done to prevent those.In our heads as good engineers, we have a feedback loop. When I find a risk, I then go back and try to fix it. And so every pre-mortem I have ever read looks something like, here are six risks that were relatively obvious, and here's the six things that we did to mitigate them. Because we've constructed this wall of, we've thought of all the risks, we've fixed all of them.We don't get to the next step in the process, which is once something bad happens, what do we actually go to do? what are we gonna go do about it? So at Plaid we've introduced, a new term. it started off as preemptive incidents. We've just shortened it to pre-incidents. and this is a process that is separate from the pre-mortem process.So you have the pre-mortem process, which is where we think about all the risks that could possibly happen, how do we mitigate them? And then we have a pre-incident and that's the way we think about once something bad happens, what are we actually going to go do, about it. So there are a few key attributes of this pre-incident process.the first is that it needs to be low friction and the default behavior. But what do I mean by this? A lot of the reason that pre-mortems don't get traction is because it adds friction to your development life cycle. If you're having a conversation with a CTO and you're saying, I think we should do more pre-mortems, he's gonna say, that's engineering hours.It costs money and time, just not useful. The other thing is that pre-mortems are something that we invoke when we think something is high risk, and we are incredibly bad at detecting when something is high risk. I think this morning we saw some examples of, prs that were introduced where the change was a one-liner and it took down production.We're not very good at estimating risk, and if we were, we wouldn't have any incidents. So what we wanna do is we wanna build a process that is so lightweight that people just do it by default. It happens every single time, no questions asked. So our target at Plaid is for this to take no more than three minutes.The conversation about whether we should file a pre-incident or pre-mortem takes longer than three minutes. So there's no reason to argue. Just go file the pre-incident. The second thing we wanna do is we wanna nudge process rigor. What do I mean by this? We don't want to be a babysitter. We don't wanna tell people, here's how you should release the change.Here's some giant checklist we need you to fill out. All we want you to do is just trigger that response in your mind to say, here's the things I've done to de-risk this change. And as we've found, this actually tends to encourage people to de-risk their changes before they go out. And lastly, we wanna make sure that we're preparing investigators and responders to be effective once they get into the room.I think we've learned it's the typical response flow is to open 10 different tabs. Figure out who you need to page all, what are the dashboards that I care about? What are the upstream dependencies? If we just preload all that work so that we know exactly who we need and what levers we need to pull in order to mitigate an incident, things move along a lot quicker.this is actually the pre-incident questionnaire that we use at Plaid. the slides will be up afterward. If you wanna copy and paste it, you can also just ask your favorite foundational model to generate this for you. We ask some really basic questions. so you can see here what changes are you making.Give us a link to your design document. What team is driving this change? So there's no question about who we need to talk to about it. What environments are you impacting? If it's testing, we don't really care if it's production or our sandbox environments for customers. Maybe that's something we need to look into a little bit more carefully when something goes awry.What systems do you expect to be impacted by this change? This helps us cross communicate between maybe a security team that's making to a cha a change to a system that they don't own. and lastly, we ask for this test monitoring and rollout plan so that people, have explained their process and reasoning for how they've de-risked this change.So once someone's filled this out, it goes into a piece of automation and there are three steps that we follow, through this process. The first thing is planned, so you fill out the pre-incident and this tells us when we're going to be launching that particular change. And it goes into a change log that someone can see.When that's actually gonna get launched in the future. We try to do this about 24 hours in advance, and so that way other teams can read the change log before it goes out and flag any conflicts. Or if you have multiple incidents going on at the same time, this is a great way for you to prevent rolling out a change that's gonna make an existing incident worse.The second thing we do, we communicate. You go into our tooling, you say this pre-incident is now going to be released and we blast to all the different on-call teams that are relevant that change is ongoing. And then they're on alert for if I see a disruption in my metrics, jump in and look at this change.And lastly, monitor. We wanna set the expectation clearly that while you're releasing a dangerous change, you're sitting there monitoring, making sure that things are going wrong. Now I hear you saying in the audience, don't we have alerts for that? We've seen a lot of cases where that just isn't true.So what you need to do is not only be looking for it, didn't break anything, but also be looking for success metrics. This is actually made it out to production. It's actually working as intended. Then I can close our pre-incident process. So our goal is just to make this pre-incident process a complete knee jerk reaction.You've seen here it's basically three lines in Slack. It's a very seamless process, but we get a lot of benefit from it. So how do we get there? We have to do a lot of training. We have to teach people that this pre-incident process is lightweight, easy to think about, easy to work through. once everyone follows this as a cultural practice, you start to get a lot of benefits as an organization.so here's the an example of the training that we prepared. yay Loom. Alright, now we get into the dollars and cents section. How has this actually impacted, our incident process at Plat? The unfortunate answer is it's really complicated. there are a lot of independent variables in this equation, as you can imagine, because people are selecting upfront whether they're going to choose to follow the pre-incident or pre-mortem or both processes.So we have to reason very carefully about selection bias. So the following is a set of disclaimers and then we'll reveal our nice qualitative results. so because people are self-selecting, should I follow the pre-incident process? Should I follow the pre-mortem process? What should I do? We have this great latent variable, which is people's perceived risk.So when someone self-identifies a risk, it means that they're going to be more cautious and therefore they're less likely to actually cause incidents. So this creates a problem with our data because. All these pre incidents are actually less likely to cause issues than someone just rogue introducing a bug into production.The second issue we have is that if someone's self-identifying that the issue or is self-identifying that a change is risky, the ability of us to fix that change is probably also harder. So incidents caused by pre incidents tend to be a lot worse because someone's doing something they know to be risky.look, mom, my psych degree paid off. and lastly, our sample size is very small. we do a lot of changes that plaid, but we still have a very small number of changes that get flagged to this process. let's make some wild speculations anyway. so what we've seen so far is that this process actually reduces the risk of real incidents.we identify 224 cases where, we wanted to follow this pre-incident process and to kind of set the tone, these are the riskiest, most dangerous changes that you could make at the company. It is Kubernetes cluster migrations. It is changing tidy bee databases from one to the other. It is migrating an entire platform in a day.and despite all of that, we only had three real incidents. I think this is a testament to the nudging behavior. If you self-identify that something is risky, you take a pause to realize, I need to make sure my tests and monitoring are really in place, because I'm gonna have to communicate that I took that step to someone else later.it means you are less likely to cause an actual incident. but let's dive further. n equals three, that's totally valid, right? So in two thirds of these cases, the operator actually caught issues according to the plan that they specified when they created the pre-incident process and they rolled it back and they were able to do it below our average MTTR across the company.So what that means is our pre warming capability of thinking about what the monitoring plan was and what the rollback strategy was actually helped us lower our MTTR, which is awesome. And you bet you're wondering about that last case too. in this case, it was actually a multi-system incident where someone filed a pre-incident.They explain their change, they closed it out and three days later we found an unrelated issue in a different system and they were able to correlate it because that data had been put into our pre-incident process, and then a chat bot was able to associate A and B. So the theme is let's make sure that AI has the context that needs pre incidents are helping us create all the context and put it in one place so that we can learn from all those changes that we're making.Lastly, we had one unexpected benefit. and this has really helped my team, which is that every time your teams are thinking about releasing an unsafe change or something that's a little bit abnormal, they're developing a bespoke process for communicating that. And I'm sure you've experienced this. Who, is actually the right people to talk to, talk to about this?What Slack channel do I put this in? Is it an email or a slack? Do I need a document? Do I need to go to design review? All these questions just add to the slog of releasing changes. and so this pre-incident process has become the standard way within plaid of releasing changes that affect systems that your team doesn't own.And that's made it a lot easier for platform engineers to actually go about their daily work. so this is, I think, really, exciting. So hopefully I convinced you the pre-incident of the way of the future. and that would be really, awesome. 'cause this is also a completely unsolicited feature request for all the incident folks to just bake this in first class.now the selling point is how do you actually go about integrating this into your workflows? so the first thing. Copy and paste this pre-incident questionnaire that I've already generated for you, or build one of your own and make sure it has the organizational context you need. to give an example of the way we think about this, we wanna make sure that the, scope of changes reflect the amount of process.So if you have a big change that needs a design document, we add a few more questions. If you have a smaller change, we don't add that much rigor. The idea being we want to tailor the process to our specific business needs. The second is building a culture of filing pre incidents when you're making changes.we've done this on kind of both the proactive front and the retroactive front. From the proactive front, we created a bunch of training. We've pushed this out to managers. We've made sure that everyone has thought about using pre incidents as a mechanism of, evaluating, whether their change is safe to release.But we've also been integrating this retrospectively. So if you come to an incident review, one of the first questions you're gonna be asked is. Did you file a pre-incident for the change you introduced? And there's gonna be a long conversation if you didn't. And it's created a culture where people are choosing to file pre incidents.'cause it's almost like an insurance policy. It helps make sure that you actually have the right incident response plan before you launch your change. And lastly, it's build tooling and, process for monitoring pre incidents when you're actually going about investigating alerts and issues. So one of the things we've done is we've taken all this pre-incident data and we've loaded it into our observability interfaces.So you can see on your time series when these changes are launched. And then I can automate that correlation process when I see the giant horizontal or, vertical lines indicating that change happened. And then immediately our availability drops off. I know who to blame for what actually just caused that availability drop.So that's it all. You gotta do, follow those three steps and wash as your, reliability improves. Thanks so much.

San Francisco 2025 Sessions