From error to insight: Human factors in incidents

Molly Struve (Netflix) goes beyond the purely technical post-incident analysis, diving into the human element of system failure.

Molly Struve, Staff Site Reliability Engineer, Netflix

The transcript below has been generated using AI and may not fully match the audio.

I am really excited to talk to you all today about a topic that is near and dear to my heart, and that is human factors in incidents. So during this talk, we're gonna go beyond the purely technical post-incident analysis and dive into the human element of system failure. However, before we dive in, let's review what human factors means to ensure we're all on the same page.Human factors refers to the study and understanding of how people interact with the systems, tools, and environments around them. The study of human factors is extremely well developed in industries where life and death are the consequences of not getting things right. Some of these industries include aviation, medicine, construction, and automotive to name a few.Thanks to the focus on human factors in these industries, safety has been dramatically transformed and improved. One of these industries that I want to talk a bit more about is aviation. My background is in aerospace engineering, and that is where my passion for human factors started. In college, I took a human factors and aviation class, and during that class we studied numerous aviation accidents and focused on the human-machine interactions.It was eye-opening, seeing the improvements that came from deeply understanding the human factors in those accidents. Roughly 70 to 80% of aviation accidents are caused by pilot error, but we don't lean away. We lean in when pilot error is the cause, and because we lean in and deeply study that airplane and pilot interaction, cockpit design has come leaps and bounds in terms of safety, ensuring pilots can do their jobs effectively.So now let's think about software. What is our engineering cockpit, if you will? It probably includes things like Slack, deployment tooling, observability tooling. Similar to an airplane cockpit, these systems give us signals and information that we need in order to do our jobs. How we deliver those signals to engineers can mean the difference between a smooth day-to-day operation or causing a full-blown incident.So keep that in the back of your mind because while the stakes may be different, we can stand to take a note of these industries' playbooks and apply human factors more rigorously when it comes to software. We can gain a lot if we can apply the same focus on human factors to software incidents as those safety-critical industries do.However, focusing on human factors in software can be challenging, and I find that it takes a bit more effort and intention. So before we dive into all of the benefits, I want to touch on some of the challenges that I have seen make it harder to reach for human factors when reviewing software. First is that many of our systems are highly automated.For the most part, once we build a software system and deploy it, it's gonna run on its own, right? It's not like an airplane where you need a pilot to fly it. For this reason, it's easy to forget that humans are part of these complex systems that we build and run every single day. Additionally, as engineers, we enjoy focusing on automation, right?We want to remove humans from the system as much as possible. We can never eliminate them completely, and it's because of this fact that we need to also focus on how we get our systems and humans to work together in harmony. Another challenge I have seen is an overemphasis on blameless culture. This focus can cause us to unintentionally avoid the human element when it comes to instance.When a human makes a mistake during an incident, often people want to avoid that aspect because it could lead to blame. This is a mistake. By not directly looking at the human factors, we miss a wealth of opportunity to improve our systems and how humans interact with them. And finally, the human error stigma.People shy away from that phrase, human error to avoid blame. We all know making mistakes is part of being human, right? Operators have dual roles as producers and defenders against failure. Engineers are both building systems and protecting them. Human error is gonna happen. It's not a question of if, but when and when it does happen, you have to seek to understand and learn from it.Now, if you're still stuck on that phrase, human error, maybe as the kids like to say, it gives you the ick. Swap it out for something else. Human factors, human element, human angle. Choose a framing that is comfortable and is going to encourage people to lean in because you need to be looking at the human factors of incidents.I wanted to outline these challenges so that you might have a better understanding of what within your own company culture or team might be preventing you from taking that leap and looking more deeply at the human factors of your incidents. So with these laid out now, I wanna shift gears and talk about the good stuff, the value focusing on human factors can provide.What can we gain when we focus on human factors when reviewing our incidents? I don't know about you, but when I hear that humans contributed to an incident, I get excited because I know if one person can make that mistake, then likely many others can make the same mistake as well. And that is one of the major benefits of focusing on human factors.It results in high ROI fixes. If you can learn from a single human mistake and prevent others from making that same mistake, you're gonna prevent many future incidents. A lot of times when you're looking at the technical aspects of an incident, you're gonna be focusing on a single service, and thus, your fixes are only gonna benefit that single service or team. When you're looking at the human factors,you're often looking at processes or widely used tooling, and when you fix friction points or deficits in those things, you're gonna be driving impact across multiple engineering teams and even an entire engineering organization. All of these higher ROI fixes are going to lead to a safer working environment.I'm sure many of you have heard someone at some point say. Oh, no, I can't touch that code. Or, oh no. Don't want to touch that tool because that might break. When engineers are working around systems, they're slowed down and change becomes riskier. By improving that working environment, you can instill confidence in your engineers and confident engineers are going to work more effectively.A safe environment empowers engineers. Engineers that are more confident operating their systems are able to work more efficiently, move faster, and be more effective. So doubling down on understanding human factors doesn't just prevent incidents. It can also increase the productivity of your engineers.All of these benefits—high ROI fixes, safer working environments, and more effective engineers—make a pretty strong case for focusing on human factors. So with these, hopefully I've planted that seed that human factors are important when reviewing incidents. Okay, so what's next? Next, we're gonna dive into how you can thoughtfully incorporate human factors into your post-incident analysis so that you can capitalize on these incredible opportunities to improve.So let's go over some strategies you can use to do this. First up, add reviewing human factors to your post-incident review process. Add a note to review them to your write-up templates or your PIR meeting agenda templates. If you already have existing PIR templates, you might have sections in them, such as what went well or what was challenging.These sections are perfect for outlining the human factors of an incident. They also show us that human factors can go both ways. They can highlight an area for improvement, or they can highlight a best practice that we wanna further promote. No matter what approach you take, codify human factors into your PIR process. Next,Dig into the human why. I'm sure many of you are familiar with the Five Whys technique, which is a great way to get to the bottom of technical failure. Consider doing something similar, but with the human aspect. One approach that you can use to do this are one-on-one conversations or interviews with incident responders.Have them tell you their story of how the incident unfolded. During these conversations, your best friend is gonna be the phrase, tell me more, to get the incident responder to go deeper on that experience that they had. Now, when you're having these conversations, be wary of counterfactuals. A counterfactual is an oversimplified statement such as if only X had been different, then Y.So one of my favorite examples of this is if my alarm clock had gone off, I would not have been late to work. We all know that a lot of other things happen in between your alarm clock going off and you getting to work, and therefore oversimplifying that statement is not productive or valuable. So you're gonna wanna watch out for those when you're digging into that human why.And finally, start by modeling this yourself. When you make a mistake, don't rush by it. You push the button or you pulled the lever that took down prod. We've all been there. I once did it twice in under an hour in case we're keeping score. Acknowledging the mistake is great, but don't stop there. Pause and seize that opportunity to improve your systems so someone else doesn't make that same mistake.Leverage these strategies to help you look at your incidents with a whole new lens and reap even greater improvements. Now, before we wrap up, I wanna go through a couple of examples that showcase human factors in incidents. These examples are gonna highlight the difference between looking at those technical aspects versus looking at the human factors and what there is to gain from this whole new angle.First scenario: a feature bug broke production. On the technical side, how do we prevent bugs from hitting production? We write more tests. Maybe we need more unit tests. Maybe we need some end-to-end tests. Maybe we want to start doing test coverage checks in our CI run. These are great technical insights to gain from an incident.Now, what can we gain when we look at the human factors angle? A great question to start with here is why were the tests skipped or why were there not enough tests? In my experience as engineers, we all wanna do a good job, right? We wanna ship code that works. So digging into this question could help you reveal more systemic issues, such as, was there excessive production pressure?Do we maybe have an education gap in terms of writing tests? Maybe we have a process gap somewhere. If one of these larger issues exists, tackling it is gonna go a long way to preventing more incidents in addition to the technical aspects. Okay, second scenario, this is one of my favorite ones. Engineers tried to stop a bad deploy and made things worse.Anyone else been in this situation? For example, maybe you're rolling out a deploy regionally. When the first region starts breaking, you immediately try to stop it. Except instead of stopping it, you accidentally delete all of the production instances. this is hypothetical, of course, okay?On the technical side, how do we prevent that bad deploy from going out in the first place? Maybe we do canaries. We want to add some of those to our deployment process. Again, going to our favorite word: automation. How could we automate this? Let's remove the human. Could we detect those error signals and then just have the system roll back automatically?Again, great insights to be gained from the technical angle. Now, what can we gain from that human factors angle? In this scenario, we're gonna wanna dig into that tooling and find out what we can learn there. Some questions that you can use to help you do this: What workflow were engineers using?There are probably multiple options for pausing or rolling back. What did engineers expect the tooling to do versus what it actually did? What inputs factored into that decision to choose the course of action they did? Then more importantly, going back to the idea of safety. How do engineers feel when they're using the tooling?Are they confident or are they hesitant? These questions are gonna help you find opportunities to improve tooling, which is gonna benefit all of your engineers. Again, by digging into the technical and the human aspects, you're gonna come away with a lot more improvements that will benefit a lot more engineers.So hopefully these scenarios were helpful to showcase how you can approach incidents from multiple angles and what there is to gain. I firmly believe that human factors are your secret weapon. Every human decision reveals insight into your system when humans struggle, your system is speaking. If you're listening, there's massive gains to be had.So to get the most out of your incidents, add reviewing human factors to your PIR process. Get curious beyond the technical and dig into that human why. And finally, set the example by starting with yourself. When you make a mistake, pause and seize that opportunity to improve your systems so someone else doesn't make that same mistake.Let's go back to our earlier thought exercise of our software engineering cockpit. Our cockpit is made up of tools and processes that we use every day to deploy and operate our systems. By focusing on human factors and understanding how engineers interact with these processes and tools, you will find improvements with massive ROI.That will create a safer and more productive operating environment for all of your engineers. So take control of your software engineering cockpit and lean in to reviewing human factors for all of your incidents. Thank you everyone for your attention and happy to talk more.

San Francisco 2025 Sessions