Leading up in a crisis

Adrian Carvalho, Senior Engineering Manager at Zuari Software, shares practical ways engineering managers can lead up during a crisis — aligning leadership with incident response, not working against it.

  • Adrian Carvalho
    Adrian Carvalho, Senior Engineering Manager, Zuari Software
The transcript below has been generated using AI and may not fully match the audio.
Hey everyone. everyone's got their own worst incident story. For me, it was a tossup between the time a bank lost half of their payment messages due to a concurrency bug. another time I can think of was when we silently stopped recording customer balance changes due to a bug for almost a month.Both were resolved eventually with no real long-term impact, but at the time, the, stress was just suffocating. I had that, that tight feeling in the chest, tension in my shoulders, butterflies in my stomach, and it was just impossible to release that. That was just for weeks. I had that feeling. my name's Adrian and around seven years ago, I decided it wasn't stressful enough trying to keep a toddler alive to keep a pond full fish alive.And I decided to get involved in incident management for a company that was scaling to become an enterprise. Suddenly I was dealing less with engineering and more with the rest of the company, including executive leaders. That plus two young kids gave me the gray hair bags under the eyes preceding hairline that you, that you see today.I learned what long-term stress can do to the body. For me, it triggered the start of a genetic autoimmune condition that I continue to struggle with, but through that I built a vision of something I call stress-free incident management. it's about creating a culture where teams stay calm under pressure, make better decisions, both during and after the incident.Executive leaders can be one of those sources of stress. I've certainly had my share of incidents with an exec standing over my shoulder while. Tapping away trying to resolve an incident. But when that energy is directed well, it can be a massive force for good. And while I'll talk, while I'll focus mainly on executive leaders, everything I'm gonna say applies to anyone outside of engineering, from sales account management products and support.As I go through these, think about which ones you might have seen, or if you are honest, which ones you might have done yourself. steamrolling the incident commander. That might be through micromanagement or just outright barking orders at people. broadcasting partial information or incorrect information.updating customers or other internal, other internal people before facts are verified, thus spreading confusion. Amplifying panic. This is unacceptable. We're losing millions every minute, interrupting engineers. How do you know this is gonna solve it? What's the risk? Could it make things worse in the middle of the incident?They can take valuable time to answer. they sound like useful questions. jumping straight to the root cause. Very common on, on channels and silencing the room just being present. On a call or on a channel. when senior voices are present, our most shy, junior introverted engineers might stop contributing and often will create side channels that become difficult to manage in the middle of a messy incident, burnout can twist our perception.We stop seeing our colleagues and leaders as allies. And we start to assume the worst of their intentions, even when everyone's just trying to get the best outcome. These are all behaviors that are natural under pressure.As incident professionals, we can either react negatively or proactively set up structures to harness this energy into something useful. where there's chaos as opportunity. People tell me, I use this quote way too often, maybe I've worked in lots of chaotic places. but in this case it's apt and to seize the opportunity.It's helpful to understand some of the reasoning, some of the psychology. So why do we see these behaviors? It all comes from good intentions, a desire to be helpful. People want to chip in with the engineers so that they don't feel alone. They also want to feel like they're contributing to, solving the problem.desire for clarity. They're under pressure to provide updates. They need to have enough information to be able to provide those updates. Context switching from a strategic direction, ma decision making that they're in day to an operational crisis mode. We all know how big of an impact context switching can have to our, abilities, and in this case, it can be particularly painful.Then we've got a, mismatch between business and technical mental models. So in a company, the boundaries between services and teams, the business units that, the teams are working in might be fluid. And trying to map that onto a technical mental model while you're in the middle of an incident can be difficult.Then we've got pressure from external stakeholders and acting as the voice of the customer. They're often the frontline between the company and the biggest customers that can be, an additional source of stress and pressure for them. We've got accountability without control. So execs are on the hook for outcomes, but they can rarely solve that in, the issue themselves.which can, cause a lot of, a lot of heartache and difficulty. But the big one is different time horizons. So during an incident. Engineers tend to hyperfocus on solving the technical problem as we should. Sometimes we lose sight of the wider business impact. Meanwhile, execs see that wider picture, but they dunno how to plug into the fix without making things work worse.So we have some stats that try and talk about what those wider impacts are. the true cost of downtime. Where we're talking about the reduction in stock price after, an incident, stock value often drops up, around 2.5% after a single downtime event. around 40% of disruptions lead to minor or major brand reputation damage.and then talking about. How other areas get involved in incidents? So engineering was involved in 53% of incidents in an Atlassian report on the state of incident management, C-suite execs in 43 SRE in 37, legal in 22, and marketing in 20. So incidents don't just cover engineering, they cover the entire companywhen. Executives are involved in incidents the right way. They can be a huge asset, not just a source of chaotic energy. So some of the ways that they can be helpful sharing the load. So as, I mentioned, the biggest incidents spill beyond engineering, engaged in z execs can help get the right engagement from other teams, whether that's legal, marketing, support product.They can help things move faster, where you've got, slow processes that you need to bypass. emergency change requests, vendor escalations, budget approval, sponsorship for resourcing or budget. So after the incident, we can say. this incident cost us this much. It's wasted huge amounts of time.Let's invest in doing these actions to be able to resolve this for the future. Now, visibility of team and services, building bridges between business and technical teams, building trust. These are all about raising the visibility of engineering in an organization, but showing that we can be problem solvers.We can be a positive source of energy for the company rather than just seeing the cause of an incident is down to engineering. trying to twist that back to where we want it to be. And the other one is alignment on product priorities. So where we've got changes that we want to make to our product that might impact.Different areas of the business. maybe we say to a business unit, we've got this buggy old report, we wanna move you to this new report. execs can help make that process a lot smoother.So I have a distaste for the term best practice, whether it's in tech or process. It reeks of cargo, cultism, every company. Has a different structure. Every company has a different IM process. So instead I'm going to give you a toolkit. I'm gonna ask you to consider how these tools apply to your company with the lens of reducing stress, and making your incident response better.So the starting point, build a framework. So this is where we're setting up a framework for everyone to be good citizens in your incident response process. The biggest improvements to most incident management processes come from improving communications. There's a famous TED Talk by Rory Sutherlands.The most cost-effective improvement to passenger satisfaction on the London Underground didn't come from running more trains. It came from installing the dot matrix boards that tell passengers how long they have to wait for the next train. Thus, you'll see a lot of the points I'm gonna talk about around communicating.So to start off with, define what is a good citizen, what's helpful? What's not helpful in terms of their behaviors? How do they get alerted for an incident? How do they get information during that incident in the best way? When and where a postmortem's held and stored, creating a templated comms playbook involving regular updates rather than play by play comms.So we're going to do, a system restart. It will take 15 minutes for us to get you to your next, for you to get your next update and automate where possible. A lot of what I'm, what I've, what I'm going to be talking about is, automatable and certainly with, what Steven was talking about around, AI SRE is a lot of this can be, can be dealt with better in, in the new world.Consider setting up comms channels by intent for your longest running, highest severity incidents. It's a heavyweight thing to do for your smaller incidents that just require engineering, but when you've got a sev one incident that's lasting a day, two days having, separate intents. Separate channels that are separated by intent can be extremely useful in reducing the noise for engineers.And by reducing the noise, you're reducing the stress. Blameless everywhere. We all know how important a blameless approach is, but outside of engineering, that might not be a term that is as often known. Make sure that in all of your presentations, all of your documentation, that you highlight the blameless approach, especially highlighting the benefits of a blameless approach.and that approach applies. those, those stats can be drawn from far and wide outside of engineering. I've seen some good stats around surgeons, showing how blameless approaches can improve patient outcomes. and that one is really key for reducing stress. Once everyone buys into it, once everyone lives, that truly, that can be, a major stress reduction when you're dealing with your instance, the next one's quite difficult to achieve well, but it, and it requires a lot of investment.It's creating a single plane, a single pane of glass for your current platform status and getting that as accurate as possible. I personally like to do, synthetic tests, simulating customer traffic from outside of your data center, but also combining that with internal metrics. the goal, reduce panic, give clarity, and make sure that people can contribute calmly.So the next proving the process, this is where we're trying to, embed the culture and build muscle memory. The starting point is onboarding, so a very, simple onboarding deck. Two slides. One, what does the incident management process look like? Two, how do you be a good citizen during that, incident?Again, how do you get the, the information that you need to help with the incident process to help customers during this session? Gather feedback to identify the gaps in your process. but the one thing you will always certainly be asked is for more information, for better information, for more accurate information, faster, be very clear on the cost of that.It's, it's all well and good that people want to know exactly what part of the platform has gone down and how many customers are affected. The list of customers affected. There's a cost to getting that information. It's reducing with advances in instant tooling. But it's at today, it's still a, cost.now moving on to postmortems, ensuring you have a strong moderator during postmortems is particularly key as you get more and more exec involvement in those postmortems. You need to be able to maintain the culture. You need to be able to set out the culture at the beginning of those postmortems. And you need to be able to say to people, that's not how we do things.Let's do things right. so I also like to include a, what went well as the start of any instant postmortem trying to put a positive spin on, what did we do well? What do we want to. enhance what do we want to keep doing in our, in our incident response? And the other part is, that's particular, particularly close to my heart, is including a stress impact section.So talk openly about how stressful the incident was. That is key to trying to reduce the stress. If we want to start, improving on something, we need to start recording data on it. we need to start talking about it openly. And the last one is, fairly, obvious. I'm sure you're all doing this already.It's having an exec summary at the, top of your postmortems. Help your execs get info quickly. Help anyone who's dealing with customers who needs, need to be able to give customer updates. Something that they can just look at straight away. So closing the loop. We've through this process, we collect.Lots of very useful data. We collect lots of data on root causes, on how we want to improve the instant process going forward. So what we want to do is connect that data to outcomes. Now, having a dashboard for post-incident actions is, a fairly, simple one, but make sure that this is connected to the customer impact.You can, you can use the severity of the incident. You may find that doesn't go deep enough. You may want to build some kind of index or metric that includes what your exec team are actually interested in. That might not be customer revenue. It might be, Specific to a product, you can matrix a score that, that combines lots of different metrics to format.one of the keys to this is that you calculate it after the instant is over, trying to add more things that you're doing during the instant. trying to stuff that process ends up quite painful and extends your, time to resolve time. the cherry on top is having a regular cadence. Of meetings, so maybe monthly or quarterly to go through incidents, root causes, post-incident actions using the dashboards that you've created.this, is key to trying to, identify trends, push for stability as part of the roadmap, and generally reduce impactful incidents overall. So looking for patterns across teams, across services where maybe alerting isn't working as well as it should do. maybe your rollback process needs some work, when you've got a larger.Engineering organization, it becomes harder to try and connect those. But, but by looking at the, overall, incident actions, you can try and start to, to work out where investment needs to go. and you can do that in line with your execs. So some pitfalls. I'm at an instant IO conference, so tooling.Tooling should always be considered the first option because a manual process is extremely costly. If you do require manual steps, consider the cost before you implement it. some of the worst processes I've seen have involved over 40 page word documents attached, and I've seen that at, at multiple companies.The exec liaison role, so I've talked a lot about improving communications and the natural instinct is to introduce an exec liaison role who is dedicated to liaising with your execs, during an incident. This can be extremely cost. There are options for using LLMs to try and automate some of this. so think hard before you go down the route of introducing an exec liaison role.and the last pitfall, this is good advice for anything for, anything. plan for incidents that escalate from a. Your lowest severity to a cev one. a scenario definitely to include in your war games. which is another thing I also highly recommend you, you, you implement. So that's all I have for today.Incidents will never be perfect and the panic is part of the fun. If we're lucky, we can turn them into something where we're working together. And maybe we're learning things.

London 2025 Sessions