Maintaining reliability amid layoffs, AI acceleration and acquisitions

Liam Whelan, Director of Reliability Engineering at Zendesk, shares how to guide your team through constant change without burning out or breaking your systems.

  • Liam Whelan
    Liam Whelan, Director of Reliability Engineering, Zendesk
The transcript below has been generated using AI and may not fully match the audio.
Thank you. I, I said to Claire down there, which button do you click on the clicker to move the slides, as you said, the green one. And, I'm colorblind, so let's see how it goes. So my name is Liam. I'm 45 and it's been three days since I last threw my laptop out the window. so I've been in various operations and, SRE roles, say for about 20 years.I started off actually when I left school, I trained as a vet. and funnily enough, we're starting to build a bit of a homestead. I live at home in Ireland. And so I spend a lot of my personal time and professional time talking about AI these days. Okay. So for, yeah. Anyway, if I'll go into details afterwards. Anyway, move on, Liam. Okay. Zendesk, a leading SaaS provider of AI driven customer service and support software. I had to say that. Okay. So for whoever was watching from the company, I said it, we've about 1500 people. Working in all aspects of product infra security engineering.So the total company size, about 6,000 people. Okay, so chat, EPT would say we're, a large, SaaS organization. apologies for, the long-winded title. it was gonna be what a fucking year. but I'll talk you, I'll talk, you through a little bit of that. And I had to negotiate with legal about some of the things I'm gonna say.But anyway, Okay, so this is a combination of a couple of presentations that I've given internally and externally, right? So I hope it flows. Okay. So today I'm gonna try and share some of my learnings, right? I am, I haven't been a hands-on engineer for quite a while, right? So I lead engineering teams and engineering organizations.So the things that I'm gonna talk about today are a little bit different from the really deep things that people are going to or have spoken about already. you'll see some of the key themes in, in things that Claire and Brian and others have spoken about, right? And we experienced them, the same way.and in, in my world. The challenges that I have are related to scale. they're related to culture, right? A clash of cultures, like hands up. Anybody here who's been an acquisition or has worked in a company that has done acquisitions, okay? So you're gonna be familiar with some of the things that I, that, I'm gonna talk about here.Okay? And it's a big challenge. And then layering AI onto that also gives us greater challenges. Okay? So let me talk a little bit about these to start us off. So the top two smaller teams and layoffs are related, unfortunately, both in terms of my own teams, but also in terms of, the engineering organization, which I work.and how did these impact reliability and instant response? at a high level, right? firstly it was figuring out that, impacts to on-call road. so some. Various points over the course of the last year or so, layoffs happened the next day we come in and, we had incidents and just people weren't there.Okay. it was very tricky, for my teams who lead on incident response and SRE. and we also had some near miss. Secondly, when teams pri prioritize post riff, so reduction in force. Apologies for the acronyms. lots of teams have gotten smaller, understandably Reliability and observability, items tended to be backlogged.Okay. In favor of developing features. So in AI and, acquisitions, ai features really scaled fast this year. Now, really, fast, right? Anybody, you all work in the industry. the pressures that are associated with any of us now in terms of feature teams who are trying to deliver features, sales execs, pushing hard.This created lots of challenges, and pressure on our IM and SRE teams. We were having new types of incidents, And the challenge is they were really accentuated by the market pressures that are coming to bear. Customers wanting new features like I'm a customer of incident.io I am and will be pushing them hard.And the same way my customers in Zendesk are pushing us hard for AI features and the pressure to complete, the pressure to compete loosened levers. Loosened levers, right? For good or bad loosened levers. But we also had challenges with the integration of new acquisitions. And I'll, talk a little bit about, some of those challenges in a second.or in some cases not so new acquisitions. So to put things into perspective, in total, Zendesk has acquired 13 companies over the last, I dunno, 10 years or so, maybe, four or five. Four or five of these happened in the last two years. Okay, so it's fairly aggressive, not at a Salesforce level or other larger organizations, but it's still, for a company our size, it's pretty aggressive to be acquiring companies at that rate.In addition to this, the velocity. Of feature releases and I wish I had a chart. There was some really good charts. I sat there earlier on. I was looking at some of the charts and I really wish I had added some charts to it to show the velocity of feature releases that we've seen over the course of the last six months and how it's grown through the acquisitions and also through AI features being released.But that really caused challenges for us. Lots of this is crappy. I'm Irish, so I use a lot of, course words. Apologies, but it for forced us and it is forcing me and the team to transform our thinking and sharpening our pencils. So let's dig into it a little bit. Hope it's the right one. It is. Okay.So let's look a bit deeper, okay. From an incident management perspective. Alright. Why incidents exposed the cracks. Institutional knowledge vanished almost overnight. I spoke about the fact that incident managers couldn't contact owners. We relied on being able to contact the right people to resolve an issue as quickly as possible.Delays to this impacted our, on our customer's experience. It impacted on our customer's experience. Okay. I've spoken about this internally, right? with some of the teams, it's a hard truth, but it did. In some cases, there were key services that had new owners that had never opened the repo GitHub.It led to some interesting retros. We didn't have guardrails in place to stop these things happening. Similar to what Claire was talking about, I think Mary spoke about it earlier on about the importance of guardrails and I'll, reemphasize that as we go. Like in the past I had talked a good game and we, my team had talked a good game about implementing various tooling to ensure that operational excellence or you, folks really about production reliability or production readiness, checks, et cetera, were in place.But for lots of really poor reasons, right? we never did. And it bit us in the backside. Similarly with the market pressures on ai, you'll be surprised to hear the corners were cut. Shock her. This led to customers reporting issues. Before we knew, sometimes we didn't know what this feature was.Has anybody been in that situation where you have a customer issue and you don't know what the feature is? Yeah. Yeah. It's not pretty. Also, small things like releasing to market without proper on call procedures. It's frustrating to think about it. Even now, I can feel my blood pressure starting to go historically.We relied on people power to help products launch safely and acquisitions onboard correctly. I'm sure many of you have superheroes that are the first people that are on the call for an incident, et cetera. You have a huge amount of tribal knowledge. Some of that disappeared for usnow, though we simply don't have enough SREs to ensure the correct onboarding of acquisitions and help with production readiness. This led to some tricky conversations, but it's forcing me and our executive leadership to think differently.Okay. Let's have a look at some data on acquisition incident response. And I picked this out specifically because, it does tell a story. and I'll, fill you in a little bit of the gaps as on, on the story as we go. So this, these couple of slides tell a historical story of our escape defect incident.So we, we turned 'em as escape defects. So it's essentially our bugs making their way into production. It shouldn't happen, but it does. Okay. Specifically in these slides, we're gonna look at acquisitions versus existing products. Okay. These are incident types that are directly within our control. They are always going to happen, right?My job is to make sure that we minimize the amount of time that it takes for us to resolve those issues and catch as many of them before they make it into production. So we typically measure reliability using a few metrics, right? But it's probably gonna annoy a few people, in various different, what to call it teams, back in Zendesk, but I'm gonna use mean and median TTD for this, right?People hate, me talking about median and mean. Okay. So for acquisitions over the period of Q2 and Q3, this year, we observed that two thirds of the total incident time was spent on detection, two thirds of the total incident time. So on average, oops. What happened there? Oh, it's gone. There you go. Oh, there we go.My bad. So on average, and at the median acquisitions had poor detection. The vast majority of incident time is spent on detection. holy crap. So when we do get the engineering teams engaged, they can fix it fairly quickly, which is good. Fixable in the main, but that's not to say like you look at our existing products, life isn't rosy.Okay. And the challenge for us has been as, as Zendesk has grown, as we have monoliths. I was talking to Brian about monoliths yesterday. We've got monoliths, But, and a, but as the complexity of adding acquisitions in. we got lots of problems now with configuration, integration problems and stuff, async problems, and they lead to, sometimes they lead to quite long, incident times.So I'm not happy about these means here, right? For existing products or our medians. I want it down, 20 minutes, 30 minutes at the most, but our acquisitions, Jesus Christ.So this data resulted in some major deep dives. For us both with these teams and on a new acquisition onboarding strategy moving forward. These are the types of incidents that we now target hard, and my job is to target hard focus, sharpen pencils, make sure that people are focusing on resolving this for our customers.Okay, so what are our conversations? Turn up. These are generals. Okay. so Clash of cultures. Okay. Some launched without proper on-call coverage. So we had some acquisitions, didn't have proper on-call coverage. Okay. for some there just wasn't a culture of on-call for somebody acquisitions. In some cases, I think John maybe referred to it, earlier on, there's some contractual issues that needed to be resolved.We acquired companies in some countries, they just didn't, contractually they couldn't be on call.Frequently acquisition engineering teams are incentivized through product launches, promises to customers, and good old fashioned reliability work takes a backseat. Is that familiar to folks? They're not incentivized to focus on reliability. Other issues stemmed from market pressures to turn some AI products from EA into ga.So from early access or area adopters into, general availability without time to invest in the ops excellence or scalability. So in other words, products got launched as EA and they said, Jesus, this is going really well. We turned it to ga, but they hadn't done. The basics from an, a monitoring, from a scalability perspective.Familiar to anybody here? Yeah. One of the biggest challenges we faced was that the pathway for getting acquisition services integrated with our reliability metrics or data pipelines had gone stale with rifs. So the people who are responsible in my team. For making sure that we could build and acquire those metrics so that I could report on the key reliability metrics to leadership, to their engineering leadership, to our executive Leadership.Just wasn't in place. Just wasn't in place. It disappeared and that was my fault. We didn't have good data in which to try and influence change. So important, right? Whether you're a small company or whether you're a big company, that you have the right data in which to influence change. So important and I missed it, we had to rebuild that and put new processes in place.Still a work in progress. Other issues were down to poor decision making and the complexity of acquisition integration, again, of which I've been somewhat guilty.So this one I actually did have a picture of a, an emo like my daughter's a preteen and all I get these days is emo at home. but I do a lot of self-reflection these days. and one of the biggest learnings for me personally was, as Mary and Claire had pointed out earlier on about guardrails, we had the opportunity in the past to commit to platforms, building platforms, building guardrails, building platform teams to develop those guardrails that could have prevented some of these issues.I didn't fully understand at the time how important that was gonna be. and yeah, it frustrates me. What we're seeing and what we are going to see in the future with AI is that we are gonna have an increased change velocity. So there is no choice for us now, but to build those platforms, build those guardrails.Okay. Let's, talk a little bit about the, our experiences on the AI mega Rave. Okay. Expectations from customer on ai Feature velocity is, just beyond anything I've ever seen before. It's just ferocious. The appetite for new things, the appetite from our sales and our marketing people for us to be seen, to be delivering new things puts pressure on everything.This presents so many new challenges for the guardrails that exist and so many of my guardrails that existed before are based on people. and it's really focusing our minds now on like how, do we do this? How do we build these guardrails for these acquisitions, for these new types of features, et cetera.It's tricky how we measure customer success has changed all has changed also with the rollout of outcome-based pricing. People here are familiar with outcome-based pricing. Okay, so let me, spend 10 seconds talking about outcome-based pricing. Brian touched on it earlier on actually. with Intercom, Intercom were one of our big rivals, so I probably shouldn't be talking to Brian.outcome-based pricing now with a lot more of the market in the AI agent, particularly in the customer service field. Now, switching to outcome-based pricing means that if I'm a customer of Zendesk. I want Zendesk to resolve as many issues for me as possible using the AI agents rather than talking to somebody who's sitting in a contact center.Okay? You are only, and Zendesk should only charge me for those issues that they have resolved through their AI agents. If it's gone to, if it's gone to a person that's sitting in a call center, then our, a AI agents haven't done their job, okay? So us now.Lemme start that again. So it places a huge emphasis now on us understanding the quality and the effectiveness of our AI features. We don't have that, or we didn't have that, at the start of the year. Okay, so the market is now pushing us towards improving qa, trying to figure out what metrics we can build and collate from our different AI features that will allow us to support these new billing models.The volume of data has also substantially increased. one of my teams is the observability team. We're now shipping petabytes of data. On a much more frequent basis to Datadog, Grafana, et cetera, et cetera. it's just bonkers the amount of data that that's coming outta the systems.Okay, let's keep going. Alright. Opportunities.I'm a firm believer in the fact that AI can help me. With alert ddu. Deduplication, sorry. From my teams, like we've seen proposing remediations from prior incidents, et cetera, incident timelines, auto summaries, everything looks fantastic. It could potential, potentially help me and my teams improve our ability to identify and surface improve reliability trends.Okay? So I don't want to have these static dashboards that, operational excellence dashboards that we're pulling engineering leaders towards. I want to push to them. Showing them trends, insights, et cetera. And I want AI to help me to do that, but I needed to do more than that and I'll be pushing incident.io and some of our other vendors to give me more.I want it to help me risk profile, have an opinion on deploy gating, improved anomaly detection at the data scale that we are at now. I needed to help preserve customer experience as we rapidly grow. My job is to preserve the customer experience for my customers in Zendesk, and I need ai. I need ai, SRE. I need so much more than what we currently have and I'm really excited for it.So my goals for AI when I look ahead to the next 12 months, reduce TTR. Help me reduce meantime between failures. You're familiar with MTBF, meantime between failures, measuring the, meantime between failures for different services. I'm reducing KTLO errors, okay? I am not going to get more head counts for my teams, so AI and ai, SRE and other AI products are going to have to help me build the guardrails.Guardian, those guardrails and help me and my teams to succeed at the scale that we're rapidly heading towards. Okay, so my stance on AI at the moment is it's an assistant, not an incident commander nor a primary responder. I need humans to stay in the loop at the moment. I will be pushing hard for it.Iterative progress. I need iterative progress for the scale at which my company is growing. The volume of new customers that we're adding and the volume of features that we're pushing out. We're not gonna be able to guard the customer experience without it.Okay. I've said a lot, so let me try and wrap up with some thoughts. I know people are getting thirsty. Alright.So reliability improvements are nearly always forged under stress, and people were given various different anecdotes about their career to date. I think my most embarrassing, my most embarrassing mistake from an operations or an SRE perspective, was. You know how many people here remember trying to send an SMS on New Year's Eve and failing miserably?I'm showing my age. I'm 45. I worked for a large telco in Ireland. I left debug running on an SMS server during New Year's one year. it's not something that I'll, ever forget. We've got to continually adopt our practices. Okay. At the start of this year, the start of 2024, I thought we were rosy in the garden like products.Reliability was, customers were talking about, I couldn't see it coming right? So we have to constantly and continue to adapt our practices, codify knowledge, invest in the right tools at the right time. If you're not doing it, do it now. Start it now next week or whenever. It's, you have to invest in the right tools and protecting our people.Claire spoke a lot about protecting our people and making sure they feel safe. Unbelievably important, these days, org debt is much harder to solve than tech debt. So what that means for me is that the changes that I needed to solve, or I needed to put in place to invest in those platforms to provide the guardrails, I made such a meal of it, right?Such like I should have made those hard decisions sooner. Because the tech debt would've been so much easier to solve if I had, planned the, the organization changes much faster and much better. So for me, org debt as a technology leader, org debt is much harder to solve than tech debt We're getting there, but the pivots should have happened earlier.So we're investing in new ways to continually train and test our instant response.We haven't got a huge plethora of incident managers. I have a dedicated incident management team, and we're already people are working to the bone, so I'm pushing incident.io hard for ai, SRE. I want additional capabilities so that I can start back some of those people off and invest in training some of the teams from acquisitions, et cetera.So I'm placing a lot of eggs in that basket. Long overdue, but we are looking at formalizing reliability platform teams to deepen investments on reliability tooling. And so we've danced, I've danced around the edge of the dance floor on this one, for too long. But now we're committed to the dance off.We, we, we have to do it, we've got to make it easier for engineering teams to succeed when they are in not incentivized. To do reliability work. So the platform teams that are developing those guardrail, they gotta make it easy, for those engineering teams to use, because otherwise you're gonna find it really, hard for them to buy into that reliability story.So I'm filled with an equal mix of, terror and excitement. as a father of two children. It's a regular occurrence on the challenges and opportunities that AI can provide in our space. There's a lot of open questions still, but Q4 into 2026, I'm gonna encourage my teams to test various use cases, learn and iterate from there, and, push hard, based on what we've seen to date with partners like incident.io who are a great partner.and thank you for the swag by the way, or I'm calling it now Santa Presence for my kids. there are opportunities and new terrorists to uncover. The challenges of collecting good data increases in my line of work. Signal to noise is unbelievably important, like we are drinking or trying to drink from a fire hose at all times, and the hose is just getting beer if nothing else.My career has taught me that there are a lot of learnings and, learning opportunities in the year to come. And one of my old managers once told me, and this is this kind of corny but I'm gonna say it okay. He once told me that a career in SRE is like constantly pushing a bolder uphill. And the size of that boulder depends on the quality of your tooling and the strength of your team.Yeah, it's a bit corny, but you know what? I spend a lot of time thinking about that these days. Thank you for your time and I hope you took something from it. Thank you.Thank you.

London 2025 Sessions