Navigating disruption: Zendesk's migration journey to incident.io

Anna Roussanova (Zendesk) discusses how her team pulled off a smooth migration from PagerDuty to incident.io — in just over two months — without derailing engineers or piling on chaos.

  • Anna Roussanova
    Anna Roussanova, Engineering Manager, Zendesk
The transcript below has been generated using AI and may not fully match the audio.
I am here to tell you a story. It's the story of Zendesk's migration of our on-call functionality — our platform — from PagerDuty to incident.io. Like all good stories, this one begins with a problem. Our problem was a little self-imposed: how do we migrate about 1,200 users, about 150 Scrum teams, over 5,000 monitors, and about a dozen alert sources in 10 weeks?Now, you guys might be thinking, Anna, this really sounds like poor planning on your part. And to be fair to us, when we made our decision to migrate, we didn't have only 10 weeks; we had more time than that. But our timeline compressed between starting to configure alerting in incident.io and our go-live.Our self-imposed go-live date of April 30th was just a 10-week time span. So why did we think we could do it? Spoiler alert: we did do it. We managed to make this migration happen. In retrospect, there were really three things that contributed to our success. First up, we had our North Stars — our guiding principles that made every decision we had to make along the way a lot easier.We had our team. It was a small, lean team — very few people, but very smart people who were able to execute this migration. And then finally, and probably most importantly, we had our communication strategy. We came up with the strategy right at the beginning of the migration process. We sat down and crafted: this is what we're gonna communicate, this is how we're gonna do it,and these are the exact times in that 10-week schedule when we were gonna make each communication. Let me go into each of these in a little more detail. So these guiding principles are our North Stars. I know you should only have one North Star. We had two, but it worked out.Both of these came out of our experiences being at Zendesk for many years. The first was that we were gonna have a single source of truth. Our PagerDuty setup had come out of 14 years of ad hoc build, without any guidelines, without any standards, all done via click-ops.Largely done via click-ops. So by earlier this year, it was a mess. And it was a problem during incidents because we'd be on an incident call, we'd realize, oh, hey, we need to pull in the team that owns authentication. What's that team again? Oh, it's the authentication team. Okay, cool.Except that in PagerDuty, they were under their old name. That was, like, Swank, right? So it was something totally different, and you had to have a certain amount of institutional knowledge just to run the incident and get the right people in the room. So this was causing slowness in our incident response.So when we started our incident.io implementation, we were like, okay, no. We need to make sure that the data in incident.io is synced to our internal database of teams and services and who owns what, and that it remains synced over time. And then every single decision after that was like, how do we make sure that whatever we're building isn't manual and is always referencing that?imported set of data that we know is going to be always accurate. So that made configuring our alert routing really easy, actually, 'cause we were like, okay, we will just look up the service, look up the team that owns it. We'll trust that it is always synced to our internal database, and that's how we build it.We didn't have to overthink our configuration decisions. We could just be like, what's gonna make sure that it stays up to date and stays current with this import? Okay, we're doing it that way. The other guiding principle that we had was we wanted to minimize the work for other teams.We knew that if we came in and told our engineers, okay, we have 10 weeks, you gotta migrate all your monitors from PagerDuty to Datadog to incident.io, they were gonna be understandably very unhappy. So we said, okay, we need to do this for our teams in order to settle them on this thing that we were gonna do.And this actually helped 'cause it helped make sure that we were the small team making the decisions, doing the build, and not having to worry about going back and forth with the teams being like, hey, did you move your thing yet? And we're waiting for you to do that? No. We were the ones doing it.So speaking of us. It was a very small team. There were two of us, me and my colleague Tommy over there, who were making the architecture decisions and doing a lot of the build. We had one product manager who was in charge of our comms strategy. We had one program manager who made sure that the trains ran on time and all the tasks got done when they needed to.And then we had a rotating cast of helpers — about eight engineers across the SRE organization — who came in and did specific pieces of build, migrated specific sets of monitors. But at any given time, there were no more than 10 people working on this. So what did that team do? First off, one very important thing when I mentioned that rotating cast of SRE helpers: we were able to get that rotating cast of helpers because we had our organizational backing.We had the rest of the SRE organization willing to give us that help when we needed it. I think the fact that we had a contract expiration deadline really helped with that — but neat. What we were able to do is, even before we started building, we were able to collect a lot of data about our monitoring setup.So we knew, okay, we have about 5,000 monitors. Ninety-five percent of them are coming in from Datadog. Of those coming in from Datadog, we know that about 75% of them are configured in code. Okay, that means that with a really simple, small PR, we can all of a sudden start sending all of those monitors into incident.io.With that knowledge we could start building what I'm thinking of as these 80/20 solutions, where we were able to do 80% of the work, we were able to move 80% of the monitors ourselves, and then have a smaller scope of work where we did need to reach out to engineering teams and ask them, hey,can you help move this set of your monitors that we've identified for you? Look, here are the links to those monitors. Go and move those. And the thing that comes into this is, as SREs, we had a lot of experience working with other teams across the organization, so we'd forged relationships with those teams over time, and we were really able to leverage those relationships to make this work happen.At one point, we realized we had a whole bunch of CloudWatch monitors that were all owned by the DBA team and were all built via click-ops. But one of our team members had worked with the DBAs for many years, so they were able to reach out to them, pair with one of our DBAs, and get all of those monitors into code.And then, first off, they were set up for success by having their monitors in code as opposed to click-ops, and we were also able to migrate them very easily to incident.io. This is also how we got our guinea pigs, by the way. We were able to reach out to our compute team and say, hey, you guys are really good at alerting and monitoring and stuff.How about, how would you like to be the first adopters of this cool new platform that we're rolling out? So two weeks into this whole build process, compute was live and taking alerts in incident.io. So that was the team. I said the most important thing about what made our migration a success was our communication strategy.It was really about making sure that we were getting the right information to the right people via the right channel. So like I said, we sat down right at the beginning of our migration and asked ourselves this question: how are we best going to be able to communicate, one, that we're making this change, and two, what people need to do for this change?So first off, nobody reads email. So we ruled out email as a communication channel right away. We knew this was gonna have to be done via Slack, by and large. But even then, there are different ways to communicate via Slack. So first we made announcements. Once we'd made the decision, okay, we're gonna do this, we announced it in our big announcement channels. We said, hey, this is happening, this is why we're doing this.We are excited to migrate to incident.io because we think that we're gonna get these benefits from the migration. And then we gave reassurances: hey, don't worry, we're gonna do a lot of this migration for you, and we will reach out to you when we need something from you. So really trying to head off that initial, oh my God, you're messing with my workflow.I don't, touch it. Don't make me get involved in this whole thing. Trying to sell the positives of what we're gonna get out of this migration, and also reassuring people about that. The other thing we knew that we needed was going to be really good documentation. So we sat down and I wrote so many articles, with step-by-step screenshots of: this is how you set up your on-call.This is how you make sure that your schedule — your on-call rotation — is configured correctly, that your escalation path looks good. We even created FAQ and troubleshooting documents, and we came up with those before we sent them out — just, okay, I think this is what people are gonna ask.But we also kept them up to date as communications went out. As people started asking questions, we made sure to keep our FAQ updated with what they were actually asking. So when questions started coming in, I was able to say, here's a link to the answer to the question that you literally just typed in Slack.So that was really important to have going into actually asking people to do things. So when we asked people to do things, we realized Slack announcement channels are not great because a lot of people will assume it's an announcement channel — it's broad — and, I personally don't have to do anything.We figured DMs were gonna be the best way to reach people. So that's what we did. We had a lightweight Slack bot that generated DMs. The majority of actual requests to people to do stuff were sent via this Slack bot DM. First, we had a message to all on-call users to ask them, hey, set up on-call in incident.io.Here's a link to the documentation with step-by-step screenshots. This is how you do it. And then we had reports tracking to see who had done that. So a week later, we could send a second DM to those people who hadn't done it yet. Similarly, we set up a massive spreadsheet with all of our Scrum teams and their managers, and we sent a DM to those managers and said, hey, can you go into incident.io and make sure that your team is set up for being on call, that your schedule is correct, that your escalation path is correct, and then check off here when you're done with it.And then again, a week later, we were able to see, okay, who hasn't checked off yet? Let's follow up with that. So I think this was really helpful in terms of engaging with people where they like to be engaged, which is Slack direct messages, step by step: these are the things that you have to do — just do them.It's easy. So what happened? I spoiled it at the beginning. A hundred percent of our engineers did go live with incident.io by our April 30th go-live date. Yay. No. There are some caveats, and there are some things that we learned along the way. So one important thing is we should have built real-time reporting on this migration,'cause, towards the end there, Tommy was manually, every day, just pulling a report of, okay, these are all the monitors that are still yet to be migrated to incident.io, and making sure that he followed up with folks. If we had that in real time, it would've been easier. More importantly, though, we should have reached out to our edge-case users a little bit sooner.The primary use of incident.io at Zendesk is for engineering alerting, but we also have a small section of the IT organization that also receives alerts, and their workflows are a little bit different. Their alert sources include some things that aren't Datadog or CloudWatch. We should have probably coordinated that with them a little bit sooner because they're not part of that 100% from the previous slide.They actually took a little bit longer to go live on incident.io, but that's okay because our April 30th go-live date was a self-imposed one because we built in a buffer. We gave ourselves a four-week buffer between when we said we wanted to be live with incident.io and when our contract expired.So thank goodness for that buffer. Awesome. But even with these lessons learned, even with the little hiccups we ran into along the way, my biggest takeaway from the migration is the feedback that we got from our fellow engineers, which is: this was one of the smoothest migrations at Zendesk.Given that it was 10 weeks, I'll take that as high praise, so thank you.

San Francisco 2025 Sessions