On-call - half baked?

John Paris, Principal Systems Engineer at Skyscanner, share how his team tackled rolling out On-call across a large team by partnering with incident.io.

  • John Paris
    John Paris, Principal Systems Engineer, Skyscanner
The transcript below has been generated using AI and may not fully match the audio.
Thanks very much, Tom, for these kind words. wow. What an act to follow. Gosh, I wish I'd been on first thing in the morning. so a on-call eight, sorry, I should just start. It's wonderful to see so many peers and actually be able to have conversations with some of you today. and I know that some of you are on this journey or some of you're thinking about this journey, so hopefully what I cover today will help you make a decision.just a bit about me. I've been at Skyscanner now for approaching 12 years, next week or be on 12th anniversary at Skyscanner. It's a wonderful place to work, but, enough of me and onto the story. I think we've, heard that a lot, about how stale, how stagnant the tools have become that we've used over the years.And, yeah, the expectations of the people using these tools, we've, we just haven't, the suppliers just haven't kept up to date. so I guess what were the problems ad engineers using the tool? They wanted a slick user interface. They wanted things just to work. Our managers also want to understand the service, the, health of their services and their teams as well.But why do we have to go around stitching data, running our own pipelines, pulling in information from different places when we should just have it all in one place? And also our HR teams, for a Skyscanner, we compensate our engineers for the time that they spend on call. we've got budgets to balance, but we also have to care about the, health and welfare of our teams, making sure that individuals are not working too many hours on call out, hours on call.Getting insight in any of these things was horrendously difficult. We ended up running our own data pipelines, our own services to. Fill the gaps that were in the products that we were trying to use, and ultimately the controls. Is everybody got their policy set up properly for, being paged? Are they gonna be able to respond?You didn't know the answers to any of these questions, so you can imagine the excitement, when the internet took off back in March 24. LinkedIn, bing, bing, on call was taken off from incident io. That was a real moment of excitement. 'cause in my head I'd already left, our previous provider.I was just looking for a new solution. But much like that hangover that you have the day after, when we looked at, say, on-call version one, we realized it wasn't gonna work for us, I guess to understand. Why, you probably need to understand a little bit about Skyscanner first.our mission, it's quite simple. We just wanna be the number one travel ally. we are already inspiring 140 million users a month. we're, reasonably large. We're helping people find destinations to travel to. We're helping them find a ways to get there and the places to stay, once they get there.So in, in terms of scale, yeah. There's the, it's 160 there actually, but globally, we're, available globally within 37 languages. We have to manage all that, and 52 managed markets. And the number to keep having to double check is that 100 billion price searches take place in our platform every day.Big numbers and thanks to SLOs. I think we might hear about SLOs later on. we know that we are getting 94% of these, returned in under three seconds. what sits behind that? We have thousands of components, data sets, models, all distributed, in the cloud. Over the years, we've amassed 22 pet petabytes of business data, and every month our engineers delight us with, an extra 800 day terabits of observability data.And that's all supported by appro approaching 900 engineers. Now, every day during the working day, there's 120 schedules running. Out of hours. So in the evenings and at the weekends, there's another 60 teams who are operating an out of hour schedule. So every, month that's 300, approximately 350 engineers and accumulating over 30,000 hours off, out of hours on call that we compensate them.And just touching on that, we know that are different compensation models available. for the purpose, I'm not gonna touch on them today. just enough to know that we compensate for the disruption to life, of being on call. We don't pay extra for dealing with calls. but it's a flat hourly rate regardless of what level you're on.And really we want our engineers to run stable systems. We don't want them to be called out. We want stability in the platform. If you do wanna know about other, models that are available, check out blogs from other people.So back to oncall, what was missing, from Oncall for us, ultimately it comes down to the payment calculator. Now the, the, technology. What you monitor and how you monitor, that's a well trodden path that was all sorted and having a catalog in the product, we've heard enough, we know that the catalog is there and it can be used in various ways.But we also benefited here from having a relationship with our own IDP provider. so we could integrate directly into, incident io through that. The part that was missing really was the organization, how you define your organization. The who's the where's, the when, the, how much, the controls around that.So what do we do about it? Remember, we are still in the downwards. our mood is still down. We've gone through the euphoria of the announcement and realized it wasn't gonna work for us. We sat down with some friends at incident io and we told them, it's not us, it's you. But as I mentioned earlier in my head, my relationship with my previous supplier had gone.I wanted this relationship to work. we documented the requirements, we sat down with a team and we got to a stage where, We felt comfortable that we could, go with this. We had some tricky points in here. There was really difficult termination points and contracts to be dealt with. we managed to get through that and importantly we needed somebody to mastermind this as well.So Megan rushed back from maternity leave, to help mastermind the build of what I'm gonna talk about.We collected these requirements and we all agreed on a, joint path forward, given mind. We've got the termination contract calls, thing in the back of our head as well. And then from January, so in November and December, we went through that feasibility discussion and we sorted out the commercials.Then from January through into April, Incident IO spent more time with us fleshing out on these requirements. I'm understanding our business and our processes in more detail. Just 'cause we had retire requirements down didn't mean we were right. We needed someone else to, validate that and to think of other ways of solving the problems.So we have, we agreed these requirements and we just gave incident IO space to build. We knew from the previous experience of going through response and the other, products there. That the team would build and would commit. Our challenge was actually we're building far too fast and we couldn't keep up with 'em.and then there was a few refinements and the decision points to be made. So by the end of April, we, by April, we'd decided we're gonna go for this. Now I mentioned there about understanding our business. it's probably no secret. We have quite a large team of engineers out in China. Their experience is quite different.Off these on-call products on our own. the Great firewall does get in the way, so that causes some real challenges there. So we spent time working out what to do there.Ultimately what we needed was our organization and the catalog. I've mentioned obviously that we've got the IDP, provider there. So that was a, great help. and you, we are the linking up data, so let's have a look at this in more detail. The who, so yeah, almost 900 engineers now and that's pretty straightforward to getting you into your Cal in catalog and the definitions of your teams and your schedules or your teams you'll get.Best thing to do is to take that, from. Workday or whatever other, tool you have for managing your organization. But the important thing is it's real data. It's not a synthetic, copy that you're updating, yourself and your own catalog. And then also to tie in the location of each engineer, which will become a part and a second why that's important.the way we're working across three currencies, three different time zones and. Within our different offices, there are different work patterns that people will follow. So our colleagues in China, they attend to start at 10 o'clock and work later on to the day. Whereas here in the uk, we tend to be like nine to five when it comes to daytime running.but if you've got an organization that's maybe in the States operating at the West coast, your engineers might be starting really early in the morning so they can confer with people over here in Europe. So being able to set different, office hours was something that we needed. And then national holidays, they change all over the world.so how do we track holidays and international holidays into the payment calculator as well? Eventually, it's a question of how much we needed to be able to export this data process it, use some rules and governance around the processing of our data so that we could then export that direct to our, payroll systems.couple of things in there. Sometimes it's not uncommon for engineers to occasionally appear in two rotors. We don't want them getting paid twice for that. and also we have to be sure that we're exporting in local currencies as well as a whole pile of other, things you need to think about as well, depending on the countries that you're operating.ultimately we needed to be sure that we had. Good governance around this, and also to get the insights. I started at the beginning talking about the, data that we just couldn't get out. We wanna have that data coming straight here from the source of truth.So what did we get to? We got our payment calculator version two. yeah, this slide's probably a bit messy now that I think about it. But see, what did that give us? So secure con config, around the approve, the approved, schedules that we have there. secure config around our location data, so that, nobody could go in and make, make changes that weren't approved in, in there as well.and our compensation config all tied down using the a c capabilities within the product. Yeah, but sorry, just to come back to this one. although we had really solid controls around this, we didn't want to lock down the platform, for the rest of engineering because they needed to be able to change their rotors and their schedules and depending on their own use.So we left them free. Now we've got to the. The big question. TTM, so many of you involved in incident management here. You're wondering, is this another acronym that we need to worry about? only if you're doing a migration. we knew we had a good product and we knew we had a good team that could fix any surprises that would come up.But what about the migration incident? IO had already said to us, it's okay, we can cut this short. We can do a lift and shift of your existing config, but. We didn't feel comfortable with that for some of the reasons that we have up here. 10 years of legacy. there was a lot in there that we just didn't want.So we decided to spend the time up front, start afresh. 'cause we knew that if we didn't do that, we would end up at the end of that journey. Fixing problems, and fixing problems. We went fresh with a greenfield, solutiondown to the migration. So in April, our own squad went live, using the tool, and we also went back to then some representatives from our teams in, China as well, just to make sure they were comfortable with the, work that was taking place. And we learned and we iterated and we improved on our migration plan.Then we made that a rollout decision point. by the end of April and in May, we started onboarding a number of friendly squads. I'm saying friendly. Here they was were people who we knew would be opinionated and we knew they were busy. And that runs into June. I, we're, by the end of June, we've got seven teams all using incident io on call, in anger.And then the big push, we used all the learnings from April and June, and June, July into August, we pushed hard on that migration. We sat down initially with squads and with tribes. Just talking them through what the, experience was gonna be like. And we've done a whole pile of preparation behind the scenes as well.So you get to a stage where hopefully they just need to accept some prs. And Yep. By the end of August, all done complete. And I'm actually delighted to say that today is the termination day on the contract that we had with our previous, vendor. So we, had some margin in here and we managed it, but the story doesn't stop there.Let's come back to why we wanted on call. We've got multiple observability tools, probably like many of you. it'd be great if you could rely on one, but reality is quite different. These alerts that, that are being generated by our observability platform, as soon as they arrive in incident io, that's you inside their ecosystem.The catalog, the tagging, and all the process, all the, all that meta that's there to help you. But thinking back to the presentation you've just seen. We generate a triage incident. By the time the engineers go online, AI SRE has already done a first pass, so they're not landing cold into a, into an incident.They're landing there with some background on on the incident in itself. and I'm delighted to say, Skyscanner does have access to a. Products ahead of the market. so we've been able to see the value in this already. We can then go from our triage and if it's a bad enough, if it's a bad enough issue, you will generate an incident.Then you're going through the post incident flow, where you learn, you share, you grow, build a better future. And you wanna make sure that your follow up actions are actually being dealt with and they're not sitting going stale. So again, they're in incident IO they could be followed up and underneath all that is all the insights, all the way from alerts right through to your follow up actions and and everything else in between.if I think about the. Coming back to, alerts and, turning into incidents in the month of September. I had a look, back at our data September, the number of incidents that Skyscanner doubled. I thought, wow. that was a really bad month. but using the insights, digging into the data, it was quite apparent that we hadn't actually had any more incidents.We just recorded more incidents. These alerts going into triage weren't just being forgotten about. They were being turned into incidents when it mattered. So we've now got more information on the incidents and the toil and the drag, than we had before. So if you are to doing this kind of thing, expect to see an increase in your incidents, but don't be shocked.Don't be surprised. Just be glad.And as a little bonus sneaking out, from under the carpet was a new team's feature, not to be confused with the other teams. so this is a, simple dashboard view, that gives engineers access within a team or in a squad, to the stuff that matters to them, their live incidents, their alerts, their escalations, their schedules, if they need to make any quick changes.And then all the post-incident activities can be seen there as well, so that people don't forget about these things and, the, they end up just being completely forgotten about. And so what were the lessons that we took from this? Be clear as you can with your requirements upfront, obviously, but be willing to accept that.Your supplier may have other ideas that can help make sure you liaise early with payroll people. in this case, we were not changing our compensation model. We didn't wanna do that. This was a technical exercise. but we still have to interface with our, a payroll systems in their HR systems when it comes to the actual, implementation.Seek out champions in your teams or your squads. These people will deal with all the issues locally without having, without you having to be a point of escalation. And it doesn't, I would recommend these are not like engineering managers with team leads, but have other champions who are, closer to the day-to-day, life of being an engineer.And, pay special attention to other geographies and regulations in other places. And if you are. They, working across, these regions. Just watch out for, any rules that come into place or any, situations like a firewall. And yes, speak to incident. they will build it. That's been my experience for years.we've heard already that every week there's something coming out in the change log. it was really hard to keep up with, but yeah, they will build it if you want to. And finally. We were customer zero here for, on-call version two, and that journey started in, November, December and went right through to August.But that was us help working with incident io and building a product that. You will now be able to benefit from, so you're not gonna have to worry about a six month migration. You will be able to do this faster.

London 2025 Sessions