Our data disappeared and (almost) nobody noticed: Incident lessons learned

Michael Tweed (Skyscanner) shares the story of an incident that quietly broke data emission for days, and why none of their alerts (or AI) caught it.

  • Michael Tweed
    Michael Tweed, Principal Software Engineer, Skyscanner
The transcript below has been generated using AI and may not fully match the audio.
So I want to say, it's kind of funny — I don't know whether the scheduling deliberately put me after the aviation talk. I'm here from a travel company, which also really got me thinking about that last talk. Am I about to talk about the same things? Interesting there. But just to start, Skyscanner is a global travel company.We have 160 million users across 52 countries. And a lot of what we do is around data — whether that's data to improve our product, find the best prices, or use that to help our partners. And as I go through this, that's the scale we're talking about. When we talk about data, there's a lot of it.Now a little bit about myself. When I started at Skyscanner a few years ago, I started as an app engineer and led some teams there. I got involved in a data emission SDK — how we send our data out of the app to our different data platforms. That evolved into getting into ingestion and processing on the data side.For the purposes of this talk, I'm gonna be focusing on this area. I operate in a few other areas, but this is really what I'm here to talk about. The other thing is, in my decade-plus as a tech lead, I'd actually been — I say lucky enough — to never lead a P0 or P1 incident. But I'd sometimes be at similar conferences, listening to talks, and it sometimes actually sounds good to lead one.You get a lot of learnings. A lot of the talks we've had here today, and that we're going to have, are about how you can take these things and make them better. I've always admired people who've been able to come share: "Hey, here's how everything got messed up." I think we've all read some of the great postmortems, blog posts, conference talks, so I was kind of like —it kind of sucks that I've never led a huge incident; I've been tagged in them. Nevertheless, I didn't know how I should feel about that. Obviously you can never predict when these things will hit. And so I'm here to talk to you about what we call internally our iOS data incident.This all started, as things normally do, on a Monday morning when I logged into Slack. We had this fairly innocuous message. At first, nobody was tagged. It was just a question from one of our data analysts saying, "Hey, it looks like there might be an issue. Something doesn't look right — down significantly — but we're not sure." There were some comments: maybe a marketing campaign was paused.Maybe something happened — but what did this lead to? From this one Slack message, over the next week plus, we opened an incident. Late in the afternoon, I became the incident commander, figuring out: What's happening? What's going on here? We discovered that all of the data emission was broken in our latest iOS release.We had millions and millions of events that were literally just missing — they weren't being sent. We narrowed it down to a change. Okay: we found what happened. We shipped something bad, we released a fix, and we found this second issue, which was: All of the failed events had actually remained cached on the device and were sent en masse at once.So this led to a huge dip, followed by a huge spike in our datasets. Our consumers weren't happy. This isn't just internal — it's external. Our partners use this data. They were also wondering what happened. Because we didn't predict this, we couldn't even tell them; it all happened at once.The good news at the end of this: we were actually able to use some metadata to backfill our events. We eventually ended up with a smooth line — it just about worked out — but it was a very stressful seven-plus days. I use this graph because it shows multiple things. First of all, this is literally what our data looks like.Huge dip, huge spike. It could also represent the rollercoaster of my emotions as I was dealing with this. I was going to say it could also represent my heartbeat over time, but I'm not a doctor, so I don't know if that's regular or irregular — but it really did show what we were dealing with.This also impacted external partners and all of our internal experimentation, which couldn't return results. So there was a lot going on. Once we got through this, we started asking some questions. Even for myself personally, as I said, we're in the incident; we focused on getting through things and working out what needed to be done, everybody pulling together.But then I was like, this kind of sucks for me. I know this is my internal dialogue, but this is what I showed — I'm responsible for this. I'd spent years working with teams of great engineers, and we thought we'd built up resiliency at all of these points. Where did this fail?We'd built all of these systems — how did none of this work? Moving on to the questions, these were some of what we started coming up with, and I'm going to walk through them today because it's really interesting to see how all of these things can come together.It started with: how did the original change even break the emission? What actually happened there? Then: why wasn't it caught by testing? Why didn't monitoring catch it when we were rolling out the release? Why didn't our data consumers detect this as well? Why did it still take six days to manually detect?And then: why did we take so long to raise an incident when we found it out? I'm gonna go through some of the learnings — hopefully they'll be interesting to you — showing how these complex systems can go wrong. So we start off: how did an apparently innocuous change break our data emission?I don't know if we have any mobile engineers — I'm going to run through this very quickly. It's a lot of text, so I'll skim. Basically, we have an internal SDK that I'd worked on, and we deliberately centralized this code. Whenever one of our engineers wants to emit an event, we make it very easy for them.One line of code: send an event, send some parameters, you're good to go. But we also have some transformations — for example, we want to apply a consistent timestamp and a consistent header to all events — and these can be handled independently. We had some dependency management code that brought these all together.You can write something up here, something over there, and it brings it together. So kind of looking like this: on the top, this is how our events get processed; on the bottom, how the SDK handles these; and they'll get added to a queue that eventually gets flushed when all transformations are applied.What happened? We made a completely unrelated change in our dependency management code — one of those unrelated PRs, a very innocuous title like "cleanup" — and nobody noticed that it impacted one of these transformations. Unfortunately — and this becomes relevant in a minute — it was also written in legacy Objective-C andbecause this was a required transformation, all the events in the SDK were pending. Nothing was sent. Then, once we released the fix — unknown to us — the queue processed them en masse as users updated, leading to this huge spike. If we look here, basically this whole bottom row wasn't happening.That's how the events got stuck. And what's funny when we talk about the fix — again, I spoke about looking at these postmortems, hearing other great incident reports — I always love it when you hear, hey, the fix was some really low-level thing that was happening, maybe something on the iOS platform that caused a really tricky bug.Unfortunately, not at all. The fix was one line of code: literally setting a variable that had been missing. Unfortunately, Objective-C doesn't complain. It doesn't throw a null pointer exception. It doesn't fail at compile time. It just lets you run the code and silently fails — and I'm very glad that everybody's moving to Swift.We then had a lack of knowledge around it as well. How did we not know what was happening? As I said, we had the dip, we had the spike, and we started to look in — and this was a little like the human errors we were just hearing about. We had a lot of engineers who had since left the team.One of the things we found — and it's the thing you don't think about until it happens — was that engineers who write documentation (and they're engineers, so they do a lot of "this code is really complex, so I've explained it in the diagram below") had written great docs, but we'd migrated to a new documentation platform.Certain diagrams had been lost — so you had this "diagram not found," which is not useful when you're trying to understand or debug what was happening. So we go on to the next question: why didn't our tests catch this? We made a change. We pushed something — fine — we broke something.It turns out we didn't have effective automated testing. This one was actually pretty tough to go through because we'd spent a lot of time trying to build up effective data emission tests, and we thought we were really strong on this. As I said, data's a key part of our business.We had written, alongside this SDK, an internal testing framework where our engineers could say, "Hey, I want to emit an event. Did it get processed correctly? Did it go through?" But what we had done, in terms of trying to make this efficient — reduce test running times, get things working — was only test up to the point of adding to the queue, so we could inject fake implementations and all of these things.That meant that technically all of the tests were green because the events were being added to the queue. You could verify that everything was working as expected — it just failed after that point, which was not good. And we then had a lack of automated checks — the things that, in the moment, you don't think about until they come back and bite you later on.So we had a bunch of automated code quality tooling — lint checks, static analysis, all of this — but we disabled it just a few months before this incident. We'd made the trade-off: Hey, we're getting really long CI times. It takes a long time to run against this legacy Objective-C.Is it worth it? We did all the maths: we get X many hundreds of builds per day; if each adds six minutes to the pipeline, that's X hundred-thousand dollars of engineering time over the year — and we removed it. And of course we got this bug. So that was fun. We did have some learnings here around the need to invest in the tests.Documentation needs to be directly reviewed — by new starters as well — because they're often the best people to come in, ask questions, and find these gaps. We also took action to move our diagrams and our docs closer to the code to prevent platform migrations impacting these.We then move into this. Again, we had all these questions. We were like, okay, we can understand: we didn't catch it with tests; something got shipped; it started rolling out — it happens. But why did none of our monitoring detect it? Why didn't we actually start getting alerts straight away? This was another human point: we had a dashboard which showed this kind of dip.You can see above. But what was happening was: nobody took responsibility to check it as part of a release. First of all, this had never gone wrong in years and years. Over time, you have it and then people are like, it's fine — I don't need to check it this release because it's been fine for the past hundred-plus,200-plus. But then we also had this: "Oh, I thought so-and-so was checking. I thought so-and-so checked this as part of a rollout." So we took some learnings here: directly reviewing our dashboards and considering the leading indicators. We tried to put maturity and health onto the dashboard so that we could start tracking this.But then again, we still go through, and these were all the questions we were asking. I was wondering to myself — and others were part of this — but then why didn't our data monitoring detect an issue? What we found was that we try to operate at Skyscanner very much with a "you emit data, you own it" model.In the first instance, this was something we thought we were doing pretty well on. We thought we had alerting tooling that would run across our datasets, get these things working, and flag issues. But the problem was: engineers — we've done what we call turbo-lift PRs, like mass PRs —"just merge this PR and you get your alerting set up." They're kind of like, please review the thresholds, but we set some default values for you based on the tooling recommendations — so just go. It turns out that default configuration was wild. Maybe you'd expect a 5%–10% threshold where you'd start to get alerted. It turns outwhen we investigated post-incident, the default threshold was set to something like this — literally, it automatically said "if it's within this huge variance." Of course, it only triggered once the events had almost dropped to zero. Not useful at all. We thought, "Hey, we had this." It didn't work.So again, we then looked at the data monitoring side and we had a very similar issue. If we take this out of just iOS, you'd think surely, when we look at some of these key datasets, we'd start to notice a drop. But again, the problem was we never configured these alerts by platform. So via a drop in iOS — if we look at it, we have data across Android, iOS, web, mobile web —during the first few days of the incident, it just looked something like this. We could see a drop, but again, thresholds weren't correctly configured and we didn't detect it. Key learnings: never rely on the default alert settings. Always test these things out and trigger the alerts to ensure correctness before they go through.Again, these questions build upon each other. We start seeing these failure points. It's okay, we can justify each level. But why did it still take six days? Okay, the alerting failed, but surely somebody should have been looking at this data. As I said, we normally don't manually monitor this data volume.We expect all of the automated checks we just spoke about to be in place. But what was interesting, and outside of our control: when we'd released the app here — we released it on a Tuesday — for the first few days, as those of you in mobile engineering will know,you have this slow adoption curve. Apple does staged rollouts. Just because you deploy to the App Store doesn't mean everybody gets it. Everybody waits to download. We actually saw a huge spike over the weekend when more users were opening the app, triggering background downloads and updating it.So the real impact of this incident only started around Saturday morning, and it wasn't picked up until Monday because we didn't have — as we spoke about before — alerting or monitoring in place. Key learnings here: really pay attention to our app adoption curve.Understand these sorts of human factors — it might not occur when we expect it to. The final question we had, given all of this journey, was: why did it still take so long? The Slack message I showed was sent Monday morning.The incident was raised Monday afternoon. But as all of us who have worked in these incidents or companies know, that is a very long time to go — almost a whole working day — without actually picking up on this. And again, the dataset it was spotted in was this endproduct of multiple sets of processing, multiple datasets being combined together, taking these events and those events. At first folks were thinking, "Hey, maybe it's this dataset that's the issue, because this dataset feeds into this dataset." But — oh, maybe —I know there've been issues before with this and this, and everybody was kind of talking to each other because I think everybody was a little scared: "Let's not call an incident just in case it's something really obvious — a marketing campaign was stopped and so we expected this." So a key learning:if we'd opened a triage incident sooner, we could have got the right folks into the room who could have looked at this. We eventually got to the point where it was like, "No, this is affecting enough to justify raising the incident."So let's summarize where we are. I've gone through a lot of questions. This was a really interesting combination: one of our engineers made a change to some legacy Objective-C code — fine. Automated checks had been disabled for the code and it was missed in review. Automated tests written for this submission continued to pass during the rollout.The dashboard checks were missed. Monitoring didn't fire when this data was then ingested. The DI alerts didn't trigger because we had wildly misconfigured thresholds. At every point, our tooling looked green — all looked good. You can see how it was this perfect storm where we ended up with the Slack message that led to a lot of the learnings we'll talk about.I'm at a conference that even has an "In the age of AI" tag behind me — I couldn't stand here without briefly touching on this. One of the things I want to call out that actually really did help was a lot of data analysis. As I said, we had hundreds of affected datasets, technical and non-technical consumers.The ability for them to write natural language queries to understand impact was a huge help. We didn't have to be running SQL queries in our data platform. They could just say, "My dataset is X. Was it affected by this incident?" and get a yes or no response. Really helpful for us.Where it didn't help: monitoring. I touched on these thresholds earlier. A lot of these were supposedly AI-enabled monitors that were supposed to intelligently detect these things — they didn't. So that wasn't helpful at all. Even the debugging: we went after the fact, took this legacy Objective-C code, fed the changes into a bunch of models, and asked, "Can you find what's wrong?"Not a single one caught it. I understand everything's moving very quickly. I have to say, even for myself, I'm very excited to find out — we do use Incident.io — their SRE would've caught it. I think I'm going to test it when I go back, but at the time of the incident, that didn't help us.A few things did go well. I'm here representing Skyscanner, so I'd be remiss if I didn't talk about some of the communication we had. It was really great that we had a no-blame culture and everybody pulled together. We did have the incident Slack channel,and we managed to keep things centralized. Again, the no-blame culture and collaboration on actions — everybody pulled together, which was great. But what were these actions? I always think the interesting part from these incidents isn't what happens; it's what happens next. How do we take this and learn from it?So we had our corrective: of course, we fixed the issue. We backfilled our datasets, which mitigated the impact. Preventative: we looked through a lot of our alerting and updated our documentation. Then we got into the strategic — and this was, I think, really interesting. We went through all of our alerting and basically said the thresholds suck; we can't rely on them.We have to do an audit of everything that we have. Also — and this is one line, but it's actually pretty huge for us internally — we committed to removing all of our legacy Objective-C code. Our engineers back in our different offices are working on this as we speak. This was something that had been spoken about for years and years.We had the long tail — just a tiny percentage left — but it was underpinning some of this, and the fact that this incident has led to this, I personally think, is a huge win. To wrap this up, this was actually a really fun one. I say "fun" — now I can reflect back on it; in the moment,it wasn't. Think of that graph. Hopefully it shows how systems are complex. All of these things can conspire against you. We thought we were very mature in all of these areas. Everything just happened to hit at once. But we've learned a lot. We've taken a lot. I'd be very happy over lunch to speak if anybody has more questionsor wants to find out a little more — there's a lot I couldn't fit into the 20 minutes here. And now I can say I've run my incident. I don't know if I want to run another one again very soon. I'm happy to say that at least I've been up here to speak. Thank you very much for listening, and thank you for your time.

San Francisco 2025 Sessions