The day the database disappeared

Dr Claire Knight talks about the day prod disappeared, and how her team focused on recovery, not blame.

  • Claire Knight
    Claire Knight, VP Engineering
The transcript below has been generated using AI and may not fully match the audio.
Thanks Tom. great to be here. yeah. I have to say this, talk was inspired by some things that are going around on the socials earlier this year. If anybody, came across that, but also as, Tom said, it's that time of day. We're nearly at happy hour. So I'm gonna tell you some stories. I think, rather than try and bomb.Bamboos, are you with fact or anything? From anywhere that I've deployed Incident and I've now deployed it at three different companies. So you could got me a little bit of a fan. I don't know if any of you recall, there's fairly well-known person in the VC slash sales space. one of those suggesting at one point anyway that we, don't need engineers and that he can now do it himself.And he'd spent days, possibly weeks building things with. One of the AI tools, not gonna give them free airtime, they don't need anymore. and then suddenly the database disappeared on them. And then there was a lot of panic and there was a lot of, this isn't fulfilling the promise. let me tell you, as a leader, I have encountered this with humans.We don't need AI to delete databases, although it can do a very good job of it. and I have to say I was quite amused by seeing all of this just a little bit. Now you know how hard it is now. You know what we all have to deal with. This is why we exist. The other thing I think we need to ask ourselves is how do you respond?when I, when this happened to me, I was, a skip level away from the people that did the thing and how I show up mattered. Adrian mentioned this in his talk earlier about execs being in the, in there and showing up and stunting things, from your engineers, but also in both cases that I'm gonna talk to you about.Or I'm referencing in this talk, the engineers knew what they'd done. They were scared and, unhappy with themselves already. They didn't need somebody coming along and yelling at them. and they were doing the damnedest to fix the problem, in. In both cases, they actually put their hand up before the alerts are, started arriving as well.So again, you gotta applaud that kind of thing and hopefully you're building the kind of environment where people feel safe to do that, which is something I don't think we talk about enough either. We can increase the ratio of incidents and all these other things, but if people feel they're gonna get penalized if they make a mistake, it's not gonna help us.This is a, I want to cover how things can go wrong with humans and then talk about AI a little bit.So domino effect, I. I've worked across several large distributed systems now in my career, and I know not everybody has, different set of problems compared to some others and what we're now all learning. I think alluded to in the demo and the. The stuff that Lawrence was talking about with all of these agents and the coordination is that all of these agentic systems are now a big distributed system as well, which, means that you can do a thing over here and it can be relatively minor.It a thing, unlike the PR that Pete did, which. Fixed itself and, everything, but you can do a small thing and then it can cause a massive knock on effect. We even saw that on Monday. I dunno about you, but I, I couldn't, the cat's cat flap wouldn't operate. I was distraught, but it's an unintended consequence of one of these big things.And I'm very glad that the cats don't have access to incident io because they would've been very unhappy in raising status updates, constantly for me. The other thing, I've been, Part of is, for example, when I was at GitHub of Good few years ago now, we had an incident, wasn't entry, database incident in this case, and it started, it was a bad release.It started generating a lot of errors, and then we took Datadog down. They were not happy. obviously we weren't happy because then we were shouting at them because we can't tell if we've brought our system back up 'cause we've got no, no visibility of this. but again, a lot of unintended consequences.And all of this is pre, ai, like humans can cause these domino effects. One small human action. A lot of consequences followed from that.Chaos. as I was building these slides, and I really wish I'd taken a photo of it. I did actually have a cap trim walk across the laptop, which was not helpful, but it did give me the idea for this slide. you've got chaos and I'm not talking about Chaos Monkey and the resilience testing so much as, things go wrong and they're outside of your control.Or an engineer does something in good faith and they. I've lacked some context for whatever reason, chaos is inevitable here. Probably shouldn't have left stuff on the, on the keyboard that the cat wanted to get a hold of. what I would encourage a soul to do is to, if you have to blame anybody or anything, blame the system, not the people.again, comes back to the psychological safety. What enabled somebody to be able to do the thing. Now, sometimes it's completely unavoidable, but oftentimes you as a leader have failed to put some guardrails into place, which I'll come onto in a little bit. Then of course we have our big bang failures.some of the rockets, that Elon's sent up didn't last very long. sometimes it's ambitious engineering that you know. it's a possible outcome and we have to accept that. But at scale, mistakes can be far larger too, and they can propagate. what's your role, as a senior, SRE, senior principal, a, leader here? What are you doing in this situation? because there's always gonna be these things. And again, this isn't, we haven't even talked about AI yet. how are you responding? Are you shouting and screaming at people? Are you jumping into Slack channels going, what's up?What's up? What's up? Even though incidents already posted an update and it's pretty good at kicking you when you forget to do updates. Or are you observing things, putting thumbs ups on things, or whatever your company culture is for how you respond in Slack. Are you letting the engineers know that you've got their backs?Because I think that's something that we, as people need to talk about perhaps a little bit more than just the operational aspects of, have we got a process for this? Are we checking it that, are we stopping that? So this is where I wanna give you my perspective and things that I think are beneficial for you in terms of leadership and dealing with incidents.The crash barriers I mentioned earlier, it's not going to stop somebody hitting the accelerator instead of the brake, but it probably means they're not gonna go off the side of the mountain. Okay? So we are limiting the impact of what could go wrong. We're accepting that there will be issues with how somebody is driving or somebody else is driving, and then in fact somebody else.but we're looking to build better roads. The tarmac thing that works better at all temperatures. advocating for tires with things again that are better for in all weather. You need to design systems that acknowledge that mistakes will happen and that you make things survivable rather than fireable.We also need to figure out how we can safely land. Okay? We have to know what recovery looks like now, naive response there is, things are working again, but. It's not just the system standing back up or systems, it's did we lose any data, much like the demo that we saw, which customers were impacted.how do we communicate this? Brian? Talking about writing about, incident reviews and putting them out publicly, things that build community, at least in the dev tool space is huge. maybe not in some of the enterprise banking spaces, but all sorts of the things that you, I think you need to think about as a leader and, encourage people to do or not do depending on what it is.so the other thing that I. In smaller companies with less technical founders as well, they're like, we don't need to think about disaster recovery. We don't need to worry about that. They're never gonna happen. Nobody's gonna hack us. Famous last words. so I think you need to also build kind of recovery into this, but also recovery around the people.I. I was not an active engineer at the time, in, in, that, in the situation, but I was a GitHub when they had their split brain incident back in 2018. If any of you, those of you that remember it now, that incident went on for effectively 28 hours I think it was. But that is more, literally more than a day.But also with the time zones involved, there were people working on that and it was things like rotating the incident, responder off after four hours. It was, They were definitely like recovery and database plans that had to be put into an action over the next week once the basic systems were up and running again, part of recovery plans.But then these key people that were involved either by didn't of being experts or just happened to be on call at that point, giving them some time, giving them some recovery, acknowledging what they went through. Again, things that I don't think we think about the human element of incidents perhaps as much as we should sometimes.When I was, building the slides for this, I was trying to get some kind of image for a systemic view, and this is what AI gave me. So you can thank, the Gemini Nanos banana integration for this one. but effectively it's never a single mistake. Everything is interconnected. there's gaps in the system.There's gaps in your leadership. There's gaps in how your engineers respond. incidents for the most part, I think we should think of as system failures. We shouldn't think of them as individual ones. Yes, there can be a mistake that's made. Yes, there can be one system that goes down, but let's look at the, whole thing and try and make the whole thing robust rather than localized maxima.So let's talk about AI because it's 2025 and we can't not.Now we've talked about failure modes with humans, but a new teammate. So not only is the, database going to die? It's even digging the grave for us.AI doesn't sleep, doesn't need coffee, doesn't need the happy hour that we're all going to soon. And it's probably gonna be pretty good at inventing very new ways of breaking things. One of the things that I'm quite looking forward to is when we have our AI agents doing our builds and our deploys, and then the ai SRE is coming and fixing it, and then the AI agent is approving that PR and nobody even contacts me at all, but we're not quite there yet.I am, I'm, a believer at the moment. AI is, great for augmentation, can do some things very, well, takes away. Toil, can speed things up, everything else, but it's also going to help us in generating bad code that we might miss. It's, if you give it the wrong credentials, gonna be able to run a mock across your production systems.Is that in your disaster recovery plant agent goes wild. I think what is happening is that it's expanding the surface area for failure. and that's something that we need to think about and, maybe people are not thinking about that side of it from an incident management point of view.So little safe, friendly car with these little, cone there rather than a big exploding rocket This time round. We need guardrails and we need guardrails for ai. And this isn't just use an MCP server or. Kind of thing. It's more about if a human can do it and AI can do it faster and probably worser, that's not the word, but, expect failure.do we expect failure with human systems? And I'm not sure we all do. Even though it's inevitable, I think there's a lot of, but our system's great. It's not gone down in this long or nobody's ever tried to do this with it. They will, somebody will, if you're good enough or popular enough, somebody will do something silly at some point.so I think we need to plan for recovery and we need to really be thinking about what these agents, both building and using and operating our systems mean in terms of incidents and recovery and resilience. I don't think it changes the fundamentals, but I just think it adds a dimension that I'm not seeing enough people talking about at the minute, which is why I wanted to, talk about this here today.And also because yeah, having had two engineers delete databases, I think we need to talk about some learnings. I think the key learning from that. And, with respect to any of the other things that I've mentioned is resilience. In the two situations that I talked about, engineers were, connected to the production database from their own machines.Now ignoring whether that's a good thing or not, or whether it should be basian host or not, they were, and they thought they were connected to staging and not to production. And then things didn't go very well. Those of you who've been there are probably nodding in sympathy at the moment. I've never quite done that one.Back in my day when internets were not as good and things, I was connected to a database running migration once, when it during a release when it decided to, drop all my connections, that was scary enough. But yeah, these engineers were. Doing things that were good for the business, they were doing the right things in theory.there were engineers that I would hire. Again, these are not flaky engineers doing silliness. They're generally thoughtful. They're generally looking at what they're doing, and they, made a mistake. They're humans. so I think we need to think about the when, not the, if in both of these cases what happened.in one of them, it was a slightly less senior engineer, and I congratulated him, told him he was a real engineer. Now you're not real engineer till you broken production. I think that surprised several people. at the time, not leadership, but some of the other engineers. It was like, there's no point in me jumping in a Slack channel shouting at people.or in, this case the instant IO channel as was we had. We'd already rolled it out there. and we talked afterwards that he was very much of the, wow. I wasn't expecting that reaction, but we need to. Bri these people up again, to Mary's point from earlier, like, where are we getting our new seniors from?He's not, he wasn't then anyway, yet a senior. this was a learning experience. ideally one, neither he nor I wanted to handle, but a learning experience nonetheless. So yeah. Resilience of your people involved and resilience of your processes. I would say that the, the second incident where the database disappeared, we didn't have, it was a much smaller company.I, I was bought in to do some scaling there and I couldn't justify bringing in an incident because we hadn't had an incident that was, problematic enough. That was a trigger We installed incident I, after that we, we held a postmortem, and discussed ways of avoiding things and that was an outcome that I think the founders here probably.Quite happy about. But, so yeah, my loss of a database was there again. but again, in that case we, we had, the backups point in time backups and unable s we were able to get it back up. We were able to restart the systems pretty quick. Again, it was a, we were lucky in that it was a small stage of the scale up, not the scaling up of the scaling up, but, I think that's again.How we can be resilient about it. And we were able, even without a particularly formal incident process at that point, to have, a really good conversation about what went wrong and why. And the engineers took ownership and wanted to drive ways and making it much harder than themselves to make those mistakes.And that's what you wanna encourage in your teams. Bit of a summary here. For those of you that like words, and I've not used a lot of words, I'd rather, chat with you. but we have guardrails. I dunno anybody's read the article, but aim for the pit of success. Make it easier to succeed than easier to fail.We need resilience. You need to be resilient as leaders as well. And then also recovery. AI is gonna make all of this even more important, I think. Now whether it can help us with some of this, that's something that we're gonna explore together, I'm sure over the next few years. but everything that we should have been doing for our humans, we need to be doing with AI tooling as well.And then whoop. I will leave you with a slightly more sane desk, and say, I think it's important. I think it was the picnic crew that called this out earlier that you need to make sure you have fun. We're all at work for a lot of hours. you wanna be able to laugh, you wanna be able to recover, set your desk, stand the coffee cup up, tidy your papers.but yeah, do try and keep the caps off the keyboard. They cause so much trouble. Thank you very much.

London 2025 Sessions