SEV0 Brand Logo

The Pearl, San Francisco

There is no such thing as a free lunch. How Slack runs their incident lunch exercise

Scott discusses how Slack runs its incident lunch exercise, a fun, time-pressured, role-playing activity designed to teach effective incident response skills by simulating a lunch order crisis.

  • Scott Nelson Windels
    Scott Nelson Windels Engineering Manager, Slack
The transcript below has been generated using AI and may not fully match the audio.
Hi, I'm Scott. as you can tell, I'm from Slack. this, that's a great talk to follow, but this is a bunch more practical and pragmatic and a little more lighthearted than, than a nuclear meltdown. welcome. I'm gonna walk through what our incident lunch is and how we learned from it, but first, I thought I'd introduce myself.I've been at Slack for seven years. I dropped out of college to start an internet service provider in 1996. So I started long ago in tech, and you'd be surprised when you cut your own ISP off from the internet several times how quickly you learn how to do incident response. Lately, to include this slide for fun, I've been doing whitewater rafting the last three or four years, which I like to reference, because I go with a bunch of friends, because I don't like to do adventure sports on my own, and it's a lot like incident response.You're needing to be at peak performance for yourself. But you have a team around you to help, and that's, to me, what incidents are. You're, everyone there should be acting at their peak performance, but you've got people to rely on, because in every incident we run into things that we don't know how to handle.this is the, for anyone else who's into rivers, this is me doing the Rainy Falls fish ladder on the Rogue River in Oregon. in prep for doing the Grand Canyon last summer, I'll start a little bit with where we started at Slack. I started in Slack in 2017, June 2017, on the monitoring team. and, As you can imagine, back then at Slack, we were just starting to take off, or, taking off pretty quickly then.Before a company hits their product market fit or hyper growth stages on the way to that success, formal incident response and management often doesn't get much attention. So when this curve starts going up, incidents become more painful as their scope and scale are now on an upward trend as well, and customers start to care more often, and they often care very loudly.And at some point in every successful company journey, you have to get better at incident response, and usually that happens fast. most small companies fly by the seat of their pants until they've grown enough that it becomes too painful not to invest in improving your incident response muscle.And as a fun fact, Richard Crowley was at Slack back here. My first two weeks at Slack, I was standing at Richard's desk responding to incidents and we've evolved quite a bit since then. so we used to have very rudimentary incident procedures and then we started on this journey to develop and mature our program.About two thirds of the way through that slide, so 2018. we started having problems with incident communication, span of control issues, executive swoops, and lack of a well known or formalized process on the engineering side. as a fun fact later, our customer experience team was actually much farther advanced than the engineering team at this time.but we started to have simultaneous incidents as a regular occurrence, and it wasn't so easy to run them all in our ops channel, which was where we would all dogpile in and swarm. all of a sudden it gets confusing when you have multiple incidents going on, and you don't know which incident you're talking about.so my manager at the time, Chris Merrill, brought in the folks from BlackRock3 Partners, to run some incident training, to help us get some training and some grounding in how we might grow the program. and then a few months later, we started hiring an experienced team. Folks like Brent Chapman, who's here, and Nora Jones joined our team, and so that was another part of that effort.one, along the way, just to, give a side note and a shout out for training, the foundation, one of the foundations of our program grew to be a deep commitment to training our staff. we made a deliberate decision. to have incident response be an organizational strength, not, your NOC style, centralized team style.our onboarding track for new engineers includes an intro to incidents course that's like self service, and then training for incident responders and commanders. So every engineer who joins Slack has to take an incident responder course that's about two and a half hours long. and we offer those bimonthly, and we still do it.Brent helped me start it. We're doing it years later still. I, firm commitment, because if you stop doing that training, you're People don't know what's going on. and then we do some deep dive sessions. so by the time people join an on call rotation for incident command or response at Slack, they've done six to eight hours of training.and that makes a world of difference. and we have a shadow program as well, which you can ask me about later. so self service training and live courses are part of this puzzle of how you get people trained, but there's a missing section in most organizations. It's like, how do you Give your staff practical experience outside of that shadowing, right?How do you, make it feel real? Before you join a real incident, and we were struggling with that. enter the incident lunch for us. so this was, out of training with the BlockRock3 Partners folks, in March 2018, they offer a two day training based around IMS, Incident Management Lunch, Incident Management System, and how it can be used to build an incident response program.they run an exercise called the Lunch Break Exercise during these sessions, and They assigned some roles to the group, and set a limited time box for the group to get lunch delivered to the training room. it's pretty simple, but it turns out it was really fun. Their focus was on having some constraints in place, teaching through that role modeling, and putting time pressure on the group.it, was also a lot of fun. just to highlight again, because this is worth noting, the successful parts of this exercise. Time pressure. There was limited time to get lunch. People are getting hungry as you go through this exercise. and most spicy incidents have a lot of time pressure from the very get go.role playing. In this case, this gave people a chance to try on different hats. While incident responders are the most common roles in our incidents, or most people's incidents probably, we needed to start expanding the other roles that people were comfortable with stepping into. Because there's many hats, and as many of you sometimes have to wear a hat you haven't worn before.And then the constraints. They called them considerations on the last slide. Imagine they had simple rules like no fast food or, pizza, basically making it harder than calling Domino's pizza, to give, it a little challenge. But most of our incidents also have an unknown set of constraints before we get into them and all of a sudden, so getting used to dealing with constraints is really, helpful.And then, of course, it was fun, once lunch actually got there, once you, got to eat. so part of the fun came from that inherent team building as well. So yeah, make it a team effort. We took the group of about 15 of us that ran this exercise, took it, and turned it into a regular training opportunity with the Slack staff.So it's important to let this be like grassroots engineering led, I think, not just top down. Here's some examples of our first posts about it, but the team who took it wanted to just bring this back to a wider group. And this was an easy way to start training, engineers across, the Slack engineering org.we crafted it so it was easily accessible for anyone in the company. So there's no setup or expertise required for this exercise, right? It turns out everyone is a subject matter expert in eating and ordering lunch. no skills required. Anyone in your company can participate. we would invite them to this two hour incident training, tell them that lunch would be provided, and then when they arrived, we would give them 15 to 20 minute refresher on how incident response works at Slack.and then we dropped the bomb on them. the lunch order fell through. The, lunch is not showing up. and then we would tell them now your objective as this group of people assembled here is to order lunch as quickly as possible. and then, give them the constraints that they had to work under.they need to do this using, our incident response processes that they know them, Slack Incident Channel, they need to assign an Incident Commander, and then post status reports in your Incident Channel, and then the facilitators acted as coach and referee for that group. for us, this was great.Low effort, high reward, right? We, as a fledgling incident management program, we needed to pay attention to real incidents and didn't have a lot of time to spend on developing deep programs. all you need to run this, incident, launch is an outline for the setup. you need to book a conference room or two.I'll share later, we used a GitHub repository that has a, a markdown file that we used GitHub pages to make a slide deck. And I've created a public one in my GitHub account, so anyone here can copy it and clone it if you want to do this launch afterwards. You have to have a workflow to announce it, find your announcement channel that you want to post it in.you need a small staff, so in this case we had a few engineers running it. and by the way, You can mention to them, they get free lunch every couple weeks, so it's, there's an incentive to, to facilitate and run this. And then you need a small budget to pay for it, it's three to five hundred bucks if you get ten people, right?It's not very expensive to, to run this in the scheme of, things. and then, bring the chaos. Tricia Bogan, who was a lead engineer on a team called AppOps back then, created a really key element for our exercise. that sprinkled in extra fun and time pressure. we've come to call what Trisha created the Chaos Cards.and they added these new elements to the exercise of variability and unpredictability. I believe Trisha, Trisha says that she, modeled these after the pandemic, game. but the set of cards have different actions or events that change the course of the exercise. And so one card might be something like the Laptop Trouble card.where you pick an SME in the room and they have to turn off their laptop for the rest of the exercise. there's one that says eerily quiet. Pick a new card in two minutes, right? there's a terrible, very good, horrible, bad, no good, bad day card that you have to pick two cards immediately and play them.it's pretty fun. we play these cards throughout the exercise at a timed cadence. We usually start at five minutes, to add that unpredictable element to the exercise. And it makes participants more uncomfortable. After they get settled in the exercise, which is something that's great for incidents, right?Because incidents often have a facet of being uncomfortable and out of your normal element. And at the end of this, I have about 30 or 40 sets of these cards. So if you wanted to meet up with me, I can give you some of these cards. After a few runs, we were able to get things running smoothly. We ran sessions in our San Francisco office, but I was able to run them in Dublin, New York, Denver, Vancouver as I traveled around.Again, it's super easy to set these up. and I've also started running these again in the last year, now that we're back, to being face to face more often. Talking about why, some of the reasons why this was successful for us on a deeper, little bit level. How many of you have spare engineering folks sitting around that you can assign the task of creating games and tabletops for?Anybody have spare engineering resources? Or, how many of you have leadership that when you say, we're gonna go play some games instead of doing, actual reliability work? let us know if you have those extra things, because I'm sure some people here would love to apply for roles at your organization, but, we didn't have a lot of time.Even if, even if you're lucky enough to have some engineers who do a Skunkworks project, or Do this behind the scenes. will they be able to keep it up? Is it just going to go for a year? I've got some friends from PagerDuty, who talked about them running some kind of Dungeons Dragons style game.And that is very engaging, building this kind of cool, choose your own adventure setup. But then you've got to have someone run it and keep it up, and it's got to change over time, and you've got to have a more experienced game master. in my references, there's also a blog post by Paul Kirk about using, a game like Keep Talking and Nobody Explodes as a way to gamify incident response, but even that takes a certain amount of setup, and people have to pay and buy it.And it usually only works for a group of four to six people. So there's other options out there, but for us, this turned out to be super easy to just keep running at a really fast cadence without much investment. It's simple, it can be crowd sourced. Again, I mentioned we had several people do it.We could train someone to run this exercise in I bet Brent and I could train someone to run it in 30 minutes just by giving them an example. We created a Slack channel, of course, for folks who were coordinating to keep, our notes and keep, the schedule, as we were documenting things and how we set it up.Really, another key component here is this works for any type of staff. Like I mentioned earlier, we regularly include our customer experience teams. I've had, I've run this for CSMs. We can have salespeople come, TPMs, and they can sit right down and participate and get the feel of an incident. and there's a certain special connection you can see them start to build when they feel that pressure.so it also builds a lot of empathy for actual incident responders. there's nothing we needed to do to prepare them for the training. We just tell them to bring their laptop. They need two hours. and as I mentioned, we tell them they're having an exercise and with that little bit of a fib. The group pictured here is one of our intern cohorts, from a blog post that was live on the site, and I wanted to include it because next to our former CTO is an intern who is now one of our major incident commanders, who like, likely went through this exercise.and it's one of my favorite, if you have interns, one of my favorite. lunches of the summer is to do this when the interns are here. They love free lunch. They're super engaged and energetic. and you will quickly find out who has the best predisposition to join your incident rotations as they, hopefully you get to hire them full time.so how does it actually work? So how could you do this in your organization today? As we talked about the setup, it's pretty easy. You invite folks, we have a channel called Announcements Learning at Slack. you probably have. Hopefully you have some announcements channels where you can share things with your engineering org or your org that's interested in incidents.Get a conference room. find a volunteer who can be, play that incident commander role. We'll talk a little bit more about this. They don't have to have done it before, but, we'll talk about why that's a good thing. check out that your slides are up to date with the most recent incident process updates.This usually takes me like five or ten minutes just to make sure we're, Our slides aren't too out of date. And then if you're in a new location, you may need to update your, map that has what we call the lunch exclusion zone. this is the lunch exclusion zone around Salesforce Tower. one of the constraints is you have to go outside of this zone to buy lunch.So you can't just walk down to the food trucks outside Salesforce Tower. And, so they can't cheat. They have to, put in a little bit of effort. Then you start running through that training intro, like we said. You give them some background on IMS, how you do incident response, the basics of why are we here, why are you getting trained for this incident response process.Walk through some of the common roles in your incident response process. In our case, we cover Incident Commander, we cover the Customer Experience Liaison, and we cover Subject Matter Experts. You may have other roles. we have scribes sometimes, but we don't cover that in this training because it's not that important.We're trying to train them for the basic incident. talk through how an incident starts at your company. Really important for them to know, how does it get kicked off? we have a process at Slack called slash assemble. and they, find out if they didn't know already, this is how you start an incident.This is how anyone in the company starts an incident. but very valuable for everyone to walk away knowing, how, do I start an incident if I want to? What is the main goal of your incident response process? It's really important for people to hear this and say it out loud. in our case, it's to restore service as quickly as possible.I'm guessing most of you have something similar. but this is where you start to open up their mind to thinking about things like, do I fix forward or roll back? If my goal is to restore service as quickly as possible, I should probably almost always roll back and not fix forward. But you plant that seed for What's my north star here?this is, you'll see more, a little bit about why this is important to plant the seed in the, lead up to this exercise. and then give them some tips and tricks, for your favorite tips and tricks for instant response. some of these, we get from the PagerDuty response training, which is a great slide deck to build some training off of, but be clear and concise.Develop multiple plans. This one, by the way, is a good one to mention because some of our chaos cards will do things like plan A failed, so you have to drop plan A and move to plan B. And if they don't have plan B, then they're starting over from scratch. use timeboxing. So explain to them why timeboxing is important.This is a, another quick example of our real training, like this, in that intro lead up, we talked to them about what we would call the slack incident life cycle and talk about our severity level. So again, this is very nuts and bolts. This is a super fast overview. When they take the real responder training, it's a much longer training, but this is just a great way to expose them to concepts they'll hear when they get into that incident channel.I think. Stephen was talking earlier about the languages, when you democratize it, that shared language is really important, across your teams and your org. After you've finished the intro and built it up, then you drop the surprise on them, right? We typically say, the front desk notified us that the bike messenger, delivering lunch is on the other side of town.and typically most people don't know this is coming. Occasionally some people are repeat Return, participants, at the end of this, by the way, I don't think I included this, but always tell the participants at the end, please don't tell other people at the company the surprise, because it's more fun if it's surprising.and then tell them now they're, now they are our ad hoc incident response team. And then you share those rules, right? Our rules are basically, orders must be made outside of the lunch exclusion zone. they can pick up or order in. We don't, No problem with us, you choose which one you want to do. set a per person budget limit.I, I updated this, it used to say 20, probably 25 is better now, maybe 30, I don't know, with inflation. mention that lunch is expensed, they could, they should keep receipts. If you're at a bigger company like Slack or Salesforce, I often give them my card to put it on, but, do what works for your company.If it's a 50 person company, you might need to, have the manager pick it up or have them expense it. they can use whatever resources they have at hand. Laptops, phones, Slack, Zoom, whatever, they want. And then, we also say anyone with real dietary restrictions must be accommodated.There's also some chaos cards that simulate dietary restrictions that are, fun. this is like one of our, yeah, again, one of our constraints. You can see this is just super easy. This is a markdown file. And so it's super fun and you can edit these to match your, your own rules. then you hand it over to the participants, have the group pick an IC.As I mentioned, it's probably good if you have someone planted. hand it over and start your timer for the chaos cards. we either have the facilitator or someone from the group pick them every five minutes. If they're doing a really good job, we speed them up. You can go every three or four minutes.And if they're struggling, then you can slow them down. Because, again, we don't want it to be a not fun exercise, right? You want them to come away feeling confident and having had fun. and then when lunch is delivered We run a quick retrospective while we're eating lunch. we'll plant some things we saw as facilitator, but it's really amazing to see the kinds of insights that these people have, especially if they're like CE agents or CSMs, like they come in and they'll have things you didn't think of, and notice things.and then you clean up the conference room and you're done. it's, it's, from start to finish, we could run one of these with a half hour setup and the two hour lunch. And again, mentioning the people who run it get free lunch, that's why the title's fun. what have we learned, in running a few years of this exercise at Slack?the time pressure and the unpredictability really mimics, what people feel in a real incident. it, it feels more realistic than most training would be. Those chaos cards give you levers, a lever to make things easier or harder. And so if you have return participants, you can make sure some of the harder cards show up at the top of the deck.like maybe playing the network outage card at the beginning of the incident. So they have to figure out how to tether their laptops and cell phones. Things that might actually happen in a real world office for them. Rarely, but again, can they handle that? That, bigger, wrench that gets thrown in it.the slack outage card is a must for any really experienced group. as soon as the slack outage hits, in this exercise, they're like, Oh, man. you, test their knowledge of what's been going on so far. and maybe pick an IC up front. we discovered that it can be really good to find someone to do that incident commander role.the worst experiences we had with this lunch were when no one was really ready to play that role of the IC and it turned into a struggle. we often don't let the person who agreed to start as the IC know what's going on. But it's still nicer if they know they're gonna play that icy role.and I, we have this hangry thing. 'cause the, longest, I think lunch, and Brent probably remembers this one, was someone who, literally got up and left the training and was like, I can't do it. They were playing the IC and they're like, I have to go. I can't. I'm hangry and I'm out. that was one of the deciders that, that it was nicer to prep someone for this role.But that leads into the other learning is, we discovered a lot about watching people in this exercise, and we use this as a recruiting tool. So you see people, you talk to them after the exercise, you can invite them back, you see someone that's got a lot of promise, and it's hey, have you thought about taking the Incident Commander training?Because you would be a really great Incident Commander. Or you find great responders too, and you just know those are people that can help be peers and mentors to other people. And then it can also be, this, you can use this as a tool to build up that confidence. If you got an incident, someone who started as an incident commander, but they're not, maybe they're a little shaky, maybe they're not that confident, invite them in and say, Hey, come, have free lunch and build your skill set.you can coach them and ask them questions to prompt them.Focus and choices. People will quickly forget that their primary goal, what their primary goal is if they aren't an experienced incident responder. So mitigating the issue and restoring quickly, service as quickly as possible, we mentioned, is the goal in the real incident response, but in this case, the objective is to order lunch as quickly as possible.we found that teams that can get going quickly can maybe get this done in 15 to 20 minutes sometimes. It's pretty impressive when they can. But they're most successful if they don't lose sight of that goal of the as quickly as possible. So when a team decides they want to start taking a poll across everyone in the room, the assembled responders about where to eat, not too different from when someone stops an incident, and they're like, I want to get everyone's opinion on what's the best way forward, right?As the facilitator, you'd better hope you're going to actually have lunch within that two hour window. Because it's not about incident response, right? It's not an exercise in democracy. It's the, Making decisions quickly and efficiently, and often making those trade offs that happen in incidents that you wouldn't make without the time pressure on you is what we're trying to simulate here.So it's like choosing your favorite food of the day instead of thinking about how to get it quickly is the wrong decision. And so sometimes people learn that, and that's a great learning. But again, in this simple exercise, it takes no technical acumen to understand. And then you can talk about that at the end of the incident training.one of our favorite trainings for people is the, are there any strong objections tactic, right? Like, how do you find group consensus? So that's a great opportunity to drop that at the end of the training and be like, instead of a poll, you should have just said, we're getting Thai food, are there any strong objections?And you could have been done in one minute instead of the ten minutes it took you to do a poll. One other quick note is choosing delivery instead of picking up in person almost always takes longer. It seems like it might be surprising, but one of those incidents, the order never got transmitted to the restaurant.from the online order, so you're adding in more dependencies. And so another great training lesson to be like, if you're adding in dependencies instead of removing them, you might be lengthening your incident response.And then, where do we go from here? unfortunately, this exercise is very, in real life focused. we have never figured out a great virtual substitute for how to run this in the hybrid or fully remote, world. So if you find something, please let us know, let me know and let the other people who've run this know.We experimented with running these with usually, sometimes with one or two remote employees. And we told them up front, secretly, that they wouldn't get lunch delivered to them. We didn't make that a requirement. And that was nice to sim So you can do that and stimulate a little bit of working with people who aren't in the room across Slack.but it doesn't work full scale. You can't I It lets us then work with ten people in all different locations. if you think of simple things that meet all these requirements, share and we'll, we'll all improve our future incidents. Tooling is hard to effectively use and our personal, Slack inside incident tooling, we don't have a demo mode or dry run mode, so I would love it if we could run it with your tooling.If you're able to run this with your tooling in a staging environment, that's great. We just don't do it because of the setup, the overhead of setting it up for all new people. is hard, but something to think about. and then one other comment, keep a log of all your retrospective insights, and notes about what happened in each lunch.these can be really helpful as you look to evolve, not only this, training program, but your incident response program. What kind of insights are people in these lunches having that were surprising to you? I found as I was preparing for this talk, going back through, seven years of Slack notes, I found some great, documentation we had written years ago.So and that's the, the bulk of my talk. I want to, I hope you'll try this exercise at your company, and, let me know how it goes. I did want to give some kudos and thanks to the BlackRock3 partners for exposing it to us, and Brent, and Tricia Bogan, and Joe Smith, and, Colm Doyle, who, people who helped at Slack, and any other people I didn't mention.and I want to thank the folks at Incident. io for hosting us today. And so there's some references here too that are great incident references from folks at Slack. And from, I don't know if Chris noticed, I included his blog post on the Slack developer blog. and then I think this last page had, this is a QR code to the GitHub repo I created.Which is pretty rudimentary, but let me know what else I can include in that. It's got the sample of the cards. so yeah, thank you.

Sessions