Writing up your most embarrassing failures for the world to see is good, actually

Brian Scanlan, Sr. Principal Engineer at Intercom, shares why Intercom publishes public RCAs: to build trust, improve reliability, and, ultimately, make their engineering teams better.

Brian Scanlan, Senior Principal Engineer, Intercom

The transcript below has been generated using AI and may not fully match the audio.

Hey, it's great to be here. Like I've built a lot of credibility in my career, just working on incidents, fixing things and stuff. And so having a conference, which is all about availability and its response is like dream topic. also like one weird, trick is if you want to get accepted at a vendor conference, write up a snappy title for a talk about, like a feature they shipped.really, they really like. That kind of thing. yeah, so I'm Brian. I'm an engineer at Intercom. Intercom for 11 years. Intercom is a customer support AI agent business with a, legacy B2B mature SaaS business on the side. and so we have tens of thousands of customers who use us to support their customers, but increasingly it's our ai, customer service agents that does this work.and. This move to AI for Intercom as well has really increased, like the pressure or the need for us to do a great job at, improving uptime, instant response, all that kind of good stuff. We mo shifted from a model, which was where we, we still do charge like c seat based, access to our like help desk products and those kind of things.But the way we charge these days for our AI agent is we, bill. Per conversation resolved by the bot. So you know, if the bot does a bad job or has to get a human involved, or the customer isn't very happy with the response, we don't get paid if they. If they are happy and the conversation gets resolved, we do get paid.and so this means we've shifted to almost more like a, retail or, like a finance kind of style, setup of where, our uptime is directly related to our reliability, our, to our revenue. CF. Pressure, the stakes definitely have risen, recently, as well, there's tens of thousands of people whose job it is to be in intercom all day, talking to their customers.So this stuff really does matter. and yeah, I'm just gonna show so in my talking, gonna show a bunch of internal stuff and a bit of a story about how we've. T thought about, writing up public instant reports, and how does this kind of change over time? But here's, just some cool stuff.Here's like some instant numbers. so here's our uptime over the last few months. As you can see, April was a pretty bad month. things have improved. We might've been into like negative territory this month if it wasn't for AWS. I also maybe should have, completely rewritten my talk after the events of this week.Like the first thing I did this morning, was to read the AWS incident write-up of Monday's outage, if you haven't read it, Take an hour or so. and if you're a dev it's very long. maybe they just try to bore people to death with it or something, but, and, but there's also been some like really good other writeups by different businesses about what effect it had or didn't have on them.There was a particularly good one from Oso where they showed how their architecture helped them just not get taken down by it. The likes of Honeycomb and others have written up some really good reports as well, so could have just. Thrown out all this content and just talked about all of the cool is instant writeups that have been published over the last few days, the internet.but, yeah, so what's going on here? we had lots of database issues, earlier on in the year. And, I think, Intercom, we do follow a lot of like best practices, like a mature company and, we know what we're doing by now and, Availability and reliability.There are like aspects of product quality when it comes down to it. And and like you can decide to whether like how much to invest in these things and like your, but your culture will lead any kinda strategy or process or technical tool. and so be really dedicated and committed to this stuff means.Be doing this for a long amount of time, taking it seriously. things like, work in areas, such as availability or improvements in process. And that needs to show up in things like promotion packet packets and be praised by leaders and all that kind of these things. And, I think if one part of quality, there's also design system design, architecture, design, all of this stuff works together and comes out in your availability numbers, but mostly your customer satisfaction and confidence in you.and, So like we're far from perfect. I think that's okay. Like we make some reasonable trade-offs. though, don't worry. You want these databases issues. and the hard part of being good at something is like knowing all the stuff, like being constantly aware of all the problems that you have.So I'm gonna, go through three kind of areas, look into why, why we've published, why I've spent all my time, writing all business reports publicly. and we'll go through like these kind of areas as ways to look at these. and it's like. why bother? we have limits of time.no one pays you for your incident reports. so why as a company would you not just, give generic responses? Yeah, we're sorry for going down, blah, blah, blah. but I do think they're important. So I benefited a lot from reading like other people's and, and, but I'm gonna. First start off with a side story.I'll come back to this stuff. so as part of Intercom's journey, like when I joined Intercom was like a lot smaller company we're selling to other startups and we went on this kind of classic journey of moving up market and we. kept on selling to bigger and bigger companies and, got, we were like real product founders, product background.we, but we started getting like addicted to big sales and having like lots of salespeople and, selling to bigger companies and this kind of stuff. And, but like when they started off, like many years ago, we, one thing that worked well for Intercom, we were a lot smaller, was like.Blogging and like just dumping on our blog. Like one of our founders in particular, Des he's very good at writing up the stuff that was happening inside of Intercom and just. Just pro on the blog, and that's content. It gets people onto your site, it gets you leads but also it's, it builds confidence.Like people know who you are, they know how you think about things. And often when you're buying a tool, if you're like choosing between vendors or whatever, it's not like who's got the most features. that's one way of buying things, but a lot of the time it's like who's got the best vision?Thinking about this the right way, are they building for me? and this kind of content, getting it out there, can build this kind of customer trust. And so we started doing this as well af not long after I joined Intercom. we had availability problems. We had a lot of product market fit around the time I joined.and like this is a curse. so we had this huge growth and we were building features, not. so that they would be reliable. We were building features because we wanted to build the best features and get them into our customer's hands. And we were going through, a lot of extreme growth at the time and that just put like things like our databases under a lot of pressure and we had to do a lot of work.And one of the strategies though was like, to keep our customers on board and let them know what was going on was like just writing blog posts. so we started doing kinda like what GitHub are pretty good at is like writing periodic blog posts about all of the kind of. Systems or infrastructure improvements you've made.And as well as writing off individual, outage reports, for some of the incidents. this seemed to go down well and I got like good feedback. People seem to like the knowing what's going on behind the scenes. and especially when you're selling to smaller companies, they're all like, of looking at intercom, and like happy to read the content.and like I, I remember one time I wrote up an incident report, and it was a kind of a classic MySQL problem where we sent, like we, we dumped a load more D data into a particular table. The query plan completely changed and latency went through the roof. And, we took measures to fix this.And then one of our customers, reached out to me going Hey, have you tried this setting? This, this took us down recently enough. So that was like, oh, people are actually reading this. and it was like, useful and it felt a valuable, but, over time as we kept on moving up market and things were becoming more sales driven and that kind of thing, we started thinking about, how we could have like things like packaging up things like, edit reports and RCAs and all this as part of like premium support packages and, putting in place, like high touch customer support and customer service, or what.With customer success like that, doing all of these things, like well suited to selling to like larger enterprises and stuff like that. and we ended up accidentally monetizing incident reports. as in it was part of a package, you could, as a customer, you could request via your salesperson, like if there was some outage or some bug or whatever, you could ask Hey, could we write off what happened?And then, the sales. Person would get in contact with us, probably some slack process, whatever. and we'd write it up and, we even mimicked what say Amazon do when we do this to Amazon and like encrypt A PDF and then we send it to the customer. And it was just like a lot of process and a lot of.Stuff. And, it was like the right thing to do at the time. Like these were all completely rational kind of decisions and we're an environment of where, this kind of made sense with the way we were framing these like premium support packages and stuff like that. But, it didn't feel great. it was, we certainly weren't getting them out fast because of all of this like process and, all of this like bureaucracy that we inserted.and then when Intercom went through a big transition a few years ago, where we Shifted back to being more of a product company, more product thinking, a more kind of cus as like the core of the company and being more customer focused, to be honest. we threw it out, and just started like we didn't.Stop publishing our cas. We just didn't do the, didn't follow the process or stopped following the process of having to, like customers ask for them. And then we send them like a week later or whatever. It's like we can actually do way better at this. We can just on the day of the outage, or like asap, publish it online like we used to, on our incident.io status page.and when a customer comes asking, we just. Point them there. and just let everyone know, Hey, we've just published the RCA, so you get it out fast, you get it out, without all of the kind of friction and overhead, and you make it just available to all of your customers. so you get way more benefit for the time and effort you spend actually writing this stuff.and yeah, I call this like an experiment. let's just try this other process that's a lot easier. and but speed I think was a important part of this. I think it's, being able to get something up fast, shows you're on top of your game. And I think other, vendors are pretty good at this stuff as well.and yeah, top tip, like just do not monetize edit reports. but coming back to like why these are important. Or what I enjoy or where I spend my time in these things. community, I think like here's a good community of people who are into availability. I'm on home territory here.but, like what, why would you bother with this stuff? I think, like I've learned a lot from reading, write-ups from other companies. and there's good. People, there's people who like reflect on these things on the internet. There's back when we had, back when Twitter was good, there was like good discussions on Twitter, but people write blogs and stuff like that about it as well.and yeah, it's like when it comes down to it, in tech, we. Do learn from each other. we go to conferences and we write books, we write papers and all this kind of stuff. And I think this is the way that this kind of, needs to share your learnings and help, the tech communities as a whole and, show your kind of work or get things out there.Even things you don't know what happened. I think this goes into this way of working and raising all boats, rather than just only kind of sharing these things internally. and, it gives you, it gives more eyes on things. I think like by looking and paying attention to other companies outages and stuff, you can see some stuff that might be coming down the line for you or just get different perspectives and, understand like what sort of, Sociotechnical problems are impacting different environments, that will cause, issues and cause downtime. And, it's like I, I've just habitually read these things, try and find them, when on the internet and just like engaging in different communities that take interest in this stuff.so personally for me, I think it's benefited me a lot in my career, in my understanding of what to do and these things. And, and also like the giving back feels good as well. And yeah, yeah, ironically, I'm gonna reference an AWS right up here. I've always loved this one.it's from like 2008 S3 completely melted down. It was due to a single bit flipping on a, their gossip protocol, which is like our backend, service discovery thing that was there in place at the time. but I really love the sentence at the end. It's we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.It's like that's an SLO, I think the criticism here I give is like the Amazon S3 team. I think these things are actually better published by people or like having like almost like a paper, like just say who wrote it and, take personal responsibility for, them or whatever. I don't do that enough for myself.I probably should say, add my name to more of these things that I publish, but I think they could, they're better off from people, I think this was a good writeup. Recommend looking it. there's a really good resource, Stan l he's given some of the best talks I've ever seen.he's got this GitHub repository, which is like list of categorized outages. you can spend a month just reading all these and look into 'em. Also, it's just a GitHub repo feed into it, do contribute. so this is just a good resource. And if you're looking for kind of classes of write-ups or looking for interesting parallels, whatever.really good place to start. And then there's some really good blogs. Lauren, he, has a really good blog, which kind of goes through, themes and things he picks out of different, editor reports. like he's way better way, way more depth than this kind of stuff than I do. like proper like mix of academic and technical approach to these things.And, but also like pulls together the themes very well. and shows how like different companies are doing. so yeah, there's good stuff out there. It's harder to find these days, but it's good. and all yeah, like Lauren, like posted on Twitter, on Blue Sky this morning, just read the AWS thing.It's great. copy construct on Twitter. Also, I didn't mention here in my slides, but. her Twitter account is a spectacular source of in-depth analysis of, outage reports. So lots of good stuff out there. you all probably all know this anyway. so anything else on this? Yeah, like I said, we're practitioners.We're all working together to kinda make the world a better place. And technology, so community aspect of this. but also I think, high, having high standards, if you actually write something up publicly that kind of forces you to do a good job better. And it's almost like a flex. Being happy to spill the beans of what's going on behind in the sausage factory means that like you can actually publish something credible. You might even put your name to it. you understand things well enough to be able to say publicly what happens and, you're willing to kinda share this not just with customers, but also publicly.And so I think it forces higher standards. It makes you reflect on things. It certainly does focus the mind. I know, like I regularly show up to. To like chats with customers about things that, broke. And it really does focus the mind when you're talking to a customer. They're telling you the impact.You're like, I'm gonna take this stuff more seriously in future. so it's a very motivating thing if if, you're gonna be publishing this to public or talking to customers. and, yeah, other, I also just hold cus companies in high regard who do this kind of thing.Cloudflare, Honeycomb and others. it gives me confidence in them. I think it's important as like a buying decision. And I think, I think these days there's a lot of building and public. I think especially with AI features, there's a lot of like uncertainty and people are just make distinctions between like their products and other products.and I think like failing and public or like showing off your failures in public is like another way of kind of building that kind of confidence, for your customers or prospective customers, and maybe even hires of people like you. Get attracting people into recruiting if they can see actually how you think about things.You take things seriously. It's not worth, it's not too bad for your brand, but I think it is a forcing function that causes you to like, maybe do a bit better internally or when you know the consequences are that you're going to write stuff up publicly. this in mostly a positive way, not like punishment.I'm being transparent as well, like customers do appreciate it. sometimes I don't know really what they do with these things. Sometimes they're like. Box ticking. They just want to make sure they've followed up the vendor. other times it is they are interested, they want to see, the evolution or you take their stuff concerns, seriously.Or they're interested in what's going on. So we don't really have data here, but it certainly gives them a better story, especially like a contract renewal. Those kind of times when yeah, we had these adages, but like here's how we followed up and we've been public with this still and we had meetings with you.All of that kind of stuff helps. So it's this little contribution Back to Intercom sales, I to take credit for. okay, where's my embarrassing outage? so hilariously, it's a DNS outage. so last year I wrote a this, it's a report. I broke our internal DNS. like on most of our hosts, so like for our asynchronous workers.so it was good fun write up. So I was like rolling out. Oh yeah. So when I joined Intercom, one of the first things I was did, or like I've made sure that there's always been the case is just keep things super simple and flat. I used to work in networking in Amazon and other places, and so obviously I hate networks.I've just kept things super simple and it's. Maybe the most impactful thing I've done at Intercom. so what I did or what. Unfortunately we did was we added a private subdomain, which is already going off, not the simple track for internal proxy SQL endpoints. We use proxy SQL to scale MySQL.and we had this like low level, failures of DNS resolution on our EC2 hosts. EC2 hosts can only do 1,024 DNS lookups per second, which sounds like a lot, but like on gigantic hosts, it's not many. So I rolled out a caching. DNS was over. This is all pretty un uncontroversial. it turns out we had slightly different security groups for web servers versus async workers, and I did all my checks on web servers.and then there was like a weird. Behavior in Unbound. You tell Unbound, forward all your queries to this endpoint. But the way that, internet resolution worked meant that, at startup it just ignored that and just, dropped our internal, records on the ground. so couldn't resolve the private domain workers couldn't talk to, our, my SQL servers when our new, a MI went out.And took out all of our async workers in the middle of the night when we were doing our a MI rollout. I got paged into this. I, I got the page: all workers down in us-east-1, and when I was walking like this was like three or four in the morning when I was walking down the stairs, there was like, I wonder, is this anything to do with the DNS thing I did.so I don't really diagnose it. I got there yeah, just roll back. so yeah, it was DNS, it's always DNS hilariously topical. How you as this would also very topical this week. didn't stick to keeping things simple. I let this kind of complexity, get in. and we knew about this exact type of adage and we had tools and health checks and all this kind of stuff.None of them worked. and yeah, the, like a lot of stuff that we designed specifically to catch these things just failed. so yeah. But right. So that's my embarrassing story. so publishing public though forces you to really own your availability. and I think customers do appreciate it, the community appreciate it.I appreciate it. so I recommend doing it if you can take advantage of those beautiful features in incident.io. Thank you.

London 2025 Sessions