Boundary Cases: Technical and social challenges in cross-system debugging

Sara Hartse (Render) takes the stage to share how Render tackled the toughest kind of incidents — the ones that cross teams, tools, and company boundaries.

Sara Hartse

Sara Hartse, Software Engineer, Render

talk

The transcript below has been generated using AI and may not fully match the audio.

How many of you have been in a situation where you're working on an incident and you start to think the issue isn't actually in your system at all? Yeah, happens to me a lot. Reminds me of a situation a couple of months ago when I came home from a camping trip — really excited to return to civilization.I turn on my hot water and there's no hot water. So the question is: is the problem somewhere in my apartment, in the faucets? Is the problem in the water heater? Is the problem that there's no gas in my building? From my position in my apartment, it's really hard to figure that out — and that kind of multi-phase, upstream provider system is something that happens all the time in software, and specifically at Render. At Render we've found that these cross-provider issues can be uniquely difficult. You don't have all the debugging tools you're used to, and there are also social factors that come into working with support teams and organizations that you just don't have an existing working relationship with.So today I'm gonna give you a bit of detail into what Render is and why we care about this. Then I'm gonna talk about various techniques and strategies we've learned in the course of working on these kinds of incidents. And I'm gonna go through three different incidents to illustrate those examples.So what is Render? Render is a platform as a service, which means we take our users' code, databases, applications, and we make sure that they're deployed and up and running and serving traffic on the internet. This means that at Render, we are both a platform provider and a platform customer.We use our own upstream providers to have compute, storage, and networking primitives. A lot of times when we're debugging an issue, we're trying to figure out which part of that system is having a problem. That takes me to the core part of the talk today: I'm gonna talk about three different incidents that all have roughly the same symptom but very different outcomes.The core symptom I'm gonna be talking about is network problems. Our users make a web request to their websites, then we have upstream providers doing our CDN and load balancing that send requests into our Render cluster in which we're running HAProxy, which is in charge of routing those requests to the exact instances of user applications.In my analogy, maybe the upstream provider is like the city gas, the proxy is my water heater, and the user application is like my faucet. The core symptom here is the end user is getting an error or a very slow request. The question in each of these incidents is: where is it coming from?We'll get into that now. Incident number one — these are gonna go from relatively chill to increasingly complicated. Starting off fairly simple: we got a report from one of our highest-traffic users that they were seeing an increase in errors. Normally, these kinds of customer reports are filtered through our support team, and they do some initial triage. But in this case, the issue was escalated in an unusual manner, and it turned into an incident pretty fast.We started digging into this error. It wasn't an error that was being explicitly returned by either our system or our user's application. So our thought was: maybe this is something to do with our upstream provider. We opened up a support ticket with them, asked if they were seeing anything, and unfortunately, no — they didn't know what was going on.Everything looked normal on their side. So we went back to the drawing board and dug in more with our customer, looked more at the characteristics of these requests, and eventually determined that their code was not producing these errors. It was unexpectedly terminating WebSocket connections that led to this kind of ungraceful connection closure and failure that manifested as this unexpected failure at the end.So in this example, the issue was in the user application — in my sink. Yeah, this is pretty common. We have a couple of lessons we've learned working on these kinds of incidents. The first is that these incidents actually tell us a lot about our product.Because we're a platform as a service, we want to build a tool where users can answer all the questions they need about their own systems, troubleshoot their own issues, view logs and observability, and solve their own problems. So when we have situations like this, it's a signal to us that maybe there are some gaps in observability or tools we need to improve to allow our users to figure out what's going on for themselves.This incident also taught us a lot about our processes and the importance of sequence. Our support team is really good at helping users troubleshoot their common problems or configuration/application issues in their code. The fact that this leapfrogged our support team and went straight to engineering incident response was actually counterproductive to driving a quick resolution here.We on the engineering team are used to working on systemic platform-wide issues, and so we're biased to assume that's what we're working on when we see an issue like this. Working on this as an engineer really emphasized that the support phase and triage processes are critical to driving a good resolution to these incidents.All right, incident number two. In this incident, a different user reached out to us because they were starting to see some very slow requests. This is a kind of heat map showing the impact and then eventual resolution of this issue where a small percentage of requests were taking multiple seconds to resolve.Because this was just a single user reaching out, we initially assumed that since only one user was seeing this problem, maybe it's something on their end. The good news was that at this time we had also started investing in better tracing in our HTTP stack. We added traces and captured some of these slow requests to see exactly where they were spending their time.In this case, you can see the teal bar there — this is well within our own code. It doesn't actually get sent to the user service until five or six steps down below. This tells us exactly where all this time is being spent. Before this point we were like, it goes into our thing and it eventually comes out, but we didn't really know where it was spending its time.Once we had that, we were able to zoom in even further. We added a couple more spans and saw that the slow part was related to accessing the data used to route these requests to the right service — this metadata store, effectively. To access the metadata store, you had to take a lock. Our initial suspicion was: okay, some kind of lock contention is going on here.The next question was: who is it contending with? Where's the competition? This was a read lock. We zoomed in even further, looking at specific slow requests on a specific pod at a specific time, and we used CPU profiling to see what code was running at this exact time.That was what allowed us to solve the issue. We found that it was fighting with this process that's involved in syncing this metadata in the background, keeping it fresh and reconciling updates. We had to make some changes to those locks to ensure requests would never get blocked by this process.It was a really interesting bug because it wasn't like we had made a code change that caused this regression. It was this kind of creeping, very slowly building thing — the bigger this map was getting, the longer it took to update — a frog-boiling-in-water kind of feeling.All right. So this is the example where it's inside the Render-controlled universe, which meant we could fix it ourselves — the water heater in my analogy. Takeaways: an initial bias that slowed us down in the beginning of this incident was the assumption that because only one user was reporting an issue, then that issue must be related to their configuration.It turns out that some users just have different needs or expectations. This particular user was a lot more latency-sensitive than other customers we had worked with, so they were kind of the canary in the coal mine for this slowly worsening performance. They helped us catch it early, basically.That was a really useful realization in this incident. The other big takeaways here were the kinds of observability that were most helpful for us — really, it was about specificity and boundaries. We weren't sure: is the slowness in our stuff or their stuff? The critical thing was to observe the boundary between our stuff and their stuff and see what exactly is happening at this boundary, and then get really specific — like seeing these trends of slowness across our whole system was pretty unhelpful. But once we looked at a really specific pod at a specific time, we could see, "Oh yeah, this is what the CPU is doing right now," and it gave us a real smoking gun. Alright, it's all been building to this — incident number three. We started seeing these really weird traffic spikes or latency spikes.It was a combination of errors — actually, the errors from the first incident — and very slow requests, similar to the second incident. These spikes were all under 10 minutes. They were all in Ohio, in us-east-1, and they all were during business hours. Never on the weekends — just during normal business hours they'd suddenly get really slow.We saw familiar themes of connections failing and slow requests, and fortunately we had our tracing nicely in place. This showed us that the problem wasn't in our system. This blue bar shows the time from when this probe check started to when it actually reached our system — it spent all the time somewhere out there in the ether — in the internet. With that evidence, we decided to reach out to our upstream provider and say, "Hey, what's going on? We're seeing these slow requests. We're seeing these errors. Can you help us?" And they — oops, sorry — didn't see anything. They saw background radiation levels of errors — the internet is a tough place; stuff goes wrong — but they didn't see these spikes and couldn't really help us. They gave us some general debugging around these connection handlings, but couldn't help us too much. We were left in a challenging situation where we couldn't see what was happening because it wasn't in our system, but our provider wasn't seeing anything.So we needed to rule some things out and do some more tests. We basically tried to form different hypotheses and then reason about what would have to be true about the world if a given hypothesis were true. Our first idea was: maybe these proxies are somehow failing to initiate or accept incoming connections.They just can't actually start the connection in the first place. This would explain the behavior, but we couldn't see any reason why this would be happening in Ohio specifically. We weren't getting more traffic in this region. These periods of slowness weren't correlated to traffic spikes on our platform or anything else we could see.So that didn't really explain the situation. Next, we wondered about our cluster networking. We use Kubernetes. We have intra-cluster networking that allows nodes to talk to other nodes. Maybe something's going wrong with the connection in the cluster itself. So we poked around at some observability for cluster networking as a whole and tried to see if we could find any evidence that these connections were failing anywhere outside the context of HTTP requests.We couldn't find any evidence of that either. Our third theory was network throttling. Our underlying compute provider's instances all have a certain bandwidth quota where if you exceed that quota, they reserve the right to throttle you. They're not always gonna throttle you, but if they're particularly busy, they can cut you off, basically.This felt good, right? It explained why we saw these connections not completing, and it also had a noisy neighbor explanation where maybe this cloud provider saw a bunch of traffic at this time and needed to throttle us. So this was a good explanation. We decided to perform an experiment.The experiment we crafted was to really crank up our instance sizes and make the bandwidth allowance really huge so there's no plausible way we could get throttled. We made this change — really big instances — and the same thing happened the next day. At this point we were getting pretty frustrated.This feeling of: I can't see it. What's happening? I've looked everywhere. We decided to go down a layer of abstraction and start looking at packets. We used tcpdump to capture packets. On one of these instances, we captured one of these episodes, and this is what we saw.We saw upstream reaching out, saying "hi"; our proxy said "hi" back — and then nothing. Then we said, "Hey, you still there?" Then nothing. Then we said, "All right, I give up. Goodbye." Several seconds later, the upstream said, "Oh, hi. Here I am." That packet, significantly, was a retransmit, indicating that was the second or third time it had sent that packet.We were able to plot all the retransmitted packets in this period, and it perfectly highlighted this 10-minute period of network issues. We went back to our upstream provider and said, "Hey, what's going on with these packets getting dropped?" With that information — our high-level trends about when this is happening and what the symptoms are, what we've ruled out,and also this particular TCP handshake failure — they were able to say, "Yeah, that looks like network congestion, looks like packet loss." They zoomed into their own observability and systems and found that indeed, these were periods of very brief but very high congestion.Ultimately it was a noisy neighbor problem, similar to what we hypothesized earlier but at a different layer. They ended up provisioning more underlying compute and more underlying network capacity that resolved the issue.Alright, takeaways from this incident. There were some interesting social-emotional biases going on here. One was this feeling that surely we can't be the only ones noticing this issue, right? It seemed really severe; if this is really a problem with the upstream provider, surely someone else would've reported it.If you notice, that's kind of the inverse bias I had in the last incident. It turned out we were the ones who were really sensitive and picky about our network performance. And yeah, we drove this to resolution and were able to solve this issue because we had that sensitivity.The other big takeaway here is how we felt around in the dark in a system we couldn't actually see inside of. We used process of elimination to rule out and reduce the search space of where issues might be. We tried to focus on both trends and specifics when sharing back to our support team — showing high-level trends of when this is happening, etc., was not quite enough.We really needed to provide concrete evidence — "this is the TCP handshake; where's the packet?" — to really break through. Wrapping up: what makes these kinds of incidents hard? I think there are two big things. You have your tactical limitations.You're debugging a black box; you're debugging someone else's code. You can't look at it, you can't attach a debugger, you can't add log lines. There's just a lot less you can do — you don't have your normal tool belt. Then you have these social challenges. You don't have a strong working relationship with the people you're collaborating with, and they're looking at a very different side of the system, right?They might not know about your application; they don't know what's going on in your systems. They can only see their side of the situation. But these are solvable problems. You can do it. You can figure out how to infer things about these systems with limited tools. You can use tracing, profiling, focus on boundaries, focus on specific behavior of what is happening between these two systems.And then you can invest in and prioritize building these social connections. It can be really frustrating to be on the other side of a customer support case — and at Render we are on both sides. Our users reach out to us when they're having trouble and we want them to have a good experience, and then we reach out to our upstream providers and have similar experiences.In both cases, really trying to build context and empathy is what's critical. You can have a really strong theory — a really strong technical opinion about what's happening — but unless you can figure out how to communicate and build that shared understanding with your counterpart at your upstream provider or wherever, it's hard to make progress.These problems are solvable — and you can do it.

San Francisco 2025 Sessions