Conf42 Site Reliability Engineering 2022 - Online

Hodor: Detecting and addressing overload in LinkedIn microservices

Video size:

Abstract

When pushed hard enough any system will eventually suffer, and ultimately fail unless relief is provided in some form. At LinkedIn, we have developed a framework for our microservices to help with these issues: Hodor (Holistic Overload Detection & Overload Remediation).

As the name suggests, it is designed to detect service overloads from multiple potential root causes, and to automatically improve the situation by dropping just enough traffic to allow the service to recover.

Hodor then maintains an optimal traffic level to prevent the service from reentering overload. All of this is done without manual tuning or specifying thresholds. In this talk, we will introduce Hodor, provide an overview of the framework, describe how it detects overloads, and how requests are dropped to provide relief.

Summary

  • LinkedIn counts on Hodor to protect our microservices by detecting when our systems have become overloaded and providing relief. Hodor stands for holistic overload detection and overload remediation. What we'll be presenting here is all within the context of code running on the Java virtual machine.
  • When a service has saturated its usage of a given resource, we need to prevent oversaturation and degradation of that resource. These limits can be reached in different ways. The amount of traffic that needs to be dropped and the overall capacity can easily vary depending on the nature of these overloads.
  • The objective of CPU detector is to quickly and accurately detect CPU overloads. Background thread, known as the heartbeat thread, is scheduled to wake up every ten milliseconds. The algorithm flags an overload only when we have high confidence that a service is overloaded.
  • GC overload detector is to quickly and accurately detects increased garbage collection activity for applications with auto memory management like Java applications. Now let's move on to the next detector which focuses on memory bottlenecks.
  • The most effective way to limit the number of concurrent requests handled by a service is to reject some requests. When the load shedder drops requests, they're returned as five hundred and three s, and these can be retried on another healthy host if one is available. LinkedIn rolled out load shedding to hundreds of microservices.
  • adding overload detectors to our service has surfaced unexpected behavior that owners were generally not familiar with. We currently have over 100 success stories where Hodor engaged to protect a service and mitigate an issue to end with. I'm going to talk about some related work that integrates well with Hodor and some things that we have planned for the future.
  • First up is a project that has been integrated with Hodor and live for some amount of time. It's called traffic tiering, but it's also known as traffic prioritization or traffic shaping. We're working on correlating CPU or GCU overload signals with latency metrics and only dropping traffic when there's a clear degradation in performance.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Brian Barkley. I'm an engineer at LinkedIn, and I'm going to be joined by my colleague Vivek Deshpande to discuss a framework that we've helped to develop called Hodor. LinkedIn counts on Hodor to protect our microservices by detecting when our systems have become overloaded and providing relief to bring them back to a stable state. So this is our goal. We want to increase our overall service resiliency, uptime, and availability, while minimizing impact to these members who are using the site. We also want to do this with an out of the box solution that doesn't require service owners to have to customize things or maintain configuration that's specific to their service. I'll also note that what we'll be presenting here is all within the context of code running on the Java virtual machine, though some of the concepts would translate well to other runtime environments. So here's our agenda for the talk. I'll provide an overview of Hodor, then Vivek will discuss the various types of overload detectors we've developed. I'll talk about how we remediate overloads situations, how we rolled this out to hundreds of existing services safely. We'll take a look at a success story from our production environment and discuss some related work and what else we have planned for the future. So, for anyone wondering about the name, as you can see, Hodor stands for holistic overload detection and overload remediation. It protects our services and so also has a similarity in that respect to a certain fictional character. So let's get into some of the details of Hodor situations that it's meant to address and how the pieces fit together. Start with let's talk about what it means for a service to be overloads. We define it as the point at which it's unable to serve traffic with reasonable latency. We want to maximize the goodput that a service is able to provide. And so when a service has saturated its usage of a given resource, we need to prevent oversaturation and degradation of that resource. What sort of resources are we talking about here? They can be physical or virtual resources. Some obvious examples of physical resources are cpu usage, memory usage, network bandwidth, or disk I O, and all of these have hard physical limits that are just not possible to push past or increase via some configuration tweaks. Once these limits are hit, performance starts to degrade pretty quickly. Virtual resources are different altogether, but can have the same impact when they are fully utilized. Some examples include threads available for execution, pool DB connections, or send before permits. These limits can be reached in different ways. A clear case is increased traffic to a service if the number of requests per second goes up five x or ten x, normal things aren't going to go well on that machine if it's not provisioned for that amount of load. Another more subtle case is if some service downstream starts to slow down. That effectively applies backpressure up the call chain, and services higher up are affected by that added latency in those calls. It'll slow down the upstream service and causes it to have more requests in flight, which could lead to memory issues or thread exhaustion. One more example is if your service is running alongside others on the same host without proper isolation. If a neighboring process is using a lot of cpu, disk or network bandwidth, your service will be negatively impacted without any change to traffic or downstream latencies. So this is where the holistic part comes in. We want to be able to catch as many of these types of issues as possible, though the services using hodor can have wildly different traffic patterns, execution and threading models and workloads, and as I mentioned before, it shouldn't take any configuration. We have hundreds and hundreds of services running on tens of thousands of machines, and tweaking things for each of those just isn't feasible to addressing the problem once it's been detected. We begin dropping requests and return 503, which is service unavailable responses. But we want to minimize these and drop just enough traffic to mitigate the problem. The tricky part here is that the amount of traffic that needs to be dropped and the overall capacity of the service can easily vary depending on the nature of these overload itself. For example, the amount of traffic that a service can handle may be much different if the cpu is saturated compared to if there's back pressure from a downstream and memory usage is becoming a problem. So we have to be flexible and dynamic, both in detecting overloads and also in knowing how much traffic to drop. So what's hodor made of? There are basically three main components. First, there are what we call overload detectors. There could be multiple of these registered, including any that might be application specific. Vivek will be talking about the standard ones we provide in a bit. The detectors are queried for each inbound request with some metadata about the requests. This allows the detectors to operate on a context specific level if needed, and potentially only detect overload and pushed traffic for a targeted subset of requests. Most of these detects, though, operate on a global level and don't do any request specific processing. Instead, they're fetching an asynchronously calculated state indicating whether the detector considers things to be overloaded. Second, we have the load shutter, so this decides which traffic should be dropped. Once a detector is signaled that there's an overload, the shutter needs to be intelligent about which requests should be dropped to make sure that too much traffic isn't rejected, but that enough is to exit the overloaded state. The load shutter takes the signal from the detects as input, as well as some contextual information about the request to make these decisions. Finally, there's a platform specific component that combines the detectors and load shutter and adapts request metadata into Hodor's APIs. The detectors and shutter are platform agnostic. At LinkedIn, we primarily use an open source project we developed called restly for communication between services. So that's the first platform we had Hodor running on. We've since adapted it to work with GRPC as well as HTTP servers such as the play framework. I'm going to hand things off to Vivek now, and he will get into more details of how some of our detectors work to determine when the service is overloads. Thanks Brian. Hello everyone. Now let's talk about how the overloads detection is done using different detectors. The first detector is CPU detector. The objective of CPU detector is to quickly and accurately detect CPU overloads. The idea is to have a lightweight background thread running at same priority as the application threads, which execute business logic. This background thread, known as the heartbeat thread, is scheduled to wake up every ten milliseconds. The overall amount of work here is trivial and adds no major variable impact to application performance. The heartbeat overload detection algorithm monitors whether the heartbeat thread is getting cpu every ten milliseconds. Once the heartbeat algorithm realizes that the heartbeat thread is consistently not getting cpu time at the expected time intervals, we flag an overload. It is important to note here that a few variations are not enough to flag an overload, and that the algorithm flags an overload only when we have high confidence that a cpu is overloaded. In concept, this idea is straightforward, but determining the appropriate variables and parameters for this algorithm to maximize precision while have high recall was challenging. To provide a concrete example, we may have the thread slip for ten milliseconds each time, and if the 99th percentile in a second's worth of data is over 55 milliseconds, that window is in violation. If eight consecutive windows are in violation, the service is considered overloads. Values for this thresholds that we use are determined by synthetic testing as well as by sourcing data from production and comparing it with performance metrics when the services were considered to be overloaded. The rationale behind using heartbeat thread is one it directly measures useful cpu time available to the application in real time. What we mean by this is that just because you see 30% free cpu on something like top command does not mean that it is useful cpu. And number two, the concept of the heartbeat thread is applicable everywhere irrespective of these environment or the application type. So this is an example of the cpu detector in action. In the top left graph you can observe that the heartbeat detector is able to capture a cpu overload. Notice that the performance indicators such as average and p 90 latencies and p 95 cpu all spike when the heartbeat detector flags an overload. Now let's move on to the next detector which focuses on memory bottlenecks. The objective of GC overload detector is to quickly and accurately detects increased garbage collection activity for applications with auto memory management like Java applications, these idea is to observe overhead of GC activity in a lightweight manner to detect overload in real time. On each GC event, we calculate the amortized percentage of time spent in GC over a given time window. We call that as GC overhead. A schedule is set on top of a GC overloads. So for that schedule, the GC overhead percentage range is divided into tiers called as GC overhead tiers. If the duration spent in GC overloads tier exceeds the volatility period for the tier, then the GC overload is signaled. The volatility period is smaller for higher GC over a tier as a higher GC over tier indicates more severe GC activity. For example GC over it of 10% or more for say 30 seconds for consecutive gcs is considered overload or lower tier such as Gc overhead of 8% or more for say like 60 seconds is considered overload and so on. So the rationale behind using percentage time in GC is it causes both GC duration and GC frequency that can catch different GC issues. And also setting a common threshold is possible which work across all the applications with different allocation rates, old generation occupancy levels and so on. So similar to cpu detector, this is an example of GC overload detector in action when GC activity increases because of increase in traffic. In the top left graph you can observe that these GC detector is able to capture a GC overload. Notice that the performance indicators such as p 90 p 99 latencies both spike when the GC detector flags an overload. Now we will look at a virtual resource overload detector. Study of QA time and its data suggests that there is a good correlation between increased KPIs, such as latencies, and increased thread queue at times for synchronous services. Consider a synchronous service requests will start spending more time waiting in a queue if current request processing time increases, either due to an issue in the service or in one of its downstreams. The capacity of a service can also be reached when latencies of downstream traffic increase, which can cause the number of concurrent requests being handled in the local service to increase with no change to the incoming request rate. But without knowing anything about these downstream service, we can assert at upstream by monitoring thread pull queue time that there is a thread pull starvation, and by dropped traffic we can help alleviate the downstream. At LinkedIn we use JT server side framework extensively and hence we target that as a first step. But the logic of observing thread pool q wet time is applicable widely, similar to the previous detectors. This is an example of the thread pool overload detector in action, where there is an issue in downstream processing that causes increase in the thread pool wet time. In the left top graph you can observe that the thread pool detects is able to capture an overload. Notice that the performance indicators such as average p 99 latency, spike when detector flags an overload. Now back to Brian who is going to talk about remediation techniques. Thank you. Thanks for that. So the question now is, once we've identified there's a problem, how can we address it with minimal impact? Well, we need to reduce the amount of work that a service is doing, and we do that by rejecting some requests. The trick here is to identify the proper amount of requests to reject, since dropping too many would have a negative impact on our users, we've tried and tested a few different load shedding approaches and found that the most effective is to limit the number of concurrent requests handled by a service. The load shedder adaptively determines the number of requests that need to be dropped by initially being somewhat aggressive while shedding the traffic, and then more conservative about allowing more traffic back in. When the load shedder drops requests, they're returned as five hundred and three s, and these can be retried on another healthy host if one is available. We experimented with other forms of adaptive load shedding, including using a percentage based threshold to adaptively control the amount of traffic handled or rejected. But during our tests we found that a percentage base shutter didn't really do that good of a job, especially when traffic patterns changed as it was continually needing to adapt to the new traffic levels, whether they were increasing or decreasing over previous thresholds. The graphs shown here are from one of the experiments we ran where the red host was unprotected and the blue host had load shedding enabled. They started off by receiving identical traffic levels until becoming overloaded where the behavior diverged. As you can see in the middle graph. You can see as the overall queries per second increases, the protected host is forced to increase the number of requests that are dropped. You can also see that the overall high percentile latency is lower on the protected host, but there are a few spikes where the load shutter is probing to see if the concurrency limit can be increased by slowly letting in more traffic. So I'd mentioned that holder rejects requests with 503s. This is done early on in the request pipeline before any business logic is executed so they're safe to retry on another healthy host. This reduces overall member impact because the 503 response is returned quickly to the client, giving it time to retry the request someplace else. But we don't want to blindly retry all requests that are dropped by Hodor because if all hosts in these cluster are overloaded, these sending additional retry traffic can actually make these problem worse. To prevent this, we've added functionality on the client and server side to be able to detect issues that are cluster wide and prevent retry storms. This is done by defining client and server side budgets and not retrying when any of these budgets have been exceeded. I'm going to talk briefly now about how we went about rolling this out to the hundreds of separate microservices that we operate at LinkedIn. So we needed to be cautious when rolling this out to make sure that we weren't causing impact to our members from any potential false positives from the detectors. We did this by enabling the detectors in monitoring mode, where the signal from the detectors is ignored by the load shutter, but all relevant metrics are still active and collected. So this allowed us to set up a pipeline for rollout where we could monitor when detectors were firing and correlate those events with other service metrics such as latency, cpu usage, garbage collection activity, et cetera. At the same time periods before enabling load shedding. Though, we monitored a service for at least a week, which would include load tests that were running in production during that time, we found that some services were not good candidates for onboarding using our default settings. These were almost always due to issues with garbage collection and could usually be solved by tuning the GC. In some cases, this actually led to significant discoveries around inefficient memory allocation and usage patterns, which needed to be addressed in the service but hadn't been surfaced before. Making changes to address these ended up being significant wins for these services as they led to reduced cpu usage, better overall performance and scalability, and they were able to onboard the hodor's protection as a side benefit. So at the bottom here is a quote from one of the teams after adoption adding overload detectors to our service has surfaced unexpected behavior that owners were generally not familiar with, and we've truly found some odd behavior surfaced by our detectors. For example, in one service we found that thread dumps were being automatically triggered and written to disk periodically based on a setting that the owners had enabled and forgotten about. The manifestation of this was periodic freezes of the JVM while the thread dumps were happening, which lasted over a few seconds in some cases, but this didn't register in our higher percentile metrics, so the service owners were never aware of the problem. Once onboarded to Hodor, though, it became very clear when the detectors fired and the load shutter engaged. There are other examples similar to this where fairly impactful and usually fairly interesting behavior went unnoticed until uncovered by our system. So next I'm going to go through a quick example from one of our production services. So this is from our flagship application which powers these main LinkedIn.com site as well as mobile clients for iOS and Android. So we periodically do traffic shifts between our data centers for various reasons. In one of these cases, there was a new bug that was introduced in the service that only appeared when the service was extremely stressed. This traffic shift event triggered the bug, and Hodor intervened aggressively to handle the situation. You can see in these top graph that Hodor engaged for a good amount of time with a few spikes which lined up directly with when latencies were spiking. Overall, about 20% of requests were dropped during this overload period, which sounds bad, but when sres investigated further, they found that if the load hadn't been shed, this would likely have become a major site issue instead of a minor one with a service down instead of still serving partial amounts of traffic. We currently have over 100 success stories similar to these where Hodor engaged to protect a service and mitigate an issue to end with. I'm going to talk about some related work that integrates well with Hodor and some things that we have planned for the future. First up is a project that has been integrated with Hodor and live for some amount of time. Our term for it is traffic tiering, but it's also known as traffic prioritization or traffic shaping. It's a pretty simple concept. Some requests can be considered more important than others. For example, our web and mobile applications will often prefetch some data that they expect might be used soon, but if it's not available, there's no actual impact to the user, it just gets fetched later on. Demand requests like this can be considered to be lower priority than directly fetching data that the user has requested. Similarly, we have offline jobs that utilize the same services that take user traffic, but nobody's sitting behind a screen waiting for that data. It's safe to retry those offline requests at a later time when the service isn't overloaded. So with traffic tiering, we're able to categorize different types of requests into different categories and start dropping traffic with the lowest importance at first, and only moving to affect higher priority traffic if necessary. Secondly, we're working on developing new types of detectors to cover blind spots in the ones we have. One of these is actually a method of confirming that the detected overload is impacting core metrics. So we've had cases of false positives where there is an underlying issue with these service, usually GC related, which isn't affecting the user perceived performance, but is impacting the ability of the app to scale further and maximize its capacity. In these cases, we don't want to drop traffic, but we do want to signal that there is an issue. So we're working on correlating CPU or GCU overload signals with latency metrics and only dropping traffic when there's a clear degradation in performance. We're also starting to adapt Hodor to frameworks other than restly, such as the play framework, as well as Neti. These have different threading models and in some cases don't work as well with the heartbeat based cpu detection. So for example, we're working on detects specifically for Netty's event loop threading model. Finally, we're looking into leveraging the overload signals to feed them into our elastic scaling system, so this seems like an obvious match. If a service cluster has become overloaded, we can just spin up some more instances to alleviate the problem, right? Well, it turns out it's not that simple, especially when there are a mix of stateful and stateless systems within an overloaded call tree. In many cases, just scaling up one cluster that is overloaded would just propagate the issue further downstream and cause even more issues. This is an area where we're still exploring and hope to address in the future. So I hope that this presentation was enlightening for you and you learned something new. Thank you for your time. I'd also like to thank the different teams at LinkedIn that came together to make this project possible and successful. That's it from us. Enjoy the rest of the conference.
...

Bryan Barkley

Senior Staff Engineer @ LinkedIn

Bryan Barkley's LinkedIn account

Vivek Deshpande

Senior Software Engineer @ LinkedIn

Vivek Deshpande's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways