How to make your service more resilient in case of traffic spikes

Abstract

I want to talk about server overload, why it happens, and what happens with a service in such a situation. I will discuss two main techniques to prevent server overload and make the service more resilient such as rate limiting and load shedding.

Summary

Ivan Lemeshev will talk about rate limiting and load shedding. He will discuss how to reduce the load on the server and keep your service running smoothly. He is interested in the development of large scale and distributed systems.
When we develop a backend application, we deploy it to a server to make it available to users. Each server always has limited computational or system resources. When a system is given more work than each resources support, it becomes slow. Of course, we can use auto scaling and deploy additional service instances.
traffic bursts can occur due to various reasons, both predictable and unpredictable. For example, social media platforms see higher traffic during evenings or weekends for unpredictable reasons. By understanding these potential causes, you can better prepare for traffic bursts.
Rate limiting is a technique used to control the flow of requests to a network, resource server or API. It essentially sets a limit on how often a user or application can perform a specific action. There are many algorithms and variations of them for implementing rate limiting.
The last algorithm for today is leaky bucket algorithm. It also uses the concept of a bucket, but in a different way. The bucket can hold a specific number of requests representing its maximum capacity. Each algorithm has pros and cons, so you should consider trade offs.
load shedding is another technique that is used to prevent server overload. It is like a controlled shutdown to prevent a total crash during a traffic overload. Users might experience delays or errors for some requests during shedding. Understanding and implementing these techniques ensures your services stay healthy and responsive even under heavy load.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody. Today I want to talk about techniques that allow you to reduce the load on the server and keep your service running smoothly, preventing server overload in case of traffic bursts. In particular, I will be talking about rate limiting and load shedding. A little disclaimer by service, I mean any backend application. It can be a monolithic application or an individual microservice in a large distributed system. But first, let me tell you a little bit about me. My name is Ivan Lemeshev. I live in Finland. I moved here over three years ago, and since then I have been working at Unity as a senior software engineer. I have been using Golang as my primary programming language for many years and I am interested in the development of large scale and distributed systems. You can find me on LinkedIn or GitHub by following the links to begin with, let's look at the agenda for this talk. First, I will discuss survey overload, why it occurs, and how it affects the performance of backend services. I'll give you an example of common causes of server overload, and then I'll show you what it looks like in the example of a simple HTTP service. After that, I'll consider techniques that help prevent server overload. I'll start with rate limiting and review common rate limiting algorithms. Then I'll show you a simple implementation of one of the algorithms, and you will see how it works. Next, I will discuss another technique called load shedding and show you a simple well, let's get started. When we develop a backend application, we deploy it to a server to make it available to users. As a server, both physical and virtual servers can be used. It doesn't matter. Each server always has limited computational or system resources, such as cpu or memory. The service utilizes some of these resources to process each incoming user request, including managing multiple tasks, switching between them, cleaning up unused memory, and waiting for data to come in or go out. The server can process only a particular number of concurrent user requests simultaneously under heavy load. When a system is given more work than each resources support, it starts experiencing a lack of resources and becomes slow, which leads to an increase in the processing time or latency. For each request. The service reaches some threshold where its performance degrades rapidly. It can keep working even when it is overloaded, but it spends amounts of time contacts switching and becomes too slow to be useful because most likely the client has some timeouts and can't wait for the response from the service too long. Therefore, the service performance and availability of your service will be declined because almost all requests will fail due to high latency and timeouts. In the worst case, the server may completely crash and stop handling requests due to running out of memory or other resources. Of course, we can use auto scaling and deploy additional service instances, but it doesn't happen instantly if the load grows gracefully. Auto scaling works well. However, deploying an appropriate number of additional instances requires time if we have a traffic burst and the load is exceptionally high. Also, there may be a situation when an individual user or a group of users may produce so many requests, consuming all the system resources, and other service users will be unable to use it. In this case, auto scaling will not help. Now let's explore situations where a surge of traffic can occur. Traffic bursts can occur due to various reasons, both predictable and unpredictable. For predictable reasons, there could be different scheduled events or planned events like product launches, sales, promotions, marketing campaigns, or even regular peak hours can lead to a surge in traffic as users try to access the service or website simultaneously. Another reason is seasonal traffic. Businesses in specific industries might experience seasonal spikes in traffic. For example, e commerce sites might see a search during holidays or back to school seasons. The next one is the time of day or week. Traffic patterns can change regularly depending on the target audience and service type. For example, social media platforms see higher traffic during evenings or weekends for unpredictable reasons. There could be different viral events like social media trends, news articles, or influencer mentions can drive sudden traffic bursts to a website or service. Also, there could be different technical issues. For instance, system outages on competitor platforms can lead to users flocking to a service, causing a temporary spike. Another very popular reason is denial of service attacks. It is when malicious actors might attempt to overwhelm a system with traffic, causing a spike and potentially disrupting functionalities. The next one is bot activity. Automated bots or scripts can cause unexpected bursts, especially if they are scrapping data or attempting to exploit vulnerabilities. There could also be issues with external dependencies. For instance, if your service relies on external APIs or services, outages or slowdowns on their end can lead to a cascading effect and cause traffic spikes for your users trying to access features that depend on those external services. By understanding these potential causes, you can better prepare for traffic bursts. Well, now let's look at the example of server overload. To demonstrate this, I implemented a simple HTTP service in go. The main function sets up an HTTP server that listens on port 8000. It registers a single root with a handler function. The handler function attempts to extract a path parameter length from the requests, convert it to an integer value and use it to generate a password. If the length parameter cannot be converted to an integer, for example, if it's not a number, the function responds with 400 status. If the length parameter is successfully converted to an integer, the function generates a password of that length and then generated password is sent back to the client with 200 status. I use the docker container to run this service because it allows us to simulate limited resources. You will see it in the next slide. I built and run the service using this simple dockerfile, the application using the Golang image in the build stage, and then it uses the distr less base image from Google's container registry to run the service. This image contains only the application and its runtime dependencies, and it is designed to be as small as possible. Then I built a docker image from the dockerfile. I set the cpu option to one to limit the number of cpu cores that also I set two options, memory and memory. Swap 300 megabytes to limit the containers memory. I have to set both options to the same value to prevent the container from using. When we have a running service, we need to test for that I use a load testing tool called Grafana k six. Here is a script that is used by this tool. It is just a simple JavaScript file with some configuration for the test. The test consists of four stages. In the stage one, we gracefully increase the number of virtual users from zero to 1000 for two minutes, and then in stage two we keep the number of users at 1000 for another two minutes. In the stage three, we ramp up to 2000 virtual users for two minutes. And finally in stage four we keep the number of users at 2000 for another two minutes. In the default function, we define the user logic. Each virtual user makes a get request to the service to generate a password. Then the user sleeps for 100 milliseconds that simulates approximately ten requests per second from a single user. After that, I just run the test using this command for it to end. I also use a web dashboard as an output to visualize the test results. It produces graphs showing the number of requests latency and other metrics. In the graph, we can see how latency changes depending on the number of requests per second. Initially, we gracefully increased the number of users and the service could handle that load. The latency was very low and then after four minutes we increased the number of users up to 2000 and the service started to experience a lack of resources. That led to a significant increase in latency and the server became overloaded. In the following graph we can see the latency distribution by percentile. From these results we can understand that when we use system resources at maximum, the latency significantly increases many times. But this is just an artificial example. In the actual application the latency might be higher and there will likely be some timeout on the client side that will drop requests and make the service unavailable for users. We have discussed server overload and its causes. Also we saw what it looks like in the example. Now lets look at the first technique that can be used in this situation. First I want to discuss rate limiting. Sometimes it also called throttling, so these terms may be used interchangeably. Rate limiting is a technique used to control the flow of requests to a network, resource server or API. It essentially sets a limit on how often a user or application can perform a specific action within a given time frame. It protects against malicious activities like denial of service attacks where attackers try to overwhelm a system with excessive requests, making it unavailable to legitimate users. Also, it helps ensure fair access to resources by preventing single users or applications for monopolizing them. It is especially important when the service for services with limited resources by controlling the request rate. Rate limiting prevents overloading the server and helps maintain optimal performance for all users. We can limit the request rate per an IP address or a user identifier, an API key, or other criteria depending on a particular use case. For example, we can load ten requests per minute from a single IP address, and if the number of requests per minute is less or equal to that value, we process a request. Otherwise, if the limit is exceeded, we will drop the request. There are many algorithms and variations of them for implementing rate limiting. I'll briefly review just the most common of them. One is the fixed window counter algorithm. This algorithm divides time into equal sized time intervals or fixed windows. The window size can be defined in milliseconds, seconds, minutes, or any other irrelevant unit depending on the use case. Each window has a request counter that allows the algorithm to keep track of how many requests occur within each window. As a request arrives, the algorithm checks the current timestamp. It decides which window the timestamp falls into based on the window size and the starting point on the current window. If the request falls within the current window, the counter for that window is incremented by one. Otherwise, a new window starts, the counter is reset to zero and incremented by one, and the start time of the window is updated. Then, if the counter value is less or equal to the limit, the rate limiter allows the request to be processed. Otherwise the request is dropped if the counter value exceeds the limit. The example on the slide shows that we have a 1 minute window size and set the rate limit to five requests per minute in each window. The first five requests were processed, all other requests were dropped. As you can see, this algorithm is quite simple. However, it has a big drawback which you have probably already noticed here. Since the counter is reset when the new window is started, the rate limiter doesn't know how many requests were made in the previous window. It leads to some issues. For example, in this picture, the total number of requests per minute was more than five because requests burst at the end of the second window and the beginning of the third window. The fixed window counter algorithm is easy to understand and implement because it only requires keeping track of a counter for each window. Also, it is efficient because it requires only basic counter operations and it has low memory usage because it only needs to store the counter value for a window, which is typically a small amount of data. However, request bursts are possible on edgest when switching between windows, and this algorithm is not suitable for identifying frequent requests since the counter is set at each window. The following algorithm is sliding window lock. It also uses a window, but this window slides a long time representing the relevant time frame. For rate limiting, the window size can be also defined in milliseconds, seconds, minutes, or any other relevant unit, depending on the use case. Instead of request counter, this algorithm keeps track of for each incoming request and stores it in the the log can be stored in a data structure like hash table or sorted set for efficient retrieval. As the window slides forward, timestamps outside the current window become irrelevant for rate limiting purposes and removed from the log. A new request arrives. Each timestamp is added to the log. The algorithm calculates the total number of requests within the current window by iterating through the remaining timestamps in the log to decide if the request should be allowed. The request is processed in. The total count within the window is less than or equal to the allowed limit. Otherwise, it's considered a rate limit violation and might be rejected. Let's look at the example. In this example, we allow two requests per minute when the first requests when the first request is saved to the lock, there is only one request in the lock at that point in time, so the request is allowed and processed. Then the second request comes and each timestamp is saved to the lock as well. Now there are two requests in the lock. All of them fall within the current window, so the number of requests is equal to the limit and the request can be processed. Then the third request comes and each timestamp is saved to the lock. Now there are three requests that fall into the current window, then the limit, so the request is rejected, and finally the fourth request comes comes. Each timestamp is also saved to the lock. There are only two requests that fall within the current window at that point in time, and there are two all timestamps that do not fall into the current window. The old timestamps are removed from the lock and the new request is allowed and processed. Because the number of requests in the current window equals the limit, the algorithm captures recent surges in requests more accurately than the fixed window counter because it considers only relevant requests within the sliding window can effectively handle bursts of requests. However, storing timestamps for all requests requires a lot of memory and it can be memory intensive, especially for high requests volumes. Also, it's more complex because it requires maintenance and iteration through the request log following algorithm is called sliding window counter. It offers a balance between simplicity and accuracy compared to fixed window counter and sliding window log algorithms. The algorithm maintains a counter variable and fixit size window duration similar to the fixed window counter, but in addition, it uses a sliding window to represent the relevant time frame for rate limiting. Unlike the sliding window log algorithm, it doesn't sort for all requests. Instead, it keeps track of the counters for fixed window and calculates a weighted counter for the sliding window. When a new request arrives. The algorithm calculates a weighted counter using the sliding window time frame and request counters for the previous and current windows. For that, it uses a percent of time. The sliding window overlaps with the previous fixit window, the number of requests from the previous fixit window, and the number of requests from the current fixit window. You can see the formula on the slide. The resulting value of these calculations is a weighted counter. Then, the algorithm compares the weighted counter with the limit value. The request is allowed and processed if the counter is less or equal to the limit. Otherwise, the request is dropped. The old counters from previous fixed windows are removed from tracking when they are no longer relevant for a sliding window. The algorithm captures recent requests rate trends more accurately than the fixed window counter as it considers requests that fall within the sliding window. Also, it avoids storing timestamps for all requests, make it more memory efficient and easier to implement than sliding window log algorithm. However, it is less precise than sliding window log algorithm. The next algorithm is as you can see from the name, it uses a concept of a bucket that contains tokens. The bucket has a fixed size representing the maximum number of tokens it can hold. Each token represents a capacity to process a single request. In other words, the bucket size defines a maximum number of requests the system can handle at a time. So the bucket size represents rate limit. The bucket is constantly refilled with new tokens at a constant rate called the refill rate, which defines how quickly the bucket refills with tokens over time, ten tokens per second. Therefore, the refill rate controls the average rate at which request can be processed. Processed when a request arrives, the algorithm checks if there are enough tokens in the bucket. If there are tokens in the bucket, it consumes a token or removes it from the bucket and allows to be processed. Otherwise, if the bucket is empty, the request is dropped. On the one hand, this algorithm is relatively easy to implement. However, it is also slightly more complex due to the token management logic. It is efficient and uses low memory because it keeps tracks of the bucket size, which is typically just a number that is incremented or documented over time. It allows for a smooth distribution of requests, but bursts of requests up to the bucket's capacitor possible in cases when the bucket is full and a large number of requests arrive simultaneously. And finally, the last algorithm for today is leaky bucket algorithm. It also uses the concept of a bucket, but in a different way. It is like the analogy of a bucket with a hole at the bottom that constantly leaks at a fixed rate. The bucket can hold a specific number of requests representing its maximum capacity. The requests leak out from the bucket at a fixed rate. This leak rate specifies sustain it rate at which the system can process requests. When a request arrives and the bucket is not full, a new request is added to the bucket and processed. If the bucket is full, the request is dropped. The liquor bucket algorithm is good at smoothing out requests because the requests leak out and processed at a controlled rate, preventing surges or bursts of traffic. However, it doesn't account for the time between requests, which means if a user sends a burst of requests quickly, it all be dropped, even if they are under the overall rate limit. We have reviewed rate limiting algorithms. Each algorithm has pros and cons, so you should consider trade offs based on the particular use case when choosing an algorithm for rate limiting. Also, each algorithm may be implemented differently depending on the requirements. Now let's look at the simple implementation of a rate limiter based on the token bucket algorithm. For the example, I implemented the rate limiter logic in a separate package. Here I use the bucketstruct as a token bucket. It has two fields, current tokens, which is the number of tokens currently in the bucket and last refill time, which is the last time the bucket was refilled. Then I define the token bucket instruct which represents the rate limiter. It has a mutex for thread safety and map buckets that stores a bucket some key. As a key, I will use the client's ip addresses. Also, it has two fields bucket size which represents maximum number of tokens, a single bucket hand hold and refill rate, or the rate at which tokens are added to the bucket. Then I define new token bucket function that just creates a new rate limiter with specified bucket size, refill rates, and empty buckets. Map this rate limiter has one method that's called isallowed. This method checks if a request with the specified key is allowed. If there is no bucket with a given key, it creates a new bucket for the key and allows the request. If the key exists. It refills a bucket. It checks at least one token in the bucket. If there is a token, it decrements number of tokens and allows the request otherwise denies the refill. Method refills the bucket for the specified key. It calculates the number of tokens to add based on the time elapsed since the last refill and the refill rate after that. It adds them to the bucket, up to the bucket size, and updates the last refill time. I also use middleware for that example, I define a middleware to use for rate limiting. I have a simple rate limiter interface that contains only one method is allowed. This interface allows allows us to use different rate limiter implementations depending on our needs and replace them more easily. Then I define rate limiting middleware function. It checks if a request is allowed by calling the isallowed method of the rate limiter with the remote address of the request. If the request is not allowed, it responds with 429 status corresponding status text. If the request is allowed, it calls the original handler function. I use the same service as the previous example, but with a few changes. Initialize the rate limiter first with a bucket size of one token and refill rate of one token per second. That means that we allow approximately one request per second from a single user, the handler function with the rate limiter middleware, and pass the rate limiting limiter there I use the same test configuration as previous example and the result test results show that even rate limiter allowed us to decrease the average load on the server and the latest is not as high as before. However, it can't improve the performance significantly in this case. Because rate limiter is integrated into the service, it also uses the same system resources. You should always remember about it. Using a rate limiter has its overhead and costs as well. You have to consider different trade offs and find a balance between using a rate limiter and the performance of the service. Moreover, you can't just use the rate limiter from the example in the real production applications because you need some shared state with counters that can be used in all instances of the application. For example, you can use redis to store the counters and get them from redis during rate limiter checks. Also, you will likely need a distributed rate limiter that can be implemented in a separated service, for example in the API, gateway or other solutions. In addition, on this slide I provided the list of different implementations of rate limiters in Golang. You can check these packages, see how different algorithms are implemented, and use one of them in your application if it suits your particular use case. We have considered server overload, the rate limiting technique, and common rate limiting algorithms. Also, we saw it in examples. Now let's look at another technique called load shedding. Load shedding is another technique that is used to prevent server overload. It is like a controlled shutdown to prevent a total crash during a traffic overload. It can be implemented in different ways. For example, the system can constantly check its resources, such as cpu and memory. If things get overloaded, load shedding kicks in. It might reject random or non critical incoming requests, prioritizing critical ones. Also, the system might slow down process processing for non critical tasks. The system also might redirect some requests to other services if possible. By sacrificing some requests, the system stays operational for the most important ones. Critical requests still get processed within a reasonable time frame. A controlled slowdown is better than a complete system breakdown in this situation. However, as always, there are some trade offs. Users might experience delays or errors for some requests during shedding. Depending on the system, some data processing might be delayed or even lost. So now let's look at an example. To demonstrate this technique, I implemented a simple algorithm to decide if the system is overloaded and reject random requests. I will use a ticker from the time package, and I assume that if the system is overloaded we will have some delays and the ticker will not work. I strongly do not recommend this algorithm for production applications is just for demonstration purposes. I implemented the logic in a separate package called overload detector. Here I define overload detector struct that will be used to detect if a system is overload based on a specified overload factor structure has three fields. Check interval is the duration between each check for overload. For overload overload factor is a duration that is considered as a threshold for overload and is overloaded flag indicating where the system is overloaded. I use bool from the sync atomic package for thread safety. Then I define the new function which just creates a new overload detector arguments and the overload detector check in the background using the run method. Also, I have isoverloaded method that just returns the value for the flag, and in the run method we do the check if the system is overloaded. First we initialize the ticker that if the system is overloaded every check interval on each tick, it checks. If the time since the last check is greater than the overload factor, it considers the system as overloaded and sets the flag to the true value. If it is not, it sets the flag to false I also define a middleware for load sharing as I have done for rate limit meeting. I have a simple overload detector interface that contains only one method is overloaded. Then I define the overload detecting middleware function. It checks if request is allowed by calling the isoverloaded method, and if the server is overloaded it responds with 500 and if the system is not overloaded, it calls the original handler function. Again, I use the same service as the previous example but with few changes. I initialize the overload detector with a check interval of ten milliseconds and an overload factor of eleven milliseconds. Then I wrap the handler function with the overload detecting middleware and pass the overload detector. There, I use the same test configuration as the previous examples. The test results shows that using the overload shedding allowed us to decrease the average load on the server. Depending on the particular case and requirements, load shedding can be implemented differently in the real production. To understand load shedding better and the approaches for its implementation, I recommend reading this article by a senior principal engineer at Amazon that explains load shedding in more detail. Well, we have explored server overload and how it can affect your services at common reasons for overload and how to identify them. We then reviewed two key techniques to make your systems more resilient, rate limiting, health control incoming traffic and prevent overload before it happens, and load shedding that acts as a safety valve, gracefully degrading service during extreme traffic surges to maintain overall system stability. Understanding and implementing these techniques ensures your services stay healthy and responsive even under heavy load. Remember, a well designed system anticipates traffic spikes and has mechanisms to handle them effectively. So I hope you found this talk helpful. Thanks for your time.

See all 21 talks at this event!

Conf42 Golang 2024 - Online

April 25 2024

How to make your service more resilient in case of traffic spikes

Abstract

Summary

Transcript

Ivan Lemeshev

Senior Software Engineer, International Team @ Unity

Awesome tech events for

Featured event

2025

2024

Info

Conf42 Golang 2024 - Online

April 25 2024

How to make your service more resilient in case of traffic spikes

Abstract

Summary

Transcript

Ivan Lemeshev

Senior Software Engineer, International Team @ Unity

Awesome tech events for