Cheap or Fast? How we got both by leveraging ML to automatically tune K8s apps

Video size:

Abstract

After all these years, the task of tuning Kubernetes microservice applications is a daunting task even for experienced Performance Engineers and SREs, often resulting in companies facing reliability and performance issues, as well as unexpected costs.

In this session, we plan to first illustrate some less-known facts about Kubernetes key resource management and autoscaling mechanisms and show how properly setting pod resources and autoscaling policies is critical to avoid over-provisioning while ensuring services deliver the expected performance and resilience.

We then demonstrate how a new approach leveraging ML techniques makes it possible to automatically tune both pod and runtime configurations to ensure any specified optimization goal, such as minimizing Kubernetes cost or maximizing application throughput, while respecting any SLOs, such as max response time and error rates. Results of real-world cases will be used to document how much this new approach can be effective to deliver higher operational efficiency tangible benefits.

Summary

In the next 20 minutes or so I'll share with you some of our experiences in tuning applications running on kubernetes. We'll start by identifying some challenges of modern applications for ensuring performance and reliability. Finally, we will conclude by sharing some takeaways.
Giovanni Paolo Gibilisco is head of engineering at Akamas. Many teams struggle with Kubernetes application performance and stability. Why is it so difficult to manage application performance, stability and efficiency on Kuber netes?
Kubernetes resource management works to better understand the main parameters that impact Kubernetes application performance, stability and cost efficiency. Let's go through five main key aspects and their implications.
Cpu limits exist to bound the amount of resources a container can consume. These aggressive cpu throttling has a huge impact on service performance. Properly setting your cpu requests and limits is critical to ensuring your Kubernetes cluster remains stable and efficient over time. A new approach is required to successfully solve this problem using machine learning.
The ML Tower optimization process is fully automated and works in five sres. The first step is to apply the new configurations suggested by the ML algorithms to our target system. The main result is the best configuration of the software stack parameters that maximizing or minimize the goal we have defined.
Another configuration found by ML at experiment number 14. This time ML picked settings that significantly change the shape of the container. The peak on the response time upon the scaling out is significantly lower. This clearly improves the service resilience. As the optimization goal changes, results and limits may need to be increased or decreased.
When tuning the modern applications, the interplay between different application layers and technologies require tuning the full stack configuration. Developers cannot simply rely on manual tuning or utilization based autoscaling mechanisms. Mlbased methods can automatically converge to optimal configuration within hours.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

This is Giovannijilisko and in the next 20 minutes or so I'll share with you some of our experiences in tuning applications running on kubernetes. These are the contents that we will cover. We'll start by identifying some challenges of modern applications for ensuring performance and reliability. We'll then review how Kubernetes manages container resources and the factors we need to be aware of if we want to ensure high performance and cost efficiency. We will introduce a new approach we implemented at Akamas, which leverages machine learning to automate the optimization process, and we will do that with a real world example. Finally, we will conclude by sharing some takeaways. Before proceeding, let me introduce myself. My name Giovanni Paolo Gibilisco and I serve as head of engineering at Akamas. Okay, let's start with a quick overview of some of the main challenges that comes with the development of modern applications. The advent of agile practices allowed developers to speed up the development cycle with the goal of getting rapid feedback and iteratively improve applications. Though increasing the release frequency, it's now common to see applications, or part of them, released to production weekly or even daily. At the same time, the underlying frameworks and runtimes, such as the JVM that are used to build those applications, have grown in complexity. The emergence of architectural patterns such as microservices have also brought an increase in the number of frameworks and technologies used within a single application. It's now common to see application composed by tens or even hundreds of services, written in different languages and interacting with multiple runtimes and databases. Kubernetes provides a great platform to run such applications, but it has its own complexities. These Kubernetes failure stories is a website specifically created to share incident reports in order to allow the community to learn from failures and prevent them for further happening. Many of these stories describe teams struggling with Kubernetes application performance and stability issues such as unexpected cpu's, loaddowns and even sudden container terminations. Engineers at Airbnb even got to the point of suggesting that Kubernetes may actually hurt the performance of latency sensitive applications. But why it's so difficult to manage application performance, stability and efficiency on Kubernetes? The simple answer is that Kubernetes is a glaze platform to run containerized applications, but it requires applications to be carefully configured to ensure high performance and stability, as we're going to see. To answer this question, let's now get back to the fundamentals and see how Kubernetes resource management works to better understand the main parameters that impact Kubernetes application performance, stability and cost efficiency. Let's go through five main key aspects and their implications. The first important concept is resource requests. When a developer is defined pod, she has the possibility to specify resource results. These are the amount of cpu and memory the pod or better, a container within the pod is guaranteed to get. Kubernetes will schedule the pod on a node where the requested resources are actually available. In this example, pod a acquires two cpus and is scheduled on a four cpu node. When a new pod b on the same side is created, it can also be scheduled on the same node. This node now has all of its four cpus requested. If a pod c is created, Kubernetes won't schedule it on the same node as its capacity is full. This means that those numbers these developers specify in the deployment yaml are directly affect the cluster capacity. A strong difference with respect to virtualization and hypervisors is that with Kubernetes there is no overcommitment on the requests. You cannot request more cpus than those available in the cluster. Another important aspect is that resource requests are not equal to utilization. If pod requests are much higher than the actual resource usage, you might end up with these cluster that is at full capacity even though its cpu utilization is only 10%. So the takeaway here is that setting proper pod request is paramount to ensure Kubernetes cost efficiency. The second important concept is resource limits. Resource requests are guaranteed resources that a container will get, but usage can be higher. Resource limits is the mechanisms that allows you to define the maximum amount of resources that a container can use, like two cpus or 1gb of memory. All this is great, but what happens when resource usage hits the limit? Kubernetes treat cpu and memory differently here. When a cpu usage approaches the limit, the container gets throttled. This means that these cpu is artificially restricted and this usually results in application performance issues. Instead, when memory usage hits the limit, the container gets terminated, so there is no application slowdown due to paging or swapping as we had in traditional operating systems. With the Kubernetes your pod will simply disappear and you may face serious application stability issues. The third fact is about an important and less known effect that cpu limits have on application performance. We have seen that cpu limits called throttling and you may think that this happens only when cpu usage hits the limit. Surprisingly, the reality is that cpu throttling starts even when cpu usage is well below the limit. We did quite a bit of research on this aspect in our labs and found that cpu throttling start when cpu usage is as low as 30% of the limit. This is due to a particular way cpu limits are implemented at the Linux kernel level. These aggressive cpu throttling has a huge impact on service performance. You can get sudden latency spikes that may breach your slos without any apparent reason, even at low cpu usage. Now, some people, including engineers at buffers, tried to remove cpu limits. What these got was an impressive reduction of service latency. So is it a good idea to get rid of cpu limits? Apparently not. Cpu limits exist to bound the amount of resources a container can consume. This allows many containers to coexist without competing for the same resources. So if cpu limits are removed, a single Runway container can disrupt the performance and availability of your most critical services. It might also make the Kubelet service unresponsive and effectively remove the entire node from the cluster using cpu limits. Is these best practice also recommended by Google? Properly setting your cpu requests and limits is critical to ensuring your Kubernetes cluster remains stable and efficient over time. To ease the management of limits and requests for many services, Kubernetes comes with autoscaling. Let's discuss built in autoscaling capabilities that are often considered as a way to automate this process. In particular, the vertical pod autoscaler or VPA provides recommended cpu and memory requests based on the observed pod resource usage. However, our experience with a VPA is mixed. In this example, a Kubernetes microservice is serving a typical dernel traffic pattern. The top left chart shows the latency of this service and its service level objective, while below you can see the resource request, cpu and memory, and the corresponding resource utilization. We let this service run for a couple of days with some initial resource sizing, then activated the VPA and let it applied the new recommended setting to the pod. It's interesting to see that the VPA immediately decided to reduce these assigned resources. In particularly, it cut in half the cpu requests. This is likely due to some apparent overprovisioning of these service as the cpu utilization was below 50%. However, with the new settings suggested by the VPA, the latency of the microservice skyrocketed, breaching our slos. What is the lesson heard here? Kubernetes autoscaling and the VPA in particular is based on resource usage and does not consider application level metrics like response time. We need to evaluate the effect of the recommended settings as they might be somewhat aggressive and cause severe service performance or reliability degradations as we've seen so far, optimizing microservice applications on Kubernetes is quite tuning tasks for developers, sres and performance engineers. Given the complexity of tuning Kubernetes resources and the many moving facts we have in modern applications, a new approach is required to successfully solve this problem and this is where machine learning can help. AI and machine learning have revolutionarized entire industries and the good news is that ML can be used also in the performance tuning process. ML can automate the tuning of many parameters we have in the software stack with the goal of optimized application performance, resiliency and cost. In this section I would like to introduce you to this new methodology. Real world case is about an european leader in accounting, payroll and business management software. These Java cases microservice applications are running either on Azure or AWS Kubernetes services the target system of the optimization is the b two b authorization service running on Azure. It's a business critical service that interacts with all the applications powered in the digital services provided by the company. These challenge of the customer was to avoid overspending and achieve the best cost efficiency possible by enabling development teams to optimize their applications while keeping on releasing application updates required to introduce new business functionalities and align to new regulations. So what is the goal of this optimization? In this scenario, the goal was to reduce the cloud costs required to run the optic authentication service on Kubernetes. At the same time, we also wanted to ensure that service would always meet its reliability targets which are expressed as latency, throughput and error rate slos. So how can we leverage ML to achieve this high level business goal? In our optimization methodology, DML changes the parameters of the system to improve these metric that we have defined. In this case, the goal is simply to optimize the application cost. This is a metric that represents the cost we pay to run the application on the cloud, which depends on these amount of cpu and memory resources allocated to the containers. The ML power optimization methodology also allows to set constraints to define which configurations are acceptable. In this case, we state that the system throughput, response times and error rate should not degrade more than 10% with respect to the baseline. Once we have defined the optimization goal, next step is to define these parameters of these system that machine learning can optimize to improve our goal. In these scenario, nine tunable parameters were considered. In total, four parameters are related to Kubernetes container sighting, cpu and memory request and limits which play a big role in the overall service performance, cost and reliability and five parameters are related to these JVM, which is the runtime that runs within the container. Here we included parameters like heap size, garbage collector, the size of the regions of the heap, which are important options to improve the performance of Java apps. It's worth noticing that the ML optimizes the full stack by operating on all these nine parameters at the same time, thereby ensuring that the JVM is optimally configured to run within the chosen container. Resource sightseeing let's now see how the ML Tower optimization methodology works. In practice. The process is fully automated and works in five sres. The first step is to apply the new configurations suggested by the ML algorithms to our target system. This is typically done leveraging Kubernetes APIs to set the new value to the parameters, for example the CPU request. The second step is to apply a workload to the target system in order to assess the performance of the new configuration. This is usually done by leveraging performance testing tools. In this case, we use a geneter test that was already available to stress the application with a realistic workload. The first step is to collect KPIs related to the target system. The typical approach here is to leverage observability tools. In this case, we integrated elastic APM, which is the monitoring solution used by these customer. The fourth step is to analyze the result of the performance test and assign a score based on the specific goal that you have defined. In this case, the score is simply the cost of running the application containers. Considering the prices of azure cloud. The last step is where the machine learning kicks in by taking the score of the tested configurations as input and producing as an output the most promising configuration to be tested in the next iteration. In a relatively short amount of time, the ML algorithm learns the dependencies between the configuration parameters and the system behavior. Though identifying better and better configurations. It's worth noticing that the whole optimization process becomes completely automated. So what are we getting as an output of the MLbase optimization? The main result is the best configuration of the software stack parameters that maximizing or minimize the goal we have defined. These parameters can be then applied in production environments, but the value this methodology can bring is actually much higher. Amal will evaluate many different configurations of the system, which can reveal important insights about the overall system behavior in terms of other KPIs like cost, performance or resiliency. These supports performance engineers and developers in their decision on how to best configure these application to maximizing the specific goals. So, to assess the performance and cost efficiency of a new configuration suggested by the ML optimizer. We stress the system with these load test here you can see the load test scenario that we use just designed according to the performance engineering best practices. The traffic pattern mimicked the behavior seen in production, including API call distribution and sync times. Before looking at the results, it's worth commenting on the application on how the application was initially configured by the customer. We call this these baseline configuration. Let's look at the Kubernetes settings first. The container powering these application was configured with resource requests of 1.5 cpus and 3.42gb of memory. The team also specified resource limits of two cpus and 4.39gb of memory. Remember, the requests are the guaranteed resource that kubernetes will use for scheduling and capacity management of the cluster. In this case, requests are lower than the limit. This is a common approach to guarantee resources for the application to run properly, but at the same time allow for some room for unexpected growth. Besides looking at the container settings, it's important to also see how the application runtime is configured. The runtime is what ultimately powers our application, and for Java apps we know that JVM settings play a big role in app performance, but the same happens for goaling applications. For example, the JVM was configured with a minimum cheap of half a gig and a max heap of 4gb. Notice that the max heap is higher than the memory results, which means that the JVM can use more memory than the amount requested. As we're going to see, this configuration will have an impact on how the application behaves under load and the associated resiliency and costs. It's worth noting that these customer also defined autoscaling policies for this application, leveraging the Ka autoscaling project for kubernetes in their environment, both cpu and memory were defined as scalers with a triggering threshold of 70% and 90% utilization, respectively. What is important to keep in mind is that such utilization percentage are related to the resource request, not limits. So as you can see in the diagram on the right can action to scale out the application will happen, for example when the cpu usage will got above one core. Okay, we've covered how the application is configured. Let's now look at the behavior of the application when subject to the load test we've shown before with the baseline configuration. In this chart you can see the application throughput response time and the number of replicas that were created by the autoscaling. Two facts are important to notice. When the load increases, the autoscaling triggers a scaleout event which creates a new replica. This event causes a big spike on response time which impacts service reliability and performance. This is due to the high cpu usage and throttling during the JVM startup. When the load drops, the number of replicas does not scale down. Despite these, container cpu usage is idle. It's interesting to understand why this is happening. This is caused by the configuration of the container resource, the JVM tuning inside, and these autoscaler policies in particular for the memory resources. The autoscaler in this case is not scaling down because the memory usage of the container is higher than these configured threshold of 70% usage with respect to the memory requests. These might be due to the JDM Max heap being higher than the memory request we've seen before, but it max also be due to a change in the application memory footprint, for example due to a new application release. This effect clearly impacts the cloud build as more instances are up and running than required. But slos that configuring Kubernetes apps for reliability and cost efficiency is actually a tricky process. Let's now have a look at the best configuration identified by ML with respect to the defined cost efficiency goal. This was found at experiment number 34 after about 19 hours and almost half the cost of running the application with respect to the baseline. First of all, it's interesting to notice how our MLbase optimization increased both memory and cpu results and limits, which is not at all obvious and may seem at first counterintuitive, especially as Kubernetes is often considered well suited for small and highly scalable applications. The other notable changes are related to the JVM options. The max cheap size was increased by 20% and is now well within the container memory request, which was increased to five gigabyte. The min heap size has also been adjusted to be almost equal to the max cheap, which is a configuration that can avoid garbage collection cycles, especially in the startup phase of JVM. So let's now see how the application performs with the new configuration identified by ML and how it compares with respect to the baseline. There are two important differences here. Results time always remain within the Hasselo and there are no more picks. So these configuration got only improves on cost, but it's also beneficial in terms of performance and resilience. Autoscaling is not triggering these configuration as the full load is sustained by just one pod. These is clearly beneficial in terms of costs. Let's also compare in detail the best configuration with respect to the baseline. These we can notice that the pod is significantly larger in terms of both cpu and memory, especially for the requests. This configuration has the effect of triggering the auto scaler less often, as we have seen, but interestingly and somewhat counterintuitively, while this implies a kind of a fixed cost considering the prices of the container resource, it turns out being much cheap than a configuration where autoscaling is triggered, and this also avoids performance issues. The container and runtime configuration are now better aligned. The JVM max is now below the memory request and has a beneficial effect as it also enables the scaled down of the application should the scaling be triggered by higher loads. Let's now have a look at another configuration found by ML at experiment number 14. After about 8 hours of automated optimization, we leveled this configuration high reliability for a reason that we will be clear in a minute. The score for this configuration, while not as good as the best configurations, also provided about 60% cost reduction. So this can be considered also an interesting configuration with respect to the cost efficiency goal as regards to the parameters. What is worth noticing is that this time ML picked settings that significantly change the shape of the container. It now has a much smaller cpu request with respect to the baseline, but the memory is still pretty large, which is pretty interesting. The JVM options were also changed. In particular, the Galbridge collector was switched to parallel, which is a collector that can be much more efficient on the use of cpu and memory. Let's compare the behavior of this configuration with respect to the baseline. There are two important differences here. The peak on the response time upon the scaling out is significantly lower. It's still higher than the response time slo. However, the peak is less than half the value of the baseline configuration. This clearly improves the service resilience. Autoscaling works properly after the high load phase replicarves are scaled back to one. Its behavior is what we expect from an autoscaling system that works properly. Notice the response time picks could also be further reduced. It would simply be a matter of creating a new optimization with the goal of minimizing these response time matrix instead of the application cost. Let's now also companies in detail the high resilience configuration with respect to the baseline. Quite interestingly, these configuration has a higher memory request and lower cpu results, but higher limits than the baseline. As you may remember, the lowest cost configuration instead had a higher cpu request than the baseline. Without getting into much details in the analysis of this specific configuration, what these facts show is that as the optimization goal changes, cpu and memory results and limits may need to be increased or decreased. That multiple parameters at Kubernetes and JVM levels also need to be tune accordingly. This is a clear confirmation of the perceived complexity of tuning Kubernetes microservices application, as here we are just discussing one microservice out of hundreds or more of today applications. There are many other interesting configurations found by ML that we would like to discuss, but I think it's time to conclude with our takeaways. Our first takeaway is that when tuning the modern applications, the interplay between different application layers and technologies require tuning the full stack configuration to make sure that both the optimization goal and slos are matched, as we've seen in our real world example. A second takeaway is that the complexity of this application under varied workloads and in a context of frequent releases with agile practices requires a continuous performance tuning process. Developers cannot simply rely on manual tuning or utilization based autoscaling mechanisms. Finally, in order to explore the vastness of the space of possible configuration in a cost and time efficient way, it's mandatory to leverage Mlbased methods that can automatically converge to optimal configuration within hours without requiring deep knowledge of all the underlying technologies. Many thanks for your time. I hope you enjoyed the talk. Please reach out to me if you have found this talk interesting. I would love to share more details and hear your Kubernetes challenges.

See all 50 talks at this event!

Conf42 Cloud Native 2022 - Online

April 28 2022

Cheap or Fast? How we got both by leveraging ML to automatically tune K8s apps

Video size:

Abstract

Summary

Transcript

Giovanni Paolo Gibilisco

Head of Engineering @ Akamas

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2022 - Online

April 28 2022

Cheap or Fast? How we got both by leveraging ML to automatically tune K8s apps

Video size:

Abstract

Summary

Transcript

Giovanni Paolo Gibilisco

Head of Engineering @ Akamas

Join the community!