Conf42 Site Reliability Engineering 2021 - Online

Let the machines optimize the machines: ML-driven automated performance tuning

Video size:

Abstract

SREs’ main goal is to achieve optimal application performance, efficiency and availability. A crucial role is played by configurations (e.g. JVM and DBMS settings, container CPU and memory, etc): wrong settings can cause poor performance and incidents. But tuning configurations is a manual and lengthy task, as there are 100s of settings in the stack all interacting in counterintuitive ways.

In this talk, we present a new approach that leverages machine learning to find optimal configurations of the tech stack. The optimization process is automated and driven by performance goals and constraints that SREs can define (e.g. minimize resource footprint while matching latency and throughput SLOs). We show examples of optimizing Kubernetes microservices for cost efficiency and latency tuning container sizing and JVM options.

With the help of ML, SREs can achieve higher application performance, in days instead of months, and have a lot of fun in the process!

Summary

  • The topic of this session is about a new approach SRE can use to automate performance tuning of system configurations leveraging machine learning techniques. You can enable your DevOps for reliability with chaos native.
  • System configuration are key for sres as they can really impact service performance, efficiency and reliability. Performance optimization is getting harder and harder these days. New approach should allow a full stack optimization, says CTO at Akamas.
  • SRES has a key role in the new automated optimization process. It defines the optimization goals and constraints and then lets the machine learning based optimization to automatically identify the optimal configurations. A configuration improving the cost efficiency by 77% was automatically identified in about 24 hours.
  • The goal for this use case is CTO increase the number of successful transaction processed by the Ad microservice. The SLO for this service is the average response time which should be kept no higher than 100 milliseconds. The best configurations identified by machine learning power optimization provides a 28% increase in transaction per second.
  • Tuning modern application is a complex problem that is hard to solve. Any traditional tuning approach cause a relevant toil for SRE teams. New approach based on fully automated machine learning based optimization. A huge improvement for rolexer is many thanks for your time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hi and welcome to my talk. The topic of this session is about a new approach SRE can use to automate performance tuning of system configurations leveraging machine learning techniques we will start by discussing the SRE challenges for ensuring application performance and reliability. We will then introduce a new approach based on machine learning techniques, and then we will see how the new approach can be used to automatically tune today's complex applications by focusing on optimizing a microservice application running on Kubernetes and the Java virtual machine. Finally, we will conclude by sharing some takeaways on how this approach benefits sres teams. Before proceeding, allow me to introduce myself. My name is Stefano Doni and my role is CTO at Akamas. Performance engineering has always been my topic of interest and passion, so let's start by discussing why system configuration are so important to the job and mission of SREs. I would like to start by reviewing at the high level what really is the job of sres, citing the seminal SRES book by Google. It's no surprise that any SRE deeply cares about service performance, efficiency and reliability to understand in practice. What does that mean? Let's have a look at some of the core tenets pursuing maximum change velocity without violating slos. The tenet is well known and refers to the goal for sres to accelerate product innovation and releases whilst in matching agreed service level objectives. For example, ensuring request latency stays below a target value. Demand forecasting and capacity planning s three should have a clear understanding of what is the capacity of the service in terms of maximum load that can be supported. This is typically achieved via load testing to determine the right resources to provision to the service, for example before a product launch. Efficiency and performance efficiency means being able to achieve the target load with the required response time using the least amount of resources, which means lower cost sres and developers can change service software to improve its efficiency. And here is a major area where configuration can provide big gains, as we are going to see in a moment. So what is the role of system configurations and why are they so important for sres? System configuration are key for sres as they can really impact service performance, efficiency and reliability. Let's consider a real world example related to a Java service. After a JVM reconfiguration on day three, the cpu usage of the application service dramatically changed a 75% reduction while the system was still being able to support the same traffic load. So this configuration change translated into significant gains on service efficiency. On the right, you have another example where an optimal configuration significantly improved service resiliency. While the baseline configuration crashed under load, the best configuration supported the target load with much greater stability. So tuning system configurations really matters. And this does not apply only to Java. It also applies to all the infrastructure levels, from databases to container middleware and application servers to any other technology. But how easy it is to do that for modern applications, well, not so much. Actually, the contrary. The problem is that performance optimization is getting harder and harder these days. There are three key factors that I would like to highlight here. First of all, explosion of configurations parameters today, a JVM has more than 800 parameters, a MySQL database more than 500, and on the cloud AWS just added plus 100, etc. Two instance types just last year. Second, configurations can have unpredictable effects. In this example, allocating less memory to a database, a MongoDB database, made it run much faster. Third, with its velocity is increasing, this is well known for sres. It's worth mentioning that it has negative impacts on performance tuning also, which is largely a manual task but done by experts, and as such it takes a lot of time. So system configurations can play a big role on service performance and reliability. But optimizing them is a daunting task for sres, given increasing complexity of technology and modern delivery processes. This is why we think we need a new approach. At Akamas, we have identified four key requirements that we believe are needed to effectively solve this class of problems. First, the new approach should allow a full stack optimization, meaning that it has to be done to be able to optimize several different technologies at the same time. For example, the optimization might focus on container parameters and runtime selectings like the JVM options at the same time. Second, the new approach CTO allow a smart exploration of the space of potential configurations. If we consider all the possible values for all the possible parameters, we really have billions of configurations. So a brute force approach is simply not feasible and would not work for this kind of problem. To be effective, a solution needs to be identified in a reasonable time frame and at an acceptable cost. Third, the new approach should allow us to align to each specific need, priority and use case of interest by defining custom optimization goals. For example, in some cases the SREs may want to reduce the service latency and provide the best possible performance at peak loads. In other cases, reducing costs will be the key driver for the optimization without violating a service level objectives. And last but not least, the new approach should be fully automated to reduce toil as much as possible and ensure a reliable process with the fastened convergence to the optimal solution. So this figure shows the five phases of the automated optimization process. The first step is to apply a configuration to the target service of the optimizing. This can be a new JVM option or a new Kubernetes container resource settings, for example. The second step is to apply a workload to the target system so as to assess the impact of the apply configuration, typically by integrating with load testing tools. The first step is CTO collect key performance indicators related to the target system, that is, information, data and metrics about the behavior of the system under the workload. This step typically leverages existing monitoring tools. The fourth step is to score the configuration against the specified goal by leverages the collected KPIs. This step is where the system performance metrics are fed back to the machine learning optimizing, that is, scoring the tested configuration against the goals. The last step is where machine learning kicks in by taking this score as input and producing as output the most optimizing configuration to be tested in the next iteration of the same process. In a relatively short amount of time, the machine learning algorithms learns the dependencies between the configuration and the system behavior, thus identifying better and better configurations. An important aspect is what is the role of SRE in this new AI driven optimization process? SRES has a key role in the new automated optimization process as it defines the optimization goals and constraints and then lets the machine learning based optimization to automatically identify the optimal configurations. Let's see how this approach applies to a real world example where we will automate the optimization of a service by tuning Kubernetes and the JVM options. In our example, the target system is Google Online boutique, a cloud native application running on Kubernetes made of ten microservices. This application features a modern software stack with services written in Golang, node, JS, Java and Python, and it also includes a load generator based on locust, which generates realistic traffic. To test the application, we will leverages this application to illustrate two different but related use cases. The first use case is related to tuning Kubernetes, cpu and memory limits that define how many resources are assigned to a Kubernetes container to ensure application performance, cluster efficiency and stability. For SRE, the challenge is represented by the need to ensure that the overall service will sustain the target load while also matching the defined slos on response time and error rate. Service efficiency is also very important. In this case, we want CTO minimize the overall cost so that we want to assign more resources that are actually needed. Also, as a srEs, we want to keep operational, toil minimal and stay aligned to the delivery milestones. So this is the overall architecture. The parameter that we are optimizing is the example are the cpu and memory limits of each single microservice. These parameters are applied via the Kubernetes APIs in order to assess the performance of the overall application with respect to the different sizes of the containers. We leverage the built in locust load injector. We then leverage Prometheus and the ISIO service mesh, respectively to gather pod resource consumption metrics and service level metrics. The whole optimization process is completely automated. So what is the goal of this optimization? In our example, we aim at increasing the efficiency of the online boutique application, that is, to both increase the service throughput and decrease the cloud cost. At the same time, we want our service to always meet its reliability targets, which are expressed as latency and error rate slos. Therefore, our optimization goal is to maximize the ratio between service throughput and the overall service cost. Service throughput is measured by ISTIO at the front end layer where all user traffic is processed. The service cost is the cloud cost that we will pay to run the application on the cloud considering the cpu and memory resources allocated to each microservices. In this example, we cause the pricing of AWS Fargate, which is a serverless Kubernetes offering by AWS, which charges $29 per month for each cpu requested and about $3 per month for each gigabyte of memory. The machine learning algorithm that we have implemented at akamas also allow to set constraint on which configuration are acceptable. In this case, we state that configuration should have a maximum 90 percentile latency of 500 milliseconds and a error rate lower than 2%. That is it. At this point, the machine learning based optimization can run automatically while we are relaxing or doing some other important stuff. Let's see what results we got. In this chart, each dot represents the cost efficiency of the application, which is the service throughput divided by the cloud cost as a result of different configuration of microservices, cpu and memory limits as chosen by the machine learning optimizer. The result of the machine learning based optimization is quite astonishing. A configuration improving the cost efficiency by 77% was automatically identified in about 24 hours. The baseline configuration, which correspond to the initial sizing of the microservices, has a cost efficiency of 00:29 transaction per second per dollar per month the best configuration ML machine learning found by tuning microservice resources achieved 00:52 transaction per second per dollar per month. This chart also helps understanding how our machine learning based optimization works by learning from each tested configuration to quickly converge toward the optimal configurations. At this point, I guess you may want to know what is this best configuration looks like? Let's inspect the best configuration by focusing on the ten most important parameters. It's interesting to notice how our machine learning based optimization reduced cpu limits for many microservices. This is clearly a winning move from a cost perspective. However, it also automatically learned that two particular microservices, the product catalog and the recommendation, were under provision. Thus, it increased both their assigned cpu and memory. All these changes at the microservice level were critical to achieve our optimization goals that we set on the overall higher level service, which is to maximize the throughput and lower cost while still matching the slos. Let's now see how the overall service performance changes when the best configuration is applied. The chart on the left shows how the base and best configurations compare in terms of service throughput interacting. Besides being much more cost efficient, the best configuration also improved throughput by 19%. On the right, we are comparing the service 90 percentile response time. The best configuration cut the latency peaks by 60% and made the service latency much more stable. Let's now consider another use case of the same target system. Here the SRes challenge is how to tune the JVM options of the critical microservice so as to ensure the required service can support a high target load, a higher target load that is expected while still matching the defined slos. The container size is not changed here as we don't want to increase cost again, we want to keep operational to a minimum and be quick to stay aligned to product launch milestones. As mentioned, the target architecture is the same with respect to the previous picture. Here we are emphasizing the specific JVM options, about 32 that are taking into account, and the specific microservice, the ad service that is being targeted. Let's see how the optimization goal and constraints are defined in this case. The goal for this use case is CTO increase the number of successful transaction processed by the Ad microservice. The SLO for this service is the average response time which should be kept no higher than 100 millisecond. Let's see the results machines learning achieved for this optimization. This chart showed the service throughput and response time under increasing level of load during a load test. First of all, consider the blue and green lines representing the microservices throughput and response time for the baseline configuration. As you can see, the baseline configuration reaches 74 transaction per second before violating the 100 milliseconds low on response time. Let's now look at the best configuration. The throughput, which is the black line, reaches 95 transaction per seconds before violating the response time slos. Therefore, the best configurations identified by machine learning power optimization provides a 28% increase in transaction per second while also meeting the defined survey level objectives. So what makes this configuration so good? The maximum heap memory was increased at 250% increase. Actually. It's also interesting to notice that the garbage collection type was changed from g one to parallel in this case, and machines learning also decreased the number of garbage collection threads from a to three as we got the parallel garbage collection thread, and from a to one as we got the concurrent ones. Machine loading also adjusted the IP regions and the object aging threshold to maximize performance. As for the previous case, we also got eight JVM options automatically selected as the most impactful with respect to the dozens and dozens, hundreds actually of potential JVM options to be considered. All right, it is time to conclude with our takeaways. Our first takeaway is that tuning modern application is a complex problem that is hard to solve. Any traditional tuning approach cause a relevant toil for SRE teams. A second takeaway is that when our new approach is based on fully automated machine learning based optimization, it becomes possible for SRE to really ensure that application with have higher performance and reliability. And third, that this new approach also make it possible to reduce the operational toys and stay aligned to release milestone. A huge improvement for rolexer is many thanks for your time. I hope you enjoyed the talk. Please send me any comment and question by leverages the conference discard channels or these contacts.
...

Stefano Doni

CTO @ Akamas

Stefano Doni's LinkedIn account Stefano Doni's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways