Conf42 DevSecOps 2022 - Online

Let your Kubernetes environment flourish with just the right amount of resources

Video size:


Keeping K8s environments optimized is a complex and time consuming task that only few master. There are hundreds of parameters R&D should define with a goal to reach optimal capacity, one that perfectly balances performance, resilience and cost. Join us to learn some best practices to achieve that.


  • Today we will be talking about the efficiency and the resiliency of large scale Kubernetes environments. The second day operation basically starts when your environment goes live. Environments scales up and down horizontally according to the demands and the needs. This is a good sign that our system is not properly right sized.
  • We need to provision as few resources as possible, but without compromising our performance. The request should guarantee enough resources for a proper operation and limit should protect our nodes from overutilization. Switching from data to actionable intelligence will streamline the decision making process.
  • Perfect scale approach can help with right sizing kubernetes. We detected 131 different resilience issues related to the missing or misconfigured resources. Those recommendations are combined into a convenient YaML file that you can use to fix the problem.


This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to our session. Today we will be talking about the efficiency and the resiliency of large scale Kubernetes environments. My name is Eli Birger and I'm a co founder and chief technology officer of Perfectscale. Prior to establishing perfect scale, I have managed DevOps and infrastructure team for many years. I have built multiple large scale SaaS systems, mainly based on the Kubernetes in the recent years. My talk today will focus on the second day operation challenges and specifically on the right sizing of Kubernetes environments. The second day operation basically starts when your environment goes live and you starting to serve real customers. The second day operation is not a single milestone, but actually this is the beginning of a long journey, the journey of the day to day development and operations across the environment. The entire day to day operations has a single purpose, to provide the customers with the best possible experience using the system and from the executive perspective, the best possible experience but also with the lowest possible cost. To achieve this, Kubernetes ecosystem provide us with two types of tools, the horizontal pole autoscaler. I personally prefer Keda here and the cluster autoscaler. Some may prefer carpenter. The combination of horizontal pole autoscaler and the cluster autoscaler allows us to dynamically change the entire environment. Environments scales up and down horizontally according to the demands and the needs. So it seems we just need to set up an HPA and the cluster autoscaler and start enjoying the best possible experience at the lowest possible cost. So now when both the horizontal port autoscaler and the cluster autoscaler are installed and configured, we are expecting our environments to have a high resilience level combined with a steady cost pattern following the demand fluctuations. But when we look at the real data we will find something like this, not always satisfying resilience level and constantly growing cost. This is a good sign that our system is not properly right sized. Despite the presence of HPA and cluster autoscaler, there is no magic here. Kubernetes horizontal scalability heavily relays on the proper vertical sizing definitions of podes and nodes. Let's see how it works in details. Here is a pod with the request of four cores of cpu and eight gigabyte of memory. Those request values are defining how much resources the node should allocate for the specific pod when pod is assigned to the node. Here we are looking now at the example of node with eight cores of cpu and 16 gigabyte of memory. The relevant fraction of node resources is reserved for our pod. Now when kubernetes need to schedule additional pod, it will place them on the same node only if remaining allocatable of the node fits the pod request. For example, this red pod with twelve gigabyte of memory request cannot be assigned to the node. Instead, this pod will go to the unschedulable queue and cluster autoscaler, constantly monitoring this unschedulable queue. And once there is a pod, it will simply add a node to our cluster. So both the cluster autoscaler and the HPA are tightly coupled to the pod requests. Let's see how as we saw in a previous slide, the cluster autoscaler will scale up amount of nodes only when pod can't be scheduled on existing nodes, and it will scale down the particular node only if the sum of request of the node is less than a threshold. By the way, the default threshold is 50% of allocation. So if the total allocations of your node are more than 50%, this node will not be removed from the cluster even if there is enough space for the podes running on this node to be hosted on other nodes. The same goes with the HPA or specifically with the resources based HPA. New replicas will start when the utilization of current pods exceeds some percentage of pod requests, and I would like to stress this thing again, the utilization will exceed the request amount. So now when we are understanding the importance of pod requests, how do we actually right size our pods and what are the correct values for the request and limit? Here is a simple answer. We need to provision as few resources as possible, but without compromising our performance. The request should guarantee enough resources for a proper operation and limit should protect our nodes from overutilization. So let's see what happens in the misprovisioning scenarios. If pod requests are too big, we will cause waste and excessive co2 emission. If the requests are under provisioned, kubernetes will not guarantee that pod will have enough resources to run if we forgot to provision requests at all. The Kubernetes will not allocate enough resources for a pod on the node during the assignment. This same as under provisioning may probably cause unexpected pod eviction on the memory pressure or cpu pressure. As for the limit, under provisioning, limits will cause cpu throttling or out of memory service will fail on lack of resources during load bursts, even if there is bunch of free resources available in entire cluster or even in a particular node. Over provisioned limits will set a wrong cutoff threshold, ending up with the failure of the entire node. Failure of the node under load spike can easily end up with a domino effect and cause complete outage for our system. Specifically for the cpu limit. In some situation it is okay to remove cpu limit completely and only cpu limit. We are not talking here about the memory limit at all. This is because of compressible nature of cpu. The complete fair scheduler of operating system will figure out how to distribute cpu time between different containers. So finally our mission of right sizing is clear. Let's roll our sleeves and set each and every pod with few resources as possible without compromising the performance. But how do we actually decide what is the right amount of values? Is it a half core or four cores? Is it 100 megabytes or 1gb? Intuitively we can try to calculate it based on the metrics. Or maybe we will just have a VPA recommending some values for us. It seems like an easy task. We just need all the service owners to go workload by workload. Look at all the metrics and adjust them accordingly for hundreds of workloads in multiple clusters. And we also will ask those service owners to keep going and do it every time when there is a code change, change in architecture or traffic patterns. Unfortunately, it does not sound like a realistic plan and this level of complexity definitely requires a solution. From my personal experience, good DevOps solutions are consist 70% of philosophy and 30% of technology. The philosophy part of such solution for our problem is to establish an effective feedback loop to pinpoint, quantify and address relevant problems on the technology part. The shift from data to intelligence what is the difference between data and intelligence? Data is not considered intelligence until it is something that can be applied or acted upon. In other words, human are not good in analyzing massive amounts of data. It is boring and time consuming. Switching from data to actionable intelligence will streamline the decision making process. This approach will allow to shift from continuous firefighting to proactively pinpoint and predict and fix problems to switch from guesstimation mode to data driven decision making. The end result of such approach will be improved resilience, less SLA and SLO breaches, reduced waste and carbon footprint, and effective governance of the platform. Now let's see it in action. So let's see how perfect scale approach can help with right sizing kubernetes here we see a cluster. This cluster contains 240 different workloads. Here they are deployment, stateful set applications, demon sets jobs the total cluster cost for last one months is $3,687. Let's see the big picture of our cluster. Our cluster combined combined for the last one months utilizes in 99% of the time, 61 cores of cpu or less. 261 gig memory or less. The combined number of the requests that are set together for all the workloads in 99% of the times is 156 cores of cpu or less. Same goes for the memory. Four hundreds and seven gig of memory or less. Let's see the total allocated. This is the size of our cluster and what we can easily see that our cluster is nearly four times bigger than we actually would need 99% of the times. However, this picture shows us that we have enough resources to run any workload in this cluster. So we detected 131 different resilience issues related to the missing or misconfigured resources such as requests or limits. Let's see an example. This is a couch base. It is a stateful set. It's running in a namespace of prode. It's running for 924 hours within last one months. This number represents the total uptime for all the replicas that this workload have. For example, if we would observe 1 hour time frame and we would have one replica, the number will be one. And if we would have three replicas at the same hour, number for this hour will be three. Then we understand on top of which node this workload is running. We also understand what fraction of the node is actually optimized or allocated toward this workload. So we eventually know how much the workload cost. We are indicating a high resilience risk for this workload. Let's see what the risk is. Let's see what do we know about this workload? This workload have somewhere between two to four replicas with average of three replicas during last one months. And we see a high throttling happening on the cpu and why this trotting is happening. This rotoring is having because this particular workload defined with 1000 milli cores as a request, 3000 millicores as a limit. In 95% of the time our utilization was two cores of CPU and the highest spike that we observed is very very close to the limit that we set. This is why the struggling happens. Those values might be correct at the moment they were set, but since then many things changed. Maybe you have more customers, maybe you have more data in the database. Maybe you have less efficient query or more microservices pulling from the same database, pulling data from the same database. So perfect scale coming. Analyzing the behavior of the workload of all the replicas of this workload and coming with recommendations how much resources you would need to set in order to run this workload smoothly. Those recommendations are also combined into the convenient YaML file that you can simply copy paste into the infrastructure as a code and run the CI CD in order to actually fix the problem. But in some situations you are not the person to make the actual fix. There is a service owner and he need to address the issue. So we can simply create a task. This task will go directly to the JIRA and later on can be assigned to the relevant stakeholder and actually fit into the normal workflow of the development lifecycle. Additional perk we can set different resilience levels for our workload. For example, if we running production database, we would like to set much wider boundaries for the workload. And if we set it to the resilience of highest level, our recommendations would be much bigger and we also will calculate the impact of the change. So this particular database in the highest level of resilience would increase the monthly cost about 70 80%. In the same way we're detecting the under provisioned workloads. For example, this collector catcher is a deployment running in pro namespace and we spent $94 for this workload during last month's, out of which $76 were completely wasted. Let's see how. So this workload contains two different containers, the Yeager agent that collected traces and the actual business logic container. This business logic container is provisioned with ten gig of memory for each replica. This one is running from somewhere between one to six replicas with average of three replicas. And the utilization is somewhere around 2gb of memory. So we basically throw in 8gb of memory for each replica that we are running. Again, we have a handy yaml to fix the problem and we can create a task in a similar way. We are pinpointing all the different problems that you have in your cluster or categorizing those problems by risk. So you can either focus on the highest risk in particular namespace or you can go and dive into particular type of the problem. For example, under provision memory limit. Let's see again. Let's see it in action. So we have the workload here. This workload suffers from very low request in 95% of the times. We need three times more resources and the limit is very very close to the actual utilization. Also, we observed the trend going up with memory utilization. So we are basically predicting here that at some point in time out of memory will occur and we are suggesting to fix the problem by increasing the amount of allocated resources and increasing the limit. This is going to be our impact of the change, but we will have this workload running smoothly. Now let's see the multicluster multicloud view. In this view, we see each and every cluster running in different clouds. We see all the problems that this particular cluster have, all the waste and all the total cost, and even the carbon footprint that this particular cluster generates. We see how those numbers are summing up in the organization level of view. How much is the cost, how much is the waste, how much savings we generated, and how much existing risks out there. So I hope you are enjoyed our session today and you learned something new about the right sizing and right scaling of kubernetes. Feel free to ping me on LinkedIn or contact me with your website. Thank you very much for your time.

Eli Birger

Co-Founder & CTO @ PerfectScale

Eli Birger's LinkedIn account

Amir Banet

Co-Founder & CEO @ PerfectScale

Amir Banet's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways