Conf42 DevOps 2024 - Online

The 2024 Kubernetes Benchmark Report

Video size:

Abstract

The 2024 Kubernetes Benchmark Report is out! Join us as we review the results of over 150,000 scanned workloads to learn what’s working and what needs to be improved. Join to see how you compare and get advice on what you need to do, and what you don’t need to do. Key Takeaways: Benchmark yourself against others, see what you are doing well and what you need to improve, and get past issues that may be limiting your ability to gain the full value of Kubernetes.

Summary

  • Fairwind's Kubernetes benchmark report for 2024. Includes more than a dozen different types of policies covering reliability, security, as well as cost efficiency. Report shows how well organizations are aligning to best practices.
  • Over one third of organizations need to right size their containers to improve efficiency. 78% of organizations have at least 10% of their workloads missing cpu requests. This could be a potentially easily solved problem with policy enforcement and guardrail tools.
  • About a quarter of organizations today are relying on a cache version for 90% of their images. Another pattern that we're seeing is that container health checks seem to be missing or ignored in some deployments. Development teams need to consider what are the changes they need to make to their application.
  • 30% of organizations have less than 10% of their deployments missing replicas. Sometimes this is because of what we call the copy and paste problem. And we're hoping to see even more and more organizations have fewer and fewer deployments missing these Replicas.
  • About 28% of organizations are running about 90% of their workloads with insecure capabilities. One positive, though trend coming out of this year's report is that we're actually seeing fewer containers set to run as root.
  • Almost 84% of organizations getting almost complete scan coverage of containers in their runtime. Almost a third of organizations have less than 10% of their workloads without a network policy. Fairwinds can provide guardrails to help you solve your business problems.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
All right, I'm really excited today to share a little bit about Fairwind's Kubernetes benchmark report for 2024. We're really kind of proud of some of the work that our team has put together over the past few weeks, getting this report ready for the new year, and excited to kind of share some of the findings that we observed by studying over 330,000 workloads. So just as a quick introduction, my name is Joe Pelletier. I'm a product manager here at Fairwinds. Been in the Kubernetes and container space for a little over five years and been working on the Fairwinds Insights product. Like I mentioned, we actually have done this report now for, this is the third year, and we've analyzed over 330,000 workloads to inform the data behind this report. It's the largest number of workloads we've ever analyzed for the report, and it includes more than a dozen different types of policies covering kind of reliability, security, as well as cost efficiency. And when it comes to Kubernetes, we find that organizations really have to consider all three types of checks here in order to make sure whats they are running workloads that align with best practices. This gives you a little bit of an example of the types of policies we've evaluated. I won't go through every single one, but we'll be covering some of these in today's presentation. The final report actually covers a number of different categories of policies and provides a lot more depth than what we'll be able to cover in today's session. So I do recommend that you download the report. I think what you'll see in today's presentation is sort of a summary and a bridge version, and hopefully kind of see a little bit of where the industry is going in terms of both kind of Kubernetes best practices and sort of how well organizations are aligning to those best practices as well. Okay, so a common graph that you're going to see in this report will look like this. And what I'll do is spend a little bit of time just explaining how to read the report and how to read the different charts and graphs that you're seeing. So on the y axis here, you'll see sort of the percentage of workloads impacted within an organization. And on the x axis, you'll see the percentage of organizations that were evaluated in terms of how many of their workloads were actually impacted as a percentage. And so a great way to kind of read this report or this example here is, if you take this example, the number of organizations with less than 10% of workloads improved has fallen from 46% in 2022% to 21% in 2023. And so when you see something like that, where the number of organizations that have such a small percentage of workloads impacted decrease, it actually demonstrates that the problem might be getting harder to control. And so that's some of the Ways that we've been able to kind of highlight information in this report. And you will see that in just kind of various different aspects of today's presentation when you read the final piece. So let's kick it off with sort of the highlight from this year's analysis, which is really this really kind of interesting fact that over one third of organizations today, and specifically 37% of organizations, need to actually right size their containers to improve efficiency. And when we dig into some of those details, we'll notice that actually this 37% of organizations have 50% or more of their containers that are over provisioned. And it's an interesting finding, because what we notice is that a lot of times developers have to guess their resource requests and limits when they go to deploy it because they don't really have the tooling or the feedback loops to tell them what those resource requests and limits should be. And a lot of times developers will guess high. They will over provision by giving their application too much memory or too much compute. And when left unchecked or unmonitored, organizations end up incurring lots of additional compute spend as a result. And so this is the first year that we actually started looking at this data. And again, you'll see that the 37% of organizations that need to right size 50% or more of their containers really represents the bottom part of this chart. But it's also interesting to see that there's actually a large cohort of organizations, in this case 57% of organizations, that have less than or equal to 10% of their workloads impacted. So some organizations do seem to get this right. And what we're really excited to do is monitor this progress over the years. So this being our first year measuring this, we want to see sort of how well has this trend improved or not improved going into next year as well. So again, first year we're kind of baselining this, but next year should help us understand where that trend is going. Another aspect of kind of Kubernetes efficiency that's important to solve is really making sure that both memory and cpu requests are set on deployments before they're actually set to running Kubernetes. Now, Kubernetes technically makes these settings optional, but when you don't set memory or cpu requests, it can actually make it difficult for Kubernetes to properly schedule that workload. And so what we're seeing is that this is becoming more and more of a systemic issue. This year. We're noticing that 78% of organizations have at least 10% of their workloads missing cpu requests, and this is up from about 50% last year. So again, I think the numbers kind of are all over the map here. You'll see that pretty much every organization has some amount of this problem. But interestingly enough, this is actually a fairly easy thing to solve with guardrails and policy enforcement mechanisms for kubernetes, where you can kind of give developers feedback at the time of their pull request or at the time of their deployment when they are missing these settings, and use that as an opportunity to educate developers as well. So we think that while a lot of organizations may be having missing cpu requests, we also think this could be a potentially easily solved problem with policy enforcement and guardrail tools as well. Let's shift a little bit to reliability and kind of look at a few different trends that we've observed in this year's report. So just at a very high level, about a quarter of organizations today are relying on a cache version for 90% of their images. And what does that mean? Well, it really means that the pull policy is not set to always, which is a general best practice. You kind of want to make sure that your containers are pulling the latest image so that you don't have inconsistency. And it also can help you from a security perspective as well to make sure whats, when you do push an image with updates, that it's actually pulling in that latest image as well, not just the cache version. So this is just a general best practice that we're seeing. Whats right now, about 24% of organizations are relying on this pull policy not being set to always for 90% of their images. Another pattern that we're seeing is that container health checks seem to be missing or ignored in some deployments as well. And so right now about 66% of liveness and 69% of readiness probes are missing in Kubernetes deployments. And it's important to set these because it helps Kubernetes automatically restart containers and ensure that the applications are available to receive traffic and then ultimately serve users. So this is actually considered one of the more basic ways to ensure application reliability in Kube and we're still seeing that organization struggle to various degrees here. I think part of it is because the configuration does require a little bit of application specific input. So development teams need to consider what are the changes they need to make to their application in order to make sure that the health checks works for Kubernetes. So we're hoping that this trend kind of improves over the years as well. Another trend that we identified is that deployments are missing replicas. This is another general best practice is to make sure that there's a few different, there's a couple of replicas available for pods and that right now we're noticing that 30% of organizations actually have less than 10% of their deployments missing replicas. So this is an improvement over 2023, but still kind of highlights whats if you look at the graphic here, that some organizations have much more than just 10% impacted. There might be lots of applications missing replicas. And I think sometimes this is because of what we call the copy and paste problem. Sometimes a deployment from one team that's missing this best practice gets copied from another team who's looking to get their application deployed. And so they may be propagating these misconfigurations where replicas aren't set on the previous team, and now the new team is using that same configuration without replicas. And so you can see that this becomes sort of a wider problem. Again, this is usually a very quick fix, like a one line change to your infrastructure as code. And we're hoping to see, even though we're on an improved path here, that even more and more organizations have fewer and fewer deployments missing these replicas. Shifting gears a little bit, we'll also take a look at security. Now, security in Kubernetes kind of can mean a lot of different things. We look at security from two lenses in this report. One is from image vulnerabilities as well as from the kind of the configuration itself. So the YaMl or the helm chart that's being deployed at a very high level. We're noticing that about 28% of organizations are running about 90% of their workloads with insecure capabilities. So that means that they're adding some sort of insecure capability like net admin. And a lot of times it actually might be necessary for some applications or workloads to have these additional capabilities, but sometimes it may not be and it could be accidentally added to apps going back to that original copy and paste problem where one team copies the configuration from another team as a starting point and inadvertently propagates some of these misconfigurations going forward. So we always look to make sure that applications start with not having these dangerous or insecure capabilities added, and that helps ensure kind of a good baseline from a security perspective. One positive, though trend coming out of this year's report is that we're actually seeing fewer containers set to run as root. So 30% of organizations today are running 70% or more of their containers as root, which is actually a drop from 44% in last year's report. And part of me thinks that this is another example of sort of a low hanging opportunity to make a quick win, a quick fix to containers by essentially turning off the ability to run as root, which again, is a one line change. And I think we also see that this example, this type of misconfiguration example is sort of very popular when talking about the issues of misconfigurations of Kubernetes. A lot of organizations talk about, as an example, running as root being a common example of that. So it's great to see that this trend is going in the right direction in that fewer and fewer organizations have a vast majority of their containers running as root, and that seems to be going in decline, which is awesome. And I think it's important to note that running a container as root, just overall, it increases the risk of a malicious user taking advantage of that root privilege as part of a larger attack. So you want to kind of, from a defense in depth perspective, by default, have your container not run as root unless it absolutely needs to because of some special need or use case for that app. So again, this is going in the right direction, and we hope it stays that way going forward as well. Switching gears a little bit away from kind of misconfigurations, we'll talk about image vulnerabilities. And so this is the image vulnerabilities that may exist in running containers or as part of scanning container images, as part of your CI CD process or your shift left process. And I think this is an ongoing challenge for many organizations. It's an ongoing problem. But we do see some signs of progress in this year's report. So if we actually dig into the first section where we show the percentage of workloads impacted, 26% of organizations have less than 10% of their workloads affected, which is an improvement from 12% in 2023. So we're seeing essentially a greater percentage of organizations with fewer workloads impacted due to image vulnerabilities. And I think that is a signal of both kind of organizations upgrading their third party containers to newer, less vulnerable versions, but also integrating and scanning more of their containers so that they have a process in place for this. In the report, you're also going to see a section where we talked about scanned images. So Fairwinds is able to kind of help companies identify if there's images running in their cluster that they have not scanned. And this has greatly improved over the year. We're actually seeing almost 84% of organizations getting almost complete scan coverage of containers in their runtime. That's up from 64% last year. So I think that's a great sign that organizations are kind of doing the first step, which is scanned as many of their images as possible so that they understand their risk and then taking remediation after. You know, we hope that next year we even see a higher percentage of organizations with fewer workloads affected. One of the enhancements that we made to Fairwinds insights last year was we added some specific checks related to the NSA hardening guide. So the NSA actually released Kubernetes hardening guidance, I think, back in 2021, and there was a number of great recommendations there, and we actually expanded the number of checks that Fairwinds insights offers to match what the recommendations were in the NSA hardening guide. So a lot of new security checks kind of made its way into the Fairwinds insights platform this year. One of those checks is actually verifying if there's a network policy configured for workloads. And network policies are increasingly important because it helps you kind of segment workload traffic and ensure that you've got controls around which pods can speak to which pods. And so we want to get a sense of how is the industry doing on this particular policy. And so I think we see kind of two types of organizations, 37%, or about a third of organizations today, have less than 10% of their workloads without a network policy. And that's actually a great sign that there's a lot of network policy adoption happening in some organizations where they're making sure that their workloads have a network policy set. But on the other hand, there's still a majority of organizations that have way more than 50% of their workloads without a network policy. So it means that they're deploying the Kubernetes. The workload is running fine, but that workload can speak to any other workload in the cluster. And so I think it shows that the industry still has a little bit of ways to go to make sure that network policy adoption is even more widespread and more adopted. And so just to kind of give a little bit of an example of why we think this is important, network policies help you limit that egress and ingress traffic. And so when you have that ability to control the traffic, it allows you to kind of, again, from a defensive depth perspective, prevent any undesirable access to those pods. So those are some of the summaries and the highlights from the report. Again, I think it's probably only, we're only covering about a quarter of the information that the report has this year, but I wanted to kind of also help organizations understand what is a path forward. Like, if you're running lots of kubernetes today, how do you ensure that your teams are following reliable security and cost efficient best practices? And I think that's really where Fairwinds insights can provide a lot of value. It can provide guardrails to help you solve your business problems. Whether it's ensuring that your images are free of vulnerabilities or that your workloads are aligned to standards like the NSA hardening guide, or aligned to standards like SoC two or ISO 27,001, there's a big security reason to provide developers with guardrails and feedback around their configuration hygiene. I think increasingly in 2023 we did notice whats a lot of organizations were very cost conscious. So they wanted to make sure that they had a way to measure their container usage, but also right size containers to properly make sure that it's using the correct memory and cpu and they're not overspending in ways whats incurs additional cost or just wastes compute resources. So Fairwinds does provide sort of both Kubernetes cost allocation as well as container right sizing recommendations. And that's helped organizations in some cases save over 25% on their container costs. And then finally, this notion of guardrails is sort of core to everything that we do. So in order to make sure that engineers have the tools to take action on this feedback, you want to be able to provide guardrails at different steps in the process, whether it's at time of pull request, when they're making their infrastructure as code changes, or at the time of deployment, also known as the time of admission, when applications are being deployed into the Kubernetes environment. You want to give that feedback to developers and have both like a way for them to remediate things easily, but also ensure consistency so that you're not introducing risk or over provisioned applications along the way. And these are kind of the core capabilities that Fairwinds insights provides and how our customers are getting value. So I do encourage you to kind of take a look at the Kubernetes configuration benchmark report for this year. Like I said, we only really covered about a quarter of what's in that report, and there's a lot more broken out by security, cost and reliability, so you can kind of see the different patterns. So if you're interested, I'd recommend going to reach out to me on LinkedIn. I'm happy to point you in the right direction, and I think that's really kind of what we're hoping to cover today. So thanks again for the time and looking forward to hearing your thoughts out there in the community.
...

Joe Pelletier

Vice President, Product Strategy @ Fairwinds

Joe Pelletier's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways