Shrinking the Observability Bill: Smart Strategies for Cost-Effective Kubernetes Monitoring

Video size:

Abstract

In today’s cloud-native landscape, comprehensive observability is critical—but it often comes with high price tag. This talk reveals practical techniques to significantly reduce Kubernetes monitoring costs without sacrificing visibility. You’ll learn techniques that can slash your spend by 30-60%

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. I'm Raho. I'm currently a senior product manager at Microsoft and I work on cloud native monitoring and troubleshooting. I'm excited to present to you today's talk on shrinking the observability table. We are gonna discuss several strategies on how you can optimize your cloud native monitoring costs while at the same time maintaining the insights. So let's jump right in. Today we are gonna be diving into yeah, the critical topic of observability costs. And this is focused on cloud, native and Cuban environments. As we explore strategies for cost effective monitoring, it's also essential to recognize the balance between maintaining visibility and managing expenses. I'll guide you through some of the innovative approaches that can help organizations optimize their observability practices without sacrificing the insights they need to operate efficiently. This is gonna be our agenda, but today we'll start off with why we see costs rising in this field. We'll discuss a framework on how we can cut the waste. We are gonna do a deep dive on some of the specific aspects of cloud native monitoring, which is the control plane, the nodes, the pods, and the network monitoring. I'll discuss a few tools and dashboards that can help and we'll end with a few key takeaways and action plan so that you can go start acting right away on this. So first off why am I here? Why are we talking about this? Observability can cost up to 40% of the bill sometimes even more. And given how dynamic cloud native environments are, the number of microservices, the spin up, the spin down, there's just a lot of data generated. Sometimes it's even beyond your control, what data is collected, what is generated. So it's important to realize the difference between a regular VM or another host monitoring versus cloud native monitoring, where some things are just out of control. The second important point is the fact that there is a lot of data, but not all the data is useful. There may not be valuable insights in all the data that you collect, so it's important that we build smart and efficient strategies so that you can only collect what you want and get the best value data. At the bottom I have a sample example of what the cost breakdown can look like. Here we have the observability, the networking, the compute, and the storage costs. Of course, this is just a presentative example but you can see that it can cost a lot, especially if you're not mindful of the things that, that you're collecting. Where are we spending all this money? There's three or four different categories where I have seen that enterprises spend most of the dollars in collecting observability and monitoring metrics. The first one, high ity custom metrics. Most of the tools out there charge a different fee or a different pricing for these custom metrics. They're extremely high cardinality, which means they have labels like the user ID and things like that. The second one is a long log retention, depending on how if it's 30 days, 90 days it could spike up the bill pretty badly. And also verbose logging. If you have extremely verbose logs like debug logs and other logs configured, it's gonna be spending a lot of your money just collecting Yeah, these verbose logs. The third thing that I've seen is the same data gets replicated multiple times to different destinations. That could, of course spike a pay cost to and all when you're sending data across the network to different regions or otherwise. That could also accrue expensive, both egress and storage fees too. So now let's get into the framework that I had promised. So the first important thing to realize is we have to collect what matters. And here the service level objectives are most important as compared to all the different logs. You have to start with the service level objectives. That is the main metric which affects users, and that's what users are seeing. And then you have to choose the right signal. So in terms of the signal itself and the value metrics are the most important signal. They help in alerting the make sure that the services are up and running, if not that you're alerted. The second is traces. So now you have a flow of like how the data goes from a different, from one service to another, and that helps you debug. And the last is logs when in a service you can go and debug what's going wrong. The third one is you have to tune your retention and aggregation. Retention is how long it is stored, and aggregation is, yeah, how do you make sense of this data and wherever possible, consolidate and offload. Move your data to a cold storage if it's not useful. So again, summarizing metrics are the most important signal and in terms of the value they provide, followed by traces. Followed by logs. So now we'll dive into the important parts of specifically Kubernetes monitoring. So like I had mentioned, there's four different components. The control plane, the nodes, the pods, and the network monitoring. Again when monitoring the control plane, the important thing is the precision. Instead of ingesting full logs, we should focus on managed service metrics such as the API latency and the health checks. Sampling or summarizing logs can help reduce data volume while providing valuable insights. And again, alerts should be based on deviations from the expected behavior and not on based on static thresholds, which allows for more responsive monitoring. It's also important to understand why control plane insights are important because they are the brain of Kubernetes monitoring. And if the control plane is down, nothing else can function and you cannot make any updates to the system. Below I have some of the examples like the a p you can see that, see lag and the throttling rates. Now let's get to the notes. For smarter node monitoring, it's essential to track only the key metrics that truly reflect the system Health, such as the CCP usage, memory usage, and disc io avoid getting bogged down with poor process metrics unless you're in a debugging or troubleshooting scenario. Monitoring node degradation through heartbeats and node ready transitions can provide early warnings of potential issues without overwhelming your data collection effort. Yeah, and again like I said, you do not need to provide, collect all the metrics you can only collect when something has changed. So here we wanna touch upon something important, which is the adoption and the use of open telemetry. Again. Optimizing metrics with open telemetry can significantly enhance our observability strategy. By using explicit aggregation temporality, we can better manage the data volume and you can filter some of the labels to reduce the cardinality and that, and limit your custom metrics to only business critical parts, like I mentioned, that can accrue a lot of costs. This approach not only saves costs, it also helps to improve the clarity of the data that we collect. So now we go to the port level visibility like we discussed in the cloud native world. It could be a lot of different microservices that's spin up and spin down, and it's important to be optimizing for the port costs, port monitoring costs. So first one. Achieving port level visibility without the noise, it is a ch challenge that we can tackle. And by filtering logs at the source, we can exclude unnecessary data such as health checks and static endpoints. Utilizing labels to correlate telemetry with services enhances our ability to analyze performance. And like I mentioned before, adopting telemetry or open telemetry for auto tagging and structured tracing can streamline this process and also improve, the last one, the network course. When it comes to networking observability, it's important to focus on the big issues. Tracking connection, failures and GNS resolution problems, which are pretty common, can provide critical insights into the network health using sampling flow routes such as net flow or IP fix. You can manage data volume while still capturing essential information wherever possible. Also, adopt the EBPF based tooling, which can offer more efficient monitoring solutions. The last one, which I want to touch upon is not one of this, but is based on the third party tooling that, that you use. Wherever you have your data ensure that you're not duplicating the telemetry across different tools and wherever possible filter at the source, which is ship filtered logs, whether you're using crabb crop or fluent parcels, send them when you're making an analysis to make an analysis based on the dollar cost per gv four. Logs and traces and for metrics. Some of the cases, it could be the number of metrics shipped. So below there's a table which shows different tools and what is the cost per gb? What this analysis helps in understanding. Yeah. Which is the tool you should be using. Where should you be sending what? I wanna present a quick case study of a customer. Again, this is a fictional customer, no specific customer in mind where implementing some of the strategies they were able to reduce 40% of their costs without losing the important telemetry that they need for troubleshooting. So by dropping unnecessary health check logs and by reducing their velocity levels, they streamlined their data collection by transitioning from raw latency logs to Prometheus system grams. They had the same essential insights while moving cold observability data to low object storage further reducing their expenses. It's again, you can see how the reduction in the logs, metrics and traces. And how it changed before and after. The monthly dollar cost is just an example. It's not the exact dollar cost. It's crucial to understand what you're given up when making changes to your observability Strategy sampling does not necessarily mean losing visibility. It's a strategic reduction. Shortening retention periods can be effective for most active monitoring scenarios. The keys to balance data fidelity with the trade offs in both engineering resources and costs. Like you can see there is a curve and at some point you lose the value of adding more metrics or more telemetry. It's important to understand what trade offs you are signing up for, and then deciding accordingly how you deal with them. Like I discussed, now we wanna discuss how you implement this at your organization and your team. First off, you have to have a cost aware observability dashboard. You have to visualize this cost per telemetry type, like I showed in some of the list. How much do the logs, metrics, and traces cost? The second you have to then attribute it to the different teams. That are using this telemetry. So you have to add the tags for the specific teams and make sure that they, those chargebacks are at least visible. And in many cases, the tools also integrate with the building export APIs, which are helpful in building this dashboard. The second step is you have to have an observability budget. It's a proactive step towards managing costs effectively. Setting a cap on expenses per cluster per month for telemetry is almost essential. Establishing quota for log volume and custom metrics can help enforce those limits and it can utilize tools like flow and Bit and low key, which can help in maintaining this constraints and ensuring compliance within your budget. Here is a framework for your cost optimization checklist and how you can go about implementing some of this so that you can satisfy your observability budget. First start by dropping any unused metrics and logs. That's a drop phase. Then you shorten retention periods wherever possible. If you do not need the logs beyond 30 days of your organization doesn't have this kind of a policy, you don't really need them. For some of the metrics, you can store them for longer and for logs you can store them for shorter filter the telemetry, the source, like I mentioned, whether it's pods or any other place for it, just saves in both the compute time as well as the storage costs and consolidate tools wherever possible. And last but not least, make sure you tune your alert rules to minimize the noise. We'll also be talking about it a little bit more now. First off, I wanted to point out something with the high card edge metrics. So wherever possible avoid storing the labels like user ID or UU ID in your metrics and use things like tire region status code, which are both more useful. And it saves cost too. Yeah, you can reduce your series count, which essentially reduces your Prometheus storage cost. Now we'll get to the alerts. Yeah. De detoxifying your alert noise is essential for maintaining effective monitoring. Here the funnel approach, which is moving from raw data to filtered, to routed to actionable alerts, it is, can streamline your process. So first off, focus on alerting only for SLO violations or customer impact. So start with your alerts. At the customer impact, which will both reduce your fatigue and improve your mean time to recovery. And yeah, this strategic approach enhances the overall observability of effectiveness of your observability efforts, like I mentioned. Yeah. It's important to understand that observability doesn't mean logging everything. Instead, we focus on capturing valuable signals that provide insights of system performance. We prioritizing quality over quantity, we can keep the cost manageable while also gaining the insights necessary to maintain operational excellence. Once again, focus on the valuable signals, not the raw volume of data collected. Just because I collect a lot of data doesn't mean I have all the insights that I will need. The second thing here is important to note is the fact that sometimes when you have a lot of this data, it's not easy to sift through it or to understand. It's almost like finding a needle in a haystack. So just because you have it doesn't always help. Yeah like we discussed, so the end to end observability flows look something like this. You start with ingesting the data, then you process it and either aggregate it or filter it. Then you store it and you alert on it and then optimize on it. It's important to tag a dollar value at each stage to measure the ROI of doing each of the steps. And yeah, you can, this is essentially a loop where you iterate to make sure that you're within your budget and you're also getting the most value from your data. I want to end with some of the recommended tools that I would suggest you check out. First off, when it comes to metrics, Prometheus with Anos is super anos, as the storage is super helpful, our cortex as well. You can explore fluent bit for log filtering in the source. Open Telemetry, SDKs and collector for both traces, metrics and logs. Grafana dashboards are super helpful for visualizations. They're both easy to build and also there's lots of open source dashboards that you can use. And EBPF tools like Cilium PC, they help for network observability. So once again, I'd like to reemphasize what we discussed. These are some of the common pitfalls to avoid. You don't have to log everything. Make sure that you're sampling or filtering ensure that you're using the right retention settings for your telemetry to not duplicate telemetry to multiple destinations. Make sure that you have no met high coordinative metrics. Metrics which are have, which increase your bills. And also alerting on static thresholds. This could lead to a lot of alert fatigue. Yeah. In closing, remember this mantra, better signals lead to better, lower cost and happier engineers. By focusing on the quality of our observability data, we can enhance our operational efficiency while also managing expenses effectively. And let's strive for a balance that supports both our technical needs and our budgetary constraints. Thanks everyone. Hope you found this useful.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Shrinking the Observability Bill: Smart Strategies for Cost-Effective Kubernetes Monitoring

Video size:

Abstract

Summary

Transcript

Slides

Aritra Ghosh

Senior Product Manager @ Microsoft

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Shrinking the Observability Bill: Smart Strategies for Cost-Effective Kubernetes Monitoring

Video size:

Abstract

Summary

Transcript

Slides

Aritra Ghosh

Senior Product Manager @ Microsoft

Join the community!