Conf42 Observability 2025 - Online

- premiere 5PM GMT

Logging in the Age of Cost-Cutting: Smart Strategies to Reduce Bills

Video size:

Abstract

As organizations adopt cloud-native architectures, logging costs have surged due to escalating data volumes, vendor pricing models, and inefficient practices. This paper examines the root causes of rising logging expenses (e.g., unstructured data, over-retention) and presents a framework for cost optimization without compromising observability. We evaluate three key strategies: (1) sampling (head- and tail-based) to reduce volume, (2) filtering to eliminate low-value logs (e.g., health checks), and (3) tiered storage (hot/cold) to align retention with access needs. Furthermore, we compare open-source alternatives (Grafana Loki, SigNoz) to commercial solutions (Datadog, Elastic Cloud), highlighting trade-offs in cost, scalability, and functionality. A case study demonstrates how structured logging and pipeline optimizations reduced costs by 60% for a mid-sized enterprise. The paper concludes with best practices for implementing these strategies, emphasizing the importance of context-aware logging and vendor-agnostic tooling. This work provides actionable insights for engineering teams seeking to balance cost efficiency with diagnostic capability in distributed systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to Con 42 Observability 2025. I am Vanga PR Kambala, a senior DevOps engineer with 15 plus years of experience in it. I started working as a network engineer in both enterprise and data center domains, and then moved to cloud as a DevOps engineer and now working as a senior DevOps engineer implementing DevOps and DevSecOps methodologies. For deploying infrastructure into AWS using services like compute, storage databases, serverless infrastructure, and as well as contain based application infrastructure. And I also have experience across domains like multiage finance, healthcare insurance, and e-commerce. The topic I chose today is logging in the age of cost, cutting smart strategies to reduce bills. Over the last decade, companies migrated a lot of their workloads to cloud because of ease of use, high availability, fault tolerance, and scalability. And also cloud provided mean exponentially fast path for moving from ideation to like implementation compared to traditional data centers. All this moving to cloud, like there is a lot of cost incurred. So the latest buzzword is finops, where cost optimization is like a top most strategy. And logging is one such a aspects where cost optimization is looked at. So today we will dive into how like industry leaders like Uber Airbnb and Slack implemented, like their cost implementation, cost optimization strategies. Reducing 70% of their bills while retaining the same level of debugging capabilities and as well as like monitoring capabilities. So let's dive into that. Let's see. Like I'm in what like causes all this cost crisis for logging. As per the research like available openly Airbnb, like spent like 1.2 millions a year for stale, unused debug data. So there is a research that state states that, like for all the logging that is generated, only 30% is used for all the queries and like monitoring. Any meaningful operations like an enterprise can do, and rest of the 70% is sitting there in the storage costing enterprises like a premium without any use. So this problem even compounds with microservice architecture because each microservice generates its own logs. Yeah, we will look into how Airbnb mitigated and how like we'll also look into other companies on different strategies of cost optimization for logging. But yeah, this is what causes like a costing for logging a lot. So that's going to the next slide. So here we will look at why why the costs are spiraling out of control for logging. So mainly the first one would be microservice explosion. So we moved from monolith architecture to microservices architecture which created like, a lot of services and those services, like independent leads generate lot of log streams. So creating the, all those log streams and like I'm maintaining them paid for exponential log generation, log creation and log storage. Which added to our exponential costs. And then cloud vendors like have a the pricing at both ingestion and storage level. So you're getting double charge so you'll be charged for per GB while you're ingesting the data. And as well as while you're storing that data for later retrieval purpose. And then there are some development practices where some applications or some enterprises like users or developers still have debug level kind logs enabled in production, which is not recommended because debug generates a lot of log data, which is not used most of the times unless there is. An alert or there is something that needs to be troubleshoot, enabling debug log level is not advised because that increases the sheer volume of like logs, it generates and 90% of the time or 99% of the time, it is not needed. And then there is lack of governance, right? Like where there are no clear policies on how log collection should work, and what are the optimization strategies and inconsistencies caused to excessive logging and excessive logging caused to an excessive storage. So it concurs again, cost at both levels, like for ingestion and storage. So we'll look into the first strategy, which is implementation sampling. So Uber had a big breakthrough, like Uber we mentioned 70% of the logs generated are unused with respect to debug. On it's almost like 95% that is unused. So what Uber did was like, it implemented a sampling mechanism and using that sampling mechanism, it was only retaining 5% of the debug logs that is generated discarding 95% of like 95% of them, which helped them like, save around $8 million annually. This was published in s Econ 2022. So yeah, while they were doing that, they still retained their logs, like they were only sampling like M and DBA blogs. So they were reducing the noise while keeping the effective logs that are needed for maintenance of the infrastructure. So there are three different kinds of implementer implementation approaches for this head-based sampling. So where the decision is made at the log generation as itself, so like source, right? Like wherever, like the microservices, our infrastructure components are used at that place itself. There is a sampling mechanism used where it drops 95% of like other debug data. Whereas there is another approach to it, like tail based sampling mechanism where the decision is made after the trace is completed and ingested in like at the ingestion process. So Uber implemented this and saved like all saved 8 million in cost for lobbying. Then there is level based sampling mechanism where like it can use like at different rates at different log levels, which is also one of the widely used approaches for best practices. Like we should always preserve error and like fatal log levels. That is what helps us for effective troubleshooting and creeping track of what services went down and alerting and other stuff and all. So we should never be sampling error and fatal logs. You can start, like sampling can be done from one to 10% for debugs. And info levels info level of logs. But like you can always start at 10% and then build your way up to 1%. And also we have to use like I inconsistent sampling consistent sampling methodologies and also service boundaries for effective sampling. Here we look into a second strategy, which is called noise filtering. Like we said 70, like out of like a hundred percent of the logs, 70% is just noise that is generated that is never used. So slack and counted similar kind of situation where there were like health check logs that were not meaningful to them, but they're still kept on, accumulated by the logging like in methods they had, which incurred both ingestion cost and as well as storage cost. So Slack implemented the fluid fluent bit filtering mechanism where like they dropped all the health check data that was not necessary, which dramatically reduced their cost by 15% with zero impact in their day-to-day operations. So generally the common nice sources would be health check endpoints, ping health or status like endpoints. And then there are repeated connection timeouts that are not critical. And also we don't need to maintain all the successful authentication logs. And there are some service to service communications between the services that you can drop and not retain. So filtering is one of the most effective mechanisms because you can just drop the logs that you don't need effectively helping in cost optimization. And also it's effective when you're able to identify systematic noise sources that you can filter out at ingestion layer itself, so you're not ingesting them as well as like I'm installing them, saving a lot of the cost there. There is another strategy like that we can use, like this is the earlier too we were looking at like generation like log generation level and ingestion level. This comes at storage level. So there are there is a storage strategy the three tier storage strategy that you can implement. So it can be classified into hot storage, warm storage, and closed cloud. Sorry, cold storage. So hot storage is something like for a week of logs, right? So anything from today to last seven days can be a hot storage, like stored in elastic search which helps. To query, which helps for a better query rates and retrieval rates. And for anything between seven to 30 days is less queried compared to the ones that are queried in hot storage. So we store them under a warm storage using standard S3 buckets, which has balanced cost and performance ratio. After 30 days most of the applications are like an enterprises won't need to like, have logging data that is like readily variable every for day-to-day use. So we can move that to glacier storage where the cost would be very less for retaining. But you should be careful while doing that because. Grier data is really tough because of like less retrieval speeds for data. But Shopify implemented one of these this approach, and they were able to save 60% in their retention costs and, they were yeah. Like generally, like they needed like the com anything logs for like beyond 30 days for compliance. So that is the reason they're retaining it. But yeah, this reduced around dramatically their retention cost, like for logging, and there is the fourth strategy, which shifts your logging cost to personal cost. So in a way razor pay, one of India's major payment gateway was able to like, opt for an open source solution like. Grafana Loki instead of Splunk, and thereby they saved around 200 K dollars. So like Splunk was costing them like, $250,000. Whereas graph Grafana and Loki implementation just costed them $50,000 because of how like Loki's architecture is cost effective while using S3 for storage. But this comes at a cost like from from a operations perspective where you will have to have a subject matter experts implementing the solutions to maintain the infrastructure for open source where you'll not have enter application support or like I vendor support. So that is the cost that you'll be incurring, but definitely reducing your logging costs. So we will look at like a real world case study that Uber implemented. So Uber had a challenge where there are like 250,000 spot jobs that run daily generated around 200 TB data. So that. Storing that data is very costly. So they came up with the compressor, CLP, which is called compressor log com, compressor Log Pro, like process where they achieved a compression of 1 69 ish to one ratio, thereby reducing their costs from 1.8 million to 10 k. Annually, so that is 99% production in cost. So they implemented like an compression mechanism of compression logs and were able to like and achieve this remarkable feat of reducing their cost. And while like this also helped so this compression compress log processor solution also helped them query the logs without decompressing them. So there was no, there was no deterioration in their querying and like monitoring needs. So another best practices like for logging is like I'm in structured logging. So if you see the example here on the left side like it is an unstructured logging where an user is trying to create or sign up for a new account, but using like an email address that is already signed up for that particular service. On the right side, like the same thing for a structured log using A-J-S-O-N, which contains all the metadata and timing information. So the right side structured data helps us like come into query better. And so like the more we have like I structured logs like the better for us I mean from operations perspective and as well as storage perspective. Because it helps in compression. So we can look at at the bottom we see all the things that structured logging data can provide. So it enables precise filtering because you can target, based on the metadata, you can select a few fields and target filtering based on that. Also it improves better compression mechanism of just how structurally it organized and then it enhances your query performance as well. Like apart from like your regular queries, you can use JQ and other queries as well. And there is more metadata and more fields that you can query on and retrieve the data in efficient way. And also, which also paves ways for better automation. So for optimization success you'll have to have financial metrics, operational metrics and volume metrics optimized properly. Whereas financial metrics will have your monthly ingestion costs, storage, like cost per tier. Also total GB ingested because as like we discussed earlier, clouds charge at GB ingestion level as well. And like also I mean at the storage level coming to operations you have to see like how fast so when we are implementing all these strategies, you'll have to always make sure how fast like your. Query responses because that is what matters at the time of troubleshooting or I'm looking into services on what is happening there. If you if something went wrong there, and then and you'll have to also make sure that log is available and with all the sampling mechanism you're doing, how effectively you're able to detect that or, and then the volume metrics like. You have to look at like daily volume trends and as well as like how much is filtered versus retained, and from all this, like how much like storage costs like you're incurring. So the key is actually establishing baselines before implementing these changes and then tracking improvements over the time. So with all this strategies like noise I mean filtering and sampling and everything. There are com some common pitfalls and we have to be like, let's look at how we can avoid them. So anything like, that is aggressive or like I'm an overdoing will cause an issue. So over an aggressive filtering will always have some repercussions, so you might be losing some critical debug information. Also, always while doing sampling and filtering always start conservatively, then increase gradually based on and what you're seeing and always and make sure that you. Preserve error and like fatal level logs, you are not filtering them out or you're not sampling them. Make sure you retain them completely. And then while sampling, you'll have sampling bias because just because of how we are sampling mechanism is implemented. Like you might miss some of the infrequent like events that come up. In your logs? The solution would be like, using some trace aware sampling mechanisms, and also you can implement a safeguard like by triggering like I'm in full tri triggering stopping of sampling by using an alert. So if you see a alert, like then stop sampling. Like you can implement certain triggers and do that. And then like I'm in a cold storage query performance retrieving data from cold storage is really painful, so you can't write a query it's not easy to write querying mechanism to pull data or retrieve data from cold storage. So design your queries in such a way that you are not accessing cold storage for your queries. Mostly you're using hot and like your warm storages. Also you can implement some of the rehydration workflows of moving the cold storage data into one or standby for better. Like I'm querying like, speeds. So now we will look at a cost optimization roadmap. From looking at all these strategies what can be done for this to be implemented in your organization? So first you'll have to assess and analyze. So you'll have to look at immediately you can look at your log volume and audit, like how much volume you're seeing. And then isolate five to 10 log streams that are like our log sources that are causing like really high volume of logs and then calculate how much you're spending on those logs. And set up like, dashboards see what are your quick wins? And implement some kind of like the strategies like we discussed earlier, either filtering or like an sampling, et cetera. And like after while you monitor that, like I'm gonna get to a conclusion after 30 days, you can start implementing like we discussed, right? Like you can start implementing health check filtering, log sampling, and then for storage needs, you can do three tier, three tier storage. And also we looked at I'm insert an open source logs. And in long run and you can always have some strategic strategic initiatives like we discussed open source alternatives and advanced compression techniques like UBA did with CLP and, or you can always have like a log-based metric alerting and you can bring most training for cost optimization and culture around that in your enterprise. So that concludes how we can save in logging. So there are other essential tools you can look at. For example, data Datadog management cost management site, and as well as all the open source applications like Grafana, ELK, et cetera. And there is CLP Open Source if you want to implement it at the same time, like on vendor specific stuff you can see in Datadog. And Uber also maintains a good treasure of a CLP technical papers. Which is a blog and that has all the technical papers where you can read through on how they implemented that successfully. And the same way cloud service providers also have best practices of white papers, like AWS and GCP. And then there are always conferences like SR Econ and Lisa conferences and even our con 42 like we are doing as of today, where they talk about these situations and have make us aware. Then there are like, cloud native foundations communities or your internal communities that you can discuss and get like better approaches. That's it for today, and thank you.
...

Venkata Madhu Prateek Reddy Kambala

Senior DevOps Engineer @ Synapsis

Venkata Madhu Prateek Reddy Kambala's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)