Observability at Scale: Optimizing Multi-Cloud Data Platforms with AI Ops and FinOps

Video size:

Abstract

Unlock the secrets to building high-performance, cost-efficient multi-cloud data platforms with AI Ops and FinOps. Learn how observability-driven strategies boost uptime, accelerate incident response, and cut cloud costs by up to 37%.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm GaN. I have 15 plus years of experience in designing and implementing scalable distributor systems in multi-cloud architecture. Today we are diving into observating at scale, optimizing a multi-cloud data platforms, modern enterprises. Really on resilient, observable data platforms across multi-cloud environments. I am excited to share how a AIOps and finops are transforming operations, boosting resilience and optimizing cost. Let's get started. We are in a world where 94 percentage of enterprise leverage cloud services and 70% in. Avoid vendor logging. But here's the catch, achieving 99.999% uptime across these enormous is tough, fragmented tools. Inconsistent metrics and silos team create blind spots. Today we will explore how absurdity addresses these challenges to deliver reliable cost equity platforms. Multicloud offers incredible opportunities. You can cherry pick the best features from A-W-S-G-C-P or Azure TAP into specialized capabilities and avoid vendor lockin. But the challenges are real. 65 percentage of enterprises struggle with the fragment of visibility, inconsistent metrics and disparate monitoring tools. This leads to reactive operations and costly inefficiencies. This narrative by providing unified insights. So what is exactly a multicloud observative? It's about unified visibility monitoring all your cloud environments simultaneously, it enables proactive detection, catching issues early for faster resolution. IT data correlation linking metrics, logs, and traces across provider. Platforms and customer usage and standardization and normalizing diverse data into consistent formats. This is the foundation for the operational excellence. Do we have a single tool I can just buy and then have this problem fixed? No isn't just a tool, it's a critical engine value. It accelerates product delivery and enhancing service reliability. It optimizes resource allocation and utilization across all your platforms, and it make it scaling efficiency and it provides 360 degree insights from unified metrics and logs and traces. At, we have seen ability empower teams to diagnose issues quickly and predict failures and make a data driven decisions that impact the bottom line. Okay, so how to build a foundation. Step one is start by creating an unified platform. Just come up with an idea of the unified platform for consistent data collection. So you should define and document all the cloud resources and then deploy a centralized system to aggregate metrics, logs, and traces. This strategy chooses you have single source of truth, making it easier to monitor, optimize across. AWS, GCP and Azures or any other cloud, A driven is a game changer. Our platforms are to detect incident 67% faster, reducing meantime to detection. Advanced MLL card point road causes instantly, which cuts the resolution times by 51 percentage. An a anomaly detection, proven 78 percentage of incidents before they happen. Through efficient scaling bottling mechanisms, this sips teams from firefighting to strategic optimization, boosting reliability, and cutting costs. Let's talk features. Smart alerting reduces electro with the dynamic thresholds and contact our notifications how to do it. For example, when you have a latency based alert for 20%, the latency increases. You try to generate an alert. So there can be a lot of false positive cases when the system tried to scale during that time, the unusual load from the customer that the regular pattern it should be, and there may be some underlying issues and some subsystems is happening or it does automatically getting fixed. So when you have those, all these context added. And you make sure the alerts only comes when the context is very clear that, okay, there is something going beyond auto resolving the problem. So that will make the alerts more meaningful and it'll drive the right actions as needed. So that is how I think we implemented Smart Alert and Tio to make sure the alert won, impact our team's efficiency. And the next important thing is anomaly detection. As the time goes by and the different type of customer, the usage pattern comes in. The anomaly detection is more important because it identifies the unusual pattern and how it affects the system. As you have this anomaly detection happens periodically, so you able to make sure. Able to handle these different type of patterns in a more efficient way in a, and then I think it's also helps, I think the cost aspect as well. And the third one is a predictive analytics with all this unified system of data alerts. All these things in a single place will help you to have a. An ability to forecast the resource needs also forecast any potential failure that can happen. How to do it again in radio. So when we monitor and aggregate the data for every five minutes and you generate alerts based on all these context. And when you have this data for a period of time, now you can provide that to a predictive model. Now it can predict the next five minutes based on a historical trend. So this way you able to really understand your various usage patterns and based on that, how the system behaves, what it should be done next, or what will happen next so that you've able to scale the resources or you generated alert. So that people understand there can be a potential problem it can create when you cannot automatically fix it. So that's what I think the Predict Analytics will really help you to find out these type of scenarios. And finally, it'll become your correlation engine. You should make sure. In your unified system, you are able to understand all the different signals coming across the, in the case of multi-tenant, the tenant level signals, the platform level signals, and each and every subsystem level signals the cloud level signals and also the, and you go next, I think you add the cost level signals. Everything together is a one engine, so you troubleshoot. Efficiently optimize the subsystems, which are have a huge impact on the final outcomes. One of the biggest wins because with the AOPs, our teams able to manage 3.4 times more infrastructure without additional staff. Automation handles the routine tasks and freeing engineers to focus on the next innovation. It'll provide a compounding results even when you are a small, medium or log large organization. The already enables efficient operations at scale. Now let's integrate finops Observability of AOPs. And finops gives you a comprehensive cost visibility in real time, alongside the reason why it had really happened. It enables data-driven forecasting for better budgeting. You are now able to correlate the cost with a real usage and what type of usage drives the cost. So now we have a very unique formula to create a budget on the usage trend and the cost trend. And based on type of usage you expect from a various customers you are expecting to get the optimization through. Intelligent rightsizing saves significant costs and precise cost allocation to business accountability. To focus and to get the maximum value. So that's why that having a clear, unified observability system incorporating all these data points into one, it drives multiple benefits, including aspect and budgeting aspect. The result speaks for themself. We able to achieve 28 to 37% reduction in cloud expenditure through strategic resource placement and right sizing. Automated reduction eliminates ideal resources. Boosting computing efficiency by 61%. Underutilized instance drops by 43 plus precise cost attribution and to person more accurate budget forecasting airlines spending with business code. Also, these efforts translate to digital business impact. We act a time to market by three times through operational efficiency, the service reliability. 41% and enhancing availability and which have a impact on the customer certifications and be able to free up more resources to do the innovative things using a and other things. So that's really had a great impact. We understand what is AIOps finops, how to put everything together in observability for the multicloud architecture and how much benefit you can get out. Let's. How to make this work. The step one is think observative first architecture. Designing any system. You should first think how many data points I'm going to get it, and is it aligned to my standards and how this will visualized into am I unified system? So how this will tie to the cost, how this impacts the customer. Usage and behaviors. So the step one is clear strategy. So you define your goals of observability. On top of it, the KPIs for all US systems, implement a comprehensive instrumentation for telemetric collection. On top of this, you should have a strategy to apply a and ML for, and automations. It cannot be done everything. Intelligent systems to automatically extract insights and automate your alerts and other activities. Next one is add this with the finops, the cost data that will make you justify which type of optimization drive how much cost. Also, it'll understand how user behavior. I think drives the cost based on different usage. This framework s the observative from the start and creating a sustainable and efficient platforms. So now let's see, few best practices, which are highly important, which you cannot skip. I think when you start on this journey. The first one is you should have a centralized collection of these data so that you'll able to understand your systems and customers and cost in a single source of truth and implement a standardized tagging across all these clouds. The tags shouldn't ta. And more important thing is not number of tags, but the what type of value you added and how you manage those values where everybody else able to understand the different type of applications, A, the purpose all these systems are working on. Next, you should adopt as a code to deploy monitoring alongside infrastructure. IT independent. You should make sure that when the system designed and when we implement the code and then when we deploy systems, the ity should be part of those code. And you should create a framework from the time the design starts at the till, the time it gets deployed and align the business how value. How each and every work drives how much value. In a more concrete terms, in a financial terms, you are able to prove it. So these practices will make the ab observ scalable and very impactful in the organization. Are you ready to start? So I can just give you a high level action plan. How you can approach this. Of course every system has some level of observability. So first thing is you should understand the current state. Most of our systems are already built and you are trying to add some things over the step one, make sure you are not thinking the ideal state immediately. Just assess the current state. Understand how much maturity this absurdity has, and find out the critical gaps. And then document all the CA type of tools and systems I think currently being used and what type of expected baseline performance we need. And then you find the opportunity for the enhancement once we identify all the current state and expected requirements and to achieve. Now think of target architecture. So each and every application system is different. So I'm not going to give anything as a blueprint to implement it. But the thinking should be based on the different type of cloud services and applications you deploy and the different workloads you run. You should have a target architecture defined properly, and then you select the suitable tools. So the architecture should be how the data you collected. And what aggregation or alerting interval should be sufficient for your business. For example, we have five minutes aggregation of these outlets, and of course we will have real time platform failure and all those outlets, but on top of it, the more important is understanding the multi-cloud, multi-tenant architecture. We should able to aggregate every five minutes to understand how the systems are working. So all the health check alerts can happen in a more milliseconds and seconds type of it, but this performance and all these alerts can have at the final interval. So you can define your goal and target architecture based on that. And then you create a comprehensive instrumentation of these metrics coming into your system. Set the proper governance and compliance standard, don't implement in a different way. Next, focus on high impact workload First. Because as we do this, as a fundamental change to the platform, you should not try to solve all the problem at once. Take the highest impact, one, implement this and show the value how much money you're able to save, how efficient the scaling is. That will make I think success for the next level of, and that will. Larger context in the long term basis. So take the high impact upload first, implement it efficiently on that one, and show the business value out of it so that irate over it for other services. That's how it should be approached whenever you go for this thing. Yeah, we have come to the end of it. In closing, I wanna say Observ at scale is about transforming multi-cloud operations from reactive to proactive, from costly to efficient. By leveraging AIOps finops, you can build resilient, cost-effective data platforms that drive business success. So when you focus on the don't. As one cost and customer usage, all these things shouldn't consider as a separate one. So we should integrate the AOPs and finops as observability platform and we should be able to drive an outcome based on that. With the cost and finops data, the a operational things of platforms will get benefited, but the platform data, the cost, and then forecasting will. So we should think in a way that, how it can provide maximum business value, not like a siloed operations. Thank you for your time. I'm looking forward to hear how you were able to take this journey and move forward and share your experience with everybody else. Thank you.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability at Scale: Optimizing Multi-Cloud Data Platforms with AI Ops and FinOps

Video size:

Abstract

Summary

Transcript

Slides

Ganeshkumar Palanisamy

Principal Software Architect @ Reltio

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability at Scale: Optimizing Multi-Cloud Data Platforms with AI Ops and FinOps

Video size:

Abstract

Summary

Transcript

Slides

Ganeshkumar Palanisamy

Principal Software Architect @ Reltio

Join the community!