Conf42 Observability 2025 - Online

- premiere 5PM GMT

Next-Gen Observability: Leveraging AI and Data Pipelines to reduce Cost and MTTR

Video size:

Abstract

Next-generation observability pipelines, powered by AI, are transforming how organizations manage complex systems by dramatically reducing both operational costs and MTTR. They filter out noise and self heal system with less human involved in incident management

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Art. Oh, hi. Welcome everyone. Today we are gonna talk about this generation observability, leveraging AI and data pipeline. And I'm Sara, director of Cloud Infrastructure and Observability Informatica. And I have Katie with me who is infrastructure architect at Informatica. The agenda for today. Is establishing the problem statement, what is the current problem we add at Informatica and our existing tele telemetry pipeline and, and our futuristic next generation telemetry pipeline? How we embedded AI in each part of the telemetry pipeline, reducing the cost and bringing better MTTR. So Informatica, right? Informatica infrastructure. We operate at four different cloud providers. Hyperscalers, Amazon, Azure, GCP, and Oracle spread across 20 different regions, 600 and more Kubernetes clusters and several thousands of worker no. And virtual machines being just about 60 terabyte logs every day with 3.8 trillion Documents get ingested every month. And if you look at our storage, it's of 15 petabytes on S3 and on on ECQ. Virtual machines is about two, 2.5 petabytes. So given, given the scale we operate and given, given that. We operate in different cloud and different region. It brings its own challenges in cost, compliance, and operational aspect of things, right? The source of logs come from different region, different availability zones get consolidated into one centralized message queue, which we currently use Kafka. And ingested into the data store and visualized using multiple tools like Kibana, Grafana, and homegrown analytics tool. So this in turn increased the cost, operational expense, and slower dashboards with the amount of data we operate. So this, this problem make us to rethink on our existing observability. Pipeline and come up with a new pipeline. Bring gen AI and ML flavor in each stages of the pipeline. So we, we bring things into the source. So consolidated the collection using open telemetry and bring filters. And ML models at the source. So can, so we can clutter noisy, unwanted data and declutter them in the source itself. So we reduce right in the source, 20% reduction and move the data to the centralized message queue with better optimized pipeline producing 50%. Saving right there. Then ingested into data storage throughout. By doing so, we reduce noise right in the source, more in the left. And using AI and ML models, we are able to apply better control of the data, bringing clean compliance result for us as the data ingestion reduced at the source, we also reduced storage cost, backup cost. And quicker dashboard instantaneously available to the end users. So with that, I would like to have Keith join me to go a little deeper into the architectural pieces of this telemetry pipeline. Keith. Hey. Thank you, sir for this creative overview. Yeah. So as Sarah mentioned. We have been dealing with HU'S infrastructure which ingest a lot of telemetry data and it's actually massive. We have to operate at massive scale. So what we did is to control, to get a control over this telemetry data. We have implemented something called Telemetry Pipeline. So we first started with the centralized telemetry pipeline because observability is very much important to every piece of the. Application and infrastructure. It is being used by application security for, for every purpose. And today's world where there is a lot of pipe coding AI is generating your code. So it's very important to make sure that your observability as in right safe so that in case there is a incident, you have the right set of data to troubleshoot. So what we did is since our system is getting more and more telemetry data, so we put this telemetry pipeline to make sure that we are ingesting only the actionable data that is required for us. And we, we need to store that for longer time. So by doing so by adding a centralized telemetry pipeline, we could able to reduce our logs. But what we realized is the majority of cost is coming from the A collectors from where we are collecting the logs and sending it to the centralized system. So we started using a open telemetry based agent where we can add filters. To filter out all the noise at the source so that we don't have to send them through the network where we where we incur a lot of network transfer or n gateway cost. And once that data comes to our centralized telemetry pipeline, so there we massage the data on how we want to send them and how we want to store them. S for example sometimes there are a lot of logs people are sending which they're using per metrics. So, but logs are quite expensive. So with the telemetry pipeline, we can convert them to metrics and store it in metrics system instead of storing it in the logging system. So similarly that will be a lot of health checks or in, or similar kind of things where. Probably we may not need all the raw data that is coming into our system. So what we can do is we just need a summarized version of the information which will be useful for us. So in that case, what we can do is use the telemetry pipeline to down sample the amount of telemetry that's coming in, and write list data to the the backend observability system, which is more, which is more useful. So with this, what we achieved is there is we reduced the telemetry noise and we got better control over the data. For example, if there is a lot of logs that's that is having some kind of PI data or something. So we have a central place to manage that. Basically add some kind of masking on top of it before we write in write it for a long long-term stories. So what happens with la with the lace data coming into the system, our dashboards at the are running faster. The cardinality of the telemetry is reducing, and also it's increasing our search, which is giving our which is giving a good user experience. So th this is kind of give us more control over the data, but. It, it also introduced more human effort to maintain this. So to create the telemetry pipelines, to maintain these pipelines and also to keep keep the track of all the different patterns that's coming in into the system. So what we did is we leveraged AI to solve this. That problem. So with ai adding, added to the, to our telemetry pipeline, so we implemented one. AI isn't, which takes your natural language query and it, it knows what are the transforms and and what are the pipeline's pipeline query languages. So it can create new pipelines. It can manage existing pipelines, and also based on the based based out of the structure of the data or the pattern of the data, it can recommend you basically if there is any scope of optimizations that you can take any action. And at the same time, we, we are not completely adding AI agents to create our pipelines. We are also keeping human in loop to review and approve on the suggestions that's given by the AI agent. So once the human reviews and approves it, then AI isn't goes and deploys it. So that way there is less resource required to maintain and build these pipelines. So with less amount of data we we are able to get better observability systems built on top of it. So this is how we operate. Basically we have a single unified portal based agent, which collects all the different telemetry signals, and these agents are managed with fleet management. So we don't have to go through each a location for any contribution changes. So that makes our contributions t as well, and all our applications are basically they're they're instrumented with, with different a PM. So with, with those instrumentation, it helps us to propagate the context about all the, all the all the a PA calls that's. That's traveling from service to service. And with that we are able to create these different service maps and correlation between services. So this helps us reducing a lot of toil during our investigation of any issues. And moreover, you don't need an expert to expert to be present in all the investigation calls to go through how the service is behaving and how each service basically correlated or when, when is service A is calling service B, or it's calling service C. And also we have implemented some ML models. To detect the animal in the system. So so that we can alert in case there is some kind of let's say let's say there are the yeah, the a PA response is high or maybe there is some errors errors that has increased in a particular service. So. With, with all these things we have it required massive efforts from I mean, human efforts to maintain all this thing. So that is where we have implemented our AI assistant. So this is, again, another AI agent. Where, you don't have to really understand the different query languages that is required to build your dashboards or get the required data what, whenever investigating any incidents. So you, you can just go to the a assistant and ask what you need in a simple sim simple natural query. It'll, it'll convert that. To the query language that the observability system understands it it goes through your observability system. It correlates your traces, metrics and logs, and it, it can provide you basically basic visualization, which will help you to understand what's going on in the system. And also it can help you to automate some of the runbooks it, it can run some of the runbooks to do a regular maintenance task that is also required whatever is required, some human efforts. So this is the way we have leveraged AI in our observability to reduce the human efforts that is required to. Build and maintain these systems and at the same time we are also. Making sure that the observable stack we are running the cost is the cost is within the ranges, and also at the same time, the user experience is good. And also whenever, during instance and all, whenever people need. Some specific data, then they can easily get that instead of depending on others are searching for which dashboard or which query to run it. So this is also reducing our MTTR of our incidents. With that, I will conclude here. So thank you. Thank you everyone for tuning in. Thanks everyone. Thank you. Thank you everyone. Thanks, Katie.
...

Kirti Ranjan Parida

DevOps Architect @ Informatica

Kirti Ranjan Parida's LinkedIn account

Sarav Jagadeesan

Director of Platform Engineering & Infrastructure @ Informatica

Sarav Jagadeesan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)