Beyond Metrics: AI-Powered Observability in Microservices Architectures

Video size:

Abstract

The transformation from reactive to predictive observability Concrete metrics (73% MTTR reduction) that demonstrate business value Practical applications that attendees will learn A competitive advantage angle that will appeal to decision-makers

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. My greetings to you all as this conference can be viewed by across, by via the worldwide. I'm just giving without the time zone. Some maybe in good morning, good afternoon, and good evening. So I just want to say my greetings to you all and myself. I am seeing you as man, and I'm currently working in Fidelity investment as senior manager and. I just want to walk you through the observability the micro, the observability, micro in the microservices. How we really evolve from the basic, traditional and with current situation where we are. And if we imply, if we empower that observability with more AI enablement tools, how we are going to leverage and what we can envision the future. Way of operation mechanism. That's the intent of the whole my presentation. It's basically called as AI powered observability in microservices architecture. Okay. It's beyond metrics where we started from initially, but now where we are landing now. So just let us get into the presentation. I will go very quick on the introduction. As currently in the tradition we used to have monolithic applications and now we are moving toward the microservices architecture. We started with the monolithic traditional monitoring approaches, basically like alerts, some set of threshold alerts to give the infrastructure health and everything. That is our start of our basic health monitoring systems. We call now, we are moving into more. Microservices distributor systems. So the traditional mechanism that we applied are nowhere able to meet the today's needs. Hence, we are transformed or transformed or evolved. As we move from one technical to another involvement the same way the operation management has been evolving from one decade to another time span. So this is basically my intent is to say, Hey, now we are in the AI era. If we apply, if we enable this observability with the AI features, how we can really make it more better. We actually, if you speak, we are already leveraging that, but still I would really want to call out here basically with my experience, how this observability is so important and how the AI can really help observatory more in future. That's the intent of this. I would like to even see the framework that we can build upon and then practice that observability as we go on with new things coming up, and that will eventually help us a lot on ongoing. That's the intent of this whole exercise. I will just start with my first, the evolution of observability. You know that we initially started with the traditional monitoring where, and then. Monolithic application as I called out before, we have to have an alerts or some sort of basic alerts. He's a. Database is down or it's up, or is a basic alert system that was making, basically the administrator used to get, and basically they act upon that alerts and work on that is called the traditional monitoring. From there, we evolve into the next set, and that is a very important, which I will take a little more time on. What is the basic observability in? In observability, we have three main factors, the three pillars, the metrics. Logs and traces metrics, it gives a very high level overview. Of the system, the CP usage, the threshold times basically the response times or the error rate, the memory usages, everything of the application the infrastructure, basically it's a high level coming to the logs. It's a granular level of the view. It more tells about the events, the basically the warning, the information, messages and everything. It's more of a granular view of this system. Okay. Then we have the traces, basically how a request is traversing within the application. Is there any blockers for that? It's basically to trace the the requests that are incoming to the application. So these are the three categories of application these categories. Of the data points that we used to collect to understand, and that was our initial basic observability that we, from where the other system we started, then we categorized into three sectors and we started using them as our base for a very long term. The basic, these three were very long term. We've been using them. And now the initial observability is basically like you get an alert, you get a metrics or your data, you collect the data of metrics or logs and everything. And then based on that, we have to take a reactor approach basically. Yes. What needed to be done for now, that is our trend where it used to be. Now given this new let us take around five years, maybe. No I think I will go with 10, 7, 8 years. Once we started this cloud. S infrastructure coming up, we really had a more better enhanced version of the observability. Where the cloud platforms are really put a very good focus on this because they know that these are the very key factors for any cloud provider that if they don't provide the infrastructure health. To the consumers, they'll not be really being sustaining the market. So they really focus on the observability. They enhanced it in such a way that they take a next step, not just collecting the data, not just collecting the data. They applied some sort of pattern analysis on this to understand what is a peak time of a usage of a system. What is the low time or what is the what is the non. What do you call, I can say non-working hours. How, what are the maintenance hours we can come? So this analysis has helped us to come with the maintenance hours to know what is the peak hours, how to manage the resources based on that. This is the AI enhanced observability is I don't I, the reason why I call this a ns, this started using this data, analyzing them and applying some thought process of what we can take a predictive approach like, hey. This is a time in a day, a very high peak. Covers are the how we can apply a resource optimization there. What are the things that I can do in the maintenance, which I should not in a peak hours. That type of predictive analysis has helped us to know where we need to do maintain hours, where we need to increase the resource that is the advantage of this enhanced observative. The next one, which I want to call out is. Not just manual intervention of increasing the resource and everything. Now think about if I can really make autoscaling horizonal scaling or vertical scaling. If you remember, in cloud, we have right now based on the peak hours or based on the season, like a Thanksgiving, where Amazon has a high volume of data, consumers hitting the system, they know that they need to scale out. At that moment of time and they need to crunch back at the once it is done, so it's basically self-healing system. It's basically self feeling based on the data, analyzing the predictor analysis, and then applying a healing system. And that is what is, so where we started from just collecting the data, which is our traditional, now where we are is we are taking the data, applying our AI enablement, and then taking a measures without even a manual intervention. Hence, this is a real. Evolution of an Observ team. That's what I want to call from, where we started from where we are now, and where we can go more further on this. Let me go to the next slide. The three pillars. These three pillars as I called out the metrics the sorry, metrics, logs, and traces as I called out. Metrics, high level overview it call more talks about CPU U stages and or the pa. Basically it has the data related to the infrastructure. Then the logs is basically about a granular level of what are the usages and. Sorry, my mistake. Granular usage of the information, including error and everything. And then the other one is basically the traces, the request flow and everything. Okay. These are the basic thing. No. If these three pillars, if we live as is, we could have not achieved where we are. We have really transformed them, these metrics and the logs and applied more. Frameworks or I know I can say more technical innovations are done on this three. In order to see that, how we can leverage, for example, metrics, CPU stages apply the historical data pattern recognition so that we can know when to scale out, when to, not to scale out logs, what type of errors are coming up. Basically to see which is, this will help us to know. What is a frequent error or frequent issues that are coming up and how we can really proactively take a measures on it. For example, database, many people are hitting at one time and the database not reachable, so they can really take a proactive measures based on the logs. Seeing that granular above that information, the system traces basically which features are used more frequently, how we can scale out that microservices more than the ones which are not used. So this is something an intelligence. Getting applied onto the tree and making them more mature. So that will help us for our, take us more life, more smoother, because if we have issues, we are day and day out, we cannot burn ourselves. But if you apply intelligence on this and make our life more easy, and that will help eventually the operations, more simple, easy to manage. One of the important thing about the observability is the SLOs. So I will even want to share one of my recent experience on this technical methods, so we know we have a technical, and when we know there's a business, unless we map this too. There'll never be a there'll never be a for example a consensus. For example, if I want to enhance my observability with more of with my, more of my knowledge with more of my implementation, unless I get a business value of it out of it, I cannot really do an investment on this. So basically what I really want to sell is, unless you really. Correlate what your technical metrics, data that capturing to the business KPIs. There will never be a buy-in for us to really invest a lot on Observ. It comes with the cost. Observability comes with the cost, and the cost need to be. Need to be approved by the business and the business. Need to see the value out of it. What we are doing here. And that value can only be correlated unless we can measure one, unless we map the business KPIs with what we are doing. For example, user customer satisfaction. What I'm taking, the customer satisfaction. Customer satisfaction comes with a latency, unless you have a data. To really capture the latency issues in your systems, you cannot really assess, gauge the customer satisfaction. Another one is to put capacity, same thing, then the revenue per transaction. Unless you know what is the cost on revenue, cost on transaction per transaction, per participant, or for the, for a customer or for a business order, we cannot really assess. How much our systems are used sometime. If the order takes long time, it means we are not able to make within, we cannot scale up ourself. So this combination is very important and. To get, to improve, sustain our observability and pro and to apply our practice on a long, we definitely need a business approval. And the business can only approve unless we show the value of what we are doing and how it can help the application and how it can help them from their sales pitch point of view. What we are doing, that's what is, for example cloud. When they share, they will definitely tell, hey, how we are going how our observability practices are helping you. Unless then you'll not ever go and make yourself as a cus customer to the cloud provider. In the same way for any application, unless we have the metrics, data and observability, all this logs properly enhanced and how we are leveraging. And which the customer will never be able to come to our systems. That's the intent of the SLO. We really need to have a precise mapping of this business, K ps, stored technical metrics, logs, and everything places so that we really have a good value benefit. And then only we can really sustain the practice of observability within our system as it'll be very, because it comes with a cost, and cost cannot be done without a business approach. Now coming to the implementation framework, what I really want to call here is building a mature observative framework. It is not just an observative, cannot be done just on one day. It is need to build as a building blocks. You need to start collecting the data. Then you need to apply some set of validation checks. Today you are going to apply the observability checks, like collecting the logs, collecting the matrix data, collecting the tracers and everything. Now, think about as we go on. We may build new microservices, so we need to ensure some controls to ensure that observability checks are being part of that existing development. For example, a new microservices. If we don't ensure the value sanity check on that new micro and is getting deployed to production, we may ensure and end up in an issue. So we need to ensure that the implementation required for observability is being applied. As part of the build, CICD pipelines, we can apply new checks like sonar code coverage checks. That's not a basic, but logs basically integration testing. There are a lot of means of testing where we can ensure as a check list has these are been done that will help us to sustain this observative practice a long term. Then coming to the intelligent analysis, basically what I'm saying is applying the. Models, ml or machine learning models, or you can say AI or AI algorithms to ensure that how we can leverage this collected data and do the predictive analysis. And then finally, AI ops, basically Remi, predictive and based on predictive, how we can make them self-healing system. That is the basic implementation of the framework, Steve. Progressing. Each layer of the period builds on the capabilities below it. Creating an observability practice that evolves with your organization needs. So based on our organization needs, we need to ensure that we do have everything in place and we do maintain that. And even though when a new application is coming, or the new services are coming within the organization or buildup, we need to ensure that we are. Making as a process that every developer or every business has this observability checks and hence the team at an organization level who maintains this operations. They are really having this controls and checks so that there's no, just an example of penetration testing if we don't have a proper metrics to ensure that a pen test. Mechanisms not there. Then we may be prone to a vulnerability. Hence, these are some of the important things that we need to imply in our framework and then continue, and that will sustain for our practice. Some of the open source systems on observability is graph. I have, they are telemetry. There are a lot of tools here, which I can list at kin and everything. But one of the important tools which I like most and I've been personally using is graph. It really gives excellent. View visualization of the underneath metrics locks place the predictor analysis. It even provides the suggestions. What we can do, and it even the, one of the important other, the allowable factor that I, for me is the dashboards based on the custom, based on the user. You can really build your own dashboard. I may be interested in metrics analyst, the other person may be interested in log analysis. The other person may be, so you can build your dashboards, focusing on where you are interested and that really gives a visualization and even. Helping in you can even build your own data. It has its own supporting query language and that is really a very valuable thing that helps a lot for us to even query the data that we want to assess or get pitch the historical data and everything. That's a very good, so what I really want to say from this slide is. We started with this observability concept. Now we are in a stage where there's a vast amount of tools built upon, and every tool has been has its own capability. We just need to plug in that tool into our system based on our automation needs, customize it, and see that we have the complete system in place. So the framework that I talked about previously, that is what you really want from your observability, what you really want to capture, and how you can map that into a tool which is already, like Grafana is an open source tool. You can apply it, customize it, and make your system, and then you need to practice this, basically sustain that observability. The other important thing about this observability is a contextual alerting. Okay? Initially, we used to get in normal alerts, Hey, this is alert and our production database is getting down. That was a basic alert. You used to get in emails. Now with this new technology at your mobile, you can get it. There's something on you doesn't need. People used to sit in front of the system in the traditional, like they were like shift wise. People used to sit in front of the system watching. Now that's not required. With this new technology we are able to get the alerts on your mobile. That's ideally need to create the technology advancement, but not only just that one. Right now I can, because I'm experiencing that. It'll even suggest you what is a similar incident happened previously and what was the action done previously and how we can suggest, say if a person who is on a on call and he got an alert saying that, Hey, there's something happened, contact, he will have a provision from the alert itself to refer to the previous incident that was previously done and see how we can, it's, this is a contextual alerting. Basically you got an alert. But at the same time, it is giving a history of the previous alerts for you to refer, and you can even delegate to a right person who has previously worked on a similar incident. If he is available, you can even delegate that incident to that person. So for a faster resolution, this is all else. Before it was not the case. We don't know who worked on a similar issue, but now we have a history of data and it even suggesting us, Hey, there is a previous history. These other steps are executed and this is what we can do and is the person available in the system, let us go and reach the person so it's more of faster resolution of incident. This is all helps us for faster resolution. I just want to take more just take a intelligent routing that is also very important. We have many people in the system who are monitoring the operations. Not everyone is dedicated to for all the people. There are certain set of sector people who are very privileged. Experience. So they will get those alerts. We can even delegate which alert to go home so that we have a proper incident management done in the right time so that the MT the meantime to resolution. It gets resolved. It gets reduced drastically. Before it was like one day, two day. Just an example, I'm giving now in hours, we are able to reduce in hours. This is all because how. How we are able to apply our intelligence on our lessons. It is basically lessons learned from previous experience. Where are our bottlenecks? Where are our improvement areas? How we can apply this AI models and into that. And so that we can make ourself. It's more of, as we go on, we will improve, and I don't know, tomorrow it could be a mechanism where. With this new robots and everything, things may be more changing, but contextual editing is basically helping you to give the history data and even provide the people whom you can delegate the requests. Yeah. Sorry, I just, I went ahead before my, this is, what is the outcome of that? Context reduction in the fast incident, MTR reduction, predictive accuracy. So based on the all alerts, you can predict what could be the issue based on the previous history. Also, then collaboration as I just called out, there's an improved collaboration between, because you know whom to reach, you know whom to delegate. It's an increase and ready you are. Instead of just putting on one person, you're delegating it to the right people. So concurrent issues can also be resolved with this. Now the very important thing about this is a next phase, which I thought right now we started with traditional, then we went to this what you call the basic observability. Then we went to the enhanced version, then the self feeling. Now think about the generate ai. Now this is a new version. Think about in observability Grafana, understand if we built a chat bot. And which can help us. You can question, I got an incident and I want to understand what can I do here? It is giving you based on the previous historical data, how, what are steps to be done and it even shows you some set of previous incidents so that you can really understand, apply so that it basically leveraging the previous historical data. Just an example. The other one is incident. You get an incident, someone locks an incident. But here, based on this. Generally if a similar incident happened, you can really accumulatively build up a very good summary and it can make the user to understand, hey, this could be possible, these reasons, and this is where is a major problem. So most frequently when this incident comes, this is the area of a fixed measure be done. So can you, can I check that one? It basically helps. To identify the root cause analysis very quicker with the summary. Basically, it brings all the incidents of together and gives the summary of what are the fixes done across and what is a common fix mostly done so that you can really first go and attack that, fix that issue. That is basically the incident summary, and the other one is. Suggestion, as I said, what are the suggestion based on historical? So if we bring the generator AI into the system, not just AI models and everything, even now apply the generator on the historical data, you really making more better. Even a new person who is very new to system, he doesn't need to really aware, he doesn't need to go with the KT sessions and everything. We can just say, Hey, use the chat bot, get the information, and you are on your own. That is more like you're making everyone a new person, an independent by with this, all this data because no one needs any information from a person. Everything information is in the historical. You just need to grab it, understand, and apply your thoughts. The last one is basically, as I said, observability is basically a discipline you can say. As we, when we build a new application within our ordination, you need to ensure that the discipline is followed across in all the applications so that we have a proper operation, KPS data being collected. And if this practice has been applied concurrently or con, not concurrently or con, the right word is ensuring that discipline is followed. Consistently across within, then you are really preparing yourself for a better application, operation management for your organization. So that your operation cost will come down drastically because your incident management coming down the turnaround your predictor analysis accuracy has been very good. Everything. So your operation cost, then the cost of the quality is getting down. Then that will improve a lot of. Thing to enhance, use a breathing space to enhance our application more further. So hence how an observability can make our life more smoother. Because if the, if we don't have proper operation management, hence the cost of quality can in, if it increases, then. From a budget point of view, we may not be able to have a fund to improve our application features. Whether we need to put more effort to retain and sustain our application existing, we are not in a position to scale up our application for the future needs. Hence, that's the reason this is.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Beyond Metrics: AI-Powered Observability in Microservices Architectures

Video size:

Abstract

Summary

Transcript

Slides

Srinivas Sriram Mantrala

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Beyond Metrics: AI-Powered Observability in Microservices Architectures

Video size:

Abstract

Summary

Transcript

Slides

Srinivas Sriram Mantrala

Join the community!