Conf42 Cloud Native 2024 - Online

Observability Maturity Model for AWS – From Reactive to Autonomous

Video size:

Abstract

Embark on a transformative journey from Reactive to Autonomous Observability in AWS. Navigate maturity stages, gain strategic insights, implement practical guidelines, and achieve future-ready Cloud Native excellence in observability. Drive unparalleled experiences to meet business objectives.

Summary

  • I'll be walking you through how you can leverage AWS to build a comprehensive observability maturity model which will take your observability from reactive to autonomous. It's very important when you are implementing a maturity model to ensure that you are measuring business outcomes.
  • Cloud native is not the buzzword now. It's cutting over a lot of overheads, but it has its own complexities as well. The traditional monitoring will fall about when you are in cloud native. Without observability you will definitely fall apart and you will not achieve your end objectives.
  • It's very important that you kind of have an observability maturity model. The maturity model is very important because that is we are not stuck. You can take it, do a little bit of customizations to suit your needs. From there you can start your observability journey.
  • When building an AWS based observability, what are the key pillars? Logs are the most ancient type of observability element. Then comes the metrics. Tracing is probably the newest kid in the block. Be mindful of cost as well.
  • First model is reactive, where it's about just doing the basics to ensure that you are kind of like getting alerts when the systems goes down. The next level is being proactive. And then predictive is the way to go where that will allow you to predict something. But does all systems need to be at the autonomous level?
  • Four levels of maturity: reactive, proactive, predictive and autonomous. Tracing usually in the keeping the lights on the reactive level. When it comes to predictive autonomous, it's about bringing in AI and ML. Achieving the observability objectives and cost is a very important factor.
  • In high level, what we need is real user monitoring. Next we'll try to implement APM or the application performance with distributed tracing. Ensure that you enable log anomaly detection. And then finally, obviously you will have to do your infrastructure monitoring.
  • It's very important you clearly define your goals, how to measure your customer experience. The more you are going from reactive to autonomous, you should be able to achieve your service level objectives. Ensure that all your data observability telemetry data are centralized. And finally, so where the cloud native observability is heading.
  • And finally, my prediction for this year. Dynatrace, Datadoc, new relic are the top three. And we have Amazon Web services, also a leading contender and in the challengers category. There's a great line of speakers who are going to speak part of Cloud Native 2024.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome to Conforted to Cloud Native 2024. I'm very happy to be part of this year's Cloud native conference and I'll be walking you through about how you can leverage AWS to build a comprehensive observability maturity model which will take your observability from reactive to autonomous. You had to ask this question, are you working for a machine or machines are working for you? It's middle of the night, you get a call out and you have to open your laptop and start working. I am afraid that little means you are working for machine. So what you have to understand is how you can take it to more of autonomous way so that machines will start working for you. During my presentation I'll walk you through about why observability is important, especially in cloud native area, and why you need to focus on observability maturity. The topic of my presentation is mainly around the maturity model which I have came up with and the pillars around that. And I'll go more into details about the maturity model where it's a four state maturity, how you can take it from reactive to autonomous, and then we will talk about some of the implementation guidelines where you are able to leverage these when you are implementing and you start your own AWS observability journey. It's very important when you are implementing a maturity model to ensure that you are measuring business outcomes. Every step of the way you try to see what is that value you are generating to your business. Unless you do that, it will be just another approach and your business partners will not see the expected benefits. So it's very important. We kind of have an ability to measure everything and then see how it's impacting the overall, the business goals and other things. And then we'll wrap it up with going through some of the best practices pitfalls you have to avoid. And also, I'll briefly talk about my predictions for cloud native observability in future. So moving on. As you might already aware, cloud native is not the buzzword now. So everyone is almost in the cloud or like partially already move into the cloud as well. So cloud native is, even though it's simplifying lot of stuff. When you are moving from your on premise data centers to cloud, it's cutting over a lot of overheads, but it has its own complexities as well. So one of the key complexity thing which is bringing is the distributed nature due to the applications are nowadays heavily dependent on microservices architectures and there are a lot of upstreams and downstreams this naturally results. Our systems are pretty much distributed and very complex and hard to track. And also most of our Trixa systems are dynamic in nature. So that means there will be auto scaling happening, there will be other, the dynamic elasticity is there. So that is required a new way of doing observability, because the traditional way of doing the monitoring and managing and operations will not work. And of course we have the containers, we have the continuous integration and deployments. And this is doing is it has increased the production velocity. This has resulted in, we have lot of complexities in our cloud native solutions. So this is a recipe for, again another disaster unless you plan properly. So what we are suggesting is the traditional monitoring will fall about when you are in cloud native. So you have to look at observability and you have to look at what are the ways how you can get the benefit of cloud as well. So in nutshell, observability is a key part of your cloud native journey. Without observability you will definitely fall apart and you will not achieve your end objectives. So moving on, why you think we need a maturity model? So there are lot of reasons. One is that not a technical reason, but one of the main thing is you need to have a north star. So when you start your observability journey, you'll probably start in someplace, but you want to know where you are heading and you want to have kind of a decent objectives in a particular timeline so that you are able to work with your resources and then able to go in that direction. So one of the main thing is if you don't have a maturity model, you don't know how to measure the quality and you don't know where you stand when it comes to the rest of the industry. So it's very important that you kind of have an observability maturity model. Doesn't mean that you will have to stick to what I am presenting today, but you can take it, do a little bit of customizations to suit your needs and then probably make it to a blueprint where you can look into that and from there you can start your observability journey. So as I said earlier, the maturity model is very important because that is we are not stuck. So when we are building an AWS based observability, what are the key pillars? So obviously there are a lot of pillars, and I'm going to touch upon few. So one of the main thing is logs. As you might know, logs are the most ancient type of observability element. It would have been there when the distributed systems or the computer systems started. Syslog is kind of like the most oldest logs we might know, and logs have been used as a way of auditing or even troubleshooting purpose. And then comes the metrics. Metric is usually a number which is providing indication of how something is working. So metrics generally are being used to trigger alerts, because we have the metrics and it's easy for us to set a threshold or kind of like a profile based alerting, where we can get this benefit of using metrics to alert us. So metrics are very important aspects of observability, because that allow us to understand some of the internal state of our systems. Then tracing. Tracing is probably the newest kid in the block, where it's about trying to understand how your code is doing. So we are good at looking at the logs and going through what things are happening. But you have to understand, logs are a little limited and sometimes it might not provide you the exact details you are looking for, but when it comes to tracing, traces will provide you that exact unit of work. What is your code is doing, you can able to trace back into the method level and even to the database queries level. So the traces are very powerful thing which is allow you to especially help you in troubleshooting issues. And then I don't have to spend time on the alarms, you have to have the right alarms in place so that you get automated call outs and you are in. But what we have to understand is automated call out is not the way to go, but you have to understand, is there a way I can automate it? I can get systems to resolve itself, or the self healing capabilities, the autonomous work, which I'm going to talk about. So again, alarms are kind of like very early, or like primarily things, but you have to have some alarms in case your autonomous things are not working. And then of course you'll have to have dashboards. Canneries are nothing but about doing some synthetic testing in your application. So it's good that we are looking at our end users behaviors, what they are doing, all the service calls, and we have kind of a full stack observable to our application. But what if that customers have a network issue or some other issues where those are outside our control? So having a synthetic monitor which is trying to mimic actual end user behavior will help us here, because with this we are able to mimic end user the actions, behaviors and then get an alert when our synthetic monitor is hitting issues. So this is a good way of keeping track of things and being on top of our systems and real user monitoring is very important. It's about front end monitoring, it's about trying to understand the exact customer experience your end users are getting. And of course you'll have to do your infrastructure monitoring, network monitoring, security and finally be mindful of cost as well. Because especially in AWS, you have to make a decision when to enable the detailed logging, even though it might be costly. You might have to make a decision when to enable my anomaly detections and other great features AWS is providing. And these will obviously have some cost associated with that. So you will have to start balancing your needs with the cost as well. So in high level, these are the key pillars of observability. So observability, if I just trying to remind you, is about approach based on the application emitted telemetry data trying to understand system's internal state. The more we understand the system's internal estate, we are in control of ensuring that the systems are working fine. And in case if we identify that internal states are kind of like deteriorating, we are able to take actions in advance so that end user customer experience will not get impacted. And what is our mission here? Our mission is to ensure we reduce or completely eliminate anything which can impact our end users customer experience. Because what we are trying to do is ensure that our systems are reliable and available as well. So the key pillars of observability is generally locks, metrics, traces, but you have to add other things as well that will complement your observability journey. So moving on, what are the key, the levels we are referring when we are saying observability model here, these are the four levels I am referring. So first model is reactive, you can call it more for keeping your lights on, where it's about just doing the basics to ensure that you are kind of like getting alerts when the systems goes down and the next level is being proactive. So it's about doing things little more and being little advanced in detecting and fixing the issues which can cause impact on end user customer experience. And then predictive. The third level is the way to go where that will allow you to predict something because being proactive is good, but if that is still having an impact on the inducer experience, that is not so great. So what we want is to predict early, so that we identify symptoms early so that we can fix it even before those get materialized and have impact on the end user experience. And the final cloud, the Nirvana state is the autonomous level where our systems are able to look at all the telemetry data. And with that it's able to make a judgment of its internal state. And if it sees that it's not going in the right direction, it's able to do some self healing remediation by its own so that it's able to keep that internal state. The key thing to note is that observability is about understanding the internal state and more data we have. We are able to let our systems to understand its internal state and then take the precautionary measure, even with us involved. So that is what we call the autonomous. So if I come back again reactive, we will have some logs and probably we'll have some metrics as well. And metrics are probably the limited and we might not have traces at all, but we will able to know if our application comes down, we might get some process alert, infra heavy alerts, and that will able to do some work. But it's just keeping the insights on. But it's not necessarily a great customer experience. Being proactive is you have the access of the logs, metrics and traces, and with that you are able to proactively identifying issues. It means probably you identify some issue and that might still have impact on customers, but you can still speedify the resolutions. You can probably identify even your end users. Getting known to that issues, sometimes it feel little great. You don't have to wait until your customers reporting and telling something is down or someplace there's an issue, but you kind of like get to know it first. And probably you can send some comms and you can be on top of the entire incident window. That is again not a great place to be, but it's still, it's better than keeping insights on or being reactive. Predictive is using these metrics, logs and traces, and being on top of the game, looking at the anomalies, forecasting things, looking at what is happening outside our BAU operations with that, come up with some intelligent predictions and then based on that, take the actions so that we can actually eliminate those issues which can have a bad impact on end user customer experience. And then finally autonomous. I have kind of like touch upon when I'm going through these four levels. Initially it's about looking at things and trying to understand when and where system internal estates is getting changed, and then what actions the system itself can take to ensure that it can bring its internal estate back to the normal state where it can able to self heal, remediate and do all those things so that it will be even far above the predictive level. And one of the questions you might ask is, does all your systems or the clients which you are working need to be at the autonomous level? The answer is no. It depend on the complexity, mission criticalness, it depend on lot of factors. So we are not advising everyone to be at autonomous level, but obviously everyone should not be at reactive level as well. For most of the systems between proactive and predictive will do the job. It will be able to balance your operations with the cost as well, and also provide some greater benefits to the end users and the business. But if you are in a mission critical system, any outage or bad customer experience will cost you money. And it's important for you to keep track of what the end users are feeling. Then obviously you will have to go to the predictive and autonomous level where you can leverage these levels to ensure that you are on top of your operations. So now let's go through the key pillars, which we discussed earlier with the four levels of maturity and trying to understand what the maturity levels mean for each pillar. So when it comes to logs reactive approach, you are simply using logs for troubleshooting. Customers are reporting some issue and then you are acknowledging that and you simply refer to the logs and then start troubleshooting and trying to find answers. But being proactive is that the exceptions are getting visible in the logs. And probably you are trying to get some alerts out of that so that you will get alerts and you can be little proactive in trying to identify issues. Obviously this will improve your meantime to detection. And with this you are able to kind of like be a little bit on top. And predictive is you are looking at all the logs and you are probably doing advanced anomaly detections. And with that you see the anomalies in advance. The moment you see something with going outside your BAU or the home internal state, you identify and then you can kind of like do a lot of predictions on top of that. Autonomous is that looking at those and then trying to do call relation and then doing kind of ability to trigger the other workflows which can actually do autonomous operations or the self healing metrics. Again, the reactive level, you will have the basic metrics and proactive, you will have some threshold based alerting, but when it comes to predictive, and you will use lot of anomaly detection capabilities and these capabilities will help you to predict issues in advance and also then build your autonomous capabilities. Tracing usually in the keeping the lights on the reactive level you will not see tracing proactive level. You will have some basic tracings, but when it comes to predictive, you probably have tracing, which is a time driven, and then you will have kind of a topology base as well, so that it's more of a distributed tracing where you are able to propagate your traces, the context, and that you can make a correlation between different systems, and then you are able to identify a lot of issues. So this is more of a full stack observability level in nutshell. And this will give you great benefits when come to being predictive. And you can definitely use this because with the traces you are able to identify actual root causes, and then you can able to trigger your autonomous workflows. The canaries are the synthetic monitoring, and while going through you will start from having no canneries to have all your key journeys being monitored by synthetic monitors. And real user monitoring is a very important one, where you will start it from proactive level and then you will improve it. When it comes to predictive level and when it comes to autonomous, you will obviously use AI and ML to improve the capabilities and then also to drive this autonomous capabilities as well. So infrastructure monitoring, again, when it comes to predictive autonomous, it's about bringing in AI, ML and being on top of your operations, and it's the same for the network and security as well. And those will actually allow you to keep walking in this journey. Achieving the observability objectives and cost is a very important factor. And from reactive to autonomous, even though there is a cost improvement increased, but with the autonomous nature or the predictive, you are able to bring down lot of human involvement, lot of human effort, and this will resulted in more gains for you as well. So you might start with little bit of expensive when you are starting your observability journey, but you can reduce the cost definitely when you are ending your journey. So now let's look at how we can implement a comprehensive observability setup using AWS. So in this example, this is an application hosted in AWS where you have a database and you have microservices and you have front end code as well, and you have upstreams, downstreams and end users. So in high level, when we are trying to implement a comprehensive observability setup, what we need is we have to implement RAM, which is real user monitoring. That is usually where I start, because I would like to know how my end users are feeling. RAM is all about understanding front end performance, and front end performance is the most important thing because that's what our end users will see. So next we'll try to implement APM or the application performance with distributed tracing, so that we know from end to end how things are happening. And we also have a full control of our code. What's important is to understand how is our code is behaving and with that we can understand the bottlenecks and the other issues so that we can rectify. So enabling application performance monitoring and distributed tracing or the full stack observability is very important. With the distributed tracing, when a user request comes in, we are able to correlate that with what's happening in the front end, what's happening in the microservice level, and what's happening in the database as well. Traditionally if you see a database query which is taking time, what happens is that your DBS or the SMEs will identify, they will reach out to the DevOps and the SRE teams, but they will sometimes struggle unless they have access to code or even they have access to code. They will search, but they might have trouble identifying which part of module or sometimes which journey this is getting triggered. Even in case if they are good with the code, they identify the module and everything. It will be next to impossible for them to isolate and identify which user, which part of user profiles has invoked this because there's no connect. But with distributed tracing we are able to propagate the trace context which will take from front end to the microservice to our database layers as well, where with the trace propagation we are able to understand the queries which is getting run by or queries which got triggered due to the end user request. So this is very powerful. This will enable us to understand and go through and identify the bottlenecks, issues in our code and other errors and everything related to customer experience which can directly correlate it with our code. And obviously we'll have to look at the logs and events and we will look at the metrics. And also you'll have to ensure that as part of site reliability engineering you define your SLIs, SLos and error budget, which is again complement by all your observability goals. And then finally, obviously you will have to do your infrastructure monitoring as well. And that will actually ensure that you are on top of your state. With this. Let's look at some of these key implementation areas so that you are able to get some idea about the implementation. So if you are using AWS, you can go into the real user monitoring where you can configure your application. It might give you the ability to get a code snippet which you have to embedded in all your front end code. With that you know you can enable the real user monitoring. It will allow you to see the page response times, the page errors and the epidemics as well, and all the things which is related to the front end performance. And then obviously you'll have to configure the cloud watch agent and then you can configure the relevant prop XMLs to ensure that you added all your log files and you can ensure that your logs are getting feeded into the cloudwatch. And once that happened, ensure that you enable log anomaly detection because it's very important. As I said earlier, what usually happen is in the reactive level when you get ripped, get your end users are complaining something is not working, you will trail your logs, you will identify those exceptions, then you will able to understand the issue and then you will try to come up with fixes. But what is great is once you identify the issues, probably someone will ask, can you go back and check when this issue started? And you have seen this started a couple of hours early or even sometimes a couple of days early. How great it is that if we are able to identify this the moment those are appearing. But one of the challenges how we do, because it can be something unknown which is not even aware by the development team or which has been difficult to capture. So with part of the log anomaly detection, the AI is able to understand what is the baseline, what are the existing errors, what are the things which is currently happening and braceline your state. And after that, on top of there are new errors are happening, new issues, new behavior changes happening. It's able to alert you. So log anomaly detection is a very powerful concept which you should definitely enable, which will provide you value when you are going through your observability. And once you enable your traces, you will start seeing the service map. So you can do this with open telemetry. And once you do that, the one great feature is that this map will allow you to see how the requests are getting served. And even if there are any bottle knocks or anything, it will show. As I said, traces are the great way because it allows you to track your request from the browser level, from the API gateways and to the SQL server. So like if you see this is about the time the front end is taken and this is probably about the time where the microservices are taking. And you can see some of the SQL running as well. So enabling full stack observability is very important because that provides you the full control of your state. So you have the ability to see all your system, internally state and especially the code. What is your code doing? But usually what happened is that in the reactive level you are more infra heavy. You'll see all the infrastructure and other things. But one thing to note is that it's the code. It's the code which is serving your customer request, the code which is doing the processing and other things. So you had to enable traces and start ensuring that you enable the full doors to your system where you can have the full visibility. And once you're doing that, you will have your metrics enabled as well. So it can be your intra level metric, application metric, performance metric, and you can have your custom metric as well. If you are using lambda, you can have the serverless based metrics as well, the database metrics. So metrics are everything like metrics are generally the numbers. Of course, it's always the number where you are able to, based on that, make lot of decisions and you can see the performance, and especially you can configure lot of alerting as well. So metrics will give you those trigger. And obviously with cloudwatch you are able to enable metric anomaly detection. So with that you have the power of not only having the metrics or not only going with some threshold based alerting, which is a very legacy or an old way of doing, and then enable anomalies. So what cloud watches do is it will start profiling the metric and how the metrics are going on and how it is changing. And with that it will try to create upper bound and a lower bound for a guidelines. And based on that, it will start sending you alerts if it sees that there are anomalies are happening. And also in AWS we have code guru. I'm recommending you to enable code guru for application profiling, because that will allow you to understand the code performance and you can correlate it multiple other factors and also enable the AWS DevOps guru, which will do a lot of AI ML in your entire account. So it's a very powerful tool, so that will have full holistic view of your entire state and ability to identify anomalies across the board. So that's kind of like what is required to implement a comprehensive observability solution in AWS. Now, let's discuss. One of the key thing is, so why we are doing this. We are doing this because we want to ensure our customers are getting the world class customer experience which our application is designed for. So while you are building and coming up with the observability journey, it's very important you clearly define your goals, how to measure your customer experience and with that trying to understand is your observability methodology. The framework allow you to achieve your customer targets and how it's able to correlate and identify when things are going wrong. So you can quickly identify and then fix it. So I am not going to go into much detail here, but one of the thing I'm iterating and important thing is to ensure that you understand your business objectives and you have a way of measuring it while you are traveling around level one to level four observability. So each level you will see the benefits and it's good even before you start you are able to identify what are the benefits and then see whether you keep or add some targets to your journey. Some of the KPIs, if you want to have few KPIs, is work on your meantime to detection, meantime to resolution, and meantime between failures and trying to see your service level objectives achievement. Because the more you are going from reactive to autonomous, you should be able to achieve your service level objectives. And that's a must. Unless we do that, the purpose of doing observability is not useful at all. And of course, as anything, enabling observability in AWS is pretty much easy. That's what the cloud is providing you. But you have to ensure you follow some of the best practices. So observability, as I said, it's about looking at the internal state of your system. So you must enable your logs and traces and metrics and you should ensure that in AWS, wherever needed, you enable your detailed monitoring because that will really help you as well. And don't forget your traces, because traces provide what's really happening at your code level. So that is what's most important thing. Because when you come to cloud native, most probably it's a safe guess to say your infrastructure is pretty stable. And then ensure that you send almost everything into a cloud watch. And then you have proper dashboards as well so that you are able to have a look and you are able to have a big picture, holistic picture of your entire state and what you should avoid. And definitely ensure that when you are shipping your logs to Cloudwatch and you're mindful of retention as well, you don't want to lose your logs. And when in a week's time you want to troubleshoot an issue. If you don't see your logs then that's a problem. And ensure that you have more granular level metrics and others as traces as well. So that's something which is very important because don't try to be very high level, because sometimes what you need is the ground level details. And when we are working in a very vast complex systems, it's very easy to forget about some of the critical systems virtusa of they are not feel like critical. So ensure that you have a proper way of understanding your critical systems and ensuring that those are being monitored, observed and everything you have done. And finally, it's very easy to have a technology or data siloed observability. Ensure that all your data observability telemetry data are centralized and it gives you a big picture. And finally, so where the cloud native observability is heading. So that's a very good question you should ask when you are coming up with your own observability maturity framework. So as far as I see immediately what I seeing is lot of clients are getting adopt into opentelementry because they are aware of the need of the traces, the distributed traces enabling full stack observability. So that will have more way of the first kind of the requirement for our customers. And then when it comes to the midterm, I'm thinking lot of people really start moving into AI ML because the observability tools are inbuilt, providing them the ability of doing anomaly detection and ability of doing the forecasting, ability of doing prediction. And those things are inbuilt and available and people really start using capabilities very quickly. And long term vision is it's about where I started. Do you want to work for a machine or you want machine to work for you? So ultimate objective of observability is it's about ensuring that you identify your system's internal state. And whenever it's getting slightly changed without human involved, you try to fix it, the systems try to self heal it. And that is the autonomous nature which I was discussing. And finally, my prediction for this year. So if you see the gardener's magic quadrant, you have seen we have the leaders as well. Dynatrace, Datadoc, new relic are the top three. And we have Amazon Web services, also a leading contender and in the challengers category. I feel like with all the new advancements and the announcements and the capabilities unleashed, part of last year's AWS reinvent, which is about application signals, which is about the log anomaly detections, which is about other more improvements and advancements to the anomalies and other AI based changes related to Cloudwatch, I'm pretty sure that Amazon Web service will be in the leaders category when the next time Gardner is going to release this magic quadrant, so keep your fingers crossed and I'm pretty sure this will happen probably this year, and if not next year for sure. So with that, I hope you enjoy my presentation. I wanted to ensure that you have an understanding of observability and you know how to use observability for your AWS. And observability is a journey. It's about starting from probably keeping lights on, then going into the proactive level and then trying to make it to predictive nature and then finally ending up with autonomous operations. So thank you very much for listening and in case if you have any questions you can find me in LinkedIn and you can also put comments into this video as well, which we might appreciate. There's a great line of speakers who are going to speak part of Cloud Native 2024. Please join and be really very much happy and appreciate the time you have spent. Take care. Bye.
...

Indika Wimalasuriya

Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways