Conf42 Large Language Models (LLMs) 2024 - Online

LLMs in AWS: Observability Maturity from Foundation to AIOps

Abstract

Unleash Large Language Models in AWS! Join “AWS LLMs Observability: Foundation to AIOps” Boost LLM performance, troubleshoot precisely, stay ahead. Dive into a comprehensive maturity model for immediate impact, creating value for GenAI applications. Innovate now! Elevate your insights.

Summary

  • Welcome to LLM 2024 organized by conferred tour. Indika Vimalasurier will discuss how you can leverage observability maturity model to improve end user experience of the apps you are going to develop using LL lens. And then we'll wrap this up with some of the best practices and the pitfalls I think you should avoid.
  • Indigo Emilasuri is a solution architect with specialize in SRE observability, AI ops and generative AI working at Vergisa. He is mainly focused on llms especially deployed in and been accessed through AWS.
  • We combine user input as well as retrievals we receive from the vector database. This is typically a workflow of generative AI and this is the way we want to kind of like enable observability. Finally we can send it to after customization to end user.
  • What is observability? So observability is nothing but ability to intercept or understand the system's internal state by looking at its external output. In indirect LLM observability, we are mainly looking at the applications or the systems which have developed connecting with LLM.
  • Generative apps being developed using llms also require observability because we need observability. Sometimes the models can create some biasness, which is again, you know, bad. Can create some bad customer experiences. So implementing observability for LLM is very important.
  • So one of the key things is I'd like to split into few parts. One is that LLM specific metrics. Here we track the LLM inference latency. And one other important thing is LLM prompt effectiveness. We will have to ensure that our generative AI apps are 100% safe.
  • At level three we are looking at advanced LLM observability with AI Ops. The next step is, you know, taking system more proactive, like be proactive. So the AI kind of thing is, can give you a full control of predictability of your generative AI application.
  • So here I have taken AWS as example, especially AWS bedrock. From here we have look at what is a general architecture of workflow of generator application. And then we have looked at some of the best practices and the pitfalls and more importantly how we can look at this from ROI perspective. With this, thank you very much.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to LLM 2024 organized by conferred tour. My name is Indika Vimalasurier and I'll walk you through about how you can leverage observability maturity model improve end user experience of the apps you are going to develop using LL lens. So we will touch about how to start which is the foundation, and then probably take it up to around using AI to support your operations. As you might aware, by around 2022, the hype started with Chatgbt. ChatgBT was a hit, it was mainstream and it resulted in lot of people who are not into AI start creating generative AI apps. So now it's already has taken over the world. The world is looking at what are the use cases which we can use and leverage. It's already mainstream. There's lot of developers who are building apps connecting llms. So there's a need. Apps which we are going to develop has a capability of providing full end user experience because we all know how it can end. While generative AI is which is opening creating lot of new opportunities. We also want to ensure that the apps which is being developed, deployed properly in production environments and are being served to end users as per the expectation and we don't want to make it ops problem. So we want to ensure we build a solid observability into our llms as well. So as part of today's presentation, I'll provide you a quick intro, what is observability? And we'll discuss about what is observability mean for llms. So there are two kind of observability which we can discuss, so which we are going to discuss, so which is about a direct observability, and second one is about indirect observability. So I'll be focusing more on indirect observability when discussing about the maturity model, which I'm going to walk you through. Then I'll walk you through about some of the pillars, give a quick intro about what is the LLM look like, and then we'll jump into my main focus, a maturity model for LLM. So then we'll look at some of the implementation guidelines, the services which we can leverage, and then of course, like every other maturity model, this should not be just a maturity model where people will just follow blindly, but we want to ensure we tack into business outcomes so we have an ability to measure the progress. And then we'll wrap this up with some of the best practices and some of the pitfalls I think you should avoid. Before we start, a quick intro about myself. My name is Indigo Emilasuri. I'm based out of Colombo. I'm a serious reliability engineering advocate and a practitioner as well. I'm a solution architect with specialize in SRE observability, AI ops and generative AI working at Vergisa as a senior systems engineering manager, I'm a passionate technical trainer. I have trained hundreds of people when it comes to SRE observability aiops and I'm an energetic technical blogger and I'm very proud AWS community builder under cloud operations and also a very proud ambassador at DevOps Institute which is also known as PC CERT because they have acquired it. So that's about me. So I am very passionate about this topic, observability. So when it comes to the distributed systems and then llms, the end of the day I look at things from a customer experience and how we can provide better customer experience to our end users and then how we can make a better business outcomes part of the presentation I'm mainly focused on AWS. So I'm looking at llms especially deployed in and been accessed through AWS. So one of the fantastic service AWS has offered is Amazon Bedrock, which is a managed service where you are able to use APIs to access the foundational models. So it's really fast, it's really quick, you just have to ensure that you have the ability of connecting. So the key features are it's giving access to the foundation models and the use cases such as text generation, image generation and the use cases around those. So it's also providing this private customization with own data with the techniques like the retrieval augmented generation. We call it rack. And it's also providing the ability of building agents and executed tasks using the system, enterprise systems and other data sources. Obviously one good thing is that there's no infrastructure, so you don't have to worry about infrastructure. So AWS is taking care of the infrastructure. So that, that's why we call it fully managed. So it's a very secure and it's a, you know, it's a go to tool if you want to develop generative AI apps. It's already consist of, you know, some of the most widely used foundation models provided by a 21 labs anthropic cohere, meta and stability AI, Amazon as well. So there are a lot of models and they are also continuously adding these models into their. So with that, our observability maturity model or the approach is mainly focused on application, which has been developed using Amazon bedrock. So moving on. I just want to give a quick idea like you know, so when we say generate AI apps, so what is kind of the use case? The use cases, a typical user can kind of like enter query. So it will come into our, the query interface like we can take it from my API or user interface. And then the, we will process, start processing this user query and then we will try to connect it with the vector encoding. So it's trying to find similar queries, similar patterns using in our vector database. And then we will kind of like looking at retrieving the top k most relevant context from the vector database and then we'll make it as input in, combine it with input when we are providing into llab. So why? So the key thing to notice that we generally combine the user input as well as the retrievals we receive from the vector database. With that we will start inferencing with the LLM, we will send the LLM the request input and then we will start updating the output as well, which we can combine with our rack integration and then finally we can send it to after customization to end user. So this is typically a workflow of generative AI and this is the way we want to kind of like enable observability. What is observability? I'm sure most of you are aware, but just to ensure that we are kind of in the same page, I just spend a quick short amount of time to give my perspective of observability. So observability is nothing but ability to intercept or understand the system's internal state by looking at its external output. So what are the external outputs? We are typically looking at locks, metrics and traces. So I like to think, you know, observability is like, you know, looking at the big picture entire this mountain, not only focusing on what's, you know, outside the water. So what we are trying to look at it, trying to ask the questions like what is happening in my system right now, how the systems are performing and what anomalies they are in my system, what are the different components interacting with each other, what causes a particular issue or failure? So when it comes to monitoring observability, obviously there are a lot of good things when it comes to observability, because observability is more of a proactive approach, it's active approach instead of a passive, and it's looking at the big picture and looking at more of a qualitative and quantitative data. We want to make a quick discussion and agree on something. And we want to agree when we say observability and llms, what that means. So when it comes to observability in llms or the apps being developed using llms, we can divide it into two parts. One is something we call direct LLM observability or observability of LLM itself. So what that means is that we in this scenario we will start monitor, evaluate and look at the large language model directly. So this is all about observability into large language model. But then there are other aspects like indirect LLM observability or observability of applications of the systems using LLM. Here we are not looking at the LLM directly, but we are looking at the applications or the systems connecting utilizing LLM. So this is just to ensure that, you know, we are able to both ways, we are able to provide some really good benefits to the end users. So both have its and the techniques we will use is pretty much the similar standard way. When it comes to observability, we will look at, you know, how we can look, leverage the logs, the metrics, the traces and other things. So now if we kind of quickly look at, you know, what, when we mean direct LLM observability, what that means. So here we will integrate observability capabilities during the training, the deployments of LLM and while it's been used, so it's at LLM itself. So the main objective is we want to gain insight into how the LLM is functioning, identify anomalies and other issues directly related to LLM, understand the decision making process of LLM here, how we approach this, we will activate logging and we will look at things like the attention, weights and other internal states of the LLMs when it's doing, when we are doing inferences, we will implement probes or instrumentation with the model architecture. So the observability is being implemented at LLM level and we will start tracking performance metrics such as latency and the memory usage, and also things like external techniques like attention visualization. So as I said, this is more of LLM level. So this is about fully fledged looking at how the LLM is performing. So when it comes to indirect LLM observability, we are mainly looking at the applications or the systems which we have developed connecting with LLM. So here we are not looking at LLM isolately, but we are fully focused on the application side. So this is to understand when it comes to our application, how is our application is behaving, what observability things which we can enable and how we can interpret the internal state. This makes sense because just like any other application for Genai also we want to understand how is our application performed like because there can be any number of issues coming in. And here again, you know, it's end of end user customer experience, it's the users who are using our solution here what we are looking at is again we will look at the logging, the other inputs and outputs related to LLM. We will looking at the monitoring metrics, we will look at enabling anomaly detections on some of the LLM outputs. Obviously we need the human feedback loops as well. And then you know, we will look at lot of metrics such as error rate, latency. And the key objective as you would have already guessed is to understand how is our application is behaving and how is our application is leveraging LLM and how good kind of output we are providing into our end users. So when it comes to the LLM observability in this presentation, when I say LLM observability, I am looking at indirect LLM observability. So I am looking at coming up with the maturity model which is catering to applications develop using application develops connecting to AWS bedrock because AWS is what I am focusing on and the other aspects of AWS is bedrock. So we are trying to see how we can integrate observability practice into generative AI applications. So we are looking at, you know, how we can identify these applications internally. States also focus on some aspects of LLM and the prompt engineering. So we will look at the indirect oversight of LLM functionalities and we try to make sure that the generative AI applications are reliable and they are providing what is it's been designed and the end users are happy with the performance. So we want to answer this question, why observe build for llms. So just like any other application llms also that generative apps being developed using llms also require observability because we need observability, you know, when it comes to ensuring we kind of like make sure that our generative applications are correct, it's provide the correctness, the accuracy and it's about the, the performance, it's about providing great customer experience. But when it comes to llms, llms have its own challenge. It's sometimes it's complex. We might have to look at, you know, what kind of anomalies, you know, or the model bias it's having or the model drift. So when we say model drift is the model can be working fine when we are doing testing for considerable period of time but it started, you know, it start failing. So this can have a adverse impact on our end user performance. And sometimes the models can create some biasness, you know, which is again, you know, bad, which can create some bad customer experiences. Then we will look at the pretty much the other standard things like debugging, troubleshooting, how best we are using our resources and the ethics, the data privacy, security. So implementing kind of like looking at these things again, observability for LLM is very important because that is again allowing us to provide and you know, kind of like give generate great customer end user experiences. So now we'll focus on trying to understand what are the pillar shaping llms. What are the pillar shaping or LLM observability. So one of the key things is I'd like to split into few parts. One is that LLM specific metrics. So one is that LLM inference latency. Here we track the LL latency of llms, the request, you know, which is coming to bedrock application. We will start monitoring the latency at different stages of the request, such as like when they coming from the API gateways and lambda functions, LLM itself. So where however we have defined, we'll try to look at the potential bottlenecks and how we can improve or optimize the performance. And then we will look at LLM inference success rate. So we will start monitoring the success rate of, you know, the request going and coming from LLM. And then we will start, you know, looking at what are the errors and whether there's increase in errors, what is the reason for errors, all the troubleshooting aspects as well. And we have this LLM quality, output quality where, you know, we will like trying to understand the quality of the LLM outputs. So which is again important. So that kind of gives us the ability to kind of like, you know, improving those areas. And one other important thing is LLM prompt effectiveness. So it tracks the effectiveness of the prompts which we are kind of like sending to LLM. So this again, you know, we will start monitoring the quality of LLM outputs based on those prompts and based on various different kind of prompts and how these are getting deviated and then start continuously refining this and moving on. Some of the other things are, you know, about LLM model drift. So we will start monitor the distribution of, you know, LLM outputs with the application, understand over period of time whether there's any significant output distributions. And then we'll start tracking the performance. And of course we will have to start looking at the cost and then when we are integrating with llms, whether there are integration issues, especially because, you know, we are integrating with the AWS, the bedrock, and then we will look at some of the ethical consideration as well. So we will start monitor llms output with the bedrock itself for potential ethical things, violations and other things. So we'll have to ensure that, you know, our generative AI apps which we have developed are 100% safe, there's no harm, illegal or discriminatory content, and llms are, and the generative AI apps are safe to use. So with that, you know, we are looking, we kind of like, those are the key things, you know, when it comes to the LLM specific metrics. And then when it comes to the prompt engineering properties, we will look at the temperature, we will start, you know, see how we can control randomness in the model, because the more higher the temperature, the diverse the outputs are. And you know, if you can lower the temperature, the more focused the outputs are. And then we will look at the top P sampling so that we know we can control the output diversity. And then we will look at the top k sampling and things like Max token and the stop tokens, you know, which is about signals to model to step generating text when, you know, this encountered. We will look at the repetition penalties, present penalties, batch sizes as well. So all of these things, you know, we can extract via logs and then send it to cloud lots, the cloudbot. And then, you know, we can create custom metrics and then start visualizing. And then two other thing is we can look at the, you know, in the inference latency, we can check whether the time taken for model to generate output for the given inputs. And then we look at the model accuracy and the matrix as well. So these things, you know, probably we are using AWS X ray and then, you know, start publishing these things into cloud work and then we can bring and create the alarms and wrappers around that. And few other things are other specific things. One thing we have to look at it that, you know, when it comes to the rag models, so what are the key things? So when it comes to rags, you know, we again have metrics like query latency. We want to understand the time it takes for the rack models to process a query and generate the response. And then we will look at the success rate, how successful are these queries and how often it's getting failed. We will look at the resource utilization and you know, in case if you are using caching, we look, we can look at the cache, it's as well. And when it comes to logs, we look at the query logs, error logs and the audit logs, you know, which will probably, probably give us a comprehensive way of, you know, auditing, troubleshooting. And then we'll try to enable traces, x ray, you know, which will provide us the end to end tracing so that way that we can have a complete observability into the data store or data retriever and other pillars are the tracing. So we have, we will use x ray, you know, so that will enable us to get integrate the traces and we will look at, you know, other integration AWS services as well. And then we will use Cloudwatch as a visualization tool. We can also use the Grafana, the AWS managed Grafana or any other things as well. So one other key thing is be mindful about alerting and incident management. So we can use the cloud virtual arms and we can leverage AWS system manager as well. So one important thing is the security. So we will leverage AWS cloud trail to audit and monitor the API calls and we'll ensure that the compliance with security and regulatory requirements are being tracked. I know we can integrate crowd logs with cloud work logs for centralization and then we will use AWS config so that we can continuously monitor and assess the configuration of our systems, AWS resources and we can ensure that we have compliance and best practices with the compliance standard with that. One key aspect is cost as well. So the more we are using our llms, you know, the more the cost factor comes in. So we can leverage AWs cost explorer and AWS budgets. And finally, one other important thing is that, you know, AI upscale building. So we will have to ensure that all the metrics, you know, whether it's the LLM specific, application specific or the RaG is specific, we will kind of like enable anomaly detection. And then for all the logs which we are putting into cloud work, we are able to enable the log anomaly detection as well. So we can also use Aws, the DevOps guru. So it's a machine learning service provided by AWS. So it, the DevOps guru will help us to detect and resolve issues in our system, especially identifying anomalies and other issues which probably we might not be able to uncover manually. And then we will look at leveraging AWS code guru as well because this allow us to integrate with the application so that we can do profiling and we can do the understand the resource utilizations usage based on our applications. Another very important thing is use AWS forecasting. So all the metrics and other things which we are bringing into the table, we can use the forecasting that will able to understand things in advance so that we can make better decisions and we can plan things ahead with that. Probably you can ask the question why we need a maturity model. So I am a big fan of maturity model because I think maturity models act as a north star. So we all want to start someplace and then take our systems into observability journey. So if you do that without kind of a maturity model or framework, then it's are, you know, you, you may ended up with any place, but by using a maturity model you can guarantee that, you know, you start with the basic steps and then you can finish with it some of the advanced things and you have better control of how you go there. So the LLM, the indirect observability maturity model, I have three pillars. One is, I call it level one, which is about foundational observability. And level two is the proactive observability. At level three we are looking at advanced LLM observability with AI Ops. So in the level one we will start, you know, capturing some of the basic LLM metrics. We will start getting the logs and start monitor the basic from properties, and we will implement basic logging and other distributed tracing. And then we will put up the visualization and other basic alerts as well. So this will kind of give you a foundational observability into your generative AI application. The next step is, you know, taking system more proactive, like be proactive. So here we will start, you know, capture and analyze the advanced LLM metrics and you know, start, you know, start leveraging the logs, then the other advanced prompt properties. And then we will enhance alerts and other incident management workflow so that we can identify things much faster and you know, resolve things much faster as well. So we will bring in the security aspect, the security compliance. We will start generating, leveraging, AWS forecasting so that we can start focusing about some of the LlMe specific matrix, matrix and the prompt properties as well. And for the logs we can also set up log anomaly detection. And when it comes to level three, which is kind of like the advanced level, which is the kind of place where you all want to be in, but you have to be mindful that it's a journey. Like you will have to start with level one, go to level two, and then we can be into level three. So at level three we start with integrating with DevOps guru and the code guru, so that with DevOps Guru will provide the AI and ML capabilities code guru will provide our quality of the code and then we will start implementing AIOps capabilities like other things like the noise reduction, smart intelligent root causes and then kind of like business impact assessments. So the forecasting feature will kind of like allow us to understand, if at all, if the models can drift, when that can happen, if at all, the models can start having a bias, when that can happen, the response time predictions and all those things. So the AI kind of thing is, can give you a full control of, you know, predictability of your generative AI application. So now I am kind of like look at more focus on implementation angle. So in the foundation model, like, you know, we can use cloud work metrics, like so that we can capture the basic LLM metrics, like, you know, the inference time, model size, prompt length and those things, the prompt properties. Again, we can, you know, leverage the sender, those logs into cloud work logs so that, you know, we can start monitoring basic properties like from content prompt sources and those things, any other logs, you know, we will be shipping into cloud work so that we can start, you know, getting the basic, the detail. And then we will integrate AWS x ray based on the technology we are using to develop our LLM to generate a app so that we can have ability to start looking at the traces and then visualization the dashboards. We can use AWS Cloudword dashboards and if required, you know, we can go into AWS managed Grafana dashboards as well. Alert and incident management. We are leveraging Cloudwatch and that will help us to understand some of the more the basic to a medium complex some of these monitors so that we can have a good control of our, how the llms are behaving and like how, how is our, the prompt is successful and overall how is our generative application is behaving and probably not probably, but how our end users are feeling about it. And then we will wrap this up with the cost like we using AWS Explorer. Because llms are sometimes costly, we'll have to ensure the usage and we start monitoring that as well. So level two, like, you know, we will go a little bit advanced for the metrics. We will start, you know, looking at advanced metrics like model performance and output quality. And again, prompt the properties. We will look at the advanced properties like the prompt performance, prompt versioning. And then, you know, we will start advancing, improving the incident workflows. We will look at the security compliance, we will look at more into the uplifting and like improving the cost factor as well. And one of the key thing, you know, here we will bring in is AWS forecasting. So using forecasting we want to ensure that we have the ability of forecasting all the key, the complex or even every metric related to LLM performance, LLM the inference, the accuracy and the prompt properties related things as well. So and then we will also look at enabling metric anomalies, log anomalies so that, you know, we start using some of the capabilities of anomalies this and finally we will bringing in AWS DevOps guru, so and the code guru and that will allow us to bringing in the AI capabilities into here the AI ML capability so that we can look at things from holistic ways. DevOps guru is a perfect tool. And then we will, you know, bringing in AI of practices and then kind of like, you know, bring ensuring that our incident workflow are more into self healing and there are a lot of other improvements and AI kind of things which we can bring. So what are the other things like, you know, so we will look at bringing in while we do this, we want to ensure that we measure the progress. So once we enable observability. So we want to ensure that we look at LLM. So, so how the LLM output quality is getting improved, how we are improving, optimizing our LLM prompt engineering area and then like ensuring that, you know, we can able to detect our model drifts in advance and then we can take necessary actions. We look at, you know, what are the ethical things, you know, our models based on that, how our models are behaving and then, you know, we look at the interpredictability, extendability and start keep a close eye on those things and generally like, you know, we will start kind of looking at end user experience as well. We will clearly define some end user specific service level objectives. We will start, you know, track the metrics and the improvements and we will start looking at the customer experience, ensure that, you know, whatever we do is align and correlate with customer experience. We see increasing customer experience as well. So and like overall that, you know, we develop and provide a better world class services into for our end users. And then some of the best practices is like, you know, so we will have to use a structured log and you know, if case you are heavily using lambda, probably go to power tools, you'll have to instrument the code, ensure that, you know, you get all the, the critical, the LLM specific metrics. Then you obviously use x ray to enable the traces as well. So the metrics which we are extracting, it has to be meaningful it has to add value. So it should be aligned with our business objectives as well. And to wrap up like, you know, some of the pitfall is that, you know, ensure that, you know, we kind of have a security we plan in advance and the compliance as well because that's again a key thing, you know, modern day when we are using generative applications and clearly define the roles like whatever objectives, you know, we are going to achieve with this. And probably you can have some numbers, you can have some measurable things so that you can start, you know, performing and kind of like try to get the benefit. So with this like, you know I'm, we are at the close so thank you very much. So here I have taken AWS as example, especially AWS bedrock. From here we have look at what is a general architecture of workflow of generator application and what are the key pillars of the observability LLM related observability pillars which we have to enable and then we will look at the three, the levels, the foundation, the proactive observability and advanced observability is aiops. And then we have look at some of the best practices and the pitfalls and more importantly how we can look at this from ROI perspective. So with this, thank you very much for taking time. I hope, you know, you kind of like enjoy this and then you have understood or you have taken few things which you can take into your generative application and make it observable and leverage into to deliver great customer experiences. So with this, thank you very much.
...

Indika Wimalasuriya

Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways