Operational excellence for your LLMs using Amazon Bedrock

Video size:

Abstract

Customers are looking for a turnkey solution to integrate LLMs with their existing applications. This session will provide an overview of the operational considerations, architectural patterns, and governance controls needed to operate LLMs at scale.

Summary

Today I'll be talking about Amazon Bedrock, which is a managed service via which you can have access to different foundation models using a single API. We talk about the operational excellence best practices that you would like to consider when using Amazon bedrock.
Operational excellence is basically an ability to support the development of your workloads. One of the key principles is performing operations as a code. refining your operational procedures, anticipating failures and learning from your operational failures. Finally observability, which can help you get actionable insights.
Mlogs is basically using the same set of processes and people and technology best practices within the scheme of machine learning solutions. MLops is basically productionization of your ML solutions effectively. At its core, every solution that you are creating is going to be talking about people, process and technology.
With Llmops, there can be different types of users that you're interacting with. And basically I want to talk about the generative AI user types and then the skills which are needed. The three key aspects that you would want to consider are speed, precision and cost.
There are four ways in which customers to be using the llms. Prompt engineering, Rag, rack and continued pre training. Using rag then comes fine tuning, which is more time consuming. Do have a clear benchmarking to see how your model is performing with prompt engineering versus rag.
Now customizing, now here we talk about customizing the business responses. This is where fine tuning and continued pre training comes into picture. Be very clear on your use case as to when you would be using a rag versus a fine tuning or prompt engineering.
Amazon Bedrock is basically a way for simplifying the access to foundation models. You get access to different models which are available within Amazon bedrock. Once you have access to the bedrock API, invoking one of these models is extremely straightforward. It's still very early days and we are all just getting started.
Amazon Bedrock allows you to customize the foundation model responses with contextual and relevant company data. Another architectural pattern is the fine tuning. You can use any integration layer which can support AWS Lambda to invoke the bedrock APIs.
You can enable it at the bedrock level and all of these logs can go into Amazon s three or cloudwatch or both. This logging will be available for you directly within Cloudwatch. And then from Cloudwatch onwards you can change it to let's say s 3 or maybe use it for any kind of future use.
Talking about metrics, you have these metrics available out of the box for Cloudwatch with Amazon Bedrock. Models can be evaluated for robustness, it can be evaluation for toxicity and accuracy. You can also bring your own prompt data set or use built and curated prompt data sets for this purpose.
And finally, we want to talk about the guardrails because as we talk about generative AI, there are different challenges around undesirable, irrelevant topic responses. One open source solution that you have is with Nvidia Nemo guardraILS. And finally, how do you invoke bedrock?

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, welcome to the session. Large language models have captured the imagination for different software developers and customers who are interested in now integrating those models into their day to day workflows. Today I'll be talking about Amazon Bedrock, which is a managed service via which you can have access to different foundation models using a single API. We talk about the operational excellence best practices that you would like to consider when using Amazon bedrock. Customers are often looking for turnkey solution which can help integrate these llms into their existing applications. As part of the session, we will talk about an introduction to the operational excellence term from a well architected review perspective. We will talk about the llms and then we will go in depth in the bedrock. So let's start with operational excellence. If you look at the well architected review that AWS recommends, we have a pillar in there which says optional excellence. Operational excellence is basically an ability to support the development of your workloads, how to run your workloads, gain insight into your workloads, and essentially improve your process and procedures to deliver business value. Operational excellence is a practice which you develop over the course of time. It's not something which you will be able to get done overnight or just by adopting a particular solution. This is how your team is structured, this is how your people process and the technologies working together. Now, within operational excellence, there are different design principles that you should be considering. One of the key principles is performing operations as a code. We are all aware of infrastructure as code and we are aware of different tools and technologies which are there in the market. Try to adopt as much as possible from an operations perspective so that you can start executing these as code snippets, making frequent and small reversible changes. That's another key aspect of the design principle for operational excellence, refining your operational procedures, obviously anticipating failures and learning from your operational failures, and finally observability, which can help you get actionable insights. With respect to llms. Let's say you're using Amazon Bedrock, which is an API, access into different foundation models. You still have to follow all of these design principles in order to how to deploy the API. How do you start versioning the API? How do you have the different operational procedures working together? What kind of observability can be put in? So these are some of the factors that we will be talking about as we go more into the session. Now let's talk about some of the key terms that we keep hearing day in, day out DevOps, mlops and llmops. DevOps as a term has been around for a pretty long time. It's basically encouraging you to break down the silos, start having the organizational and the functional separation removed from the different teams, and have an ownership end to end. As to whatever you are building, you are also running and supporting that code. Mlogs is basically using the same set of processes and people and technology best practices within the scheme of machine learning solutions. So you consider DevOps is something which you would often use for microservices written in Java, Python or Golang. When you are trying to use the same set of technology stack, but now trying to solve a machine learning problem where you suddenly have a model, you need to train the model, you need to have the inference of the model, you need to have multi model endpoints. You want to incorporate these practices into how this model is getting trained, how it is getting deployed, how the approval process is going to be there, and ultimately how the inference is going to be there, be it a real time inference or a batch inference. So that's the mlops part in there. What about llmops? So far, mlops is mostly used for specific machine learning models which you have created to solve a single task. With large language models, you have a capability of using a single foundation model to solve different types of tasks. For example, a foundation model like something which you would be having access via the bedrock, you can use it for text summarization, you can use it as a chatbot, you can use it for question and answers. There are different business scenarios where you can use these models. Hence the llmops as a term is using that single model. That is the foundation model for different aspects of your business. So in all your best practices on the operational excellence which we spoke about in the previous slide, they remain quite consistent. But just the nature of these specific problems which you are solving would be differing depending on the machine learning solutions or the LLM solutions which you have. At its core, every solution that you are creating is going to be talking about people, process and technology. Now we say MLops is basically productionization of your ML solutions effectively. So let's say I deploy a solution into production which has the model go through its own set of training. Someone has provided an approval. Now it is an inference, be it a batch inference or a real time inference. There is a lot of overlap that happens with foundation model operations such as generative AI solutions using text, image, video and audio. And finally, when you talk about llms, these are basically large language models which are using, again for productionization. There are some of the attributes which would change in terms of the metrics that you're looking at, but the process more or less remains the same. And then there are some more additional customizations that you would incorporate into llms with say, rag or fine tuning, etcetera. We'll talk about it in a, in a few slides. At its core, it's still going to be people, process and technology, and that overlap is going to be consistent, irrespective of what kind of operational excellence you're going for, be it you using the best technology that is available in the market. You still need to train your people who can effectively use that technology to derive the business value. You need to have a close correlation between the team which is training your model, or maybe fine tuning that model, and ultimately the consumers who are going to be using that model. That aspect of people, process and technology. And obviously, Conway's law doesn't change much when it comes to deploying a software, whether you're deploying it using an LLM or you're deploying it in the traditional sense with microservices or even a monolith application. Now, let's talk about foundation models. And the first thing that we want to talk about is the model lifecycle. So in a typical machine learning use case, you will have a model lifecycle where you have a lot of data, and using that data now you're going to go into a processing stage where you're processing all of your information. That data has been labeled, maybe a supervised learning or unsupervised learning, whatever is your choice algorithm that you're doing. And then once the training has been done, now you have a hyper parameter tuning that you're doing to ultimately create a model. That model is going to go through the model validation, testing, and once that model is ready, you are going to be using for that specific task. That's the important part here. You're going to be using for this specific task because the model has been tuned and trained for that particular task. Tomorrow you have new set of data, you're going to do an iteration, and then you're going to do the training of the model and ultimately redeploy the model once the requisite approvals are available. So this is just one project. When it comes to foundation models or large language models, your data set is no longer just one data set. You're training that model using every possible data set. For example, the model which you have from meta, the Lama models you are, they have been trained, pre trained with large data sets, 70 billion parameters, etcetera. Once that model has been made available. Now, from that foundation model, you can either do a fine tuning if you're interested in doing that. So that's the project b that you're looking at. But then from the same foundation model, you can directly use it for some task specific deployments. And then once you do a fine tuning or rag or something else, you can use that same model for a different use case. So that's the key difference here. You're using a single model for different projects and different scenarios with some alterations. And in the previous case you are having one model for each of it. Now, with Llmops, there can be different types of users that you're interacting with. And basically I want to talk about the generative AI user types and then the skills which are needed. Let's talk first about the providers. You have a provider, let's say someone is building an NLM from scratch. In this case we take a Lama model which has been built from scratch, and that model can be used for different use cases, NLP, data science, model deployment, inference, etcetera. So that's a provider. You have got a model from there internally, your team can decide to have a fine tuning on those models. So those are the people who are doing a fine tuning on the model to fit custom requirements. Maybe you have a business specific data which you want the model to be a little bit more aware of. So you're going to be training that model, fine tuning that model, using that business particular data, domain specific knowledge that you're having. And then the third group is basically consumers. They don't care about what the model has been trained on or how it has been fine tuned. They are more like consumers who are going to be just using that model. So consider someone who is using your chatbot, someone who has asked a question. They would like to get a response. They want to ensure that the response is not having any kind of bias, toxicity, or unrequired responses that you will be getting. So they don't really have much of the ML expertise, but they are basically using prompt engineering for getting a response from the model. Be mindful. These roles are transferable. So you can always have a provider who's also becoming a tuner, and you can always have a consumer who can also become a tuner. Essentially, this is the entire spectrum that you're having, where you have more on the MLov side, where the model is getting created, and then you have the other end of the spectrum, where people are directly incorporating this model into their day to day workflows when it comes to LLM selection, there are different aspects that you would want to consider. The three key ones that we have seen from the field is the speed, precision and the cost. Now, let's say you have three different llms and each one of them is good at one particular thing. So let's say we have LLM one, two and three. LLM one is the best when it comes to precision. Two is the best when it comes to cost, and we have one again to be the best when it comes to speed. Depending on the business scenario and the priority for a particular customer, they can choose one of those llms. Some customers are ready to sacrifice a bit of precision in order to pick up a low cost LLM because of the number of tokens that you'll be sending across and the large use that you'll be having. You always want to have a cost effective solution in terms of any software that you are deploying. Second is the response time. There are different ways in which you can surely improve the response times. Maybe you're using an embedded text, embeddings or something like that with a vector database by which you can cache it or you do something else. But essentially these are some of the key factors, the three key factors that I have seen with different customers when they are evaluating llms. And obviously this is a summarization of what I just spoke about, which is LLM one, two and three, how they compare. And then it's up to the customer how they want to pick up a particular LLM and what they want to use it against. Let's talk about customization. When it comes to customization of the llms, there are four different ways in which I have seen the customers to be using the llms. And one of the most common use cases that they have is prompt engineering. So that's when you are sending a request to the llms. For example, you're using an anthropic cloud model on Amazon Bedrock. You are going to be using one of the playgrounds and just send a request and ask for. Give me details of when was the last major incident which has happened in software engineering around the best practices or something like that. So that's a problem, engineering. Just asking a question. You're expecting a response from the LLC. A more nuanced one is a retrieval augmented generation, which is Rag, where you are able to use Rag, which as a better solution and as a better cost benefit, and you can use it for customizing your llms. Using rag then comes fine tuning, which is more time consuming, it is more complex. There is a lot of data and other things which would be needed. And compared to rag, fine tuning is a special case. I would say if you really want to have that level of control over the responses, then maybe you can think of fine tuning. And the last would be continued pre training where you are essentially loading the model and customizing it way more. And obviously the complexity increases as you go from prompt engineering to rack to fine tuning to ultimately continued pre training. One of the most common cases of a rush of LLMs that has been seen is everyone tries to start doing fine tuning, thinking that the LLMs can be made aware of specific knowledge and facts about the organization's code base or domain knowledge, etcetera. What has been observed is in majority of the cases, rag is good enough. It offers a better solution. It is more cost effective from in terms of cost benefit ratio between rag and fine tuning. And fine tuning requires considerably more computational resources and expertise. It introduces even more challenges around sensitivity and the proprietary data than rag. And there's obviously the risk of underfitting or overfitting if you don't have enough data which is available for fine tuning. So do have a very clear benchmarking to see how your model is performing with prompt engineering versus rag. And then think about whether fine tuning is the right solution that you want to go for without much evaluation. You may be jumping into a technology solution, but which may be a much more difficult thing to manage in the long term. Now customizing, now here we talk about customizing the business responses. So what's really going to help drive your business in generative AI is what's important for your customers, what's important for your products, which you're creating, and how you go about that. And you can leverage different mechanisms here. And this is where fine tuning and continued pre training comes into picture. You talk about the purpose, it's basically maximizing the accuracy of the specific tasks that you're having. And we have comparatively smaller number of label data. But then when it comes to continued pre training, that's where you want to maintain the model for a longer duration on your specific domain. That is hyper customizations and large number of unlabeled data sets that you will be using. Now, as I mentioned before, Amazon Bedrock can help remove the heavy lifting for these kind of model customization process. But be very clear on your use case as to when you would be using a rag versus a fine tuning or prompt engineering, and why you would want to use a more complex customization than the one that you're getting. So without that clarity at a business level, it will be quite difficult for you to just adopt the LLM and make sure that it is viable in the long term. Now let's talk about Amazon Bedrock. Amazon Bedrock is basically a way for simplifying the access to foundation models and providing an integration layer for you via a single API which is an invoke model API. You get access to different models which are available within Amazon bedrock. Some of the models which you have here is the stability AI model. You have the Amazon Titan, you have cloud, you have Lama models, etcetera. Customers have often told us that one of the most important features of bedrock is how easy it makes it to experiment with and select and combine different range of foundation models. It's still very early days and we are all just getting started and customers are moving extremely fast. And the key aspect is customers want to experiment, they want to deploy, they want to iterate on whatever they have done. And today Bedrock provides access to wide range of foundation models from different organizations and as well as the Amazon Titan models that you have. So once you have access to the bedrock API itself, invoking one of these models is extremely straightforward. I'll talk about it in a bit. Now let's talk about the architectural patterns that you have when using Amazon Bedrock. Obviously with Amazon Bedrock you have different knowledge bases for Amazon Bedrock which you will be using. And to equip a foundation model with an up to date proprietary information organization often talk about retrieval augmented generation. We spoke about it a little bit earlier when during the customization slide. It's basically a technique where you're fetching the data from the company's data sources, enriching the prompt with that particular data and delivering more relevant and accurate responses. We have knowledge bases within Amazon Bedrock which helps you in a fully managed rag capability and it allows you to customize the foundation model responses with contextual and relevant company data. So essentially it helps you securely connect to your foundation models. It's a fully managed rag and it's a built in session context management for multi tone conversations. And obviously you also have automated citations with retrievals to improve the transparency that you get. So how does it work for you? So let's say you have a user query someone has asked about how can I get the latest details about my statement or something. Now that information goes into Amazon Bedrock and it has the knowledge basis which is associated with that particular Amazon bedrock. And it's an iterative process going to look at the knowledge basis for Amazon Bedrock and based on that it's going to augment the prompt that you have received and ultimately you are going to use one of the models, which is the foundation models, be the Claude Llama, Titan or Jurassic models, and ultimately provide a response to your customer. All the information that you are retrieving as part of this process comes from the source citations and the source which you have within the knowledge base. And ultimately it gives you the citations to the knowledge base. In order to improve the transparency. You also have Amazon Q, which is an which has a similar approach when it comes to integrating with Amazon Connect. Not something which we are covering for this particular session, but it has a similar aspect of being able to use your knowledge bases and then give you customized responses. Another architectural pattern is the fine tuning. We spoke about it earlier. So let's say you want to have a very specific task for which you need to have fine tuning. Simply point to those examples of that particular data, which is an S three and they have been labeled. And then Amazon bedrock makes a copy of the base model, trains it, and creates a private fine tuned model so you can get tailored responses. So how does that work? Essentially you're making use of one of the foundation models, be it a Lama two model or a Titan model. For these specific tasks you are keeping all of your specific labeled data sets in Amazon S three, and then you are using that data set in order to make your model better to get tailored responses. So today you will have fine tuning available with Lama models cohere, command, Titan and express Titan multi model and Titan image generator. Fine tuning will be very soon coming into anthropic cloud models, but today it is not available. So this creates a copy. You have the label dataset which is in Amazon S three, and from there you are able to fine tune the model and get the generated responses. Now let's talk about how do you invoke these models. One of the most common patterns that you have with respect to invoking these models is by using API gateway. So it's a very well tested serverless pattern which has been there in existence even before Amazon bedrock instead of bedrock. You would be having, I don't know, ECS or EKS or just something running on a compute somewhere. And you can use Amazon Lambda AWS lambda for doing that invocation with bedrock as well. You are able to use the same pattern and it leverages the event driven architecture that you have been using or maybe using with Amazon API gateway. And it doesn't always have to be Amazon API gateway. You can use it with any integration layer which can support AWS Lambda to invoke the bedrock APIs. And finally, instead of AWS lambda, you can also have the same behavior which let's say if you're having a long running compute and EC, two ecs or eks, and then you can invoke bedrock API in the exact same way. For this particular example, let's consider that you are having two models which you have created for your request and response. Payload request is saying that you need to have a prompt which is going in and response is saying that you have a response that is coming back and a status code that is coming back. When you want to invoke the Amazon bedrock endpoint, you're going to be writing a very simple lambda code which is going to be using the boto three API. So let's walk through this API. So you guys basically creating a client of bedrock using the bedrock runtime. With boto three you are creating the body which is the prompt, the max tokens that you need to get as a sample response, the temperature, etcetera. And then you're selecting the model id. So here I have selected anthropic cloud model. You can also select any of the other model like a Titan model, or you can select the Lama model, any of the model that you want. And once you select the model id and you select the payload structure that you are sending. So be mindful that this particular payload structure can change depending on the model that you are invoking. And you can just use invoke model and give you a response. And that response is how you would return it back by using the same model structure that you have used earlier. Now this particular response request payload is structure would be differing based on the model that you are using and the model id will also change based on the model that you are intending to use. So that's one of the way of invoking it if you're using Lambda and API gateway, and even if you're not using API gateway, anything else which can integrate with that you can use. Now let's say you're not using any lambda you just want to use from a generic application. You can essentially use boto three and you can use a temporary credentials in order to gain access and ultimately invoke the bedrock API. And for any reason if the AWS SDK is not available to you. You can also leverage AWS SIG V four for constructing a valid request payload and invoking the bedrock API. So this is a similar example, quite similar to the one that has been shown earlier. The only difference here is we don't have the lambda handler with the event context and the event and the context. Here we are directly using the Botox reap and we are getting a response from here. So you can embed it in any of the applications which has access to the temporary credentials and you should be able to access the bedrock API. Talking about operational excellence, one of the things that I had spoken about earlier is having good insight into your application. So we spoke about how do you invoke the application, how do you have the API driven approach so that you're able to have a versioning, you're able to have visibility of what is invoking what, and you're able to have temporary credentials, best practices, etc. Etcetera. Now we talk about observability that you would be getting with Amazon bedrock and that's the invocation login. So customers want to know what was the invocation, what was the prompt which was sent and what kind of response did I get. You can enable it at the bedrock level and all of these logs can go into Amazon s three or cloudwatch or both. Here is a sample of a log structure where you have the input body which was sent by the requester either via lambda or any other way by which the API has been invoked. And you can see that this is the input someone is asking explain the three body problem. And here the response is coming in in terms of the number of tokens that has been given. So you will notice that because we had given a maximum token of 300, the response token count is 296. For the purpose of the presentation I've just truncated what is there in the completion response. But here you will have a response coming from the model. In this case it was a cloud model which had been used for that. So this logging will be available for you directly within Cloudwatch. And then from Cloudwatch onwards you can change it to let's say s three or maybe use it for any kind of future use. Talking about metrics, you have these metrics available out of the box for Cloudwatch with Amazon Bedrock, which would be your number of invocations, your latency that you're having, any kind of client and server side errors, any throttling that you're having, and obviously the token count, the input and output, you saw a sample of it in the previous log structure is going to be the same. Now, talking about the model evaluation bedrock currently has in preview, I believe a way for you to evaluate the models. Now the models can be evaluated for robustness, it can be evaluated for toxicity and accuracy. There is on the AWS console, you you can can essentially evaluate the model using recommended metrics. There's an automated evaluation, but you can also choose what kind of task you're evaluating it for. For example, this particular screenshot is from the AWS console, which allows you to evaluate for a question and answer scenario for Amazon bedrock. And we are using the anthropic cloud model and these were the responses which we received on the accuracy and the toxicity that was evaluated against. And you can also bring your own prompt data set or use built and curated prompt data sets for this purpose. So these are some of the observability and insight related details that you can potentially use when you are thinking about using bedrock as your single API for different foundation models. And finally, we want to talk about the guardrails because as we talk about generative AI, there are different challenges around undesirable, irrelevant topic responses or controversial queries or responses which you will be getting, toxicity of your responses, privacy protection bias, stereotyping propagation and all of those things. So as we talk about these new challenges, you also want to talk about what kind of guardrails you will be applying for your models. One open source solution that you have is with Nvidia Nemo guardrails. So this is basically for building trustworthy, safe and secure llms. So you can define the guardrails or rails and to guide and safeguard, guide and have a safeguarded conversation. And you can also choose to define the behavior of your LLM based application for specific topics and prevent it from engaging in any discussions which are in unwanted topics. You can also start connecting different models using LAN chain and other services which you can have. So it's kind of like a shim layer which is sitting between your application which is going to be invoking an LLM. So here you can define all your programmable guardrails and then you can kind of steer your llms in order to follow a predefined conversation path and enforce standard operating procedures. So these kind of standard operating procedures are some of the core context when it comes to building an operational excellence practice, especially when you are building out the llms. So these are some of the same points which I have mentioned. And you can have a look at GitHub and there is, I'll give you a link towards the end of the session as well, where within Amazon bedrock samples are available which you can take a look at it and you can see how the guardrails have been incorporated, is basically a config YAML file and you give an input rail and the input rails are basically applied to the inputs from the user and it can reject the input or we can stop any additional processing. Then you have the dialogue rails which is to influence how the LLM is prompted and they operate on the canonical form messages. You have the retrieval rails which are applied on the retrieval chunk. In the case of say rag scenario, a retrieval rail can reject a chunk or prevent it from being used to prompt the LLM. You have the execution rail and finally you have the output rails. So these are like five different levers which you can control and you can write your config in a config yaml. If you go into the GitHub for this particular guardrail, the Nemo guardrails, you will be able to find more details in there. But this is just an introduction of what kind of guardrails you can add into your LLM invocation. So that you are ensuring that it's safe and you're ensuring the responsible AI best practices when using llms. And finally, this is how it would look when you are using it with Amazon bedrock where you would have the central layer of all the guardrails that you are having. You have the invoker in here and ultimately the bedrock model coming in here and the Neemo guardrails would apply at the central layer. So that is the shim layer that is sitting between your LLM which is exposed via the bedrock and then the invoker who is giving that. And finally the GitHub handle where you can find the details of Amazon bedrock workshop. And this is a screenshot of the UI that you have that closes every, all the topics that I wanted to cover for this session. Talking all the way from what bedrock is what it is offering. How do you invoke bedrock, what kind of observability you're getting out of the box. And finally the guardrails which you can apply for bedrock. Hope this helps. Thank you so much for your time. It's been a pleasure.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Operational excellence for your LLMs using Amazon Bedrock

Video size:

Abstract

Summary

Transcript

Slides

Suraj Muraleedharan

Principal Architect - Platform Engineering, Global Financial Services @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Operational excellence for your LLMs using Amazon Bedrock

Video size:

Abstract

Summary

Transcript

Slides

Suraj Muraleedharan

Principal Architect - Platform Engineering, Global Financial Services @ AWS

Join the community!