Conf42 Chaos Engineering 2025 - Online

- premiere 5PM GMT

AI Model Monitoring: Building Reliable Alert Systems

Video size:

Abstract

By simulating failures and stress-testing alert systems, let’s find out how Chaos Engineering can enhance the reliability of AI model monitoring.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, I'm Mudassir Sharif and I am An AI engineer who has been building and deploying AI systems in production for a while now I feel like right now like everyone is watching towards like, building some kind of AI system in production right now and Then that is where like I've seen different stakeholders ask a question. Like how do we make sure that our AI models are working As we want in production like how do we make sure that we get the maximum roi because that only happens If your models or your systems work as you want right and not the other way around and that is why this is really important talk. I believe that like everyone, for example, different stakeholders can be executives who want to oversee the entire AI stuff happening in their company. And also at the same time, engineering managers who are overseeing different LLM projects, like how can they think about how to measure the models and systems, I would say, in production. And And that is why, for example, in this short talk, I will be covering all the important, I would say, concepts that the entire industry has been using to monitor system and production. And then from these concepts, we can think about, for example, okay, so how do we measure LLMs in production as well, right? And then after we cover different tools, different concepts, towards the end, we'll sum things up as well. So, the question is, what is model monitoring on a really high level? There are two main things, there are two main goals. The AI system, we, towards the start when you are building a use case, you define how you want the entire AI system to behave. And you define some key KPIs that you want to track. And then the goal is to simply see, is the model able to meet the Or exceed those kpis in productions or not, right? one example can be let's say a bank, every time like you go and swipe your credit card there's a model that the bank use behind the hood to simply classify if the card was used by someone who is Not you, right? So they have like for them the bank for them You They are monitoring How well the system is able to tag? fraudulent Transactions versus the non fraudulent one. That's one example. So their kpi is the percentage of fraudulent transactions they were unable to tag in production, right so for llm for example, like Let's say you have an llm system in place You Which is, which you are right now using to automate. Questioning answering for example, there is customer support Use case or you have a vendor who's providing you with an ai tool that is able to automate the entire Customer support tickets like how do we make sure? like in that system is the model able to provide accurate information Or lm is giving the accurate information. That is a really important kpi That you want to track So the very important I would say a really high level in summary like you define what you want to track number one and Then based on that you define all the other sub I would say KPIs you want to track So like before I move forward like It's really important to understand that Model tracking, if you look at the last 10 years of the history of what's been happening in the industry There are two main things that Everyone is tracking and it's worth tracking Number one is for example what goes inside the model and that is called data And what comes out of the model that is called some prediction or some token If you think about the llm domain, so so before before llms when we think about Monitoring in machine learning use cases. we track All the inputs it's called data and then we track the output It's called concept or prediction drift the same also hap is happening right now in the llm space as well So So, what, what do we do? For example, okay. So once we define and then we'll talk about more in detail, how do we define different KPIs? what's the end goal here? The end goal here is that, these, some system has to be there to track the model and then let us know when. Okay. There is something abnormal, right? So we can go back and fix it, right? So and then ideally in the ideal case scenario, you want to have a system in place that can Number one track and at the same time trigger Some pipeline which can go and fix the entire Model or retrain it or simply if you are in the llm domain improve the prompt So the goal is to track and then in the end improve systems in production And make sure the output is how we want the output to be And then there are different tools out there to do that as well so if you think about the llm domain right now, there is athena that's doing a really good job there And then at the same time there is lang smith That is also a really good job right now to provide you with the entire Monitoring and how things are going in production Let me quickly show you the entire athena folks All right, so uh Moving back towards what do we want to track? So there are two once again, there are two things we want to track always you want to track What data that goes inside the model? And how is that data changing over time? let's suppose like you have An llm system in place for customer support And then you have built that system to handle Tickets regarding refund and change in Delivery address for example, right? That system is employed in the e commerce domain and then you go you only track these two things That's the input refund requests and change in the address requests So let's say like You like in the production that for example the tickets that are coming up are not about refund or are not about the change in the address the tickets that are coming up about are about our complaints about The performance or the quality of a product means what there's a drift Means what the input that the system was receiving Was built to handle You have a system in place to handle refunds and then you handle I would say Change in the address now you have some different kind of inputs coming up, right? So in that case We call this drift And drift on the data level in that case normally for example, like since you What you never build a system in place to handle those tickets regarding faulty product Your system is going to perform as Expected pretty much, right? that is called a drift on the Data level the other example that I can think of right now is a bank Let's say a bank has a system in place. They have a model they have trained on the customer data to tag transactions as D as fraudulent or not Or yeah And for example, and let's say that bank is expanding in a new country, right? Of course when they expand a new country and they open up their shops there And then they are offering credit card to different clients in that country. Of course the data that will Be coming into the ai model will be different right because different in the sense of Location and other attributes right for sure, right? So that is where we call this drift the inputs that we are sending in the model has changed Over time means what the system that you had built before is no longer Going to work in this case if you have to go back and fix it, right or you have to either You Find a way to handle these new requests or you have to build Or retrain the entire ai system to make sure it handles all the scenarios whenever there's a drift in the data level Normally, for example, like how do we do that? we track distributions, so I do have a few right now really good examples to show as well On how do we like easily measure drift? It's normally distribution, for example, right? let's say the distribution on the location level was only the US, but now that a new bank has opened up their shops in Europe, right? So normally you will have a different distribution coming up. Simple as that. So your input, again, the input, it was red and sometimes blue, but over time it's more, it's mainly, I would say blue and less red. Or for example, Before there was more mainly Online transactions and now there are more offline transactions. I don't know maybe the covert is gone now, for example, and that is an example of a drift the data that's coming in has changed over time because The world you operate in has also changed over time as well and then, for example, if I think about the LLM domain right now, we talked about, for example, that the tickets that are coming in, they are different tickets now. Something has happened in the system before. So moving on, for example, the other really important concept that we have to track is the drift in the output. And that is mainly called model drift Why is that happening? It's quite possible. For example And the output drift happens for two reasons number one is for example, Like over time the input is different. For example, it's quite possible. The output is going to be different for sure, right? there's difference in the output driven by the input number one and number two is the concept drift that happens because the model was built to build to learn patterns in the data. And that data represents a world right now in 2025 it's quite possible that in two, three years, how people buy stuff. can change right, over time it's quite possible that the relationship between different variables in play are different now and that is where the models is not going to give you the right output or prediction in that case. to think about a few examples right now towards concept drift let me open up this link really quick. Yep, I think that's a really important example over here as well. So sales like they were mainly online sales and now there's offline sales. And it's quite possible that, yeah, one example can be, for example, if you have a new store. And then new store and then you see that your sales is more seasonal, right? But as the store and as your business grows over time you have Like you have a strong brand It's quite possible that you are collaborating with different influencers, right? And then it's quite and then for example, it's quite possible that at that time like your sales are not entirely driven by Seasonality, it's also driven by different events happening or different campaigns you run. So Your sales, your prediction is no longer, I would say, dependent on seasonality or something else. It's a different business or different concepts in play. That's the other example of, for example, concept drift. The model was, once again, it's really short. The model was trained to learn a phenomena and that phenomena has changed. because the world has changed and that is where you can see I expect to see a model Going wrong or not giving you the right output So like Simply for example if you are tracking drift on two levels There's input and the output or data and the model Then the alerts are Also going to be on these two level, right? for example Okay, so sometimes for example, it's quite possible that we like hey, it's okay Like it's okay. For example, if there's a drift in the input But as long as there is no drift in the output, we are fine, right? And that depends on the use case. It's quite possible, for example, that LLMs were not given instructions on how to handle requests. requests that are coming in about some faulty products, right? But that LLM focus for support is smart enough to handle these new requests because these models are really good at understanding the language. In that case, like you might say, Hey, the input difference is not really important. The output difference is really important. For example, if you see models Giving the output that is not backed by citations or giving the output that is, that contradicts what we have in data, or there is some hallucination, for example, different metrics we track in the output. That is where, that is the most important thing to track. And then for example, like in the other case scenario, for example, uh, in finance or in healthcare, they might want to, for example, track both the inputs and both the output means they want to track all the drift in every variable, the input in model. And they also want to track the drift in all the output or as well, right? and then for example, like how much drift is. I would say is something that you can accept but also it depends on the use case for example, if there's a model that i've trained to predict Sales, right? And then I see that there's a drift or of around 10 percent It's quite possible that my demand forecasting or sales forecasting model. I already have that like At a threshold in place or I've already stocked, my inventory 10 percent more than the prediction. So that small amount of drift won't be super important. But on the other hand, for example, A bank for a bank or for a health case Use case even a one person drift for them would be super important, right? So that threshold Is something for example that has to be defined by the company So if you just think about the llm cases right now because that is where Majority of the work is happening right now You As well, so on the input side the most important thing for example for them is a The number of input tokens is fairly important On the input level for example, i've seen companies also track as well the Sentiment as well. For example, let's suppose if sent if sentiment of the tickets you are receiving in your customer support llm agent Has changed over time a lot Then it's quite possible that you might there's something wrong happening or there's some change in the overall system or Overall, I would say world and then it's worth going back to Think if we have to go and change our Prompt or change something else and then like on the output level, for example, there is hallucination that even on track as well there is number of output tokens for example, let's say like before you saw that the model is Always giving you like 500 tokens on average in every output But now all of a sudden the models are giving you 1000 tokens For example, like that's something you wanna go and understand or like why is the output number of tokens Different it's quite possible that You There's a drift on the concept level, right? for example the phenomena that you want to track And you want your models to handle that phenomena is it's very different now Because the these maybe like the store has grown The tickets that are coming in Are not only around refunds. There are more tickets about some other queries coming in and then the two really important tools that I've seen right now, in action, I use like Athena in my companies just because like they have a really easy to build our assessment in place. but at the same time, like I've also seen Langsmith people being used a lot. These both of these tools have one really important thing they track every inference of every prediction So you see for example, okay, so what's happening and then for example, you also have the ability to simply track the change in tokens Change and then you can also build your own tracking evaluations or KPI so track for example on hallucinations, and then you also have ability to track different various aspects Both on the inputs and both on the outputs level Yeah, I think we cover that as well so and then for example like different tools if you think about where If you have a different Machine learning system in place that I've seen Evidently ai being used a lot. It's an open source tool used By a lot of people in the industry and then they have both Different kpis being built on the input and output level And then you have the ability to simply think about which kpis or which metrics you want to monitor as well I think we covered different examples before, so towards the end, on a really high level, for example, we never went into detail on different, on, on math behind different concepts, for example, like how do we measure drift? Okay, so if drift is on the distribution level, for example, what are different formulas or concepts that are in place to measure drift? The difference in distribution. That's a different topic, right? So but like in this shot I Would say presentation. My goal was to simply explain that Monitoring is happening or has been happening on two different levels. There's either Input that goes in the model and then there is the output The input is called and then the change in the input is called data drift and then the change in the output is called model drift and then Within these two buckets. There are different Kpi as you can think about if you have a traditional machine learning system in players, then it's simply just different distribution For example, it's quite possible that you have one very input variable location before Location was only different cities in the u. s. But now you have location as different cities in the US plus in Europe plus in Asia means what the input has changed. If the input that goes in the model is different, the output will be different. For example, number two, the output, for example, before Model was giving a prediction Always more than 50, but now you see model giving prediction Between the range of 30 and 60. There is a difference In the prediction or on the output of the model and that in that means there's a drift in the output. So That means something and these are two different buckets that you always have to track and then How much drift can you afford it all depends on? The use case in healthcare or in finance where you have regulations in place. It's and, how much drift can you afford depends on the regulations and depends, depends on, the loss the company can incur because of that drift. In some use cases, like you always have an error built into prediction. For example, retail, they always know, okay, so they are, if they are doing demand forecasting, they always like. Overstock because like it's fine to have an extra stock versus having less stock because There's more loss to the company if customer works in the store and goes out without buying stuff So they always overstock so for them They can afford to have a big drift if there's a big surge or big drop in the demand They're always prepared right but on the other hand, for example, like For them if their prediction is like they will sell 100 1 million skews SKUs They won't be stalking 5 million, right? They will always be stalking 1. 2 million, right? So for them they need accurate prediction, but in the end They are fine if their models Have a really big drift in the output because of different reasons in place So this is On a really high level. For example, how do we monitor our AI systems in production and how do we make sure that? Our models behave as we want and we get the maximum ROI because as a result Signing off now
...

Muddassar Sharif

Co-Founder & CTO @ Virtuans.ai

Muddassar Sharif's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)