Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, I'm Mudassir Sharif and I am An AI engineer who has been
building and deploying AI systems in production for a while now I feel like
right now like everyone is watching towards like, building some kind of
AI system in production right now and Then that is where like I've seen
different stakeholders ask a question.
Like how do we make sure that our AI models are working As we want in
production like how do we make sure that we get the maximum roi because
that only happens If your models or your systems work as you want right
and not the other way around and that is why this is really important talk.
I believe that like everyone, for example, different stakeholders can be
executives who want to oversee the entire AI stuff happening in their company.
And also at the same time, engineering managers who are overseeing different
LLM projects, like how can they think about how to measure the models and
systems, I would say, in production.
And And that is why, for example, in this short talk, I will be covering all
the important, I would say, concepts that the entire industry has been
using to monitor system and production.
And then from these concepts, we can think about, for example, okay, so how do we
measure LLMs in production as well, right?
And then after we cover different tools, different concepts, towards
the end, we'll sum things up as well.
So, the question is, what is model monitoring on a really high level?
There are two main things, there are two main goals.
The AI system, we, towards the start when you are building a
use case, you define how you want the entire AI system to behave.
And you define some key KPIs that you want to track.
And then the goal is to simply see, is the model able to meet the Or exceed
those kpis in productions or not, right?
one example can be let's say a bank, every time like you go and swipe
your credit card there's a model that the bank use behind the hood to
simply classify if the card was used by someone who is Not you, right?
So they have like for them the bank for them You They are monitoring
How well the system is able to tag?
fraudulent Transactions versus the non fraudulent one.
That's one example.
So their kpi is the percentage of fraudulent transactions they were unable
to tag in production, right so for llm for example, like Let's say you have an
llm system in place You Which is, which you are right now using to automate.
Questioning answering for example, there is customer support Use case
or you have a vendor who's providing you with an ai tool that is able to
automate the entire Customer support tickets like how do we make sure?
like in that system is the model able to provide accurate information Or lm
is giving the accurate information.
That is a really important kpi That you want to track So the very important I
would say a really high level in summary like you define what you want to track
number one and Then based on that you define all the other sub I would say
KPIs you want to track So like before I move forward like It's really important
to understand that Model tracking, if you look at the last 10 years of the
history of what's been happening in the industry There are two main things that
Everyone is tracking and it's worth tracking Number one is for example what
goes inside the model and that is called data And what comes out of the model that
is called some prediction or some token If you think about the llm domain, so so
before before llms when we think about Monitoring in machine learning use cases.
we track All the inputs it's called data and then we track the output It's called
concept or prediction drift the same also hap is happening right now in the llm
space as well So So, what, what do we do?
For example, okay.
So once we define and then we'll talk about more in detail, how
do we define different KPIs?
what's the end goal here?
The end goal here is that, these, some system has to be there to track
the model and then let us know when.
Okay.
There is something abnormal, right?
So we can go back and fix it, right?
So and then ideally in the ideal case scenario, you want to have a system
in place that can Number one track and at the same time trigger Some pipeline
which can go and fix the entire Model or retrain it or simply if you are in the
llm domain improve the prompt So the goal is to track and then in the end improve
systems in production And make sure the output is how we want the output to be And
then there are different tools out there to do that as well so if you think about
the llm domain right now, there is athena that's doing a really good job there And
then at the same time there is lang smith That is also a really good job right now
to provide you with the entire Monitoring and how things are going in production
Let me quickly show you the entire athena folks All right, so uh Moving
back towards what do we want to track?
So there are two once again, there are two things we want to track
always you want to track What data that goes inside the model?
And how is that data changing over time?
let's suppose like you have An llm system in place for customer support And then
you have built that system to handle Tickets regarding refund and change in
Delivery address for example, right?
That system is employed in the e commerce domain and then you go you
only track these two things That's the input refund requests and change in
the address requests So let's say like You like in the production that for
example the tickets that are coming up are not about refund or are not about
the change in the address the tickets that are coming up about are about our
complaints about The performance or the quality of a product means what there's
a drift Means what the input that the system was receiving Was built to handle
You have a system in place to handle refunds and then you handle I would say
Change in the address now you have some different kind of inputs coming up, right?
So in that case We call this drift And drift on the data level in that
case normally for example, like since you What you never build a system in
place to handle those tickets regarding faulty product Your system is going to
perform as Expected pretty much, right?
that is called a drift on the Data level the other example that I can
think of right now is a bank Let's say a bank has a system in place.
They have a model they have trained on the customer data to tag transactions
as D as fraudulent or not Or yeah And for example, and let's say that bank
is expanding in a new country, right?
Of course when they expand a new country and they open up their shops there
And then they are offering credit card to different clients in that country.
Of course the data that will Be coming into the ai model will be
different right because different in the sense of Location and other
attributes right for sure, right?
So that is where we call this drift the inputs that we are sending in the model
has changed Over time means what the system that you had built before is no
longer Going to work in this case if you have to go back and fix it, right
or you have to either You Find a way to handle these new requests or you
have to build Or retrain the entire ai system to make sure it handles
all the scenarios whenever there's a drift in the data level Normally,
for example, like how do we do that?
we track distributions, so I do have a few right now really good
examples to show as well On how do we like easily measure drift?
It's normally distribution, for example, right?
let's say the distribution on the location level was only the US,
but now that a new bank has opened up their shops in Europe, right?
So normally you will have a different distribution coming up.
Simple as that.
So your input, again, the input, it was red and sometimes blue, but
over time it's more, it's mainly, I would say blue and less red.
Or for example, Before there was more mainly Online transactions and now
there are more offline transactions.
I don't know maybe the covert is gone now, for example, and that is an example
of a drift the data that's coming in has changed over time because The world
you operate in has also changed over time as well and then, for example,
if I think about the LLM domain right now, we talked about, for example,
that the tickets that are coming in, they are different tickets now.
Something has happened in the system before.
So moving on, for example, the other really important concept that we have to
track is the drift in the output.
And that is mainly called model drift Why is that happening?
It's quite possible.
For example And the output drift happens for two reasons number one is for example,
Like over time the input is different.
For example, it's quite possible.
The output is going to be different for sure, right?
there's difference in the output driven by the input number one and
number two is the concept drift that happens because the model was built to
build to learn patterns in the data.
And that data represents a world right now in 2025 it's quite possible that in
two, three years, how people buy stuff.
can change right, over time it's quite possible that the relationship
between different variables in play are different now and that is where
the models is not going to give you the right output or prediction in that case.
to think about a few examples right now towards concept drift let me
open up this link really quick.
Yep, I think that's a really important example over here as well.
So sales like they were mainly online sales and now there's offline sales.
And it's quite possible that, yeah, one example can be, for
example, if you have a new store.
And then new store and then you see that your sales is more seasonal, right?
But as the store and as your business grows over time you have Like you
have a strong brand It's quite possible that you are collaborating
with different influencers, right?
And then it's quite and then for example, it's quite possible that
at that time like your sales are not entirely driven by Seasonality,
it's also driven by different events happening or different campaigns you run.
So Your sales, your prediction is no longer, I would say, dependent
on seasonality or something else.
It's a different business or different concepts in play.
That's the other example of, for example, concept drift.
The model was, once again, it's really short.
The model was trained to learn a phenomena and that phenomena has changed.
because the world has changed and that is where you can see I expect to see a
model Going wrong or not giving you the right output So like Simply for example
if you are tracking drift on two levels There's input and the output or data
and the model Then the alerts are Also going to be on these two level, right?
for example Okay, so sometimes for example, it's quite possible that we
like hey, it's okay Like it's okay.
For example, if there's a drift in the input But as long as there is no drift
in the output, we are fine, right?
And that depends on the use case.
It's quite possible, for example, that LLMs were not given instructions
on how to handle requests.
requests that are coming in about some faulty products, right?
But that LLM focus for support is smart enough to handle these new requests
because these models are really good at understanding the language.
In that case, like you might say, Hey, the input difference is not really important.
The output difference is really important.
For example, if you see models Giving the output that is not backed by citations
or giving the output that is, that contradicts what we have in data, or
there is some hallucination, for example, different metrics we track in the output.
That is where, that is the most important thing to track.
And then for example, like in the other case scenario, for example,
uh, in finance or in healthcare, they might want to, for example, track both
the inputs and both the output means they want to track all the drift in
every variable, the input in model.
And they also want to track the drift in all the output or as well, right?
and then for example, like how much drift is.
I would say is something that you can accept but also it depends on the use
case for example, if there's a model that i've trained to predict Sales, right?
And then I see that there's a drift or of around 10 percent It's quite
possible that my demand forecasting or sales forecasting model.
I already have that like At a threshold in place or I've
already stocked, my inventory 10 percent more than the prediction.
So that small amount of drift won't be super important.
But on the other hand, for example, A bank for a bank or for a health case
Use case even a one person drift for them would be super important, right?
So that threshold Is something for example that has to be defined by the
company So if you just think about the llm cases right now because that is where
Majority of the work is happening right now You As well, so on the input side
the most important thing for example for them is a The number of input tokens
is fairly important On the input level for example, i've seen companies also
track as well the Sentiment as well.
For example, let's suppose if sent if sentiment of the tickets you are receiving
in your customer support llm agent Has changed over time a lot Then it's quite
possible that you might there's something wrong happening or there's some change
in the overall system or Overall, I would say world and then it's worth
going back to Think if we have to go and change our Prompt or change something
else and then like on the output level, for example, there is hallucination that
even on track as well there is number of output tokens for example, let's say like
before you saw that the model is Always giving you like 500 tokens on average
in every output But now all of a sudden the models are giving you 1000 tokens
For example, like that's something you wanna go and understand or like why is
the output number of tokens Different it's quite possible that You There's
a drift on the concept level, right?
for example the phenomena that you want to track And you want your models to handle
that phenomena is it's very different now Because the these maybe like the
store has grown The tickets that are coming in Are not only around refunds.
There are more tickets about some other queries coming in and then
the two really important tools that I've seen right now, in action, I
use like Athena in my companies just because like they have a really easy
to build our assessment in place.
but at the same time, like I've also seen Langsmith people being used a lot.
These both of these tools have one really important thing they track every
inference of every prediction So you see for example, okay, so what's happening
and then for example, you also have the ability to simply track the change in
tokens Change and then you can also build your own tracking evaluations or KPI
so track for example on hallucinations, and then you also have ability to track
different various aspects Both on the inputs and both on the outputs level
Yeah, I think we cover that as well so and then for example like
different tools if you think about where If you have a different Machine
learning system in place that I've seen Evidently ai being used a lot.
It's an open source tool used By a lot of people in the industry and
then they have both Different kpis being built on the input and output
level And then you have the ability to simply think about which kpis or which
metrics you want to monitor as well
I think we covered different examples before, so towards the end, on a really
high level, for example, we never went into detail on different, on, on
math behind different concepts, for example, like how do we measure drift?
Okay, so if drift is on the distribution level, for example, what
are different formulas or concepts that are in place to measure drift?
The difference in distribution.
That's a different topic, right?
So but like in this shot I Would say presentation.
My goal was to simply explain that Monitoring is happening or has been
happening on two different levels.
There's either Input that goes in the model and then there is the output The
input is called and then the change in the input is called data drift and then
the change in the output is called model drift and then Within these two buckets.
There are different Kpi as you can think about if you have a traditional
machine learning system in players, then it's simply just different
distribution For example, it's quite possible that you have one very input
variable location before Location was only different cities in the u. s.
But now you have location as different cities in the US plus in Europe plus in
Asia means what the input has changed.
If the input that goes in the model is different, the output will be different.
For example, number two, the output, for example, before Model was giving
a prediction Always more than 50, but now you see model giving prediction
Between the range of 30 and 60.
There is a difference In the prediction or on the output of the model and that in
that means there's a drift in the output.
So That means something and these are two different buckets that you always
have to track and then How much drift can you afford it all depends on?
The use case in healthcare or in finance where you have regulations in place.
It's and, how much drift can you afford depends on the regulations
and depends, depends on, the loss the company can incur because of that drift.
In some use cases, like you always have an error built into prediction.
For example, retail, they always know, okay, so they are, if they are doing
demand forecasting, they always like.
Overstock because like it's fine to have an extra stock versus having less
stock because There's more loss to the company if customer works in the store
and goes out without buying stuff So they always overstock so for them They can
afford to have a big drift if there's a big surge or big drop in the demand
They're always prepared right but on the other hand, for example, like For
them if their prediction is like they will sell 100 1 million skews SKUs They
won't be stalking 5 million, right?
They will always be stalking 1.
2 million, right?
So for them they need accurate prediction, but in the end They are fine if their
models Have a really big drift in the output because of different reasons in
place So this is On a really high level.
For example, how do we monitor our AI systems in production
and how do we make sure that?
Our models behave as we want and we get the maximum ROI because
as a result Signing off now