Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and welcome.
I'm LIC Pass from Marto ai.
ATO is a platform for managing your gen AI assets.
This includes your prompts, model configurations, data sets, and events
from the minute you write your initial prompt testing and experimenting with it
to make sure it performs as you expect to the time it's running in production
with detailed observability and metrics.
In this session, we will take a look at the key role evils play in
bringing your Gen AI up to production.
We'll see the process of designing, evaluating, and refining prompts
for an LLM based application.
So let's start.
So what are evils?
Evils are a systematic method we use to measure the performance
capabilities and basically the replies that we get from AI models.
They are key in determining how well a prompt or a model output meets the
specific objectives that we have.
The key features for evals are being measurable, being able to quantify
various aspects like accuracy, relevancy, structure, repeatability, meaning
consistent assessment across different iterations, and being actionable,
provide the insights we need to refine the prompt and to improve the outcome.
So what have changed?
Software testing had been around for decades.
Why are we talking about a new method of testing?
The first thing is the boundary.
We move from a very well-defined and structured output and error
format in traditional APIs to complex and unpredictable
formats, errors and mistakes.
We are moving toward a non-deterministic world where even similar and
sometimes identical input receives or might receive different outputs.
So we are moving from a very clear pass or fail criteria to a world where we
would measure different aspects of the replies that we get, and we will compute
some kind of a scoring that will tell us if we are hitting the target or not.
EVAs transform prompt engineering from guesswork into a scientific
process by making interactions measurable and repeatable.
They will ensure LLM output aligns with our business objectives, enable
data-driven decision making and prompt refinement and reduce the risk
of unforeseen issues in production.
So let's take a look at an example use case.
This is our demo application.
Imagine that you were given a task of developing a technical support
bot for a company called upma.
The app Master Support bot is designed to help technical
users find answers efficiently.
It pulls information from various sources and a knowledge base, and it aims to
provide an accurate, concise response.
Our challenge is to craft a prompt that will ensure the bot
delivers high quality answers.
So let's take a look at our initial prompt and the app master support bot.
Here's the first version of our app Master Support bot.
Let's try to ask something.
We got a very simplistic reply and not something that would really help our
users, so let's see how we can improve.
So we got our vote up and running, but how do we measure success?
We'll need to define criteria for each business requirement.
We got, for example, response relevancy.
Its ability to build on the knowledge we pass in the context, et cetera.
We'll use different evil methods for each criteria.
We'll see some examples in a minute, and we will build test data, which we can use
to iterate and have a complete test case.
We've seen that good evils are aligned with the business goals that we have.
In our specific example, that master support bot evils could measure
accuracy, politeness, response times, and the expected output
format that our code expects to get.
So let's first analyze the different types of evils we can use.
In the first category is the content-based events.
In our example, we have a data set with samples for expected
input and expected output.
We can compare the actual responses that we will get in our
experiment with the expected one.
Such a comparison will use vector similarities, which is a mechanism
for determining the contextual distance between two text box.
Other content-based evils might be checking for existence of certain
words, maybe checking for the absence of sensitive data, et cetera.
The second category is the format.
It's very often a requirement to return a response in a very specific format.
It can be that our code depends on such a structure or fits the user expectation.
Either way, we will need to look at the format and validate
it against our requirements.
In our example, the code expects a certain JSO schema.
We will see in a minute how to verify it.
Third category is various qualities like making sure that the bot is using
the documents that we are passing as part of the context and is grounding
the answers based on these documents.
And of course, other guard rails or non-functional aspects like politeness.
In our use case, which is one of the requirements we got for building the bot.
We can use different mechanisms for running our evals.
This can be deterministic or rule-based evals for checking, text matching,
for example, or regular expressions or model-based events that can use LLM
as a judge or other models for other capabilities like vector similarity.
In our example, we will combine both mechanisms where possible we will
be using a rule-based one and where required we will use a model-based one.
The pros and cones are quite obvious.
The deterministic ones will be very fast, very cheap to use,
but sometimes limited and rigid.
While the LLM based ones will be very flexible, we can basically
ask any question we would like.
They will be relatively easy to use, but they can be expensive and they
themselves are non-deterministic.
So we need to be very careful of how we measure or how we look
at the answers that they reply.
So let's see a few examples we will use to optimize our app support bot.
The first one, vector similarity.
As part of the requirement for our support bot, we got an example of how
an expected output should look like.
This includes both the structure and the content we expect to get for a
specific set of inputs or user question.
When we run the actual experiment, we get a reply back from the LLM.
We can read all this text and try to evaluate ourselves
if they are similar or not.
But in our case, we will use vector similarity.
That will save us the trouble of going through all those details.
The response that we got back is very similar from a contextual point of view.
The second evil will be a response format eval.
Here we can use a very simple, deterministic based evil for checking,
for instance, certain regular expression that will match the expected
J schema that our code requires.
And lastly, we will use some LLM based evals.
In this example, we wanna verify that the bot answers in a polite
manner and mentions that it is the friendly app master support bot.
So we will use another model to verify this.
So how do we put those evils into action?
We've defined our evils criteria.
What's next?
We will now experiment with our prompt.
We will first make a template out of our prompt putting variables
or placeholders in the places we wanna inject dynamic data.
We'll prepare test data, which will be our data set, and we'll iterate over it with
the evil metrics that we've just defined.
We mentioned the need to prepare test data.
What makes a good data set for such an experiment?
The data should represent real world scenarios, meaning the inputs that we
expect users will enter in our upma support board app, but we should also
make sure to include data that represents edge cases and unexpected inputs
that users might enter into our app.
So let's see a live experiment of everything that we've just defined.
So this is our app support bot notebook.
It includes the prompt and the instructions that we give
the LLM, the dataset that we will use for our experiment.
And we already have a couple of evals that we've defined.
The first one is using for validating that the bot is answering in a polite manner.
The second one is a simple deterministic one, making sure that we create the user.
Let's add a couple of others who will pick the relevancy evil for
making sure that the results are grounded with the documents that
we've passed, and we can even validate the JSON schema that we require.
In a deterministic manner using a regular expression.
We'll run the experiment and see the results that we get.
So as we can see, we get a relatively poor results on several of our
ours, including the Jason Schema.
And we can see the individual failures and specific inputs.
So let's go ahead and improve the prompt and the instructions we gave to our bot.
We will include a more detailed example for the response format
as well as strengthening other instructions, and we'll run it again.
We can also pick another model and see if that will have any
effect on the results that we get.
So we'll pick CLO and we'll run it again and let's take a look at the responses
that we got and the evil results.
So as you can see, we got a much better results this time, and all of our evals
have passed with 100% success rate.
This is true for both models that we've tried.
So the improvement we did to our instructions made the difference.
Let's compare all three experiments that we've run.
And as you can see, we improved the instructions we gave.
We still get very high similarity score comparing to the expected results
that we gave as part of the dataset, but this time we have a much higher
success rate, actually a hundred percent success rate in our events.
So let's push our new prompt into production and see
how the new bot behaves.
We'll ask the same question,
and this time we get a much more detailed answer with various options that would
assist the user in solving his problem.
So to sum up, we've seen how we can use evils to iterate and improve
until we reach a prompt that satisfies our business requirements.
Thank you very much for your time today.
I hope this session had been insightful.
If you have any questions, comments, or would just like to
keep in touch, please follow us on LinkedIn or visit our website.
And you are welcome to try us out at arra ai.