How to evaluate your Evals?

Video size:

Abstract

In this session you’ll learn what makes a good eval, how to define it, what are the best practices for building a test dataset, and most importantly how to make sure you can get all the way to production.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, and welcome. I'm LIC Pass from Marto ai. ATO is a platform for managing your gen AI assets. This includes your prompts, model configurations, data sets, and events from the minute you write your initial prompt testing and experimenting with it to make sure it performs as you expect to the time it's running in production with detailed observability and metrics. In this session, we will take a look at the key role evils play in bringing your Gen AI up to production. We'll see the process of designing, evaluating, and refining prompts for an LLM based application. So let's start. So what are evils? Evils are a systematic method we use to measure the performance capabilities and basically the replies that we get from AI models. They are key in determining how well a prompt or a model output meets the specific objectives that we have. The key features for evals are being measurable, being able to quantify various aspects like accuracy, relevancy, structure, repeatability, meaning consistent assessment across different iterations, and being actionable, provide the insights we need to refine the prompt and to improve the outcome. So what have changed? Software testing had been around for decades. Why are we talking about a new method of testing? The first thing is the boundary. We move from a very well-defined and structured output and error format in traditional APIs to complex and unpredictable formats, errors and mistakes. We are moving toward a non-deterministic world where even similar and sometimes identical input receives or might receive different outputs. So we are moving from a very clear pass or fail criteria to a world where we would measure different aspects of the replies that we get, and we will compute some kind of a scoring that will tell us if we are hitting the target or not. EVAs transform prompt engineering from guesswork into a scientific process by making interactions measurable and repeatable. They will ensure LLM output aligns with our business objectives, enable data-driven decision making and prompt refinement and reduce the risk of unforeseen issues in production. So let's take a look at an example use case. This is our demo application. Imagine that you were given a task of developing a technical support bot for a company called upma. The app Master Support bot is designed to help technical users find answers efficiently. It pulls information from various sources and a knowledge base, and it aims to provide an accurate, concise response. Our challenge is to craft a prompt that will ensure the bot delivers high quality answers. So let's take a look at our initial prompt and the app master support bot. Here's the first version of our app Master Support bot. Let's try to ask something. We got a very simplistic reply and not something that would really help our users, so let's see how we can improve. So we got our vote up and running, but how do we measure success? We'll need to define criteria for each business requirement. We got, for example, response relevancy. Its ability to build on the knowledge we pass in the context, et cetera. We'll use different evil methods for each criteria. We'll see some examples in a minute, and we will build test data, which we can use to iterate and have a complete test case. We've seen that good evils are aligned with the business goals that we have. In our specific example, that master support bot evils could measure accuracy, politeness, response times, and the expected output format that our code expects to get. So let's first analyze the different types of evils we can use. In the first category is the content-based events. In our example, we have a data set with samples for expected input and expected output. We can compare the actual responses that we will get in our experiment with the expected one. Such a comparison will use vector similarities, which is a mechanism for determining the contextual distance between two text box. Other content-based evils might be checking for existence of certain words, maybe checking for the absence of sensitive data, et cetera. The second category is the format. It's very often a requirement to return a response in a very specific format. It can be that our code depends on such a structure or fits the user expectation. Either way, we will need to look at the format and validate it against our requirements. In our example, the code expects a certain JSO schema. We will see in a minute how to verify it. Third category is various qualities like making sure that the bot is using the documents that we are passing as part of the context and is grounding the answers based on these documents. And of course, other guard rails or non-functional aspects like politeness. In our use case, which is one of the requirements we got for building the bot. We can use different mechanisms for running our evals. This can be deterministic or rule-based evals for checking, text matching, for example, or regular expressions or model-based events that can use LLM as a judge or other models for other capabilities like vector similarity. In our example, we will combine both mechanisms where possible we will be using a rule-based one and where required we will use a model-based one. The pros and cones are quite obvious. The deterministic ones will be very fast, very cheap to use, but sometimes limited and rigid. While the LLM based ones will be very flexible, we can basically ask any question we would like. They will be relatively easy to use, but they can be expensive and they themselves are non-deterministic. So we need to be very careful of how we measure or how we look at the answers that they reply. So let's see a few examples we will use to optimize our app support bot. The first one, vector similarity. As part of the requirement for our support bot, we got an example of how an expected output should look like. This includes both the structure and the content we expect to get for a specific set of inputs or user question. When we run the actual experiment, we get a reply back from the LLM. We can read all this text and try to evaluate ourselves if they are similar or not. But in our case, we will use vector similarity. That will save us the trouble of going through all those details. The response that we got back is very similar from a contextual point of view. The second evil will be a response format eval. Here we can use a very simple, deterministic based evil for checking, for instance, certain regular expression that will match the expected J schema that our code requires. And lastly, we will use some LLM based evals. In this example, we wanna verify that the bot answers in a polite manner and mentions that it is the friendly app master support bot. So we will use another model to verify this. So how do we put those evils into action? We've defined our evils criteria. What's next? We will now experiment with our prompt. We will first make a template out of our prompt putting variables or placeholders in the places we wanna inject dynamic data. We'll prepare test data, which will be our data set, and we'll iterate over it with the evil metrics that we've just defined. We mentioned the need to prepare test data. What makes a good data set for such an experiment? The data should represent real world scenarios, meaning the inputs that we expect users will enter in our upma support board app, but we should also make sure to include data that represents edge cases and unexpected inputs that users might enter into our app. So let's see a live experiment of everything that we've just defined. So this is our app support bot notebook. It includes the prompt and the instructions that we give the LLM, the dataset that we will use for our experiment. And we already have a couple of evals that we've defined. The first one is using for validating that the bot is answering in a polite manner. The second one is a simple deterministic one, making sure that we create the user. Let's add a couple of others who will pick the relevancy evil for making sure that the results are grounded with the documents that we've passed, and we can even validate the JSON schema that we require. In a deterministic manner using a regular expression. We'll run the experiment and see the results that we get. So as we can see, we get a relatively poor results on several of our ours, including the Jason Schema. And we can see the individual failures and specific inputs. So let's go ahead and improve the prompt and the instructions we gave to our bot. We will include a more detailed example for the response format as well as strengthening other instructions, and we'll run it again. We can also pick another model and see if that will have any effect on the results that we get. So we'll pick CLO and we'll run it again and let's take a look at the responses that we got and the evil results. So as you can see, we got a much better results this time, and all of our evals have passed with 100% success rate. This is true for both models that we've tried. So the improvement we did to our instructions made the difference. Let's compare all three experiments that we've run. And as you can see, we improved the instructions we gave. We still get very high similarity score comparing to the expected results that we gave as part of the dataset, but this time we have a much higher success rate, actually a hundred percent success rate in our events. So let's push our new prompt into production and see how the new bot behaves. We'll ask the same question, and this time we get a much more detailed answer with various options that would assist the user in solving his problem. So to sum up, we've seen how we can use evils to iterate and improve until we reach a prompt that satisfies our business requirements. Thank you very much for your time today. I hope this session had been insightful. If you have any questions, comments, or would just like to keep in touch, please follow us on LinkedIn or visit our website. And you are welcome to try us out at arra ai.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

How to evaluate your Evals?

Video size:

Abstract

Summary

Transcript

Slides

Hilik Paz

Co-founder & CEO @ Arato.ai

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

How to evaluate your Evals?

Video size:

Abstract

Summary

Transcript

Slides

Hilik Paz

Co-founder & CEO @ Arato.ai

Join the community!