AI vs. AI: How to Test AI Systems Using AI

Video size:

Abstract

AI systems are non-deterministic, hallucinate, and evolve fast—how do we test them? Traditional testing falls short. In this session, we’ll explore how AI can test AI, leveraging rule-based and AI-driven evaluations to detect bias, hallucinations, and inconsistencies using open source soluions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Robin, and today we'll look at the topic of how to test AI using ai. First, a little bit of introduction. I'm Robin Gupta, co-founder and CEO at Te Zeus. I have three plus years of leading product engineering pride, AI teams of up to 45 members. I'm a published author on the topic of software testing and automation, and I'm also an open source contributor. Here are a few logo where have in. And on the large QR code, you can find more details about us. First and foremost, let's look at the rise of the IA or intelligent applications. So for example, large damage models themselves. Examples like open ai, GP 4.5, deep C carbon, and other Ls tend to in an intelligent fashion. The second bucket is. Rag applications, retrieval, augmented generation, or even chat bots, which are wrappers on top of these models. And the third is agents. Agents like TE New or hardware ai, which have agency inside them or which can perform tasks autonomously using planning, memory and tools. So these are just a few example of intelligent applications which need to be tested in novel ways. That being said, let's look at these changing needs for software testing. As you can see on the table over here, I divided this into three columns. The first six. Second is the comparison between traditional software and the third one, element based apps. The first example is behavior. For example, the traditional software behaves in a very predefined way based on the rules that it has. Think of it as a program having if else, if. However, on the other hand, airline based apps give the behavior of probability and prediction. The second criterion is output. For example, if you log into an application like Salesforce, it'll determine stickly log you in, and they will always be that one output for that funding input. But in the case of real based apps, these could provide non-deterministic outputs. You can get different outcomes for the same question. Last, but not the least. The testing strategy for traditional software was only about the evaluations of it being right or wrong. On the other hand, for airline based apps, they need to be evaluated on accuracy, quality, bias, consistency, and last, but not the least, toxicity. That brings us to the automated evals. Now that we know what our evals, let's understand the what and when of automated evals. So what should you. You must focus on context adherence, context relevance, correctness bias and toxicity in all intelligent applications, and when should you evaluate it? Now that can be taken as cues from traditional software. For example, you can check after every change, bug fix, feature update, or data changes. You can have the pipeline steps and evaluations as part of predeployment in your SIT or UT environment wherein you check after mergers to broad branch, end of sprint or prior to shipping hot fixes. And last but not the least, it wouldn't hurt if we do a little bit of smoke testing post deployment, for example, based on the demand from business or internal team needs. That brings us to today's. Demo application Under test, what we'll do is we'll build a small rack based application, which will act as a quiz generator. We'll provide it with a sample data set around topics like physics and science, and then we'll ask the LLM to generate quiz questions for it, which can be provided to the user. So here is an example. Let's say we ask it to write a quiz about science. It'll respond back with some science questions for us with right and wrong answers. And for this application we'll use AI to perform the automated evals. So let's look at the pipeline. So from the starting point, we have the Quizit app. We'll supplement it with the small quiz bank, and we'll also give it the user input and the prompt template. The prompt template defines what kind of outputs it should. The output of a skills generator will go through the first ring of fire for rule-based evals wherein we'll check if it generates answers, which contains specific words or not. If yes, good, if no fail the test case. The second gate is around formatting. Does it generate the quiz in the right format as specified or not? And then last but not the least, we'll look for things like hallucination. Wherein we'll check it against negative prompts and see the kind of output that it generates. We'll also ask an AI based automated eval system to reason about the answer and tell us whether the quiz generator app is working fine or not. That being said, it's time to roll up your sleeves and let's go onto some hands-on coding examples. So first and foremost, as you can see on my screen, we are using Google collab. So these notebooks are I Python notebooks wherein you can write snippets of Python code and run them online. This is free and even you can use it for these experimental purposes. First and foremost, as we can see, we can run each of these cells individually. And in the second cell, we are highlighting my OpenAI, API key being imported. Over here you can use OpenAI or any other lys that you're comfortable with. For the same I Python Notebooks. These resources will provided right after the session for your reference, so don't worry about that. First and foremost, we are highlighting the human template as the question, and as mentioned, the app that we're testing today is a. Which uses the quiz bank or the data set and large language models to generate quiz questions for Lexi students. So here the quiz bank is being run and it has, subjects around DaVinci, Paris telescopes, study nights, and physics. Each of them have a category and a few facts. So I, when we ask this Ative app, it should use an. Two, look at the data set of this question bank and generate a quiz for us, and that is how it performs. Retrieval augmented generation based on the data that we have provided it. Now what we'll do is we'll build a small prompt template as part of this demonstration. Today we'll be using Lang Chain as the core framework, and in the construc chain we have to highlight the prompt template. So the prompt template that we're following for the quiz generator app is very simple learning To complicate it, we are asking the LLM to follow the below steps and generate a customized quiz for the user. It must identify the category which the user is asking, determine the subject, and then generate a quiz for the user based on the format. With a few question, what we'll chain is an open source. Which acts as a wrapper on top of large language bottles. So we have provided the input as a human and the prompt template, and from there we'll move forward. So for example, if we just print the chat prompt, this is what goes through blank chain. So as we can see, this is the same quiz bank and the prompt template that we had highlighted earlier. Now we'll use GPD four oh with the temperature as zero. And choose that as the large language model for us today. As highlighted, you can use other large language models as well for the demonstration. Now, from LAN chain, we use the schema par so that we get appropriate parts outputs from the open language. Now we'll take all these components and. We'll make them reusable as one piece. Okay, so nothing too complicated. Over here we're creating an assistant chain, the system message, the human template, and the LLM with the prompt template that we have provided and the name of the model. So let's experiment with that quiz generator application. So we are asking it to a quiz about science, and then we'll print the answer what we are getting back from this. So what it has done is. It has understood that the user is asking about the category signs, and then based on the tags and labels that we have provided in the data set, it has generated quiz about telescopes and physics. All good so far? So now we create an eval for expected words as a first example. So what we are doing over here is we are creating a small quiz from the quiz digital application that we have created. And the output must be evaluated to contain certain expected words. So it is a very rule-based eval system. There is nothing magical, there is nothing AI inside it. We'll come to the AI once in the second example, so just keep up. Now what we are doing over here is that in the expected words, we have asked it to have a few expected words. And it's as simple as generating a quiz and then checking that the output should have the expected words. Okay, so when we run this with the eval, and then as expected, we should see that it creates a quiz about science, which picked up the scope and physics generated quiz, and then it had the right words inside it. However. Let's create an accurate test case or evals for refusal. So in that, what we are doing is that we are just asking our system to create a quiz and then invoking it on the question that we have asked for. And if we ask it to generate a quiz about bread making the decline responses, I'm sorry. So for example, if we run this, we'll start seeing the failures in that eval test. So we expected the assistant questions to include this, but it did not. Is the idea very simple? You ask the quiz app to generic a quiz based on a topic. Look for specific words inside the answer. Nothing fancy, all simple python overhead. But a key problem with limb based applications as we saw earlier, is hallucinations. So now what we are doing is we are asking the quiz janitor app to write a poem about conferences. Let's see what it does. So when we invoke the assistant application, it actually wrote a poem about conference. Now, a key failure point for this is that we had created a app, not a. But when we asked it to create a poem about conferences, it broke all the rules about format and its system prompt and went ahead and wrote a small poem about conferences not cool. That is where we come to our second example wherein we use an AI system to test that first AI system that we have built. So the first AI system which we had built was this question app, and we'll use another AI system with a system prompt to check out the responses from the first one. Very simple. So let's start running the same example from the top. As I earlier, we are setting up the open IPI key. In the second block. We are also setting up the quiz bank, so nothing changed over here. It's the same quiz bank as from the previous ones. Which has topics covered about telescopes, starry nights, physics, science, so on, so forth. Now what we are doing is that we are creating the assistant chain as our sample application data test. Basically, we are building the quiz generator app overhead. So we are asking you to follow the steps, need a customized quiz for the user, just as we saw earlier, it should follow a certain format. Now what we've also done is we've tried to harden the system prompt a little bit. We are asking it to own, include questions from information in the quiz bank. Only use explicit string matches for the category name. And if the user asks a question about the subject, you do not have information about the quiz bank, just say that you do not own the answer. So we can also sort see that on the right side. And it says it should say, I'm sorry, I do not have information. Ask you to write a poem. It should not do that, is the idea. So we've run that small cell, it has done the setup for us, and then we'll build a prompt that has the LM to evaluate the output of the quizzes. Now, this is the second evaluator AI system. So it has this very simple system prompt, which says you are an assistant that evaluates whether or not an assistant is producing. And then it should give the answer. So the idea as highlighted earlier, was that you have the quiz app, which is an LLM or an AI based app, acts as a rag application retrieval. Augmented donation takes the quiz bank based on the system prompt provided it should. Id create the quizzes as the output, the second evaluator, ai. Is, as we can see over here, has a system prompt acting as an assistant that evaluates the responses from the first one. So that is how we are using AI to test another AI system. Very simple. So moving forward, we similarly little response to make a first small test and we set the prompt for testing agent as you are evaluating a based on the context assistant. And an interim response. Read the response carefully determine if it looks like a quiz or not. Output y if the response to a quiz output, n if the response does not look like a quiz. So that is where we are doing that assertion of the response from the first ai. Using the second AI. As an evaluator, we use blank chain to build the prompt template for evaluation. We'll choose. And we'll import the parser to have readable response. We connect all these pieces together in the variable chain, and we'll test the positive response by invoking the email chain. So basically, as we had highlighted, if the answers are correct, it should respond with a Y. If we don't get the proper quiz back it, it'll respond back with a no. So combining all these pieces create an chain, using chain, and then what we also do is. We'll check for a negative response. Now, the known bad result highlighted over here, or the negative test case, is that there are a lot of interesting facts. Tell me more about what you'd like to know. Now, this will not donate a quiz, and that is what the second AI should catch and fill the test case. So let's run that and see what's happening. So it also generated a small end. What we'll do is we'll compress the pieces of these systems into a test. So that we can run by test or test fixtures around it. So we are running the code for the test assistant, setting up for all these scenarios. And then there is a specific method which will look for sub stringing of the message, and it'll also decline the response if it is not the correct response. So when we run that, we'll ask it around some facts which we had provided. And then if it is outside the knowledge base, as we can see, the is also responding. I'm sorry, I do not have the information about that. We'll create the test release valves and when we run that, as we can see for the assertion, we've got an N where it should have got a why, because we have input the known bad result as we saw earlier. So that is how you can use the second AI system to check the first AI system. It is as simple as the secondary system has prompt, which checks that the quiz generator is generating the right outputs or not. Now, the same could be applicable for your customer support scenarios or outbound sales scenarios. For example, in the case of a customer support scenario, let's, your AI system is supposed to act as a customer service bot. It should ideally not reveal any secret information or should not go into hallucinations. So when you're building that bot or that AI application, you can have another AI evaluator with known bad results and then check for the behavior of that first customer support AI part. That is how you can test AI systems using secondary AI systems. But let's not stop there. Let's make it more fun. What if not only could we find the deviations on the format, we could also find the deviations from the context adherence. And what if we could reason about the bad responses? So last but not the least, let's perform some automated evals with advanced examples. First and foremost, we have the OpenAI key set up. The quiz bank is the same as earlier, and we'll create the assistant chain wherein it can act as a quiz generator for us using GPT-4 oh. What we'll do is we'll create the second AI system with a system prompt and give it a very specific direction. Asking it to act as an assistant that evaluates how well the quiz assistant creates quizzes for a user by looking at the set of facts available to the assistant. So we are asking the second AI system to look at the same question bank given to the first one and see that the quiz generated by the first one adhere to that context or not. So we are asking it that your primary concern is making sure that only facts available are used if you focus here. Only is accurate in bold. So what we found is that text written in bold, in system prompt is acted upon strictly by large language models. So we'll highlight that we learn that small cell, and we'll also ask it to compare the content with the question. And if a fact is in the quiz, but not in the question bank, then the quiz is bad. So we are specifically asking it to check for context relevance as part of the system problem. So we are asking you to output why if the quiz contains facts on the question bank and output N if it contains facts that are not in the question. So we'll create a function which will check for the evaluation and hallucinations. So in this case, when we run it. We will find that while the glass chat opener is deprecated, when we ask it for a bad one, it says, I'm sorry, but books is not a part of the categories I can generate a quiz for. So as you can see in the question bank, we actually don't have anything around books. It has topics about telescope science, study night physics, and DaVinci. So that is how, in very simple words, it can check the context relevance. But let's improve the system prompt a little bit. So we are asking it to act an assistant that evaluates how well the quiz assistant creates quizzes. Your primary concern is making sure that only facts available are used and helpful. Quizzes only contain facts in the test set. We'll rebuild the evaluator to not only provide the decision of the result, but also the explanation. So that is the fun part over here. So for example. We are not only asking it to compare the information in the quiz, the question bank, we are also giving it a few additional rules. Over here in line numbers 28 to 32, we're asking it to output an explanation of whether the quiz only references information in the context, make explanation brief and only include the summary of a reasoning for the decision. Include a clear yes or no, and then reference facts bank if the answer is yes and explanation. For example, if there will be decision, and then there could be a small explanation. So we'll rebuild that whole check prompt with the new prompt. And what we'll do is we'll also run it with a few small test data set. So for example, there are topics about science, about geography and bread making or cooking. So ideally it should pass. The topics that we have in the quiz bank, and it should fail the ones around bread making because we did not cover it. So what we'll do is we'll create this function data set and invoke the quiz generation. We'll use a library from Python called as Pandas to help us display the results in a more format once we run this end piece. The data set and the evaluator ai, what we'll see is for the positive test cases, not only will it pass the test case or evaluation for the negative funds, it'll not just fail them as highlighted in the system prong for the evaluator ai, it'll also try to look at the data set and then reason or explain its results to us. So for example, in the first row, we are asking it. To generate a quiz about science and can you gimme a quiz to test my knowledge? So I generated that. So the quiz generator app created a small quiz for the user, but then the evaluator AI checked the quiz based on the data set and highlighted that yes, the decision is right, the quiz is good, and give an explanation of why it believes that the quiz was generated is accurate or not. Similarly, we ask it about geography. And last but not the least, when we ask it about bread making, not only did the quiz generator app replied, I'm sorry, I cannot give you a quiz of bread making the evaluator, AI also had a decision as No. And then explain that the quiz does not reference any facts from the question bank, and it has nothing to do with DaVinci Paris telescopes, so on. so in very simple ways. What we have done is. We created a small quiz generator app, which takes a question back, talks to the LLM generates quizzes. What we then did was part one, look for certain rule-based evals that the generated quiz from this first AI based application contains a few words or not. Second, we checked for the format, and last, but not we created this evaluator. Which not only checked for the context adherence and relevance based on the same data set provided to us for the quiz, it also gave us the app explanation of why it passed or failed the test case. So that is how you can build AI systems to check other AI systems. Let's get back to our presentation. I'll put this in slide mode and time to. Okay. Jokes aside, moving forward, let's zoom out a little bit and compress these concepts into a CI CT pipeline. Continuous integration, continuous testing pipeline for AI applications. Let's say you are doing some changes you must do per commit evals on the dev feature branch from there merged to me. After that, you can do pre-release evals, like the second test around formatting. That we did move it to deploy, and then last but not the least after that. You can also check it for a few smoke scenarios with the evaluator AI that we had created and look at not just the pass fail responses, but also the explanations. It is very important to highlight that, not just the pre-release eval, but the post release eval. Must go through conferences. When I highlighted that you can use AI systems to check the other AI systems, someone very smart in the audience asked me then, who checks both of these AI systems are working fine or not. The answer, as you can imagine, is today and forever Humans. That being said, let's look at a few differences. Infosys recently released a responsible AI toolkit, so I. This can help you do red teaming for AI applications. GitHub publishes their opens and then last but not least, is very nice research paper on evaluation of large language models. That brings us to the closure of our session today. Feel free to show us questions, and you can also check out the world's first open source testing agent at the GitHub report highlighted below. Thank you so much. Have a nice day. Or a nice evening. Stay curious. Stay smart. Take care, and byebye.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

AI vs. AI: How to Test AI Systems Using AI

Video size:

Abstract

Summary

Transcript

Slides

Robin Gupta

Co-Founder @ Testzeus

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

AI vs. AI: How to Test AI Systems Using AI

Video size:

Abstract

Summary

Transcript

Slides

Robin Gupta

Co-Founder @ Testzeus

Join the community!