Improving Large Language Model Performance: Scalable Evaluation & Advanced QA for Enterprise AI

Video size:

Abstract

Unlock the power of Large Language Models (LLMs) with advanced evaluation systems! Scalable, distributed testing frameworks ensure consistent performance, lower failure rates, and fuel innovation. Explore QA strategies to optimize enterprise AI and protect mission-critical applications.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi guys. My name is Gar Benzel. I graduated in 2009 in information Technology from India. I have around 15 years of experience working in information technology, like building services, web applications test infrastructure. I am currently working with Amazon for around 2015, almost like 10 years now. So yeah, that's all me. Let's talk about why we are here today. So today we are here for the LLM applications. The industry is moving fast. Everyone is learning. LLM, everyone is talking about large language models and how to use in their day-to-day testing, how to use it in their day-to-day life changing events like finding the flights, finding the code solution, or finding anything at this moment like. But how do we test LLM application? That's the question today, right? We are talking about, so testing LLM applications. Before we can talk about that, let's talk about how do we test a normal surveys or how do we test a simple method. So for an example, let's talk about a method that test the sum two plus two equal to four, two, and two as an input. Four is an output. So you evaluate sorry, you assert equal with input, which is two and two asset output like four comma four. I am equal. I'm done. How to test an LLM application. LLMR. Like kids, like they answer different. So like for an example, I ask an LLM question, how is the weather today? Is his answer is the weather is awesome. It's 62 degrees. Okay, I have an expected, the same weather is also is 62 degrees. I miss it a surgical, it worked fine. My test passed. Same thing. I asked the question again. How's the weather today? It's weather is wonderful. It's 74 degrees today. My God, the answer I got first is different in the answer I got second, so I'm not going to assert it. What do we do then? How do we test it? This is a horrible situation being a testing industry or a software developer. How do you test your LM application? So how do we test it? So the thing is like we don't test it. We evaluate it for an example. Like you go to the doctor, right? Doctor ask you like. How are you feeling on a scale from one to 10? And you say I'm feeling nine. That means you're good, great, wonderful. And two, I'm bad. I'm not doing good. I just want to lay down. Same thing happens with LLM. You evaluate them, you don't test them. So there are very many metrics available into the market that tells you how to evaluate it. Like what the text was, input. Did you get the correct response? Is the similarity is good? Is the evolution is good? Is there so context relevant or other things. So you evaluate them, you don't assert a call onto them, you just them based on a scale from zero to 10. And then let's talk about the first slide that we are talking today, the enterprises scale challenge. What is the challenge we are facing? The challenges like, like everywhere we have building LLM applications and LLM applications are everywhere. It is in pharmacy, it is in education, it is in sports everywhere, and they are growing every single day. They're learning so many things. They're getting complex and complex. Everyone wants to use LLM for their work today. No one wants to write the code for services. They're getting complex. Operational performance variability, like they are huge, right? They, you have huge data centers over there, so you need to work to them to understand them. You need to understand like, how do I manage these huge data structures and huge data storage, excluding stakeholder expectations. So LM is going big, right? So everyone wants to get their things done within seconds, within minutes, or within microsecond actually. LLM today is slow. Slow means like they process a lot, ton of data. Like they, they talk about ton of data, like you put an input, there is a large data set. They have to go through it. They think about it and they respond. It. How can you make it more quicker? Those are the challenges we are currently facing. Let's talk about. Traditional metrics, like how do we handle that? Like accuracy, safety, coherence, adherence. So accuracy is one thing that we always talk about, right? I want my LLM to give me the correct answer, not the wrong answer, and how many times is gonna give me the correct answer, and what is the criteria of the correct answer? If the answer is matching with an expected results, 80% is it correct? It is accurate. So we need to always check on those metrics. Safety, like I want my LLM not to provide wrong answers, right? Anybody can use LLM to do any, anything bad. So I want my LLM when someone ask a question to hurt something, to destroy something, my LLM should respond. I don't know the answer. Coherence, logical coherence. Every time I ask a question, I. LLM has to give me the correct answer. It should be logical. I it should not be like, how's the weather today? And LLM answering is today is I'm going to buy a car. That's not logical. So you need to understand that thing. Instruction, like how precise is that? How easy is to interpret like. Can it support multiple step instructions or not? So those things, l, LM has to know those are the beyond the additional metrics, business impact. Let's talk about the third slide. Business impact of robust evaluation very well. LLM currently utilizing so many things. So for an example is writing code earlier. We used to do simple Java upgrades and all those things. Everyone has to spend time on it, like hours and hours writing code. Even though it was just one line of code, you still need to test it. You still need to deploy it. You still need to make sure that it's not breaking things for a normal developer to make a Java upgrade, it was a at least a one day of work. With LLM, it is like a five minutes of work. It just put up the CR for you. It put up in your pipelines, it runs the test case and all those things. So it's getting more, providing more business impact is failure reduction is one of the categories. So how many failure reductions we are facing like 30% failure reduction. So what a human could catch it. Like where I used to do, if we are people as used to do manual testing by reading some of the test cases, but we are humans, right? We get tired after 10 hour shift, nine hour shift. My eye doesn't work. I am not able to figure out what test cases on the left. How is, how many is passed, how many is relevant, how many is failed. With LLM, they don't get tired. They can always execute your test cases. They can always run your integration tests. They never get tired. Reliability. It has gone up. Because of, we are training LLM every day. We are providing a lot of data, so it's always going up. Foster ations, like as I said, like it never gets fired. So you can always done as many times as you want. Distributed evolution architecture. Now let's talk about real time monitoring. Every time that LLM is going into that, like we are doing the real time monitoring, it has to be there. Every time the model is changing, we are changing the model on the backend. New model is coming. We want to assess it, we want to see that it is good doing the right thing or not. How do you do it like you get the evaluation results before? Then you put a new model and then you do the evaluation results after, and then you match it and then you say oh my God, now I can see my model is working good before or now. And if it is degrading the performance, it alerts. So that's how it works. It's scalable infrastructure. It is really scalable. Like you can just like run thousands of utterances towards your LLM and it always give you the precise resource. It cannot, but the infrastructure is scalable. Hub and spoke testing specialized evolution models for different capabilities. Enabling target assessment for reasoning knowledge to driver and creating generation is the one of the key part. Let's go back to the human element. Humans are really very important. It has to provide right sometime. LLM. Needs to be test by a human. It doesn't, has to be like like LLM can test itself, but human has to know what is right, what is wrong. At least at this moment. Once LLM grew up, maybe we don't need humans, but currently an expert reviews needed for all the test cases or all the data that we are using to test an LLM for accuracy. We need to make sure that it has been reviewed by a human. To expert review feedback loops. What are the problems we are seeing with an output of the aams? We need to make sure that it also has to go through the human so that they can provide the feedbacks that is good or not. Bias detection. If human is finding there is a problem in the output, there is a problem in the in input. We need to solve those problems before sending it out. Versatile testing. H Case generation. Next slide that we are talking about. So SK generation is strategically check, create challenging in unexpected scenario, right? So those are like very, like LLM is able to test where you need to prove is running, like you can send any kind of, entrances to it. There should be guardrails. You cannot ask a bad question to it, right? The guard guardrail should save it. And you say oh, this is the question that I don't support it. Rat team exercises every time. We need to make sure that we mislead it, we break it, we do some of the security vulnerability testing on it. So like you need to make sure that. Views and like whatever the data you are sending to your LLM is not gonna be open up to the public. Some of the times you are working on the project that is like critical projects that is not gonna open for the public. That's not, that's going to be published maybe two months from now, but it's gonna change how your industry is working. So you need to make sure that those things is not open to anyone else. That is very important. Stress testing. Stress testing is also really important because you need to make sure that your LLM can support thousands and millions of requests. Everybody is growing up right, so PR service should be able to support all those testing. You should be able to do the stress testing before to make sure that it is able to do support, the high volume of queries and complex queries. That is really important in the testing as well. Then we talk about the testing, compliance integration, documentation. So every time you are doing anything, you document it. You write test reserves, you write evolution methodologies, you write limitation actions, everything to be documented. That's a key part of any testing, transparency, implement clear reporting mechanism that articulate how AI decides. On evaluated right AI decisions are evaluated. Make sure that you have the evolution in place. And so always keep your back evaluation results and your new evaluation results. Regular audits, as I said, keep auditing. Keep auditing your data. Make sure that you are adding new data. Make sure you are removing the data. That is a STA geographic adoption. Make sure that wherever you are you need to make sure, if you're part of India, then you need to make sure that the standards of India are followed. If you're part of us, then the standards of us as follows. Evolution trends, simulation, driving multimodal evaluation, continuous info evaluation has to be there. Any service that is using LLM, right? So all the services at this moment is using LLM. You need to make sure you put a enough testing in your CDCI pipeline, so you should be able to test it. It could be related to the input, it could be related to the output. L and M is not only sir testing your services, there is like model that can be evaluated like. And without even going through your business logics, it can be right. Simply be evaluated. Continuous evaluation is very important. Keep continue evaluating. Learn from your mistakes, understand what the old one was that, and then make sure the new one. Implementation roadmap. Oh, this is very important to learn. So before you are going to evaluate an LLM or evaluate an application that uses l and m on the backend, first you need to define this evolution criteria. Like what are the metrics you are gonna achieve from this? Evaluation like it is similarity metrics, it's large rock metrics. It's like any metrics is that is important for evaluation for an LLM infrastructure. Next comes to the infrastructure, like once you have the documented data, the end. Metrics. You build infrastructure, right? You create service, you create ui. You make sure that your service has access to your LLM, your service has access to your knowledge basis. You use the UI to show how the previous Golden Data set was executed towards x your LLM, and what was the result? What was the evolution criteria for met or not? Intricate human therapy. Always be there because you don't know that test data is a stale the golden data set that we are evaluating an LLM is gone. It's bad, it's new. Even the l and m response is correct or not. So always have a human in there. Continuous improvement. Learn from your mistakes. Whatever the bad thing happens. If the LLM was bad previously, run from it. Automate evolution process. Try to automate it. Make it part of your CDCI pipelines. Make it part of automated deployment. Don't make sure because every time you have to run it manually, it's not gonna work. So you always want to put it in your automation. Implement compound lines monitoring. Make sure that you are not using any production data, you're not using any red data in there. Same thing, driving innovation with confidence. Yes, evaluate, improve, deploy, monitor and learn. Keep repeating the same. Look, evaluate deploy, improve, monitor, evaluate, improve, deploy, monitor, same thing. Keep on moving the same circle and make sure that you are in the confidence. Yep, that is all about the LLM that I can talk about. Anything that, yeah, it was really great talking to you and then I hope I would be able to provide some more presentations in near future talk. Thank you so much and we'll talk to you later then. Bye-Bye guys. Bye.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Improving Large Language Model Performance: Scalable Evaluation & Advanced QA for Enterprise AI

Video size:

Abstract

Summary

Transcript

Slides

Gaurav Bansal

@ Uttar Pradesh Technical University, India

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Improving Large Language Model Performance: Scalable Evaluation & Advanced QA for Enterprise AI

Video size:

Abstract

Summary

Transcript

Slides

Gaurav Bansal

@ Uttar Pradesh Technical University, India

Join the community!