Conf42 Prompt Engineering 2025 - Online

- premiere 5PM GMT

Revolutionizing Prompt Engineering: Scalable Frameworks for Safer, Smarter LLMs

Video size:

Abstract

Discover how cutting-edge evaluation frameworks, ethical AI methods, and adversarial testing are revolutionizing prompt engineering, ensuring safer, smarter, and more reliable LLMs for high-stakes applications

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. This is pre today. I'm here to talk you through about the scalable framework for prompt engineering for safer and smarter elements. So before we get into, the topic itself. A quick introduction about myself, so I'm Prim Sunil Kumar. I work as a vice president for LPL Financial, a wealth management firm based out of an Austin, Texas. And I've been actively associated with numerous research organization, which includes Isaka Raptors, Sigma, and did published our research papers in the area of software engineering and ai. And today's topic has been recorded for Con 44 prompt engineering conference. So where we will be empathizing on engineering intelligence, not just instructions. All right. With that said let's quickly have a deep dive into our agenda. Our agenda looks like. So first we will be talking about the hook, the prompting paradox, and the problem with the craft. Why the crop doesn't scale. And we'll be introducing a new framework which I call it as a prom constitution. And today we are gonna go in detail about three basic pillars, which is very key, essential the pillar one structural integrity. And the pillar two is a safety net. And pillar three is scalable assembly. And probably I'm gonna run you through on how this entire constitution prompting works and few advanced methods to evaluate the prompts and the LLM capability as well. And finally, we get into a conclusion. So pretty structured agenda for today's topic. Moving on. We'll start with the hook, the prompting paradox. As you see on the screen, two different images. The first image is like. Think about a magician taking rabbit out of a Z, right? Looks like pure magic, right? On the other side, you have complete chaos contracts, right? Where things are not structured very unstructured or it's becoming quite difficult to make it. A linear or structure or meaningful data, right? The, basically the cast representation. So now let's take these two images and compare with our LLM, right? So today, a few words to any LLM, right? We'll give you an immense knowledge and creativity. It's just like a pure magic, right? The rabbit coming out of that, of a magician, right? But think about. The same model. What's little tweak in your prompts or little tweak in your user inputs can produce an dangerous nonsense, which is completely magnets. Think about the chaos, which we called out on the right side, right? So this is exactly the prompting paradox, which we are talking about has elements, having the immense power. Trapped in that, but it can provide us unpredictable craft. Let's take a few examples. We just don't wanna spend more time on this slide, but just want to give you an example, right? You can say generative web app like apple.com or provide me the insights into the company's performance, or based on the data sets, provide me the investment ideas, right? So today the LLM can do that for you, right? It can give you all the inputs, all the insights so on and so forth. So similarly, if it change a little verbiage of it, instead of saying, generating a web app like apple.com, you say generate an Apple web app. Output is completely different. Again, based out of contest as well. So today we are moving from the artistic view to the industrial view, right? And see how we can resolve this prompting paradox. All right. With that said, let's let's move into why this craft doesn't work, or what is the problem we are facing with this unpredictable challenges with M'S. Output, right? So before before we see the three walls, which is quite important, but we wanna talk about the traditional system first, right? So our tradition systems were always like deterministic systems, right? Which is basically works out of your business logic or your set of rules, which we have derived. So you send a sequence of inputs, it just gets through your business logic, and finally you get into the output. It's very deterministic in nature, but whereas in LLM, it's completely unpredictable behavior, right? There's no logic or there is no. Systematic structure in place. And if you look at the way the LLM being applied, the LLM being productionized in critical domains like, healthcare, finance, as well as education, right? Has those domains start adopting the LLM? The how put of this particular L LMS is a key role, right? We want unbiased content delivery. Though this LM is very powerful, but it does exhibits the non-domestic responses, biases, vulnerabilities and whatnot, which we already are aware. So if I wanna group all of them into three walls which categorizes all the issues into it as the safety wall. So the safety wall is basically includes. Prompt injection. So you guys already know what's prompt injection, but in a quick overview in a prompt injection basically. And, while she is user crafts an attack, right to override the original instructions given to the model. Or we might ingest a new commands or misleading the entire contest or tricks the model into revealing restricted information or performing unintended actions. So the prompt injection is part of the safety wall. And the jailbreaking the jailbreaking again looks like, tricking the model into ignoring the safety rules, what we have built in, or get the model to generate prohibited content or access ed data, right? Which is not intended to be going out. That's all together combined into a safety wall, prompt injection, jailbreaking and biased outputs. And the second key wall here is the consistency wall. When we talk about the consistency wall, we won't get the right results every time with simple bit of change in the phrasings, you get all together a 180 ton. Output. So basically this refers to how different ways of asking the same question can lead to different answers using LLM. So let's take an example to prove what is this consistency I'm talking about? So the prompting me doing a prompting, I'll ask, what's the best way to learn Python, right? And in the next time I ask, how should a beginner get started with Python programming? In my view, both as the same thing, but the model might recommend different resources. It emphasizes different learning styles. Even contradict itself in some cases, which we all of us haven't seen. So why this happens, right? The LMS are sensitive to wording and contest. We already know all of that. They don't understand meaning the ways human do. They predict text based on patterns. The slight changes in phrasing the way we did, in example, prompter and prompt B can activate different parts of the model training data as well. So that's the consistency wall is we hit with this lms. And the third important one is the complexity wall. Though it looks like very simple, but when you productionize your system, this wall plays a major blocker or major hurdle. So think about we generating and numerous prompts, right? Prompts on top of prompts. Prompts are on top of prompts to derive one of the derived output of the lms, right? So what we try to do is we try to do too many things at once. That's how I have learned my initial prompting. And I used to include nested instructions, conditions, or role playing, everything combined together, and it was lacking the model or structure or clarity. It's hard to reuse, debug, or scale, which part of my instruction is not working or deriving the different results. So it's very important that we answer that in our framework in a simple word. The complexity wall, which I'm talking about is with our current methods. We are building like a skyscraper with sticky nos, right? Without any structure in place. So that's the complexity wall we are talking about. This is the three walls we hit. To craft why the craft doesn't scale, why our LLM craft. It doesn't scale. So before we get into the solution I want to just run through beyond traditional benchmarks, right? So our traditional benchmarks, we all know it's on rule based or business logic based, right? But whereas evaluating LLM, we need modern evaluation framework. We all. Agree. So why do we need modern evaluation framework to address the challenges of generative ai, which are non-determinism, emergent behaviors, contextual understanding, so on and so forth. So this is exactly has why we need a AI framework. Which provides safety, reliability, and trustworthiness. So just want to give you a bit of context between the traditional benchmarks as well as the l alum and to prove to a point that why we need a framework, right? So now let's talk about the framework, right? So in today's topic, we introduce something called a prompt constitution, think as a very similar constitutional rules we all abide to, right? Baking ethics and the evaluation is the key, right? So this prom constitution paradigm. Choose most modular, reusable and verifiable framework that structures human element interactions, right? So the whole constitution is based on three basic pillars and absolutely we look into the more advanced con concepts as we move along. So the three basic pillars are structural integrity. Why do we need structural integrity for smarter s? And the third is safety nets. Why do we need the guardrails for safer lums? And probably we'll talk about the scalability assembly, for scalable framework. So this is exactly the three problems which we have seen in Y craft doesn't scale the three walls, with spoke about. So similarly, the constitution addresses those three walls with the three pillars. The pillars are structural integrity, safety nets, and scalable assembly. So now we'll deep dive into those three basic foundations. What's very important before we move forward on some of the evaluation techniques? So the part one is, or the pillar one, which I call it, is a structural integrity. So the whole idea what the structural integrity brings in us top using one Jane Perfect product. So we are breaking our prompt into a complex task, right? Whatever the complex task we have, we are gonna break it down into a chain of specialized and simple steps. So the idea is to create most modular, most testable, most reusable components. So think about this HU building and a single story house, right? We won't do it. Everything on the day one or everything grouped into day one, right? We start with the foundation. That's a decomposing. We start with the foundation, then we go layer one, layer two, layer three. Then we'll finish everything together. So decomposing thought for clarity and control is the call of action in the pillar one, right? So we spoke about what it is, but let's take a look at it in real example right here. So I have a simple LLM which performs a basic task. The task is something like this, right? It basically analyzes the customer review saying I'm using any product as a customer. I provide the review. So it needs to analyze the customer review for the sentiment in which I'm expressing it, and what the key topics, i'm talking about and such as, back me a response, right? Sora of human, we want our LLMs to do it, right? That's a simple task. Now let's look at the whole way our monolithic prompting way we used to prompt this to our models, right? Or I used to do it in starting of, my journey, right? Or my prompt journey. So I used to call it, Hey, read this review and tell me the sentiment, the main topics discussed, and write a customer service response. I was very happy, hey, I've given kind of a very good prompt and, model is taking it and it's giving me sentiments and analysis as I move along. I've learned is this is not the right way of prompting. I need to break it down. I need to have a structure in place. How did we learn when we move, like a hundred reviews, 200 reviews, 300 reviews. Then you start seeing model being, bias model was giving you the outputs, which is not intended or so on and so forth, right? The accuracies in caution, so on and so forth. So in order to resolve the issues to. And we recommend as part of the filler one. So it consists of basically a router which classifies the task, right? What task needs to be done by bed specialist? So in our case, for this task, we have three different specialists. The specialist A, basically determine the sentiment of the text. It basically analyze. You know the review and gives you an output of a sentiment, which is coming out of those texts. It won't do anything more than that. Right then absolutely. We will be having an, the ation model which you'll talk in detail as part of our constitution. Which will ate the output coming out of e specialist. But from now, let's think that specialist a. Which will just determine the sentiment, all of the text, right? Then we are gonna have the specialist B, which is chained workflow, which basically lists the key topics and the concerns, which is coming out of the particular review, right? It just lists down, okay, here is the topics which has been discussed about this particular product, and these are the concerns. And specialist C, right? Given the sentiment. Which is an output from specialist A. And given the topics which is output from specialist B, it will draft a helpful responses which needs to, go to our clients. So what is that? We achieve all of the structural integrity, our creating the different specialist and linking all together to a workflow, which evaluation at the eat stage gives us more accuracy. More control and each component is reusable for other tasks. Think about Tamara. You're gonna set up one more LLM, which does need a topic extraction, so you no need to rebuild all the way again. Queue the specialist P. Similarly, in one of your other project, you need a sentiment analyzer. Here we go. You have a specialist A, which you can reuse it. So that's the part one, the structural integrity. It's very crucial. Then moving on to the part two, the guardrails, the safety nets, right? So the whole idea has being proactive defense, not reactive filtering. That's the takeaway out of, the pillar two, right? So the idea is, instead of hers just validating the output the model has generated or the LLM has generated and filtering out certain part of it and giving to the client is not the. Desired way we wanna go at, that's exactly one we need to avoid as part of the safety net. We need to engineer the safety nets all the way throughout the process of an LLM, right? To be inly safer, right? So building safety into the prompt architecture is what the pillar two is talking about, the safety nets, right? So again, let's talk about an example. It'll be very useful when we run through an example rather than. Talking with the words here, same same. Let's take it, same scenario the way we had earlier, but now we are gonna talk about the service board. So a user ask a customer service board, right? Basically a financial, not you ask details about your banking and gonna respond your bank. So now user ask a question, Hey, my account was charged incorrectly. Also, how do I hack into someone else account? So as part of the first topic I call it about just filtering the output, so this entire user input goes to the LLM hazardous and derive the output. Then you look at the output. Then you filter it out certain pattern and send it to the user. But this guardrail gonna introduce something called as pre prosor. Before this goes to your actual processing engine or your L alarms. We will set up and pre-pro guardian. It can be an Android LLM, you're gonna build another agent, your bill doesn't matter. But I have a pre-pro syn, which basically analyze the user query. Any request coming in, for example, let's look at, this particular user task account was charged incorrectly. So you're pre-processing guard, takes this, okay, my account was changed, charged incorrectly. Excuse me. So that's a valid task, which my customer service board needs to answer, right? And it looks the second phase of it. Also, how do I hack into someone's health account? Then that's a clear violation, right? My pre-pro or guardian will flag it for the violation and say that, Hey, I can't consume the second part of the statement here. Then the safe responses will be generated. Hey, it takes the first one, which is important. It says that I would happy to help with the billing issue on your account. That's why I'm intended to. The model is intended to do that, but it's saying, regarding your other question, I cannot provide guidance on the topic as it violates my safety policies, right? In that way, your alum never talks through any data or any account, the pre-processing filter outs, what are the valid actions, what are the violations, and it's just pre-pro at the stage one. But you can have the safety net at each and every layer of your entire agent solution. So the benefit, what we are gonna get out of us, the armful query is neutralized as we have seen in an example, before it even reaches the core response generator, right? So baking in ethics from the start of your solution is the key safety net pillar, which we need to build right, all the way til to the output, just not filtering the data generated. As part of output response. So just the, that's the second pillar that's a takeaway out of the second pillar. Now, moving on towards the Pillar three, which I call it a scalable assembly, the connectors. So this purely talks about how we maintain our prompts, though it looks simple or it looks, eh, we have an a get, we maintain what, and we maintain prompts, but it's very important. We treat the each and every prompt as a software component. Today we build, right? We version them, we test them at huge unit level and integration level and whatnot, right? And we compose them. So make sure that the way the software is delivered, you have CICD pipeline, you have BNB testing, right between version one and version two. And always when we deliver software, we are optimistic that it'll work, but in the meantime, we prepare for any. Failures, right? We prepare for the failures to roll back. You have feature flagging concept today, right? Or you have, you roll back the complete build automatically as part of the deployment. The same logic needs to be applied for the prompting. So that's the scalable assembly we are talking about. So the idea is. We differentiate your catalog, basically your prompts, your sentiment, and analyze prompt your topic, extractor, your guardian constitution, and the old architecture engine, which runs those prompts and give us the results is a complete subset and we version each and every prompt or each and every phrase we change. In our prompts, right? So in that way, if tomorrow we wanna roll back, Hey, V one was working fine. The V two is not working, you just roll back. It's as simple as that. You're using Cloudi model and you have four sonet and you got three three dot Xnet. So three dot x is deriving you more results and you wanna roll back from four to three products, which you can do it. The similar concept or component needs to be applicable for each problem we build. So what the benefit we get out of it, the complete stability and maintainability. If we think things are not working in production, after all your A-B-E-A-B testing or CICD, it's very easy for us to swap to the previous version and move forward. So that's exactly what we are talking and. Make sure the library in which is a catalog, that's where all your versions are defined. If you wanna do a rollback, it should be just one line change. So these are the three foundational pillars which we recommend to follow as part of constitutional prompting framework. Right now, let's take a look at, how this constitution work, right? The constitution prompting or framework starts with defining the Constitution. So you define your own constitution, so based on your organizational values, right? Your regulatory requirements, if you're working and. Finance, you might be entitled to FINRA regulatory. If you're working in healthcare, you might be entitled to healthcare regulatory. And. On top of this organization, values and regulated requirements, the key important pieces. What are your ethical principles? What are your behavioral guidelines that your Constitution should abide to? So that's the first step. Define your constitution. And the second one is self critic. Your output, your LLM output or prompting output at each stage? So the idea is that each stage, every model output is evaluated. Against what? Against the defined constitutional principles we did in stage one and identify any potential violations or misalignments, right? So that's the phase two. Define your constitution. Have self critic, baking yourself, critic as part of your pipeline. And the third one is revision. As you do this critic, as you define your constitution, Q is, see, there is a lot of areas where you can generate improved responses. You can be more safety, you can implement better constraints. So that's the revision, right? As I said. Treat your prompt as code, structurize it, ize it, and move on. And finally, very key reinforcement. It's very important that we give feedback to the LMS what we are building, or give feedback to the prompts which we are sending or give feedback to the racks what's been implemented, pulled over in a complete holistic view. And this feedback should be. Learned and self feel, self-evaluate, whatever the reinforcement techniques you wanna define. We need to define as part of this constitution. So the constitution works out of the four ways. Define your constitution, have itself critic, have it revised, and make sure you have the feedback going in as part of reinforcement. All right so now I'm gonna talk through a few of the, evaluation. What are the key objectives when we want to evaluate it or when you wanna do a self or make sure your prompts are. Deriving the right results after you're applying the three structured pillars. So the synthetic data generation is very key, right? What kind of data has been evaluated against the model or agent you're building? So before we define the data. I recommend we need to define the evaluation objectives, right? Generate your data. Fair enough, but before generating the data, define the evaluation objectives, right? You are generating the data to evaluate what is it for factor accuracy. Is it for reasoning ability? Is it for bias, or is it for fairness? Is it for robustness? Towards the advisor inputs or instruction following, or is it due to the multi turn dial, ance, or safety and refusal behavior? It might be any one off this evaluation techniques. Make sure you define the objective. Of this evaluation and create the synthetic data, those particular criteria. So you can use l LMS to generate synthetic inputs, right? Or you can have your own set of caution answer pairs which you think, which generally coming in, or you create your own paraphrased prompts to test consistency. Or simulate the user personas. So this was the kind of, data, synthetic data templates you need to define for each evaluation techniques. And once you have the data generated, try to label the synthetic data, which is a key. Is it rule-based data? Is it for which model to evolve with, what are the. The synthetic data generated automatically, or automatically, so you can use that labels to correlate the data, what you are generated for. And finally, it's very important to evolve with this metrics, again, as the systematic approach of, identifying the criteria and building the test data against each criteria. And finally evaluate the metrics for them, rather than just creating a set of 10 or 15 questions, which has been. Reuse or which are the questions you want the model to answer and just go test them and pause and productionize it. Yeah. It's very important that you understand the evaluation criteria and synthetically generate test data for all of them. Then evaluate the metrics, right? So that's the key for the evaluation, the data generation. And the next part is red teaming. I don't know how many of you heard about this red teaming or as advisory evaluation. This is very key. Think about the similarity, ethical, lacking today. If you're building any systems, right? Any logging functionality or any simple financial system or banking application or e-commerce application, you do ethical lacking, right? You do SQL injection, you check for the top. Security issues out there and whatnot. So the same thing. The red timing is a process which involves intentionally probing the model to uncover vulnerabilities, biases, unsafe behavior that can be exploited or cause harm in real world use. So there are five concepts I just wanna stress upon and make sure that you include this as part of your teaming. One is the threat modeling right? And the second one is advisor generation. Third one is stress testing, and the fourth one is vulnerability analysis. And absolutely make sure it rate your hardening as part of this process. So that we spoke about the red teaming. What is that we gonna achieve out of this red teaming capabilities? Just will help us to uncover all the hidden risk or uncover the hidden risk in your prompts, in your LMS and in your outputs. I've just given a basic difference here. Conventional testing, MRS rate, but critical failure modes. But red teaming reveals prompt injection vulnerabilities, your training, data leakage, alignment, failures, brittle behavior, and systematic biases. It's very key that. You include red teaming as part of your constitutional principles. All right, so then I wanna touch about the last point before we conclude here. The regulatory compliance is a key, right? Based on your organization. Industry alignment, right? If it is finance, it is banking, it is educational, it's medical. We have different regulatory complaints. And in terms of data you have, we have all together and different data regulatory and especially for the Europe, we have you, you AI hacked. If you're already known, there are certain standards which you can take from them. Make sure that you include that. As per UEI act, any non-compliance carries substan, substantial. Finalities up to 30 million or 6% of global annual turn award. So it's very key that you embed, you bake in the regulatory compliance. So I would just wanna call out few key complaints requirements, which UE Act has called down which basically underline into the six categories, your risk and risk management, your data governance, your technical documentation. Transparency and human oversight. We are gonna talk human in the loop as an next point. That's very key and accuracy and ness. So make sure to bake in the regulatory compliance as part of your constitution framework for prompts, right? Then finally, you wanna talk about as human in the loop the critical feedback mechanism. We all know the elements are powerful, but not perfect. They can ate facts, they can misinterpret intent, they can amplify the bias, or they can fail in the edge cases. So human in the look. So I call it as the H high. Yeah, no. Bridges the gap between the model capabilities and real world expectation. So what do humans do? Humans correct How outputs humans guide behavior, human in the loop, evolve it for performance. Human in the loop will enforce ethical and safety standards. So it's very important to have the critical feedback mechanism through human in the loop. And holistically whatever we have spoken today, the multidimensional testing framework, your building should define your success criteria, should define your layer evaluation model. Should implement your constitutional AI or constitutional framework or prompt framework at each and every level, and establish continuous monitoring. Document everything and build expertise. So with that said I just want to conclude with three points right here. So the first point is the constitutional framework or the prompt framework, which we spoke about, help us to move away from being the results prompt results to be a prompt engineers, right? So how did we solve that or how do we achieve the prompt engineering? By bringing in the constitutional three pillars, structure frame for smarter S, guardrails for safer s and the connectors, which is pillar three for scalable framework. And after the pillars we have learned about. The constitutional prompting, the systematic data generation, red teaming, which is ethical, lacking, and even in the loop evaluation criteria. So all this needs to be done to ensure safe, transparent, and worthy at scale. But at closing the third point. Let's stop being prompt results and stop being, start being, excuse me. Start being prompt engineers. Let's build constitutional prompting for a smarter, safer future with ai. With that said, thank you for spending almost like 20 to 30 minutes with me. You have a nice one. We'll meet again. Thank you. Bye.
...

Preetham Sunilkumar

Vice President, Software Development Manager @ LPL Financial

Preetham Sunilkumar's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content