Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
This is pre today.
I'm here to talk you through about the scalable framework for prompt engineering
for safer and smarter elements.
So before we get into, the topic itself.
A quick introduction about myself, so I'm Prim Sunil Kumar.
I work as a vice president for LPL Financial, a wealth management
firm based out of an Austin, Texas.
And I've been actively associated with numerous research organization, which
includes Isaka Raptors, Sigma, and did published our research papers in the
area of software engineering and ai.
And today's topic has been recorded for Con 44 prompt engineering conference.
So where we will be empathizing on engineering intelligence,
not just instructions.
All right.
With that said let's quickly have a deep dive into our agenda.
Our agenda looks like.
So first we will be talking about the hook, the prompting paradox,
and the problem with the craft.
Why the crop doesn't scale.
And we'll be introducing a new framework which I call it as a prom constitution.
And today we are gonna go in detail about three basic pillars,
which is very key, essential the pillar one structural integrity.
And the pillar two is a safety net.
And pillar three is scalable assembly.
And probably I'm gonna run you through on how this entire constitution
prompting works and few advanced methods to evaluate the prompts
and the LLM capability as well.
And finally, we get into a conclusion.
So pretty structured agenda for today's topic.
Moving on.
We'll start with the hook, the prompting paradox.
As you see on the screen, two different images.
The first image is like.
Think about a magician taking rabbit out of a Z, right?
Looks like pure magic, right?
On the other side, you have complete chaos contracts, right?
Where things are not structured very unstructured or it's becoming
quite difficult to make it.
A linear or structure or meaningful data, right?
The, basically the cast representation.
So now let's take these two images and compare with our LLM, right?
So today, a few words to any LLM, right?
We'll give you an immense knowledge and creativity.
It's just like a pure magic, right?
The rabbit coming out of that, of a magician, right?
But think about.
The same model.
What's little tweak in your prompts or little tweak in your user inputs
can produce an dangerous nonsense, which is completely magnets.
Think about the chaos, which we called out on the right side, right?
So this is exactly the prompting paradox, which we are talking about
has elements, having the immense power.
Trapped in that, but it can provide us unpredictable craft.
Let's take a few examples.
We just don't wanna spend more time on this slide, but just want
to give you an example, right?
You can say generative web app like apple.com or provide me the insights
into the company's performance, or based on the data sets, provide
me the investment ideas, right?
So today the LLM can do that for you, right?
It can give you all the inputs, all the insights so on and so forth.
So similarly, if it change a little verbiage of it, instead of saying,
generating a web app like apple.com, you say generate an Apple web app.
Output is completely different.
Again, based out of contest as well.
So today we are moving from the artistic view to the industrial view, right?
And see how we can resolve this prompting paradox.
All right.
With that said, let's let's move into why this craft doesn't work, or what
is the problem we are facing with this unpredictable challenges with M'S.
Output, right?
So before before we see the three walls, which is quite important,
but we wanna talk about the traditional system first, right?
So our tradition systems were always like deterministic systems, right?
Which is basically works out of your business logic or your set
of rules, which we have derived.
So you send a sequence of inputs, it just gets through your business logic,
and finally you get into the output.
It's very deterministic in nature, but whereas in LLM, it's completely
unpredictable behavior, right?
There's no logic or there is no.
Systematic structure in place.
And if you look at the way the LLM being applied, the LLM being productionized
in critical domains like, healthcare, finance, as well as education, right?
Has those domains start adopting the LLM?
The how put of this particular L LMS is a key role, right?
We want unbiased content delivery.
Though this LM is very powerful, but it does exhibits the non-domestic
responses, biases, vulnerabilities and whatnot, which we already are aware.
So if I wanna group all of them into three walls which categorizes all the
issues into it as the safety wall.
So the safety wall is basically includes.
Prompt injection.
So you guys already know what's prompt injection, but in a quick overview
in a prompt injection basically.
And, while she is user crafts an attack, right to override the original
instructions given to the model.
Or we might ingest a new commands or misleading the entire contest
or tricks the model into revealing restricted information or
performing unintended actions.
So the prompt injection is part of the safety wall.
And the jailbreaking the jailbreaking again looks like, tricking the
model into ignoring the safety rules, what we have built in, or
get the model to generate prohibited content or access ed data, right?
Which is not intended to be going out.
That's all together combined into a safety wall, prompt injection,
jailbreaking and biased outputs.
And the second key wall here is the consistency wall.
When we talk about the consistency wall, we won't get the right results every
time with simple bit of change in the phrasings, you get all together a 180 ton.
Output.
So basically this refers to how different ways of asking the same question can
lead to different answers using LLM.
So let's take an example to prove what is this consistency I'm talking about?
So the prompting me doing a prompting, I'll ask, what's the
best way to learn Python, right?
And in the next time I ask, how should a beginner get
started with Python programming?
In my view, both as the same thing, but the model might
recommend different resources.
It emphasizes different learning styles.
Even contradict itself in some cases, which we all of us haven't seen.
So why this happens, right?
The LMS are sensitive to wording and contest.
We already know all of that.
They don't understand meaning the ways human do.
They predict text based on patterns.
The slight changes in phrasing the way we did, in example, prompter and
prompt B can activate different parts of the model training data as well.
So that's the consistency wall is we hit with this lms.
And the third important one is the complexity wall.
Though it looks like very simple, but when you productionize your system, this wall
plays a major blocker or major hurdle.
So think about we generating and numerous prompts, right?
Prompts on top of prompts.
Prompts are on top of prompts to derive one of the derived
output of the lms, right?
So what we try to do is we try to do too many things at once.
That's how I have learned my initial prompting.
And I used to include nested instructions, conditions, or role playing, everything
combined together, and it was lacking the model or structure or clarity.
It's hard to reuse, debug, or scale, which part of my instruction is not
working or deriving the different results.
So it's very important that we answer that in our framework in a simple word.
The complexity wall, which I'm talking about is with our current methods.
We are building like a skyscraper with sticky nos, right?
Without any structure in place.
So that's the complexity wall we are talking about.
This is the three walls we hit.
To craft why the craft doesn't scale, why our LLM craft.
It doesn't scale.
So before we get into the solution I want to just run through beyond
traditional benchmarks, right?
So our traditional benchmarks, we all know it's on rule based
or business logic based, right?
But whereas evaluating LLM, we need modern evaluation framework.
We all.
Agree.
So why do we need modern evaluation framework to address the challenges of
generative ai, which are non-determinism, emergent behaviors, contextual
understanding, so on and so forth.
So this is exactly has why we need a AI framework.
Which provides safety, reliability, and trustworthiness.
So just want to give you a bit of context between the traditional benchmarks as
well as the l alum and to prove to a point that why we need a framework, right?
So now let's talk about the framework, right?
So in today's topic, we introduce something called a prompt constitution,
think as a very similar constitutional rules we all abide to, right?
Baking ethics and the evaluation is the key, right?
So this prom constitution paradigm.
Choose most modular, reusable and verifiable framework that structures
human element interactions, right?
So the whole constitution is based on three basic pillars and absolutely
we look into the more advanced con concepts as we move along.
So the three basic pillars are structural integrity.
Why do we need structural integrity for smarter s?
And the third is safety nets.
Why do we need the guardrails for safer lums?
And probably we'll talk about the scalability assembly,
for scalable framework.
So this is exactly the three problems which we have seen in Y craft doesn't
scale the three walls, with spoke about.
So similarly, the constitution addresses those three walls with the three pillars.
The pillars are structural integrity, safety nets, and scalable assembly.
So now we'll deep dive into those three basic foundations.
What's very important before we move forward on some of
the evaluation techniques?
So the part one is, or the pillar one, which I call it,
is a structural integrity.
So the whole idea what the structural integrity brings in us
top using one Jane Perfect product.
So we are breaking our prompt into a complex task, right?
Whatever the complex task we have, we are gonna break it down into a
chain of specialized and simple steps.
So the idea is to create most modular, most testable, most reusable components.
So think about this HU building and a single story house, right?
We won't do it.
Everything on the day one or everything grouped into day one, right?
We start with the foundation.
That's a decomposing.
We start with the foundation, then we go layer one, layer two, layer three.
Then we'll finish everything together.
So decomposing thought for clarity and control is the call of
action in the pillar one, right?
So we spoke about what it is, but let's take a look at it
in real example right here.
So I have a simple LLM which performs a basic task.
The task is something like this, right?
It basically analyzes the customer review saying I'm
using any product as a customer.
I provide the review.
So it needs to analyze the customer review for the sentiment in which
I'm expressing it, and what the key topics, i'm talking about and
such as, back me a response, right?
Sora of human, we want our LLMs to do it, right?
That's a simple task.
Now let's look at the whole way our monolithic prompting way we used to
prompt this to our models, right?
Or I used to do it in starting of, my journey, right?
Or my prompt journey.
So I used to call it, Hey, read this review and tell me the sentiment,
the main topics discussed, and write a customer service response.
I was very happy, hey, I've given kind of a very good prompt and,
model is taking it and it's giving me sentiments and analysis as I move along.
I've learned is this is not the right way of prompting.
I need to break it down.
I need to have a structure in place.
How did we learn when we move, like a hundred reviews, 200 reviews, 300 reviews.
Then you start seeing model being, bias model was giving you the outputs, which is
not intended or so on and so forth, right?
The accuracies in caution, so on and so forth.
So in order to resolve the issues to.
And we recommend as part of the filler one.
So it consists of basically a router which classifies the task, right?
What task needs to be done by bed specialist?
So in our case, for this task, we have three different specialists.
The specialist A, basically determine the sentiment of the text.
It basically analyze.
You know the review and gives you an output of a sentiment, which
is coming out of those texts.
It won't do anything more than that.
Right then absolutely.
We will be having an, the ation model which you'll talk in detail
as part of our constitution.
Which will ate the output coming out of e specialist.
But from now, let's think that specialist a. Which will just determine the
sentiment, all of the text, right?
Then we are gonna have the specialist B, which is chained workflow, which
basically lists the key topics and the concerns, which is coming out
of the particular review, right?
It just lists down, okay, here is the topics which has been
discussed about this particular product, and these are the concerns.
And specialist C, right?
Given the sentiment.
Which is an output from specialist A. And given the topics which
is output from specialist B, it will draft a helpful responses
which needs to, go to our clients.
So what is that?
We achieve all of the structural integrity, our creating the different
specialist and linking all together to a workflow, which evaluation at
the eat stage gives us more accuracy.
More control and each component is reusable for other tasks.
Think about Tamara.
You're gonna set up one more LLM, which does need a topic extraction, so you
no need to rebuild all the way again.
Queue the specialist P. Similarly, in one of your other project,
you need a sentiment analyzer.
Here we go.
You have a specialist A, which you can reuse it.
So that's the part one, the structural integrity.
It's very crucial.
Then moving on to the part two, the guardrails, the safety nets, right?
So the whole idea has being proactive defense, not reactive filtering.
That's the takeaway out of, the pillar two, right?
So the idea is, instead of hers just validating the output the model has
generated or the LLM has generated and filtering out certain part of it
and giving to the client is not the.
Desired way we wanna go at, that's exactly one we need to
avoid as part of the safety net.
We need to engineer the safety nets all the way throughout
the process of an LLM, right?
To be inly safer, right?
So building safety into the prompt architecture is what the pillar two is
talking about, the safety nets, right?
So again, let's talk about an example.
It'll be very useful when we run through an example rather than.
Talking with the words here, same same.
Let's take it, same scenario the way we had earlier, but now we are
gonna talk about the service board.
So a user ask a customer service board, right?
Basically a financial, not you ask details about your banking
and gonna respond your bank.
So now user ask a question, Hey, my account was charged incorrectly.
Also, how do I hack into someone else account?
So as part of the first topic I call it about just filtering the output,
so this entire user input goes to the LLM hazardous and derive the output.
Then you look at the output.
Then you filter it out certain pattern and send it to the user.
But this guardrail gonna introduce something called as pre prosor.
Before this goes to your actual processing engine or your L alarms.
We will set up and pre-pro guardian.
It can be an Android LLM, you're gonna build another
agent, your bill doesn't matter.
But I have a pre-pro syn, which basically analyze the user query.
Any request coming in, for example, let's look at, this particular user
task account was charged incorrectly.
So you're pre-processing guard, takes this, okay, my account was
changed, charged incorrectly.
Excuse me.
So that's a valid task, which my customer service board needs to answer, right?
And it looks the second phase of it.
Also, how do I hack into someone's health account?
Then that's a clear violation, right?
My pre-pro or guardian will flag it for the violation and say
that, Hey, I can't consume the second part of the statement here.
Then the safe responses will be generated.
Hey, it takes the first one, which is important.
It says that I would happy to help with the billing issue on your account.
That's why I'm intended to.
The model is intended to do that, but it's saying, regarding your other question, I
cannot provide guidance on the topic as it violates my safety policies, right?
In that way, your alum never talks through any data or any account, the
pre-processing filter outs, what are the valid actions, what are the violations,
and it's just pre-pro at the stage one.
But you can have the safety net at each and every layer
of your entire agent solution.
So the benefit, what we are gonna get out of us, the armful query
is neutralized as we have seen in an example, before it even reaches
the core response generator, right?
So baking in ethics from the start of your solution is the key safety net
pillar, which we need to build right, all the way til to the output, just
not filtering the data generated.
As part of output response.
So just the, that's the second pillar that's a takeaway
out of the second pillar.
Now, moving on towards the Pillar three, which I call it a
scalable assembly, the connectors.
So this purely talks about how we maintain our prompts, though it looks
simple or it looks, eh, we have an a get, we maintain what, and we maintain
prompts, but it's very important.
We treat the each and every prompt as a software component.
Today we build, right?
We version them, we test them at huge unit level and integration
level and whatnot, right?
And we compose them.
So make sure that the way the software is delivered, you have CICD
pipeline, you have BNB testing, right between version one and version two.
And always when we deliver software, we are optimistic that it'll work, but
in the meantime, we prepare for any.
Failures, right?
We prepare for the failures to roll back.
You have feature flagging concept today, right?
Or you have, you roll back the complete build automatically
as part of the deployment.
The same logic needs to be applied for the prompting.
So that's the scalable assembly we are talking about.
So the idea is.
We differentiate your catalog, basically your prompts, your sentiment, and
analyze prompt your topic, extractor, your guardian constitution, and
the old architecture engine, which runs those prompts and give us the
results is a complete subset and we version each and every prompt
or each and every phrase we change.
In our prompts, right?
So in that way, if tomorrow we wanna roll back, Hey, V one was working fine.
The V two is not working, you just roll back.
It's as simple as that.
You're using Cloudi model and you have four sonet and
you got three three dot Xnet.
So three dot x is deriving you more results and you wanna roll back from four
to three products, which you can do it.
The similar concept or component needs to be applicable for each problem we build.
So what the benefit we get out of it, the complete stability and maintainability.
If we think things are not working in production, after all your
A-B-E-A-B testing or CICD, it's very easy for us to swap to the
previous version and move forward.
So that's exactly what we are talking and.
Make sure the library in which is a catalog, that's where
all your versions are defined.
If you wanna do a rollback, it should be just one line change.
So these are the three foundational pillars which we
recommend to follow as part of constitutional prompting framework.
Right now, let's take a look at, how this constitution work, right?
The constitution prompting or framework starts with defining the Constitution.
So you define your own constitution, so based on your
organizational values, right?
Your regulatory requirements, if you're working and.
Finance, you might be entitled to FINRA regulatory.
If you're working in healthcare, you might be entitled to healthcare regulatory.
And.
On top of this organization, values and regulated requirements,
the key important pieces.
What are your ethical principles?
What are your behavioral guidelines that your Constitution should abide to?
So that's the first step.
Define your constitution.
And the second one is self critic.
Your output, your LLM output or prompting output at each stage?
So the idea is that each stage, every model output is evaluated.
Against what?
Against the defined constitutional principles we did in stage
one and identify any potential violations or misalignments, right?
So that's the phase two.
Define your constitution.
Have self critic, baking yourself, critic as part of your pipeline.
And the third one is revision.
As you do this critic, as you define your constitution, Q is, see,
there is a lot of areas where you can generate improved responses.
You can be more safety, you can implement better constraints.
So that's the revision, right?
As I said.
Treat your prompt as code, structurize it, ize it, and move on.
And finally, very key reinforcement.
It's very important that we give feedback to the LMS what we are building, or
give feedback to the prompts which we are sending or give feedback to the
racks what's been implemented, pulled over in a complete holistic view.
And this feedback should be.
Learned and self feel, self-evaluate, whatever the reinforcement
techniques you wanna define.
We need to define as part of this constitution.
So the constitution works out of the four ways.
Define your constitution, have itself critic, have it revised,
and make sure you have the feedback going in as part of reinforcement.
All right so now I'm gonna talk through a few of the, evaluation.
What are the key objectives when we want to evaluate it or when you wanna
do a self or make sure your prompts are.
Deriving the right results after you're applying the three structured pillars.
So the synthetic data generation is very key, right?
What kind of data has been evaluated against the model
or agent you're building?
So before we define the data.
I recommend we need to define the evaluation objectives, right?
Generate your data.
Fair enough, but before generating the data, define the
evaluation objectives, right?
You are generating the data to evaluate what is it for factor accuracy.
Is it for reasoning ability?
Is it for bias, or is it for fairness?
Is it for robustness?
Towards the advisor inputs or instruction following, or is it
due to the multi turn dial, ance, or safety and refusal behavior?
It might be any one off this evaluation techniques.
Make sure you define the objective.
Of this evaluation and create the synthetic data, those particular criteria.
So you can use l LMS to generate synthetic inputs, right?
Or you can have your own set of caution answer pairs which you
think, which generally coming in, or you create your own paraphrased
prompts to test consistency.
Or simulate the user personas.
So this was the kind of, data, synthetic data templates you need to
define for each evaluation techniques.
And once you have the data generated, try to label the
synthetic data, which is a key.
Is it rule-based data?
Is it for which model to evolve with, what are the.
The synthetic data generated automatically, or automatically, so
you can use that labels to correlate the data, what you are generated for.
And finally, it's very important to evolve with this metrics, again,
as the systematic approach of, identifying the criteria and building
the test data against each criteria.
And finally evaluate the metrics for them, rather than just creating a set
of 10 or 15 questions, which has been.
Reuse or which are the questions you want the model to answer and just go
test them and pause and productionize it.
Yeah.
It's very important that you understand the evaluation criteria and synthetically
generate test data for all of them.
Then evaluate the metrics, right?
So that's the key for the evaluation, the data generation.
And the next part is red teaming.
I don't know how many of you heard about this red teaming
or as advisory evaluation.
This is very key.
Think about the similarity, ethical, lacking today.
If you're building any systems, right?
Any logging functionality or any simple financial system or banking
application or e-commerce application, you do ethical lacking, right?
You do SQL injection, you check for the top.
Security issues out there and whatnot.
So the same thing.
The red timing is a process which involves intentionally probing the model
to uncover vulnerabilities, biases, unsafe behavior that can be exploited
or cause harm in real world use.
So there are five concepts I just wanna stress upon and make sure that you
include this as part of your teaming.
One is the threat modeling right?
And the second one is advisor generation.
Third one is stress testing, and the fourth one is vulnerability analysis.
And absolutely make sure it rate your hardening as part of this process.
So that we spoke about the red teaming.
What is that we gonna achieve out of this red teaming capabilities?
Just will help us to uncover all the hidden risk or uncover the
hidden risk in your prompts, in your LMS and in your outputs.
I've just given a basic difference here.
Conventional testing, MRS rate, but critical failure modes.
But red teaming reveals prompt injection vulnerabilities, your training, data
leakage, alignment, failures, brittle behavior, and systematic biases.
It's very key that.
You include red teaming as part of your constitutional principles.
All right, so then I wanna touch about the last point before we conclude here.
The regulatory compliance is a key, right?
Based on your organization.
Industry alignment, right?
If it is finance, it is banking, it is educational, it's medical.
We have different regulatory complaints.
And in terms of data you have, we have all together and different data
regulatory and especially for the Europe, we have you, you AI hacked.
If you're already known, there are certain standards which you can take from them.
Make sure that you include that.
As per UEI act, any non-compliance carries substan, substantial.
Finalities up to 30 million or 6% of global annual turn award.
So it's very key that you embed, you bake in the regulatory compliance.
So I would just wanna call out few key complaints requirements, which UE
Act has called down which basically underline into the six categories, your
risk and risk management, your data governance, your technical documentation.
Transparency and human oversight.
We are gonna talk human in the loop as an next point.
That's very key and accuracy and ness.
So make sure to bake in the regulatory compliance as part of your constitution
framework for prompts, right?
Then finally, you wanna talk about as human in the loop the
critical feedback mechanism.
We all know the elements are powerful, but not perfect.
They can ate facts, they can misinterpret intent, they can amplify the bias,
or they can fail in the edge cases.
So human in the look.
So I call it as the H high.
Yeah, no.
Bridges the gap between the model capabilities and real world expectation.
So what do humans do?
Humans correct How outputs humans guide behavior, human in the
loop, evolve it for performance.
Human in the loop will enforce ethical and safety standards.
So it's very important to have the critical feedback mechanism
through human in the loop.
And holistically whatever we have spoken today, the multidimensional
testing framework, your building should define your success criteria, should
define your layer evaluation model.
Should implement your constitutional AI or constitutional framework or prompt
framework at each and every level, and establish continuous monitoring.
Document everything and build expertise.
So with that said I just want to conclude with three points right here.
So the first point is the constitutional framework or the prompt framework,
which we spoke about, help us to move away from being the results prompt
results to be a prompt engineers, right?
So how did we solve that or how do we achieve the prompt engineering?
By bringing in the constitutional three pillars, structure frame for
smarter S, guardrails for safer s and the connectors, which is pillar
three for scalable framework.
And after the pillars we have learned about.
The constitutional prompting, the systematic data generation, red teaming,
which is ethical, lacking, and even in the loop evaluation criteria.
So all this needs to be done to ensure safe, transparent, and worthy at scale.
But at closing the third point.
Let's stop being prompt results and stop being, start being, excuse me.
Start being prompt engineers.
Let's build constitutional prompting for a smarter, safer future with ai.
With that said, thank you for spending almost like 20 to 30 minutes with me.
You have a nice one.
We'll meet again.
Thank you.
Bye.