Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Welcome to Con 42, prompt Engineering, another session.
My name is Jamis Akia.
I'm a solution architect at Snowflake.
Today I'm going to cover the topic for prompt engineering for data engineering
how we can unlock the natural language access to cloud data platform.
I have been working in data and analytics field for almost 12 years now.
I have lots of experience working with lots of different clients on
different data engineering projects.
Also lots of implementation on cloud, on-prem, cloud data platform.
Also have done lots of migrations.
From on-prem to cloud data platform and then worked with all
across different domains also.
And I can definitely see here the impact of how the prompt engineering
can play in data engineering.
So to start with how the evaluation of data access was and now how
it's getting evolved over the time.
Data engineering is always, and as of now also is more of attack.
We consider that we need a people, data engineers or data developers
to do some of these things.
They have to either write a very complex SQL views or they have to create
lots of different store procedures.
They also have to do lots of scripting and then also have to manually manage lots
of pipelines that are coming from source.
And then we are transforming the data.
And then also we have to do lots of monitoring, lots of security
concerns, lots of data governance.
So all of that goes into the traditional approach.
What we are seeing now with LRM that there will be a revolution
of this data engineering by using large language models, which will
change the paradigm using lots of different things that can be enabled.
And reduce some of the time complexity and make it easy for
everyone to use and implement data engineering and data ecosystem
less technical trainings required.
And we can also de democratize the data across the enterprise very quickly.
So LLMs are definitely beyond chatbots.
When initially when we heard about chat, GPT.
Everyone was like, okay, it looks like this is what LLM looks like.
If you prompt for something, either you get a output.
So you converse con, like it's a conversational AI where you can talk
with the chat bots and you can get your answers, question questions
answered, stuff like that very quickly.
And also you can, content generation is a very big application right
now where everyone is using that.
There is so many different things you can do with chat, GPT and all the other
tools that are coming in the market.
But there is a potential for LLMs in data engineering for making our transformations
on how we consume our data from cloud data platform and LLM can definitely breach
some gap between the business users and the technical data infrastructure that
we always face, challenge with how fast we can turn around things, how fast the
data can be accessible for everyone.
So some of the applications that we can definitely think of.
With LLMs in data engineering number one is query generation.
We can definitely convert the natural language, respond to get a optimized sql.
This can be done on a cloud data platform like Snowflake using some of
the technologies that they are offering already, like Cortex Analyst, snowflake
Intelligence, and we can definitely see some of the other companies are also using
some of this LLM data engineering tools.
Like E-T-L-E-L-T, tools like DBT is also already using that pipeline orchestration,
triggering workflows through conversation.
I want to run something that was run last night just to fix something
that is updated in the source side.
I can just say run the last run last night runs and it'll run for you.
And that's one of the advantage you can take from the LLM Prompt engineering.
And definitely real time analytics.
I want to find something about my data.
I want to learn about where the sales are, high orders are getting
less, what region is impacted, if I have any company wide issue.
Those are the things I can definitely talk to.
Data on the fly with my real time analytics using natural language.
To implement this LRM, prompt Engineering there are core
components that we have to focus.
When it comes to data engineering number one is context optimization.
We have to provide LLMs enough information, business information,
and also the relationship between the data to ensure that the query that
are getting generated are accurate and we are getting the quality output.
There are lots of different ways you can do that.
You can make sure that all your data quality is good.
You create a business layer using semantic.
Views or semantic models that has more information that business users also
have relation between different terms, different law logics, how you calculate
things, what is, what those type of information can be stored in a layer.
And it helps with the context optimization.
The more context you provide to RLM more accurate and more
fast, you get the output.
Prompt design pattern.
You can talk to LLMs, you can get the answers, but if you write a particular
thing or write questions in a particular manner, you get a particular answers.
So if we know that what are the different templates that we need to
follow, we can actually create reusable templates, which guides the business
users to successfully interact and also.
Get the consistent output if there is requirement of that.
And then there is a doc access, which is always there, but we
want to make sure that there are standardizations across, if there is
requirement of doing that for business.
We also have to make sure validation is done on a data integrity and data quality.
We have to implement lots of implement safeguards to verify that
the output that we are generating.
From the existing model is as expected.
There needs to be a rigorous testing and lots of different validation that needs
to happen before we allow it to get used by the business users in production.
And then last but not least, we have to also have feedback loops, which is a
human interaction, that we continuously refine the prompts based on the user
interaction and model con performance to improve the accuracy over the time.
A model can learn from itself once it starts generating the quality outputs.
So it's very important to have a feedback loops.
Now, what is the biggest advantage that we will get with this prompt engineering?
In data engineering?
The most powerful outcome we can see here is deep democratization of data access.
Business users can ask questions in plain language.
They don't have to go to the IT teams or technical experts every time.
For some of the things that they want to get answers from.
They also have something that we can build, which allows them to transfer the
technical operations, if any, that can be handled through the LLM interface.
Now this can be a little complex depending on the business requirements, like what
type of interface that you want to build, but you normally want to build something
that allows technical users to also monitor, but also business users to run.
On the fly.
If they have to do something, if they need to do a data refresh, if they
need to get additional data source, those type of things can still be
handled through the LLM interface.
And then last but not list, we can have the cloud data platform like Snowflake,
which allows business users to execute and return results very quickly.
There are so many different things you can do in the cloud data
platform with AI enabled with lots of features that are coming.
That can allow business users to feel empowered.
It's a less it's a less work for the analytics engineer,
engineers to enable users and also get things created very fast.
The value for the data can be get very fast, and it can be
created and built for scalability.
What are some of the real world examples?
We already talked about ad aocc querying.
We dis users can basically ask questions on what they want to see
from the data, if the data is created with the high level of semantics.
The example here we can see, show me the customer retention
rates by region for quarter four.
Now, if the prompt and prompt engineering will go back to the model, and if
we have all the schema and data.
Relationship define in a way.
And then the answers we will get either, we can ask that it creates
a SQL query and you copy paste the SQL query and run the SQL query or
verify the SQL query before it runs.
And then once it's run, it gives you the output.
And then output can be either in the tabular format, chart, format, whatever
the preference that users have.
So you can do some of those things where you want to add a validation
point, and then also create the charts.
Or sometimes if you don't want people who are not technical.
We can directly give them the output pipeline management.
This is more for the developer team or the monitoring team or the admin
team that can trigger things like ETL processes, schedule jobs, monitor data
flows through the conventional command.
Like they can just def if I want to run a particular nightly customer
aggregation pipeline, I don't have to go in the scheduling tool and
search for the job that does that.
I can just run the prompt and say, run the nightly customer aggregation pipeline.
And it should run the right job for me.
And then data exploration, which is a very, I will say, broad application
right now that we can have, which allows you to understand what's going on.
You can run the hypothesis you can run the marketing campaigns,
you can run the analysis of what type of products you can launch.
It can give you lots of different things.
It can be a a predictive engineering or it can be a. Prescriptive
engineering, whatever data that you want to try and get the output for.
So it's very application.
You can see the example on the right hand side, something like that.
You have a natural language search on the top, and then the things that you
have built already can change according to the search that you're doing.
So it can be a very different ways of doing prompt engineering embedded within
BI or within your data exploration tools.
There are challenges obviously, to address because we do all know
it's been lots of different topics that people are talking about.
One of them is accuracy and vaccination.
How we can make sure that the accuracy is at best, so that's one of the
thing that we can avoid is by doing rigorous validation and testing
frameworks to prevent the data to be giving anything that is not right.
We have to also do some of the data quality things, but also once the
data quality checks are done and model is using the data, we still
need to do some of the testing.
And then we have to define some guide rails to avoid hallucinations
because sometimes the LMS can think that they are right and then they
just go into the loop of doing the things, which once they start doing.
And if it's incorrect, then it always going to give you the incorrect
on output or incorrect answers.
So to make sure that the guidelines are there to say, okay, some looks
like something is going wrong with the model, and someone is able to make
sure that the model is always running in a way that it should run with the
high accuracy and no hallucinations.
The second big challenge is data governance, security.
This is independent of lms.
We all know data governance and security is always the critical
part of the data ecosystem.
We have to make sure.
That we have all the possible data privacy regulations, compilation compliance
requirements, and then obviously make sure that the permissions to models that are
either reviewed from the security team or we also want to make sure that the data
that we want to pass onto these models are gar guarded by not sharing anything that
need, doesn't need to share to the model.
Like any PII information, any personal information.
Any company data that is not, does not qualify for using for
ai as per the policies that are defined at the organization level.
You have to make sure that you have the right data governance,
security built on top of the infrastructure that you're building.
For LLMs model fine tuning, we have to have fine tuning on the domain specific
schemas and business terminologies.
To improve the performance significantly.
There are so many different things you can do.
You can define a layer, as I mentioned before, which is a very good quality
layer, which is basically business layer that allows you to create semantics
about what is what, how that, how things happen, what is the business
logic, what happens and what scenario.
All those type of information, other than the actual structure
and unstructured data, helps the context to get the model to tune.
Much better.
And if you want to get the model, give more accurate answers, you can give more
and more useful context that allows model to tune itself or fine tune itself to
get better output performance at scale.
The one of the things that we always heard about is that once
we start using these models and complex queries and large databases.
The performance can be impacted and how you can make sure of that.
You want to make sure that you have enough compute resources, but not always
using all the resources, and you are trying to do the things to make sure
the performance is always responsive and user experience is always better.
And you can do that by doing optimization strategies that can be used to make
sure the performance is always better when you're performing at the scale.
As I said, governance is the foundation of trust.
If we can use any.
To deploy the successful LLM, we have to make sure we have a very robust
governance frameworks, access control.
We have security column masking and force automatically.
You can define data masking policies.
You can have role-based access control.
Audit trails, you want to make sure that you have a complete logging of all
the m generated queries and operations.
And then you also want to make sure that you have human before the
execution, wherever it is required.
So from my experience working with so many different clients, and from
my understanding of how things are always going in a way that it's
successful, is that you need to make sure that you have a roadmap.
You have a. You have a really good understanding of how things are going
to be created and used over the time for a particular use cases or for all
the use cases that you have for prompt engineering in your organization,
especially for data engineering.
You should start with the limited scope.
Try with one particular date, business domain or business unit, or a particular
data engineering process, and then work on it and see if it works
fine, and then go for the next one.
Rather than trying to use everything at once, try to minimize the scope.
That will always help you to define the quality and
consistency of what you're doing.
You can also collaborate across the team.
We have to make sure everyone from the business side, data engineers, security
team, compliance team, all of them are involved to make sure that whatever we are
doing has covered from all the aspects.
Is there anything we are missing?
We have to make sure that we are doing everything that we can
to make sure that all the data governance, security, business logic
is used to implement the LLM Prompt Engineering project text extensively.
Again, we are coming back to the validation because we want to make sure
that the data quality or the results quality or the accuracy of the model that
is giving the output is at as the best.
And then we are getting the right and in and correct value
information that is really important.
And then we have to scale incrementally because that also allows you to
get feedback every time you have something small that is implemented.
And then gradually you are expanding your capabilities, which also connects back to
the performance at scalability that you are able to scale, but also make sure that
your performance is consistent and your feedback that you're getting is also used.
To do the changes for the work that has been done before
moving to the next project.
So what future of data interaction looks like?
We all know the LLM is going to evolve.
We are going to see lots of different new technology every day.
There is something new that we are hearing.
The models are growing, the companies are building new things.
The companies are coming.
With new things.
So there will be lots of different things that will happen, especially
from the data engineering perspective.
The boundary between the business users and the data platform will blur further.
We will see lots of power users, lots of business users are also doing lots
of things which are much more technical.
By just doing the natural language commands and prompt engineering
will represent a fundamental shift.
And how organizations think about data accessibility and technical expertise,
it will allow them to do the faster decision making because they have
basically eliminated the time between the data team or the support team because
business users are directly talking to the data and accelerating the decision
cycles by getting the input and getting the output and using that information
to make the decisions very fast.
And there will be a smarter data practices across.
How data means for getting more insights.
I think there is a culture shift that we have seen when we are, when there was a
pi wave, when self-service analytics came.
Now it's again the LLM wave with prompt engineering that will make
the data driven culture much more mature and it'll be tech savvy.
Environment where people will learn about this new stuff and interact
with LMS and use prompt engineering to do most of their things.
So what are some key takeaways from today's session?
Prompt engineering accents far beyond chat bots.
We all heard about chatbots.
We are all about content generation, but it is powerful tool that
will democratize the access to complex cloud data ecosystem.
Success requires careful attention to accuracy, governance, and fan tuning.
We already saw and discussed like what all we need to make sure that we have
a proper framework and collaboration across the organization to make the
success out of this prompt engineering applications, smart small test,
rigorously and scale incrementally.
That's the mantra there, which will allow any organization.
To build this infrastructure.
Use LLMs, use prompt engineering in a way that is always scaling
and performing at it best.
The future of data interaction is conversational.
It's going to be like you're just talking with data.
You're not really doing any data analytics, but you're just
trying to understand, and the prompt engineering will allow you
to have sort of an interaction that will empower and accelerate
innovation across the organizations.
Thank you very much for listening me.
I hope this session was useful.
And you are learning something new about prompt engineering, the
application in data engineering, which is going to grow very fast.
And looking forward to much more exciting times for data
engineering and prompt engineering.
Thank you.