Conf42 Prompt Engineering 2025 - Online

- premiere 5PM GMT

Prompt Engineering for Data Engineering Unlocking Natural Language Access to Cloud Data

Video size:

Abstract

Discover how prompt engineering and LLMs are reshaping data engineering. I’ll show how natural language interfaces let anyone build queries, trigger pipelines, and unlock real time insights without deep technical skills. Smarter data, simpler access, faster decisions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to Con 42, prompt Engineering, another session. My name is Jamis Akia. I'm a solution architect at Snowflake. Today I'm going to cover the topic for prompt engineering for data engineering how we can unlock the natural language access to cloud data platform. I have been working in data and analytics field for almost 12 years now. I have lots of experience working with lots of different clients on different data engineering projects. Also lots of implementation on cloud, on-prem, cloud data platform. Also have done lots of migrations. From on-prem to cloud data platform and then worked with all across different domains also. And I can definitely see here the impact of how the prompt engineering can play in data engineering. So to start with how the evaluation of data access was and now how it's getting evolved over the time. Data engineering is always, and as of now also is more of attack. We consider that we need a people, data engineers or data developers to do some of these things. They have to either write a very complex SQL views or they have to create lots of different store procedures. They also have to do lots of scripting and then also have to manually manage lots of pipelines that are coming from source. And then we are transforming the data. And then also we have to do lots of monitoring, lots of security concerns, lots of data governance. So all of that goes into the traditional approach. What we are seeing now with LRM that there will be a revolution of this data engineering by using large language models, which will change the paradigm using lots of different things that can be enabled. And reduce some of the time complexity and make it easy for everyone to use and implement data engineering and data ecosystem less technical trainings required. And we can also de democratize the data across the enterprise very quickly. So LLMs are definitely beyond chatbots. When initially when we heard about chat, GPT. Everyone was like, okay, it looks like this is what LLM looks like. If you prompt for something, either you get a output. So you converse con, like it's a conversational AI where you can talk with the chat bots and you can get your answers, question questions answered, stuff like that very quickly. And also you can, content generation is a very big application right now where everyone is using that. There is so many different things you can do with chat, GPT and all the other tools that are coming in the market. But there is a potential for LLMs in data engineering for making our transformations on how we consume our data from cloud data platform and LLM can definitely breach some gap between the business users and the technical data infrastructure that we always face, challenge with how fast we can turn around things, how fast the data can be accessible for everyone. So some of the applications that we can definitely think of. With LLMs in data engineering number one is query generation. We can definitely convert the natural language, respond to get a optimized sql. This can be done on a cloud data platform like Snowflake using some of the technologies that they are offering already, like Cortex Analyst, snowflake Intelligence, and we can definitely see some of the other companies are also using some of this LLM data engineering tools. Like E-T-L-E-L-T, tools like DBT is also already using that pipeline orchestration, triggering workflows through conversation. I want to run something that was run last night just to fix something that is updated in the source side. I can just say run the last run last night runs and it'll run for you. And that's one of the advantage you can take from the LLM Prompt engineering. And definitely real time analytics. I want to find something about my data. I want to learn about where the sales are, high orders are getting less, what region is impacted, if I have any company wide issue. Those are the things I can definitely talk to. Data on the fly with my real time analytics using natural language. To implement this LRM, prompt Engineering there are core components that we have to focus. When it comes to data engineering number one is context optimization. We have to provide LLMs enough information, business information, and also the relationship between the data to ensure that the query that are getting generated are accurate and we are getting the quality output. There are lots of different ways you can do that. You can make sure that all your data quality is good. You create a business layer using semantic. Views or semantic models that has more information that business users also have relation between different terms, different law logics, how you calculate things, what is, what those type of information can be stored in a layer. And it helps with the context optimization. The more context you provide to RLM more accurate and more fast, you get the output. Prompt design pattern. You can talk to LLMs, you can get the answers, but if you write a particular thing or write questions in a particular manner, you get a particular answers. So if we know that what are the different templates that we need to follow, we can actually create reusable templates, which guides the business users to successfully interact and also. Get the consistent output if there is requirement of that. And then there is a doc access, which is always there, but we want to make sure that there are standardizations across, if there is requirement of doing that for business. We also have to make sure validation is done on a data integrity and data quality. We have to implement lots of implement safeguards to verify that the output that we are generating. From the existing model is as expected. There needs to be a rigorous testing and lots of different validation that needs to happen before we allow it to get used by the business users in production. And then last but not least, we have to also have feedback loops, which is a human interaction, that we continuously refine the prompts based on the user interaction and model con performance to improve the accuracy over the time. A model can learn from itself once it starts generating the quality outputs. So it's very important to have a feedback loops. Now, what is the biggest advantage that we will get with this prompt engineering? In data engineering? The most powerful outcome we can see here is deep democratization of data access. Business users can ask questions in plain language. They don't have to go to the IT teams or technical experts every time. For some of the things that they want to get answers from. They also have something that we can build, which allows them to transfer the technical operations, if any, that can be handled through the LLM interface. Now this can be a little complex depending on the business requirements, like what type of interface that you want to build, but you normally want to build something that allows technical users to also monitor, but also business users to run. On the fly. If they have to do something, if they need to do a data refresh, if they need to get additional data source, those type of things can still be handled through the LLM interface. And then last but not list, we can have the cloud data platform like Snowflake, which allows business users to execute and return results very quickly. There are so many different things you can do in the cloud data platform with AI enabled with lots of features that are coming. That can allow business users to feel empowered. It's a less it's a less work for the analytics engineer, engineers to enable users and also get things created very fast. The value for the data can be get very fast, and it can be created and built for scalability. What are some of the real world examples? We already talked about ad aocc querying. We dis users can basically ask questions on what they want to see from the data, if the data is created with the high level of semantics. The example here we can see, show me the customer retention rates by region for quarter four. Now, if the prompt and prompt engineering will go back to the model, and if we have all the schema and data. Relationship define in a way. And then the answers we will get either, we can ask that it creates a SQL query and you copy paste the SQL query and run the SQL query or verify the SQL query before it runs. And then once it's run, it gives you the output. And then output can be either in the tabular format, chart, format, whatever the preference that users have. So you can do some of those things where you want to add a validation point, and then also create the charts. Or sometimes if you don't want people who are not technical. We can directly give them the output pipeline management. This is more for the developer team or the monitoring team or the admin team that can trigger things like ETL processes, schedule jobs, monitor data flows through the conventional command. Like they can just def if I want to run a particular nightly customer aggregation pipeline, I don't have to go in the scheduling tool and search for the job that does that. I can just run the prompt and say, run the nightly customer aggregation pipeline. And it should run the right job for me. And then data exploration, which is a very, I will say, broad application right now that we can have, which allows you to understand what's going on. You can run the hypothesis you can run the marketing campaigns, you can run the analysis of what type of products you can launch. It can give you lots of different things. It can be a a predictive engineering or it can be a. Prescriptive engineering, whatever data that you want to try and get the output for. So it's very application. You can see the example on the right hand side, something like that. You have a natural language search on the top, and then the things that you have built already can change according to the search that you're doing. So it can be a very different ways of doing prompt engineering embedded within BI or within your data exploration tools. There are challenges obviously, to address because we do all know it's been lots of different topics that people are talking about. One of them is accuracy and vaccination. How we can make sure that the accuracy is at best, so that's one of the thing that we can avoid is by doing rigorous validation and testing frameworks to prevent the data to be giving anything that is not right. We have to also do some of the data quality things, but also once the data quality checks are done and model is using the data, we still need to do some of the testing. And then we have to define some guide rails to avoid hallucinations because sometimes the LMS can think that they are right and then they just go into the loop of doing the things, which once they start doing. And if it's incorrect, then it always going to give you the incorrect on output or incorrect answers. So to make sure that the guidelines are there to say, okay, some looks like something is going wrong with the model, and someone is able to make sure that the model is always running in a way that it should run with the high accuracy and no hallucinations. The second big challenge is data governance, security. This is independent of lms. We all know data governance and security is always the critical part of the data ecosystem. We have to make sure. That we have all the possible data privacy regulations, compilation compliance requirements, and then obviously make sure that the permissions to models that are either reviewed from the security team or we also want to make sure that the data that we want to pass onto these models are gar guarded by not sharing anything that need, doesn't need to share to the model. Like any PII information, any personal information. Any company data that is not, does not qualify for using for ai as per the policies that are defined at the organization level. You have to make sure that you have the right data governance, security built on top of the infrastructure that you're building. For LLMs model fine tuning, we have to have fine tuning on the domain specific schemas and business terminologies. To improve the performance significantly. There are so many different things you can do. You can define a layer, as I mentioned before, which is a very good quality layer, which is basically business layer that allows you to create semantics about what is what, how that, how things happen, what is the business logic, what happens and what scenario. All those type of information, other than the actual structure and unstructured data, helps the context to get the model to tune. Much better. And if you want to get the model, give more accurate answers, you can give more and more useful context that allows model to tune itself or fine tune itself to get better output performance at scale. The one of the things that we always heard about is that once we start using these models and complex queries and large databases. The performance can be impacted and how you can make sure of that. You want to make sure that you have enough compute resources, but not always using all the resources, and you are trying to do the things to make sure the performance is always responsive and user experience is always better. And you can do that by doing optimization strategies that can be used to make sure the performance is always better when you're performing at the scale. As I said, governance is the foundation of trust. If we can use any. To deploy the successful LLM, we have to make sure we have a very robust governance frameworks, access control. We have security column masking and force automatically. You can define data masking policies. You can have role-based access control. Audit trails, you want to make sure that you have a complete logging of all the m generated queries and operations. And then you also want to make sure that you have human before the execution, wherever it is required. So from my experience working with so many different clients, and from my understanding of how things are always going in a way that it's successful, is that you need to make sure that you have a roadmap. You have a. You have a really good understanding of how things are going to be created and used over the time for a particular use cases or for all the use cases that you have for prompt engineering in your organization, especially for data engineering. You should start with the limited scope. Try with one particular date, business domain or business unit, or a particular data engineering process, and then work on it and see if it works fine, and then go for the next one. Rather than trying to use everything at once, try to minimize the scope. That will always help you to define the quality and consistency of what you're doing. You can also collaborate across the team. We have to make sure everyone from the business side, data engineers, security team, compliance team, all of them are involved to make sure that whatever we are doing has covered from all the aspects. Is there anything we are missing? We have to make sure that we are doing everything that we can to make sure that all the data governance, security, business logic is used to implement the LLM Prompt Engineering project text extensively. Again, we are coming back to the validation because we want to make sure that the data quality or the results quality or the accuracy of the model that is giving the output is at as the best. And then we are getting the right and in and correct value information that is really important. And then we have to scale incrementally because that also allows you to get feedback every time you have something small that is implemented. And then gradually you are expanding your capabilities, which also connects back to the performance at scalability that you are able to scale, but also make sure that your performance is consistent and your feedback that you're getting is also used. To do the changes for the work that has been done before moving to the next project. So what future of data interaction looks like? We all know the LLM is going to evolve. We are going to see lots of different new technology every day. There is something new that we are hearing. The models are growing, the companies are building new things. The companies are coming. With new things. So there will be lots of different things that will happen, especially from the data engineering perspective. The boundary between the business users and the data platform will blur further. We will see lots of power users, lots of business users are also doing lots of things which are much more technical. By just doing the natural language commands and prompt engineering will represent a fundamental shift. And how organizations think about data accessibility and technical expertise, it will allow them to do the faster decision making because they have basically eliminated the time between the data team or the support team because business users are directly talking to the data and accelerating the decision cycles by getting the input and getting the output and using that information to make the decisions very fast. And there will be a smarter data practices across. How data means for getting more insights. I think there is a culture shift that we have seen when we are, when there was a pi wave, when self-service analytics came. Now it's again the LLM wave with prompt engineering that will make the data driven culture much more mature and it'll be tech savvy. Environment where people will learn about this new stuff and interact with LMS and use prompt engineering to do most of their things. So what are some key takeaways from today's session? Prompt engineering accents far beyond chat bots. We all heard about chatbots. We are all about content generation, but it is powerful tool that will democratize the access to complex cloud data ecosystem. Success requires careful attention to accuracy, governance, and fan tuning. We already saw and discussed like what all we need to make sure that we have a proper framework and collaboration across the organization to make the success out of this prompt engineering applications, smart small test, rigorously and scale incrementally. That's the mantra there, which will allow any organization. To build this infrastructure. Use LLMs, use prompt engineering in a way that is always scaling and performing at it best. The future of data interaction is conversational. It's going to be like you're just talking with data. You're not really doing any data analytics, but you're just trying to understand, and the prompt engineering will allow you to have sort of an interaction that will empower and accelerate innovation across the organizations. Thank you very much for listening me. I hope this session was useful. And you are learning something new about prompt engineering, the application in data engineering, which is going to grow very fast. And looking forward to much more exciting times for data engineering and prompt engineering. Thank you.
...

Jimish Kadakia

Solutions Architect @ Snowflake

Jimish Kadakia's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content