Conf42 Prompt Engineering 2025 - Online

- premiere 5PM GMT

From Reactive Prompts to Autonomous Agents: LLMs for Multi-Cloud Infrastructure

Video size:

Abstract

Discover how LLMs can move beyond generating snippets of code to become autonomous agents for multi-cloud infrastructure, balancing opportunity with the risks of hallucinations, misconfigurations, and cost. Learn practical architectures for safe, scalable orchestration.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm pre Patil from Equinix. I build multi-cloud infrastructure and interconnection for high performance AI needs. Thank you for joining me to explore my talk for reactive prompts to autonomous agents, LLMs for multi-cloud infrastructure. Over this past year, I have spent a lot of time on exploring the next stage of cloud deployment in the world of ai. We have seen Terraform and cross. Planes facilitate cloud deployment and maintain the desired state using reconciliation loop. That worked well for many years, but now is the time to advance it even further. In this talk, we will see how large language models LLMs are exploring and evolving and not just answering our questions to actively managing complex cloud infrastructures. This shift from manually reactive prompts to autonomous agent behavior is poised to revolutionize cloud infrastructure. Gartner has said the agentic AI is a top trend of 2025. We don't need Gartner to say that. We all know that by now. In this talk, we'll talk about how multicloud can leverage agent tech ai, talking about AWS Azure GCP even on-prem. A full hybrid mix, if you will. We will discuss on how LLM can move beyond merely splitting the config snippets to actually orchestrating the tasks across multi-cloud, all while balancing the exciting opportunities with very real risks, the hallucinations, misconfigurations, and the cost. Most enterprises run at least two public clouds plus co-location on-prem. Facilitated by S like Equinix. They all have different IMS network models, APIs and billing. Billing knobs. The result is a config sprawl, policy divergence and reconciliation drift. Old fix was write a glue script or a terraform and have PR reviews at scale. The glue on the Terraforms and the cross planes became second platform to maintain. Sometimes if people don't maintain the GitHubs model, then we would end up having them as tribal knowledges. What we want instead is a continuous awareness of state across the providers. Policy driven orchestration and autonomic reconciliation that respects the change. Window approvals and budget. Think treating cloud boundaries like availability zones with explicit control, our cost, and other risks. And what most teams today use LMS reactive. Ask cha g pity for a Terraform script. Tell me how to connect a LB to my target group. And the VPCs, if it does generate good scripts, no doubt about that. But you end up with a lot of on off scripts without full validation and with full confidence, we don't docs anymore. Neither does cha G PT or any LMS would just end up giving us information that it best knows it. So for one good. And is we wrote a chat assistant and asked to write a clean VPC module. And it did in minutes, but all it did was apart from building a VPC module, it also had a place where it would open up all the traffic and all that because we forgot to mention a bion host restriction, nice code, wrong outcome. The lesson without. Context and checks, you are accelerating both right and wrong changes. So what should be the vision of having an autonomous agent of this sort? It should have continuous awareness. Take snapshots of multi-cloud states, drifts and quotas. It should have proactive de detections for misconfigurations, security vulnerabilities, cost anomalies, and a autonomous. Resolution, propose and execute fixes with approvals and rollbacks. Think of this agent as a junior, SRE with excellent documentation recall wrapped under strict guardrails. It suggests and even allowed to execute and then explains what it did. The core of this architecture is Ritual Augmented Generation, also known as rag. The difference here with RAG is between confident fiction and grounded action. The retrievals will be that of. Current desired state. Think terraform Terraform states, terraform documentations scripts cloud formation. It'll will also will look into actual states through cloud APIs. It'll look into change histories, think logs, prs, run books policies like security policies, FinTech policies provider docs. Not only a p aspects, but also best practices. Amazon always provides well architected writeups and they are very resourceful model plans With this in context and not just the blank state example can think of before touching a security group, the agent reduce the intent port for IAC. Our infrastructure as code pulls all the current rules from AWS that already in is presenting your infras policies that you have set up ahead of time telling for bidding, opening all the poor for admin VPCs for example, and any change tickets over the time. The output of this is a proposed diff. Plus the rationale on why we are doing certain things. The next piece of the architecture is where you get your confidence with the verification pipeline. It. Start off with syntax validation. Yamo, JSO, Terraform. Do the linting, do the formatting, then is the semantic analysis. Logically thinking what it is doing is right or wrong. Think of reasoning models here. Then the guard rails for all your policy, compliances, budgets, tax encryption, geo residency. GDPR kind of compliances whatnot, right? And then finally, your dry run testing, Terraform Plan cube cut diff your smoke test on your staging environment, et cetera. I also recommend. Doing a dual model verification where the second model of a rules engine would critique the plan, ask why is it doing a certain way? Is it required? What can go wrong? Think of red teaming the real value of these autonomous agents. It's not just during the deployment, but between the deployments. The agents will continuously monitor across the clouds for any changes like I am Devs. Route table changes health of your load, balance, quotas pricing. It detects the drift automatically and tries to correct it. Alert with context. Which is more significant than we all think, and once it. Identifies any of these changes, it then runs its reconciliation using, either by opening a PR and having somebody approve it, or if it is pre-approved, it could just proceed with deployment. A practical example, right? This is anecdotal happened to me. One of the members who had access to the, to an AWS account made modifications to a security group. Now we all in the team had to go all over the place, try to see what happened, when things were not working without clarity on where to reach and what to do to fix this. If the agent and autonomous agent was there, it already has complete context of the desired state and it can run its reconciliation rule and propose the solution and also propose what went wrong on a certain scenarios. While all of this sounds great, there are problems starting with hallucinated configurations. We have seen that LMS as we know it today. Can end up hallucinating and getting to the same rabbit hole of trying to solve a problem that doesn't even exist. Then security exposure, we might end up over giving overprivileged to agent and that can cause a huge problem. Cost escalations of if there is no. Proper guardrails, we might end up overspending on certain resources we might even end up overspending on. Additional cost medication is very straightforward. Always validate your resources and deployments unit tests and automated tests to verify that agents are behaving as they're expected. Most importantly, least, and on demand privileges always provide privilege only to the resources that the agent needs at the time. It needs then cost control, not just for the cloud resources, but also for the agent's inference. But trust doesn't come automatically. Trust comes with transparency, reversibility, and learning. We need to log all the LLMs decisions and reasoning. Bring human in the loop. Have a tiered approval time boxed change windows so that you are in control of the change. Not in the middle of the night or at 2:00 AM have rollback mechanism. Every agent action must map to a version artifact for instant rollbacks, and then the continuous learning to fine tune your agents meeting your needs. Based on all the learnings that you have had and the agents have had based on all the logs. While this framework is very encouraging and exciting, it can also be very overwhelming in terms of its implementation in your environment. So I recommend to start off simple, start with reactive announcement. When I say that, think about keeping your current flow as it is. Just add retrievals and validation around LLM generated artifacts. Next stage is integrate, rag rag can start off with your documentation, your infrastructure as code, some run books some well architected or best practices from the providers. Then start off with continuous monitoring. Have your agents start detecting drifts and reporting them. And the final stage when all the confidence is in place, start with deployment and autonomous actions with human in the loop and with the change window in mind. So the key takeaways here is to use AI like it was always meant to be, not just for copy pasting code, but to leverage its reasoning and its knowledge. Start off very simple. Take small steps and have an incremental adoption. Build a robust architecture around rag and multi-stage verification. Know your risks and manage your risks. With that, I want to thank Con 42.
...

Praneeth Patil

Senior Staff Engineer @ Equinix Inc

Praneeth Patil's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content