Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I'm pre Patil from Equinix.
I build multi-cloud infrastructure and interconnection for
high performance AI needs.
Thank you for joining me to explore my talk for reactive
prompts to autonomous agents, LLMs for multi-cloud infrastructure.
Over this past year, I have spent a lot of time on exploring the next stage of
cloud deployment in the world of ai.
We have seen Terraform and cross.
Planes facilitate cloud deployment and maintain the desired state
using reconciliation loop.
That worked well for many years, but now is the time to advance it even further.
In this talk, we will see how large language models LLMs are exploring
and evolving and not just answering our questions to actively managing
complex cloud infrastructures.
This shift from manually reactive prompts to autonomous agent behavior is poised
to revolutionize cloud infrastructure.
Gartner has said the agentic AI is a top trend of 2025.
We don't need Gartner to say that.
We all know that by now.
In this talk, we'll talk about how multicloud can leverage agent tech ai,
talking about AWS Azure GCP even on-prem.
A full hybrid mix, if you will.
We will discuss on how LLM can move beyond merely splitting the config
snippets to actually orchestrating the tasks across multi-cloud, all while
balancing the exciting opportunities with very real risks, the hallucinations,
misconfigurations, and the cost.
Most enterprises run at least two public clouds plus co-location on-prem.
Facilitated by S like Equinix.
They all have different IMS network models, APIs and billing.
Billing knobs.
The result is a config sprawl, policy divergence and reconciliation drift.
Old fix was write a glue script or a terraform and have PR reviews at scale.
The glue on the Terraforms and the cross planes became second platform to maintain.
Sometimes if people don't maintain the GitHubs model, then we would end
up having them as tribal knowledges.
What we want instead is a continuous awareness of state across the providers.
Policy driven orchestration and autonomic reconciliation that respects the change.
Window approvals and budget.
Think treating cloud boundaries like availability zones with explicit
control, our cost, and other risks.
And what most teams today use LMS reactive.
Ask cha g pity for a Terraform script.
Tell me how to connect a LB to my target group.
And the VPCs, if it does generate good scripts, no doubt about that.
But you end up with a lot of on off scripts without full validation and with
full confidence, we don't docs anymore.
Neither does cha G PT or any LMS would just end up giving us
information that it best knows it.
So for one good.
And is we wrote a chat assistant and asked to write a clean VPC module.
And it did in minutes, but all it did was apart from building a VPC module,
it also had a place where it would open up all the traffic and all that
because we forgot to mention a bion host restriction, nice code, wrong outcome.
The lesson without.
Context and checks, you are accelerating both right and wrong changes.
So what should be the vision of having an autonomous agent of this sort?
It should have continuous awareness.
Take snapshots of multi-cloud states, drifts and quotas.
It should have proactive de detections for misconfigurations,
security vulnerabilities, cost anomalies, and a autonomous.
Resolution, propose and execute fixes with approvals and rollbacks.
Think of this agent as a junior, SRE with excellent documentation recall
wrapped under strict guardrails.
It suggests and even allowed to execute and then explains what it did.
The core of this architecture is Ritual Augmented Generation, also known as rag.
The difference here with RAG is between confident fiction and grounded action.
The retrievals will be that of.
Current desired state.
Think terraform Terraform states, terraform documentations
scripts cloud formation.
It'll will also will look into actual states through cloud APIs.
It'll look into change histories, think logs, prs, run books
policies like security policies, FinTech policies provider docs.
Not only a p aspects, but also best practices.
Amazon always provides well architected writeups and they are very resourceful
model plans With this in context and not just the blank state example can think
of before touching a security group, the agent reduce the intent port for IAC.
Our infrastructure as code pulls all the current rules from AWS that
already in is presenting your infras policies that you have set up ahead of
time telling for bidding, opening all the poor for admin VPCs for example,
and any change tickets over the time.
The output of this is a proposed diff.
Plus the rationale on why we are doing certain things.
The next piece of the architecture is where you get your confidence
with the verification pipeline.
It.
Start off with syntax validation.
Yamo, JSO, Terraform.
Do the linting, do the formatting, then is the semantic analysis.
Logically thinking what it is doing is right or wrong.
Think of reasoning models here.
Then the guard rails for all your policy, compliances, budgets,
tax encryption, geo residency.
GDPR kind of compliances whatnot, right?
And then finally, your dry run testing, Terraform Plan cube
cut diff your smoke test on your staging environment, et cetera.
I also recommend.
Doing a dual model verification where the second model of a rules
engine would critique the plan, ask why is it doing a certain way?
Is it required?
What can go wrong?
Think of red teaming the real value of these autonomous agents.
It's not just during the deployment, but between the deployments.
The agents will continuously monitor across the clouds for
any changes like I am Devs.
Route table changes health of your load, balance, quotas pricing.
It detects the drift automatically and tries to correct it.
Alert with context.
Which is more significant than we all think, and once it.
Identifies any of these changes, it then runs its reconciliation using, either
by opening a PR and having somebody approve it, or if it is pre-approved,
it could just proceed with deployment.
A practical example, right?
This is anecdotal happened to me.
One of the members who had access to the, to an AWS account made
modifications to a security group.
Now we all in the team had to go all over the place, try to see what happened, when
things were not working without clarity on where to reach and what to do to fix this.
If the agent and autonomous agent was there, it already has complete context
of the desired state and it can run its reconciliation rule and propose
the solution and also propose what went wrong on a certain scenarios.
While all of this sounds great, there are problems starting with
hallucinated configurations.
We have seen that LMS as we know it today.
Can end up hallucinating and getting to the same rabbit hole of trying to
solve a problem that doesn't even exist.
Then security exposure, we might end up over giving overprivileged to agent
and that can cause a huge problem.
Cost escalations of if there is no.
Proper guardrails, we might end up overspending on certain resources we
might even end up overspending on.
Additional cost medication is very straightforward.
Always validate your resources and deployments unit tests and automated
tests to verify that agents are behaving as they're expected.
Most importantly, least, and on demand privileges always provide
privilege only to the resources that the agent needs at the time.
It needs then cost control, not just for the cloud resources, but
also for the agent's inference.
But trust doesn't come automatically.
Trust comes with transparency, reversibility, and learning.
We need to log all the LLMs decisions and reasoning.
Bring human in the loop.
Have a tiered approval time boxed change windows so that you
are in control of the change.
Not in the middle of the night or at 2:00 AM have rollback mechanism.
Every agent action must map to a version artifact for instant rollbacks, and
then the continuous learning to fine tune your agents meeting your needs.
Based on all the learnings that you have had and the agents
have had based on all the logs.
While this framework is very encouraging and exciting, it can also
be very overwhelming in terms of its implementation in your environment.
So I recommend to start off simple, start with reactive announcement.
When I say that, think about keeping your current flow as it is.
Just add retrievals and validation around LLM generated artifacts.
Next stage is integrate, rag rag can start off with your documentation,
your infrastructure as code, some run books some well architected or
best practices from the providers.
Then start off with continuous monitoring.
Have your agents start detecting drifts and reporting them.
And the final stage when all the confidence is in place, start
with deployment and autonomous actions with human in the loop and
with the change window in mind.
So the key takeaways here is to use AI like it was always meant to be,
not just for copy pasting code, but to leverage its reasoning and its knowledge.
Start off very simple.
Take small steps and have an incremental adoption.
Build a robust architecture around rag and multi-stage verification.
Know your risks and manage your risks.
With that, I want to thank Con 42.