Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
This is Nickel cai.
I'm a software engineer specialized in cloud AI and FinTech.
I'm super excited to be here at Con 42 SRE 2025.
I'll walk you through how we can supercharge distributed systems using
Kubernetes, not just for orchestration, but for realtime optimization.
Driven by ai, we'll explore how to stay resilient.
Scale, smart and also respond and milliseconds, all while reducing
costs and also improving uptime.
Whether you're an e-commerce, finance, IO ot, or SRE at scale, you'll
walk away with practical ideas.
You can apply today.
Let's dive in and get to the next slide.
Now let's get back and look at what we are really solving here.
Today's distributed systems are powerful, but they come with c Ds complexity.
First, we are scaling like crazy, more data, more nodes, and also more regions.
But are we scaling smart?
The thing is often not really without a clear strategy.
This leads to cows components, bottleneck services, draft out of
sync, and also things break under load.
Latency also can become a killer.
A few milliseconds of delay can mean the difference between smooth
UX and also the customer drop off.
And then thing is the downtime.
So that's not just annoying.
It also costs revenue and damages trust.
Even a short outage can ripple across user basis and without complexity comes risk.
The more moving parts you have, the harder it is to maintain security
and availability, especially when the attack surface grows.
Most systems are still reactive.
We fix problems after they occur, when alerts, fire logs explored,
and also like the users complain.
But what if we could get ahead of this all cows?
What if we could predict failure, detect anomalies instantly and scale
intelligently before things break?
That's where this talk leads to today.
I'll walk you through how the use of Kubernetes plus AI to shift from
reacting to predicting, preventing and optimizing is going to work.
Let's next get to the next slide.
So here we just discussed like what are some basic
foundations of modern resilience?
As we say, like Kubernetes forms a core of our architecture and for a good
reason to, so first, it offers auto scaling, which means the system can
instantly respond to traffic spikes or depths, and no manual tuning needed here.
Then there self-healing.
If a container crashes, Kubernetes just pins up another without anything,
or like anyone lifting a finger.
Rolling updates, let us push changes without downtime.
Keeping systems available while deploying new features, and also with
service discovery and built-in load balancing routing traffic is seamless,
even as services scale or shift.
And the next thing, what we can discuss ideally is like affinity rules.
So these are very important as we see affinity rules help play spots
on the right notes, optimizing for performance or isolation.
And the last and final thing is the security policies and resource
limits, which gives us the guardrails to keep everything secure, stable,
and fair across different workloads.
So Kubernetes gives us the operational muscle, but what
if it could also think ahead?
That's where AI comes in.
So let's get to the next slide and see what we can do with ai.
So we can just think about how the system architecture can
be enhanced using artificial intelligence on a high level here.
So here's how we made our system not just reactive, but also intelligent.
At the base, we are collecting live metrics and logs from
all across the cluster.
Things like CPU memory and POD status and traffic patterns
are monitored and collected.
These streams feed into our AI ML layer, which does the heavy lifting forecasting
resource needs, spotting anomalies, and also optimizing the placement.
Then we have the decision engine, the brain of the system.
It combines modal outputs with logic to make decisions that triggering or
scaling, moving workloads or even.
Alerting is serious, and over time it learns becoming better
with every decision it makes.
So we are not just automating here, we are anticipating at a higher level.
So the next thing, what we would like to discuss here is.
AI models that the, that's the power of the system, right?
So AI models, let's talk about the brains behind this operation are AI models.
We use LS CM models or forecasting.
They give us a heads up on resource spikes so we can scale before even
things get tied for anomaly detection, isolation forest helps us spot severe
behavior like crypto jacking or container escapes or react fast can be also done.
We use PPOA reinforcement learning model to figure out the most efficient
placement of workloads across the cluster.
This alone improved our resource usage by 18%.
18% seems like a small number, but ideally in terms of a
resource is 18% is really huge.
And finally, random for us.
Help us product if something is likely to fail so we can fix it
even before it breaks These models.
Turn our infrastructure into a smart predictive system, not just like a.
More of a, just like a reactive one.
Going to the next slide, we can here discuss real time scaling
and resource allocation here.
The, I would say like real time scaling and resource allocation.
This is more like a game changer.
I would say ai.
Helps.
I would say like AI helps us scale before even what's the right word?
Even before we suffer.
So in traditional systems, we wait for CPU to spike or latency to
creep in and then scaling kick.
But by that point, users may already feel the lag.
With AI driven forecasting, we know what's coming.
We can scale up or down proactively, not reactively.
The result, the smoother ex user experience, and also like lower
infrastructure costs altogether.
And also like the minimal risk of over poisoning can also be handled here
because we respond in subsecond time.
The system stays ahead, always ready, and also like always optimized.
The next thing I would like to discuss here is more about the security,
which is really important in any use cases, even like even if it is with
Kubernetes, with ai, without AI anytime.
It's really important.
So when we think about availability, most people, I really focus on
infrastructure and traffic, but what really brings systems down
fast are the security threats.
That's we build AI directly into our security stack.
First, we use behavioral threat detection to identify suspicious container or
power activity like crypto mining, privileged escalation or unexpected file
accesses before it can impact services.
Our smart API monitoring here tracks how internal and also
the external APIs access.
If something deviates from the norm, it's flagged immediately and action
is taken before any performance step.
On the network side, our system spots anomalies like data acceleration
or DDoS attempts within extreme LI low false positive rate.
It was like very, pretty much I would say I. Good improvement for us,
because low false positive rate is something really important in general.
So what is the biggest win here?
So threats are mitigated in under three seconds, and also that
keeps latency, low force, and also availability high even in the face of
active attacks and with self-learning models, the system keeps evolving.
Improving its detection, accuracy, and also as it
encounters new patterns and risks.
The security here isn't about just protection.
It's critical piece of insuring uptime.
And the next thing I would like to discuss here is the disaster
recovery and fault tolerance.
I would say let's face it, however, like how optimized or secure any system is,
the failures will still happen, right?
So what happens, or like what matters most is how quickly and
intelligently we can ideally recover.
Our system uses AI to continuously scan for early warning signs.
Whether it's discs, we may leaks.
Or degrade in node health and it predicts failures even before it occur.
Once a risk is detected, the risk assessment engine evaluates how
critical it is and in seconds it generates a disaster recovery plan
tailored to our life cluster state.
We are relying on predefined static playbooks here.
These plans are adaptive and dynamic before the current.
Moment.
In fact, if action is needed, that platform can orchestrate failover
on the demand or like automatically keeping our RTO under 15 minutes.
In the real time replication in place, we keep RPO under five minutes.
So the data loss is minimal to none.
All of this is AI driven, automated, and first helping us
stay resilient, available and always ready even during the unexpected.
And the next thing I would like to discuss to see overall like real world results,
how they are pleased, and also like how AI plus Kubernetes can be put in action.
Here's all what are the tools like when it's real in production
at scale across industries.
In e-commerce, AI powered auto scaling helps one platform cut infrastructure
costs by 31% while maintaining 99.99% uptime across thousands of
nodes, even during FLA seals or like even during the seasonal spikes.
In financial services, the combination of predictive security and automated failover
reduced incident by 45% and achieved five nine's availability for their most
critical systems, even including trading, which is the most important thing here
for iot platforms, which run at the edge.
AI helped optimize.
Resource usage, cutting data transfer cost by 40%, and also
improving edge performance by 35%.
That's a massive win with every MB and millisecond that matters.
And in pharma.
Also air driven forecasting and disaster recovery, kept supply chains stable
during global disruption, preventing like stockouts when demand was unpredictable
and also logistics or chaotic.
These are rare results made possibly by combining AI's intelligence
with Kubernetes orchestration power and the next thing in
the future, what can be done?
So ideally, like there are some.
Whenever we develop something, there's always something we
can expect in the future.
Like a new things that can help us given getting better and better.
So getting to that point, like how the automation towards autonomous
infrastructure can help here.
Let's end by looking at where this all is going.
So what we have built so far with AI enhanced Kubernetes, it's
powerful, but it's still automation.
The future is autonomy.
Imagine infrastructure that doesn't just scale or heal it.
Things.
It forecasts, it makes decisions.
AI agents will soon make real time infrastructure choices, rerouting
traffic, spinning up clusters, preemptively blocking threats, and often
we are even aware of the issue, right?
And to build.
I would say like into be trust in these autonomous systems, we'll need
explainable ai where every scaling, even security block or even securities
steps, or even the recovery steps is traceable and also like justified.
We are also heading toward federated intelligence here.
That is also one possible way that we can expect in the future where
clusters learn from each other across geographies and environments.
Imagine global resilience turned by shared insights.
This all leads to the autonomous sari where machines don't just assess
humans, they take charge of operations.
As engineers, our role shifts from operators to strategists.
From writing rules to the training systems.
So this is what I wanted to talk about on a very short note, like how
Kubernetes plus AI can really help these systems like scaled and also
with high availability, and also on the level of security on different levels.
Thank you for all listening to my talk today.
I'm ready to answer any questions or any concerns or any other
queries that everyone have here.