Ensuring High Availability and Low Latency in Distributed Systems Using Kubernetes

Video size:

Abstract

Supercharge your distributed systems with Kubernetes for lightning-fast, low-latency data processing! Learn how Kubernetes’ auto-scaling, fault tolerance, and seamless resource management boost performance and reliability, powering real-time applications in financial, IoT, and other critical sectors

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. This is Nickel cai. I'm a software engineer specialized in cloud AI and FinTech. I'm super excited to be here at Con 42 SRE 2025. I'll walk you through how we can supercharge distributed systems using Kubernetes, not just for orchestration, but for realtime optimization. Driven by ai, we'll explore how to stay resilient. Scale, smart and also respond and milliseconds, all while reducing costs and also improving uptime. Whether you're an e-commerce, finance, IO ot, or SRE at scale, you'll walk away with practical ideas. You can apply today. Let's dive in and get to the next slide. Now let's get back and look at what we are really solving here. Today's distributed systems are powerful, but they come with c Ds complexity. First, we are scaling like crazy, more data, more nodes, and also more regions. But are we scaling smart? The thing is often not really without a clear strategy. This leads to cows components, bottleneck services, draft out of sync, and also things break under load. Latency also can become a killer. A few milliseconds of delay can mean the difference between smooth UX and also the customer drop off. And then thing is the downtime. So that's not just annoying. It also costs revenue and damages trust. Even a short outage can ripple across user basis and without complexity comes risk. The more moving parts you have, the harder it is to maintain security and availability, especially when the attack surface grows. Most systems are still reactive. We fix problems after they occur, when alerts, fire logs explored, and also like the users complain. But what if we could get ahead of this all cows? What if we could predict failure, detect anomalies instantly and scale intelligently before things break? That's where this talk leads to today. I'll walk you through how the use of Kubernetes plus AI to shift from reacting to predicting, preventing and optimizing is going to work. Let's next get to the next slide. So here we just discussed like what are some basic foundations of modern resilience? As we say, like Kubernetes forms a core of our architecture and for a good reason to, so first, it offers auto scaling, which means the system can instantly respond to traffic spikes or depths, and no manual tuning needed here. Then there self-healing. If a container crashes, Kubernetes just pins up another without anything, or like anyone lifting a finger. Rolling updates, let us push changes without downtime. Keeping systems available while deploying new features, and also with service discovery and built-in load balancing routing traffic is seamless, even as services scale or shift. And the next thing, what we can discuss ideally is like affinity rules. So these are very important as we see affinity rules help play spots on the right notes, optimizing for performance or isolation. And the last and final thing is the security policies and resource limits, which gives us the guardrails to keep everything secure, stable, and fair across different workloads. So Kubernetes gives us the operational muscle, but what if it could also think ahead? That's where AI comes in. So let's get to the next slide and see what we can do with ai. So we can just think about how the system architecture can be enhanced using artificial intelligence on a high level here. So here's how we made our system not just reactive, but also intelligent. At the base, we are collecting live metrics and logs from all across the cluster. Things like CPU memory and POD status and traffic patterns are monitored and collected. These streams feed into our AI ML layer, which does the heavy lifting forecasting resource needs, spotting anomalies, and also optimizing the placement. Then we have the decision engine, the brain of the system. It combines modal outputs with logic to make decisions that triggering or scaling, moving workloads or even. Alerting is serious, and over time it learns becoming better with every decision it makes. So we are not just automating here, we are anticipating at a higher level. So the next thing, what we would like to discuss here is. AI models that the, that's the power of the system, right? So AI models, let's talk about the brains behind this operation are AI models. We use LS CM models or forecasting. They give us a heads up on resource spikes so we can scale before even things get tied for anomaly detection, isolation forest helps us spot severe behavior like crypto jacking or container escapes or react fast can be also done. We use PPOA reinforcement learning model to figure out the most efficient placement of workloads across the cluster. This alone improved our resource usage by 18%. 18% seems like a small number, but ideally in terms of a resource is 18% is really huge. And finally, random for us. Help us product if something is likely to fail so we can fix it even before it breaks These models. Turn our infrastructure into a smart predictive system, not just like a. More of a, just like a reactive one. Going to the next slide, we can here discuss real time scaling and resource allocation here. The, I would say like real time scaling and resource allocation. This is more like a game changer. I would say ai. Helps. I would say like AI helps us scale before even what's the right word? Even before we suffer. So in traditional systems, we wait for CPU to spike or latency to creep in and then scaling kick. But by that point, users may already feel the lag. With AI driven forecasting, we know what's coming. We can scale up or down proactively, not reactively. The result, the smoother ex user experience, and also like lower infrastructure costs altogether. And also like the minimal risk of over poisoning can also be handled here because we respond in subsecond time. The system stays ahead, always ready, and also like always optimized. The next thing I would like to discuss here is more about the security, which is really important in any use cases, even like even if it is with Kubernetes, with ai, without AI anytime. It's really important. So when we think about availability, most people, I really focus on infrastructure and traffic, but what really brings systems down fast are the security threats. That's we build AI directly into our security stack. First, we use behavioral threat detection to identify suspicious container or power activity like crypto mining, privileged escalation or unexpected file accesses before it can impact services. Our smart API monitoring here tracks how internal and also the external APIs access. If something deviates from the norm, it's flagged immediately and action is taken before any performance step. On the network side, our system spots anomalies like data acceleration or DDoS attempts within extreme LI low false positive rate. It was like very, pretty much I would say I. Good improvement for us, because low false positive rate is something really important in general. So what is the biggest win here? So threats are mitigated in under three seconds, and also that keeps latency, low force, and also availability high even in the face of active attacks and with self-learning models, the system keeps evolving. Improving its detection, accuracy, and also as it encounters new patterns and risks. The security here isn't about just protection. It's critical piece of insuring uptime. And the next thing I would like to discuss here is the disaster recovery and fault tolerance. I would say let's face it, however, like how optimized or secure any system is, the failures will still happen, right? So what happens, or like what matters most is how quickly and intelligently we can ideally recover. Our system uses AI to continuously scan for early warning signs. Whether it's discs, we may leaks. Or degrade in node health and it predicts failures even before it occur. Once a risk is detected, the risk assessment engine evaluates how critical it is and in seconds it generates a disaster recovery plan tailored to our life cluster state. We are relying on predefined static playbooks here. These plans are adaptive and dynamic before the current. Moment. In fact, if action is needed, that platform can orchestrate failover on the demand or like automatically keeping our RTO under 15 minutes. In the real time replication in place, we keep RPO under five minutes. So the data loss is minimal to none. All of this is AI driven, automated, and first helping us stay resilient, available and always ready even during the unexpected. And the next thing I would like to discuss to see overall like real world results, how they are pleased, and also like how AI plus Kubernetes can be put in action. Here's all what are the tools like when it's real in production at scale across industries. In e-commerce, AI powered auto scaling helps one platform cut infrastructure costs by 31% while maintaining 99.99% uptime across thousands of nodes, even during FLA seals or like even during the seasonal spikes. In financial services, the combination of predictive security and automated failover reduced incident by 45% and achieved five nine's availability for their most critical systems, even including trading, which is the most important thing here for iot platforms, which run at the edge. AI helped optimize. Resource usage, cutting data transfer cost by 40%, and also improving edge performance by 35%. That's a massive win with every MB and millisecond that matters. And in pharma. Also air driven forecasting and disaster recovery, kept supply chains stable during global disruption, preventing like stockouts when demand was unpredictable and also logistics or chaotic. These are rare results made possibly by combining AI's intelligence with Kubernetes orchestration power and the next thing in the future, what can be done? So ideally, like there are some. Whenever we develop something, there's always something we can expect in the future. Like a new things that can help us given getting better and better. So getting to that point, like how the automation towards autonomous infrastructure can help here. Let's end by looking at where this all is going. So what we have built so far with AI enhanced Kubernetes, it's powerful, but it's still automation. The future is autonomy. Imagine infrastructure that doesn't just scale or heal it. Things. It forecasts, it makes decisions. AI agents will soon make real time infrastructure choices, rerouting traffic, spinning up clusters, preemptively blocking threats, and often we are even aware of the issue, right? And to build. I would say like into be trust in these autonomous systems, we'll need explainable ai where every scaling, even security block or even securities steps, or even the recovery steps is traceable and also like justified. We are also heading toward federated intelligence here. That is also one possible way that we can expect in the future where clusters learn from each other across geographies and environments. Imagine global resilience turned by shared insights. This all leads to the autonomous sari where machines don't just assess humans, they take charge of operations. As engineers, our role shifts from operators to strategists. From writing rules to the training systems. So this is what I wanted to talk about on a very short note, like how Kubernetes plus AI can really help these systems like scaled and also with high availability, and also on the level of security on different levels. Thank you for all listening to my talk today. I'm ready to answer any questions or any concerns or any other queries that everyone have here.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Ensuring High Availability and Low Latency in Distributed Systems Using Kubernetes

Video size:

Abstract

Summary

Transcript

Slides

Nikhil Kassetty

Software Engineer @ Intuit

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Ensuring High Availability and Low Latency in Distributed Systems Using Kubernetes

Video size:

Abstract

Summary

Transcript

Slides

Nikhil Kassetty

Software Engineer @ Intuit

Join the community!