Conf42 Kube Native 2025 - Online

- premiere 5PM GMT

AI-Powered Observability with LLMs: Accelerating Resilience in Cloud-Native PaaS Architectures

Video size:

Abstract

Discover how Large Language Models are transforming observability in cloud-native apps. Learn how to cut MTTR, reduce noise, and future-proof your platform with AI-powered diagnostics, self-healing, and intelligent code insights. Real results. Real impact.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to Con 42 Cube Native 2025. My name is P Shaker and I am delighted to share my experiences on a topic that sits at the heart of modern Cloud operations. On how large language models are transforming observability and accelerating resilience in platform as a service architectures. I've spent more than 14 years working across cloud native technologies, site reliability, engineering, and software delivery. Over this time, I have witnessed how the scale and complexity of cloud systems have outpaced the traditional monitoring approaches. LMS are not just a trend. They are rapidly reshaping on how we detect, understand, and recover from issues in dynamic environments. Today we will talk through the key challenges of observability in modern platform as a service, how LLM driven innovations are changing the game, and how providers can deliver observability as a native service rather than just an add-on. So let's dive in without any further delay. So talking about the challenge of modern cloud observability. Observability in today's cloud world is anything but complex. Even though the ask might be simple, with great power comes great complexity. Hope everyone agrees with me on this, especially if your mission is to build resilient platform as a service architecture. Organizations are pouring investment into public cloud platforms to gain agility and scalability. Teams are now embracing microservices and moving workloads onto cloud native platform as a service offerings to ship features faster, but with this evolution also comes with the steeper price complexity, skyrockets. Legacy monitoring tools often just dump vast amounts of raw data, leaving engineers with dashboards full of noise, but few actionable insights when the systems fail. The gap between an alert and a fix can keep stretching, slowing down recovery. In the meantime, in platform as a service environments. This complexity is further magnified. Services are deeply distributed and dependent on each other in an unpredictable way. A single issue can ripple through an entire ecosystem, right? And alerts frequently arrive without enough context for the engineers to respond quickly. And let's talk about the dynamic workloads. Think about a platform as a service offering with Kubernetes. Think auto scaling Kubernetes pods spinning up and shutting down. Within seconds, they make the failure landscape even more harder to track and resilient and to identify where the exact root causes. Now let's take an example of a FinTech platform as a service, which is handling a very high volume trades application. Imagine a sudden spike in traffic hits and or a single service flatters, and that causes transaction delays to cascade. Now your SRA team is flooded with alarms, but lacks the correlated insights. Recovery slows down, downtime grows, and resilience suffers. I've seen the scenarios play out repeatedly and they have convinced me that embedding smarter self observability directly into the platform can fundamentally change how fast we can bounce back from disruption. So let's deep more into the distributed platform as a service environments. Imagine a healthcare platform as a service. Managing electronic medical records. Patient scheduling and clinical notifications are the core services of it. Services now must integrate flawlessly because it depends on the patient. It directly depends on the patient experience and even sometimes safety and other protocols as well. Seemingly a small update. To one of the service can suddenly overload the event streams. It can choke dependencies and can also trigger caca cascading failures. Meanwhile, underlying infrastructure is also ephemeral. Container scale in and out, and the IP tends to change, and all troubleshooting assumptions are no longer whole. Good. So this reality underscores the urgent need for observability to evolve platform as a service. Offerings can no longer treat resilience as a bolt-on feature or something that the customers can figure out themselves. Instead, observability must be integrated. It should be always on. It should be context aware and also be capable of accelerating detection and recovery the moment this disruption begins. Now, with that, let's jump on to the evolution of observability. Observability has come a long way, but it's been a very slow climb. We started with basic infrastructure monitoring where CPU graphs. Memory usage charts and simple alerts were in place. These worked well and good when the systems were monolithic and failures were easier to isolate. Then came the three pillars approach in the DevOps era where metrics, logs, and traces gave us more depth, but still required human correlation and interpretation. So tools like distributor tracing and chaos engineering improved proactive testing, but they still demanded manual effort. Now with AI and lms, we are entering into a new era. These models can ingest and interpret massive telemetry streams, correlate events in real time, and. Explain what, why, and how an incident took place rather than humans tackling this all alone. So for platform as a service providers, this locks a new value proposition, observability as a built-in int intelligence service. The platform itself can proactively reduce downtime and it can guide recovery with no separate stack for customers to manage. Now let's take a look at the land LA large lang language models, which is the new Paradigm lms break the traditional rules based approach to monitoring. Instead of static thresholds or hardcoded alerts, they understand the context. They excel at correlating diverse data sources like metrics, traces and logs, even deployment histories to explain why something is happening and how to resolve it. Imagine an e-commerce platform as a service. During a Black Friday, rush latency spikes in the checkout service. A conventional alert tells you only that the latency is high, but on the other hand, an LLM can however, dig deeper, linking a recent deployment, a sudden memory spike in one of the service and an abnormal error pattern in another. Then surface a clear, actionable remediation to roll back or scale up. So in this case, the recovery time shrinks dramatically. This agility also enables platform as a service vendors to often offer tenant specific LLM powered insight right out of the box. Customers don't need to build or fine tune their own models. Resilience becomes a part of the platform itself. Now let's take a deeper look into the key capabilities that an LLM enhanced observability model can offer. So they can offer automated root cause analysis, correlating telemetry across layers to rapidly pinpoint why a failure occurred, a remediation guidance. We are suggesting or even executing recovery steps based on learned patterns and continuously improving as it observes your workloads over time by adaptive learning and comes in and it can also decide when to escalate to humans versus self. Ha cutting the manual delays with intelligent automation. Picture a logistics. A company hit by a routing outage and LLM not only flags the anomaly, but it can also help to analyze the patterns and propose or even auto execute rerouting strategies. A recovery that could have taken hearts might now just take minutes. So let's take a look at how this can be implemented in an implementation approach. So we, this can be implemented in the following ways, like where you can start by integrating your alerts. You can feed your LMS with existing monitoring signals and enrich them with context, so alerts become more meaningful narratives rather than just being noise. And the LLM models can also be trained based on specific domains, so it can be fine tuned based on your platform's, telemetry, or deployment histories and failure patterns to improve accuracy. And it can also do predictive insights where moving away from reactive alerts to forecasts. And also catching failing nodes or bottlenecks before they impact the tenants. You can also give engineers a chat like interface to ask why this latency is spiking and get a clear context rich answer with a conversational interface and you can also correlate. Or link seemingly unrelated signals like API errors, pod churns, or config changes to stop cascading failures early with deep correlation. I have read articles about this that in a media streaming platform as a service environment where LLM connected playback reaches to recent API throttling helping engineers to fix the issue in minutes instead of house. Now after the implementation, the important part is what is the actual outcome? What could be the benefits of it? So before an LLM integration, the meantime to resolution was taking long alert, fat fatigue was real, and the recovery relied heavily on expert engineers piecing the clothes together. And after LLM adoption resolution times drop down dramatically, in some cases to under even 10 minutes. And where it also had a positive impact on less false positives due to which, like the product productivity increased and the overall reliability improved. By baking this into a platform as a service, managed service providers can now standardize resilience gains across all tenants, not just those with the light SRE teams. Now, let's take a look at a technical implementation framework work that can outline the steps of what can be done to implement this. So start with your data ingestion and normalization. Where you can pull your telemetry from, metrics, logs, traces, and even CICD events, clean and sanitize the data with a rich metadata, and also choose models and fine tune for platform specific workloads, for example, like scaling patterns or Kubernetes churn, and ensure low latency inference. Where you can optimize the LLM and you can also do insightful insights on delivery via where tailored outputs for each persona, instant remediation. Hence for sre, where summarized health and trends for developers or even managers. And in the most important part is the technical tool chain integration where you can expose the results via API dashboards or even platform consoles that the users can consume insights natively. And let's talk about a more important topic, which is the evaluation methodology when introducing LLM Driven observability into platform as a service architecture. It is always essential to have a structured evaluation approach to prove its value. Before wide adoption, a practical evaluation offers can offering can include four complementary methods. One is controlled experimentation where you stand up a test or a staging environment that mimic production. In a multi-tenant Kubernetes cluster or like a service mesh and real, where real telemetry pipelines are present, and you can inject these failures such as like PO crashes or cascading service degradation, or even traffic surges. Now you can compare the performance of your existing observability stack with the LM announced system under the same controlled events to even get a more clear or reproducible results. And once you have that in place, next is to do a quantitative analysis where you collect hard metrics like mean time to detection and mean time to resolution. And also the false positive rates to actually understand the accuracy of the root cause detection. And you should also monitor how quickly the system detects the trigger and correlates the signals together and guides or executes remediation. A well instrument, a well instrumented test often shows significant reduction in MTTR. That is the mean time to reduction when ellam insights are added. Next comes the qualitative analysis. Where you engage the SRE and the DevOps teams who work directly with the new system, you can use structured feedback or sessions or surveys to capture changes in alert clarity or even the cognitive load. And trust in recommendations. This human-centered lens highlights whether the AI explanation are actionable and understandable, which is critical for adoption. Then comes the statistical validation where you apply the statistical testing, example, T tests, or you can do a AB testing one with the traditional model and the other one with the LLM enhanced model. And also can do effect size calculations across multiple incidents to ensure observed improvements aren't just random. And this step transforms raw results into credible data backed evidence so that the new approach truly accelerates resilience by combining controlled experimentation, quantitative KPIs, and human feedback. The platform as a service providers can confidently validate that an LLM powered observability delivers measurable resilience gains before scaling it as a core service. So another important feature that we need to talk about is the privacy, security, and ethical considerations. Once evaluation is complete, more focus should be given to privacy, security, and ethical implementation. Trust us as essential, right? So when we talk about privacy, collect only what is actually necessary, anonymize it wherever possible and designed the data pipelines to comply with regulations like HIPAA or GDPR. And in terms of security guard, again, guard the model against misuse or data hallucinations or data leakage. Keep the critical remediation steps human verified, wherever it is appropriate. And the most important part is ethical. As we all know, like AI and LLM has its own governance, and most of the systems specifically in healthcare, government, or financial finance, they have their own regulations on how AI and L LMS can be used. So in this case, mitigate bias in model training and ensure that the outputs are explainable so engineers can trust the recommendation by addressing these from the start. A platform as a service provider can confidently offer observability as a service without compromising compliance or user trust. So let's talk about the future directions on where we are headed. So looking forward, we are heading towards a fully self-healing platform as a service capability except multi-model input inputs like logs, metrics, traces, video or voice to enrich the diagnostics. And industry specific models will give domain tuned resilience like for example, healthcare versus FinTech or versus gaming, et cetera. And let's talk about federated learning. We will allow sharing intelligence across tenants without leaking sensitive data. And let's talk about edge aware observability where computational is computational aspects and reachability is limited, where it can adopt to a hybrid model and a very lightweight LLM based model that can serve edge aware observability as well. In terms of resilience, it'll become proactive and predictive by default and no longer a reaction. But it should be an intrinsic platform capability. Now let's talk about a practical implementation guideline. So for those planning to implement the first step is to prepare your infrastructure. Always ensure that you have enough comu, compute and memory for LLM inference workloads and strategize on your data. So you need to prioritize clean labeled telemetry, sanitize the data in such a way that can be fed into the model as a baseline in vector databases. So that the model can seamlessly respond back by looking at the semantics of your data. Next is training your teams whereby upskilling the ops team on a driven workflows. And before you get started, you need once you have a baseline ready, the next step is to pilot by starting small one service or a failure mode, and then expand gradually. You can also minimize dep disruption with well-planned change management. With smooth integration and also keep measuring gains and refining automation with continuous monitoring. So these steps help a platform as a service vendors embed LLM Observability without detailing existing operations. So now let's wrap To wrap up, let's talk about like how LMS are not just improving observability. They are redefining resilience at a larger scale. By embedding these capabilities directly into platform as a service offerings, we can now dramatically cut down downtime, reduce operational burden, and can also deliver reliable self-healing platforms to every tenant. Thank you for joining with me today. I would love to continue the conversation. Feel free to connect with me on LinkedIn if you are exploring or implementing LLM Driven Observability. Thank you. I.
...

Srinivas Pagadala Sekar

Senior Site Reliability Manager @ Karsun Solutions

Srinivas Pagadala Sekar's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content