Cloud-Native Observability: Building Visible, Resilient, and Scalable Enterprise Systems

Video size:

Abstract

Discover how elite teams slash incident resolution time by 40% with cloud-native observability! Learn the exact architecture patterns that transformed chaotic microservices into crystal-clear systems. Master distributed tracing secrets that make the invisible.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Welcome to conference 42 Observability 2025. My name is VMs Kris and I work as a senior software engineer at Walmart, where I primarily work in infrastructure engineering. And today I'll be presenting on cloud native observability, building visible, resilient, and scalable enterprise systems. So this talk dives into how advanced monitoring analytics transform enterprise systems in modern day. Cloud native architectures. So this is a topic which is close to my work on AI driven API Resilience. So the next 30 minutes, we'll explore the shift from traditional monitoring to observability, real world metrics, the pillars of observability and practical implementation strategies. So let's get started. So what it takes from monitoring terms, ability, let's begin with the fundamental comparison. Traditional monitoring focuses on predefined metrics. So think about an CPU stage or a request counts, which are a limited, have a limited context. So now we need to, which collects displays and, data from specific sources. It often creates silos where teams see disconnected components leading to reactive fixes after users are impacted, reacts to predefined alerts or thresholds notifying users when something is a mess. Contract to this, we have a cloud native observability. It gives us a holistic view through metrics, logs, traces, unifying visibility across microservices. This enables proactive detection, so just imagine a catching a latency spike just before it affects users and faster root cause analysis, which is critical and distributed systems across the industry with this kind of high volume traffic. So what is the relation between. The observability and monitoring. So observability can seen as an extension of monitoring where monitoring provides the data, and observability provides a deeper insights and understanding. Monitoring is a necessary component of observability, but observability goes beyond the scope of traditional monitoring practices enables us for proactive investigation and resolution of issues by analyzing them. As a system behavior and identifying root causes. In essence, monitoring tells you when something is wrong, but while observability tells you why and how to fix that. So now let's see some real world implementation metrics. So these are some tangible results. Practical implementation has delivered a 40% faster resolution time for incidents, which is down from hours to minutes. Thanks to the correlated data and studies show that 25% re resource efficiency is achieved by optimizing infrastructure utilization directly impacting 35% of database load reduction. And what about the detection speed? It's almost around 200 millisecond milliseconds, which is near time. An identification and it supports 58% faster incident detection goal. These metrics come from production systems handling complex scenarios during Thanksgiving, black Friday, and holiday traffic peak times. So these stats rely on tools which are pulled up from Dynatrace and GCP Stack Driver. So moving on. Next we have. Few pillars for observability. So let's observe this triangle. Every small triangle is important to get the details from an atomic level to an holistic point of view of an application. So observability, rest on these three pillars, metrics, logs and traces. Water metrics. So metrics are quantifiable. So think about a latency or an error rates that are tracked in real time. So these are your systems. Vital signs, logs are timestamped. Records of events and errors offering a detailed audit trial, for instance, in a microservice failure, logs pinpoint to the exact state change and also pinpoint to the exact line of code which went wrong. And moving onto the traces, these are like these map from end to end request path revealing dependencies, broad bottlenecks. So in the distributed setup, a single request might span around to 10 services. So these traces help us see the full picture in API Resiliency work. Prometheus is used for metrics ELK, four logs and JA for traces. This thread enables to proactively manage 85% faster issue resolution, which is a key metric for enterprise scalability. And now let's see how it behaves in a stateless was a stateful services I. We need to differentiate between the observability in these two services. So what is it? Stateless services. Think about load balanced APIs, which focus on throughput latency, andal resource use. So we monitor these with horizontal scaling in mind. So the benefits of these are simplified, scaling, easy to troubleshoot in case of an individual issues. So now, whereas coming to the state, full services like databases, which require tracking persistence, metrics like replication, health consistency, and storage performance. So for an example, in Google Cloud Spanner, we ensure that, the data deliverability is across multi regions. The benefits of this are users experience continuity in that usage. It's a rich personalization and potentially more complex features. When all these get added and coming to observability strategies, they may vary notably between stateless and stateful services due to their differing approaches to the state management stateless services, which are easier to scale demand, greater focus on tracing individual requests. And analyzing errors in contrast to the state full services, which require a broader view that captures persistent state and ensures session continuity. And now the shared patterns include health probes and anomaly deduction. Implementing of these to balance the stateless API performance with stateful database reliability, it'll support us in 25% of latency reduction. So this dual approach is critical for hybrid cloud native systems, and then comes the distributed tracing implementation. So what is this one? So distributed tracing is a cornerstone of observability. First instrument. Your code. Add a trace context propagation using libraries like open telemetry across service calls. Next, configure intelligent sampling. Capture a hundred percent of critical parts. But just as a side note, do not sample a lot of traffic. Just sample. The data, which you want to do it, say one in thousand requests. So we need to include both normal and exceptional patterns and then centralize the pipeline with a scalable infrastructure. Think about something as a Kafka or a GCB per sub to ingest all this data and trace it out. So this process millions of spans daily. And finally perform the root cause analysis. Visualize traces in JG to spot bottlenecks, because these ties directly to your 85% resolution metrics from studies at an industry level. So another key topic is API Gateway observability. So what is that? So API Gateway observability refers to the ability to monitor, measure, and gain actionable insights into the behavior, performance, and health of API traffic as it flows through an API gateway, it helps ensure that the APIs are reliable, secure, and performant. But now why exactly the API Gateway Observability matters. This acts as the entry point for microservices in distributed systems. The gateway is where the request ends from the external world to the internal world. So the API Gateways are observability hubs. Where we can get track traffic insights, request volumes, service dependencies, and rate limiting effectiveness. So these traffic volumes could vary from managing few hundred requests per second to tens of thousands of requests per second during our peak times. And this provides a centralized control plane for routing, authentication, rate limiting and caching. So what does observability enables? In the API gateways, it gives us real time visibility into the API traffic issue detection, whether it might be latency errors, anonymous auditing and compliance, and also security monitoring. So when I say security monitoring, that is the key. It monitors authentication events and attack patterns. Which we can, give them to the InfoSec team. Hey, these are the, attacks coming up. So there are a few other performance metrics like latency and throughput, which are very critical. So 25% latency restriction could be achieved by optimizing gateway routing, which is a direct application of this particular slide through API gateways with. Native observability. What is that? So how can we achieve that? So we can do that by Azure, API Management plus some application insights and Epic E plus A Stack Driver or Kong Engine EngineX, TONY, with Prometheus and Grafana. So these are the tools and there are like other tools like G-C-P-A-P-I, gateway with Integrated monitoring Dashboards makes this actionable too. So are there any gateway specific challenges you are facing? Let's see. Yeah, so best practices include enable structured access logs with AL IDs. So these IDs help us relate from an end to end. Monitoring of what happened for that particular request. And also use rate limiting and circuit breakers with metrics exposure instrument your GA API gateway with tracing head dust, which includes request id and also transparent. Monitor latency percentiles such as P 50, P 95, P 99 for the SLO enforcement track End-to-end transaction parts using distributed tracing. So moving on. Next we have an observability pipeline design. A robust observability pipeline has four stages. Collection gathers telemetry from services and infrastructure. Think of agents in this case on each and every node processing of normalization and filtering data. So stream processes are used to reduce noise to keep only critical signals for a few hundred millisecond milliseconds detection space. That's an as an example. So this is a kind of filtering out the additional noise or to pull out the exact data, which we are looking for. And then storage uses tried strategies, so hot storage for recent data, cold storage for archives, which helps us in balancing cost and access. Aligning with 25% of resource efficiency. Now visualization turns data into dashboards. There are many tools like Grafana to spot anonymous in real time, supporting 40% faster resolution. These pipelines become the backbone, our system, and now we have high availability strategies. So what are high availability strategies? High availability ensures minimal service disruption and maximum uptime. From an observability perspective, the ability to detect, understand and response to failures is critical to achieving true resilience and scalability. High availability is a non-negotiable in a distributor system, so for that, we deploy collectors and storage across multiple availability zones to avoid single point of failure. So what about the residency and failover visibility? We need to monitor at a node level, zone level, and even at regional level for residency and track failover events and measure meantime to recovery, which is termed as MTTR in the industry levels, and then alert on degraded clusters or failed availability zones. And now how to monitor across a cross zone or a region observability for a multi-cloud high availability. So these rails to correlate the metrics logs and trace across the clouds detect latency between regions and audit cross cloud failover success and data sync. Integrity designed for graceful degradation, which helps us in reduce tele. If any backend fails, ensure core functionality, test this during your P loads to get all the corner cases. But that might be, again, our own individual case. If your system behaves well during low p ds, then to get all the common cases, then yes, of course we can go ahead with that. And then we need to do a local buffering temporarily which temporarily stores metrics when any connection drops, or a multi chanting alerts, ensuring teams getting notifications via Slack email and also PagerDuty alerts. So these. Strategies underpin 50% of our downtime reduction, which are critical for enterprise resilience. And now how do we do a capacity planning and what are the optimizing steps we need to perform? Capacity planning starts with historical patterns, aggregate all the metrics over the months to understand utilization. Ideal data in this case spans. Over six months of the API traffic and then forecast this demand with machine learning, predict seasonal spikes like Thanksgiving, black Friday, or holiday day traffic, which helps us optimizing resource allocation and then optimize dynamically using auto-scaling, balancing cost and performance, which helps in continuous improvement. Of refining these models aligning approximately 25% utilization of resource efficiency. And now for the continuous implement improvement we need to establish a feedback loops to regularly compare for casted projections against the actual consumption patterns, redefining prediction models and allocation strategies. And now what is, what about the implementation roadmap? So let's outline this roadmap. Start with foundational building, set up metrics with Prometheus and centralized logging with ELK. Define your SLIs and SLOs for core services. Then move on to the advanced capabilities. Deploy JAAR for tracing across microservices and build custom dashboards for teams. In the optimization phase, add an automate detection with AI and automate remediation for common issues, which are building for 58 of our incident detection. As per industry research, this phased approach has guided 40% resolution improvement. Plan to start with one service and scale. Always remember, start small and scale. This will reduce the impacts that any new changes brings in. And that wraps our journey to the cloud native observability. We have covered the shift from monitoring real world metrics, pillars, and implementation strategies. Thank you all for your attention and I hope everybody enjoy the session today. Thanks a lot.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Cloud-Native Observability: Building Visible, Resilient, and Scalable Enterprise Systems

Video size:

Abstract

Summary

Transcript

Slides

Vamsi Krishna Munnangi

@ Walmart

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Cloud-Native Observability: Building Visible, Resilient, and Scalable Enterprise Systems

Video size:

Abstract

Summary

Transcript

Slides

Vamsi Krishna Munnangi

@ Walmart

Join the community!