Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to conference 42 Observability 2025.
My name is VMs Kris and I work as a senior software engineer at
Walmart, where I primarily work in infrastructure engineering.
And today I'll be presenting on cloud native observability,
building visible, resilient, and scalable enterprise systems.
So this talk dives into how advanced monitoring analytics transform
enterprise systems in modern day.
Cloud native architectures.
So this is a topic which is close to my work on AI driven API Resilience.
So the next 30 minutes, we'll explore the shift from traditional monitoring
to observability, real world metrics, the pillars of observability and
practical implementation strategies.
So let's get started.
So what it takes from monitoring terms, ability, let's begin
with the fundamental comparison.
Traditional monitoring focuses on predefined metrics.
So think about an CPU stage or a request counts, which are a
limited, have a limited context.
So now we need to, which collects displays and, data from specific sources.
It often creates silos where teams see disconnected components leading to
reactive fixes after users are impacted, reacts to predefined alerts or thresholds
notifying users when something is a mess.
Contract to this, we have a cloud native observability.
It gives us a holistic view through metrics, logs, traces, unifying
visibility across microservices.
This enables proactive detection, so just imagine a catching a latency spike just
before it affects users and faster root cause analysis, which is critical and
distributed systems across the industry with this kind of high volume traffic.
So what is the relation between.
The observability and monitoring.
So observability can seen as an extension of monitoring where monitoring provides
the data, and observability provides a deeper insights and understanding.
Monitoring is a necessary component of observability, but observability
goes beyond the scope of traditional monitoring practices enables us
for proactive investigation and resolution of issues by analyzing them.
As a system behavior and identifying root causes.
In essence, monitoring tells you when something is wrong,
but while observability tells you why and how to fix that.
So now let's see some real world implementation metrics.
So these are some tangible results.
Practical implementation has delivered a 40% faster resolution time for incidents,
which is down from hours to minutes.
Thanks to the correlated data and studies show that 25% re resource
efficiency is achieved by optimizing infrastructure utilization directly
impacting 35% of database load reduction.
And what about the detection speed?
It's almost around 200 millisecond milliseconds, which is near time.
An identification and it supports 58% faster incident detection goal.
These metrics come from production systems handling complex scenarios
during Thanksgiving, black Friday, and holiday traffic peak times.
So these stats rely on tools which are pulled up from
Dynatrace and GCP Stack Driver.
So moving on.
Next we have.
Few pillars for observability.
So let's observe this triangle.
Every small triangle is important to get the details from an atomic level to an
holistic point of view of an application.
So observability, rest on these three pillars, metrics, logs and traces.
Water metrics.
So metrics are quantifiable.
So think about a latency or an error rates that are tracked in real time.
So these are your systems.
Vital signs, logs are timestamped.
Records of events and errors offering a detailed audit trial, for instance, in a
microservice failure, logs pinpoint to the exact state change and also pinpoint to
the exact line of code which went wrong.
And moving onto the traces, these are like these map from end to end request path
revealing dependencies, broad bottlenecks.
So in the distributed setup, a single request might span around to 10 services.
So these traces help us see the full picture in API Resiliency work.
Prometheus is used for metrics ELK, four logs and JA for traces.
This thread enables to proactively manage 85% faster issue resolution, which is a
key metric for enterprise scalability.
And now let's see how it behaves in a stateless was a stateful services I.
We need to differentiate between the observability in these two services.
So what is it?
Stateless services.
Think about load balanced APIs, which focus on throughput
latency, andal resource use.
So we monitor these with horizontal scaling in mind.
So the benefits of these are simplified, scaling, easy to troubleshoot
in case of an individual issues.
So now, whereas coming to the state, full services like databases,
which require tracking persistence, metrics like replication, health
consistency, and storage performance.
So for an example, in Google Cloud Spanner, we ensure that, the data
deliverability is across multi regions.
The benefits of this are users experience continuity in that usage.
It's a rich personalization and potentially more complex features.
When all these get added and coming to observability strategies, they
may vary notably between stateless and stateful services due to their
differing approaches to the state management stateless services, which
are easier to scale demand, greater focus on tracing individual requests.
And analyzing errors in contrast to the state full services, which require a
broader view that captures persistent state and ensures session continuity.
And now the shared patterns include health probes and anomaly deduction.
Implementing of these to balance the stateless API performance with
stateful database reliability, it'll support us in 25% of latency reduction.
So this dual approach is critical for hybrid cloud native systems,
and then comes the distributed tracing implementation.
So what is this one?
So distributed tracing is a cornerstone of observability.
First instrument.
Your code.
Add a trace context propagation using libraries like open
telemetry across service calls.
Next, configure intelligent sampling.
Capture a hundred percent of critical parts.
But just as a side note, do not sample a lot of traffic.
Just sample.
The data, which you want to do it, say one in thousand requests.
So we need to include both normal and exceptional patterns and
then centralize the pipeline with a scalable infrastructure.
Think about something as a Kafka or a GCB per sub to ingest all
this data and trace it out.
So this process millions of spans daily.
And finally perform the root cause analysis.
Visualize traces in JG to spot bottlenecks, because these ties
directly to your 85% resolution metrics from studies at an industry level.
So another key topic is API Gateway observability.
So what is that?
So API Gateway observability refers to the ability to monitor, measure,
and gain actionable insights into the behavior, performance, and health of
API traffic as it flows through an API gateway, it helps ensure that the APIs
are reliable, secure, and performant.
But now why exactly the API Gateway Observability matters.
This acts as the entry point for microservices in distributed systems.
The gateway is where the request ends from the external world to the internal world.
So the API Gateways are observability hubs.
Where we can get track traffic insights, request volumes, service dependencies,
and rate limiting effectiveness.
So these traffic volumes could vary from managing few hundred requests per
second to tens of thousands of requests per second during our peak times.
And this provides a centralized control plane for routing, authentication,
rate limiting and caching.
So what does observability enables?
In the API gateways, it gives us real time visibility into the API traffic
issue detection, whether it might be latency errors, anonymous auditing and
compliance, and also security monitoring.
So when I say security monitoring, that is the key.
It monitors authentication events and attack patterns.
Which we can, give them to the InfoSec team.
Hey, these are the, attacks coming up.
So there are a few other performance metrics like latency and
throughput, which are very critical.
So 25% latency restriction could be achieved by optimizing gateway
routing, which is a direct application of this particular
slide through API gateways with.
Native observability.
What is that?
So how can we achieve that?
So we can do that by Azure, API Management plus some application
insights and Epic E plus A Stack Driver or Kong Engine EngineX,
TONY, with Prometheus and Grafana.
So these are the tools and there are like other tools like G-C-P-A-P-I,
gateway with Integrated monitoring Dashboards makes this actionable too.
So are there any gateway specific challenges you are facing?
Let's see.
Yeah, so best practices include enable structured access logs with AL IDs.
So these IDs help us relate from an end to end.
Monitoring of what happened for that particular request.
And also use rate limiting and circuit breakers with metrics exposure
instrument your GA API gateway with tracing head dust, which includes
request id and also transparent.
Monitor latency percentiles such as P 50, P 95, P 99 for the SLO
enforcement track End-to-end transaction parts using distributed tracing.
So moving on.
Next we have an observability pipeline design.
A robust observability pipeline has four stages.
Collection gathers telemetry from services and infrastructure.
Think of agents in this case on each and every node processing of
normalization and filtering data.
So stream processes are used to reduce noise to keep only critical
signals for a few hundred millisecond milliseconds detection space.
That's an as an example.
So this is a kind of filtering out the additional noise or to pull out the
exact data, which we are looking for.
And then storage uses tried strategies, so hot storage for recent data, cold
storage for archives, which helps us in balancing cost and access.
Aligning with 25% of resource efficiency.
Now visualization turns data into dashboards.
There are many tools like Grafana to spot anonymous in real time,
supporting 40% faster resolution.
These pipelines become the backbone, our system, and now we
have high availability strategies.
So what are high availability strategies?
High availability ensures minimal service disruption and maximum uptime.
From an observability perspective, the ability to detect, understand and
response to failures is critical to achieving true resilience and scalability.
High availability is a non-negotiable in a distributor system, so for that,
we deploy collectors and storage across multiple availability zones
to avoid single point of failure.
So what about the residency and failover visibility?
We need to monitor at a node level, zone level, and even at regional level
for residency and track failover events and measure meantime to recovery,
which is termed as MTTR in the industry levels, and then alert on degraded
clusters or failed availability zones.
And now how to monitor across a cross zone or a region observability for
a multi-cloud high availability.
So these rails to correlate the metrics logs and trace across the clouds detect
latency between regions and audit cross cloud failover success and data sync.
Integrity designed for graceful degradation, which
helps us in reduce tele.
If any backend fails, ensure core functionality, test this during your
P loads to get all the corner cases.
But that might be, again, our own individual case.
If your system behaves well during low p ds, then to get all the common cases, then
yes, of course we can go ahead with that.
And then we need to do a local buffering temporarily which temporarily stores
metrics when any connection drops, or a multi chanting alerts, ensuring
teams getting notifications via Slack email and also PagerDuty alerts.
So these.
Strategies underpin 50% of our downtime reduction, which are
critical for enterprise resilience.
And now how do we do a capacity planning and what are the
optimizing steps we need to perform?
Capacity planning starts with historical patterns, aggregate all the metrics over
the months to understand utilization.
Ideal data in this case spans.
Over six months of the API traffic and then forecast this demand with machine
learning, predict seasonal spikes like Thanksgiving, black Friday, or
holiday day traffic, which helps us optimizing resource allocation and then
optimize dynamically using auto-scaling, balancing cost and performance, which
helps in continuous improvement.
Of refining these models aligning approximately 25% utilization
of resource efficiency.
And now for the continuous implement improvement we need to establish
a feedback loops to regularly compare for casted projections
against the actual consumption patterns, redefining prediction
models and allocation strategies.
And now what is, what about the implementation roadmap?
So let's outline this roadmap.
Start with foundational building, set up metrics with Prometheus
and centralized logging with ELK.
Define your SLIs and SLOs for core services.
Then move on to the advanced capabilities.
Deploy JAAR for tracing across microservices and build
custom dashboards for teams.
In the optimization phase, add an automate detection with AI and automate
remediation for common issues, which are building for 58 of our incident detection.
As per industry research, this phased approach has guided
40% resolution improvement.
Plan to start with one service and scale.
Always remember, start small and scale.
This will reduce the impacts that any new changes brings in.
And that wraps our journey to the cloud native observability.
We have covered the shift from monitoring real world metrics,
pillars, and implementation strategies.
Thank you all for your attention and I hope everybody enjoy the session today.
Thanks a lot.