Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning.
Good afternoon, everyone.
This is Man Kashkar.
I'm a principal architecture at ThoughtWorks India.
I'm really excited to be here to walk you through how organizations
can navigate today's complex digital landscape using intelligent full stack
observability as systems scales, and expectations rise, observability.
It's no longer just a DevOps concern, but it is a business imperative.
Let's understand the cost of poor observability.
We often hear that visibility is everything, but in reality, most
enterprises only see a fraction of their landscape despite having multiple tools.
As per the state of observability report, 89% of organizations
admit they still lack.
Complete visibility to their systems, tools, sprawl, long mtt
and exploding telemetry volumes are all driving operational
inefficiencies and rising cost.
The question is no longer whether to invest in observability, but it
is more about how to do it smartly.
Hence, I request everyone here to understand why it is important
to have your systems observed.
Each and every time, please treat observability as a first class citizen.
This isn't just about collecting logs or metrics.
It is about understanding the why behind your system's behavior in
real time, and more importantly, predicting what it could evolve into.
This is a shift from monitoring to insight driven decision making.
Think observability as a layered strategy.
It spans across applications, infrastructure, business, and
even user experience domains.
The melt stack metrics, events, logs, and traces, they act as the foundation.
To build a strong and robust observability framework, the real impact happens
when you correlate these pillars across the domain in real time.
Let's also understand the observability maturity as you see here.
This slide reflects a critical shift in how organizations evolve
their observability approach.
Early maturity stages often start with siloed monitoring.
Teams track the metrics and logs, but the insights are isolated.
They were lacking the context as well as domain information.
As maturity progress, we see a convergence of telemetry with real business KPIs
in terms of evaluating and tracking and monitoring the uptime, the revenue impact,
the customer churn rate, et cetera.
Full stack observability, it represents the pinnacle, a model where.
Infrastructure applications, digital experience, and business context,
they all are stitched together.
This maturity enables not just technical diagnostic, but also insights into the
business impact of technical degradation.
Like for example, a 40 milliseconds increase in response time will
cause 2% drop in conversation.
This is a concerning point for your business.
Let's check out this journey as a case in point.
This is a real world example about a customer journey, like an e-commerce
platform, who wants to check out some of the goods in the products, and
it also emphasizes the importance of having dual lenses, the outside
in perspective and the inside out.
When we talk about outside in.
It actually captures how the user experiences the application, whether you
have slow pages, failed transactions, high latency, and when it comes to the insider
out perspective, it surfaces systems, site telemetry, backend response times,
your service response time, database query lags, database query, performance memory
spikes, or API gateway bottle links.
The true observability connects these two views in real time.
For instance, a spike in checkout may be directly lied, may be directly
tied to a database locked timeout, or a failing third party payment, API.
Without that correlation, you are left guessing and it'll take lot of time
for you to identify the root cause.
You are enabling precision recovery and business aligned decision making, and
here it's AWS who provides us multiple such services, which could help you to
achieve the full stack observability.
AWS provides a robust observability toolkit, which values in how you
stitch these services together.
CloudWatch, it'll help you to capture native metrics and logs
across various compute systems.
X-ray adds distributor repressing essential to follow a request across
microservices, CloudWatch synthesis.
It mimics user journeys to detect issues before the customer do.
CloudWatch, RUM, the real user monitoring.
It provides the frontend visibility, user latency, and core web vitals.
The opportunity here is an integration building composite dashboards that
unify telemetry across various layers.
For example, correlating CloudWatch alarms with x-ray presses.
We'll help you to isolate bottlenecks automatically.
This will help you transform raw signals into actionable insights and
eventually reducing MTTR significantly.
Let's also quickly understand how you can target to achieve AI ops and how the
journey will look like with observability at scale comes the data delay.
Terabyte of LAX thousands of metrics and alert for this is where AIOps
step in AIOps is a natural evolution.
Applying machine learning and automation to operational data to detect patterns,
surface anomalies, and even auto issues.
Think of it as observability with intelligence.
We are seeing a shift from reactive alert to productive signals spotting
anomalies before users are impacted.
This journey to AI ops isn't overnight.
It starts with robust observability data layer with smart correlation and gradually
introducing automated decision making.
Now, let's understand.
To achieve this, what are the components that the organization should focus on?
AIOps operates across three strategic pillars.
First is observe, ingest, clean, structured telemetry across the melt.
Learn.
Apply ML services, AI models to detect any outliers, understand the
seasonality and predict failures.
The third is act automate downstream workflows, alert suppression, ticketing,
dynamic scaling, and even rollbacks.
This is where the value multiplies.
For example, if CPU usage spikes and is constantly followed by database
timeouts, AIOps will learn that.
Relationship and suppress noise while highlighting root cause.
This is the transition from alerts to intelligent signal management.
Now let's also understand the AI ops.
What are the various tools and the capabilities that are provided by AWS?
The key services includes.
DevOps Guru.
It is a fully managed ML powered insight engine that detects operational
issues well before the impact users.
Next is CloudWatch anomaly detection.
This is a beautiful service which auto detects anomalies in time series metrics.
The third, an very important one, is X-ray insights.
It correlates traces with error rates and it highlights the slow segments.
Each of these tools is very valuable individually, but when combined, they
form a very powerful AI ops pipeline.
For instance, DevOps guru can detect a performance anomaly, link it to the
failing express segment, and triggers an SNS alert that upgrades your incident
management system automatically.
Now let's understand.
How can you optimize your it, IT operations using automation?
With AI ops, we go beyond dashboards and alerts.
We enter into a world of intelligent alert routing, noise reduction, contextual
correlation, and automated remediations.
This will result into turning operations from reactive firefighting mode
into a very proactive governance.
Now, let's see, how can we streamline the operations?
This is a journey which can be achieved in a very phased manner
in order to optimize the operations.
You should definitely leverage on the AWS AI stack or something similar,
which other cloud vendors provides.
Now let's also understand another case about how can we reduce the incident
volumes as a part of your day-to-day BAU when it comes to the operations
or maintenance support activities.
Your organization here is a step by step playbook about how can you
achieve this instrument, everything.
Ingest with structure, identify the patterns, drive predictive alerts,
and connect it to your ITSM systems.
The goal is.
Fewer tickets, faster resolution, and a major able uplift in terms of
customer and business experience.
Here is a case in point about anomaly detection framework.
This isn't just about catching failures, but it is more about preventing them.
For that, you need to train your ML models on very relevant logs, metrics and traces.
Your system should be able to correlate and identify a meaningful
contextual inferences out of the data that has been provided to you.
This will help you eventually to degrade your lots of surface issues
that you typically identify in the operations, and once you apply this.
This is your value realization phase by instrumenting telemetry across
systems, detecting patterns via ml and integrating them into the workflow.
The incident volumes will surely drop significantly.
You are no longer drawing in alerts.
Instead, you are seeing fewer tickets, faster root cause identification,
and measurable business impact.
Hence observability and AIOps together will enable a continuous
impro improvement loop for your organizations where every incident become
training data or smarter operations.
While we have spoken and understood lot of good things about AI ops and what
are the various ways to achieve it, AI ops also comes with certain challenges.
Some of the common challenges identified are the data quality, integration
complexity, and the change resistance.
However, if you see the compelling ROI that you could achieve by implementing
AIOps such as improved efficiency, faster MTTR and better customer satisfaction.
When AIOps done right, it will really help you to transform not
just it, but your entire business.
Some of my closing thoughts here are the intelligent observability is not a luxury.
It's a foundation for high performing digital enterprise.
From real time user experience to deep backend visibility, from animal
detection to AIOps driven automation.
The journey starts with visibility, but ends with intelligence.
Let's build systems that not only recover fast, but adopt, predict, and self fail.
Thank you for joining me today for this session.
I hope the session you find insightful.
Happy learning to all the audience.