Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody.
My name's Sanjeev.
Today my top is ensuring it infrastructure, reliability through
robust monitoring and alerting.
So this slide is about introduction.
What is monitoring what is alerting.
So let's begin by understanding.
Why monitoring and alerting are essential to system reliability in the
cloud in today's fast moving digital infrastructure cloud system handle
huge amount of dynamic workload.
Any small issue like high latency.
Memory spikes can quickly snowball into service disruption.
That's where monitoring alert, step in monitoring tool give us a real time
visuality into system health metric like CPU uses latency and through output
allow us to detect problem easily.
Actionable alerting is just as important.
It is not enough to get notified.
Alerts must be meaningful, targeted, and free from noise.
This helps avoid alert fat, which happens when team gets
over blamed with non-critical alerts and start ignoring them.
As the research showed, integrating reverse tool like Datadog New Relic
reduced downtown downtime approximately 50% while improving resource utilization.
Ultimately, this translate into greater system stability,
fastest incidents response, and better experience for the users.
Why monitor matters.
Now let's talk about why monitoring is critical in today's cloud environment.
Cloud platform like AWS Azure and Google Cloud have made it
incredibly easy to scale application, but that flexibility comes.
Complexity service services are spread across regions API, microservices,
containers, and sometimes even across hybrid or multi-cloud setup.
In such dynamic system, things can go wrong fast.
You might see unexpected latency, intermittent failure,
or even complete outage.
And often without a monitoring system in place, you wouldn't know until
you, until user start complaining.
Monitoring tool XA eyes and ears.
They continuously track performance indication and system health in real
time without this layer of observability.
You are essentially flying blind.
You are reacting after the fact rather than productivity.
So solving issue in essence, monitoring, safety, your approach
from being reached to preventative.
It's become your early ING system, helping to maintain system
reliability, performance, and trust.
So what are the tools in action?
Let's look at the tools that bring monitoring to life
in real world environment.
I have evaluated three popular monitoring tool that is premature and open
source metrics based monitoring tool.
Great for containerized environmental like Kubernetes number, those.
Number two is Datadog e-commerce assess solution that integrate well
across cloud service and provides powerful dashboard and alerts.
And number three is like New Relic known for application performance
monitoring, real time observability, and user experience tracking.
These tools were tested across.
Three major cloud platform that is AWS Azure Google Cloud.
In addition to simulated environments, they were deployed in a real world health
insurance company using Micro soft Azure.
This added practical depth testing how these tools perform under.
Actual workload such as high traffic during claim processing
and compliance driven application.
The goal was to understand their impact on latency, resource users throughout
and cost, and most importantly, how they help detect and resolve issue easily.
How they help detect.
Resolve.
Easily.
This hands-on deployment gave us valuable insight into the strength
and trade of each tool in both test and production environments.
Now let's talk about the essential factor to monitor.
Essential factor to monitor one is latency throughout output, CPUN,
memory users and operation cost.
Let's go one by one when it comes to the monitoring, knowing what to
measure, it just as important as having.
Right tools.
Let's walk through the four essential factor that need to be tracked.
Number one, latency.
Latency.
This measures the time it takes for a system to respond to a request.
High latency can frustrate user and signal system slow down or
backend issue through output.
This is how many request or transaction your system can handle per second.
A high thought put means better stability and more efficiency.
Resource utilization.
C-P-U-C-P-U-N Memory uses monitoring these tells you how you, your
infrastructure is being used.
It spikes many indicates performance bottleneck or in efficient
application operation cost.
That is also one factor.
Other overheaded overlooked, but.
Critical monitoring itself use resources and some tools incur additional costs.
You need to track this.
Two, ensure that the benefit of monitoring overweights the expense.
These metrics to together give you comprehensive view of system health.
They help.
Identify patterns, bottlenecks, and opportunity to improve
performance before user affected.
Now let's let's see, like what was like outcomes from using
those tools, monitoring tools.
That is a real world impact.
What is the impact for it?
Now let's look at the real world impact these monitoring tools had
across different cloud platform.
After integrating tools like Datadog and New Relic miserable
improvement were observed on AWS latency drop from a hundred MS to
80 milli per second on all Azure.
Through output increased from 1900 to 2,400 requests per
second across all platform.
Overall, downtown was reduced by around 15%.
These number might seem iCal at first, but in cloud environment,
especially with high traffic.
Or mission mission critical application.
Even a few millisecond of latency or a few percentage point of uptime
can make a huge difference in user experience and business opportunities.
This shows how effective monitoring tool doesn't just provide visuality, it's.
Actively help you optimize performance, scale smarter, and deliver a
more reliable service or users.
These kinds of performance gain also build trust, and with stakeholders,
especially in regulated environment like healthcare, finance, where downtime and
delay can have serious consequences.
Let's talk now, trade off and overheads with monitoring ly board
system performance and reliability.
It is important to understand that it comes with trade-offs first.
There is a CPU uses overhead in the study.
CPU uses increased by five to 8% at cross platform after
integrating monitoring tools.
That's because of these tools.
Collect, analyze, and sometimes visualize washed amount of
data in real var real time.
Second, visa rise of operation cost around.
12% on average.
This include the cost of the tools themself and the extra cloud
resources needed to run them.
The these overhead are not necessarily bad, but they need to be managed smartly.
For instance, avoid full stack monitoring if only application
level inside are needed.
Tune data collection intervals.
Do you need metrics every second or every five minute enough?
We need to think on that as well, monitoring the most critical KPI
instead of tracking everything.
The key takeaway here is that the monitoring should be a strategic
hand done the benefit far.
The cost, but blindly enabling all features can lead to the
resource waste and unexpected Bill.
Now let's talk about the preventing alert.
So one of the most common challenges monitoring is alerting for you when
teams are bombarded with so many alerts.
That they start ignoring them, even the critical ones.
To prevent this, we need to external and intelligent alerting, not just noise.
Start by defining a smart threshold.
For example, don't alert when CPU uses goes over 50% alert when it says
above 90% for a sustained period.
Second use dynamic baselines.
Tools like Datadog and New Relic can learn what normal looks like.
An alert only hand behavior deviates significantly group related alerts
together using alert correlation so teams see the bigger picture
rather than isolated symptoms.
Most importantly, make sure alert are routed to the right
team with proper context.
So responder know exactly what action to take.
By doing this, you keep your team focused on meaningful incidents,
reduced burnout, and respond to real problem faster in thought effectively.
Alerting should be guide not over.
What may be the best practices?
Now let's bring all together with some best practices for
effective monitoring and alerting.
Choose the right tool for your environment.
Not every tool fits every use case, for example.
Premises is ideal for the Kubernetes environment with Datadog.
While Datadog signs in hybrid cloud setup with a strong integration needs
continuously evaluate the cost benefit.
Tradeoff monitoring should be enhanced reliability, not drain
resources, monitoring what matter most and fine tune data collection.
Frequency to avoid over conjunction focus on ana actionable KPIs.
Track metrics that directly impact user experience and system health latency
error rates throughout output ETC.
Avoid getting lost of divinity metrics integration, LRT into
the incident response workflow.
Use tool like page duty, slack, or ServiceNow, to automatically
notify the right team and reach alerts with contact and speed up
resolution, review and refine.
Alert regularly system, and evolve what was critical six
months ago might not be today.
Keep alert, up to date, and align with current performance goal.
Following these practice, ensure your monitoring setup is efficient,
scalable, and responsive, not just technically sound, but also
aligned with business needs.
Other way and or future scope may be AI driven monitoring, so
predict issue before they occur.
So it can be achieved by the ai.
Use AI to analyze pattern and defect anomalies early.
Preventing outage and performance drops.
Second optimize performance automatically enable self healing system and dynamic
threshold that has just in real time reducing the need for manual oversight.
Third, reduce cost and resource users implement smart auto-scaling energy.
Percent workflow placement and noise, reducing alert system
to lower operation cost.
Now let's go to the conclusion.
So looking ahead, the future of monitoring is the increasingly being
saved by AI and machine learning.
Traditional monitoring relies heavily on.
Predefined threshold and aesthetics rule, but with AI driven monitoring
system can learn from historical data, detect complex patterns, and
even predict issue before they occur.
Imagine being alert not because of CPU just spike, but because
the system learned that.
I learned that a specific pattern of memory and disc behavior
typically precedes a failure.
That is the power of prediction, monitoring product predictive monitoring.
This approach helps reduce the need for continuous manual monitoring, minimizing
alert noise and false positives.
Optimize.
Resource allocation Dally based on expected demand.
AI can also help with auto remediation where a minor issue are fixed.
Without human inter intervention, saving time and improving uptime, the next
generation of tools will be smarter, more adaptive, and better aligned with the.
Evolving complexity of cloud native environment like container and service
serverless organizing that embrace these technology only will gain
completely as through the higher system relevance and lower operation cost.
Thank you for listening me and let me know if you have any question.
Thank you.