Abstract
Unlock the power of observability to boost your distributed systems! Learn how metrics, logs, and traces improve performance, reduce downtime, and enhance reliability. Discover strategies, best practices, and tools that drive efficiency and transform user experiences.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let us dive into observability today and how it helps us
boost operational efficiency.
It is important for us to be operationally efficient when it comes to maintaining
and scaling distributed systems.
Hi, I'm Harri Patel.
I have around 20 years of experience in tech industry working for small
startups and large enterprises.
I hold master's degree in software engineering from Bits Alani.
I have held various tech leadership roles at high tech places like Yahoo,
Symantec Group on PayPal to name a few.
I have had opportunity to lead teams of sizes ranging from five people
to 30 plus people, organization, including engineers, leads and managers.
In my most recent 10 years of experience, I have had opportunity to design
and develop large scale distributor systems and big data architectures.
Let us start by going through various challenges that we face while developing
and maintaining distributed systems.
Distributed systems are inherently complex.
This complexity stems from the fact that they provide one system view to
underlying tens or maybe sometimes hundreds of interconnected services.
Hence, our traditional monitoring is not adequate.
Distributed systems usually undergo rapid changes.
With advent of continuous development and deployment.
Systems are constantly evolving, so to pinpoint issues it becomes harder.
There are certain scale issues, so when it comes to reproducing issue, they
only surface at scale, and our current test systems usually are not equipped to
mimic production scale, which further.
Toughens the problem of identifying the issues.
Given these challenges, it is important for us to step back and
understand different ways and means to achieve observability for such
large scale distributed systems.
So let's take a look at three pillars of observability that exist.
The first pillar is metrics.
So metrics are essentially numerical data points.
Collected at regular intervals revealing system performance patterns and health
enables metrics enable identification of trends, anomalies, and potential
bottlenecks before they become critical.
Hence, they are key ingredients in helping us achieve proactivity
instead of reactivity.
The second pillar in observability are logs, so logs are.
Timestamp records of discrete events and actions within the system.
They help us by providing rich contextual information for
debugging forensic analysis and understanding the sequence of events.
The third pillar of observability are traces.
So traces are a means to depict end-to-end journey of a request as they flow through.
Interconnected mesh of services in distributor systems.
They help in highlighting not only complex interactions, but also to detect
latency issues and dependencies that impact overall system performance.
Now that we have gone through key ingredients that forms the basis
of observable observability, let us explore vital metrics that really
matter in such distributed systems.
So first popular metric is called system latency, and it is a measurement of
response time across service boundaries.
So in other words, it is overall turnaround time taken for a
request from hitting a service to receiving a final response from it.
We usually track P 95 and P 99 percentiles to quantify the latency of systems.
And services.
The second metric, which is equally popular is called error rates, so rate at
which errors occur or are encountered with respect to successful execution of service
Requests are essentially error rates.
Depending on the business impact and use cases, we can configure
and watch for different thresholds of rate at which this error occur.
For instance, mission critical services.
Will have lower thresholds versus relatively non-critical
services will probably have higher error rate thresholds.
The third metrics that matter is resource utilization, so metrics,
collection of resource utilization is a way of keeping track of various
resources, including compute, memory, network, IO storage, and so on.
In this case.
The idea is to stay on top of these utilizations and prevent resource
exhaustion even before they occur.
This one if configured diligently, can help us stay proactive.
User experiences are another set of metrics.
They imply measuring interactions associated with actual user experiences.
It helps us connect backend performance to front-end experiences.
As you can see, these are some of the key metrics that we can possibly
integrate and automate for staying on top of various issues that could occur
in large scale distributed systems.
Now let us take a look at our typical example of distributed tracing in action.
When it comes to distributed tracing, it is a sort of a chain or lifecycle from the
time request enters the system and goes through various hops as it goes through
as part of interconnected subsystem and services to cater to the end user request.
So it start, as you can see from the left-hand side, it starts all
the way from front end request.
User interaction initiates a traced TDP request with unique correlation id.
It then propagates trace context while routing to downstream.
Microservices at the database query level spans indicates performance
bottlenecks with 200 millisecond latency exceeding threshold, and finally
completed trace shows full request path with timing metrics across services.
It's called response chain.
So now that we have discussed pillars of observability and key metrics that matter.
Let us understand typical business impact of observability and what
gains can we possibly realize by employing such measures in
large scale distributed systems.
So with appropriate observability in place, we can drastically reduce
time taken to identify and locate the underlying root cause by at least 60%.
This.
Immediately translates and results into faster troubleshooting Times
mean time to recovery is a popular KPI in industry that helps us measure
mean time taken to resolve the issue through means of observability.
It is seen to be improved by at least 25% and improved.
MTTR is used to measure systems overall maturity.
Uptime is another KPI that is widely used to measure reliability of
highly available user facing system.
And it has been assessed that with due observability in place we can realize
increase in uptime by at least 30%.
So in summary, with proper attention to designing and configuring
observability, we can realize business impact to increase uptime.
Reduce.
MTTR.
And faster root cause analysis.
So on one hand it is about building observability in the system
through designing and configuring various metrics and tracking KPIs.
While on the other hand it is about alerting from the system with when
some or all of these metrics and thresholds hit at the runtime.
Alerting helps us notify teams.
Systems and stakeholders of detected anomalies during its operation.
It is systems way of saying something needs an immediate attention.
As you can see here, it starts from the bottom of the pyramid.
We usually start by watching capacity trends.
This capacity could be compute, storage network, io, and so on.
Through means of thresholds, we keep track of possible early warning signs,
things that could potentially become catastrophic through monitoring critical
infrastructure components we treat.
We keep track of overall systems health.
All of these result into configuring alerts that impact business
metrics, issues that affect end customers of the service.
After going through all these concept concepts, you might wonder how do
we go about it and how do we set ourselves up for success when actually
implementing these observability measures throughout the system?
So observability implementation roadmap typically looks something like this.
It starts with defining key metrics.
That identify critical signals that indicate system health through
keeping focus on business outcomes.
The second aspect is instrumenting code.
Instrumenting code is about adding metrics, logging, tracing to the
application code through means of using standard conventions and coding practices.
Once the metrics are collected, once the metrics are tracked, they're
collected at a centralized place.
So designing solutions to harvest these metrics into common, configurable
and scalable storage solutions enable co. It enables correlation across
interconnected services because they're stored at commonplace.
The last aspect of it is visualizing and alerting.
So we create dashboards and alerting rules and carefully ensure to continuously
eliminate noise and false positives so that it's a high quality alerting.
Display mechanism.
Now let's see.
What are common pitfalls to avoid while implementing these observ
observability measures in the system?
Alert fatigue.
It is very common to fall trap to too many alerts resulting into your team
ignoring the critical ones amidst unmanageable stream of alerts instead.
We should focus on high quality, actionable alerts that are meaningful
and helps everyone with meaningful message and corrective actions to be
taken to resolve the issue at hand.
Cost overs, collecting everything and storing it can turn out to be expensive,
both on the dis storage as well as the code that further slows down as
it is running metrics computation logic besides actual business logic.
Idea here is to sample high volume telemetry and stay at just
enough threshold of collecting metrics and computing KPIs too.
Overload.
Using too many disparate systems can be challenging for the teams to adapt, so
try to consolidate wherever possible.
Streamlining and standardizing tool is an ongoing process.
With system becoming more mature.
Teams tend to gravitate.
To fewer number of tools leading to system-wide standardization
processes, vanity metrics.
It is easy to fall into the false notion of everything is important.
Instead of tracking everything, focus on metrics that tie to business
outcomes, metrics that matter from enhancing business impact perspective.
Now, as we come close to the closing this session.
Let us go through key takeaways that we have learned and
understood as part of this session.
Observability is essential and not optional for any large
scale distributed systems.
When it comes to wholesome observability.
Integrate all the three pillars, that is metrics, logs, and tracing.
Measure business impact, connect technical metrics to business outcomes.
The ones that enhances meaningful business impact build observability culture.
This one is the most important one.
We have to train our teams to develop appreciation and I to leverage
these observability metrics that will help them connect the dots.
I hope this session was.
Helpful to all of you to understand and acknowledge the importance of
designing and developing various observability measures to further
enhance operational efficiency in large scale distributed systems.
Thank you for your time today.