Optimizing System Performance: The Power of Observability in Distributed Environments

Video size:

Abstract

Unlock the power of observability to boost your distributed systems! Learn how metrics, logs, and traces improve performance, reduce downtime, and enhance reliability. Discover strategies, best practices, and tools that drive efficiency and transform user experiences.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Let us dive into observability today and how it helps us boost operational efficiency. It is important for us to be operationally efficient when it comes to maintaining and scaling distributed systems. Hi, I'm Harri Patel. I have around 20 years of experience in tech industry working for small startups and large enterprises. I hold master's degree in software engineering from Bits Alani. I have held various tech leadership roles at high tech places like Yahoo, Symantec Group on PayPal to name a few. I have had opportunity to lead teams of sizes ranging from five people to 30 plus people, organization, including engineers, leads and managers. In my most recent 10 years of experience, I have had opportunity to design and develop large scale distributor systems and big data architectures. Let us start by going through various challenges that we face while developing and maintaining distributed systems. Distributed systems are inherently complex. This complexity stems from the fact that they provide one system view to underlying tens or maybe sometimes hundreds of interconnected services. Hence, our traditional monitoring is not adequate. Distributed systems usually undergo rapid changes. With advent of continuous development and deployment. Systems are constantly evolving, so to pinpoint issues it becomes harder. There are certain scale issues, so when it comes to reproducing issue, they only surface at scale, and our current test systems usually are not equipped to mimic production scale, which further. Toughens the problem of identifying the issues. Given these challenges, it is important for us to step back and understand different ways and means to achieve observability for such large scale distributed systems. So let's take a look at three pillars of observability that exist. The first pillar is metrics. So metrics are essentially numerical data points. Collected at regular intervals revealing system performance patterns and health enables metrics enable identification of trends, anomalies, and potential bottlenecks before they become critical. Hence, they are key ingredients in helping us achieve proactivity instead of reactivity. The second pillar in observability are logs, so logs are. Timestamp records of discrete events and actions within the system. They help us by providing rich contextual information for debugging forensic analysis and understanding the sequence of events. The third pillar of observability are traces. So traces are a means to depict end-to-end journey of a request as they flow through. Interconnected mesh of services in distributor systems. They help in highlighting not only complex interactions, but also to detect latency issues and dependencies that impact overall system performance. Now that we have gone through key ingredients that forms the basis of observable observability, let us explore vital metrics that really matter in such distributed systems. So first popular metric is called system latency, and it is a measurement of response time across service boundaries. So in other words, it is overall turnaround time taken for a request from hitting a service to receiving a final response from it. We usually track P 95 and P 99 percentiles to quantify the latency of systems. And services. The second metric, which is equally popular is called error rates, so rate at which errors occur or are encountered with respect to successful execution of service Requests are essentially error rates. Depending on the business impact and use cases, we can configure and watch for different thresholds of rate at which this error occur. For instance, mission critical services. Will have lower thresholds versus relatively non-critical services will probably have higher error rate thresholds. The third metrics that matter is resource utilization, so metrics, collection of resource utilization is a way of keeping track of various resources, including compute, memory, network, IO storage, and so on. In this case. The idea is to stay on top of these utilizations and prevent resource exhaustion even before they occur. This one if configured diligently, can help us stay proactive. User experiences are another set of metrics. They imply measuring interactions associated with actual user experiences. It helps us connect backend performance to front-end experiences. As you can see, these are some of the key metrics that we can possibly integrate and automate for staying on top of various issues that could occur in large scale distributed systems. Now let us take a look at our typical example of distributed tracing in action. When it comes to distributed tracing, it is a sort of a chain or lifecycle from the time request enters the system and goes through various hops as it goes through as part of interconnected subsystem and services to cater to the end user request. So it start, as you can see from the left-hand side, it starts all the way from front end request. User interaction initiates a traced TDP request with unique correlation id. It then propagates trace context while routing to downstream. Microservices at the database query level spans indicates performance bottlenecks with 200 millisecond latency exceeding threshold, and finally completed trace shows full request path with timing metrics across services. It's called response chain. So now that we have discussed pillars of observability and key metrics that matter. Let us understand typical business impact of observability and what gains can we possibly realize by employing such measures in large scale distributed systems. So with appropriate observability in place, we can drastically reduce time taken to identify and locate the underlying root cause by at least 60%. This. Immediately translates and results into faster troubleshooting Times mean time to recovery is a popular KPI in industry that helps us measure mean time taken to resolve the issue through means of observability. It is seen to be improved by at least 25% and improved. MTTR is used to measure systems overall maturity. Uptime is another KPI that is widely used to measure reliability of highly available user facing system. And it has been assessed that with due observability in place we can realize increase in uptime by at least 30%. So in summary, with proper attention to designing and configuring observability, we can realize business impact to increase uptime. Reduce. MTTR. And faster root cause analysis. So on one hand it is about building observability in the system through designing and configuring various metrics and tracking KPIs. While on the other hand it is about alerting from the system with when some or all of these metrics and thresholds hit at the runtime. Alerting helps us notify teams. Systems and stakeholders of detected anomalies during its operation. It is systems way of saying something needs an immediate attention. As you can see here, it starts from the bottom of the pyramid. We usually start by watching capacity trends. This capacity could be compute, storage network, io, and so on. Through means of thresholds, we keep track of possible early warning signs, things that could potentially become catastrophic through monitoring critical infrastructure components we treat. We keep track of overall systems health. All of these result into configuring alerts that impact business metrics, issues that affect end customers of the service. After going through all these concept concepts, you might wonder how do we go about it and how do we set ourselves up for success when actually implementing these observability measures throughout the system? So observability implementation roadmap typically looks something like this. It starts with defining key metrics. That identify critical signals that indicate system health through keeping focus on business outcomes. The second aspect is instrumenting code. Instrumenting code is about adding metrics, logging, tracing to the application code through means of using standard conventions and coding practices. Once the metrics are collected, once the metrics are tracked, they're collected at a centralized place. So designing solutions to harvest these metrics into common, configurable and scalable storage solutions enable co. It enables correlation across interconnected services because they're stored at commonplace. The last aspect of it is visualizing and alerting. So we create dashboards and alerting rules and carefully ensure to continuously eliminate noise and false positives so that it's a high quality alerting. Display mechanism. Now let's see. What are common pitfalls to avoid while implementing these observ observability measures in the system? Alert fatigue. It is very common to fall trap to too many alerts resulting into your team ignoring the critical ones amidst unmanageable stream of alerts instead. We should focus on high quality, actionable alerts that are meaningful and helps everyone with meaningful message and corrective actions to be taken to resolve the issue at hand. Cost overs, collecting everything and storing it can turn out to be expensive, both on the dis storage as well as the code that further slows down as it is running metrics computation logic besides actual business logic. Idea here is to sample high volume telemetry and stay at just enough threshold of collecting metrics and computing KPIs too. Overload. Using too many disparate systems can be challenging for the teams to adapt, so try to consolidate wherever possible. Streamlining and standardizing tool is an ongoing process. With system becoming more mature. Teams tend to gravitate. To fewer number of tools leading to system-wide standardization processes, vanity metrics. It is easy to fall into the false notion of everything is important. Instead of tracking everything, focus on metrics that tie to business outcomes, metrics that matter from enhancing business impact perspective. Now, as we come close to the closing this session. Let us go through key takeaways that we have learned and understood as part of this session. Observability is essential and not optional for any large scale distributed systems. When it comes to wholesome observability. Integrate all the three pillars, that is metrics, logs, and tracing. Measure business impact, connect technical metrics to business outcomes. The ones that enhances meaningful business impact build observability culture. This one is the most important one. We have to train our teams to develop appreciation and I to leverage these observability metrics that will help them connect the dots. I hope this session was. Helpful to all of you to understand and acknowledge the importance of designing and developing various observability measures to further enhance operational efficiency in large scale distributed systems. Thank you for your time today.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Optimizing System Performance: The Power of Observability in Distributed Environments

Video size:

Abstract

Summary

Transcript

Slides

Hardik Patel

Manager, Software Development @ PayPal

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Optimizing System Performance: The Power of Observability in Distributed Environments

Video size:

Abstract

Summary

Transcript

Slides

Hardik Patel

Manager, Software Development @ PayPal

Join the community!