Building Observability into Cloud-Native Applications

Video size:

Abstract

This session explores how to build observability into cloud-native applications, providing real-time insights that improve reliability, performance, and incident response in modern systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. My name is Mohammad Ahmad Saeed and today I'm going to talk about building observability into cloud native applications. Let's first look at what is observability. So observability is the ability to understand the internal state of a system based on its external outputs. In simpler terms, observability is about answering the question, what's happening inside my application or system and why? Observability is often confused with monitoring, but they are different. Observability, it goes beyond traditional monitoring and monitoring simply tells us if something is wrong or not. For instance, monitoring will tell us if my CPU is high or not. If my response time is slow or not, but observability helps us understand why it's wrong. it empowers us to arbitrary questions about our systems and get answers without needing to predict every possible failure scenario beforehand. Let's look at what is the pillar of, what are the pillar of observability. So observability has three pillars. The first is the logs. So logs are the timestamp records of events that occurred in the system. For instance, the logs, whenever there is a booking made, there is a, the login happened. the payment is made. So they are the structured unstructured, records of events that happened in the system and they provide us the details, context, the detail context about like specific operations and they can be extremely valuable for debuggings, for instance, like If there is any exception happened or if there is some failure happened then the log will tell me the exact message of like why that happened and what exactly was the problem. The next pillar is the matrix. matrix It's metrics are the quantitative data about system performance. for instance, the CPU usage, the request latency, the memory usage, the error rates, the throughputs, there can be a lot of metrics, but. yeah, it's more about the quantitative data rather than the data about individual events. And traces are, the end to end, tracking of the requests as they propagate through the distributed system. So these traces help pinpoint performance bottleneck and understand dependencies. So using traces, we can Exactly identify how our request went from one service into the other service. And then, yeah, we can track all the routes that it took, for that's for a given request. Now, let's see why is observability crucial for cloud native. So cloud native applications are inherently complex, they are distributed like microservices, they have a lot of microservices and those microservices communicate over network, which actually introduce latency and then there are a lot of failure points because of, of using network. for communication. The next is the dynamic. So all the containers and orchestration platforms like Kubernetes, they constantly spin up and tear down resources. Like for instance, they can, whenever the demand for the demand grows, they can create new database, but also when the demand lowers, they can they can kill that the specific instances that were created. So it, it is creating the resources all the time. So it is creating and destroying resources all the time. And then ephemeral. So instances are like short lived. They, and because of that, it is really hard to actually track what happened. Because by the time when we start investigating something, the maybe, The instance or the platform is not there without observability. We are basically, essentially like we are flying blind. So when something goes wrong, we need to quickly identify the root cause, whether it's like a misconfigured service, a network bottleneck or a cascading failure. Let's look at What lack of observability can do with us? what are the consequences of the lack of observability? The very first is the increased mean time to resolution. So if we don't have a proper observability setup, diagnostic, diagnosing and, diagnosing and resolving issues. It takes significantly longer, which can impact the user experience and operational efficiency. for instance, if it takes like few hours to recover from a failure, then it can have severe consequences on the business. in ideal situation, if we have Observability is set up. We should be able to quickly identify the root cause and which will help reduce the, the, reduce the mean time to resolution. The next is the limited, performance optimization. So if the, if we don't have a proper observability setup, then performance bottlenecks in one microservice can cascade affecting the overall application and we won't be able to pinpoint exactly what is causing the problem. So without matrix and traces, optimizing performance then becomes just a guessing game. Like we can try to like optimize something, but we don't know exactly the root cause. So for instance, if we see that there is an increased response time for a given endpoint, then if we don't have an observability setup, then we might believe that this endpoint is slow, it has like slow. like a slower implementation or it needs some kind of optimization. But in reality, there is a possibility that at that specific time, maybe that the response time from database increased. But because we didn't have observability set up, so we believe that our endpoint is, it's slower. And we cannot identify that database was the actual root cause in that. And the next is like the difficulty in ensuring reliability. So yeah, before, like if there is any, like if there are any gaps in the observability, then we cannot ensure proper reliability because, the number of microservices grows, ensuring the reliability of the entire systems. It becomes challenging without a holistic view of the system's health. Let's look at the steps for implementing observability into the cloud native application. So on a high level, there are five steps. The first is the instrumentation. Instrumentation is basically the foundation of observability. And here we need to instrument our code to collect metrics, logs, and traces. We can use libraries and frameworks that are designed for our chosen language and frameworks. The next is the centralized collection and storage. So Collecting data from all the services and then storing it in a centralized location. We can use tools like Prometheus for metrics, Elasticsearch or Loki for logs, and Jaeger or Zapkin for traces here. so once we have the data instrumentation set up and then once we have the data centralized in, in one place. The next part is the visualization and analysis. So here we can use, dashboards and visualization tools like Grafana to explore the data and gain insights. We can also set up the alerts to be notified when there is any critical issue in the system. So once we have the. With visualization analysis done, we can move towards contextualization. So here we can connect our metrics, logs, and traces together to provide a holistic view of the system. This allows us to correlate events and understand the relationships between different parts of the application. The final step is basically the automation. we can, we can automate the whole process of collecting, analyzing and visualizing the observability data. So this allows us to focus on understanding the system rather than managing tools. So using all of these, five steps, we can, implement observability, properly into the cloud native applications. Let's look at the. Famous tools and technologies that are available in the market. the very first is the, for metrics, we have, we can use Prometheus, Telegraph or StatsD. For logs, it's mostly Elasticsearch, Locky, Filebit and Fluentd. For traces, it is, like the most famous is OpenTelemetry. there is also Zipkin and Jagger. For visualization, yeah, the most famous. Are, Kibana and Grafana. yeah, some of these tools are open source. Some of those might not be. let's look at the platforms. So why we need those platforms. because we need those platforms because. They provide us like all the features in one place. We don't have to set up individual steps. We don't have to set up individual tools and technologies and try to connect everything. So using a platform can help get, speed up the integration of the observability into our cloud native applications. The first is the Datadog. So Datadog is a market leader. It offers comprehensive monitoring and analytics across cloud native environments, including logs, metrics, traces and events. The next is the New Relic. So New Relic is a, it's a comprehensive observability platform with strong application performance capabilities. it is also a very famous platform used by like millions of users across the world. The third in this list is the Honeycomb. So Honeycomb, it stands out because it is focusing on high cardinality data. So which means high cardinality data means the data with have like many unique values. And honeycombs excel at event driven instrumentation So instead of just monitoring a matrix like cp usage or request count honeycomb. Let's let us Analyze individual events. For example a user logged in an api call or a database query so we can this is how you can analyze individual events using Honeycomb and if we just look at all the, like the, the key features which are available in all of this, platform, there are more platforms other than this, but these are the top three in my list. The key features are always like the real time insights. So we are collecting and analyzing metrics, log, traces, and events from various sources to provide real time visibility into application performance and infrastructure health. All of these platforms, they support microservices. So money microservices have a really complex distributed, architecture. but yeah, or using these platforms, we can easily monitor all of those microservices in just one place. And They all of them allow us to create customizable dashboards. So for instance, if I want to see some trend over time, what is the P50, P90 for my data, for the events, for, for any kind of metrics, we can easily set up the customizable, we can set up the customized dashboards, using any of these platforms. Here, let's look at the best practices. So The very first thing is we have to instrument early and often, we shouldn't just leave it till the last moment. the earlier we start implementing observability into the cloud native, the less painful it will be when the things start scaling. and then for metrics, we have to use the meaningful metrics. So we can focus on the metrics that are relevant to the business. And we have to filter out signal from noise. There can be a lot of metrics, but we have to filter a list of metrics that are most relevant for the business. And then for logs, The best practice is to actually structure the logs. We have to use a consistent format and make it easier to search and analyze the logs. So by, like for instance, we have to, make sure that, like when we are logging, something, logging events, so the, for example, there can be standard fields, like the pattern can be standard. For example, there's a timestamp, there is a system name, there is an application id. There can be a lot of stuff, but yeah. Whatever it is, it has to be a consistent format and then, it's better to if we are using the same logging format across all the services because all of these logs might end up being in the same place and then it will be easy to analyze. if the format is the same. correlation, data correlation. So here we are connecting metrics, logs, and traces together to provide a holistic view of the system. and then we have to automate the observability pipeline. So So all the process of collecting, analyzing, and visualizing the data, it has to be automated. there, there shouldn't be any manual intervention required whenever we want to, whenever we want to visualize something. So I just need to open the dashboard, open a platform. Yeah. The data should be already there. Everything should be, set up, in an automated way. The final thing is the cost management. So observability can be expensive, especially at scale. If we are logging, like billions of event, millions of event, per day, per hour. So we have to be careful about, like the cost that are associated. with the storage, with the processing, and all that. So, to reduce the cost, to reduce the data that we are collecting, we can set up some data retention policies and use like tiered solutions to manage costs. Let's look at a case study. assume that there is a cloud native, e commerce experience, there's a cloud native application which is experiencing periodic slowdowns. So We can use the three pillars of observability to actually pinpoint what is the root cause of this. So the very first is the logs. So logs will tell us if there is an increase in the error rates in a microservice. Metrics will tell us if there is a high CPU utilization on a node. The traces will pinpoint a slow, a slow database query affecting response times. using logs, metrics, and traces, we have identified that, yeah, the problem goes back to a slow database query. For resolution, It's straightforward. We can optimize the query. We can auto, auto scale the service and, yeah, then it will reduce the latency. So we can use these three pillars of observability to find out what is the root cause of the periodic slowdowns in a system. thank you. I hope this talk was useful and, thank you.

Slides

Download slides (PDF)

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Building Observability into Cloud-Native Applications

Video size:

Abstract

Summary

Transcript

Slides

Muhammad Ahmad Saeed

Software Engineer

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Building Observability into Cloud-Native Applications

Video size:

Abstract

Summary

Transcript

Slides

Muhammad Ahmad Saeed

Software Engineer

Join the community!