Observability: Seeing Through the Chaos of Modern Systems

Video size:

Abstract

Monitoring shows you’re broken - observability tells you why. In distributed systems, you need traces beyond logs/metrics to debug fast. Learn how to build truly observable systems that turn outages into insights.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I am N Krisna. I'm a technical architect with more than 11 years of experience. I worked on a WS services as a developer administrator and I'm working as an architect, for a S Services, DevOps DevSecOps. I also work on automations of all these services. For performance improvement and producing the cost to the clients. today I'm going to discuss about the topic, observability with open telemetry. So this technical discussion will focus on observability principles and implementing open telemetry in our modern distributed systems. Let's get started. this is the agenda for today. First, we'll, focus on. Understanding the observability, by learning foundations and key concepts of observability. Then we'll deep dive with different types of, tele data types. And collection methods. Later we'll focus on the architecture of Open Telemetry components and implementation part, which covers open telemetry essentials as a thought topic. And the fourth topic would be the data sources of open telemetry on how can we integrate open telemetry with other systems, other tools in our application. Let's get started with the deep dive. Before we start, in details, let's focus on the foundations of observability. we talk more about tele data over this, discussion. So let's talk about what is telemetry data. Basically, it's like a collection and transmission of logs, events and measurements from various systems. Could be remote systems or distributed systems to a central monitoring place. Analysis. Once we collect this, data would be used to locate the potential problems in the systems. So basically, DevOps focus on, two metrics. any, which looks into these two metrics. One is called MT td, and another is MTTR. mean time to detection. this is a time taken to identify the issue. And MTTR like mean time to resolve. This is a time taken to resolve the issue after the issue is detected. So identifying and resolving, these are the two metrics that are being measured by organizations. How quickly we are able to identify the issue and how quickly we are able to resolve the issue. Monitoring methods. So we'll be collecting different types of telemetry data, as we discussed in the previous, slide, right? So what type of data we are going to collect. So the first one is, red method. the R represents rate E represents errors, D represents duration rate. So the number of requests coming into the system per second, like requests per second. Transactions per second error. The number of failed transactions or requests in our application or a system and the duration. So the response time, may be per request or per transaction from our system. And the second method is use method. In this, it focuses on utilization of the resources. Like resource ation, like how much of CPU used, how much disc is used? Saturation, like a Q length, for example. Network packets or network requests are queing up errors. So it could be hardware or software related errors. Say for example, discre error, discre IO error. These are all called as, errors. Four golden signals. this method is, recommended by Google in their set liability documentation. So four golden, signals, covers the first, first three, collections, like red plus saturation. So the first topic is in this one is the latency. the processing time, the delay happening in the system traffic, the request volume coming into the system, that errors the failure rates in the application of the system, saturation in the system load, that is happening within our operating system. Sir, it would be an overall system or the application. And, core web bottles. Okay. This is mainly focusing on the web UI or user interfaces or webpages or websites. The previous, three methods, what we focused are, they focus on the system, the infrastructure side, system side and services side. But core web metals will focus on, the web, user interface, user experiences. The LCP, the largest. Content. Full paint. It measures loading performance. FID first input, delay measures, responsiveness to user, interactions. CLS accumulate to layout, shift measures, visual, stability during the load of the particular webpage. Now let's talk about, Tracks known issues and it is best for, a small systems and monolith systems. And it shows when the issue occurs. When it comes to observability, it's a practical approach. It uncovers unknown unknowns ideal for complex systems, and it shows where, when, and why that issue happened. So observability gives us more quicker way of identifying the issue and resolving the issue. And tele data type. So we talked about what is tele data, what we collect, and what type of data it is. So there are, different types of tele data. one is metrics. it it, just explains about the quantitative measurements, like what is happening. it comes with numbers, basically. Okay. logs. So it's like timestamped events that, are decoded in the system. So what happened when it happened? Okay. Traces like this is very important when we talk about observability. so it gives us end-to-end request journey within our application. So suppose I have five APAs. So the trace would help us to identify which APAs performing slow, or which a PA is actually broken. Okay? So it explains about where the bottle is profiles. So this will identify. The resource level diagnostics, why my system is slow, either it is, because of CPU utilizations or maybe garbage, corrections happening during that time. So this profiling will help us to identify the resource level diagnostics. this is the, this slide explains about a pictorial, way of, your observability in the motor systems. So here, you see the application could be in public, cloud, and the on prime. And we receive various, tele data. So we call this melt, informs e for events and for logs and D traces. And this, these, collected metrics would be pushed into observability solution, which we are going to, look in, future slides. And different teams like developers, monitoring ops, security ops would be depending on the solution to identify any issues and to resolve the problems. Okay? And this is, just overall picture of how observability will be in our, modern systems. how can we collect these, metrics? So there are two collection metrics, actually push method and scrape method. The push method is like applications will actively send the metrics to the collection endpoint or TCP or UDP protocols. in this picture you see the backend services is, sending the metric to, stats D. Stats D is in turn sending, that particular metric to graphite, which is a time database. Scrape method is something where applications expose metrics. For collectors. So basically, applications won't push the data. Instead the collection, services, the data collectors would pull the data from the applications. Here in this example, pro is actually scraping the metrics from this application. So basically we can say it's actually pulling the data, from the backend service. So when can we choose scrape? When can we choose push? So choose scrape when, you're using, Kubernetes or dynamic environments and preferred, centralized, control of your all the log management and when apps can expose HT pinpoints. Okay. so basically, yeah, your apps has, option to expose their end points. Especially choose the scraping choose push when apps are shortlived and the real time metrics are required for us, and apps run in restricted networks. So in scrape method, the, the apps that has HT p endpoint exposure, yes, the script can be used, but when apps are running within the restrict networks and they have no option to expose their endpoints. So use the push method now. we entered into, introduction to Open Telemetry. open Telemetry is a open source framework, which standardizes generation collection and management of, telemetry data. It's A-C-N-C-F project. Basically it's incubated under Cloud NATO Computing Foundation. So same foundation is actually managing, Kubernetes as well. The key benefits of Open Telemetry are it's vendor neutral and it supports, cross languages and multiple languages, and it provides standardization of, data collection and it covers a lot of, metrics like. Logs metrics, traces into, one framework and let's, look at, the open architecture. So the architecture of Work Telemetry has these, main components. Instrumentation, libraries, receivers, collectors, processors, OTLP, protocol and exporters, instrumentation libraries. So these, will enhance applications to generate the telemetry data so that the application scan pump various metrics, logs C to open telemetry, and then the receivers are the ones which will collect the telemetry data from various sources. There could be applications as well. Collectors, they receive a process and export the tele data processes. These processes will man plate the data which was received and they transform before actually exporting them to backend services. OT LB protocol. It's a standardized protocol for transforming and transmitting the telemetry data. Export, whatever the data that has been mand and transformed by processors, the data would be exported to external observability backs, open telemetry collector. This is a very important, component of open telemetry. It is a centralized executable that receives processes and exports. Tele data to multiple targets, protocol support. It works with all the popular open source, tele protocols. It acts as an intermediate tree between, the applications and the backend, analysis tools. And it reduces the, resource consumption, centralize the configurations, and it'll improve the. Now we'll talk about, collector components. The components of the collector, the receivers, processors, and exporters. Receivers would be the entry points for tele data and they accept various formats. Processors, would be. Manipulating the receive data by filtering or enriching or sampling the data and send it to the exporters and the exporters would actually send the process data back to the backend systems for analysis. And there are two more components, are part of optional components of collector. Connectors, they facilitate the connection or communication between the pipelines and they transform, different, signal types enable a multi-stage processing flows, extensions. Basically, they enrich the component capabilities and the performance analysis and. This is a simple, a typical collector pipeline could be. So the receivers would be receiving the data from various sources and they send it to the processors, which the process in turn will, transform the data and send it to the exporters and the pipeline can have multiple receivers, multiple processes, and multiple, exporters in a single pipeline. And so we are talking about, open Telemetry would receive the, telemetry data from various sources. So what are those data sources? Typical data sources of, open telemetry are application instrumentation. Basically apps using the sdk. So they send the telemetry data, log CS metrics to, open telemetry. So basically applications are one of the data sources. And the next ones could be service mesh. So the traffic metrics and traces from, the service. But, mesh tools like SIO are one of the data sources. The next one could be node level metrics. So data from, so this is from, Kubernetes. the data coming from ATE about the nodes and running pods are also the data sources. Kubernetes events, cluster events, providing insights into system activities. This is also one of the data source. And the other, sources for, open telemetry are logging demands. Flu and fluent, which will collect the data of the logs. So basically they collect the logs from the containers within the cluster and they forward the logs to, open, telemetry. And various metrics coming from cloud providers like AWS and GCP and liveliness and readiness, data, which is, which we call probes and health checks that are coming from, Kubernetes clusters and the container, runtime data, like the metrics about container, states and, resource usage of those containers can also be, sources for open telemetry. And coming to the implementation approach for open telemetry. so there are two ways, auto instrumentation and manual instrumentation. So automatic instrumentation, basically it's a call, auto instrumentation, like we can get started immediately for the visibility of the, entire system. And, it, it needs less, development effort as it's a great start point for the teams to get started to know about telemetry and, implementation and when it comes to manual instrumentation. So here, it gives us more flexibility to customize the metrics and it gives us the precise control and we can focus on business specific insights and gives us flexibility as we discussed. So it gives us, what we need and why we need and when we need. So the manual instrumentation, all, it takes effort. It gives us, the control on what we want to see. And this is, the entire, full observability stack. So start from the bottom instrumentation, like adding the code. like SDK to your application code, and then the collection starts. The data collection starts from Open Telemetry, and then they have to be stored somewhere in Prometheus, lowkey tempo. These are different tools based upon the type of the telemetry data We'll see in the next slides. And Grafana is, a dashboard for visualization. So this is the simple pyramid, which explains the entire, observability stack for any, application system. Yeah. typical observation, observability stack components, like the collector, like our, open telemetry collector. Is like a, tracing system, tracing collected. So basically it is actually a distributed, tracing system for a request. so it explains us how that particular request had the journey in our system. Loki, it's like law aggregation platform, for centralized log management, Prometheus. It's basically a storage. Okay. It's a time series database for metrics collection and storage at the end. Grafana. so this is going to give a web UI for, visualization, dashboards and alerting. And the next slide. so this slide actually is going to give you a overall picture. So this slide on the right side, it explains so we have a VM one and VM two, and the data is actually coming from two VMs. the fluent bit, if you see here, the fluent bit is running on both VMs. So basically it's a log corrector. It sends the logs to the, open telemetry collector. The open tele to connector, would be actually, collecting the logs, like coming from and d and other metrics coming from the VM two, like other metrics, for example, for node metrics. So node exporter would send different metrics, so those metrics would be sent to M Okay. And, in the backend, the, entire data would be. Host into Grafana database, sorry, into Grafana user interface. So here, the applications only interact with the open tele to correct it. So they don't know. In the backend there is, low key. There is Mimi, there is Grafana. So applications simply interact with open telemetry. but that's what they see in the backend. Open telemetry collector. Knows, okay, the log should be sent to Loki and the metrics should be sent to Mia. So in this way, it reduces a lot of complexity, for the applications, to interact. So integration is very less here. it would be very faster and easy, to implement, open telemetry in the applications. that's all, from my side. And thank you for your time. A.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability: Seeing Through the Chaos of Modern Systems

Video size:

Abstract

Summary

Transcript

Slides

Naga Murali Krishna Koneru

Technical Architect @ Hexaware Technologies

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Observability: Seeing Through the Chaos of Modern Systems

Video size:

Abstract

Summary

Transcript

Slides

Naga Murali Krishna Koneru

Technical Architect @ Hexaware Technologies

Join the community!