Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

Unified Observability-A Single Pane of Glass and Single Source of truth

Video size:

Abstract

Managing multiple monitoring tools in an organization can lead to fragmented insights and delayed decision-making. Unified observability provides a solution by integrating diverse data sources into one, enabling teams to quickly identify issues, perform RCA, and take proactive actions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Good morning, good afternoon, good evening, wherever the location that you're joining. In the session, we are gonna talk about how can we get a single pane of view of all the telemetry data that collected in the enterprise digital landscape. How can we get a unified view of all the entire IT landscape so that. We can easily, quickly detect the problems and we don't have to switch between the tools and see what is affecting the performance and stability of an IT systems so that we can spend more time in innovating rather than finding where the problem is. So in this session, we are gonna talk about open source tool, like Open Telemetry and also using the Grafana stack. How can we achieve this unified already? Okay. Let's dive into. What are these unified help can we achieve? Quick introduction about me. I am, I'm a co-founder of Gohar Tech, where we provide SRE as a service. I also have our 20 years of IT experience where performed various roles, architecture consulting, performance engineering, set up SRE landscape at SRE team for big enterprises. Also we've got the patent in risk management, risk assessment with respect to reliability and change management. Also, one more patent filed. I'm also an active speaker in the Chaos Con, so also contribute to open source community as well. Okay. Let's directly get into the evaluation of observability. So we talk about monitoring in the earlier days, but now we started talking about observability. The first version of observability. It's like foundation for all. I. Modern application development. And this really helps to understand the system behavior. When we write a code we have an infrastructure there are middleware components in it. So this really helps us to gain some visibility into the IT infrastructure and the application as well. So what are the characteristics? And the components like metrics is one of the key aspect of observability on 0.0 where metrics is like more mostly a time series data. For example, like a latency or throughput or error or even the backend system or infrastructure, CPU, resource utilization, memory disk, and so on and so forth. Traces. This really helps us to troubleshoot any of these latency or error problems call tree kind of view. Where, when there, in the run time, when there is a transaction or an API or task getting executed, what are the different methods and the different services? What are the things being actually called that? How much time it took, how many times being called? So this really helps to capture those information. And logs. It's a traditional way where both system level logs or application level logs, where whatever the information we wanna share to the analyzers or whatever the information that system is spacing at this point. It could be informational troubleshooting purposes are error are feral. The profiling. This is also getting added as one of the signals in Absorbability earlier days. This was not is one of this is only for applicable for lower environments or non-production environments where the sampling is gonna, is very higher. So it's gonna affect the performance of the system if we run in production scale. But as, as soon as that, after the Google's dapper algorithm came in, our team actually found a really how can we do it in a lightweight and for production scale system. So this, there's like advanced level of tracing is this actually provide minute level of information in terms of CPO, how the memory blocks are getting allocated from which line of code. So this really helps to solve a lot of memory leak problems. So in, in a recent observer teams, we see a continuous profiling with EVPF coming in. It captures all the cannel level statistics. What are the things getting executed and really helps to troubleshoot code level problems. I. User monitoring. So just to understand instead of a synthetic. So just how the actual users are facing the how their experiences are while using the system. So this really helps to understand the user behavior and experience. Application performance management. This is there for a while. So which provides capability of all these other things we saw. But this also play vital role in getting the runtime. Apart from these, getting a runtime visibility, if the Java application, how the Java heap utilization is, how many threads are there, connection, pooling and all, if it's a goal and application, how many sub proteins are span, but how the ang memory is being used. So it gives more visibility apart from this. So now it can. Think about, we have all the relevant signals are capabilities and observability 1.0, but are we in this right path? Are we using observability 1.0? Let's see, what are the characteristics of this? So we talked about all these things, but. The challenge here is that it's all siloed tools, so we have to have multiple tools to measure individual characteristics like metrics. We have to use separate tools, for example, like Prometheus collector or our specific agent, which collects inform metrics and so on, so forth. Traces like open tracing or open telemetry just to collect our Eger or zipkin just to collect the traces alone and inject it to the backend and the logs. Any framework which actually helps to gather the logs and events. Profiling. And again, separate tools are separate instrumentation, radio user monitoring, again, like amp for any JS application or any other instrumentation or agentic base to collect the radio user monitoring and the APMs a as well. So these are mostly silos. Some of them are club together to get the unified view and mostly it's a reactive approach. So we usually capture these after the fact of some of the incidental issues already started. And limited and no correlation. So that's a gap of this observability on 0.0. And it leads a lot of manual analysis because of the cellular tool. So we have to look multiple tools to do a correlation and find out where the problem is. So now it get into deeper. Silo data we saw, I saw about it. It's really data silos really the hindering of a ho holistic view. So we cannot get an end-to-end picture even in the traces if you're not collecting the data from across the system. So it'll still the manual correlation needs to be done. And so limited insights. So if, even if we don't, instrument any one of the component of the layer. So we will not get a complete end-to-end visibility. So then if there is any problem happens, it's hard to troubleshoot. It's a black box. So that's when the observability 2.0 was born, where it really helps to elevate the insights in the digital ecosystem. So it, when there is a system with like thousands of microservices really helps to understand where exactly the problem is and how to troubleshoot and fix it. Where it come, whatever the signals are, what are the characteristics of observing at 2.0? But it brings more context to the traces where it, it gives a, where. This request is originated from, so where we'll have a parent-child relationship and adds more attributes or dimensions to those executions. So it really helps to find out if you wanna get a trace of this particular user at this particular order. It'll be easy. The context really helps to find out a little down those levels. And events. So logs mostly it's like unstructured unless otherwise we follow some templates. Like we have access logs, have the Apache format and but at the backend, most of the container logs, unless otherwise we enforce it, it's gonna be like formatted. So that's when the events really helps to do a key value pair with the formative structure of data. When there is a critical events or change in the system happening, really helps to correlate the issues. Next, next level to the real user monitoring, the user experience monitoring. So this really helps to understand as soon as we receive the data, the UI receive the data. So how much time it takes to enter the frame, and how much time it takes for each of those activity that you ui is doing. So how the, what is the real user experience is when they do the action or when, app start time mac crashes and handle exceptions. The data stream. This is another important with the recent AI ml world where we deal with like terabytes and petabytes of data and also with Kafka streams and all with a lot of events processing through, especially in the iot kind of systems where how can we get deep insights and visibility into it. So this really gives a separate perspective with that. Business metrics. This is another important, so other than user experience, we also need to measure, get the business insights to it so we can make some data driven decisions. So like how many artists being made, how like different. The user is coming and viewing the product, but what is a conversion rate? So whether he's gonna do a checkout and make a price like order $2, how much time and how much percentage of users getting converted. So those kind of critical metrics. So if there is any dropouts and what is a reason for it, if there is any technical reason or is something else. So this really helps to gain those insights. And alerts in the traditional way. We mostly set a threshold based, which is not scalable and in a dynamic environment. So sometimes you tend to miss some of these gaps, especially when the business metrics and all. So that's where the composite alarm really helps to avoid these alert static and reduce the alerts to noise ratio. So really combine those when there is an incident happen. So if multiple layers are failing, so we're gonna not gonna have multiple alerts, it's gonna combine and have a single. Alarm and really helps to focus the problem. So with this, what are we gaining? The characteristics of 2.0 E helps us to provide a single source of truth. Obviously, with the context and formatted events, we can quickly troubleshoot or filter out only what we need. And really with the open telemetry and open standards and the agents, we really have low code or no code instrumentation to get the data. We'll see what what are those instrumentation types, how to do it, and also open standards and interoperability really helps to have a kind of no vendor locking situation. And it also, it's a cloud agnostic where it can, it'll really support both on-premise or hybrid or cloud native environments. And the absorb funnel I just touched upon a little earlier. So when there is any. Digital ecosystem or digital IT application runs on any underlying infrastructure platform. The absorbability really helps to capture the capacity, utilization and saturation level. So this is the very foundation where all the code gets executed in binaries and on top of it we'll have all these. Downstream systems where data re resides, where we need to capture durability, consistency, integrity, and also recovery aspects. If you want to have a high availability solution, scalable solution, on top of it, we'll have a connector layer, middleware layer, which actually needs to propagate the context and also provides. The resilience view, how the health of the downstream, do we have a right circuit baker, a fallback retrace pattern. So this really helps to get that perspective of it, that health of the backend, of downstream systems on top of connector. Actual origin of these business logic are a PA calls. We mostly measure the golden signals, the latency throughput at a rate and saturation level. Real actual a PA consumers from the web website or an app site to where it'll have a get the customer journey map. So what are the different activity customers are doing so that we can get real actions happening in the app in terms of business usage and also what users are seeing. And all the, apart from all these technical stuff, the on top of it, the funnel, we have a business metrics where we'll get clear visibility into the business functions and what are the outcomes. So the idea for this funnel is when there is a. Drop in the business, or when the user's complaining about some of the things not working or not working as expected, then we need to correlate to any one of these technical layers. So that is the intention of this funnel. So to have a mostly composite metrics or alarms, and also to gain visibility of business interruptions due to any technical failures or technical challenges. So this will definitely help us to have a reliable, resilience, scalable and SE secure platform. So now let's get into how can we achieve this single source of truth? The reference observability architecture, I would say, where we have the target IT system where instrument using the no-code, low-code agents are the SDKs, which. Send the telemetry signals to the data collector in the periodic interval so the data collector will receive and, converts and sends it to the data pipeline where this is where actual data massaging happens. Where pipeline not only combines data, add relevant additional context to it which cluster it is running on, and additional masking if there is any sensitive PI information is there. We wanna mask it. And convert it into any other format where the data lake accepts any common format. Like it doesn't accept key value there. It's a column, store column or store. It's a time series store. If you wanna split that or if you want to convert that, it'll also happen during, in the pipeline site. Once the data getting converted and it's ready to store, then it'll be pumped into enterprise telemetry data lake. So this is where all the telemetry signals come and stored in a single source of truth. That's where we'll not have a different perspective, different time zones all these problems that we usually face Earlier, we'll have all the signals like metrics, traces, logs, events profiling, even profiling data, and also all the front end business insights data, user experience data. So this really helps us to visualize so we can get a single pane of view. Not only that, but also helps us to create a composite alarms. If there is any standard DVR anomaly detected. Really it helps us to correlate where the issue happens. And also really helps to run those synthetics to make sure we try to capture if even user sees it. So if there any chance to capture it earlier and also helps us to set the, define the SLO thresholds and also helps to track those SLOs. And not only that, it also helps to. Make data recommendations, or since the data is in a single place, it really helps to run these ML R Foundation models to understand the data and really helps to solve the problems. Now let's talk about open telemetry. It's a open standard and it's is, as everyone knows, a second project in CNCF, how does it really help us to get the unified view? So this is cloud NATO and really helps to avoid any vendor lacking situation. So once, we're all instrumented. So wherever we wanna store the data really supports it. And it supports 20 different languages. For all the four different signals. They provide what is the later state of individual programming languages. And instrumentation is where we, it's really key where there are, we start out low code and no-code instrumentation. So open telemetry supports most of the language commonly used languages like Golan. Java our python with auto instrumentation. But one challenge is auto instrumentation really provides us all the coverage for all the HGTP modules or any other GRPC based any calls that it makes between the se services. So it really helps to auto capture these information like metrics, cases, and logs. But but if you want to have a business insights or business metrics. Our business events. So that's where we have to touch a code a little bit, add an SDK way of having things get some custom metrics added or trace and the spans. So then we need to have additional lines of code in there. So otherwise it really supports the low code just simply through a couple of lines of configurations and bring the binary into a model, and it automatically captures this. Typically it's like an agent based instrumentation, how these traditional a PM tools work. But the code is really going to give more granular level details and additional information. So once we instrument, the next step is to collect and correlating the data. So this Open Telemetry agents really collect the telemetry and send it to the Open Telemetry collectors. Where it actually helps to I'll process the data and store it into a data lake. Then the correlation really happens by leveraging all the metrics. So when the, when there is a high latency span absorbed, then it can really give a correlation of what is a. Actual service resource utilization at that time. What was the services latencies during that time? Is it only for this particular request or it was al already performing poor? What was the error rate? So it really helps us to give the contextualize view for a single pane and really provide unified view of the system behavior. And with the help of this context of, with the custom attributes, we can even really filter the traces of spans specific to that particular. Customer attribute. Maybe it's a customer email or a customer id or order ID or whatever the business attribute is. And once we have all these data coming in, the correlation, and so it really helps to visualize the traces, metrics, and logs. And Grafana. Grafana is a wonderful visualization tool. Lot of the prebuilt dashboards are there. So we really provide us a single pane view of all traces, metrics, logs, and profiles as well. So it, it also give an interactive exploration where if you want to drill down further so you see a specific anomaly there. So it really helps us to further drill down. It provide hyperlinks to other views and dashboard and also helps to build a custom dashboard. There are plenty of dashboards available in JSON format, so we don't have to build new one, especially for tech common technologies like. Kafka or Kubernetes or any cloud specific services. So a lot of things are out there, especially for Linex. I've seen this very big dashboard which covers entire Linex stack. So really useful. So it's ever growing so that it definitely provide a contextual, unified, single pane view. And Grafana also growing. They've had a lot of other stacks as well. So this is the complete RAF absorbability stack where it collects through the agents through like a Prometheus exporter for any metrics or open telemetry or affluent bit. It collects into the open telemetry collector. They own open telemetry collector and also stores in Tinder. LGTM stack are low key for logs, Grafana for visualization, tempo for traces and MiiR for metrics. And also it provides a continuous profiling through pyro scope. So once we have all these data stored in these data lakes be, definitely will have a complete visibility. So we'll have application observability, which front ends, and also calculate the service map. So provide service map views and through the traces. Also give EVP of perspective for how the profiling happens and what are the hotspots there, and provide infras, absorbability to fall Kubernetes from the server VMs or databases, or any cloud native integrations as well. And it really helps to have a incident management in terms of providing alerting capability on-call support and for incident tracking and also even the service level objectives. So apart from this to shift left and really the case, it's really helps us to simulate the load in terms of backend as well as for the browser testing. So definitely it's a unified single place to go for this observability stack. And it also integrates with a lot of other big data sources where once the collection layer there, if you don't wanna manage these data, we can really ship to all the commercial products out there. So now the empowering alerting and incident response. So these alerting is very key there for timely incident response. So this really helps us to, the composite alerts really helps us to have a rules based and multiple data sources combining into a single alerting system so that it really helps us to avoid these alerts to nice ratio. And also really helps us to automate root cause analysis to the integrations with web hooks really helps us to trigger a workflow where it can go and look at, based on the knowledge graphs and the service maps to find out in the ecosystem or in the chain of events, what exactly gone wrong what is anomalies or correlation, and provided the RCA as soon as possible. So with these Absorbability 2.0, what is the future of Absorbability? So definitely to accelerate the innovation where the time to detect and time to isolate and time to recover is gonna be shorter and shorter With these tools, definitely really helps to accelerate the innovation. The unified absorbability, definitely the single source of truth is going to empower to make the data driven decisions. Intelligent automation. This really with the evaluation of a Agentic frameworks and a and mission learning, really helps to detect the anomaly and trigger a workflow for quick Auto RCA and even do the runbook automation, or having it self-healing system built. It auto recovers itself. And also let's not forget about the security aspects. So it really helps to analyze those audit trails and events, to have a security threads and vulnerabilities by detecting the behavior and pattern analysis and really helps us to find, find any security threads and vulnerabilities there. That hope you've learned something today. Thanks so much for joining us and if you have any follow up questions and feel, Brad, please connect me over LinkedIn and thank you so much and have a good rest of your day.
...

Sivasubramanian Bagavathiappan

SRE Leader @ GuhaTek

Sivasubramanian Bagavathiappan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)