To observe or Not, is not the question

Video size:

Abstract

Observability, is the top of mind from developers to executives for different contexts ranging from application to business metrics. So Whether to observe or not is not the question but whether I am getting the right use cases solved with the observability.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, and welcome. My name is Ani and I'm the founding member at Cloud. First of all, I would like to thank Mark and the team for providing me with the opportunity to talk about observability in S 2025 as a name of the topic, suggest whether to observe or not. I don't think that's a question. Everybody is deep down into observability. The real question is whether it's giving me the output that I'm expecting, whether I'm running organiz systems correctly and looking at the data correctly. So when we deep dive into the word observability, we need to look at not what just it is, but how it plays into different context across my organization. Whether I. Security or work in compliance or something else. Observ has something for me, but what did it, so before we deep dive into observability and its different context, let's start with the basics. Observability is the ability to understand what's happening inside your system just by looking at the outputs. The outputs in this context are called signals like logs. It's about answering the question why something is happening and not just what went into my application, but what do I do about it. So when we look at it has a very broad spectrum or different context into organization and let's deep dive into different context of. First and foremost is about the technical context, which revolves around the core data and tools that send the data. This is the heart of observability. This is where you collect all the telemetry data, like metrics, log stresses, and you define how to use that data, how to track, which is simple. Test sending the data. Example. Example, when you look at the technical data, will for an application recently deployed, we'll look at the A spikes. You might want to check request duration metrics. You might want to check logs, and you might want to look at specific process for the service that are failing. So this is the core data and critical for observability. This defines the basics on which the system are built on. But beyond this technical context, there are specific use cases. Not all users want to use the depth of all the data that you're sending in, but they want to look at dashboards and alerts. So one of that team is your operational teams, and they have a very defined operational context for your system that. The live and breathe on the data that is sent by these observability tools, and they look at system reliability, incident response, and objectives. They want to define different SLA SLOs, SLIs error budgets, or application, and track those for the reliability of the system. For instance, let's say you have an operational dashboard that shows spike. Or define different errors at different service levels, can the ops teams want to understand how the system degradation is based on these errors they want? They not only want to detect these errors, but want to understand the impact on of time of the system. This directly goes hand in hand with developer experience. The major outcome of an observer system is the feedback loop that operation teams can provide to the development. So the development context has wherein the developers understand how it's from writing and shipping code to running it in production, how different tests are perform for my application, how my C pipeline is working and. A feature flag, how it performs in different stages and how different services. This helps users to define the production performance of a certain application, its features and tune them. It also helps to understand each cases for different feature flags that have been created, so the development context and DevOps go hand in hand. But that doesn't stop here. A user facing application is of no use with a hundred percent working backing system and no user experience that works for the customers. So from a business perspective, a hundred percent backend, uh, has no use. So the define different observative response observability requirements for user facing application. And that is basically goes into the UX context, how the user is experiencing the product, how you monitor, how your funded behaves, and how your resource flows are working. They use tools like real monitoring s checks to help answer different challenges into user experience. Is, is. Are my customers abandoning cars because frontend timeouts does my site load in five seconds instead of two? How is my bounce rate and different red flags that affects different user loads? The frontend monitoring it help improve the user interactions with the system. Business users are concerned about the user observability and its outcomes. If the, uh, technical details that you're getting are not helping business context, then this is a huge drawback to your system. The system needs to bridge the gap between technical health and the business impact. If you. You need to able to correlate that with five x error in the checkout service or in some of backend service, which is not able to work with the database. So if you can determine what is causing the revenue loss because of the application issues, then business can get very good insight into. Piggyback on security and compliance for your application health. When you talk about security, it, the users of the system, the security user system needs to ingest the logs and traces that detect anomalies or detections and, and investigate any breaches into the application in, uh, use. For example, example, a spike in login attempts from a single IP can range or can be spotted into tive data. So it also can help you look at the different, uh, pricing for applications or different services within the application and find out user compromise. So these. Security incident. It's just not application downtime. It basically costs, uh, business and have different penalties at different regions. So to avoid this, different teams use different compliance methods. Think of GDPR, hipaa, SOC two that you eventually hear about compliances and. System also helps to maintain audit data, access logs, and track consideration changes all times that are essential for this compliance and audit requirements. It also helps to define policies like longer retention, uh, than authenticated access to the sensitive data. Enhancing observability in different contexts helps user look at the data in different contexts. But what is the use of this context when you wanna utilize the data? Correlation is the key application teams might be generating and utilizing infrastructure metrics for sending the software, uh, details how, how it is performing. It's helpful to monitor this performance and resource usage. For the applications. Applications. But for businesses, the data needs to be domain specific. What is my user experience, whether it helps me to achieve the business goals. So correlations of this metrics help the business leaders and the, uh, other users to use of the. If you track the latency to number of active users, if you track the, uh, through, uh, block signups or checkouts right, then this is more helpful to the users than just looking at the different dashboards. The key differences between application metrics and business metrics is that they need to go beyond system performance to define the business performance. They need to go beyond latency, CPN error rates to define the convergence, churn, and error research they need to provide what is affecting the system of time to what is affecting the revenue and product decisions. So correlating the application business metrics is the key. Apart from isolating the technical issues, you need to be able to satisfaction so. If I'm getting a drop-in checkout, I need to be able to find out which is the find error that is causing this, and that needs to be defined with different tools that businesses can use. So if you want to look at different, uh, uh, business outcomes, we need to have a visual correlation between the matters that we're ingesting and based on the we're consistent. And give the fine access to the issues that you're getting into application. For example, I'm able to track which user is getting the checkout, uh, bad checkout experiences, which region is having the most latency, which region is having different plans for mobile and let's say browser users, which users are getting feature. Which product our ID are getting, uh, lower checkouts and whatnot, right? So this is very granular filter that businesses can look at. And essentially this goes into providing alerts for specific things when we only monitor the infrastructure application. Business and application, we give composite alerts. For example, if an error rate spike is giving issues into conversion stop, or is giving issues into purchase checkouts, or is giving issues into user signups, this is the alert that businesses want to look at. Rather than just, okay, I got an 500, or I got 50% error rate into application for last five minutes. The SO for that is should be a mix of system health and user experience. So going beyond when we are using an system, this needs to have a unified system inside. That's where the coordination helps. That's where you can correlate business metrics to applications. That's where when you look at certain logs for 500, you can look at the time series of it at the same time, see the events which are causing this. If a system does not have a unified view of different data that it is ingesting and provide you with a capability to use that view to better utilize the uh, so. Application metrics are primarily technical. It talks about things like latency, error rate usage, database query time and whatnot. But business metrics are more product driven. Whether I'm able to retain my daily active users, what is the adoption of my new features? What is my revenue for transaction? What is my PCO for total infrastructure that I'm investing in? These are different things that different roles in the system are looking at. And they're critical for everyone. So business metrics will tell you where the business is based on details that you provide with the application metrics. Correlating the two is very essential and it turns the, from a mere backend system into a full stack superpower. Let give you very real time. I. Standard payment processing. The payment processing works across multiple regions, but I'm getting five xx and which is affecting my overall SLO. When I look at this, I see that it can be at different places, but if I have a custom label that gives metrics about success, signups, or purchase. Detailed granularity with user type, region plan, or feature flags for that thing, I can, uh, kind of aggregate based on that and get the details about, so if a user can get for this region and for this device type, I'm getting my users affected, they can easily change that and. So correlating that is very key. But at the same time, the upgrade system needs to have the capability to correlate that. An observed system with the correlation can give you more, uh, output than one, which without it. So the, at the end of it, the system has a defined cycle. It'll gather application and business metrics based on the tools that you're using. They can be at different granularity, but when you ingest the metrics, how the system is correlating that, create relationships and visualize the data in a way that any user of the system based on its role can define different data sets, define different dashboards and alerts. Generate insights from the, uh, ingested data is the key. So to sum it up, observability has many contexts. Technical, operational, user experience, business security, and compliance. They all are necessary or they might not be necessary for different enterprise. Application and business. It brings observative life. It turns the raw data into real insight, real insight. Thank you for listening in. I'm happy to answer any questions offline or deep dive into specific things. Thank you very much.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

To observe or Not, is not the question

Video size:

Abstract

Summary

Transcript

Slides

Swapnil Kulkarni

Founding Member - Customer Success and Solutions Architect @ Kloudfuse

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

To observe or Not, is not the question

Video size:

Abstract

Summary

Transcript

Slides

Swapnil Kulkarni

Founding Member - Customer Success and Solutions Architect @ Kloudfuse

Join the community!