Conf42 DevOps 2023 - Online

Observability vs. Performance Monitoring: What's the difference and why should I care?

Video size:

Abstract

What do you do with all of the data coming out of your systems? Are you spending time diagnosing issues in your distributed systems without a good understanding of how your microservices are built? As system complexity increases, the ability to understand the sum of the outputs becomes more and more difficult and the volume of data about your services becomes overwhelming. Luckily, having an observability mindset when instrumenting your outputs and leveraging the right tool for your team can help you cut through the noise and identify key offenders in your system and resolve them quickly and efficiently. In this talk, we will discuss the differences between traditional performance monitoring and observability and how to understand how they can independently and together ensure the health of your team and keep your end users (and your developers!) happy and focused on the right things.

Summary

  • Today we're talking about observability versus performance monitoring. The difference between these two ideas and why you should care about them. Going to look a little bit into the history of monitoring. How you can benefit from observability.
  • observability is a measure of how well the internal states of the system can be inferred from knowledge of its external outputs. Just because you have purchased an observability platform or tool doesn't mean that your system is observable. You really want to be sure that you're picking the tools that are right for your environment.
  • Distributed traces are the shiny new things that we've gotten with application observability. logs are another really important part of observability, and they're more powerful when you can correlate them to other signals. The value lies in the ability to answer questions instead of outputs.
  • Open telemetry is an open source project. It's the second most active CNCF project after Kubernetes. All of the big players have kind of bought in and started to provide support for it. It gives you a really vendor agnostic approach to generating and sending your telemetry data.
  • The bigger the restaurant, the more complex it is and the more things that can go wrong along the way. In a restaurant scenario, you may need a lot more monitoring to be able to really understand what's happening. And that can be applied to your system.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You welcome. Today I'm coming to be talking about observability versus performance monitoring, the difference between these two ideas and why you should care about them. So before we get started, just a quick overview of what we'll be having about today. First off, why do we care? Going to look a little bit into the history of monitoring. Going to think about why monitoring has had to evolve over time. We'll look at a high level overview of observability and where whats term came from and what it means. How you can benefit from observability. Talk a little bit about the three pillars. We'll do a recap. We'll talk a little bit about open telemetry as a path to observability, and then we'll wrap it up and you can be on your way. So first off, why do we care? We really care about this because we want to work on the good stuff. We don't want to spend our time debugging, troubleshooting, doing support. Our companies also don't want us to spend our time doing this. It can be very expensive for an organization to have an entire engineering team doing troubleshooting, support and not working on future looking features. Additionally, just for our quality of life, we don't want to spend our time, our free time doing this. You rarely take a job with the hopes that we'll wake you up in the middle of the night to try to keep the lights on for an organization. So the better and more stable your environments can be, the better off everybody in the organization really is. So starting at the beginning, in the good old days, back when I was actually doing development, we were working with typically a single code base. So it was something that could be run locally. So if I needed to step through, something on my computer could bring the code down, run it in a debugger with some breakpoints, and really be able to understand what was happening from start to finish. So this is monolithic architecture, and it's great at what it's good for. So I'm not here to talk about the differences between microservices and monolithic architecture or why one is better than the other. I would say there's appropriate uses for the appropriate case. But monolithic architecture is kind of where monitoring evolved. So performance monitoring, this really came out of the data center. So your applications were running on hardware that you could have some insight into. You knew what you needed to be able to keep an eye on to make sure that everything was up and healthy. We'd be able to kind of take a look and see what are the trends, what's happening here? Are we headed in a bad direction? Are things pretty stable? And then this is where alerting really came into play. When do I need to stop what I'm doing and pay attention to something that's happening within my infrastructure or application? So this pager is a picture of the exact same kind of pager that I carried back in the day. I would say that a lot of people who have been on call are familiar with this sort of thing. And the idea is whats you want this to beep at you as little as possible. So sounds like monitoring pretty much has it covered. It tells us alerting can tell us when we need to drop everything and fix some stuff. We know the health of our servers and our applications, so why would we need to know anything else? The distributed systems are why we need to know more than just the high level aggregate. This architecture brings with it a lot of benefits like improved scalability. It can be more efficient to work in, it's easier to do kind of rapid deployments. It's also easier as people join to be able to understand a smaller piece of the overall system and ramp up quickly and be able to contribute quickly. But what it does introduce is more complexity, which makes it more difficult to monitoring. While you can monitor a lot of differences, parts of the system effectively, it's hard to get a really good understanding of what's happening from start to finish. So when there's issues, it's a lot more challenging to know where those issues are. Another aspect of kind of some of the new technology that's emerged is ephemeral resources. So it's a lot harder to monitor something if you don't know when it's going to be there and when it's going to disappear. So you can't set a monitor for something that you can't see. So you really need this sort of thing to kind of automatically pick up. And when it disappears, you no longer want to be alerted about it because it's supposed to work like that. It's not a negative that something has shut down, it's doing that to save your resources, but you don't want to get pinged about it every single time it happens. So this is another challenge of monitoring today. So makes sense things are more complex, we need to step up our game, how are we going to do that? So that's where observability comes into play. So the answer to the ultimate question of life, the universe and everything is observability. It will give you all the answers you need to all of the questions you could possibly ask in theory. So observability is really the ability to understand the state of internal systems by observing the output. So the idea is that you can collect information that will tell you what's happening within your system and sounds just like monitoring. And that's because it is. Monitoring is an aspect of observability. And if it is everything that you need to be able to answer the questions about your system, then you have an observable system. So terms get a little bit muddled and we'll dig into some of that. So the things that I want everybody to remember is that the outcomes are more important than the labels. So whether or not you call it observability or monitoring really doesn't matter. You just need to know that you can support your applications the best way possible. It's also something that's never done. So just like software evolves and advances, you need to update your monitoring or observability. It's not a checkbox where you can just say, all right, now we're observable, and set it and forget it. It doesn't have to be all or nothing. So I think a lot of people shy away from exploring new methods of introducing observability into their environments because it sounds overwhelming. But I would say you can just kind of start small, get familiar with what you need and what you might like to use and kind of take it from there. Additionally, like I said, it's not a set it and forget it, it's a spectrum. So your systems can be very observable or very opaque, and what you want to do is kind of get to the level that you need to be successful. So again, it really doesn't matter what you call it. What matters is that you can answer any question, whats you might need to ask of your system. But I'm going to keep talking about observability because that's the name of this talk. So the origin of the term observability, it's a measure of how well the internal states of the system can be inferred from knowledge of its external outputs. So this came from the general theory of control systems from the 1960s. So even though we've just started to hear about this in the past couple of years, it's not a new concept and it's not a new buzzy term. This is an idea that's been around for a long time, can be applied to basically any kind of system, but obviously now we're talking about distributed systems and software architecture. So it sounds like all I need to have to be able to answer any question that I could ask. My system is all of the data. If I have all of the information, I'll be able to answer all the questions, right? So one approach to that would just be logging everything. If we have every little aspect of everything that's happened, then we'll be in good shape, right? Not really, because you're kind of building this dumpster of data that is hard to navigate. You don't want to spend all of your time diagnosing through logs that can be just as ineffective as throwing hypotheses on a wall and trying to fix something in production. So you really want a better, more thoughtful approach to how you want to kind of achieve observability in your system. So the big question here is going to be, do you know the unknowns? So monitoring lets you answer a specific known question like what's a response time? What's my average response time? Observability really takes it a step beyond that and will let you say something like, oh, one of my customers is having a problem, but only one of them. So what's unique to them that's causing issues. So that's where observability really comes into play. It lets you work with higher cardinality data to really be able to kind of slice and dice information that you're gathering from your systems to be able to get to root cause analysis very effectively. So you might ask yourself, can I just buy something to do this? So you can. There's a lot of tools out there that are labeled as observability tools, but it's going to take some sweat equity to make it valuable. So just because you have purchased an observability platform or tool doesn't mean that your system is observable. So you really want to be sure that you're picking the tools that are right for your environment. So the thing whats is right for a six person startup is not necessarily coming to be the right tool for an enterprise company with thousands of people. So you really just want to be thoughtful about your approach to this and not just jump on the new kind of buzziest thing that whats your inbox from a vendor. So how does any of this help me? Can observable system will really help you fix problems that you didn't anticipate and be able to navigate requests across your system in a way that you weren't able to do before. So what I always kind of like to highlight here is that at a company that I worked at, we had an engineer, we'll call him Bob, and he knew everything about one legacy part of our platform. He whats the only one that knew it. He was the only one that supported it, unfortunately for him. And I think that's a common scenario that happens. And the risk that you run when you kind of silo knowledge that way and you don't make an effort to kind of make that part of your architecture more observable, is that if Bob quits or something tragic happens, you no longer have any insight into that and it leaves you in a really bad place. Bob gets hit by a bus and suddenly you're trying to reverse engineer something that nobody has any familiarity with. So it'd be a lot better if you were getting some useful outputs of it. So it's kind of the idea if you need to slow down to speed up, it's better to put a little bit of effort into this before you need it, so that you're not caught on your heels when you do. And just to kind of paint that picture again. So you've been paged, so what's going to happen now? A lot of us would try switching it off and then switching it on again. So when it's the middle of the night and you're getting paged and you don't have all the information you need to really know what's going on, but somewhere in a runbook it says just restart the service. If this particular state gets reported, then that's probably what you're going to do. You're not going to spend a lot of time trying to figure out the root cause, and you may be off call tomorrow and this is going to be somebody else's problem. But really what you want to be able to do is make your systems better. So to do that, you need to be able to answer some questions about the sort of incident you want to know who's being impacted. And if you're just working with aggregate data, you may not be able to understand that. So if I know who's being impacted, I might have a better sense of urgency. So if it's something like a canary in production triggered, can alert, I might be able to ignore it until tomorrow. But if it's potentially like our biggest customer and they just onboarded a bunch of users, it really may be an all hands on deck scenario. But having that information lets me really evaluate and assess the priority of the incident. Do I have what I need to resolve it? So do I have enough information available to me to either resolve the issue or kind of hand it off to somebody who has the information that they need to without just sort of flying blindly into an issue. I want to know where the problem is. So I don't want to just start kind of at what I consider the beginning or the end. I'd like to have some information about where this is happening. If it's something that is caused upstream and I'm just feeling the pain of it here at the end where the customer sits, I want to know whats. I don't want to have to kind of guess. I also want to know when the issue started. So we want to be aware of whether or not this is something that we've been trending towards over time, or is this very sudden. And we also want to know how we ended up in the state so that we can prevent it from happening again, obviously. So that brings us to the three pillars of observability. So we've talked a little bit about some of the kind of more conceptual ideas around observability, and we're going to get into some of the more kind of nitty gritty of the nuts and bolts of what people consider traditional observability today. So metrics is a good starting point. Metrics are intended to provide statistical information aggregate. So this is what we're all kind of familiar with. It can give you a really good indication of the current state of things, and it's a great place to set your alerting. And this is more like that traditional monitoring where it's kind of at a high level. It's a really good vehicle for storing information about your systems, but it's not great for doing diagnostics because you've lost all of that good kind of connective tissue data that the metric is made up of. So once you have an incident, you can't drill in any further to understand what happened. So if you see a spike, you kind of have to do your own correlation. When you start to dig into your logs based on timestamps or other information that you have built, it's not done for you. Distributed traces are the shiny new things that we've gotten with application observability. That's really exciting. So traditionally, a trace traces something within a particular location. So I always think of like a traditional database, you can run a query, you can trace that query and understand everything that's happening along the way. But a distributed trace will let you do a similar kind of following of a request, but it can hop across different resources, which is what makes it really kind of magical when you're thinking about trying to understand maybe a customer experience. So you know that they started by trying to make a particular request, say they're like trying to check out, you're selling something and they're trying to check out. So it starts there and it may hit a whole bunch of different back end services. So you may be looking at customer ids, you may be looking up skus, you might be checking inventory and whats sort of thing. And those may all be different systems. So with distributed tracing, you'll be able to trace that the whole way. So this is just a look at what a distributed trace might look like. And what you can see is that the trace itself is comprised of spans which are little units of work. And this is demo data. This is up until midterry demo data, but it shows you that we're coming across different resources and languages. So it really kind of gives you a good visualization of that start to finish sort of understanding you can get of something that is requested of your application. And this is just a look at one of the spans expanded. So you can see that we've got some kind of custom resource information here, and we have detailed down to the level of the actual product name. So that National Park Foundation Explorer scope is an actual product that we've looked up. So just kind of highlighting just how granular you can get with your span and your trace data here. And then logs are another really important part of observability. So earlier I said you don't want to just dump everything into your logs and assume that that's the best foray into being able to resolve issues. But logs really do hold a whole lot of really great information that can help you troubleshoot things, and they're more powerful when you can correlate them to other signals, like a distributed trace or a span. So the great thing about logs is that they can have really high cardinality, which means that you've got more independent pieces of data that you can kind of pivot on. So something like a user id, an organization id, some of your custom resources from your services, that can really help you understand things at a very precise level, as opposed to a more aggregate level where you're looking at things maybe over just like rolled up by a time by a service name or something like that. So to quickly review so far, just want to reiterate that collecting data does not make a system observable. You do have to collect data to achieve observability but collecting data alone will not accomplish that for you. The value really lies in the ability to answer questions. So again, when we talk about outcomes instead of outputs, this is the outcome that we want. We want to be able to answer questions that we need to ask of our systems, to have healthy and have systems that we really understand thoroughly. So one of the downsides of just amassing a lot of data, storing it for later in case you ever need it, is that it's very expensive and it's hard to navigate and it's just wasting space and resources for you. So just kind of hammering home the point that just collecting data is not the answer that we're looking for here. And when you kind of start on your journey to observability, there's a lot of solutions out there. So you may find yourself really kind of experiencing fatigue with the different tools that you're attempting to implement, the number of sales calls you're getting about different sorts of observability tools, and really just kind of the concept of observability as this kind of huge army of different tools and services you need to implement within your environment. And that's not necessarily, it can be simpler than things, but it can be kind of overwhelming when you're trying to figure out the best approach for your needs. And again, your needs and your team will really dictate the solution needed. So you can start very simply and small if that's what your team needs. So you don't need to go all in and buy the most expensive, shiniest thing. It's not necessarily better for what you're trying to accomplish. So you really need to evaluate and choose the right solution for you, be that something that a vendor provides or something that you can kind of build and maintain in house. It really depends on your specific situation. There's not just like a one size fits all approach for this. And that brings us to open telemetry. So before we jump into this, I do want to just say open telemetry is not the only way to achieve observability. It's something that we really like at telemetry hub because it's introducing a standard for observability. And that standard lets that correlation of the different signals. It's managed very effectively by the open telemetry instrumentation so that you don't have to do it yourself. So it takes a lot of the effort out of achieving a really observable system and does it for you through a really amazing project. So open telemetry is an open source project. It's the second most active CNCF project after Kubernetes. All of the big players have kind of bought in and started to provide support for it. It's integrated directly into a lot of cloud native stacks, which is great and it's fairly simple to use and there's a lot of customization so that you can really instrument something that's specific to the details of your application so you can really understand what you need to know. But it's a great project to be able to sort of start simply. There's some great documentation on the website about how you can get up and running some really great tools, whats you can users. And again, when we talk about it's not all or nothing, you can kind of start playing around with this, get an idea of what it can do for you, and if it's something you want to explore without having to go all in and spend tons and tons of cycles on it. This really introduces a shared standard, so it provides a shared concept of those metrics, traces and logs that we were talking about, and a shared protocol for sending and receiving those signals. It comes with the sdks in a lot of popular languages and things sort of in varying degrees of maturity, and all of that is available on the website. So if you can understand where all of those lie and the great things about it being open source, whats if your preferred language isn't as mature as you would like it to be, you can contribute to it. So the components of the project are really the cross language specification tools to collect, transform and export the data, the sdks and the auto instrumenting and contribute packages. So you might be saying, I thought that open source meant you have to do it yourself. Sometimes that's the case, and you can make this a very complex implementation if you want to, and that's where your journey takes you. But it doesn't have to be. There's some really good auto instrumentation that you get with a very simple implementation of open telemetry. So it doesn't have to be hard. So this is just a quick screen grab from the telemetry hub documentation, but this is basically what you would need to kind of get started instrumenting a Python application. Pretty straightforward and again like great place to start, and then you can add complexity as you go and as you know what you really want to be able to get out of your system. Another really cool thing you get from open telemetry is open telemetry collector, so it can receive, process and export your signal data, but it's a lot more powerful than just that. It's also, just to clarify, you don't have to run the open symmetry collector to be able to get your signal data out of your application. Once you've instrumented, you can actually send that data directly to a backend, but the otel collector gives you some really good control with that processing step so that you can be very particular about what you're sending to your back end, you know, sounds good. I can instrument open telemetry in my application and in my infrastructure, and that'll give me all the information I need to achieve observability. What do I do with it? So this is the great thing about open telemetry, is that it gives you a really vendor agnostic approach to generating and sending your telemetry data. So you can send it to us at telemetry hub. You can send it to one of the big monitoring vendors like Datadog. You can keep it in house and build your own tools around it. You can use other open source solutions. It really leaves you in a good position to sort of try things out and see what works for you and also to let your observability implementations kind of evolve over time. So if you outgrow a solution, you don't have to rip out proprietary agents and install something new. You can just point your signal data somewhere else. That gives you a better visualization for what you want to use. One of the other things about the open symmetry collector that helps to support this is that you can send data to multiple places. So if you want to keep your log files in house, as well as sending them to a log exploration tool, you can do that using the collector. All right, so one quick analogy to kind of bring this into the physical world conceptually, and we are all set. So I stole this from our engineering lead, Lance here at telemetry hub, and I really like this kind of illustration of observability. So if you think of home cooking, so it's you by yourself in your kitchen, you're the one that's touching everything, so you know exactly what's happening. So when you create your scrambled eggs for breakfast, you know, when you took the eggs out of the fridge, you could theoretically know how cold the fridge was, you know, whether you put in milk or butter, and you kind of have all the information you need to be able to understand why that meal turned out the way it did. But once you move into a restaurant, everything kind of turns on its head. So if you've ever worked in food service, you know this, but there's many stations, and the bigger the restaurant, the more complex it is and the more things that can go wrong along the way, because an order can pass through many different stations. So it starts with a services at a table, taking an order, she may pin that somewhere for the person that's executing to start the order. And it can already have fallen apart right there. And you're not going to know as easily as you would if it was just you by yourself. So she takes an order, the chefs work on it. It goes down the line through all the sous chefs who are adding salt and adding sides and all this, and it ends up back on the table. And the soup is too salty. But we don't know who did that or where it happened or how to prevent it from happening again because we don't have all the information we need to be able to really understand our entire restaurant system. So the system being all the different things we're using and the people involved. So just a good way to kind of think about observability. And when you need more or less. So if it's just you in the kitchen by yourself, then maybe the thermometer on your oven is all you need to be fully observable. And you don't need to invest in anything more complex or more expensive than that. But in a restaurant scenario, you may need a lot more monitoring to be able to really understand what's happening. And that can be applied to your system. And so that is it for me today. Love to talk about this stuff, so feel free to email me. Sarah@telemetryhub.com and thanks for listening.
...

Sarah Morgan

Senior Product Manager @ TelemetryHub by Scout APM

Sarah Morgan's LinkedIn account Sarah Morgan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways