Conf42 DevOps 2023 - Online

The Hidden Cost of Instrumentation

Video size:

Abstract

The path to Observability starts with instrumentation. But no lunch is free! There are many different ways of instrumenting applications and components that affect observability. Let’s tour getting started with instrumenting cloud-native applications and get value for money and time! There are many different ways of instrumenting applications and components that affect observability, such as log aggregation, distributed tracing, application performance management, and synthetic monitoring. Each method has its own pros and cons. In order to determine the best way of instrumentation for the observability of software systems, it is important to consider the specific requirements of the system and the cost/benefit of each monitoring method. Different solutions may be more appropriate for different systems depending on the size, complexity, and budget of the organization.

Summary

  • Prathamesh Sonpatki talks about the hidden cost of the instrumentation at Conf 42 DevOps 2023. Modern software systems definitely need some sort of instrumentation to know that things are working fine. The pillars of observability are logs, metrics and traces.
  • The most important point that generally happens in case of instrumentation is the explosion of cardinality or the churn of the metrics and logs information. This results into constant tuning of monitoring data, instrumentation data and results into a lot of engineering toil. What is the most hidden cost? It is actually the distraction.
  • Prathamesh Sonpatki is a software engineer at lastio end IO. We also have a discord where we hang out with other SRE and DevOps folks. I would highly encourage to check it out and join if you're interested.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone. I am very excited to talk about the hidden cost of the instrumentation at Conf 42 DevOps 2023. My name is Prathamesh Sonpatki. I work at Lastnet IO as the developer, evangelist and the software engineer at Lastnet. We build SRE tools to provide visibility into Rube Goldberg of microservices and let's get started. So. So my first question is, why do we even have to care about the instrumentation? Can we not just ship our software to cloud and enjoy, just relax and chill? But that's not always the case, right? How do we even know that our application is running as expected? We may also have some customer level, service level agreements that we are committed to, and we may have to even give them pros that our system is working as expected. Even for giving that proof, we ourselves need to have some information about how the system is running. Additionally, as SREs and DevOps people, we also need good night's sleep. We cannot be always staring at our screens and debugging the information to see if something is even working as expected. So all of these factors contribute to the fact that modern software systems definitely need some sort of instrumentation to even know that things are working fine. Hope cannot be the strategy. As the Google SRE Bible says, we cannot just hope that everything is working has expected. We need to make sure that we take conscious efforts into measuring and then making sure that things are working as expected. So the reliability mandate basically starts with instrumenting as its first step, because we can only improve what we measure, right? We cannot even understand how the system is behaving if we don't measure what we want to measure in our software systems. Let's go over the landscape of the instrumentation, because modern software systems are very complicated. They are not just like our standalone application that is running on a server or a vm. And that's the only thing that is running, right. What we have is actually a burger because our application is like a patty, which is running inside the bun and that, but we can consider it as a cloud or a virtual machine. So there are variants of buns like AWS, TCP, Azure, and then the patty is where the real magic resides, which is our application. We throw in some mayo sauce, some external services, data stores as rdss and databases, along with some ketchups, fries and everything. And then we get a burger. So sometimes also, this is not just a single budget that we have. We may also have multiple burgers at the same time because our system can have different microservices talking to each other has, well as some other services. So this is how generally the landscape of the modern software systems looks like. So we basically deal with burgers every day. We may not eat them, but we have to at least run with them all the time. So this is where we lead into the rabbit hole of the full stack observability, because just the monitoring of the application will give the insights very specific to the application. But we may not know that some requests are getting dropped at our load balancer layer or our database. Read I ops and write I ops are constantly below the required threshold. So for knowing that, we need to take a cut across the burger and monitor all the components together so that we get better insights. So modern software applications, I like to call them, has living organisms that grow and shrink in all possible directions. Grow and shrink specifically because of the auto scaling and scaling constraints. We do have ephemeral infrastructure that comes into existence and then goes away when it is not needed. Another interesting point is that the applications also communicate with similar applications at the same time. So it's not just one application that we have to deal with, we have several of them talking and chatting with each other all the time. So basically how do we monitor them? Right. The only option that we have is basically we have this temple of observability that is there and we have to just bow in that temple of observability to make sure that everything is getting instrumenting. Generally the standard pillars of observability are logs, metrics and traces. Logs help us debug a root cause very quickly. They can be structured versus unstructured depending on whether they are like debug logs versus request scope logs, but they are very easy to adopt with, right. So we can throw in standard libraries for logging as well as components such as NgInx. Load balancers have standard formats for logging, so the adoption is extremely easy. Consistency is slightly tricky because every microservice and every system can have their own format of logging. So not necessarily you will always be able to have the same consistency across all services, whereas metrics, they are specifically giving you the aggregate information about how system is behaving. You can get a better overview of overall how the system is behaving using metrics and in case of metrics. Also there are standard tools and libraries that one can use for the adoption part. Adoption of metrics is also easy because of the proliferation of different tools that one can use, and also they provide certain consistency because of standards like open telemetry, open metrics that people can use. So adoption and consistency. Both are kind of consistent in case of metrics, whereas traces are helpful in case of when we want to monitoring different workflows. So for example, I may want to trace my payment transaction starting from my microservice where the user authentication happens to the background queue where actually the job gets processed for sending the notification that the payment was successful or unsuccessful. So in case of traces, I'm mostly concerned about monitoring the workflows. And to do that, what I do is I insert one span or a trace id in all the pieces where I want to basically monitor it. Traces are extremely sharp and useful in such scenarios, but also they can have a lot of information getting emitted. Basically, if not handled correctly, it can turn but as like your debug, logs are running in production. So these are the original three pillars of observability. But additionally we also have profiling, external events and exceptions. And Yuri Shukuro has written an excellent post on these six pillars of observability. It's a great post. So profiling is basically the continuous profiling of our application to capture the runtime information about how the application is behaving, and that can also help us in debugging certain solutions when needed. Generally, the profiling, even if it happens continuously, we may not use it all the time. We may use it only when it is needed. So while enabling it, we also have to consider the overhead that it will put on our production systems because we may not be using it all the time like we'll be using probably two, three times a year or something like that. The external events are extremely important because they can affect the state of the application. So while logs metric statuses are internal information about how the application is behaving, external events such as deployments, configuration changes, third party changes such as your AWS, instances getting restarted or reprovisioned or something like that can also affect your running application. So tracing that and then provide making sure that they are also visible in terms of the overall visibility is extremely important. Another important part about the external events is they are extremely critical in certain cases and also not happening all the time, right? So in case of logs metrics, they are constantly happening, but external events are not happening in that same amount of number that will happen in case of logs versus metrics. So they need precision in capturing as well as storage when we deal with such events. Additionally, we also have exceptions which can probably go to tools like sentry and rollbar. This can be considered has an advanced version of structured logging only where we have tools such as sentry rollbar, giving us specific log lines and traces where we can go and debug the issues before going forward. I have a curious question. How many of us have used more than three at the same time? Because I have talked to a lot of people and what I realized is that depending on the use cases, we tend to pick up at least three to four of these at any point of time, but not necessarily all of them at the same time. So that is a very interesting conversation to have, whether we have folks who have used multiple types of days at the same time. But we do capture these kind of information in our instrumentation processes. The most important point to consider is that none of this is free, right? And when I say about the cost, it is not just about the monetary cost, but it also adds overhead to our runtimes. It also add overhead to our processes. So there is no such thing as free lunch even in case of instrumentation, and we have to pay the cost for different factors that we will see a bit later. The most important point that generally happens in case of instrumentation is the explosion of cardinality or the churn of the metrics and logs information. They keep changing all the time and that basically prevents us from just shipping it and sitting there. We always have to capture and monitor that the data is not getting out of control because of the cardinality explosion. To just give a simple example, three node Kubernetes cluster with Prometheus will basically ship 40k active series by default. And that is just the default metrics. If you want to emit some custom metrics then obviously it will even explode. With the ephemeral infrastructure this can go out of control very quickly. We also have to do the operations for running and operating this instrumentation of the entire stack. So this is one more thing to operate besides the application. We also have to run our application, but we also have to run our entire observability and instrumentation stack. And we also have to make sure that not just the app scales, but with the app, the instrumentation processes also scale. Because we cannot be blind on a new year's day for 4 hours or we cannot be blind before the streaming of the final match between. I'm from India, so I'll give an example of cricket, but we cannot be blind before the final of the cricket World cup between India and Pakistan. Just because our instrumentation is not able to scale. That can be a very bad thing, not just for the engineering but also for the business. All of this results into constant tuning of monitoring data, instrumentation data and results into lot of engineering toil that the engineering teams have to go through. So I give it an acronym as a cost. The cost that we have to pay is basically for cardinality churn operations, scale tuning and toil. And all of this just becomes the cost of the instrumentation that we have to pay. But what is the hidden cost? Right? These costs that we talked about are slightly apparent on their face. We are aware of these things as well. But what is the most hidden cost that is there in case of such instrumentation? It is actually the distraction. We always get distracted from doing the things that we actually wanted to do, which is our product engineering or scaling our business, or making sure that our customer experience is not impacted. How many times you have heard terms like, okay, can you just reduce the data dog monitoring cost before the next month? It is actually going out of hand. Please can you just stop your feature development and focus on getting this in control or our logs are piling up from last two days. Can you just look at it as a p zero item and please fix them? Otherwise our vendor will charge us double and if we don't do that, then we'll be spending too much amount of money unnecessarily. Today is New Year's Day tour. Prometheus is not getting required metrics. Can you just ignore the important feature and bug fixes that you are pushing? Just fix on this, because otherwise we are completely blind before the party starts. So we always hear these kind of things, and that causes us the distraction from our actual tasks that we want to do in our day to day life. We always get distracted by the instrumentation and the information that we are emitting and probably not even be using. Right? So we may be emitting so much amount of data, but only using 10% to 20% of it. So we do pay for the data, not just that we use, but we also pay for the data that we don't use, which is not really a good option to have. So the modern software systems engineer has to not just maintain their software, but also has to maintain instrumentation of that software as well, with the same rigor, with the same requirements of scaling and so on. It is also fatigue, right? With so much amount of data, so much dashboard, so much panels everywhere, so much logs in front of our eyes, we always get desensitized to the information. There can be duplicate alarms. How many times I have seen that while debugging a critical issue in production, we get confused because the logs shows two, three pictures at the same time, and some of the information that we see is not even getting used in the code. So there can be such situations where just the too much of information can cause us delays in debugging. So because we focus on getting the data out, because it is easier, we don't even consider why do we even need them in the first place. So these are some of the points that can cause fatigue with too much of the information. While we talk about all of this, and we sort of are used to these things, what's the way out? Let's discuss that. So, if we focus on the data that gives us only the early warnings with least amount of data, and this least amount of data is important, then probably we can just focus on the warnings and then based on that, dig deeper to isolate the root causes as and when needed. So I would like to give an analogy to the Apple Watch, which is on my wrist. But basically what Apple Watch does is it only gives me the vitals, such has heart rate, or how I'm doing with my sleep, or it gives me if I'm walking correctly every day and so on. So it just gives me the vitals that are needed, right. And based on that, I can decide to go to the doctor for detailed x ray scans and ECG reports and then decide whether to go further with my debugging or deep exploration. So while I get the vitals, if the vitals are off, I can go for the detailed information about why they are off. I don't debug and start with the x ray scans immediately, or I don't start with the ECG reports as the first step without even checking whether my vitals are off or not. So a threat or a warning of something breaking is always better because it can give me like an ample amount of time to at least either fix things or ignore that if actually it is not off the track. So a threat is always better in such cases. So what is the plan of action to fix this? We can measure what we actually want in our instrumentation. We can plan what we really need. We can only emit the data that we need and skip the things that we don't need. We can observe and track. We can prune aggressively. Lot of metrics and instrumented data is not even used at all. Like there are a lot of default metrics that we keep pushing and they can basically slow down things at later point of time. So we can prune them aggressively. We can of course store lets for less amount of time, because the more we store for more amount of time, it can cause us problems and distractions and we can focus on what can give us the best value for the money, and that can help us in terms of reducing the scope of our instrumentation. But there can be a better plan of action than this as well. So, for example, what if we can define access policies for our data, that you can access certain amount of data only for this much amount of time. If you want to access beyond that, then you have to be okay with some reduced data or aggregated data and so on. We can also have data storage policies across organization, that your logs can be stored only for one day, and then beyond that we won't have those because otherwise they will basically explode in terms of storage costs. All of these policies can help us in defining standards for our instrumentation across the organization. So there is consistency and we get the same results across our software systems so that things are in a better, consistent way. Less is always better, even in this case of instrumentation, because instrumentation is not just instrumenting, it is actually a liability that we have to worry about as builders of software. Thanks. That's all that I have today. My name is Prathamesh Sonpatki. I work has software engineer at lastio end IO. I have this blog and I have posted my Twitter and Microsoft details. We also have a discord where we hang out with other SRE and DevOps folks to discuss about reliability, observability and a lot of other things related to SRE and DevOps people. So I would highly encourage to check it out and join if you're interested. Thanks again. Thank you.
...

Prathamesh Sonpatki

Community & Dev Evangelist @ last9.io

Prathamesh Sonpatki's LinkedIn account Prathamesh Sonpatki's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways