Cloud-Native Observability at Scale: Why Distributed Architectures are Here to Stay

Video size:

Abstract

Gaining affordable, in-depth insights is always the first observability task of engineering teams. When you take a conventional, monolith-friendly monitoring architecture and try to graft it onto a distributed, cloud-native app problems arise - you can no longer have both.Centralized architectures, where raw data is being trucked all the way to the observability provider before being digested for insights (such as metrics), have started to weigh heavy on their teams - usually by way of immense burdening costs. Suddenly to get the desired logs, metrics and traces, teams had loads more data to collect, since a monolithic single app now gave way to a complex array of microservices. A new breed of observability architectures are out there and it’s time we all get acquainted. These architectures collect and digest raw data in a distributed manner, where the data actually lies. They break the link between the data volume monitored and the volume of data being sent, handled, and stored to support in-depth monitoring insights - reducing costs, network usage and irrelevant data collection. We’ll delve into a real life implementation of such an approach, showing how it can redefine modern observability today, and allow sustainable scale in the future.

Summary

Application monitoring, or APM, is clearly under adopted with only 22% of the population actually using an APM solution of some kind. Shakar: APms don't scale well with modern architectures. Monitoring so much data basically turns cost to be unbearable.
centralized architectures is being replaced today by something which we call edge observability. The weight of the decision making is being moved from these observability vendor side to the actual agent or SDK, running at the edge. And that's the future as we see it.
EVPF is a technology that's basically a revolutionary technology. It allows you to run complex business logs inside the Linux kernel. It's a very interesting technology for many different use cases, from networking to security, to monitoring and observability.
This may sound like theory of trying to take the centralized approach into the edge observability approach. But this is current reality and this is the current reality of ground cover. The industry is definitely shifting from one approach to a completely new modern approach. We would love to answer any questions or keep talking about it in different discussions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Everyone. I'm Shakar from ground cover and I'll be talking about cloud native observability at scale today. A little about me I am the co founder and CEO of ground Cover. I have many years of experience as an engineer and an R D leader and during the different positions I've being in, I've also been responsible many times for creating and maintaining the observability stack at the companies I've worked with, which is part of the motivation for setting up and creating ground cover, which is a company that reinvents Kubernetes application monitoring before talking about anything, I just wanted to set the ground about the three pillars of observability. They start out with the most bottom layer of logs or log management, which most these know and use today in production. The layer on top is infrastructure monitoring. Basically figuring out is your infrastructure healthy and is it healthy enough to host your applications in production? And the upper layer, and perhaps the most intricate one and most related to the business you're trying to create, is the application monitoring or APM application performance monitoring solutions. APM is super critical to monitoring what your business is trying to achieve, since it correlates really deeply with the targets you set for yourself and for your customers. Just to describe two of the critical missions APM usually performs. One of them is trace captures, actually capturing the behavior of the API driven system. You're actually burdening production. Most companies today are based on a microservices architecture which is heavily app driven, and by capturing the different traces between these microservices, teams can troubleshoot, figure out where things starting to break or behave differently than expected. And the other one is highly related to the SLOS the business is trying to maintain. And that's application metrics or golden signals. In these application metrics you will find things like throughput or requests per second, or error rates or latencies of the different APIs which measure how fast or slow your application is actually doing its task facing your users I'm not going to dive in too deeply about why APM is so important, but this is a quote from the Google SRE book just saying that if you could measure only four metrics about what your user facing system is actually doing, you should focus on those four golden signals, since they can highly predict if something is starting to drift or break in your production. And that only goes to say how important an APM is in the stack that you're maintaining inside the company with regards to observability. So if APM is so important, it must be adopted really well in the industry, right? But reality actually says completely different. I mean, looking at statistics, slug management solutions are highly adopted in the industry. I mean, with almost 70% of teams actually using some kind of log management solution. Infrastructure monitoring not following too far behind, with almost 60% of teams using some ways to monitor their infrastructure. Yet application monitoring, or APM, is clearly under adopted with only 22% of the population actually using an APM solution of some kind. And it raises the question of why this is happening. I mean, if APM is so important, why is it so under adopted in the industry as we know it? And to understand the answer to this specific question, we have to dig much deeper into what an APM is and how it is built. There's many reasons behind why apms today are under adopted. But in this talk we're going to focus on one, one specific and painful answer to this question. And it is the final actual understanding that apms don't scale well with modern architectures. The reason that they don't scale well, just to jump for a second all the way through to the final understanding of what we're trying to say, is that they stored tons of irrelevant data just to get you the little insight you need about what's actually going on in production. Monitoring so much data basically turns cost to be unbearable. So teams can't really scale as their business is growing and their customer base is growing. Can't really scale with the APM solutions and maintain them and use them in production. But another question is starting to emerge here. How did we get here? How did we get into a state where an APM solution is storing tons of irrelevant data just to give me the little information I need to monitor my production and to understand that we have to kind of go back in time, ten or 15 years, back to when apms started as they are today. The solutions dominating the APM industry today have been formed over a decade ago, and these are based on a centralized architecture, which we're going to talk about in a second. And the tale of these centralized architectures and the architectures decisions the APM vendors made over a decade ago have led us to the point where we are today, where apms just don't fit to the current modern scale of microservices architectures. When things started out basically in observability, you had an infrastructure agent, which is usually an external problems of some sort, monitoring the infrastructure as these bedding of your running applications in production, figuring out what the infrastructure is doing, provider metrics about the infrastructure. But when APM providers started to try and monitor applications of what these applications are doing, they had to turn into a different solution, which was much more heavily based on instrumentation inside the code. And instrumentation means that to monitor the application I gave, to monitor it from within, I have to give the R and D teams or the developers some kind of SDK that they can instrument into their code, basically integrate into their code. And this SDK would allow me to monitor the application from within, running as part of the application, giving me the opportunity to suddenly measure things like a latency of a specific API or the error rate on a specific API, things that I couldn't have done with an external agent up until that point. But what exactly is instrumentation? I mean, what does it actually mean? This is a sketch of a typical microservice kind of behavior. A production system based on this architectures would be very API driven. You would get some kind of request from the outside world, from the user, some microservice would handle it and pass it on down into these different microservices inside your production, taking care of the different business logics you're trying to create, eventually returning some kind of response to the actual user that triggered this entire flow. But instrumentation means that you're actually injecting or inserting a small monitoring code that wraps this entire API handling behavior inside each of the microservices. For example, to monitor the latency of a specific API, I would have to wrap it with an external monitoring code given to me by the observability vendor to actually time the start and the end of the handling of, say, an HTTP request flowing into my web server. But this specific choice of actually using instrumentation is part of the reason we got to the point we are at today. I mean, when APM vendors chose to invest in SDK and instrumentation of code by the developer teams, basically they did it since this was the best technology to achieve their task over a decade ago. This also creates a lot of different meanings into what the architecture actually looks like. If I'm using an instrumentation and I'm inserting my own code into my user's application code, it has to be very simple and has to be very fast. And the reason for it is that, for example, you wouldn't expect your web server efficiency to decrease by 50% just because you're starting to monitoring it with an APM, right. It means that the code I'm wrapping my actual application with this instrumentation code, has to be very fast to not create bottlenecks or slow down my applications. This means it has to be very simple and create very simple decisions and run very simple algorithms that basically make it efficient, fast, not error prone, to not create any bugs or crashes that can also endanger my application. And this means that eventually, instead of putting a lot of weight into the sophistication of this instrumentation code, I'm basically taking all the responsibility back to the APM vendor. This is the basic architectures behind the centralized approach that APM vendors chose over a decade ago, which means let's make the SDK as simple as possible. It will be fast, it will be free of errors, it will be very simple to understand and let it sample raw data and just send it back to the APM vendor. And all the crunching and all the complex algorithms and all the intricate decisions we have to make as an APM provider would happen on the backend side of the APM vendor. Things like capturing specific traces, things like creating span based metrics, these golden signals we're talking about, we're going to create them from the actual raw data sampled from the customer side on the AMPM vendors backend side. But this decision, even though very logical at the time, created a major depth towards scale, and this is part of what is starting to shift currently in the industry. Basically what it says. It says you have to store raw data to get insights. So if we look at all the green rectangles here as healthy request and the red one as a faulty request, you would actually store these spans of the different requests when you wanted most of the time, just these digested insights around them, like the golden signals, the metrics that depict their behavior in production. You can imagine that if you have a million requests per second on a Kubernetes cluster, for example, if you're a company working at high scale, you wouldn't want to store too many spans and pay for their storage just to get things like the latency on your different microservices and things like that. That would be the entry point to figuring out if your production is doing okay or not. And the other one is that by making simple decisions at the instrumentation code at the SDK side, it will also mean you would store data at equal depth. Meaning that, for example, if I want to make a decision of what should I store in different cases, this wouldn't happen with a common APM today. So for example, imagine I want to get the payload of the faulty red request here. Since I want to troubleshoot with this payload, it will usually mean that I would have to store payloads for all requests or for none of the requests. But clearly that's not reality, right? I mean, not everything is equally important. If that faulty request comes one in 10,000 requests you would gave wanted to get as much information as you could on this specific request. Like know logs from the different pods communicating around this request or other requests flowing through the system at the time to figure out if there's some kind of interdependency. But on the green healthy requests, you probably want to capture much less and pay much less for all this data being stored at the APM vendor side. And this trade off is exactly what we define as the visibility indepth cost tradeoff. It means that if you want to go deeper in the cases you care about, like for example getting as much information as you can about that faulty request, you would have to pay increasingly high amount of pay by storing increasing amount of data across your system. Since you can't really decide what to store and what not to store given the different cases you care about in production. And we all know that from intuitively. Every developer knows that from the different experiences he or she had in handling some kind of observability stack in their production environment. We know it very well from logs, right? We store, say warning logs level and up. Since we couldn't store every trace, debug or info messages flowing through the production, it doesn't make sense, right? But imagine that you could get the info or traces logs from the web server on 10 seconds around any faulty request. Wouldn't you want to do that? I mean, clearly there's different depth of information you would want to get in different cases. We also know it from traces really well. I mean, observability vendors or APM vendors allow us to randomly sample our production to reduce the volume of data we capture. That would mean we would capture, say one in 100 traces, but that's not exactly what we want. We want to capture the specific traces we care about, like the ones that take ten x the time of other traces, and not just randomly sampling the production. So centralized architectures is being replaced today by something which we call edge observability. It's a terminology that's starting to be very common in modern solutions today. Basically it says that the centralized architecture has reached a point where it doesn't make sense anymore, it can't scale well. It doesn't make sense to maintain and pay for the data being stored by this approach. And basically the weight of the decision making and the sophistication is being moved from these observability vendor side to the actual agent or SDK, running at the edge, close to where the data is. And that's the future as we see it. And before trying to kind of describe what ground cover does in this specific approach of how to actually monitor with edge observability, I want to talk a bit about EVPF as an enabler to all that. EVPF is a technology that's basically a revolutionary technology, being more and more prominent in the last couple of years. It's a very interesting technology for many different use cases, from networking to security, to monitoring and observability. Basically what it says, without going too deep, is that you can run complex business logs inside the Linux kernel, and that allows you for monitoring, for example, to monitor all the user space, all the applications running at the user space, without actually being part of the code. You can, for example, get things like golden signals or things that you would expect from an APM that had to be very tightly integrated into the code itself by just running an external agent at the kernel level and capturing the same data. So EVPF is a really interesting technology. But these reason it's so interesting to edge observability is that it's suddenly allowed to achieve what an APM would have achieved only with instrumentation deeply into the code. Suddenly with an out of band agent that just runs alongside the application without being part of it, that suddenly opens up two interesting use cases or ways to think about what we can do. It means that everything we do inside this agent wouldn't create an overhead on the application. So for example, if we make some kind of complex decision or run some kind of complex algorithm inside this agent, it wouldn't necessarily mean directly that we may slow down the performance of these application we're trying to monitor. And that means that we can make distributed decision making in this agent, which is much more complicated than before. If before we had to keep that SDK as simple as possible, suddenly we can put much more sophistication into this agent, since it's running completely out of band and not impacting the application it's trying to monitor. And this is exactly where things started to get really interesting and what you can do. So if the first thing we talked about was I don't want to store raw data and pay for so much raw data, to get a simple ingested metric like latency on my specific APIs, we can suddenly move all these sophistication into the EBPF agent. Imagine an API flowing into your application at a very high rate. Instead of storing or sampling these spans and sending them back to the APM vendor. We can now create the metrics inside the EBPF agent on the fly without storing the spans or tracking them out of the node or the actual host at any time. And that allows us to store so much less data. To create the actual insights, teams need to monitor production like golden signals and application metrics. Another interesting approach allows us also to create variant depth capturing. So imagine I have again this API flowing in at these high throughput into these application. I can decide to do very different stuff in different scenarios. For example, if there's a bad status code, let's imagine it's an HTTP API for a second. So if I get a had status code returning from the server like a 500, I may want to decide to capture these full payload of this specific request so I can troubleshoot and store the logs a few seconds around that request from the different participants that participated in this API call. But perhaps if it's a high latency event like an HTTP span suddenly taking ten x the usual time, I may decide to do different stuff. For example, I may want to stored other requests flowing to the server at the same time to figure out if there's specific API load from different other APIs on the server. At the same time, I may want to store the cpu usage of this specific HTTP server at a very high resolution at the time of the request to figure out if there was some kind of cpu spike that eventually caused this slow response. And it also opens up different behavior for the normal flow, right? In usual APIs in a common production environment, the normal flow is clearly very prominent and most spans are healthy and describe the normal case. So I may want to do something completely different. I may not want to store even payloads or logs or whatever at this specific situation and just sample 1%, say of the normal flow to just allow the users to see actual healthy spans and how they look in a normal flow. This variation opens up a lot of different ways to store data based on the different scenarios the user would care about or the development team will care about. And that breaks the trade off of storing too much information to create the little insights you need. You can figure out what scenarios you care about and store much more information about them, while storing much less information about other cases which are not that interested in figuring out the details about them. This may sound like theory of trying to take the centralized approach into the edge observability approach and do things differently to reduce costs and to make things much more scalable. But this is current reality and this is the current reality of ground cover and how we're built. As an APM vendor, we're based on exactly these two assumptions or approaches that kind of bothcentralized the way an APM is built. One is that we use an EVPF instrumentation instead of an SDK instrumentation inside the code. Basically what it means is that first we allow at can immediate time to value because we don't require any r and D efforts in the process integrating our solution into their code. But we also support this out of band deployment, which we just talked about, which allows us to take complex decision and create a sophisticated EBPF agent that can carry most of the load of these observability at the edge side where the data is actually at. And the other one is that we're creating these edge based algorithms to create metrics to sample the data smartly so they can be built for scale. So we can scale very well with production environments as they grow into hundreds and hundreds of microservices, breaking the trade offs of how deep we want to get these data in specific cases compared to the overall volume of raw data that would be captured to kind of describe the production and let users figure out if their production is doing well. This is a really interesting discussion, and we think the industry is definitely shifting from one approach to a completely new modern approach. We would love to answer any questions or keep talking about it in different discussions. We find this topic really interesting and important for the entire observability community. Feel free to join our slack and ping us at any time with questions or ideas you have or experiences you have from monitoring your production with the scale you're used to today. Also, feel free to contact me directly and thank you. It was a great, it, it was great to talk about this topic and share our thoughts with you. Thanks guys.

Slides

Download slides (PDF)

See all 13 talks at this event!

Conf42 Kube Native 2022 - Online

October 20 2022

Cloud-Native Observability at Scale: Why Distributed Architectures are Here to Stay

Video size:

Abstract

Summary

Transcript

Slides

Shahar Azulay

Co-Founder & CEO @ groundcover

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2022 - Online

October 20 2022

Cloud-Native Observability at Scale: Why Distributed Architectures are Here to Stay

Video size:

Abstract

Summary

Transcript

Slides

Shahar Azulay

Co-Founder & CEO @ groundcover

Join the community!