Observability strategies to not overload engineering teams

Video size:

Abstract

Revolutionize your engineering team’s approach to observability with game-changing strategies that won’t overload them. Learn how to streamline observability with Proxy, OpenTelemetry, and eBPF approaches. Build better products by empowering your engineers to focus on what really matters!

Summary

Today I'm going to show you some strategies to implement observability in your company without acquiring engineers efforts. The idea is to let developers focus on what really matters for them. So I hope you enjoy and let's get started.
Nicolas Takashi is a software engineer living in Portugal. He is an open source contributor especially on the observability ecosystem. Currently working at quarter logs it analytics platform for logs, metrics, traces. Usually talking about Kubernetes, observability, Githubs distributed system.
In the context of software engineers, observability is crucial for understanding how system is behaving. When we're talking about observability, we need to talk about also application observability. The idea of this talk is to show some strategies to let engineers focus on the things that really matters for them.
The proxy strategy, the open telemetry strategies, and the EBPF strategies. The idea of these strategies is to give to your engineering team a solid foundation of observability without having any code change. After I show you the live demo of each strategies, we are going to see table of comparison.
We are going to leverage existing piece of infrastructure that you probably already have on your company, which is your web proxies. The idea is that you can produce standard telemetry data independent of the technology that you are using to build your serve.
We are using a very specific image for this container which is the NgInX open tracing. We can start generating a few load on those services and see the things happening. If you are starting doing this strategy right now, I really recommend you using the Opentelemetry version.
Opentelemetry strategies aims to rely in the infrastructure piece that you will deploy on your company and it will start collecting metrics and traces out of the box for you. We are going to use today the open telemetry, open instrumentation and the open Telemetry collector to generate, auto generating and collecting the traces.
On the Yeager view we can see the traces from the services. We have checkout and payments. Since we are using Opentelemetry in this puns metrics and we are creating metrics from traces, we can leverage another Yeager feature, which is the monitor feature.
We are ensure that those metrics are being produced using open standard. Most of the vendors are supporting opentelemetry data. Open source solutions are supporting open telemetry data as well. This is very useful, especially if you want to build dynamic dashboards.
EBPF is an emerging technology, especially on the cloud native space. It is extending your Linux kernel to trace, monitor and analyze system performances and behavior. A few products leveraging EBPF, but it's not highly adopted yet.
Celine is a networking interface for Kubernetes. Using Celine and its EBPF agent, we can collect metrics like TCP, HTTP networking metrics. All the source code will be available on the GitHub.
We are going to start in a Kubernetes cluster and then we will start installing all the things that we need like Celian, Grafana, Prometheus and so on. Hubble is an observability solution that gets network flows from your pod communications and then extracting metrics and providing network visibility from your cluster.
Hubble L seven HTTP metrics by workload, where we can see what is the source workload. We can build in alerts for error, h latence and so on. You can use all those metrics on the same standard that we have. This is on the application level, it's not on the load balancer level.
EBPF agent is providing a few metrics. Not very detailed as Celine is doing. But I do believe this is a project that's going to be very mature in a few weeks and a few months maybe. EBPF is a technology that you must watch for your observability systems.
Third strategy is to not change any line of code to include instrumentation. All the three options are environment agnostic. You need to understand your user case to choose the best fit for you.
So folks, that's it. I really hope you enjoyed it. If you have any questions, feel free to ping me out. Talk to you and have a nice conversation about cloud native, about observability and many other subjects.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You all right. Welcome everybody. Pleasure to be here. And today I'm going to show you some strategies to implement observability in your company without acquiring engineers efforts. If you are doing observability using open source solutions, as I am, you probably think about do the same thing that most of the observability vendors are doing out instrument maintain. They are just deploying an agent in your host and starts collecting your metrics, logs and traces. So the idea of this talk is to show you some strategies using things that you probably already have on your infrastructure to start collecting all those signos and improve your developers experience and without requiring engineering effort. So the idea is to let developers focus on what really matters for them. Okay? So I hope you enjoy and let's get started. Cool. So before we start talking about observability strategies, let me introduce myself. My name is Nicolas Takashi. I'm a brazilian software engineer living in Portugal for the last seven years. I'm open source contributor especially on the observability ecosystem for projects such as Prometheus, operator, opentelemetry and many other projects on the observability ecosystem. I'm currently working at quarter logs it analytics platform for logs, metrics, traces and also security data. And you can find me on my social media networks such as Twitter, LinkedIn, GitHub by my name. I'm usually talking about Kubernetes, observability, Githubs distributed system and also of course personal life and changing experience. So I hope to see you there and let's move forward. Okay, cool. So now everybody know who I am, let's move forward. Let's start talking about observability strategies, folks. So before we start talking about the strategies, see the things in action. Let's ensure that everybody's on the same page. Let's ensure that everybody has the same knowledge and the same understanding about observability and the use cases. Okay? I know this may be very basic, but this is very important to have the understanding about the strategies that we're going to see here. So folks, in the context of software engineers, observability is crucial for understanding how system is behaving. So given external inputs and what, I mean external inputs, I mean users using your system, like if you are running ecommerce, users buying, adding things to their checkout bag, doing payments and all those things, okay? And you start collecting telemetry data. So you start collecting traces and you start collecting logs, you start collecting metrics, profiling and many other things, okay. And getting all those information this huge amount of information, because observability is very easy to start handling a huge amount of data. You can identify issues, you can identify bot on acts on your system where your system can be improved, okay? And you can troubleshooting problems very quickly. But when we are talking about observability, it's very common. People start talking about infrastructure observability, which is true and which is very important actually, because if we are running a health infrastructure, your resilience is better, your reliability is better, and your customers are happy. And usually when you are talking about infrastructure observability, we are talking about monitoring kubernetes system, for example, if your system is scaling or not, if you have new pods or for example if you're running kafka broker, you may pay attention on disk size, disk throughput and many other things, which is important, as I said. But when you're talking about observability, we need to talk about also application observability, which is a little bit more complex because most of the things on the infrastructure side is made. We have metrics, we have logs by default because logs is the most common observability data type. But when you're talking about application observability is a little bit more complex because some systems is not prepared yet to export all the things that we need. Okay, because sometimes we need technical metrics or technical information like logs, traces and metrics. But from the application perspective, you want to know what is the p 99 of latency for a specific application. You may want to know how many messages this specific application is producing for your message broker. If you're using Kafka, RabbitMQ, anyone is the same concept more or less. And on the application side you may want to understand also some business metrics. So as given the ecommerce example that I gave you in the beginning, you may want to know how many orders your customers is doing by second, by minute, what is the click path on your system, how users navigate on your platform, and collecting all those information. You can start thinking about like places in your system that you want to add a lot of focus to improve resilience, to improve performance, to reduce or hate, and so on and so forth. Okay? But collecting all those information sometimes is not easier, especially if this is not build in your framework that you are using. And it lead us to a work that most of the engineers don't like to do even less the product manager, which is instrumenting code to get in the required information. Okay, so if the idea of this talk is to show some strategies to let engineers putting their focus on the things that really matters for them, like delivering features, measure user experience, getting business information. Let's avoid engineers spending time adding telemetry to collecting technical things like HTTP requests, Kafka, throughput, DB latence, and so on and so forth. Okay, so this is what we're going to show you today, how we can collecting standard metrics without acquiring engineers efforts. And this is useful to make some standard proxy across your company and ensure that doesn't matter the language, the language that you are writing your system, you are collecting the same kind of information using the same structure. Okay, pay cool folks. And we are talking about instrumenting code and instrumenting code. It's collect as much as we can. So as an engineer, you want to know every single information from your system because everything is kind of available, but do this kind of job can quickly become overwhelming for you and for your team because you need to use engineering time to add metrics that maybe might be provided by platform. If you have a platform team, for example, a DevOps team that can do all the automations and the strategies that you're going to see today on your company and may give you such information, and you can use your engineers time to do instrumentation to collect things that really matters for your system and for your product engineering and so on. And for example, we have a meme here. And of course, this is just a joke, but a funny one, because when you're talking to product engineers that we need to instrument our code, we need to spending engineering time instrumenting codes instead of delivering features, it automatically get a low priority. Okay? And as I said, this is just a joke, folks, because of course we are talking about two different professionals looking for two different perspectives for the same problem, okay? And when you are talking about when you are engineer, you're trying to push as much as you can, the better system to production you want to build, the more scalable, the more performance system. When you're a product engineers you want to push to production, the better product you can with the better features, the basics, user experience and so on and so forth. So this is a trade off. We need to talk to each other. And of course, if your platform team is providing you some basic information on the platform side, you just need to instrument your system, your code for the information that really matters for your product, for your teams, and you can use the information to improve your product. And your product engineers can use the same information because it's attack and business information and observability information so let's move forward and let's see what are the strategies that we're going to see today. Those are three. Okay, cool folks. So those are the strategies to not overload your engineer team, the proxy strategy, the open telemetry strategies, and the EBPF strategies. I have a blog post for every one of those strategies on my medium account. You can check this information there as well and feel free to reach me out and provide any feedback that you may want. And people, the idea of these strategies is to design and give to your engineering team a solid foundation of observability without having any code change. And what I mean by this is simple. You as an engineers, you as an engineer, when you want to deploy a service on your company infrastructure, you don't want to change your code to start collecting common metrics such as HTTP classes, GRPC streams, Kafka consumers and Kafka producers. You don't want to instrument your code to collecting latency metrics. You don't want to instrument your code to start collecting basic tracing information and so on, because you want to leverage your platform. You want to consume observability as a service. I like to say that because we are offering many services as a service, like CI as a service, kubernetes as a service, deployment as service, but you also want to have observability as a service. You want to deploy your system without any coaching and you want to start collecting telemetry data. Okay? So in the end, the idea of this talk is to provide some useful insights that you may use in separate each one or combine those strategies together to get the information that you may want. Okay? After I show you the live demo of each strategies, we are going to see table of comparison so we can compare each strategies, the benefits, the pros and cons of each one. And this may help you understand when you choose one and when you choose another one and so on. Okay, so folks, I hope you enjoy, like this is going to be very fun right now because it's live demo. You're going to see things in action or you're going to see open source solutions. And yeah, let's go, let's move forward and see the proxy strategy in action. Okay, cool folks. So let's go for the first strategies, the proxy strategy. We are going to leverage existing piece of infrastructure that you probably already have on your company, which is your web proxies, okay? If you're running HTTP applications, you probably have something like Nginx or Ha proxies, which are very common solutions when we need this kind of strategy. But this is not coupled for any technology. Okay, I'm going to use Nginx as an example just because I'm familiar with Nginx. But you can do the same with ATA proxy or any other web server that you know better. Okay? And the concept is the same. So as we can see on this slide, we have a diagram showing the flow. So we have an ingress proxy. The ingress proxy is responsible to handle the HTTP request that's coming from outside your platform to inside your platform. And then it's hedirecting to the proper service. So service a or service cb. And on the left side of the diagram we have the three telemetry backends, okay? We have Prometheus for metrics, yeager for traces, and open source for logs. And those backends are going to store the opentelemetry data produced by the ingress proxy. Okay, this is a very simple one. And the idea of this strategy is if you are using web applications and this strategy is very web specific, you can ensure that you are going to produce standard telemetry data like traces, metrics and logs independent of the technology that you are using to build your serve. So let's imagine that the Serfca is using Java and the Serfcb is using Golang. You can ensure that the opentelemetry data that we are collecting is using the same standard and doesn't care about the technology that the service is using. Okay? So let's move, let's move to the VS code and see the very simple strategy we have. So quick spoiler. First thing that you see here is a make file. I'm just using this to abstract a few combs and type a little bit less. But on the app folder we have a very simple Golang application where we are mimicking ecommerce checkout process. So when user is doing a checkout, the checkout service is going to call the payment service to do the user payment. Okay, very simple. And here you can see a few lines of go code, not important for us. And then we have a docker file, simple as well known especially. And then we have docker compose where we have a few containers running here. And I'm going to tell you about it a little bit. So first we have two proxy containers. The first one is the ingress proxy, the ones that I told you, it's handling the requests coming from outside your platform. And then we have the egress container which is acting as ambassador container. So which is handling the request that's going out from one service to another service. Okay. And having those containers, those two proxies, we can collecting and connect points between service a and service b with distributed traces. Okay. It's very similar what we have when we are using serfs mesh in kubernetes. Okay. We have like containers handling in front of every application to do this magic. Cool. Besides that we have checkout and the payment serves and then we have the exporter for both proxies. For the ingress and the ingress I'm using Prometheus exporter which is collecting, it's creating metrics using the NginX HTTP logs which are very useful and we already have a lot of information there. So we are getting the logs and creating metrics from the existing logs to understand latency, quests, hates and so on and so forth. And then we have Prometheus and also Yeager. Okay, so before we move to the next step folks, let me come back for the proxy configuration. And I would like to highlight that we are using a very specific image for this container which is the NgInX open tracing and this docker image already have all the required modules to starting spawns when a request is received and then export the spawns and the traces to the tracing backend in our case, Yeager. Okay, so an important thing here is I know that we already have opentelemetry version that was released very in this week. I didn't update this demo yet, but if you are starting doing this strategy right now, I really recommend you using the Opentelemetry version and not the open tracing. Okay cool. So for the ingress configuration we have something very simple. For the proxy configuration, it's like a forward proxy. We are just getting the request and forward this to the serves and we are just leveraging the proxy to collecting the opentelemetry data without needing to change anything on the code as we saw the go application. It's very simple and we don't have any instrumentation to collecting HTTP metrics. Okay cool. So we already know the basics, we already know all the things and then we can just start generating a few load on those services and see the things happening. So I'll be open the terminal, let's get the make file just to understand what's happening behind the scenes. And the first thing that I'm going to do is run make setup. Make setup is going to start all the containers using Docker compose up. And also I have all the containers up and running. As we can see here. All those logs is here are here we can go to the web browser. Let me switch to the web browser and oh actually let me just fix for the seminar nokijit. And now on the web browser we can access localhost nine it's nine it which is Prometheus web interface. And we can see the Prometheus targets here, the two Nginx products, the ingress and the grass and Prometheus itself. Okay, so backing to the vs code, let me see which is the port that Yeager is running because I don't know by heart and it's 6006. Eight six. And then we are going to move back to the browser and we are going to access localhost and then the Yeager port. So we don't have anything here. Okay. The only service we have is Yeager itself because Yeager is collecting its outer traces and now we have all those things running as expected. We can start producing a few loads to this infrastructure. So let's go back to vs code and then we are going to open a new terminal and using a make comment we have here which is maketest. What maketest is doing is just creating a few load on the checkout serves using Vegeta. Vegeta is a very simple loading test in Cli. I think this is just amazing for this kind of workload. Okay. And I'm going to run load test against the checkout API during 6 seconds. Okay. So let's go. And it's producing, it's making a lot of HTTP requests. So if we can go to the logs, we may see a few logs happening here and I think we may already have a few data available. So let's go back to browser. And then in the browser the first thing that we're going to see is, are the metrics that we are collecting. So we have a few metrics, name it Nginx. Let me hit fresh because the metrics might not be available. Yeah, we already have it. And then we have the NgInX HTTP request count total. So we have few things here, we can see all those things. And then if we run an expression like hate HTTP request and sum this by, I don't know, serfs, URi and stats code we can see all those things. And then we can see this increasing over time, which is very cool. And since we are doing linear requests we are not going to see these ups going up and down. But here we can already know the information that we need. Okay. We can measure the amount of HTTP requests for a given service and that's cool. That's pretty cool. And using that information we can see the amount of HTTP requests we can use in these two building slos, for example, like ahorhates, we can use that information to measure HTTP responses because we should have probably, I'm not finding right now, but we should have a few metrics about latency. But we may know which is for example the response size for each class, which is also very useful information. And using the NgInx exporter, you can build any metric you want using the NgInX logs. Okay. So you can get in the logs and you can build all the required metrics that you may need. Those are just a few examples. And as I said, those are standard metrics. Doesn't matter which technology you are using behind the scenes. The next telemetry data that we can see are the distributed tracings, okay. And then we can see that we have two services here. The first one is the checkout. And if we find traces from checkout services, we can see that the checkout go to payments and so on. So we have two hope, we have two hopes for every service. This is Nginx internal, okay. And we can see in that way we can look in for system architecture and we can see for example, the service is not zooming again. Let me see if it's better on the service. Yeah, we can see the checkouts. It's using the payments API. And if you have many services, you'll be able to see this diagram architecture, which is nice because if you are using a microservices solution, whether you have many microservices talking to each other, it's very hard to know only by knowledge what are the service communication flow. Okay. One thing that's very important, those distributed information are useful, but it doesn't give much detail about the service internally. So you cannot identify performance issues inside your services using that information. Okay. This is useful to understand the network hopes on your platform, but not useful to identify internal problems. Okay. But giving that information, giving that telemetry data, your teams can start looking and see, okay, we are talking to these herbs and these herbs and then they can see, okay, on this payment flow, I want to improve the details. And then the team can go and add telemetry for the flow and the path they really want to know. Okay. And they can reduce the amount of work they need to do. Cool. So this is the proxy strategy, folks. This is very simple. As I said, the idea is not like any rocket science. Just using a piece of infrastructure steps that you already have on your company and start collecting a few telemetry date. Okay, so before we finish just talking about the logs, logs are available on your host. So for example, you can see all the logs produced here. We are producing a lot because we are generating a bunch of requests. We don't have any errors in that case, but we could mimic some errors, for example, and also we have this, those access logs. You can use some log sheeper like Opentelemetry fluent beat to collecting those logs and ship to open source solution. And my advice is since most of the information you have on the logs are now available on metrics like stats code path, you can just choose to drop some of those logs, especially the 200 stats codes. Okay, this is just another device that might save you some money in terms of storage and also networking. So this is the first strategies, we are going to move to the next one now, which is the open telemetry strategy. So let's go. Okay, cool folks, this is the second strategy, the opentelemetry strategies. And in my opinion this is the coolest one because this is using the Opentelemetry, which is an amazing project maintained by amazing people, offering a lot of integration and wow, this is very interesting. Okay, and this strategies folks also aims to rely in the infrastructure piece that you will deploy on your company and it will start collecting metrics and traces out of the box for you. Okay, so the Opentelemetry project, it's a huge project composed by many different parts such as the open telemetry specification, the opentelemetry collector, also the opentelemetry EBPF out instrumentation operator and so on and so forth. We are going to using today the open telemetry, open instrumentation and the open telemetry collector to generating, auto generating and collecting the traces and generating the metrics. Okay, so looking for the diagram, we have an auto agent, opentelemetry agent running your host and receiving the tracings that will be produced by your application. The opentelemetry agent is going to process the traces and creating metrics and then expose metrics and traces to the back end for metrics to promote traces to Yeager, it's like kind of language agnostic. So it doesn't matter which language you're using, if you're using Python, if you're using Java and so on and so forth. Okay, so let's move to the vs code. On vs code we have almost the same thing. Before I show you the solution, I'm just going to run make setup to ensure that I have everything running and I will start products a few loads on the seRps. It's using the same vegeta comment that I showed you on the previous demo. So we will looking for the configs we have here. So first I have the app folder. On the app folder we have a very simple python application mimicking the same behavior, checkouts and payments doing the same flow. Okay? And folks, as you can see here, we don't have any open telemetry or Prometheus or Jaeger code. We are not creating traces, we are not creating metrics. This is a plain and standard flask Python server. Okay, this is just to bear in mind all the things that you're going to see. It's auto generated. So looking to the pipe to the docker file, this is where the magic starts to happen, because we have to install a few python packages on the host, which has a dependence like the opentelemetry distro, the exporter of the opentelemetry format. And also we need to run a few bootstrap comments to configure everything that we need on the host. And then we have the opentelemetry instrument, some rounds in the Python command. It's like running the opentelemetry instruments. And this is starting a process of Python when we are doing these folks. And this is where the magic starts to happening because the out instrumentation project of opentelemetry is doing the same thing that the vendor agent is doing. It's changing your codes in Python casing during hung time to adding opentelemetry code. So the opentelemetry out instrumentation is gamechanging the Python code to initialize tracing context and ensure that we are propagating the tracing headers when we are doing HTTP requests, when you're producing a Kafka message, reading the trace context when we are receiving a request or consuming a Kafka message, and those are just example, this is doing a lot of things behind the scenes. Okay, cool. The traces are being produced by this application will be sent by the opentelemetry collector. Before I show you the opentelemetry collector configuration to you, let me show you the containers. We have running side the Docker compose. We have Prometheus and then we have Yeager, and then we have the open telemetry. And then we have the services containers where we have the checkouts API and also the payments. As you can see, I have for both containers a few environment variables defined. We have the opentelemetry traces exporter, which is going to be the OTLP. What is the format that I'm going to export traces, which is the serfs name, in this case the checkout. And what is the opentelemetry collector inch point where this application should push traces, which is hotel and on the port 43, 117. Okay, hotel is the container. Running the open telemetry collector. Cool. Pretty simple. Moving to the open telemetry collector we have the open telemetry collector. And the open telemetry collector is a piece of software responsible to receive opentelemetry data, process the opentelemetry data and then export the opentelemetry data. This is literally a software pipeline. Okay, so you can receive, process and export. And we can see we have a pipeline section on the opentelemetry code collector config. And we have two pipelines running here. The first one is the receiver, which is the OTLP. So the application is producing the traces for this receiver. And then we have a batch processing. We are just batching the spans and exporting those spans in batch to the back end which is export another OtlEp and then the OTLP it's sending to the Jaeger end point. And on the exporters we are also sending to something named Spun matrix. And what is spun matrix? Spun matrix is a connector. And connector is a part of opentelemetry collector that acts as a receiver and also exporter. So it's like literally a connector, it can receive and export on the same time. Okay, so for every spun we are exporting we are sending to the spun matrix. The spun matrix is receiving the spuns. Processing those spuns is basically creating metrics from the spuns. And later we are using another pipeline named matrix, spun matrix. And we are getting data from the spun matrix since we can receive and export data. And then we are getting from metrics from spun matrix and we are exporting to Prometheus and vote write which is remote writing to our Prometheus server. Okay, this is basically that folks. And the part of code is responsible to creating metrics from traces is the spam matrix connector. So the open telemetry out instrumentation on the docker image is gamechanging the python code. To initialize the tracing context and ensure that we are propagating headers and reading headers. The collector is receiving processing and export to the metrics. Okay, just one note. The way that I did on the docker is the simplest way that you can do to have out instrumentation running. Okay, there is another approaches. If you are running kubernetes you can use an open telemetry operator and you can inject sidecars on your pods based on your technology. So if you're using Java, you just need to notate your pods with Java notation, open telemetry, Java notation and opentelemetry collector is going to do all the magic to you. The opentelemetry operator. Okay, but I'm not show you this today because it's a little bit more complex. So we already have this running for a while. Let me see, probably for a few times from now. Let me just ensure one thing. I guess I forget to run make test. Yeah, but let running and let's switch to browser. Now we are on the browser. We can go first on the Yeager view and we can see the traces from the services. We have checkout and payments. If we look to one tracing in specific, we can see that we have three spans. We have the first spawn which is the checkout. So when the checkout service receives the request and then we have a checkout action. Get where the checkout service is doing an HTTP request to the payment service and then we have the payment service receiving the request. Since the payment service is doing nothing, we don't have any kind of continuation spun here. But if we're consuming a Kafka master job, another thing, we're going to see this spun as well. Okay, we can look to those spuns and see that we have a few useful information like user agent. In this case vegeta is written in go. So the user agent is go client. We have, which is the host port, the prip, the opentelemetry library name. We also have a few process tags such as the SDK version. It's also showing auto version, auto instrumentation version and the same is true for payment serves. The difference here is the user agent is Python because checkout serves is a Python server and not goling. Okay, cool. As we can see on the open telemetry strategy we have more details about the service internals. The thing that we didn't have on the proxy strategy, because we are collecting on the proxy strategy, we are collecting opentelemetry data from the layer above. Okay, well we also have the system architecture as we can see here, very useful as on the other demo. But since we are using Opentelemetry in this puns metrics and we are creating metrics from traces, we can leverage another Yeager feature, which is the monitor feature. We can just see an APM view like by service and operation. Under these serves we can see the request hate the P 99 matters, we can see the action and we can also see the impact of this action on the services. So this is related. So if this action is most used or less used, this is very nice and very useful for quick and troubleshoot. If you want to see this on your Yeager page. Yeager is reading data from Prometheus to build this screen, which is kind of nice. Okay, so moving to Prometheus, and if we go to Prometheus on the 99th part, we have two metrics here. The first one is the cost, which is basically a hate of actions. Okay, in this case, since we are using an HTTP request, this is a hate of HTTP request. And then we can run Hm five minutes by HTTP metal stats code and then service name and we can see this going up. Okay, let me just reduce this a bit. Well, this is it. So if we are doing Kafka producers, we're going to see the same thing, Kafka consumer, the same thing under the calls. So it's important to understand this kind of operation. We can also of course have in the span name, for example, to help you understanding what is the action is being executed on this part of code. But it's basically that, okay, another useful metric we have is the buck duration bucket where we can measure p 99, p 95 as we saw on the Yeager screen. So let's do sun by Le and service name, and we are going to use a histogram quantile p 99 for these. And we can see like last five minutes, the p 99 for the checkout service is around eight milliseconds and for the payment service is around one millisecond and a half. Okay, well those are very useful metrics. Again, we are doing all those things without any code change. We are just deploying an agent on the host or changing the docker image. So this is very simple to be executed by someone from your platform team and which is nice. We are ensure that those metrics are being produced using open standard, which is the opentelemetry specification. Most of the vendors are supporting opentelemetry data. Open source solutions are supporting open telemetry data as well. And if you are using a vendor which is not supporting open telemetry format, I do recommend you move away from this vendor because it's not good for you. Keep using proprietary opentelemetry date, okay, which is nice as well, because doesn't matter which technology you are using to build your service, if you're using python, if you're using Golang, if you're using Java, all those metrics will have the same standard, the same labels and so on. So this is very useful, especially if you want to build dynamic dashboards, dynamic slos and so on. Okay folks, so I hope you enjoy this demo. For me, this is one of the coolest one. And then we are now switching to the 31, to the EBPF strategies. All right folks, so this is the 30 observability, observability strategies to not overload engineering teams. And this strategy is like very interesting because it's using EBPF. So EBPF is an emerging technology, especially on the cloud native space. You may see a lot of products using EBPF for observability, security, networking and so on. So EVPF, for those that don't know what this means, it stands for extended Barclay packets filter, BPF. It's very common on the Linux kernel and EBPF, it's like BPF with some tuning extra features and really cool extra features actually. And the idea of EBPF is extending your Linux kernel to trace, monitor and analyze system performances and behavior. So you can collecting things that's happening on the kernel level and providing sites and provide those information to the user space. Okay. And EBPF is not such a new technology, but you can see a few products leveraging EBPF. We start seeing a few more nowadays, but it's not highly adopted yet. Okay. For many reasons, people still not like discovering and so on. So the idea of this demo folks is basically the same idea. It's like we are going to having an agent that will be able to collecting all the signos that we need, like the metrics, trace logs from the application level, not only from the infrastructure, but as we did for the other demos, we're going to use the same concept of collecting application level observability. Okay, wow. So let's move to the VS code. And this demo is going to be a little bit different because I'm really focusing on kubernetes strategies right now because I'm going to be using a solution which is Kubernetes based. But we already have many other options to the ones that not using kubernetes. Okay, so what we have here, it's like pretty simple. We have a cluster and then we are going to starting a minikube cluster and then we are going to install on the cluster cilian. What is cilian? Celine is a networking interface for Kubernetes. Okay. So we have many, like Falco, we have some cloud specific CNIs container network interface like AWS, CNI and Cylin is another CNI, okay. And Celine is fully built on top of EBPF and it's using EBPF for networking, for loading, balance and many other things. And now so using EBPF to provide observability inside your cluster. So using Celine and its EBPF agent, we can collect metrics like TCP, HTTP networking metrics and so on. And we can also understanding what is the networking flow inside our cluster. Very similar with the information we have on the Yeager diagram architecture, but it's build not using traces, it's build using network flows. But the concept is the same. It's just another sign or another kind of information that we can use to get the same site. Okay, so I have all the comments that I need to install celeb to install the monitoring, the monitoring stack on the Kubernetes cluster. And another thing like we are leveraging some applications from Star wars to start collecting traffic from there. Okay, I will not cover each comment as I didn't on the previous one, but you are free to check this out later. All the source code will be available on the GitHub and you have access for that. Okay folks, so meanwhile let's start creating, make setup as we did for the other ones. We are going to start in a Kubernetes cluster and then we will start installing all the things that we need like Celian, Grafana, Prometheus and so on. Okay? So as soon as it's finished, we will be back here. Okay, cool. Now, we already have all the components running on the cluster. So if we run kubectl, get pods a to get pods from all namespace, we're going to see all the pods that we need to have on our clusters. Like we have the celine operator, which is the operator that's going to ensure that we have each agent has ceiling agent running there, that the cylinder is running health, it's collecting all the metrics and so on and so forth. So we have few kubernetes pods like core, DNS, etcD. It's pretty straightforward. Those one, we have rubble and UI. Okay, I have other Kubernetes pods as well. So the thing here is like cylinder by default is not providing any kind of observability. Okay, the ceiling project, it's working in a very specific way, which is Kubernetes CNI networking and loading balancing. Sure that when you create a new pod, the pods getting IP, the nodes getting IP, it's like doing the communication with your cloud providers to getting ips from your network and so on. But the Celian project has another sub project named Hubble. So Hubble is an observability solution, if I can say that, that it's leveraging cilium to getting network flows from your pod communications and then extracting metrics and providing network visibility from your cluster and the applications that you are running. Okay, so this is what is celebrated and this is what Hubble and we are going to see this right now. For this we need to have port forward few components on our machine. Let me show you the pods again and explain other thing. So we have each agent with a ceiling agent et Kubernetes nodes with a ceiling agent running there, watching every network communication inside those nodes. Okay, and then rubble, it's getting the network flows and using all the observability that we need, creating metrics and so on. And we have a few other things here, which is like we have a Grafana and then we have also prometheus because we need store the time series that rebel is creating. And we have Death Star and now tie fighter X wing to provide some loads inside the cluster. And then we can mimic like services communicating to each other. Okay, so the tie fighter and X wing is like doing some HTTP request to death Star. And we're going to see this in action like right now. Cool. So let's port forward a few components like make port forward and then relay because we are running all those things from our machine. And then this is like ceiling requirement because cylinder UI needs to talk to cylinder relay to getting the information that we need. So now I'm going to run get port make port forward Ui. Okay. Meanwhile we can switch to browser and then on the browser, let me switch to browser as well. We can. Localhost is 2000, I guess it's 12,000. Okay, this is the rubble home page. Okay, so you can see all the namespace we have inside the cluster. And then if we click on the namespace, we can see all the applications is running inside this namespace and the traffic flowing inside each application. Let me show you another one. Like Kube systems, the same idea. We have a rubber UI and a rubble running there. What else on Celia monitoring we have Grafana. Let me see if it's load. No flows found for now because we don't have any trafficking happening inside this namespace, but we can back to default and then we can see that x wing and tie fighter is talking to Devstar. Okay. And we can see all the action happening right now, like the post for the v one requested landing. Okay. And we can see like forward. And if we click on those things, let me see maybe down below, if we click here, I just missed this. We can see few details, like when this communication is happening, we see if it's a track, that action, if it's in grass or aggress, what else we know what is the source pods. So if you have many pods, you can see from where this network action is coming from. And you have a few labels, we have the ip, we have the destination pods, we have destination labels, like a lot of useful information. Okay? And then we can run a few filters here, like we can filter by name. Let me see, kubernetes, I'm not seeing this, but maybe clicking here, namespace default. It's already, you see, label equals namespace equals default. This is how we are filtering. And this is the same thing. If we click here on the service pod, we can see a few labels from the pod, from the destination, same thing. And that's basically that. From Cedar, we can know a few network information, but not very special. But this is more like a UI view because from that application you cannot create any alerts, you cannot create any dashboard. Okay? This is more information regarding ceiling and rubble than like a properly observability solution. But how? Look, we are, because ceiling and rubble exports all those information, especially metrics, is being created to a time series database such as Prometheus. And then we can move back to vs code. Okay, so let's go back to vs code and then let's create a new tab here and let's port forward Grafana. Now we have grafana running. Let's go back to the browser and open localhost 3000. And we have few dashboards on Grafana. The first dashboard is like about ceiling operator. This is not what we need, it's more related to ceiling operator healthiness. What else? We have ceiling metrics, which is useful as well, but not what we need. Okay, we may see like how much the BPF memory has been using, if we have any ABPF, air horse or not, system calls, maps and so on. But this is not what we want, right? Let's see again what else we have. We have two another dashboards, which are more related to rubble and the metrics. Rubble is empowering based on the network flows. So we have Hubble dashboard by itself. We may see the amount of flows and the flows, folks, we are talking about here. It's all those things is happening on this tab down below where I'm hovering the mouse. So each communication between a search and a destination is considered a flow. So we may see the amount of flows we have, we may see the type of flows we have, like it's trace or if it's Lsl seven network flow. So in this case we have a few l seven flows because we are doing HTTP quests, okay. And so on. So we may see if we are losing any package. And that's it from this dashboard, which is nice because we start already seeing a few HTTP metrics that are being created on top of the network flow. Celian is collecting DNS and so on. But what is nice here is also the Hubble L seven HTTP metrics by workload, where we can see what is the source workload, for example x wing or tire fighter, and then we can see the destination in this case is only death star. We don't have any other destination, but we can see metrics like by stats code, by source and destination and so on. So let's explore. We may see also latency, as we can see here, like the p 99 and the p 95. We can build like slos using these metrics. We can build in alerts for error, h latence and so on. So just to show you we can see few metrics to explore. Labels, let me see, five minutes. And we can see a few labels here. Like we have the destination, the namespace and the workload, the destination IP as well. We have which protocol we are using because we may use like Kafka protocol, we may use HTTP or any other kind of things. Okay, we may use HTTP two and so on. We have the source workload and the source namespace, which is the method. And I guess we also have the status code in someplace here. I'm not 100% sure. Yeah, we have the status and also we have the method. So we may create inquiries like, okay, on the last five minutes, I want by this, in a source workload code and method, and we can see something like that. I guess it's status maybe. Yeah, it's status. So we may build some slos like all the requests, all the requests by the band requests, and then we can status something like that. Okay, I just have typo here. Oh, it's the opposing. Yeah, and then we may have something like that. Yeah, I broken the query, but I don't know why because it's zero. Yeah, this is the reason because we don't have enough questions with the errors. But yeah, you got the idea. Okay, so you can use all those metrics on the same standard that we have. On the other, doesn't matter which technology you're using, if it's java, if it's golang, if it's ruby, we don't care. The metrics we are collecting are the same. The network flow we are collecting are the same. We don't need to instrument our codes to get all those metrics, which is nice. Again, this is on the application level, it's not on the load balancer level. Okay. It's pretty close to your service. And then this is the idea of ceiling, EBPF and so on. Okay, so back into the slides. I would like to add a few things here is like, I've been using Celine as the solution for this demo, but the truth is, when I was building this demo, Celine was the most mature technology for observability using EBPF. But nowadays, we already have a few more technologies that I didn't test yet. Like we have opentelemetry, EBPF agent, which is providing a few metrics. Not very detailed as Celine is doing. But I do believe this is a project that's going to be very mature in a few weeks and a few months maybe. We are still working to have helm shards to make this agent installation easily, and the opentelemetry team is working to improve this. I know that community is building a few other services using EBPF, but I didn't test yet. Okay, so for sure, Celium is not providing all the signals we need. We are talking about more related to maybe we are not including traces and logs. Okay, we are only talking about metrics. But this is a starting. EBPF, as I said, is continuous growing technology. So day after day, we seek the community involving the observability and security solutions using EBPF. So I do believe this is a technology that you must watch for your observability systems, not only for metrics and traces, but also for profiling. We see many profiling solutions like park doing cpu and memory profiling using EBPF. Okay, folks, so this is the third strategy, the EVPF strategies. Again, the idea is to not change any line of code to include instrumentation. And we collecting as much as we can using platform. Okay, so this was the last one. Let's now looking for a table comparing both of them. Okay, now you might be asking what is the best solution? What is the best strategy? To not overload my team. And this is the answer that engineers usually to hate. But it depends, of course, it depends of your infrastructure. It depends your team knowledge, it depends how many people you have working on platform abstracting features provide things as service inside your company. So of course it depends always. But I list a few things that I think that might be important, like technology agnostic, context propagation environment agnostic, and also MOOC opentelemetry data. So from the technical agnostic, what I mean by this, if I have different implementations based on the technology I've been using, and for proxy and EBPF solutions, this is completely agnostic. It doesn't matter the technology you are using, the implementation will be the same like on the proxy we are collecting on the proxy level, EBPF is doing all the magic on the kernel level for you, the opentelemetry instrumentation, it depends on the technology that you are using. The example that I show you, it's a python solution. So if you are using, I don't know, C sharp, this is another way to implement like the same concept, but another way if you are using ROS, another way if you're using nodes as well. So open telemetry instrumentation depends on your technology. So you might have different implementation based on the solution you have. But in the end it's worth, you saw the power of opentelemetry, of instrumentation and the things we can do with opentelemetry collector. I do believe that you should give a try for that. Okay, so the things that ensures context propagation, like the proxy, it kind of shows, okay, it only ensures context propagation on our demo because we have an ingress proxy and an egress proxy. So I'm listening to all the traffic that's incoming from my platform and all the traffic that's going out from my services. And the other things is we only see context between proxies. We don't see things happening sides of the application. Okay? So we cannot using this information to Rio, troubleshooting the problems inside the application. This is what I mean regarding traces, okay, regarding the metrics, well, we have the highest level possible and as much closer from your customers, your proxies. More realistic will be the latents and the error rates that you're going to collect from your system. Okay, the opentelemetry and the EBPF, yeah, it ensures context propagation. The opentelemetry saw the things as well, going through one serfs in another. And they use inside the serfs, the EBPF as well. It depends, of course, the solution you are using. Sealum is not talking about trace, but if you're trying another solutions, that is EBPF trace solution, it's going to work. All the three options are environment agnostic. So it doesn't matter if you're running kubernetes, if you're running virtual machines, if you are running bare metal. So it's going to work. The ceiling is only working for kubernetes because it's a Kubernetes CNI. But the EBPF solution is not kubernetes based technology. So you can use Opentelemetry EBPF collector on your Linux machines. Okay. All those are providing different kind of telemetry data. We have logs, from there we have traces and now we also have metrics for sure for all those three options. Okay folks. And as I said, it depends. You need to understand your user case, you need to understand your requirements and your capabilities to choose the best fit for you. One of them is going to provide for you what we need and the metrics and the telemetry data that you are looking for. Okay, cool. But there is another option and if I'm able to implement every solution, why not do this and collecting metrics in different levels, okay, like using the proxy strategies to collecting metrics on the proxy level, the proxies that are closer to my customers and then I can measure pretty close customers latency. Okay, why not using auto instrumentation to start collecting traces from my applications without any code change and providing all the information that I need and then I can use an open telemetry collector to processor and enrich this telemetry data and why not using EBPF as well? So then I can collecting network information and also application metrics from the kernel level because the kernel is the best place ever that we can use to collecting observability and security data. And then you can have different point of views and you can decide which metric and which it's better for each level you are looking for. If you're looking for network levels, probably EBPF. And the Celia solution is going to be better for you. If you're only looking for application level metrics, the opentelemetry is going to be better and so on and so forth. And then you can use all those strategies, provide as much insight as you can for your engineer teams. They can build alerts, they can do whatever they want, or you using all those standard metrics, you can automatically providing them dashboards and now alerting out of the box, okay? And this is the great thing from those strategies to ensure that the teams are going to getting default observability for their services without any code changes. They can focus on delivery features and make the product owner happen, the customer happen as well, increase revenue. And then when they really need to adding effort about observability, they're going to adding observability for their context for their specific use case cool. So folks, that's it. We can also try to join all those three. So that's it from my side. I really hope you enjoyed it. If you have any questions, feel free to ping me out. Will be my pleasure. Talk to you and have a nice conversation about cloud native, about observability and many other subjects. And thank you for being here and thank you for listening. So see you folks.

Slides

Download slides (PDF)

See all 12 talks at this event!

Conf42 Platform Engineering 2023 - Online

September 07 2023

Observability strategies to not overload engineering teams

Video size:

Abstract

Summary

Transcript

Slides

Nicolas Takashi

Senior Platform Engineer @ Coralogix

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2023 - Online

September 07 2023

Observability strategies to not overload engineering teams

Video size:

Abstract

Summary

Transcript

Slides

Nicolas Takashi

Senior Platform Engineer @ Coralogix

Join the community!