Observability Ecosystem in Kubernetes: Metrics, Logs, and Traces with open source tools

Video size:

Abstract

Dive into Kubernetes’ Observability Ecosystem: Explore metrics, logs, and traces for holistic insights into your clusters. Enhance performance and troubleshooting with real-time data.

Summary

Jonathan Jill: This talk is about observability in kubernetes also we talk in deep about metrics, logs and traces with open source tools. Life is really simple but we insist on making it complicated. Join to this awesome journey for today.
The talk focuses on the QR nets architecture for kubernetes. We take a look at the control plane and also the worker node that is related here. The pod is the minimal part that we can check inside of kubernets but is the most important. Depends on what kind of object you implemented for your clusters.
We start in deep about the observability equal in triangle. The second part is the metrics that is related for quantitative information around the components that support your application. If we don't take a look for these components inside of our application that it's a huge problem.
So we move from the spidey guy to instrumentation. How do we can explore these physical devices for obtain information about the environment. The second one is open telemetry. That is pretty huge projects behind of that the community.
With open telemetry you have the two possibilities for instrumenting your application. The second part is related for manual instrumentation. EBPF is more focused for implementation and efficiency. If you want take a look more in deep you can check this article and also read another resources.
Currently I deploy all my infrastructure using AWS. At this site we install a package management for qrnates that we will call it my hotel. And also we use the open telemetry,open telemetry demo site. We talk about the architecture behind the services.
In this demo, we look at the configuration side of Kubernetes. What is the behavior for every request that comes from our application or for the flow that I ingested? From Prometheus and also for Jagger. It's pretty awesome because enable the observability also for your components deployed for that.
So that is a brief resume that we talk for today that is from metrics loss and traces with open source tools. The observability what, why, where and how the instrumentation site that is part of the how site. So thank you, I appreciate your time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Okay guys thank you for joining this hasom talk. We take a look very quick or not could be not about observability. Also we focus it on QR nets that is focused for this talk and we take a look more deeper about how we can check an instrument and focus it observability in our application. So please join to this awesome journey for today. That is me, that is Jonathan Jill and let me change the landscape here so we can start here at this part we talk about for this little presentation about observability and this is me Jonathan Chavez. But let me move here. So who is Jonathan? Who is Johnny Palm? Who is Jaden 24? Just a human that loves share knowledge about instrumentation, about Linux, about ecosystem, about cloud, about DevOps, about sire, about something that I learned I want and I love share that. That is my social network on GitHub, on YouTube and also for Twitter of X. And I love this little quote. Life is really simple but we insist on making it complicated. That is really true because when we start for this journey on it journey we have a lot of challenge every day. So that is our preposition. I think this talk observability in kubernetes also we talk in deep about metrics, logs and traces with open source tools. That is the best part the open source tools for our companies on also every part of the implementation here that is of content for today we talk about for PSS scope, for introduction observability, what is CNCF, what kind of tools we can check for open source tools, what we have challenged here using observability on kubernetes and also we take a look in a little demo here. So start with the introduction site. We need to take a look very quick about the QR nets architecture right at this moment. So we have this part, the information about how to the QR net is internally works because we need to take a look here the control plane and also the worker node that is related here. And also if we deploy this cluster using some cloud providers we need to communicate our cluster using this cloud provider and also we have a little here also these components. So we need to take a look more in depth about the internal worker nodes components that is related here we have internal for the node we have container runtime, Docker container D or something like that. That is part of the node not for the control plane. Control plane must focus it for the brain behind of especially for ETCD and scale part. That is pretty awesome job here because using the reconciliation process they mentioned how we can deploy our chain, our workloads inside of the cluster of kubernetes. So this part of cube left and Kuberoxy internal for kubernetes. The Kubernetes is how do we can response all requests from the control plane side and also using the Kube proxy for how we can communicate one process for another. I means how do we can communicate one port for another port or one service for another service that is part of the Kubernetes proxy internal deployed here. The another scope here are related for the pod. The pod that is the minimal part that we can check inside of kubernetes but is the most important I think because that is related for our application. That is related for all demon set that we can generate using depends of our needs current crds behind of kubernetes. So we need to take a look for this. That is at the end of the day the world that was executed and contains our applications for service all our customers. So we have here one container or couple more than one container inside. Here we have one part for the network, another part for the volume that you attach for your pod. Depends for your needs that you need to deploy inside of kubernetes. Also we have at this moment kubernetes add ons like DNS like metric server that you added to the node and you can expose and communicate and generate this relation between another tools and another part. We have demons jobs and another objects inside of kubernetes that will be deployed inside of the node. But depends exactly for what kind of object you implemented for your clusters. Right. So that is a big scope about the architecture for kubernetes. We have here a lot of challenge if you try to observe what happened in your cluster, what happened with your application. The focus for today and this tolkie is focused for pod that is related for our application. Right. So we need take a look very quick about what is observability we have here observability that is related for the property for the system. And we can take an actionable insight for our application that allow to us to understand the system. That is the current state depends for the external inputs and how to these inputs modify the behavior internal for their application. Right. So I take this definition from the glossary that is related for CNCF that we need take a look more in deep about in this talk. So at this part we define the oily and white. The oily and white is related for o for this o and the eleven characters here and the y that is the n very fancy. Very similar for the QR. Net scope that is for Kus. So we have the same correlation here. Please explore if you can take a look from some article, something like that you find observability, you could be you find the oil and white at the same part that is very common internally. Depends of your teams or depends of your advantage that you have internally for all your application that you have currently. Right. So move for the next. We start in deep about the observability equal in triangle. That is very important for us especially because at this part we define what kind of information we can get from the everyday application. We have for one part once defined for logs, that is an obstructor data that the application start to execute. When we execute one application generate a lot of data behind of the implementation framework, the implementation languages. So you have the name for the application could be on something like that and you can store and get that information. At the end of the day that is one part. The second part is the metrics that is related for quantitative information around the components that support your application. What components? The cpu, the memory and also the disk that you attach it for this case for the pod. And also we can take a look about the traces. That could be the most challenge here because we have one request that comes from our application and how do we can measure that request jump to another application, how do we can take a look for that and we can see what happened with this request and what happened with another response for the autonomous systems or the implementation and also the communication side that we need to take a look. How do we can adopt this challenge for us and how do we can move this challenge also for our teams internally for take more deep about information from our systems, right. But we need to take a look for these three information site for our applications. So why we need observability, that is an awesome part because we take a look very quick for our application. There is our customers at this site, we have our cloud provider, something like that or could be environmental. And this request comes to our firewall. This request jump to the load balancer, this request jump to the application, the front application or back end application, I don't know. And this application queries some data to the database side and also get information for our archive. So that is a normal architecture that you could deploy for your super or your system at this site. And also we have here one challenge because we have this spider guy, right? We have this application and we generate this application and split this application using the microservices or services or nanoservices behind off depends of the architectural definition that you have internally. So you have this challenge right? Because we are the spidey guy, but what happened one day to this Spidey guy happens. The database wrong. Database fail, the firewall fail, the application fail, the archive fail, our cloud provider failed or could be load balancer fail. How do we can check if these components works well or not? We need is inside of our infrastructure, inside of our application also because if we don't take a look for these components inside of our application that it's very huge problem here. Identify exactly the root cause here. That is one part that we need to take up our separability because we don't have these eyes inside of application. We don't have these eyes normally for the infrastructure side we need generate this eyes for our little spider guy. So we need to pray for this little guy because at the end of the day this is the man that we need to take a look for what happened internally for this application, right? So we move from the spidey guy to instrumentation. Why? Because we need to use a couple of definition behind of that. We talk about for one part for the pod we need to talk for one part for these little architecture definitions. But when we move for the instrumentation we take one spar from the Wikipedia and this Wikipedia mentions to us, hey, we define the instrumentation as part for the physical drives and you most could be for the customer's site. How do we can get this measure internally for your, sorry, your granary or something like that or your farm. How do we can explore these physical devices for obtain information about the environment, right? If the environment is well or not or it could be wrong with deploying all these focuses all these data that you can be collected from your side on also what happened internally for the farm. So we need to take a look very quick and very in a same way for observability because we need to instrument our application. We need to generate this metrology automation for our application. Also how do we define exactly the physical or in our case the software devices for take a look for this instrumentation. We need to define how do we can instrument the application, how do we can call it this information about the log stresses and metrics for every part of our application. So we can define how do we can take a look for this application or how do we can obtain this information and also how we can send that information for a correlation side and also how we can store that information. Also at this point we need to take a look for this the CNCF the CNCF is related for the clonetic computer foundation that is behind of one of the biggest projects here. That is once for Q A. The second one is open telemetry. That is pretty huge projects behind of that the community, support for the community and also deploy and support to the community that it's very nice ecosystem because when we talk about the community we talk for one part of the biggest smart guys behind us. That is the biggest challenge here and also that is the biggest impact for the community, from the community. So as part of the CNCF behind of are the Linux Foundation, Linux foundation could be, you know, could be done that is the Linux behind of the open source operation system that you could be implemented for your internal application or could be your docker, AI set application or something like that. So now in that online we have at this point we have the landscape that is the address here you can explore also and we have one chapter here related for observability and analysis. And also when we start to explore that we have a lot of tools that day from the pay side also and also the open source side we have at this moment 98 tools here focused for monitoring. We have at these 21 tools focused for lodging and we have at this point 18 tools for tracing. So it's just two more ecosystem here that I don't have in this presentation. But if you explore the landscape you can check that's one for chaos engineering and also another one is for the optimization part. So when we start for observability we can enable a lot of components behind of the application, a lot of components behind of our structure and also for our companies how do we can move very quick for another practice internal for our companies that is pretty nice because when you start seeing what happened internally for your company, you can start to move and start to generate more focus for another part that you could be done take a look in deep for some case, right. So with this part we need to take a look very quick about these open source tools. I recommend for instrumentation using open telemetry you can use another SDK depends for your cloud provider also you added the SDK libraries and you can deploy these libraries and implement and you define with the development team how do we can start this journey for the instrumentation side. But I recommend that it's very quick with open telemetry because with open telemetry you have the two possibilities for instrument your application, the auto instrumentation that is part for adding just one line or could be in QR net is adding the sidecar pattern and also this take a look about the information from your application deployed and take this information from your application and start tracing these metrics inside for logs for logs traces some metrics. That is one part. The second part is related for manual instrumentation. The manual instrumentation are related for hey I need to take a look how do we can call it that information about the log about traces or metrics and I split this little site could be this is more difficult but depends exactly for your maturity appliance for observability internal for your teams right. But I recommend exactly open telemetry is a nice framework. Sorry also for collect logs I recommend file lock that is supported for biopentelemetry and also for traces Jaeger and for qualities metrics I recommend Prometheus. So that is the path that comes from observability. We instrument the application the instrument applies to the application then the application generate logs the application generate traces application generate metrics and export that metrics the traces and logs for one side using opentelemetry framework and you can collect this information and also seeing and take a look inside of Grafana for one side we need to how do we can store logs how do we can store the traces and what is the lifecycle that you have internally for this quantity of traces or times could be one month or six months or one year or something like that depends on the object for the company site and also you can take a look inside on the top of these collected metrics you can drop using Grafana. Right. So what happened with observability site on Kubernetes we have these two approach internally. For Kubernetes site we have for one site open telemetry that I recommend and also sees another part that is related for EBPF. But EBPF is more focused for implementation and efficiency. And if you want take a look more in deep you can check this article and also read another resources. I check this article and I see that it's pretty awesome how today explain the differences between from one side and another side. And I appreciate exactly this part of open telemetry that is very easy to use and also all the compatibility that currently have for all languages that you call implemented your microservices. If you want to take a look for open telemetry or OBPF please take a look and please feel free to doing that and also start this journey for observability side. So that is the best part for our site. That is the demo site, right? We have here the achieve tourist behind off. Let me out for the presentation mode and also we can take a look very quick for this demo for open telemetry we can access here exactly opentelemetry define how you can implement opentelemetry and this part of the architecture site add to you exactly this kind of frameworks for that site. And also this is the architecture deployed for this demo. We have on one site these microservices deployed here. This microservices was wrote on. Net, C plus Plus, Erlang, Golang, Java, Javascript, Kotlin, PHP, Python, Ruby, ROS and TypeScript. That is the common languages on the current scope for we have and also when we have this microservice here we define exactly how to the data comes and how to the data flow internally for our application. Right. When we export and we try to call it these metrics what happened internally we need to take a look for Prometheus from one side that is related for the EPM site exactly for cpu, memory and disk, right? And then other side that is related for JK that is for traces. If you want to add the part of the log please go ahead and take a look for this instrumentation, how you can store the log and how you can define this flow for the log site. So from Prometheus you collect this information for the microcycle site, this part for the oddband telemetry collector. That is the configuration internally for your application and you explore this part. That is how do you can receive that information, how do you can process that information and how do you can export that information from the open telemetry configuration. That is pretty awesome here because if you want to generate could be a leaf and chief and I don't like Prometheus, I use another tool, I don't like jigger, I want to use another tool you can remove from here and you added your flow and you can move very quick for start using another tool. That is pretty awesome because you don't need exactly change more internally for your teams or your development team, you just need change the configuration site that it's pretty awesome because at the end of the day you remove this responsibility for the development team and you add this responsibility for the it guy. But you need to know more in depth what happened internally for your traces and how you can call it the traces for every signals that you have internally for your system, right? So that is internally for Prometheus. Prometheus received that information using this URL and also this information will be storage on the Prometheus transactional database and at the end of the day when the data was storaged, the Prometheus reflects that spark for export that metrics and also you can see that metrics using Grafana that is part for Prometheus and it's the same way for Jake. Jaker comes here using the GRPC protocol and comes that tracing here Jaker store that information and jagger export that information that will be consumed from Grafana site. That is pretty awesome because at the end of the day you have this site that you can take a look for like a pipeline, something like that internally for open telemetry configuration site and you can drop here one tool that is Prometheus and you drop here another tool that is called jigger. And also you can use Grafana internally at this point for explore your metrics and also generate this awesome dashboard that you can share with your it guys or could be for more than for your CTo or CEO guys. And also when you drive to this dashboard you need to take a look what kind of the customer end user need to see this dashboard. There is another journey but you need to take a look internally for that, right? So that part for the architecture site and also that is related for the ingest flow or the telemetry data flow or the signals flow for store that information. So we can take a look very quick this demo, this demo actually is currently on my GitHub and also if you want to explore and generate this information also and explore that demo internally for you pretty well. Currently I deploy all my infrastructure using AWS. So let me move here very quick to our demo site. So previously I don't run exactly the part of the cluster generation. Let me start the cluster. So at this side we need to take a look here. Eks create cluster here and also drop here. At this point we start to create the clusters site. So let us wait for create this cluster. Okay, so we take a look for this cluster recently created, that was created from using the EKCTL CLI version from AWS generated. We send here the cluster information related for version that is 1.27 and the name of this cluster will be conf 42 qa native. So if we take a look here, Betty queen for elastic Kubernetes service that is internal for kubernetes for AWS on Saclier region USA. Two we have also the possibility to check what kind of component was deployed on this command. This command deployed the wall cluster that we talked internally on previously slides that is related for the control plane and workers node site and also another capabilities that AWS needs to manage this clusters because when we deploy using the EKS CTl or using the eks provisioning way we deliver some responsibilities for AWS. That is pretty awesome because when you define a cluster you have to define exactly what kind of part of the cluster you generate here. So let me move because I select the wrong part here. So move pretty quick to AKS. Again let's secure network services and also take a look for the clusters create. At this point you have the possibility to check what happened with your cluster. We just have the empty cluster because we don't drop here any kind of components at the moment. So let's check. Meanwhile Qctl get pods when we send this information we connect from the API side and request all pods from this API. And with this option hyphen a in uppercase we request that was focused for all namespace internal of Q a, right. So we have here the part of the Kubernetes side, the Qrnet is deployed on AWS and also you can see here the command end the all pods internally deployed on Kubernetes side, right. That is our decode deployed git current service API. That is timeout here that it's very uncommon part but could be you don't have to see that we create this cluster 20 minutes ago. And also if we can take a look for the compute here we have these worker nodes that is part of the q and a site exactly for the worker nodes side when they supported the deployed here all pods that we need to deploy it. So we continue here with our cluster definition. We can use this alias, this alias that is very helpful for you. If you want to send more quick comments inside of kubernetes you just send Kubectl get bots open a and that's it. And it's more quick for generate and send that request for the API side. Okay we need to take a look for another tool that is called helm. Helm is a package manager Internet of kubernetes and also we start using this tool that is part for the components deployed and that's it. They take a look from the pod site and end and we pass here the command. With this command we added the repo with these commands here the repo called it open telemetry. And this take from this helm chart that are storage on GitHub site. Open telemetry already exists because we added previously helm repo update because in some cases you need to update the reference instantly on your system, that is on your local environment nor currently yet for the cluster site just update this repository on your site. So that is the important command. The one is the important command here because we added this configuration site because if you sense the first one here you have our own definition for helm file because depends exactly for the version for qrnettes also and we deploy the Kubernetes using the API deployed on use Qrnate site for 1.27. So at this site we install a package management for qrnates that we will call it my hotel demo. And also we use the open telemetry, open telemetry demo. So we have here the open telemetry site and we use from this package management the open telemetry demo using the version exactly that is the components that will be deployed inside of the Kubernetes site. So we need to wait here little time because they need to download this package for this version and start deploying all packages inside of Kubernetes. What kind of packages? We talk about the pods, we talk about the services. We need to deploy a lot of components behind of the part that as part for this architecture that was defined for the open telemetry site right. Also that is very quick if you want to install your applications because you install don't want an installed site that it's awesome for your application and start your network observability using open telemetry site that it's very quick for start your PoC or spikes depends of the maturity of your teams. So we need to take a long little time here wait for this appliance internal for the Kubernetes side. When these components are already deployed we can send this and prepare this command here and also these references internally for the Kubernetes side okay hence the installation site and we can open using the port forward configuration site here and we drive this port forwarding for the service internally for the service call it my hotel demo and we generate this port forward from 80 80 locally and they take from the internal site for 80 82. So we can send this for the Qrctl port forward site and this granted to us access to this information that we return from the helm installation site. So open the first part that is related for the one deployed here. That is for the web store that is the application that we deployed using the architecture here. And also we need to take a look for Grafana that is the part that we start to check what happened here. So meanwhile the port rewards start, we wait also. So that is our application side. And this jump for every request that was returned for our web browser here that is the part for our application that is a hue application. When you down the products start to load then the requests start to generate also. And when we move to Grafana we have that part for the once configuration site. If we move pretty quick here for the configuration, just for the data source, we have here the jigger configuration that we mentioned that is related for the traces and also for Prometheus that is related for this cpu and memory site for our application. Right. So we take a look here very quick for the default browse the full dashboard that was created on this deployed. So open and open and open this four dashboard. And also we can take a look pretty quick about these dashboards. If you saw that it's very quick this installation, just one couple of minutes for start your journey to observability using Q a site. Also you can start the journey using other tools. Yeah, but could be you need take a look more in deep about these concepts and also generate this implementation in your application site. So that you need to take a look internally with your development team. Right. So we saw here information related for this service, right? We have the feature flag service and also we jump for every service here and these graphics update for that service and what kind of service and also how this service was deployed inside of the Kubernetes, right. You can jump and explore all service here and you can see here what happened with the cpu. What is the recommendation here, the recommendation for memory, the recommendation for scout, the error rate, the service latency, what kind of times, what's generated for every service and from the request to response the times between of that and the error rate that is related for the errors that you have internally for your application site that it's pretty awesome because you have using the traces that comes from open telemetry you have this dashboard. And also if you added your microservice you can drop here and identify your service and you start this journey using your open telemetry instrumentation site that you can enable for that. For that site we have for this dashboard we have information from receivers from exactly this component that we call it here. Let us move, I close this part but when we talk about the configuration side, we talk about the one flow here, let us load again the open telemetry page. Oh that is here this auto configuration, right. We have internally deployed on Kubernetes, one components on Kubernetes. So we need to take a look what happened with this component. What is the behavior for every request that comes from our application or for the flow that I ingested? From Prometheus and also for Jagger and what kind of behavior we can check here, right? What is the cons requested for and what is the response and what is supported for that. But it's pretty awesome because enable the observability also for your components deployed for that. The another part that is the traces baseline that is the complete opentelemetry collector data flow that is related for this part, the TS flow, how do they jump from one side to another site? The collector here, the processor here, the batch here and the sporter here and for logging are also for open telemetry. That is for the traces pipeline at this part for the metrics pipeline, right. We have currently obtained that information from one part for the trace from metrics and what happened with Prometheus. If we open this view here, we saw the Prometheus collect information for these services. That is all pods deployed on the current deployed cluster, right? So we have here exactly what happened with these components for this configuration side that is related for this diagram and that we can take a look here and also what happened internally for these components it was accepted or refused. What is the total, what is the batch site, what is the total for this batch and what happened here for logs and also for the pen telemetry configuration site. The next is related for the span metrics team. That is all traces that currently we have from opentelemetry site. And also we saw here what happened exactly with these components. What is the jumps, what is exactly the request times, what is the endpoint latency for every component that you currently have. And it's pretty awesome because when you start this journey you need to identify exactly what kind of components are behind of your application and what kind of these components you need. Take a look more in deep for this implementation that you currently have. So that is the exactly configuration for that. If you want to take more in deep about these metrics you can move very quick here and also you can run these metrics and you generate the metric for that case we use the request, you run the query here and you take a look that information from this and also you want to take a look what happened with your jumps using for Jagger you can hey please load this information from the jigger site. Run query. No, we need to take a look for this search and generate and select one service here and execute the query and you start seeing what happened with the trace internally and when you select this trace you have the possibility to show what kind of jumps this request generated. This request generated one request for let us open here, open here this request generate another request for checkout services, for car services, for product service catalog, for currency, for product catalog service again for current and these all jumps that you are currently had with this instrumentation site for your metrics that it's pretty awesome I think because at the end of the day you could enable this demo for start your journey to observability site. So that is part for the demo site and also you can take a look more in deep for your site. And also if you want to deploy a clusters or could be you have your cluster, you have the possibility to deploy that using your current cluster. You don't need exactly to generate a new cluster also. So that is a brief resume that we talk for today that is from metrics loss and traces with open source tools. We have on one side the Kubernetes architecture in a pick scope. The observability what, why, where and how the instrumentation site that is part of the how site, the golden triangle that we need to take a look with our development team the CNCF what is the scope from CNCF, what is the community for the community and also the open source tools that we take a look for doing and start for this awesome journey to observability using Kubernetes site. So thank you, I appreciate your time. I appreciate you learn something new. And also if you want to contact me you can send me a message using GitHub or could be using YouTube or X or Twitter like you mentioned you have here exactly what kind of reference that you can read. Also for take more in deep about this research about process, about the terminology behind us. So I appreciate your time. See you soon. Thank you.

Slides

Download slides (PDF)

See all 21 talks at this event!

Conf42 Kube Native 2023 - Online

September 28 2023

Observability Ecosystem in Kubernetes: Metrics, Logs, and Traces with open source tools

Video size:

Abstract

Summary

Transcript

Slides

Jhonnatan Gil Chaves

DevOps Engineer @ Globant

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2023 - Online

September 28 2023

Observability Ecosystem in Kubernetes: Metrics, Logs, and Traces with open source tools

Video size:

Abstract

Summary

Transcript

Slides

Jhonnatan Gil Chaves

DevOps Engineer @ Globant

Join the community!