Conf42 Cloud Native 2024 - Online

Journeying through the Tales of Telemetry - A Dive into the OpenTelemetry Framework

Abstract

In the vast landscape of modern computing, understanding the inner workings of our systems is paramount. Observability and telemetry play pivotal roles in this quest for insight, enabling us to unravel the mysteries within software and infrastructure.

Summary

  • Folks at CoN 42, thanks so much for joining into my talk today. What is the importance of having observability in your system and some of the key pillars about it. Also going to talk to you about opentelemetry and how it works.
  • Siddhant is a developer advocate at Siglens and a co organizer at cloud native community groups nasic. He says if a server goes down, it can lead to a cascading number of other errors. This is going to cause a lot of unhappy users and cost business value and revenue.
  • observability allows you to get deep insights into your system. It's also useful for debugging issues and predicting any sort of future issues. There are three main pillars of observability: logs, metrics and traces.
  • Opentelemetry is simply a framework which you can use for implementing observability within your systems. A lot of the existing standards have all pretty much been merged into OpENTelemetry. There's also a really useful feature which is compatible with just a few languages. It tries and automatically instrument all of your code.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Folks at CoN 42, thanks so much for joining into my talk today. Today I want to talk with you about observability. What is the importance of having observability in your system and some of the key pillars about it. I'm also going to talk to you about opentelemetry and how it works and where it fits within this entire observability system. So let's get started with that a little bit. About me I'm Siddhant. I'm a developer advocate at Siglens and I'm also a co organizer at cloud native community groups nasic. Along with that, I'm also a community manager at a couple of tech communities. Now if that didn't make it obvious already, I'm a huge geek when it comes to tech. I love talking about Linux as well as Kubernetes and I've also started to geek out about various books and about health. If you want to connect with me after this talk, feel free to find me on my socials. Now before I talk about anything technical, let's imagine a scenario. You have a server. Let's say it's running on, on Prem in your own data center and on top of the server you have your applications running. You've got your healthy applications, you've got your healthy databases, and all of your users or your customers are able to properly access your applications. And everybody is happy. The developers are happy, your operation teams are happy and most of all your customers are happy. Now all of a sudden something happens and your server goes down for whatever reason, maybe because of a power failure, maybe because there's too much load on the system, or it could be a plethora of different reasons your server has gone down. That means it's going to lead to a cascading number of other errors. You're soon going to start seeing some faults in your application and your database as well. In the extreme worst case, your data might be completely lost. You might start losing old data as well as new incoming data. And this is going to cause a lot of unhappy users, which at the end of the day is going to cost a lot in terms of business value and revenue as well. And none of us want that right. And this entire thing has happened at 03:00 a.m. And your engineers, your operations folks are getting a ton of calls as to what's wrong with the server. They are working tirelessly to try and bring the server back up, but they have absolutely no idea what's wrong with the servers. This is because you haven't put in any sort of way to get visibility within your servers as to what's happening within it in the first place, right? So that's what we're going to talk about today. That's where observability comes into play and can really help you out. So observability in a nutshell allows you to get deep insights into your system and it allows you to use all of this data to evaluate your performance and improve it as well. And it's also useful for debugging issues and predicting any sort of future issues. Let me give you an example about predicting issues. Let's say you're an ecommerce website and you have spikes during certain periods of the day, right? So in this case, whenever you have a spike you would require more resources. So you need to allocate more resources or provision more resources from your cloud provider in order to maintain a healthy uptime. There are also three main pillars of observability. But before getting into that, let me try and give you an analogy to understand observability. Let's say you drive a car and you do your own repairing and maintenance for your car. Now if you did not have the correct tools, how are you going to know what's going on within your car? If you try to understand what's happening in your entire engine without ever opening your bonnet, you have no idea what's going to happen. For example, let's say your tire has low air in it. You wouldn't know that unless you have the right tools, correct? That's exactly what we want to achieve within observability. But for our software and for our servers. Now talking about the key pillars of observability, we have got three key pillars. The first one of them is logs. Logs are simply timestamped data with some information about an event that has happened. For example, my application could have thrown me a warning at 02:15 a.m. Now obviously nobody is going to sit at 02:15 a.m. And continuously monitor the logs, right? So that's where we generate logs using some sort of observability tool and we store it in some sort of observability back end. More on this later. The second pillar of observability is metrics. Now metrics involve things like your CPU utilization over a long period of time, your memory utilization over a period of time, and other similar things. It can also include things like your HTTP requests as well. How many requests were dropped, how many were accepted, and similar types of details. Next, we have traces. Now traces is useful for figuring out the performances of your application. Now, in this diagram, you can see that for going from A to B, it takes 50 milliseconds. Now, A and B are simply some function calls. So function A makes a call to function B, and it takes 50 milliseconds for that entire process to wrap up. Then function b tries to go to function C, function c tries to go to function D, and so on. The amount of time it's taking for this entire transaction to happen, this entire communication between various functions to happen, is what we call as a span and the entire process for it to complete, that is what we call as a trace. So now let's talk a little bit about what is open telemetry. Now, opentelemetry is simply a framework which you can use for implementing observability within your systems. Now, to give you a little bit of a backstory, before the introduction of open telemetry, there were around 14 or 15 different standards for observability. If you come from the web development world, you know how much of a pain this can be, having multiple standards for the exact same thing. When Opentelemetry was created, the project had an aim of unifying all these standards, and it has so far achieved this goal. A lot of the existing standards have all pretty much been merged into Opentelemetry, and Opentelemetry is becoming a de facto for observability standards. Moving on. I might call Opentelemetry as Otel, which is just an abbreviation, a short form for saying Opentelemetry. Now, Opentelemetry works in a couple of different ways. If you look on the left in the entire microservices column, that is where you actually try and instrument your code. So Opentelemetry has a couple of developer kits, SDKs. Using those SDKs, you will instrument your code that, hey, this is my function. I want Opentelemetry to tell me how much time it takes to go from this one function to the second function to this third function or whatever. So for that you have the Otel SDKs, you have Otel APIs as well. And there's also a really useful feature which is compatible with just a few languages for now called as Opentelemetry auto instrumentation. It's just as it sounds, it tries and automatically instrument all of your code. For now, I have seen it work with node JS, but there are a couple of other languages as well, which it supports. Then you can also use opentelemetry for your infrastructure. For example, if you are running opentelemetry on a VM, you can collect the system logs, the memory usage, the RAM usage, the system calls, et cetera. Same thing you can do with Kubernetes as well. There is the opentelemetry operator for kubernetes, and you can install it with a simple helm chart. Now, once you have instrumented your code or your infrastructure, you have something called as an opentelemetry collector. The collector simply acts as a way to collect all of your telemetry data. Telemetry data and observability data are the same thing. So telemetry data includes all of your three things, logs, traces and metrics. Once the opentelemetry collector collects all of that telemetry data and it processes it properly, it sends it to an observability front end, or rather an observability backend. Now, this backend can be something like Grafana or Loki or siglens, for example, which helps you store all of this data, filter through all of this data, create graphs, create some sort of analytics and so on. Now, how does the opentelemetry collector actually work? So here I'm taking an example with Argo. If you don't know what is Argo? CD? Argo is basically a tool which allows you to implement Githubs within your entire software workflow. Now, to talk about how the entire Opentelemetry collector works, Argo, within its entire code base has some sort of built in mechanisms for collecting this telemetry data. You simply need to configure within it that the endpoint where you want this telemetry data to go is open telemetry and it exists over here. Once that is done, Argo will send the information to an open telemetry receiver. The receiver's job is simply to receive whatever data is being sent by this external source. It doesn't necessarily have to be Argo. It could be a number of other applications, it could even be a custom application. Once the receiver has the data, it'll send it over to the processor. Now, the processor is used for adding some additional data onto the existing telemetry data. For example, I've gotten a warning log from Argo that hey, your deployment has failed for whatever reason. Now, within the processor, I can configure the processor in such a way that it will attach some details about CPU usage, memory usage, maybe some batch process that's happening in the background to this particular log which I've received from Argo. Once those entire processes are done, the next step is the exporter. Exporter's job is simply to send the data to some sort of observability backend. Opentelemetry does not store any of the data which it collects. It'll collect it, it'll process it, and its job is done. If you don't send it to any sort of external observability storage area or an observability backend, this data is going to be lost. So that's where the exporter's job comes into play. The exporter will send the data to an observability backend, such as siglens for example. There are other options as well, but I'll take siglens as the example here. Now, if you want to get started with Opentelemetry, there are two ways depending on who you are. If you are a developer, then you can use the API of Opentelemetry. It also has a number of different SDKs which you can use for instrumenting your code, and you can find all of that on Opentelemetry's website. And if you're an operations person, you're a system administrator, there's a completely different roadmap for you. I'm going to talk about it from an administrator's perspective, since that's the field that I have some experience with. So as an administrator, or rather an operations person, you have two ways to install it. You can either install it using Docker compose, which you can find in the documentation again, and docker. I'm taking the docker way and assuming that you want to run it on a simple VM. The second way is using Helm. Helm is a package manager for Kubernetes, so if you want to install Opentelemetry onto Kubernetes, you would do that using helm. Now there are a lot more in depth processes such as the Opentelemetry operator and all that comes into play for Kubernetes, but that will go outside the scope of this talk, so we're going to skip that for now. Once you have installed opentelemetry using Docker or using helm, this configuration that you can see on the screen is what an open telemetry's configuration might look like. So let's just quickly take a look through all of these configuration options. First, you can see that we have the receivers, and over there you're simply mentioning the type of protocol that you're receiving the data with, whether it's via GRPC or HTTP, and you are mentioning an endpoint as well. So in here for the HTTP protocol in specific Opentelemetry is expecting some data to come in from localhost on port 4318. Now after that the telemetry data which the receiver has gotten is going to head over to the processor. The processor will have its own thing. Over here you have a batch process. You can attach a number of other processes as well. And all of that you can find either in the documentation or on Opentelemetry's GitHub page. Next thing we have is the exporters, which again I mentioned what it is. The exporter is simply going to send the data to some external storage. Now over here you have exporter type of OTLP. This means that the data which is getting exported is going to be in the format that opentelemetry supports, and it goes to this endpoint. Now over here, this is just an example endpoint, but this can be absolutely anything as long as the observability backend supports the opentelemetry endpoint, which at this point pretty much every single observability backend will be supporting the opentelemetry data format. Then you also have some extensions and service checks which you can put into place. Talking about pipeline, this is something important. The pipeline is simply how you want all of the data to be formatted. So for example, for traces, I first want my opentelemetry traces to be collected. Then I want my processes to run in this order. Now since over here we just have one single processor, that's the only thing which is mentioned. But if we had a couple of other processes, for example CPU utilization, memory utilization, we would put that here in which order we wanted. So if I wanted first CPU utilization, then I wanted information about batch, that's how my order would be. If I wanted the details of batch first, my batch would come first, and then I would mention my CPU usage categories. And yeah, that's the end of my talk. Thank you so much for being patient and listening to this talk. I hope you found it useful and informative. And yeah, if you want to go ahead and try out opentelemetry, feel free to check out its website and you can even use siglens as one of its backends, the observability backend where you store all of the data and this is the website where you can find out everything about siglens. Yeah, if you found this informative, please do let me know and share about this on socials as well. And yeah, looking forward to connecting with all of you. Lovely audience, thanks for listening to my talk.
...

Siddhant Khisty

Developer Advocate @ SigLens

Siddhant Khisty's LinkedIn account Siddhant Khisty's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways