Conf42 DevOps 2024 - Online

The Complete Handbook to OpenTelemetry Metrics

Video size:

Abstract

You have heard of OpenTelemetry in the context of traces. But did you know OpenTelemetry also supports metrics with a forward-looking data model? This talk will demystify everything about Otel Metrics, the data model, and the differences between Otel Metrics and Prometheus

Summary

  • Today I'm going to talk about the complete handbook to open telemetry metrics. We'll discuss about Prometheus and opentelemetry metrics, the key differences between both. What are the roadmap for supporting hotel metrics natively in Prometheus?
  • Opentelemetry is a collection of APIs, sdks and tools. You can use hotel to basically generate instrument, collect all your telemetry data, and even export it to different storage backends. The key difference here is that Otel supports all three signals and another key difference is that there is no storage backend.
  • Opentelemetry metrics project was basically being able to connect metrics to other signals. Hotel collector can receive data in different formats, including Prometheus and OTLP format. Otel collector also can scrape metrics from slash metrics. It's a drop in replacement for your Prometheus to scrape your services.
  • There are two Prometheus that are applicable to metrics and there are default Prometheus. The first one is batch, which basically ships these metrics in batches to a storage backend. The second one is a memory limiter processor which does memory analysis. Things are not that straightforward when it comes to shipping metrics to Prometheus.
  • There are different metric types in hotel and Prometheus. Some of them are compatible. Other metrics can't be converted. They are basically not supported when you're shipping those metrics to Prometheus.
  • There are pros and cons of cumulative versus delta temporality. Some backends like datadog recommends using delta because it can improve certain performance things. In case of cumulative, the naming conventions are also different. Consider these when choosing otel metrics versus Prometheus metrics.
  • Prometheus is also working on adding lot of other capabilities for hotel support and making it first class for hotel metrics. Some of these issues are also slated to be completed by Prometheus 3.0.
  • So, to recap, what we discussed was we touched upon how they are different from Prometheus. What are the plans of Prometheus project to make the experience of using opentelemetry metrics in a more seamless and native way. Check out levitate which is our hotel and Prometheus compatible TSDB to manage high cardinality workloads.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Today I'm going to talk about the complete handbook to open telemetry metrics. My name is Prathamish. I work as a developer evangelist. At last nine at last nine. We are building tools for observability and monitoring and we have a time series data warehouse called as levitate which you can use for all your high cardinality metric needs. You can store application, infra business and product metrics. It's an end to end monitoring solution. So today's agenda, we'll talk about why you should care about this talk. If you're already doing something else, why should you think about opentelemetry metrics in the first place? We'll discuss about Prometheus and opentelemetry metrics, the key differences between both of them. We'll talk about opentelemetry collector in detail, how it can help us and do lot of heavy lifting before shipping our metrics to the destination. We'll of course touch upon some of the formatting changes between otel metrics and then the conversion to Prometheus metrics. We'll touch upon some of the advanced concepts like cumulative versus delta temporality, and then we'll see what is coming in future. What are the roadmap for supporting hotel metrics natively in Prometheus? First of all, why should you care about this talk, right? Opentelemetry is getting lot of attention and it is getting a lot of adoption as well. It's the second most popular project in the CNCF landscape after kubernetes, and overall, the vendors that are supporting opentelemetry is increasing. The adopters who are trying to use opentelemetry is also increasing every day. It helps in certain senses. It helps us bring standardization to our opentelemetry. It helps us with vendor neutrality because we are not tied to one single vendor at all. Because of the API as well as spec standardization, we can try out different vendors for our data. It also has promise of signal correlation because it emits standard metadata for all signals and the naming conventions are also present so you can have correlation between your logs, metrics and traces. It has support for a lot of languages for sdks as well as client libraries so that you can start using them for hotel metrics. And then other projects are also considering adding native support for opentelemetry metrics. So there is an experimental support in Prometheus for consuming the Otel metrics, and there are some enhancements that are planned for future as well, which we will discuss. So that's why I think you should listen to this talk and know more about opentelemetry metrics. But before going to opentelemetry, let's talk about the standard of metrics that exists, which is Prometheus, which allows us to scrape metrics from applications different targets, and then store them, run alerting on them, create graphs and so on. You can also optionally remote write these metrics to a long term storage like levitate for better reliability and performance. So that is the typical Prometheus architecture. You basically scrape metrics. It's based on a pull based model. Optionally you can write it to a storage as well. Most importantly, data is collected as cumulative data. So if you collect something at t one, the value was one, then at t two the value became four because the difference between t one and t two was three. But the value that gets reported to Prometheus is four. The cumulative value very similarly, at t three, seven will get reported, and at t four, nine will get reported. And this will play an important role in our future discussion. I'll talk about that. How opentelemetry considers temporality slightly differences in terms of formats Prometheus supports two formats. There is a text exposition format of metrics which it supports, and of course it also supports the open metrics standard for metrics as well. It has float values for like the type of values is basically float. You can store all of them as float by default. It is based on a multidimensional data model using labels so that you can have any number of labels for each time series. Their values can be anything as well. And in terms of getting the data, it is based on the pull based scrape model. So those are the key characteristics of Prometheus. If you think about what is opentelemetry, it is basically a collection of APIs, sdks and tools. This is the standard definition that I have captured from opentelemetry website. You can use hotel to basically generate instrument, collect all your telemetry data, and even export it to different storage backends depending on your needs. The key difference here is that Otel supports all three signals and another key difference is that there is no storage backend. With Otel you have to use a vendor or open source tool like Prometheus for storing these metrics and other telemetry signals. If you talk about metrics, Prometheus is the natural backend for your Otel metrics. There are other hotel compatible metric backends as well. Levitate is one of them. There are others as well, datadog and new relic as well. But Prometheus is a natural choice because that has been the standard and pioneer for metric based platforms for a long time. So if you think about the open telemetry architecture, there are standards specifications, then there are sdks and client libraries, and then there are tools which can help you collect this data, transform this, and then export it to storage backends. Now we'll touch upon the standards as well as sdks as well, but let's look at the middleware tools because they play an extremely important role in the lifecycle of how the metrics are collected and then finally shipped to the storage. So there is a component called as hotel collector which basically connects the source of data to the destination backend. It does the heavy lifting of transforming processing the data. It can basically convert some parts of data for interoperability as well, and then finally it exports to the compatible storage backings. There is a hotel operator component as well which allows you to run hotel collectors at scale. In Kubernetes you can manage the service discovery. You can do auto instrumentation of your services in a Kubernetes environment using otel operator in this talk will not touch upon a lot on hotel operator, but will focus on hotel collector a lot because that plays very important role in terms of how hotel metrics can be used by the end user. So opentelemetry, as we said, does not have any storage backends. Prometheus is a good fit for storing your hotel metrics. It also supports storing them as well with certain charts that we will discuss. There are other commercial backends as well, like levitate neuralink Datadoc, where you can send your hotel metrics as well. If you think about the opentelemetry promise and what it allows you to do is basically being vendor neutral means you can try but different platforms. It helps you standardize your metrics by providing semantic conventions so all HTTP services will follow the same naming conventions for your metrics. There will be default metadata emitted for all the metrics so that you can correlate your metrics with other signals such as traces and logs as well. It also has certain design decisions which can help you do performance improvements in your pipelines because it can do some of the heavy lifting of transforming the metrics or even dropping running high cardinality workflows and all of that. If you think about the metrics project goals, it's important to know about it, because that will also help you understand why we need to talk about opentelemetry metrics and Prometheus metrics in this kind of a way. So the idea behind Opentelemetry metrics project was basically being able to connect metrics to other signals because hotel's aim was to support all three signals of observability metrics, logs and traces at the same time. At that time, the goal was also to give a way to customers who are using open census metrics so that they can migrate their projects to opentelemetry. And you will see a lot of the influence in the design spec of opentelemetry, which is influenced from opensensors. And then of course working with existing metric standards like Prometheus statsd so that the interoperability exists, you can move data from one system to another, you can push Otel metrics to Prometheus and so on. That was also one of the design goals. Now, if you think about hotel collector, and this is the diagram from the official website, but you have three layers. Basically you have receivers where you can receive data in different formats, including Prometheus and OTLP format. Then you have a pipeline of processors. So you can run these data streams via different processors. You can run it via batch processor, an aggregate processor, transformation processors, and at the end, after all the things are done, then you can export it to a storage backend of your choice. If you compare it with an earlier diagram that we had seen with Prometheus architecture model, this is how the Otel collector will look like. There will be some applications which can push the metrics in an OTLP format to the collector. The Otel collector also can scrape metrics from slash metrics. So if you have standard applications who are emitting metrics on slash metrics, Otel collector can just scrape those metrics as it is without changing anything. And then of course it can remote write it to a storage backend. Or it also exposes slash metrics endpoints. So you can have your own Prometheus which can scrape metrics from an Otel collector. Otel collector can also write push in OTLP format to other commercial storage backends as well. Recently, Prometheus has added the capability of pushing metrics to Prometheus in an OTLP format as well. And there are a lot of plans of supporting and improving this experience in future. But that option also exists where your hotel collector can just push now to Prometheus in an OTLP format. Let's talk about the receiver, because that is the entry point for basically getting the metrics into hotel collector. So there are two receivers that the collector supports. It's basically a drop in replacement for your Prometheus to scrape your services like if you're using hotel collector and Prometheus receiver, then you don't need a Prometheus to scrape the data, you can just get rid of it because the receiver will do the same job. It also supports service discovery the same way that you have it with your scrape config in Prometheus configuration, just that you have to basically copy the same YAML configuration and put it under the hotel collector stanza and it will start working. There is a simple receiver as well that I didn't mention, but it is used only for scraping limited services, so that's why it is not to be used in production workloads. Talking about processors now this is the most interesting part, because here we can literally do renaming of metrics, dropping of metrics, we can perform memory analyzations, we can do redactions, we can do attribute changes and all of that. So there are two Prometheus that are applicable to metrics and there are default Prometheus. The first one is batch, which basically ships these metrics in batches to a storage backend. And then there is a memory limiter processor as well, which does memory analysis so that it can check periodic checks of your memory usage of the collector. And then of course will begin refusing the data, forcing gcs to reduce memory consumption when the limits have been exceeded. All of this just to make sure that your collector doesn't go down if you send too many metrics to it. And of course you can change these memory limits. Some of the limits are soft limits, some of the limits are hard limits. But the cool part about processors is you can do lot of other things as well, not just the memory checks or sending data in batches, you can generate new metrics. So let's say you have two metrics. One is the CPU usage of a pod and then the other is CPU limit of a node. You can literally create a new metric for Pod CPU utilization by basically dividing these two values. There are otel mathematical operations that this processor also supports. You can use addition subtraction multiplication to create the new metrics. By just adding these mathematical operations. There are other processes as well for transformation. This is where things get really interesting. Your recording rules use cases probably will not be needed anymore because you can just create or transform the metrics in the collector itself. You can rename them, drop them, aggregate them. So even if you have some high cardinality metrics, you can create new aggregated metrics out of them using the processor for transformation, and then send the end result to the storage back end so that way you can have lot of heavy lifting to be done at the collector layer itself instead of doing it on the storage layer or after storage layer like using recording rules. And that is one of the key advantage of opentelemetry collector exporters. After the processors are done and you have completed that pipeline, the next task is of course to ship these metrics to the storage backend. There are three mechanisms basically that the collector provides. You can scrape metrics from last metrics endpoint that the collector exposes. So that way it works very similar to how your Prometheus can scrape metrics from the application targets. Instead of application targets, it will scrape it directly from the collector. You can also remote write from collector as well to long term storages like limited and the recently added experimental support for OTLP push. So you can push to Prometheus as well. So now as of today, Prometheus can accept data in OTLP push format as well. Not just the pull format that we discussed earlier, but things are not that straightforward when it comes to shipping metrics to Prometheus. And the reasons are many, right? The metric types are different, specifications are different, hotel metrics. Spec is not same as open metrics format. There is a use case of cumulative versus delta temporality as well, and both have their own pros and cons. Naming conventions are different. Hotel has semantic conventions and it has to take care of making sure that those conventions are same for not just metrics but for logs and traces as well, so that you can do the signal correlation out of order metrics metadata. There are a lot of things that are different in case of hotel metrics and Prometheus metrics. And all of them basically play an important role when you are thinking about shipping metrics to Prometheus. So let's start with metric types. There are of course different metric types in hotel and Prometheus. Some of them are compatible. So there can be a one to one translation of these metrics for certain data types. But then there are some other types as well. Like there is a gauge histogram, asynchronous context. In case of hotel, Prometheus has standard types, counter gauges, summary histogram, and then recently added sparse histogram as well. So the Prometheus receiver and the exporters, they basically can convert the metrics that can be converted one on one, but other metrics which can't be converted. They are basically not supported when you're shipping those metrics to Prometheus. Let's talk about the cumulative versus delta temporality. Temporality cumulative temporality, right? Like what we discussed earlier was in this case, the overall value gets submitted to Prometheus. So at t one the value was one, but at t two the value became four because it increased by three. And in case of cumulative temporality, four will be reported to the storage system. In case of delta temporality, three will be reported because three was the delta from the last value. Now, there are pros and cons of both of these approaches. Cumulative temporality means that you may have to maintain a state because you always want to increase the value from last time. So you have to maintain a state on client side before pushing out these values to the storage backend. Delta can be stateless because you're just capturing the difference, right? You don't want to worry about keeping the state at all. There are certain backends which support only cumulative, like Prometheus supports cumulative. As of now, there is no support for delta. The other backends like datadog recommends using delta because it can improve certain performance things on their side. So it's a choice, and hotel leaves it up to you how you want to do it. Cumulative is the default, I think, but you can configure it in the applications themselves and you can restart the applications if you change the configuration. Of course you have to restart the application so that it takes the effect. In case of delta temporality, if somehow you drop the samples, that means there is a data loss because you cannot really recover the values. If certain labels are dropped, certain samples are dropped. But in case of cumulative, that's not a problem because over a long time rate and other functions can absorb the lossy values, and then you don't have to worry about losing out on data. In case of cumulative, the naming conventions are also different. In case of hotel and Prometheus, Otel uses the dot notation, so HTTP request duration is something that it uses. Additionally, it enforces that every metric should have a unit as well. In case of Prometheus, it does not use dots. It uses underscores for metric and label names, and sometimes the unit is also part of the metric name itself, so that is one of the key difference. The receivers and exporters have the ability to normalize the hotel metrics into Prometheus metrics based on the naming conventions. It can also do it vice versa as well. When scraping data from Prometheus, it can convert prometheus metrics to hotel naming conventions as well. Of course, it cannot scale up the values or do conversion between units like it cannot convert milliseconds to seconds or kilograms to grams and so on. That is still a processor task. It basically does only the naming convention changes. By the way, the naming convention changes also mean that if you have standard dashboards which are using a certain style of naming, they may not work if you are doing these transformations or if you start using hotel collector to ship those metrics. So that is one of the trade off that you have to consider while choosing otel metrics versus Prometheus metrics. So what we discussed so far, open telemetry metrics as of today, there is a way to consume them and then ship them to systems like Prometheus. There are some gotchas and conventions rules that you have to follow. We discussed the collector in a very detailed way. We discussed all three aspects of the collector including the receivers, processors and exporters. We also discussed the recent support that has been added in Prometheus which can allow you to push metrics as well, Otel metrics as well in an experimental way apart from just the pull mechanism that Prometheus is very popularly known for. But Prometheus is also working on adding lot of other capabilities for hotel support and making it first class for hotel metrics so that you can use Prometheus as a backend, natural choice of backend for your Otel metrics. There are issues created for this in Prometheus issue tracker. Some of the interesting issues are around out of order support because Prometheus earlier did not have a support for getting out of order metrics. But Otel pushes batches of metrics so there can be a case where but of order metrics are reported and Prometheus is now working on supporting that. Then the UTF eight support for label and metric names. This is also something that might be added in Prometheus and that will solve the case of normalization that we discussed earlier where dots were converted into underscores and vice versa. The delta temporality support is also being discussed. While the delta to cumulative conversion is not very easy. There are some discussions around supporting that in native Prometheus as well. And then of course supporting the resource attributes metadata so that those can be stored in Prometheus correctly and then utilize later. There can be some performance improvements in all of this as well. So all of these issues are tracked on Prometheus issue tracker and you can basically check their progress. Some of these issues are also slated to be completed by Prometheus 3.0 roadmap as per the last dev summit discussion that I saw on YouTube. But that is something that is very interesting update and I think that will be very great addition to the community if all of these things are supported natively in Prometheus. So, to recap, what we discussed was we started with why you should care about opentelemetry metrics. We touched upon how they are different from Prometheus. What are the key differences between Prometheus and opentelemetry, not just in terms of the specification, but also in terms of architecture? And we also touched upon what are some of the things that opentelemetry can offload from Prometheus. If we can ship metrics from hotel to Prometheus directly, we discussed about the current state of the art of opentelemetry and Prometheus interoperability, and then also discuss what is upcoming. What are the plans of Prometheus project to make the experience of using opentelemetry metrics in a more seamless and native way. So that's all mostly what I have for today, discussing about the open telemetry metrics and how you can get started with it. You can find me on Twitter, you can follow me or last nine, and then of course check out levitate, which is our hotel and Prometheus compatible TSDB to manage high cardinality workloads. Thank you.
...

Prathamesh Sonpatki

Developer Evangelist @ Last9

Prathamesh Sonpatki's LinkedIn account Prathamesh Sonpatki's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways