Conf42 Site Reliability Engineering 2022 - Online

Building Openmetrics Exporter

Video size:

Abstract

Openmetrics-exporter, or OME, is an Observability-as-Code framework that reduces the toil of finding-and-combining useful metrics from layers and hundreds of components involved in modern cloud-native systems. Every source, component, or metric is just a simple configuration file because the only “code” you should focus on is for your customers.

It leverages plugin architecture to support data sources. It relies heavily on data frame processing to combine metrics from various metrics sources before they are all converted into Openmetrics format, ready to be piped out by a Prometheus. Traditionally, such correlation and post-processing have been the responsibility of additional Data Pipelines, but with OME, it’s as simple as writing a configuration file. At its core, OME uses Hashicorp Configuration Language (HCL) to build a DSL that can allow declarative input to build metric Pipelines.

The talk is mainly about what you can solve using OME. But it also takes a concise journey of “behind-the-scenes” The need to build Openmetrics-exporter, picking a configuration language that was easily editable by humans, creating a DSL around it, and, more importantly, leveraging Golang for Data Science needs.

Summary

  • All modern cloud components are built on complex layers before you even get to instrumenting your application code. To stay on top of it, you need instrumentation. One of the industrial ways of instrumenting and observing is metrics. But 40% of your metrics might not even be accessed ever.
  • An openmetrics exporter can connect to a cloud source and use it as an engine. It pulls data from Cloudwatch and from PromQL and emits it together. We want to write code once and reuse it as many times as possible. The fundamental principle is about reusability.
  • Openmetrics exporter was born at last nine. It allows people to ship reliable software. The first step is building a consolidated metric layer. You can club these metrics together and emit them as unified, homogeneous under the same label set.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
What do you do with your metrics? You either save them and mostly visualize them. The talk is about a tool called Openmetrics exporter. I'm Pivot Sharma, I'm CTO and co founder at lastcent IO. And is this talk about yet another exporter? Because openmetrics in favor of hotel and lot of other formats have probably fallen out of favor. Let me walk you through the journey of how and why instrumentation and metrics are important. All modern cloud components are built on complex layers before you even get to instrumenting your application code. The application code that you deploy is deployed across layers of infrastructure, layers like it probably is deployed on a kubernetes. That kubernetes is running on a set of virtual machines which are supplied by the cloud and maybe which is connected to a disk as well underneath. And all of these components that are complex layers within themselves are speaking with each other. Before a request reaches your application, it probably passes through a load balancer which reaches an ingress controller. Then it reaches your pod over there, there may be another Nginx running and then it hits the application and each one of these complex layers and handshakes are breaking almost all the time. To stay on top of it, you need instrumentation. One of the industrial ways of instrumenting and observing is metrics. And one of the formats that have been the pioneer here is the open metrics format. But how do you observe such dynamic components and infrastructures these days? A typical landscape of observability in the modern world looks like this. You probably would have a can somewhere. And this is a typical example of any web application. There would be a CDN. There are instances, instances which are running some stateful workload, there are pods. Almost everybody runs kubernetes these days. There is some reverse proxy in form of ambassador or a traffic, maybe there's an NgInx as well as an ingress controller. And each one of these layers in the name of instrumentation are either emitting data in form of metrics or in form of logs. If you have metrics, well and good. If you have logs, you got to transform them into openmetrics. The way to export these metric, if the provider is already servicing these components like cdNs, like load balancers, you might get the data natively into the metric storage layer which is provided by the cloud. But if it is not, and it's a managed infrastructure that you are managing by yourself, you would probably be running some kind of exporter somewhere, a metric exporter, which all of it is unifying into a metric sync now, these metric syncs are also diverse. The cloud provider's metric sync is different from your metric sync. And consistently you're trying to build a correlation because they are effectively talking about the same component or the layer underneath. All of this consolidation and aggregation is mostly done at runtime using a visual tool, which very famously is grafana, almost as a de facto standard. And there's a human trying to interpret a dashboard, a dashboard which is speaking to the diverse layers, which are effectively talking about the same thing here. Now, what do we observe here? And these are certain challenges that I want to raise, among many other challenges of these metric syncs and data lakes. The first metric here is just an example. It's a namesake Go GC duration seconds. How often would you have seen this metric? Almost all the time. Every metric page that is emitted by Prometheus. You see this metric somewhere. But have you ever really alerted on this metric? If not, what is it doing there? If your scrape interval is 15 seconds, which is the default of Prometheus, you're emitting this metric four times. And every such emitter is all other Golang codes are bringing in such metrics everywhere. You would be surprised to know is something that we observed across our customers is that 40% of your metrics might not even be accessed ever. But they are all sitting there in that metric storage layers and you would have heard it@your.org as well. I have heard it@many.org where the very common component is about, hey, our dashboards are getting slower. Our dashboards are getting slower because there's too much data in it or there are too many requests in there. Now what is that data in there? Is data that you don't even know? Is that data that you don't even use because it's coming free from the exporters that we actually use? If we take a step back and say, how did we even get here? How do you observe, for example, an eks cluster? There are some metrics about this eks cluster that Cloudwatch is providing you. But there are some metrics which you are emitting by yourself in form of kubestate, openmetrics or a c advisor. Very likely that there is a Prometheus sitting in their free, which is also being observed via somebody's grafana. She's emitting metrics in two formats about the exact same thing. Isn't that a problem? At the same time, do you even know if the exporter is lagging somewhere or has it crashed? Or because it is creating data from Cloudwatch in a pursuit to unify the metric. It's just raking up the Cloudwatch bill every single run of its own, every single time where it pulls data, the bill just keeps increasing. And it is probably dumping a plethora of metrics that we don't even have any control over. If I was to ask a very standard question here, if, for example, I want to emit and observe an SQL database, what metrics are even being emitted by the Prometheus exporter, you're going to have a tough time even figuring out what metrics are in there. So if I was to summarize these key challenges, sources are consistently changing. We're in an infrastructure world where everything is ephemeral. Servers go in pods, die in by the hour. Infrastructure is changed. The functions, the lambdas, which we literally don't even have any control over. Fundamentally, there is no correlation between the entities that we speak. At the same time, there's a metric explosion happening because there's a lot of unused information. And there's literally no way of prioritizing urgent versus important versus unused metrics. And every single new source of a type of a component that you have to observe, it's a new binary, a new exporter that needs to run. If only we could actually make this declarative infrastructure. The cloud component creation world went through a transformation like this in form of tools like terraform. They made infrastructure as code. We actually, openmetrics exporter proposes to make. Observability was a code. It's a three step process. Step one, you declare your observability. There are certain metrics that matter, and there are other metrics that do not at all points of time. And this is very subjective. This is domain specific as well. The same Nginx could mean different in different domains. You declare these, you declare what type of components do you want to observe. You run a plan locally on your computer. You validate if the data is coming in. If everything looks well, you dispatch it. Once you dispatch it, openmetrics exporter will export all the metrics on a slash openmetrics page along with the timestamp of it. Once it is there any standard Prometheus agent, be it form of Prometheus agent, a Grafana agent, or a VM agent, can carry these highly precise selected metrics into the metric lake, where what is sitting in your metric lake is now something that you duly alert on, is important to you, and more importantly is what you have selected. An anatomy of an openmetrics exporter looks like this. You declare a scraper, a scraper is a first class entity which can connect to a cloud source and has certain attributes to it. These attributes are what form the metrics. There are different kinds of gauges, vectors and histograms, which are pretty much vector is kind of a summary as well. There's a dynamic label to it. Every definition of this metric type carries a query inside it which is specific to a source. Once it speaks to a source, it can pull that data, and more importantly, it can use this as an engine. If you pay attention, there's a resources each load balancer line. What it effectively does is that once that I have written this file, it is enough for it to run across any selection of load balancers that I may want to run. So I write this code once, but I run it multiple times for each load balancer. We've spoken multiple times about correlations, but what exactly do we mean by correlation? Let's take the same example here. There's load balancer data that I get from Cloudwatch, which I observed to say namesake is at throughput right now, but at the same time the latency is being tracked elsewhere. Assume that this latency was being emitted into Prometheus. And once it is in Prometheus, I effectively want to emit them with the same label sets, but a different metric name. How do I do that? As a part of the same metric file is what openmetrics literally unlocks. I just add a new stanza called gauge latency. But this time around, the source will be PromQL. It's a standard promql query with certain functions written here, which are manipulations of data being done. And this entire scraper runs together. So when it runs, it pulls data from Cloudwatch and from PromQL and emits it together. What it means is that if it fails, it fails both of them equally. There is unification, there's a correlation of data happening here. And when it does, it emits it with the same label set. So if there's any change happening, the change is uniformly being applied in both places. Many a times there are sources. And this is one of the fundamental powerful features of open networks, is that the data at source can constantly change. And this is applicable in, let's say a stateful system where you're saving your openmetrics like SQL, where the data might alter because there is look back and lag, which are the fundamental first class properties of any scraper. What it means is that you can actually stay on top of this data changing. Very rarely you would see uses them together, but they can powerfully be used in conjunction with each other. What look back means is how much time should I look back into the time frame, so that if there's anything that is changing in the past, can be emitted with the same timestamp. Now, most native exporters, what they assume is that the time of scrape is assumed as the time of insertion. But at openmetrics, we retain that timestamp across all sources. So what it effectively means is that if any point in the future a historical timestamp has changed, it can actually get converted into your metric, sync with the corrected value. In certain situation, you may not want to process any pipelines until the value is converted, and in such cases, lag is your friend. You can apply a lag and the whole pipeline will not emit fresh data until that point of time. One of the lessons that we learned from cloud systems, and trying to automate a lot of cloud operations that we do via terraform, the fundamental patience test that happens is the feedback loop is extremely painful and small. We want to change that with openmetrics exporter. So what we did was we added native Grafana support here. So every open exporter file, every run of it, is resulting in a new dashboard which you can debug later. It's the same example here where we're observing an RDS. If we run here with two parameters, the first parameter being the endpoint of the grafana, where I'm running my local grafana, it could be a hosted grafana as well, and along with it, an API token, it actually emits a dashboard. Once I visit that dashboard, I can visually see if everything that I had emitted looks correct or not. If all the statistics are correct, I may have not accidentally uses an average in case of max or a sum or a query is not misbehaving. And once I'm confirmed, I'm happy to dispatch this into production. Software engineering's fundamental principle is about reusability. We want to write code once and reuse it as many times as possible. We brought the concept of reusability and modules as first class citizens into openmetrics exporter. If we look at the same module, the same scraper files that we have been writing for a while, all we need to change is one single keyword. Instead of identifying the scraper uniquely, we replace the word as module. This very simple and a small fix here, what it does is it makes the entire selection of metrics that I've carefully crafted be available as a blueprint to be run every time. What it means is that once I've identified the key metrics that are extremely important for me. I can save them, I can create a blueprint, and elsewhere where the next time I have to run this, I can refer to this as a remote URL. Every single scraper that I have to write once can actually be Prometheus to be used across the community or my subsequent deployments. And before that, binding blueprint or a guardrail, which ensures that everything is observed homogeneously. On the left is an example where instead of scraper I call it extends. And that's it. Boom. Job done. What I'd also do is we create a catalog of all the important openmetrics that we seem to be relevant across. All of these components that are available across clouds. This is a registry that you can actually visit. We have a default catalog that is consistently being updated, but you can always contribute as well. It's open source. All of this is only meaningful if they could speak with multiple sources of data and sources, just like terraform has. Providers is a first class entity here. Openmetrics exporter can help you pull data from Prometheus, from Google, Stackdriver, from Amazon, Cloudwatch, Redshift, or any other SQL power database. It can also speak with other Openmetrics backend along the lines we actually saw these syntaxes, they may have looked familiar, the syntax is written in etc. A lot of people ask me, why is it not Yaml? Well, here's a joke. Yaml provides you job security, etc. Lets you concentrate. But again, why not Yaml? If we look at a smallish example here, what I'm trying to do is I'm trying to run over a resource all cluster I'm joining, splitting, and then doing a format. Any such manipulation of data is almost impossible to be done as a first class language construct. And you know, every DSL as it evolves wants to add some programming logic to it, some iterators, some conditionals, some ternary operators, some boolean expressions. Now, because all of these are available as first class in HCL, what it means is you can add some logic, not obnoxiously complex logic, but some logic, into the configuration templates itself, which means that that is less code, which is actually to be written. Now imagine if you had an existing exporter that you were using, which was off the shelf, and all you had to do was add a certain formatter of a string, or a certain split, or a certain join, or a certain regular expression. You would have to go edit the code and maintain a fork. But if the language spec becomes powerful where you can freely run these expressions, it only provides for more reusability, and that is the fundamental principle behind etc. It's human readable. Humans can interact with it better. It effectively means that it is less learning curve. It will fit all existing editors as you have seen with terraform console or nomad, all the editors will continue to work. Your existing Gitops pipeline will continue to work. Now, if we take a recap here and we try to understand how is this different from the existing Prometheus exporters that we would have actually used in case we were trying to build this as well. I want to go scenario by scenario. If there was a new source of data in an existing exporter approach, we would have to dispatch a new binary. Every single new binary that gets dispatched onto production is another point of failure, which needs to be tracked further. Now, who keeps a track of the binary? Probably another binary, and that's kind of a recursion that keeps going on. With openmetrics exporter, you just have to write a scraper file. Now, open matrix exporter has been written for massive concurrency. It's backed by Golang. We are on a single machine which is probably six cores and some 6gb of ram. We've been able to scale it to 500 scrapers and up to three to 4 million data points being emitted. We also touch the part of what metrics do you have control over in existing exporters, let alone controlling at times finding it becomes really hard with openmetrics exporter because it's observability as code. We get to pick and choose the metrics that are important to us and what should reside in our metrics. Lake building correlation is a job for post processing. Post processing is that you would carry this data into another data lake, and thereafter we will write complex pieces of code using some sort of another language which can help us extract some knowledge out of it, so that we can see the data together from two things. Or it would just be a visual post processing in form of a grafana dashboard. With openmetrics exporter, it becomes a native support because you can club these metrics together and emit them as unified, homogeneous under the same label set, logic manipulation is almost impossible. We are at the mercy of editing the code which is a part of the exporter itself, but because it is a first class function via etc, the native expressions allow very easy manipulation inside openmetrics exporter. Now, how does this all fit together? Let's take an example. The step one is the sub and ship cycle. You would download the binary from an artifactory as a practitioner you would write certain etc files or you would reuse existing modules that are available. You run them, you plan them, you run a grafana on them. Once everything is sorted looks good you commit it to a repository. From the repository you would probably create an artifactory of the etc files on the deploy side which is the step two the same open metrics binary which is a flavor specific to the deployment server tries to download these etc files and then emits the metrics. These metrics are exposed on a slash metric page which any Prometheus running in agent mode or a VM agent or a grafana agent can then ship it to the Openmetrics receiver. Openmetrics exporter was actually born at last nine. That's a company where I work at and we allow people to ship reliable software and the very first step of building this reliability stack is building a consolidated metric layer. You can visit OpenMetrics exporter at our last nine page and well happy for feedback.
...

Piyush Verma

Founder & CTO @ last9.io

Piyush Verma's LinkedIn account Piyush Verma's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways