Conf42 DevOps 2024 - Online

Optimizing Observability with OpenTelemetry Collector for Budget-Friendly Insights

Video size:

Abstract

Explore OpenTelemetry Collector’s practical implementation for enhanced observability. Learn about TheyDo’s journey in managing budget constraints by tail sampling data effectively. Gain insights into choosing it over other alternatives for cost-effective observability

Summary

  • Brunfre: In this talk I will tell you about optimizing observability using the Opentelemetry collector for budget friendly Insights. He says the first tip is to be careful with the auto instrumentation. Brunfre: Do we really need all this data? And the answer is probably not.
  • You can have multiple tail samplers. You want first the traces with high latency and errors, and then sample. traces with specific attributes. The other 99% of the time won't be interesting for you, and you can either discard them or execute some random sampling.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Brunfre and in this talk I will tell you about optimizing observability using the Opentelemetry collector for budget friendly Insights. I'm currently a DevOps engineer at Theydo and they do is a platform for customer centric collaboration. The good news is that we are hiring for different roles, so feel free to check out our product and careers page. I think most of us agree that 2023 was not economically great. We had a rising inflation, lots of tech companies doing massive layoffs, and hiring freezes throughout the year. And yeah, taking these contexts into account, the last thing you want is to have a $65 million bill for your observability system as we read on the news during last year for a famous cryptocurrency company. In this talk I will share our experience at theydo adopting opentelemetry, and hopefully you can get some useful tips on how to avoid this kind of spending when setting up your observability system. Last year we did a major effort at Deidu to adopt opentelemetry as our observability framework, and depending on the platform that you are using to store and process your open telemetry signals, either metrics, traces and logs or logs, you might be either charged by the amount of ingested data or in some vendors you will be eventually throttled. If you are ingesting more data than you should or it's included in your plan. And that was our main issue during the first few weeks after adopting opentelemetry. As you can see here on the image on the right, our usage was low, so we were below the threshold during weekends, but during weekdays we were above the threshold, so we were above the daily target. And as we had more load on our product and we were ingesting more data than we should, eventually we would be throttled and events would be rejected. So we had to do something about this. So the first question we asked was ourselves was like, do we really need all this data? And the answer was, well, probably not. So our first action was, okay, then let's pick the data that is actually useful for us. But how? The first thing was to really think how we were using the auto instrumentation. And the first tip I can give you is to be careful with the auto instrumentation. If you're using one of the popular libraries to have auto instrumentation for node JS or Python, it's really common that the default configurations will be just sending too many spans that you won't need. So auto instrumentation is really useful to have your initial signals sent to your observability back end, but at some point you will need to optimize it. Here's two examples that caused us some trouble. So the auto instrumentation for the AWS SDK as you can see on line five and six. So by default it is not suppressing any internal instrumentation. And this means that any call that you do to s three, for example, you will have at least four more spans. So you will have the parent spend for the s three action. In this case, as you can see, is a put object, and then you will have four more spends. You will have a put for the HTTP action, and then you will have another one for the DNS lookup, another one for the TLS connection, and another one for the TCP connection. Of course this might be useful in case there's any DNS issue or something, so you will see it right away on the trace, but that can also be enabled later when needed if there's any weird behavior that detected on s three operations that you suspect that might be related with some DNS problem. But most of the times you probably won't need all these internal spans on every trace. A similar situation happened on the auto instrumentation for COA. So by default it will create a new span for any type of middleware you have on your API. And most of the times you won't need all of them. So you can probably ignore it and enable it later. Or if you have any suspicion that one of the middlewares can be the root cause of a LATC issue, for example, or you can go for manual instrumentation and apply the needed instrumentation inside the middleware logic itself. So in this case, as you can see, we had lots of spans on every trace that were automatically sent by the auto instrumentation for COA. And how we solved this was to ignore the layer type and ignore these middleware spends, but this was not enough. So another essential technique we used to filter out the needed data was to do tail based sampling. Tailbased sampling is basically where a decision to sample a trace happens after all the spans in a request have been completed, and to execute tail based sampling. The most popular tool is an open telemetry collector. For this we have multiple options. Here's some of them. So the first one that we tried was the AWS distro for Opentelemetry, a dot, and it's an AWS supported version of the upstream Opentelemetry collector and is distributed by Amazon. It supports the selected components from the open telemetry community and it's fully compatible with AWS computing platforms such as ACS or EKS. It has some niceties like being able to load a configuration from an s three file, so you don't need to bake the configuration for the opentelemetry collector in the docker image and you can just retrieve it from an s three. But we had some issues that we couldn't go for this solution, and the first issue that we found is that it didn't support all the processors that we need, especially the transform processor. And I actually opened an issue for this, so currently it's not included on this distribution. And the other problem was that it didn't support logs at that time. Now it does. It was announced a few weeks later, but at the time it didn't support logs and we needed it because we also wanted to enrich these logs with some extra attributes and we couldn't use the opentelemetry collector for this. So yeah, be aware of these things and the documentation on the repository is pretty good, and you will have a list there of all the processors available on the distro, so you can see beforehand if it's the right tool for you or not. So taking this into account, these limitations so we had to go for the official upstream open telemetry distribution, and here you have two options. You have the open telemetry core and you have the open telemetry contrib, which was the one that we used. The core is a limited docker image and it includes components that are maintained by the core open telemetry collector team, and it includes the most commonly used components like filter and attribute processors and some common exporters for Yager and Zipkin. And the contrib version includes almost every processor component available, with some exceptions where the components are still in development. So if you want to create a slimmer image, because if you find that the open telemetry contribute has too many things that you don't need, there's this recent blog post from Martin on how to build a secure open telemetry collector so you can create a slimmer image with just the processors that you need, with just the components that you need. And it's also a good idea in terms of security, of course, because you won't include any processors that you don't use, so you reduce the surface area of attack and it's not that hard to build. As the blog explains, it's not that hard to build an open telemetry collector from scratch. And finally, you have also some vendors like Anacom that provide their own solution for tail based sampling. So you have refinery in their case, which is like a complete different project from the opentelemetry collector. And it's also a sampling proxy that examines the whole traces and then intelligently applies sampling decisions to each trace. These decisions determine whether to keep or drop the trace data in the sampled data forwarded to Anikom. So our current architecture for the opentelemetry collector usage, it looks like this. So we run the opentelemetry collector as a sidecar and our app container forwards the spans to it. Then the collector also calls the metrics endpoint on the app here to fetch some metrics from the Prometheus client running on the application. And regarding logs, they are being tailed by fluent bit sidecar, another sidecar that we have that then forwards the logs to the opentelemetry collector container. And then the opentelemetry collector filters the spans and enriches also the metrics, spans and logs with new attributes like the identification of the task that is running and other attributes that are useful for us. And then it's responsible to send the metrics, tracing and logs to one of these backends. It can send like to honeycomb, to Grafana, to datadog any vendor that supports the OTLP protocol, and then you can visualize your data there. Regarding the collector configuration, you will have to do that configuration on a YAml file and here we can see a visual representation of that configuration. For logs, metrics and traces we used the image was generated on the hotelbean IO, which is also a really great tool to visualize your configuration for the open telemetry collector. And on the left you can see for logs, metrics and traces you can see the different receivers, so OTLP or Prometheus. And then after the data is received, you will have the different processors that will process the data filter, enrich with more attributes, and then in the end you will see the destination of those of the open telemetry signals, which in this case it's also OTLP, so it is sent to a backend to then visualize the open telemetry signals. In this case, the type of signal that was generating more data for us was the traces and that's when we needed to act. So the first processor, so let's focus on the pipeline for the traces. The first processor on the pipeline configured on the collector is the batch processor, and it's really simple. It accepts the spans and places them into batches. And batching basically helps to better compress the data and reduce the number of outgoing connections required to transmit the data, so it's a recommended processor to use. After that, the data is then handled by the next processor, which is a tail sampler that we call default. As I will explain later here, the trace will be analyzed and it is not dropped. If it is not dropped, it goes for the next processor. The next processor is another tail sampler in our case, and here if the trace is dropped, so all the data on the trace can be dropped depending on the configuration. So let's see how these two are configured in our specific case. The first tail sampler named default has three different policies. So the first policy is the errors policy that will send or that will sample any trace that constraints a span with an error status. So we assume that if it has an error, it will be an important signal that we can then analyze and get to the root cause of it. The next policy is the latency policy where we check if the trace took more than 100 milliseconds to be processed, or in this case like if the request took more than 100 milliseconds to be processed. We also sample the complete trace and the main idea is the same as before. So we sample slow operations to then analyze it and get to the root cause of it. These two policies will already filter out most of the simple operations that you might have, like status or health checks calls that you might have on your API. But you could also filter those kind of calls explicitly by using the HTTP path or attribute, for example. That would be another way to do it. And finally, the last policy is to sample any trace that contains a span with a specific attribute. In this case we sample every trace that contains a span with the graphQL operation name Resync project. So in this case this might be an important operation that we will always want to sample. For example, it can be like a new feature that you want to check its usage, and we will always want all the traces related with that operation. An important thing to notice here that you maybe already noticed is that these policies have an or relationship between them, so the trace will be sampled. If any of these conditions is true, you can have multiple tail samplers. In this case we have two of them. So the next one we called it synthetics. And this exists basically because we have synthetic monitors to check our API every minute from different regions, and on each of these calls it will generate multiple spans that are not interesting at all if they run successfully. So therefore for this processor we configured the same way. So we configured an error policy and the latency in this case, the latency is bigger than 1 second. So if one of these synthetic monitors throws an error, or it took more than 1 second to complete, then we sample the data, because that's an interesting event, right? Then we have two extra policies in this case. So the first one is to sample only 1% of the synthetic requests that are successful. And this might be useful, like to have an idea of what's the average latency, for example on the synthetic requests. And as you can see here, we can create a policy with two sub policies, and it will evaluate them using an and instead of the default r. In the end, the last policy will serve as a failover to basically sample all the other traces that do not have the cloud watch user agent. So in this case it's the opposite of the previous one. So we check for all the traces that so we have invert match equals true. So we sample 100% of the non synthetic requests. So it's basically a failover. So we sample all the other traces that do not have the Cloudwatch user agent on the attributes. And this is important because without this failover this processor would discard all the other traces because they wouldn't be evaluated as true by any of the other policies. The last processor we have that also filters some data is this filter that excludes any span taking into account an attribute. And in this case, the main difference here with the tail sampling is that the tail sampler filters complete traces, while this one is filtering just specific spans. And because of that we need to be really careful when doing this, because we need to be careful with the data being dropped because dropping a span may lead to orphaned spans if the dropped span is apparent. So ideally you would create rules here to guarantee that you drop only childless spans, which in this case is true because as you can see in this case, like we are dropping only trivial graphql attributes, so we are looking at the field type and we are dropping only the ones that we know for sure that were trivial and they were childless spend. So they didn't have any child, so it was safe to drop them. And most of the times they were not interesting because they would complete in just a couple of milliseconds. So it was just information that we didn't need to keep. So in resume, the logic behind the configurations I previously mentioned are represented pretty well on this image by Riz Lee and posted on the official opentelemetry blog. So on the configurations for your tail sampler running on the collector, you will want first the traces with high latency and errors, and then sample. Also traces with specific attributes, as we did, for example, for the resync operation that we had there. The others 99% of the time won't be interesting for you, and you can either discard them or execute some random sampling on them if you prefer, and have the budget for it. And this way you can have an efficient and economic observability framework. And that's it. Thank you. And hopefully you got some useful information on how to use the open telemetry collector to keep your wallet safe. And if you have any feedback and questions about what I said here, you can find me on LinkedIn or on Twitter. So, yeah, that's it. Thank you. Thanks for watching, and bye.
...

Bruno Ferreira

Senior DevOps Engineer @ TheyDo

Bruno Ferreira's LinkedIn account Bruno Ferreira's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways