The Art of Event Driven Observability with OpenTelemetry

Video size:

Abstract

The Open-Source community worked on a standard called OpenTelemetry. The project provides libraries producing metrics and traces. But in event-driven architecture, tracing could be done in various ways. Observability is not a science but an art, where we need to understand our system to observe it

Summary

This session is all about open telemetry, but in a specific context, the even driven architecture. Henry Greg is a cloud native advocate within Dynatrace. He also produces content for a YouTube channel podcast called Perf Bytes. The 42 conference is dedicated to observability.
We look at the various components involved in open telemetry, how to produce traces. Then we're going to jump into our topic of today, which is event driven architecture. We'll explain what is span links and how you can utilize them.
Observability is not a science. You basically see things and then you start visualizing this. So you need a couple of tools like pencils, pastels, you need various colors. What are the pillars that we have to basically build the art of observability?
A trace is a transaction. To make the traces, we will need a context. Context have the trace id and a span id. This is how we attach all the spans together to make a trace. Once we have defined this, we can create our spans and our child spans.
In a traditional microservice architecture, distribute traces is just fantastic. How can I make sure that the context will be sent over to service b and the trace will continue? It's called trace propagation. What's the relationship with event driven architecture?
The great examples is my second example, using span links. What is a span link? Well, in the case where you have a long transaction, in our case was not two minutes, but four minutes. At the end I have three traces, much more easier to consume.
Using spannings is great because at least it keep track properly on the consumers. But the major disadvantage that you lose is the connection between the consumer and the publisher. Depending on, on your implementation, you will have to decide if you need an end to end trace or the user of Spannings.
All right, so just a small teaser of the YouTube channel isn't observable. There's plenty of content covering opentelemetry and other observability framework or agents. By looking at it, it will help me to be more efficient in the way I'm producing new content. Thanks for watching, see you soon.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome to this session. Hope that you're enjoying the 42 conference dedicated to observability. And thanks for joining my session. So what are we going to talk about? This session is all about open telemetry, but in a specific context, the even driven architecture. My name is Henrik Rexed. I'm a cloud native advocate within Dynatrace. And prior to Dynatrace I've been working 50 plus years as a performance engineer. Performance is pretty much still in my heart. So that's why I'm producing content for a YouTube channel podcast called Perf Bytes. So this is the icon with the red is perf bytes. And then one year, less than two years, I would say I've started a YouTube channel called is it observable? Where I try to provide content for anyone who wants to get started in the observery world. All right, so we're going to talk about in the next 20 or 30 minutes or so, a couple of things. So first, obviously you heard entitled opentelemetry. So we're going to obviously talk about opentelemetry. We look at the various components involved in open telemetry, how to produce traces, because today we're going to focus mainly on tracing. Then we're going to jump into our topic of today, which is event driven architecture, and see the various way of instrumenting your event driven architecture application. And of course you will see that we'll have some disadvantage. So that's why we're going to jump into spanlinks. We'll explain what is span links and how you can utilize them. So before we start, let me tell you a story. So when I was a kid, I really love to draw, to paint, to create basically content, and because my grandma and my mother teach me how to draw things. And it was pretty funny because when I started my first job, when I started in the industry, I was assigned into an engagement as a consultant. And at that time, in that technical environment, my manager, we were basically managing different servers, different applications. And my manager said, hey, we need to draw or we need to visualize the current behavior of our application. And that time we were having a couple of tools like top and mon and so on. Those tools are pretty much precise, but give you points. So with the help of those points, the only thing that was able to draw was this. So basically like a list of points so you can think about, it's very zoomed in, so like a pixel. So it's very hard to understand the entire context of our application. But with the tooling, I was able to do that. Then 20 years ago we had improved our solution, so we were able to store data, so we have history, we were attaching the data that we were collecting with metadata coming from CMDB, for example. So we had more context. And because I was having stored points, I was able to draw lines. That's why I started to have a shape of the health of my web server. Then 13 years ago, the industry provided a couple of great product called APM application performance monitoring that was allowing us to provide other box. Distribute traces. Yes, distribute traces was there out there at that time, metrics. It was also having some synthetic monitoring, some real user monitoring and so on. So with this at least we had a better understanding. And so I was able to start drawing the eyes, the nose, the mouth of the actual server that I was trying to represent so much better. At least we still understand what we have here. Then ten years ago, because we were producing so much logs in our environment, we thought why don't we start utilizing them? So we were parsing the logs, indexing them, and try to get even more value out of all the effort that we were doing in terms of producing logs. So at the end we were better. That's why I'm having the arms now on the server. But still it's not perfect. Let me show you what I thought to draw and to show to the project. This is the actual drawing. So you can see that at least I had kid and the shape of the kid better, like a draft. And then the kid was basically in a forest or in a park or in a garden. So there was something surrounding the kid. So unfortunately with the toolings that I had, I was not able to basically draw what the current situations. So at the end, why I'm telling you that is because observability is not a science. Observability is like an art. You basically see things and then you start visualizing this. So you need a couple of tools like pencils, you need to have pastels, you need various colors. So let's jump into the various toolings that we have to basically build the art of observability. So what are the artistic tools? Well, as you know, observability is about understanding our current environment. So for that we need logs. We talked about it this couple of minutes ago. We need events. So events could be basically a system like kubernetes generates tons of events, but you also have events from your building, your pipelines, all the solution that basically make your solution going to production. So you have a lot of rich information that give even more context to the current situation. Then we have metrics, we used to that traces, we talked about it, and now you see that traces is becoming very popular. And then we have producing. So profiling is basically one of the signals, which is very powerful, because when you look at traces, you obviously want to go down to the code and profiling will help you to do so. But those pillars, those toolings are great, but you need to add context, because without context it doesn't make sense at all. So what are the context? We need the technology where it's been deployed, which server, the service, if it's a Kubernetes environment. So, deployment file, the namespace, the pod, the version number, very important, the geo, where is that server located, if you have multicloud deployments, maybe also the geo where this server is coming from. So at the end, the context is rich, because it will help you to correlate the data that you have been ingesting in your back end. But again, a natural reaction from the market. If you look around you, a lot of organizations and a lot of engineers have started to implement observability, but if you pay attention, they were using dedicated tools that were very specific for each signal. So for example, metrics, I'm going to store it in Prometheus for traces, probably Jaeger for logs, I want to use elasticsearch. So at the end we have observity, but everything is disconnected, so pretty much separated. So we are not efficient. So we need to stop doing this and we need to keep everything together. So then we are more efficient, especially when we need to troubleshoot, understand a given situation. So that's the purpose of the Opentelemetry project. The Opentelemetry project, for those who never heard, but it, Opentelemetry is an open standard, so it doesn't provide any storage, it doesn't provide you any software. It basically provides you libraries on helping you to build a standard format observatory. So standard format for distributed traces, standard format for metrics, for logs and so on. So at the end we will put something in our code, like an SDK. There's another component we'll talk about in a few seconds. And then with those SDK, we'll be able to produce auxiliary data. And that framework will add on top of that the metadata that we're looking for. So the semantic convention, so the server, the HP request, anything that we are related to our data. So if we want to summarize the Opentimi project in a very simple way, you have two things. You have the instruments of the SDK. So at the end I've got a guitar, I'm going to produce logs, metrics and traces, but I can play my guitar like this. But if I want to propagate my could or change the sound on the fly, I'm probably going to use an amplifier. So that's the collector. I will send my sound to the collector, I will be able to add some effects like chorus eventdriven, and then with the collector I'll be able to amplify the data to send it to any observer backend of our choice. And as you know, any amplifier you can have a second output, so you can send the data to several observed backend of our choice. At the moment, opentelemetry supports several type of signals, so the first one that has been initially supported was traces. So now it's stable in most of the language that we use today. Metrics is also stable for most of the language, except two of them, PHP and I don't remember the other one to be honest. And last, we have logs where it's under construction, so we should expect it probably in Q three or Q four of this year. Then we have continuous profiling, the specification is on the way, so we should not expect it until next year. So how do we produce traces? Because today we're going to focus mainly on traces, a couple of things. So first traces for this you will produce traces, you can do it in various way, but we will still need to add an SDK in our code. And there is two ways. The first approach is manual, so you can say, oh, manual sounds quite expensive, but if it's a fresh new application, it's similar to logging if you've been used to do logs, logging produced logs from out of the code, why don't you start producing traces out of your code? Similar journey, but we never start from there. Usually we start using the automatic and semi automatic instrumentation that will instrument our well known framework of the market. So for example, spring in Java will be fully instrumented. If you use Python, there is a lot of popular framework like Django that will be also instrumented. There is plenty of various examples for every language. So at the end, if you rely on a framework, there's a good chance that the traces that will be produced will be quite accurate for your use case. And at the end it will only produce data, but you will still need to send to an obliterate back end to store the data that you produce. So what is a trace? Good question. A traces is a transaction, so I'm making an action, I will save, for example, an order. So the save order will go through different components in my architecture and that will be the trace. So to make this saving order, it will go through different subtasks within my architecture, within my microservices. So I will have subtasks and those subtasks will be same spans. So at the end of traces, very simple, it's a big JSON array of spans. So it's a list of spans and everything. To make the traces, we will need a context. Context have the trace id and a span id. And basically with the help of the trace id, this is how we attach all the spans together to make a trace. So a span has different information, so the name, the attributes and so on. But we're not going to go too much into details. So if I want to build my traces in go or Java or node or whatever language, I will have a couple of objects that will be common to all of the SDK of opentelemetry. So first, if I want to build traces, I will add the opentelemetry API, no problem. From there I will be able to create one trace provider. So the trace provider is basically the component that will help me to build my traces. And you can see that there is various objects involved. So they have the span processor, the sampler, the exporter, opensource. So we will see those in details, but the span processor will help you to determine how you're going to send the data. Sampling is how much details we want to send, Explorer is where you're going to send the data and resource is the identity of service. And then we have a progator. We'll talk about it in a few seconds. Once we have defined this, we will be able to definitely create our spans and our child spans. So what is a resource mentioned? It's the identity of the service. So the resource is very critical because at the end, if you have a service cart service order, each of those services needs to have an identity. So then they will be basically displayed in the observy solution. So at the end you need a service name, you need service version. So a couple of things are required, but at least minimum is a service name tracing, sampling what it is? Well, we want to determine how much data we want to send to observe backet. Of course we can send 100%, but that will be quite expensive at the end. So we need to sample and decide how we want to sample. At the moment the opening project gives back, you have to configure the sampling decision by your own. So you have several sampling decisions available. You have always on. So 100% of the terms of data produced will be sent to the back end. You have always off, which is the opposite, nothing. Then you have parent base. Parent base is perfect when you are a dependency, like service b. For example, in this slide where I will define that only the information from service b coming involved in a global transaction will be sent to the user back end. Then you have trace id based ratio, where you define basically a ratio, how many percentage? I say 20%. 10% of my data will be sent to the back end. And then you have combination of both. So parent based, always on, parent based trace id ratio, parent based, always off and so on. So here in this example I have series A-B-C-D and e. And we're looking for an end to end trace. So the a have been configured with trace id ratio because it's the first endpoint and the other one, we configured it with the parent based trace id ratio. So it will send the data from a global transaction where they involved as a dependency. And we see that I've defined 20% for the service b, 50% of service c and so on. But you can see that the numbers here, it's very important how you configure the sampling decision because you can have more or less details. So in this example, if I take 1000 requests, out of these 1000 requests I will have only one end to end trace from a to e, which is quite low. So you have to tweak and configure it properly to get the right details that you need. What is trace propagation? Trace propagation, very simple. It's basically how the context will be sent to our architecture. So if I'm the first service a, I'm starting the trace. So I have the context very easy, so I can push it to all the various code, because at the end it's the same code. I can keep the right context. But how can I make sure that the context will be sent over to service b and the trace will continue? Well, this is called propagation where basically we will inject the trace context into before in our HTTP request, for example in the case of an HTTP communication, and then we will extract in a service b. And then once we have the trace context, we can continue our trace. So now you know everything, let's have an example. So in a traditional microservice architecture, distribute traces is just fantastic because here you can see that I have an ingress controller, I got server services. So with this I will be able to keep track on all the tasks of a given transaction. And you have a very easy way to visualize this data. For example, here I have an HTTP request going through various services in architecture. So here I see that I have 25 milliseconds of this transaction. And if I want to optimize this transaction, I can clearly see that. Okay, so we can see that list recommendation takes 20 milliseconds. So if I want to optimize, I may probably want to optimize that specific functions. And also what I discover here, I could get product that is called 1234 times. So maybe there is a more efficient way of doing it to reduce again the footprint of this transaction. So fantastic. But you say, okay, great, but you're talking about microservices. What's the relationship with event driven architecture? Okay, be patient, I'm coming there. Well, for distributed tracing it's a but different, for example, you had a service, then I send my data to a broker or to a pub sub whatever, and then based on that pub sub, based on that event, a couple of different services will start. So there's two different ways of doing it. The first way is oh, let's do a big trace. So from the service close to the ingress, to all the services that is being triggered through that event. So I will have a big traces, and you will see that sometimes based on your architecture it makes sense, and sometimes it doesn't make sense. So let's have a look at the first example, the end to end trace I just referred, very simple. So here I prepared for this event a GitHub repo. So here's the link, so you can play around. So here I'm using a pub sub architecture hosted on solace, and I have the hotel demo, just to produce traces on the side. And then I have a demo, I have a publisher, and I have two consumers. One is on the rest and the other one in database. So let's have a look at how we can do that. What does it means in terms of coding? So first let's bring me the code of the publisher. So the publisher, nothing special. I'm going to look at the code here. So here you can see that we're using the Opentelemetry SDK like expected. We have a batch band processor. So every single things that I explained, the sampling is defined. So first here I can see that I'm defining an exporter. So it's going to use the standard exporter of opentelemetry. I'm using a batch band processor to determine how I'm going to send the data. So this basically with that I have a trace provider now, and with that traces provider I'm able to start creating spans, which I'm doing here, create span, start span. And as you can see here, I'm adding some attributes to give some details about this specific function, and then I will be able to attach the context. In the case of messaging, a couple of technologies support and the trace context will be passed properly. And in this particular example I'm using pub sub, so I don't have necessarily the SDK that does it automatically. So what I'm doing is I have a trace, once I have the trace, I'm getting the trace id and the span id. So then I'm sending it as a property of the message, which means the consumer will have to extract it as well. So now if we look at one of the example of the consumer, like the distressed consumer, same thing, we defined the traces provider, nothing special. And here you can see that from the message, I'm extracting the trace id and I'm extracting the span id. So once I have those, I am defining a new span context and I'm linking it and I'm restarting the child spans from there. So that's from the code perspective, but what does it represent in an actual traces example? So if I open my browser, I have it already displayed here. So first let me bring me to the services. So which means here I got a dynatrace services, I have different application running. And here you can see that I have first of all the consumer database, the consumer rests, those are the two things and the publisher. So those are the information. The three services I'm running. In my example, if I click on the publisher, I will have details about the actual services. But what I want to show you is the actual end to end traces. Remember I started traces, I send the context to the message and then the consumer extract the context and continue their task as well. So I can see here, now I have a big trace with the publisher sending the data, and then I get every single details, have the consumer rest and the consumer database. But the problem is, as you can see here, we have a four minutes transaction and four minutes in terms of open to me distribute trace is very difficult. So if I want to optimize the publisher, I have no idea because here it traces, then a couple of hundred milliseconds. So if I want to reduce the publisher, I'm not able to have the details because technically it's not visible. So it's not perfect. So that's why I think we should improve in a way, because it's not well designed for our example. So you can see this is what I was showing you. Is it a useful trace? The answer is, of course no. I may need to change the way we're doing it. And the great examples is my second example, using span links. So before we start with these second examples, what is a span link? Well, in the case where you have a long transaction, in our case was not two minutes, but four minutes, you can see that the black span, which is in the first one, is mainly not visible. So if I have this, I can basically start a traces and then use a span link. So then the consumer will start a new trace and link that trace to the publisher. So at the end I will have three traces with this implementation, one for the publisher, one for one of the consumers, and last one for the other consumers. So at the end I have three traces, much more easier to consume, much more easier to understand what's going on. So again, let's jump into the information. So I did a second version of this implementation where I changed slightly the code. So let me show you what I mean. So if we look at the code here, let me go back here I go to the other example where it's the same rest that I showed you. So the publisher, I didn't change it, it's still sending the trace context in the message. I still need to extract my trace context. So, trace id and spin id. And here I'm starting a traces, and here you can see that I'm tracing a link to the actual context of the publisher, and then I'm starting a new span, and here I'm attaching a link. So it means that the code is slightly different. But what does it mean in terms of ui, in your observable backend of your choice? Well, let's jump into dynatrace now I have another example where here, this is one of the example of the rest that I was showing you. And you can see that I have three steps, much more simpler to use it. But what is interesting here, if I pay attention to the first one, I have here links, and then with that link I can click on it and it will bring me back to the publisher trace. And you can see that I have 100 millisecond trace, much more easier. And here I can see that if I want to optimize it, obviously I need to optimize the connect pub sub, which could be very difficult, but at least I know where I'm spending most of my time on that specific steps. Okay, so now that we've seen that very important, let's jump into the conclusion. So first, pros and cons. So first, I would say using spannings is great because at least it keep track properly on the consumers. It's easier to visualize these things. But the major disadvantage that you lose, basically the connection between the consumer and the publisher, from the consumer to the publisher, it's very easy. I know that this consumer comes from that trace, no problem. But from the publisher, I have no idea how many consumers has been triggered through that message. So it's more difficult to get a better vision, a broader vision of these things. The other disadvantage I would say is sampling decisions. We mentioned of sampling here I got a publisher trace and then I have a consumer trace. If those are two different sampling decisions, I may lose details. Maybe I won't have the publisher trace anymore. Or maybe I won't have the consumer trace anymore. So again, it's not perfect, but it's easier to consume. So depending on, on your implementation, you will have to decide if you need an end to end trace or the user of spannings. That's why it's an art to do observability. So first, as a takeaway, make sure your code is agnostic because again, the idea, we don't want to use any vendor locked in. That's the concept and the culture of opentelemetry. Second thing, if you start doing observability with using opentelemetry, make sure you add the right context in your metrics, in your logs, in your traces. Otherwise you won't be efficient and again, be creative. Understand your system, design the right observability technique based on your application. All right, so just a small teaser of the YouTube channel isn't observable. There's plenty of content covering opentelemetry and other observability framework or agents. So check it out. It's a quite young channel, so by looking at it, it will help me to be more efficient in the way I'm producing new content. All right, so I hope you enjoyed that session that you learned some stuff, but if you have any questions, I will be very delighted and honored to answer to any of your questions. Thanks for watching, see you soon.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Observability 2023 - Online

June 08 2023

The Art of Event Driven Observability with OpenTelemetry

Video size:

Abstract

Summary

Transcript

Slides

Henrik Rexed

Cloud Native Advocate @ Dynatrace

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2023 - Online

June 08 2023

The Art of Event Driven Observability with OpenTelemetry

Video size:

Abstract

Summary

Transcript

Slides

Henrik Rexed

Cloud Native Advocate @ Dynatrace

Join the community!