Conf42 JavaScript 2021 - Online

Distributed tracing for Node.js using OpenTelemetry

Video size:

Abstract

Tracing and observability are becoming very popular as microservices are getting more and more complex. The idea behind it is the fact that microservices are distributed and in order to better understand our architecture and to be able to troubleshoot production issues faster, we need to track how requests are populated throughout the system. By monitoring the interactions between the different services we are able to overcome some of the native complexity of microservices.

In this talk, we will review the concept of tracing by examining the opensource project OpenTelemetry and specifically its node.js version. We will also cover how to utilize opensource solutions, along with commercial products, to get the most out of tracing data OpenTelemetry collects.

Summary

  • Mike Laboman is the co founder and CTO at Aspecto. He talks about distributed tracing for nodejs using open telemetry. He explains how to understand when you have issues in a distributed environment.
  • When it comes to open telemetry, you just need CTo have a single file. It's an Opensource tool that can visualize traces. We use it mostly when we have production issues. We can use it in other places as well.
  • You can tie together the logs with the traces so we'll be able to jump from one another. The code examples are available right here. In GitHub aspecto IO opentelemetry bootcamp, you can grab the first episode of the bootcamp.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, my name is Mike Laboman. I am the co founder and CTO at Aspecto and I am here to talk with you about distributed tracing for node js using open telemetry. So even if you don't know distributed tracing or open telemetry, this is exactly what I'm going to focus, what they are, how they are working, and, and to be honest, you don't have to be super experienced specifically with nodejs, but we'll focus mostly on the distributed tracing and open telemetry part. So why did I choose to speak about this topic? So, I've been working with distributed applications for the past five years, and as you can assume, distributed tracing are related to distributed systems. So distributed system and microservices, something I've done a lot, and for the past two years I've been doing almost only opentelemetry. So I kind of know this space and I wanted to share my experience with you. So let's get started. So we will start by understanding why do we need track. I'll give you like a good real use case, real example of when you need traces. We have all sort of solutions that we working with today, things like logs, like metrics. Why do we need another one? Once we will understand why do we need it, we will learn what they are, how they work and how to actually implement them. So let me give you an example. So you're working in a distributed environment and you have a scenario where service a cannot write to a database. And let's assume that you know that from having a logs sending you an exception alert saying hey services a can write to db one. So let's think together. What can you think to do at that point? Right? You have a service that is not able to write to a database. This is probably a high priority thing. We may have data loss, most likely the user are noticing it. How are you going to understand what's happening there? So if you look at logs, they most likely are going to point you to a specific location in your code. So you'll have a line of code throwing the exception, and then you'll go to that code base and you would find out which other lines led to this specific line. So you kind of play this game where you're trying to go through the different files, the different components of your code, and try to figure out what could led to this specific exception. Or you are more of a metric guy and you would go and say okay, let's see what's currently happening in DB one. Maybe there is high cpu maybe there is, I don't know, an I ops issue. Maybe you go to have some metric telling you about it and maybe it's just an increase in traffic, right? Maybe I just have way more requests to service a. And then I need to ask myself, well, what endpoints in service a are actually causing a query to DB one? And then I may ask myself maybe it's not an HTTP call, maybe it's, I don't know, a Kafka message that is being sent. So this is kind of the thought process you are going through when you have issues in microservices, in distributed environment. Let me try to illustrate that a bit. So we have service a and we have DB one. We know that, but we don't know that. Maybe we have two services producing messages to service a. And then the question is, does only the communication between service b and service a are causing this issue? Or maybe it's service c, or maybe it's both. So in some way, when we looked at logs, went to our code base and started to go through the path that the code execution took, it's very efficient to do it within one process, within one service. Talking about multiple services, it's hard to do it. It's hard to jump between services and understand how they interact with one another. This is basically traces. So log told us hey, this is the situation of specific process a specific service. And this service is unable to do a specific action in a line of code. The metric told us kind of the overall situation of the system. It told us that maybe DB one had high cpu. The track is telling us the story between the services. It's telling us what is the path that this specific API call took. Maybe it was service b, service a, DB one. And it kind of gave us the context between the service within the entire system. So we probably going CTO say that we need all three. We need logs, metrics and traces in order to understand how an incident occurs. So let me give you a quick look how a trace could look like. So here you can see a system that present traces and you can see right here that all this track is starting from an API call to purchase order in order service. And the next thing that is happening is that we are calling userverify in user service. Then we are able to have this API call CTO, an external API. Then we have some save interaction followed by another service that writes to the database and eventually a Kafka message is being produced. So I have this view kind of telling me the map and this view kind of telling me the timeline, what happened in parallel, what happened in sequence. And as expected, you could click something and then get the overall data, what was the requests, what was the response? And if we're talking particularly about Kafka messages, as I mentioned before, it may not be an HTTP, it may be some messaging protocol such as Kafka, then you want to be able to kind of correlate between both the products and the consumed. So basically for me at track it's mostly this view, it's this tree view, a child parent relation that kind of tell you your request started at this point, then it got CTO service a or older service, and then the user service. And basically this is going to tell me what were the interaction between the different services. So that's tracing for me, that's the ability to see a particular API being propagated throughout the different services. And this is kind of magic thing and the way that it works. And that's I think very interesting from development point of view. So open telemetry, this is the standard way to collect traces. Open telemetry can collect other stuff like metrics and logs, but mostly most mature in traces. And this is an oh, maybe before I'm starting to explain what it is. Opentelemetry is an open source project, of course under the CNCF, the Cloud native Compute foundation. This is the foundation that is also responsible for kubernetes for instance. So it's in good hands. So the process goes that you implement an SDK within the code and within your process, within your microservices. And then this open telemetry is going to collect the data and collect the track and then ship them somewhere so that you'll be able CTO have this review. And this is going to be a parent child relation between all the different hops. And as you can see here we have service a, service b, and a database. You can see that both services have opentelemetry installed in them. And by this I mean we took the Opentelemetry SDK and we actually installed it within the service. And what happens is that when service a and service b are communicating, so it's very easy like logs for service a to just report what happened and for service b to just report what happened. But we don't want just the report of the event that, hey, I got an API call, I want something a bit more sophisticated. I want to know that when service b is being invoked, it was invoked by service a for that. What opentelemetry is doing is when you send an API call between service a to service b. Opentelemetry is going to inject the opentelemetry context. What that means. It means that when service a is sending an API call to service b, it's going to leave like a breadcrumb that it's going to say, hey, I was the one that sent you this message. So when you're reporting whatever happened within service b, please report it as a child of what happened in service a. So all of those are going to be shipped into a backend and let's call it tracing backend for simplicity purposes and let's see what is being reported. So service a is going to say, hey, I sent an API call to service b and it's going to say this is trace id number one. So every trace has its own id and every interaction, every hop between services. Any action taken within the span, within the trace we are going to refer as a span. So here we are just reporting this is span id 155 and we don't have any parent because this is the root. Then the 55 span id is going cto be injected into the headers sent to service b and service b is going to say, hey, I got an API call from service a. It is still the trace id one. I am span id 66, but I have a parent and unlike it written in the presentation that a mistake, I do have the parent and the parent is 55. And then when the pen reported by service beat that, it's writing to the DB or it's querying the DB. Then again we have the same idea. We're reporting what trace id and who is our parent. And by reporting this parent child structure, eventually we're able to render in the UI how this trace looks in this nice review that we saw earlier. Okay, so this is how it works. So when do you use it? We use it mostly when we have production issues. We can use it in other places as well. We can use it while developing, we can use it in our staging environment, but mostly in production. I actually wrote a cool open source to how to use traces in your testing, like how to utilize traces in your testing. So you can do a lot of stuff. But the common use case would be how am I understand what fail, how do I improve something that works slowly? How do I understand whether the system is working as expected or not? So that would be the common use case. But if I try to give it like a bigger name, I would say that we are trying to improve our MTTR. MtTR being meantime CTo resolve or recover or something starting with r administrative that the problem no longer exists. So that's what we're trying to do. We're trying to solve things faster and by having this cool image telling us oh, service a called b is called c, and we have this specific indication where the error happened and what led to that. That's what's going to help us to solve things very fast. So I've spoken quite a lot and I really want to show you how it looks in real code. Like what do I need to do in order to have opentelemetry implemented in my code tomorrow? So let me give you a quick look. So here I have two services. I have my user service and the user service is doing something really simple. But let's start with the item service. The item service has a data endpoint and what it basically is doing, it's calling the user service that we saw 1 second ago and we're responding that data. So the slash item is calling user and then responds the data if something doesn't work in slash data. So for instance, if I'll put in my query string fail, what will happen is I will respond with an error. And you can see here that I did two specific things around open telemetry, which I'll explain in a second. The user service is also doing a very simple thing. It gets an API call. It communicates with some mocking solution, randomize a number according to the length of the array that we got. We are reporting to Opentelemetry this number and then we are just responding that and that's all good. Both services are importing a file called tracer and just provide the name user service. And also the item service is doing the same. So basically that's all you need to do. When it comes to open telemetry. You just need CTo have a single file. You will see that the installation of it is quite simple. The code within the tracer and that is it. All the other interactions that I show you, they are not mandatory, but you can definitely go ahead and add them if you would like to Soho, the tracer. The tracer is actually very simple. So basically opentelemetry collects what's happening in your service and then going to send it somewhere. So the Yeager exporter, it's going to export data to Jaeger. We will see Yeager in a second. It's an Opensource tool that can visualize traces. So it either can be something you spun up locally or some production endpoint that you're using, or you're choosing to use a vendor, and then you'll get a bit more feature than Jaeger and you don't need to operate Jaeger by yourself. So we're telling Opentelemetry where you're going to send the traces. Then we're going to tell Opentelemetry when you send those traces, those spans. To be more accurate, please indicate that this is the service name, so we will be able to distinguish between services. Few kind of generic setups. Here you are specifying what kind of libraries you want to be able to instrument to collect data from. So here I went with a simple list of HTTP express, but you can have a lot of other types of instrumentation like kafka, mongo, rediswssdk, you name it. Most likely there is an instrumentation for your need. Basic instrumentation means please collect data from this library. So here specifically we're talking about the HTTP library, the node native HTTP and the exprs one. This is it. This is all there is to it. Everything you would see is going to work based on that. I am running two services. I did yarn users to start the user service and I did yarnitems to start the item service. So let's go and have a quick look what happens when I'm sending an API call to data. So I'll go to Yeager. This is Yeager, and let me fetch the latest track. So this has happened right now. And when I'm clicking on it, you can see that we actually called data. And you can see this is under the item service. And then we communicated with the user service. So you can see here that we sent an API call CTO, our mocking service. And you can see everything that you would like to see that is going to tell you what really happened. Now this trace is quite not that interesting because everything is local, the communication is quite simple, but this is how you will be able to debug whatever is working or not working. You do remember, let me even show you that again. So in user we're getting an array and we are randomizing a number. So if I got someone from Madrid, let me refresh that. So now I got someone from London and I want to understand why this thing happened, why this data was randomized. So what I did here, I got the current span, the active span, actually handling this code, and I just wrote a note, hey, a number was randomized and this was the number that I randomized. And you can actually see it right here. This is kind of a log, right? This is kind of allowing me to send logs within my track. So it kind of putting them together. I can see what happened between services, but it can also attach to those spans what happened within the service. So this is a cool trick that you can use if you want to use like add event. It's very useful. Now let's do something else. Let's make it fail. So when I am running a query string fail, let's look at the code. When I have a query string fail, I'm throwing a really bad arrow and I am doing a very interesting thing. So what I'm doing here is I'm fetching the current span. And in my logger, in my console, I'm actually writing what is the current trace id? So assume that you have in your production environment, you probably have some log solution, something like kibana or so, and you have an exception. Now that's cool. But now I want to visualize this specific exception, not only logs, but also in traces. So you can see here I'm printing critical arrow, and here I have my critical arrow and I have my trace id. So I'm using to copy that, go back to jaeger and just paste it right here. And now I can see the specific of this arrow and I can see all the different things that related to this specific action. So we are kind of tying together the logs with the traces so we'll be able to jump from one another. So you can add to your logs the current trace id or span id. And you can also add to your span something similar to logs. So this is all you need to do. This is everything there is to know. And I would urge you to kind of go and try because it's a really simple line of code that you can start and get started and see what you're getting from it. If you are interested, the code examples are available right here. So in GitHub aspecto IO opentelemetry bootcamp, you can grab the first episode of the bootcamp. That's almost exactly the show that I showed you and that will get you started with opentelemetry quite fast. So I really hope you enjoyed this talk and if you have any questions, feel free to reach out. And best of luck with having traces.
...

Michael Haberman

CTO @ Aspecto

Michael Haberman's LinkedIn account Michael Haberman's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways