Conf42 Cloud Native 2021 - Online

Using OpenTelemetry in a polyglot environment

Video size:

Abstract

Michael Sickles will walk through instrumenting the Google Microservices demo app in OpenTelemetry.

This demo app uses Go, Java, NodeJS, .Net, and Python all communicating in a distributed microservice architecture.

Michael will talk about the challenges he faced to get everything to connect and show up in a single OpenTelemetry Trace.

Summary

  • Michael Sickles talks about how to get open telemetry working in a polyglot environment. Opentelemetry gives you tools, APIs and sdks to ask questions about your system. It's a standard so that you can spend the work to instrument and get application insights once.
  • Google microservices demo uses scaffold to automatically deploy in Kubernetes environment. Uses open census manual instrumentation to automatically add trace, connect and span durations for my GRPC calls. Can add absolutely anything that will be useful for me later when I want to ask questions about that data.
  • Getting that automatic instrumentation was very similar. The documentation was a little bit lacking on the Python side. There was a lot of learnings in that where I didn't get tracing propagating. Eventually we were able to see this grand trace of my system.
  • So my next steps, add more information. Just recently I wanted to figure out how to use baggage. Eventually we also want to use this as a demo environment. We want to be able to add some arbitrary slowness. I definitely need to add more manual instrumentation.
  • Thank you. Here's some of the resources. At the bottom there you can see the microservices fork that I created on our GitHub repository page. Also you can look at the Opentelemetry docs and GitHub. And then there is also a Slack channel.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. Today I'm going to be talking about open telemetry, specifically on how I got open telemetry working in a polyglot environment. Before we begin, my name is Michael Sickles and I'm a solutions architect at Honeycombio. And so I've been working with our customers recently setting up their instrumentation for their environments, using Opentelemetry to get application insights. And that's what it's really for. Opentelemetry gives you tools, APIs and sdks to ask questions about your system. So seeing things like traces, metrics and logs and why it's important and useful is that it's a standard so that you can spend the work to instrument and get these application insights once, but not have to redo it later. Typically, the old style is a lot of the vendors out there have their own proprietary format on how to add instrumentation and get insights. And it's frustrating when you want to either try multiple tools or switch to a new tool because it takes work that you have to lift and shift and rewrite code in order to try things out. With Opentelemetry, you do that once and then as you move along you can keep adding the opentelemetry insights as you develop new code. You can then point it to one or multiple vendors, to be honest. And from there you can see which tool is going to be best for your situation and where you're going to get the most value. You're no longer locked in just because of all that work you put in upfront. And so I wanted to find a good demo and good environment to help our customers see how to use Opentelemetry. There's a lot of examples on individual like here's a node app. How do you set it up for a node app? Here's a Java app. How do you set it up for a Java app? But there's not a lot of good tutorials or examples on something that's more complex, something that is distributed tracing and talking across multiple different kinds of code systems. And so I found the CNCF Google Microservices demo. This is a microservice on Kubernetes and it's polyglot, so it's got no Javago net and python, which also uses Opensensus. So Opensensus was the way it got its telemetry insights. And Opensensus is a standard before opentelemetry, and there was another standard called open tracing, and Opensensus and open tracing decided to come together. Let's unify the standard and have one. And that's where opentelemetry came about. They merged together. So I could take that open census instrumentation and know I can switch it over to the open telemetry style. There might be some semantic differences, but I could say, look here, it's using Opensensus to get insight here. Let's get insight in opentelemetry as well. The application itself is on the right and it's an ecommerce website. It allows you to find items to add to your cart, check out, you can convert different currencies, you can get ads. And these are the different services that make up your application. And what's going to happen is we're going to start at our front end, which is written in go, and we're going to take some kind of action. Maybe it's add to cart or check out or see ad. These are the traces, and I'll get more into that later that we're going to follow through as it makes a call to a back end service which might make calls to other back end services. And these are on different servers and that's the different coding languages. We want to be able to watch one call, one action, one transaction, and see all the different pieces that that action talks to and connects to. And what that allows us to do then is we can see where slowness might be in the system or where there's errors in my system. I can target and get to root cause faster with a tool like Opentelemetry and some vendor out there. When I was considering how do I instrument, what do I want to do? I right away thought I'm going to reuse that open census code like I mentioned before, it's already in place. I can just convert it to open telemetry semantics and open telemetry libraries. It will give us good insights. Just going to go follow the front end to the back end. So I'll start with that front end first, move my way to the other services it talks to. And then I'm going to use automatic instrumenting when possible. When you're going through and adding telemetry to your system, all the different code languages have various ways to have automatic insights. And that can vary from automatically hooking say to the JVM for Java to pulling in specific wrappers that are going to wrap the libraries themselves and say like node js to automatically do tracing for us and automatically starting and stopping units of work for those libraries. So if we can do less work for us, that's great, right? That's another benefit of opentelemetry is we have all these different organizations working on it where that if there is a library being used at one organization and they can then add the instrumentation pieces that might make its way upstream into open telemetry, then any other organizations that might use those libraries can get insights. Finally, though, that I'm going to want to add manual instrumentation as well. So you can get a lot from automatic instrumentation, but it can only take you so far. You know your system better than anyone else and no automatic instrumentation is going to get to where you need to really understand the state of your system because you got this auto instrumentation, which is good, but take it to the next level. You can add things like user details or server details or product details that are really going to be those nuanced differences on why your system might be performing or breaking in different ways. And by adding in those details you can ask more questions about your system. And that's what we ultimately want to do is we want to understand how it's performing and maybe for whom does it suck it in my system before I go on a little bookkeeping here, terminology. So a span is a unit of work. It's an action encode, it's maybe a function, a method, and it took some amount of time, it took three milliseconds, 20 milliseconds. It's something we measured and then we have a span attribute. So we're going to add contextual details. That's that user id, session id. We want to add the variables in our code to the span so I could understand what was going on in my system at that specific point in time. And then I can add span events that is essentially a log attached to a span, more or less. It's something that doesn't necessarily have a duration, but is something that is interesting. So for example, an exception, if there is an exception, we want to take that information and attach it to this specific point in time. That span, that unit of work, it's where it happened. But an exception doesn't trace a certain amount of time. It's just something that happened at a point in time. A trace then is a collection of spans for a certain action that add to cart that check out. That is something that your users are doing and it's going to touch different spans as it goes through different pieces of your code and connect them together using a unique id, so that opentelemetry is automatically generating that id and able to connect it across your code. That your vendor tool choice is going to be able then to render it some way in the UI in some kind of view to make sense of that. If you ever see me talk about OtLP in this presentation, OTLP is just the specific protocol and format for opentelemetry itself. And then exporters, exporter is where we are sending our data. So you can export to the window, the console window, or we can export to a vendor, we can export to somewhere those application insights. So I started with the front end and we can see that I ripped out the open census libraries. And this code. By the way, this GitHub repo of where I made these changes will be at the end of the presentation as a follow up for if you want to go in and see the specific things that I changed from open census to opentelemetry, you can see all the changes I did using the get history and compare. I'm not going to throw in every single little code change I did just because it'd be something it took me a couple of hours to do and thus I can't keep it in one presentation. But yeah, I loaded the open telemetry sdks so you can see us changing. You can see some of the libraries have similar namings. So for example there's an open census trace. Well there's also an open telemetry trace. So that idea of traces and spans are similar across the two. And then I have this idea of auto instrumentation and go, we're going to have wrappers around the various libraries. So in this case I have a gorilla mux router and I want to automatically get insights on my different HTTP requests. So there is a library out there that exists that automatically does that for me. I import it in and then just wrap my router when we go through in opentelemetry. This is going to be pretty common across all the different coding languages. But we're going to go through and we're going to create some kind of exporter. We need to send our data to a location. I've removed some vendor specific information in here, but the gist is you're going to send it to some API endpoint. If that API endpoint is secure, you're going to have to be careful with that and that you're going to want to have some SSL credentials. So this is something, this is a nuance I found. Going through go doesn't automatically infer if it's HTTP or HTTPs. You have to add in these blank credentials to say hey, this is going to a secure endpoint, but if yours going to an unsecure endpoint you wouldn't need that specific piece and we're exporting in that OTLP format over GRPC. The next thing we're going to do is we are going to create this tracer. So what this trace does is it's going to automatically propagate the trace context, the tracing information, connect and create that unique id. And then we're having this spam processor that's just processing our spans. We have a batch span processor. Rather than hitting the endpoint for every single event, it's going to batch them together so that you can save some network bandwidth and we just have to add a little bit more contextual information. You added a service name for my front end here. Beyond that, there's a couple of other pieces in code that I wanted to add. You can see at the top that R use middleware. And when you look at the code, that is me taking that opentelemetry middleware and wrapping my gorilla mux router and getting that auto instrumentation piece. That's what I wanted. Right. Beyond that, the microservices demo uses GrPC to communicate to all the different backends. So there is actually an open telemetry GrPC wrapper as well. And so I was able to utilize that to automatically add trace, connect and span durations for my GRPC calls. So great rest work for me. But I do need some manual instrumentation. Ultimately, if I'm going to ask questions about my system, I want to understand what is going on in my code. So for example, maybe that session id, I get a support email and I can look at that session id to be able to see what happened for that user. So not only am I understanding high level details of how my system is performing and how my calls are doing, I now am empowered with extra details for the variables and code. I have things like an email, in this case a zip code, state, country and session. But I can add absolutely anything that I think will be useful for me later when I want to ask questions about that data. So with that I was able to deploy it. The Google microservices demo uses scaffold to automatically deploy in this Kubernetes environment. This is in our AWS cluster and honestly I just took that front end instrumentation and I copied it. There is three other go services and how you do it in one coding language, you reuse that copy and paste, just change the service name. I uses the same like GRPC wrappers because they also use GRPC to communicate. It was something that was pretty easy once I got the first front end working. So from there I needed to do my next service and I was looking downstream, like I said, front end to backend. Well this front end is touching this ad service. Java is really nice. So for Java it has an agent. This is different from the other libraries and other coding technologies. Those use sdks. Java has an agent that can hook into the JVM itself. And as it hooks into the JVM you just set a couple of environment variables you can see down here on the bottom and it's coming to the rescue. It's just going to hook into and has a long list of automatic instrumentations, tons of different libraries like spring and databases and HTTP calls, Kafka. It's going to wrap those automatically for you from the JVM context and you're using to get a lot of good insights in Java. Java is Ga. Now when I originally did this and you look at the code, I still need to update it. It was version like 16 or 17, but now that it's Ga I do need to eventually update to the newest Java agent. So starting with this Java agent, that's great. But there was existing open census manual instrumentation and one of the things I wanted to do as I said is reuse it, right? And so you can see how the terminology is very similar from the open census to opentelemetry. And this is common across a lot of the other languages as well. But here you can see I wanted to add attributes and in open census it was put attribute. Now I just switch it over to set attribute, easy change, add annotation became add event. So now I have a span event and I have that logging information with context about my Java application and that's awesome. Java also has a really neat thing in that you can take the manual instrumentation and sdks and hook into the automatic instrumentation. So it has this app with span annotation that will automatically wrap your function, call here and do the tracing and the timing for that span, that get add span, which is great. And then I can get this span current to get the specific span in the auto instrumentation, that point in time that I am at and add in different attributes. So that's that set attribute, the add event, et cetera. This allows me not to have to sit there and manually start and stop my spans. That's something I was trying to avoid if at all possible. Just that's a little bit more work with this. I got going pretty quickly and I started seeing ad service information in my code or in my vendor tool. So I'm going to continue down that path of I want to trace across multiple parts of my system. Moving down from the checkout service I see that it touches two node services. So I decided let's just go there next. So inside my payment service this is one of them node is a little bit different and this is common theme. There's going to be a little bit nuances. The node code uses this tracing js file and that's going to be start up with the node command. So you'll see in the docker image for this in the source code I just added hey start up with this tracing js and I'm using to go through and do similar things. I'm going to have an auto instrumentation piece. Note is nice in that you can wrap or rewrite the javascript code. So in this case I have plugins and I'm loading in a GRPC plugin. Once again GrPC calls is what it's making and an HTTP plugin, but there's multiple plugins out there. There's an exprs plugin. If you go on NPm you can find them and it's also on the GitHub and it might not be yet in the docs, but I'm sure it'll be there soon. You just Npm install these plugins. In reality I shouldn't have had to actually manually put this plugin code this plugins and enabled and location it should have automatically according to the docs done it for me. It wasn't working. So I manually said hey, these are my plugin names, this is where they're located, please enable them from there. We're setting up a collector again, Java didn't have it where you needed to do this. Create SSl. But no, does I have to create SSL credentials if I'm sending to a secure endpoint? And in this case for this specific implementation I am. And with that I'm also exporting in my OTLp format and GRPC. So I have to be mindful of using the right nodejs libraries. GrPC is usable in the back end so I could use that and my otlp format so I can send it directly to this vendor. With that creating my trace I'm using to in this case just kick it off, register it and add the instrumentation auto pieces and go see what is in my UI eventually. But there is some things to understand. There's more nuance, it's not always great. For example, when I wrote this nodejs and it still might not be ga, I'm pretty sure it's still not ga. But it's working towards it. Node not being ga means that there are going to be sometimes changes to the spec, the APIs and how you do things. And this is a case of when I originally did this. Compared to now, there is a little bit different on how you might do things. I will eventually update the repository to use the new method and add in the versions or update to the newest version of the node JS opentelemetry. But essentially it's similar. Instead of a plugin, it's an instrumentation. It's still loading that auto piece and wrapping around it. All right, so now I have my two node services. Once again I'm just copying and pasting my tracing js and I'm going to move on to the next piece. And that next piece I decided to go downstream to this cart service. Coming from that checkout service, it's net core, and with net core and net there's once again more nuances. So these are things that I had to kind of work out going through the documentation, going through the GitHub repos. Net uses the built in Microsoft profiling libraries. And so there's a little bit of differences in namings. Like if you were to use the manual instrumentation, you'll see they have a little bit different terminology on adding attributes, putting attributes, et cetera. But at a high level, getting that automatic instrumentation was very similar. Pretty straightforward in that I have a startup file where I am configuring my services and I'm using to initialize my telemetry. So I add my tracer, my open telemetry tracer to the services itself. From there I'm having my instrumentations. So instrumentations is my wrappers. These are my automatic instrumentations. Automatically take that trace id, automatically take the durations on how long pieces took and less work for me. With that I then also want to add an exporter once again OTLP format, and I'm going to a specific endpoint. I remove this vendor specific code in here because you might have to add things like API keys and the vendor URL where you want to send the data, but you should be able to take this and apply it to different vendors, at least from a reference standpoint. And then of course finally we see that same issue. We need to make sure empty SSL if you are sending to a secure endpoint, and in this case it's GRPC secure endpoint. And I left the automatic for the net. I didn't add in the manual instrumentation yet. That is still on my to do list. So I decided to move downstream. Once again, all that's left is two Python services. I'm getting close to the end to be able to see this grand trace of my system, to see communication between services. So I have this email service in Python, and I personally have not used a lot of python in a production environment. I've used Python from scripting, but not really in terms of a web application. So I had to do a little bit more reading up and figuring out how a requirements in work. And so using that requirements in, I was able to once again remove the open census code and libraries and references and instead add the open telemetry stuff. The documentation was a little bit lacking on the Python side. I think it's just the nature of things are still growing and are still in flux for some of the languages, but it will get there eventually where we're following the same thing we did before. We have an exporter, and with our exporter we are doing OtLPF format to our endpoint like before with empty SSL credentials. Great. We have a trace provider, a tracer, whatever have you want to call it in the different languages. And in this case we give it our service name and we're adding our spam processor. We're just going to simply export to this location. And then I wanted to add some manual instrumentation as well. There is this server interceptor. This is a hotel specific wrapper for my GRPC server, and it allows it to get that trace id from the upstream calls and automatically add it to the python calls. And I like that automatic instrumenting. As I've mentioned before, I kind of want to get going quick and see what I get out of it and then add my manual instrumentation later. So we got that. What does it all look like? Did we actually do it? And the short answer is yes. The long answer is it took me a couple of tries. There was a lot of learnings in that where I didn't get tracing propagating, right. I had to add the, for example, in that go piece, I had to add that automatic GRPC instrumentation, because without it it wasn't propagating the trace ids. And that's really the problem and why I wanted to do this is there's all those good examples individually, but we really need more examples of a complex environment where microservices, architecture, talking to different types of code environments and tracing through it. And so here's the grand trace. This is in honeycomb because I work for honeycomb, but it might look similar in the tool that you choose. And we can see that we are tracing across services. So it's taking that trace id from my front end, sending it to my checkout back end. It's telling me how long I spent. That's great. This is the power of a trace and a span and a trace waterfall view is I can see where most of the time is being spent. I can follow through and I see from that java piece or the go pieces or whatever have you, they all have these ideas span events, and I can see those span events in this case as those dots, those are maybe those exceptions or log messages that are important and those span attributes. So now in my system I can ask questions. I can see someone@example.com what was their experience when they did a checkout? And that's ultimately the thing we want to solve for, right? We want to understand where the bottlenecks are, we want to get to root cause, because if your system is down, your users are not happy. And we want our users to be happy to continue using our products. A lot of lessons learned. Honestly, the biggest problem I had when going through all this is the documentation can be lacking for some of the languages. I did have to sift through the GitHub code itself. That is continually changing already since I did this a few months ago. There's better documentation. More of the languages are ga. Opentelemetry is moving very fast and it is becoming more and robust like daily. And this is why it's important to have this nice and open format is you have the mind share of everyone out there who is interested in it to be able to work on it. That's great. But with that we saw that some of the languages are prega. That is something, a risk you might have to take, but that is once again also continually changing. More and more languages are becoming ga, and I expect that at least for tracing, to really be pretty robust on all the different languages we've seen today to be ga. Right? Metrics and logs, those are in the pipeline for open telemetry, but eventually we'll get there too. There's a different nuances I saw. So I was having troubles with that SSL piece, and that is something I had to just dig into the code and figure out why I needed to do it for some of the languages, but not for Java. And I figured out that it's just one of the nuances. We definitely also need more examples. Like I kind of mentioned that the individual examples are good. You can go into the GitHub, they all have examples on how to use opentelemetry. Tons of vendors have examples on how to use opentelemetry for the individual languages. We need more examples on complex real environments because that's where you're going to run into the more nuances, the more edge cases on how to set up something, how to get tracing across the different languages, for example. Right. Hopefully now with this Google microservices demo, that's going to help some more people out there and I'm going to keep updating it. Auto instruments, your mileage may vary, so that is something always to keep in mind. Auto instrumentation is good, definitely to get up and going. It's not going to be the end, be all, it's not going to solve all your problems for you. I don't think it will. So be prepared to go in to do some manual instrumentation. And yeah, I need to add back in some of the health checks. I had to remove them because for some reason when I added my open telemetry, the Kubernetes pods would crash because they weren't starting fast enough. But when I removed that health check to see if the pod was ready, it started up fine. That's just something I need to do to get this demo back to a good standing on how the original open census stuff was. So my next steps, add more information. Just recently I wanted to figure out how to use baggage, and so I was able to dig through the docs, dig through code, figure out how do I add baggage, and I got baggage working. So baggage is taking something like that session id you saw earlier and how can I propagate that as well to all the downstream calls, not just a trace id, so that I can set something like that session id on every single one of my spans. And I was able to add that into the code. You'll see that at least for the front end and a couple of pieces it touches. I definitely need to add more manual instrumentation. That's another piece that might be lacking in some of the documentation is how to get specific situations set up. So I want to make sure I have all those situations for any of my customers and anyone out there who is interested in setting up specific attributes and such for the different coding languages. Eventually we also want to use this as a demo environment, and with that we want to be able to add some arbitrary slowness. Your tracing tool should be able to identify bottlenecks and we want to be able to showcase that even in a complicated environment we can add slowness and quickly identify it. And you should make sure your tool that you're using can do that. As well. Obviously I just mentioned health checks need to go back in and then finally I need to update to the latest versions. Feel free to make some prs to the code, it's fast. The open telemetry stuff just a few months has just updated multiple versions already as everyone is working towards that GA for all the different languages. So eventually they're using to get all of them will get to stable and it'll be perfect and great. Until then just going to have to keep monitoring and get it to update. Thank you. I really appreciate you taking the time to watch my presentation. Here's some of the resources. At the bottom there you can see the microservices fork that I created on our GitHub repository page. Also you can look at the Opentelemetry docs and GitHub. That's where you're going to find a lot of your information to really understand how to uses it. And then there is also a Slack channel, the CNCF Slack does have open telemetry channels for you to ask questions there as well. Thank you.
...

Michael Sickles

Solution Architect @ Honeycomb.io

Michael Sickles's LinkedIn account Michael Sickles's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways