Open Source Observability with OpenTelemetry

Video size:

Abstract

The massive Observability industry is shifting to OpenTelemetry, we’ll talk about why that’s happening, and how you can test OpenTelemetry in your stack. We’ll cover: * The OTel Collector * Do you need to run a collector * Processors and other Collector Magic * Language Support * Support levels vary, but the results will surprise you * Logging * Don’t get discouraged! there are many solutions for logging even if your language support is limited * Distributed Tracing * the magic bullet We’ll also cover justifying the deploy time, what’s supported out of the box, and how OpenTelemetry helps with root cause analysis

Summary

Open source observability with open telemetry. Noshanika Meyerflare. It's possible to have a fix without understanding the problem. Much better to have some understanding of what's going on.
In the era of the monolith, only a few people understood the whole system. With microservices, someone understands each of the interconnected dots completely. But nobody understands the map that covers all of these microservices. With observability and any kind of understanding of can outage, in almost all cases, microservices are going to be a dead weight loss.
Distributed tracing is a hybrid between metrics and logging. The vast majority of trace data is never viewed. The goal of tracing is to show us the components that are hit by a request. But storage and management are their own challenge.
Open telemetry is a standard for the communication of metrics, trace and logging data. The collector is where a lot of this magic happens. It can do things like filtering, batching, attributing, and so on. Support for open telemetry in Ruby is a lot better than you think.
Go check out telemetryhub. com for a really nice, cheap, efficient way to go ahead and report up open telemetry data. I want to thank you so much for joining me again. Have a great conference.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everybody. Thank you so much for joining me. I am Nocnica Mellifera. Let me put on the full face to say hi. Hi, everybody. Thank you so much for coming out. This is open source observability with open telemetry. Noshanika Meyerflare. You can find me most places at serverless bomb. You can also just google the name Noshanika, turns out, and I come up. So that's fun. Given in association with telemetryhub.com. Go check it out while we're talking this through. Okay, so what is observability? Frankly, observability is a term is much more familiar on the west coast of the United States than it is across the entire tech sector. So I think it's fair to say, hey, hopefully you're here because you understood something about it or you've heard the term before. It is not a single tool or a special case or a standard. It is a design criteria. And I think of this as observability as being the time to understanding and not just know. People like charity majors and the open telemetry project have talked about defining it this way. We think of our time to understanding of a particular problem or issue or service interruption is the first half of your time to resolution. And so a lot of the time throughout this talk, I'm really going to be referring to these situations where a service is completely down or otherwise really not performing as it should. But you can also, observability can cover these cases where it's like, hey, why is this so slow for users in this region? Some people have some reports of a bud that we haven't been able to replicate. These are also problems that observability can address. Right. It's possible to have a fix without understanding the problem. This is an example where, hey, you know that eventually the service runs out of memory, so you go ahead and restart it. And we've all seen those setups where it's like, yeah, we just need to restart this thing every 24 hours because we know it's running out of memory. We don't know why. And so that's can example where we have no observability, really no understanding of the system, but we do have a fix. But without understanding, the stress of a particular problem is pretty high. Right. Much better to have some understanding of what's going on. Okay, so why are microservices a little bit harder for this? Why do they make the challenge larger? So let's talk about a historical time where we were really thinking about monoliths as ways of creating production software, right. In the era of the monolith, only a few people understood the whole system. So most people were working in little areas of it. They often felt like they were needing the expertise of a small group of people who really understood the whole system. But those who did understand the whole system, they had a very full explanation of problems that were happening on the stack. And the biggest problem with a monolithic architecture is actually not at all about how it performs. Some people will say, hey, we don't do monoliths anymore because they don't scale correctly. That can or cannot be true. That's not always the case. But the problem that monoliths really created was thats it often took months for someone to become can effective team member once they joined your community. And with a lot of people averaging just two years in a particular position, monoliths just don't work anymore. So you have to have these microservices so that people can get up to speed on a single microservice and be contributing within weeks instead of months. And so that's the reason for the migration. It's not really because a monolith performs so poorly. And one of the things thats monolith did a lot better was that all the information should be available on the stack at any time that you choose to stop and see what's going on. So a person who understands the monolith well can very quickly get to the bottom of a particular problem because all the information is available. So with microservices, right, someone understands each of the interconnected dots completely. They completely understand how that dot works. But nobody understands the map that covers all of these microservices obviously have a ton of performance advantages, scaling advantages, and again, that advantage with how quickly people can start contributing to the team. But with observability and any kind of understanding of can outage, in almost all cases, microservices are going to be a dead weight loss. So they're going to make the situation worse for being able to. Like for example, if it's 05:00 a.m. Or 03:00 a.m. Outage. Once everyone on the team is awake and the people who understand the system best are awake and have gotten connected with the monolith, somebody's going to understand what's going on. But very often with microservices, one of those common questions I have gotten when working with observability tools and people have these very deep microservice architectures, people just say, hey, on a normal request, no problem. With request, no failure, how do I find out which services are being hit by that request? So a simple question like, hey, when they come and check, but from our ecommerce store, what services are involved in that checkout? Okay, so that shows you this is an oversimplified version of microservices, right? They really are multifaceted, very, very complex, and quickly build to a complexity where it's very hard to even understand where a successful request is going. And so we can move very quickly to thats chaos where it's very hard for us to understand what's going on inside a microservice architecture. Okay, let's talk about how we solve this with observability. There are three major components to observability that we need to ensure. I'm going to be a little bit quick with this because we're going a little bit deeper into concepts after this. So we're going to zip through this just a little bit. But there's really good write ups on opentelemetry IO about the concepts of logged traces and metrics, which are the three pillars of observability. So let's start with metrics, right? When you don't know what's happening, count something I actually have lost where I got that quotation from. This is a quotation from a statistician. One of the concepts for what is a metric is that the speedometer on your car is a metric, numerical measurement of a complex system. So instead of saying, hey, you've just passed slough and you're going into this next place, and then you're going to get there in this much time, or these other things, a metric is a very simple measurement. Hey, you're currently going this fast. They're very easy, or they should be an easy way to get a high level view. This is a nuclear control station. So it really gives you a sense of how you can get so many metrics very quickly that you dont have a very quick and easy view, but you do have a high level view of what's going on. And metrics are also very easy to store in a high volume. So metrics don't present usually a challenge for like, hey, where are we going to keep all of these? If you're getting to a point where your database is struggling to contain the metrics that your production service is generating, either your Netflix and I'm sorry, or you have an issue with configuration, things like metric explosion, which we're not going to get into here but yeah, normally it's very easy to store. You have logs, right? Logs, as I say, they always have a complete and thorough explanation of the problem somewhere, right. But storage and management are their own challenge. Logs can be so complex, can contain so much data, that very often the real challenge is just sorting through them during a crisis. And so there are people who are of the opinion that if there is an outage and if something is not working, they really don't want to be starting with logs. They know that they're not in a good place if logs is where they're starting. And then finally is our new entrant in the last 510 years into the story of observing our systems, which is tracing, right. They're informatically a hybrid between metrics and logging. And they're trying to generalize observed time spans, which is a little bit obtuse. But essentially tracing is supposed to show us the components that are hit by a request. And because we use a modern architecture, those are not going to be sequential, right. They're going to be multiple time spans happening all at once or at the same time. And we have a few more figures here to kind of help us see that. So, one little side note about tracing. Tracing is relatively, it should be as dense as logging, possibly more so. And one of the journey secrets about tracing is that most trace data is never viewed. And by most we mean like three nines of data. The vast majority of trace data is never viewed. I see my little face is covering my joke there, right? Really, most of it is never viewed. That's kind of worth noting when we think about our data retention problems and other problems like that. Okay, thats is not what I wanted to do. Let's come over here. There we go. Okay, so from tracing, we came to the concept of distributed tracing. Fix that. So distributed tracing really is just the implementation of tracing, but that is able to track an event between multiple microservices. So here you see this request being passed around, which is creating multiple events which are sent to other APIs and getting back responses. And thats each time this is happening, there's some kind of persistence going out that is saying, hey, here's the stuff that we want to log about what's happening, and we want to be able to connect all those together. We don't want to just be filtering logs to see that connection. We want to be able to see it easily that this request is connected. So at a very high level, how does distributed tracing happen? Right. You add a trace header somewhere close to the start you pass it around with the request and then you have some collector side logic or some data gathering side logic to stitch those pieces together. So the goal of tracing is to get something like this waterfall chart, right, which is showing us here are the components that were hit by this request, and ideally seeing them in some kind of hierarchy to say, hey, general, we had a request to the API. It had these components that were hit. These were the ones that were running simultaneously. Here's how long they took. So beyond just, hey, this went to here, which again, as I mentioned earlier, is often where people come in as that's what they really need from the system is they just say, hey, I want to know what the heck is being touched by this request. These x widths here have a meaning. They have a meaning of how much time something took. So you see a lot of discussion when we talk about tracing of spans, which is the measurement of the amount of time that each of these components took. And then you get some kind of visual indicators of what was blocking what. Right. Like in this case, auth needed to be completed before we could get to payment gateway at dispatch. Okay, so once we start thinking about distributed tracing, one of the problems that we run into is this problem of how do we get these individual pieces to communicate. And so in the sort of closed source SaaS world, there were these efforts to say, okay, well, we'll create a library for maybe front end measurement, for measurement of your back end system, for measurement of your database, and then we can tie those together. If you use our closed source tools, use our SaaS tools, we'll be able to tie those together into a single trace. But as microservice world started to explode, it really got difficult to negotiate that trace header value to be passed successfully between all these things and a single company, a single effort, no matter how big, just could not maintain a system that could be installed everywhere that would successfully pick up this trace, report it successfully up to their system and give you this nice unified trace. There were always going to be these large black boxes within your trace where either the trace data was totally lost or it's just, yeah, we were waiting for something here we don't have observation of what happened. So that is how we get to this point with open telemetryhub. Open telemetryhub and the history of open telemetry and distributed tracing are intimately linked as this is a project to define can open standard for the communication between components that distributed tracing can work successfully. Open telemetry covers the other components of observability too, as we'll get into. But this is kind of where we start. So a big key idea with the open telemetry is this thing of the collector. So while open telemetry is in part is just a standard for the communication of metrics, trace and logging data, to say, hey, here's how thats should be transmitted. And that's supremely useful for distributed tracing because it means if you work on your little project for instrumenting, laravel symphony or a particular build of rails or what have you, you can follow these open standards and be able to get traces that you can tie together. But there's this kind of superpower involved there because we mentioned that there's these steps to creating trees. And one of the key steps is we have some way to tie those traces together, right? And that is one of the problems that is solved by the open telemetry collector. So the collector is where a lot of this magic happens. And let me zoom in a little bit on this chart. So you have these open telemetry standards and they can communicate, but to a third party service, as you can see up here, and I'll mention a little later that one of the ways to get started is to try just directly reporting from your service up to a Prometheus endpoint or up to another open telemetry endpoint. But one of the other ways to do it is to be running a service that is an open telemetry collector, where you have your multiple components thats are reporting over into the collector, and then the collector is saying, okay, let me go ahead and write out really nice, clear observability data. And the collector is not just a data explorer or a sort of data middleman. The collector has all of these multiple components that can do things like filtering, batching, attributing, and so attributing, adding attributes, I don't, thats doesn't feel like attributing. I don't know, it feels like a separate word, but whatever. So these processors are a key part of the story with the open telemetry collector where these questions that previously maybe from a SaaS servers were pretty hard to cover. Like hey, I had this very particular kind of PII data, like specific format of health data and I need to filter that out and make sure it's never sent, even if it got observed accidentally. Instead of waiting on a SaaS company to say, oh well, don't worry, we'll implement a filter for that, the collector, you could just go ahead and grab a processor component and do that filtering. And since a collector can be run within your own cloud, you can say, hey, I want to do this filtering before it's ever sent along the network. Along with these three pillars, there is this concept in open telemetry of baggage where you're able to add a little bit of information that gets passed along. So an example might be client id. It's kind of a classic one is that all of these microservices are maybe seeing this thing, but only right at the start did we see what their client id was, and we'd say, yeah, because that's useful to us. To tie this together to add filtering data later, we're going to add this baggage that is this client id. Now, baggage is not reported automatically. It's not like an attribute on a trace, but it can be useful. You can explicitly say, hey, I want to go ahead and check this baggage here. And if we got a client id, I want to write that to this trace. So yeah, that's kind of this. The idea of baggage is right, is just sort of something that contains a little something else thats comes along with you. And so it's very nonspecific about what it may contain, but it can be a useful concept as you're getting a little bit more advanced. And support for open telemetry is a lot better than you think. And I say that because I was actually writing one of the write ups of hey, here's the state of open telemetryhub support. And I commented, oh, hey, maybe Ruby is kind of not ready for use. And this is because I was looking on the opentelemetry IO page and just seeing like, hey, in know a couple of these things are listed as not yet implemented, but small. But of the way shops like Shopify use the Ruby open telemetry project. So pretty advanced actually. Even though metrics right on this table at the top level are listed as not implemented. You can actually, if you click in, you see, oh, they're experimental, but a lot of people are using lemon production now. So it is great that there is this sort of top level list of like, here's the level of support. And obviously for obvious reasons, like traces are kind of the first thing that's implemented. But I really think it's worth a look. And especially because so many of these languages, it's only logs that are missing. And the fact is you've had a way to report up logs and filter logs for a long time, almost certainly. So that's not really going to be the missing piece for you. So what are we talking about when we say hey, how's this language support? This means what is the state of the open telemetryhub SDK for this language, including automated instrumentation. So in languages like Java and Net, you should be able to get a ton of metrics out from this project, automatically doing instrumentation for you and automatically writing it to whatever data point you want to send it to. So getting back into that just for a moment, ways to get started this is from the AWS blog, but one of the things to remember is that you do have this option about whether or not something is going to go to a collector or go to some other data endpoint. And so what's so cool about the collector is lets you decide how the data is going to be batched and how it's going to be filtered again, removing Pii and doing other kind of clever stuff with your data. But if you want to have things work just from day one, if you want to just try things out, having stuff report directly to Prometheus is totally an option that you have. And if you're doing stuff like you want to report metrics every few seconds or you want report individual spans for a trace, yeah, that's going to result in a lot of network requests if you're just reporting directly and you dont have batching and stuff with the collector, but that's fine for a beta project or a proof of concept. And then obviously once you do implement a collector, it's very easy to change over. Also if your data is quite predictable, if you know what you're going to be doing, if you're using handwritten calls to report up data. So maybe you're managing pretty well, you're matching without having to define that on the collector's side. These are all really good reasons to say, hey, I'm not going to implement the open telemetry collector quite yet. Okay folks, that's been my time. I want to thank you so much for joining me again. Go check out telemetryhub.com for a really nice, cheap, efficient way to go ahead and report up open telemetry data. So that's an open telemetry endpoint. So that's the collector and endpoint get a little bit disclarified there, right? Your endpoint is where the collector is going to report its data for users to be able to go and see it. I'm Nocnica mellifera. You can find me almost every place at serverless mom and I want to thank you so much for joining me. Okay. Have a great conference.

Slides

Download slides (PDF)

See all 41 talks at this event!

Conf42 DevOps 2023 - Online

January 26 2023

Open Source Observability with OpenTelemetry

Video size:

Abstract

Summary

Transcript

Slides

Nocnica Mellifera

Head of Developer Relations @ TelemetryHub

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2023 - Online

January 26 2023

Open Source Observability with OpenTelemetry

Video size:

Abstract

Summary

Transcript

Slides

Nocnica Mellifera

Head of Developer Relations @ TelemetryHub

Join the community!