Conf42 Site Reliability Engineering 2022 - Online

Adding OpenTelemetry to Production Apps: Lessons Learned

Video size:

Abstract

Observability is increasingly important in our modern apps/cloud-native world. However, when adding observability to existing production apps, there are a number of tradeoffs in approaches and in tools. Often, these tradeoffs are an exercise in confusion, leading to decision paralysis. We took on the challenge of adding observability to NGINX MARA, investigating choices, discovering and addressing challenges while keeping to open source solutions whenever possible. You’ll come away with an understanding of how the three classes of data (Metrics, Traces, Logs) work together, why we chose the solutions we used and how we extended past the normal space into health checks, introspection and core dumps. Come learn from our experience in dealing with OpenTelemetry and related tools, from traces, metrics and logs, in working with production class apps and discover what approach finally worked for us.

Summary

  • Open telemetry is near and dear to my heart. We're going to talk about how you can add opentelemetry to a modern application. Some of the things you may run into with working in an open telemetry environment.
  • A modern application is not defined by its implementation, but by its capabilities. These include portability, scalability, observability, reproducibility, as well as debugability. It also provides context on errors and crashes.
  • Mara modern application reference architecture is a microservices architecture. It uses Kubernetes leader in orchestration, of course, and is designed to be production ready. The larger amount is the stuff that the application is dependent on without being hidden below the service.
  • Open telemetry is a standards based agents and cloud integration structure offering the capabilities of observability. It has automated code instrumentation, it supports multiple mini frameworks and pretty much any code at any time. Learn more migrate to Opentelemetry today.
  • How do we get the data from where it is? To be really useful, those log files need to be easily searchable. Opentelemetry, even though it has logged as one of the classes of data, is in an early beta stage.
  • distributed tracing is where Opentelemetry started out of open tracing and opensensus. We needed to support all the languages of interest. In our case it was two we had Java and we had Python. We also had two frameworks, sprint boot and spring boot. It is really all about the language.
  • Anthros has developed a tool that aggregates data from various sources. It uses micrometer, Prometheus, statsd and other tools to do this. The tool was developed in Java and not in python. Anthros hopes the tool can become the underlying structure for distributed tracing.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good day and thanks for joining me today for this talk about open telemetry and some of the things you may run into with working in an open telemetry environment. And first, my thanks to comp 42 for letting me talk about this today. Open telemetry is near and dear to my heart. Having spent the last four years working in the observability space, opentelemetry is definitely taking the world by storm. But we're going to talk about how you can add opentelemetry to a modern application. But to do that, we kind of need to start by talking about what we mean by modern application. Well, every company is on a cloud journey, and they're moving from their old style, monolithic, tightly coupled environments all the way out to a cloud native or rearchitected environment. And we're seeing this more and more driven by combination of cloud economics as well as the ability and ease of scaling, management and support. But a modern application actually is a little bit more than that. It's not defined by its implementation, it's defined by its capabilities here. And we think of these when we talk about modern apps in terms of portability, scalability, observability, reproducibility, as well as debugability. Last word is, by the way, quite a mouthful to say so. When we look at this, modern applications really have certain feature sets to them, but they are not implementation details. And so I don't care whether you write this in cobalt, heaven forbid, Fortran eight and 95, it doesn't matter. What really matters is that it works. And at this, you can look at this list and see that certain things make up characteristics that define this modern application space. Here in this that we're talking about, this observability open telemetry space, it really comes down to three quickly find performance bottlenecks. So where are things not working the way we expect them to do? After all, user happiness is heavily dependent on performance as well as results, provide answers to platform engineering questions. So how the people are working in the platform aspects, as well as how the application itself is behaving, and what can I find that I need to find out about that? And then finally provides context on errors and crashes. This doesn't mean simply reporting an error, but it means finding the context, what's going on around the error, what was leading up to this error. And so these things actually lead us into this larger category of things that are important for modern apps in this observability open telemetry space. And so what I've done is taken with my engineering team, this thing called Mara modern application reference architecture, and it's a microservices architecture. It uses Kubernetes leader in orchestration, of course, here, and is designed to be production ready. As you can see, the application actually only makes up part of this. And the part that's kind of hidden below the surface is monitoring observability management aspects, infrastructure aspects, and Kubernetes itself. The application is really only one part of this question. The larger amount is the stuff that the application is dependent on without being hidden below the service. And so this is a quick look at the application itself. This is a banking application. It's written in Python at the front end and uses Python for certain pieces in the back end, as well as Java for other pieces in the back end. It has two postgres SQL databases that it makes use of. And so each of these pieces is an independent microservices on its own right. But this does lead us to some of the interesting things. We need a solutions that has the ability to support more than one language, and we need this solution that understands the context of to and fro, where things are coming from, where things are going to, and what the results in each piece are. And when we started looking at this, we built this so that it was portable. Remember that portability aspect here so it can run on lots of different infrastructures built in at this point with tracing metrics, logging, visualization, management skills. We also built a testing structure for load generation as well as a continuous integration model. This mimics what goes on in production applications every single day. Maybe not completely complete or as complex as youll, but this is what a reference architecture is designed to do. In our case, we wanted the reference architecture to be more than just a reference drawing. We wanted this to actually work and be able to test what this meant and so quickly drawing it out. You can see the various pieces inside of here, kubernetes driving our node pod arrangements. Here we're using application networking pieces. In this case we're actually using the NgInx open source project to do that. And then we have other pieces around it for automation, these pipeline code repos which are on GitHub or GitLab, and then automation code to drive all of this. And so there's lots of moving parts across this entire application space. Now, when we start looking at this, all those lots of moving parts means that things have become complicated. Microservices mean things have come complicated. Failures don't always repeat here. Debugging can be really painful because those failures don't repeat here. And the scale of data that we're dealing with is massive here. So when we started looking at this, we sort of used this kniven framework to say, what do we need to do to drive to where we need to be able to probe, sense and respond the emergent technologies that are being driven by cloud native or modern application spaces. And so observability is a data problem. More observable the data the system is, the faster we can understand why it's acting up and we can fix it. But in general, we look at this from having a visibility across the entire stack, but based on three classes of data metrics, traces and locks, or do I have a problem, where is the problem, what's causing the problem? And each of these pieces plays a role in helping us look at that application stack. But when we started looking at the needs for real observability in a production application, we found that logs, metrics and traces were a great place to start. But we also needed things like error aggregation. We needed to be able to look at the runtime state introspections, we needed health checks, we need to be able to look at core dumps. All these pieces play a role across this in actually pulling the data out. Today I'm probably going to spend most of my time talking into logs, metrics and traces, but if you want to, this is an open source effort. You can go through, play with it and take a look at what the underlying structures are for all of these different pieces and how we built them together. And so that is a starting wish list. And as you can see, technology with logging all the way over to heap dumps, and then we listed some that came off the top of our head around where things might actually make sense. So, elastic, the elastic APM model, elastic cloud, elasticsearch, whatever you call it. We also looked at Grafana, Greylog, Jaeger, Opensenses, opentelemetry, production apps, statsD, Zipkin. There are a few others that crept in and out during this discussion, but this is where we started. And then we mapped across this and said, oh, does that offer this capability? What we've quickly found is that this chart is very pretty and honestly kind of meaningless. If you look at this, you can't easily compare something like Zipkin to elastic APM, even though they do have a crossover. They are very different structures and very solving very different problems. And so we decided that we were going to sort of stop doing this and start looking at the qualitative views of looking at different projects and see what they could meet, not to this checklist items, but to our underlying needs for these functionality. So we actually started by looking at Opencensus. In Opencensus at the top of the page, interestingly enough, goes opensensus and open tracing have merged into opentelemetry and we. Hmm, so open tracing, open tracing project is archived. Learn more migrate to Opentelemetry today. This kind of gave us the feeling that we should probably be looking at open telemetry. And so that was where we went to. So what is open telemetry? It's a standards based agents and cloud integration structure offering the capabilities of observability. It has automated code instrumentation, it supports multiple mini frameworks and pretty much any code at any time. Remember that we are using multiple languages and we're using multiple frameworks inside of here. Open tracing and opensenses merged to form open telemetry, so that we did have some backward capabilities for older implementations. And as we'll find out when we get into open telemetry. Open telemetry has also considered how we work with existing applications in feeding any of those classes of data into the back end where we can do the aggregation and analysis. And so the nice thing when we really looked at this, is that opentelemetry provided the classes of data we want, provided the conduit that we could use here, and we could start trying to solve the specific problems that we were looking for of how to reach an observable state in a modern application made up of microservices under Kubernetes control. And so when we looked at this, we found a couple of interesting things. First of all, tracing metrics, logs, all of them are important for this observability space. But we also had to worry about instrumentation. How do we get the data out here? How do we see the sdks, the canonical representations coming across here? How standardized is the data structure? And then can we use it any place we want to? Because we didn't want to solve a single problem that was going to lock us in to a solution, that we couldn't change it to meet our ever changing needs. And so opentelemetry actually did provide all of those aspects, and this made it a lot easier for us to start looking at the open telemetry space. So when we started working with this, some interesting things showed up. First, let me note that this activity was a point in time snapshot, and it is based on work that we did roughly five months ago, maybe six months ago. And so the rules may have changed. We have not gone back and done the exercise to constantly keep this up to date. We will be changing and looking at future implementations around these development of open telemetry, both the standardization of the specs for data classes, as well as its evolving nature of being able to transmit that data. But let's start with logs, so everybody kind of knows what a log is, something the system decides to write off someplace and let us know what it thinks is going on for here. But that very simplicity leads to some really complex decisions inside of here. The simple side, grab the log output. Sounds simple, what the log output looks like, what format it is, what's included in there, and how can I scraped the data to make sense of it becomes incredibly important, especially when we reach production level scales here. So the complicated side starts to become how do we get the data from where it is? And remember, in a Kubernetes space, we don't necessarily know where it is. It could be lots of places, it could be ephemeral, it could be elastic, shrink and expand on things. Where do we store this data? We could have a lot of data coming in here. Do we need to index it so that we can actually search it easily for things that we expect to have happen here, and then how long do we retain it? Just those four questions alone become a really challenging point when we start looking at logs and production class applications. To be really useful, though, those log files need to be easily searchable, and they should be based on varying criteria. We shouldn't be limited to just the indexing structures. We should be able to search easily and efficiently on anything that we may want to look at it. Developer SRE platform operations we need to be able to find the data that we need, when we need it, without a huge amount of overhead. And so we can start looking at the various players, the indexing and things like this. We're also, honestly, a little bit, not to say it this way, but we like to take advantage of things that people have built that are efficient and work correctly. We also tend to favor open source, and so those things led us to our first viewpoint, and we did look at a number of them, but we started with Elasticstack. So Filebeat became our data transport from the kubernetes demon set. So it's easy to put it into kubernetes. It manages to put the right pieces where we need to have them go into, and then we transmitted that from beats into elasticsearch and then visualized in Cabana. We used bitnami as the chart to split that deployment. So we had an ingest engine, we had a coordinating engine, we had a master and data node aspects. The nice thing about Cabana was it let us do the search as well as a bunch of preloaded decisions that were made for us. The nice thing, it works. But it turned out to be extremely resource hungry. And the way that we had to deal with queries varied. It was okay, but it did vary. If you were doing something that was in that preloaded indexing strategy, it was not bad. If you were trying to do something that was a little bit out of their index limitations, life became a little bit more challenging. And so, as you can see here, this is one of the areas that we can tell you it works, but it needs to be improvement. One of the driving pieces behind this is that opentelemetry, even though it has logged as one of the classes of data, is in an early beta stage, and we didn't quite feel comfortable at this point in time. Depending on that emerging functionality, we wouldn't recommend putting it into production very honestly, simply because it's still in a state of change, in a state of flux. Great specification, probably going to be finalized very shortly. But then we have implementation details, SDK details, languages themselves. All those pieces have to come into consideration across that. Now, when we looked at the distributed tracing, so distributed tracing is where Opentelemetry started out of open tracing and opensensus here it is complex. It can also be semi chaotic, and we have some very definitive requirements. First one, it must not impact the quality of service for the application. We've all seen it where log files get written off in a batch format. Every hour I'm going to take my accumulated log files and write these off to disk and watched as we get a sawtooth effect of performance. Because the system is now busy trying to dump its buffers for here. We also needed to support all the languages of interest. In our case it was two we had Java and we had Python. We also had two frameworks, sprint boot, spring boot, sorry, misprint for that, as well as class that was going on for these things. So we wanted to make sure that we accomplished all those things. And what we found was something really nice. When we looked at Opentelemetry, Opentelemetry had this collector, this agent, if you will, that could pull the data, and it had the ability to pull that data from any of the independent services, but it could also in turn act as an aggregator. So we could feed the open telemetry collector from its point into an aggregation model, which we could then hand off in this case to Jaeger. So we had the ability to look at all the different things that we needed to look at. It also let us look at some unique capabilities. It was very simple to set this up. We could roll data through very quickly and take a look at what it looked like while we were in the development process and be able to test what was going on. And so here's an output of a trace. So starting from Nginx we can see what the front end access logs like, what the various fans took like. We can also look at what was happening once we moved through this. So each of these pieces became very easy to look at. This is a Jaeger chart, if you're not familiar with it, but most of the distributed traces. This acyclic chart model is something youll find very common in looking at distributed tracing. Honestly though, when we get into this space, it is really all about the language. And remember we have two here. Well, we started with Python and Python was actually really pretty straightforward. We added a couple of files, we updated our requirements file. We actually made use of Bunyan for JSON logs so that we could make sure that our log formats were also tied to it. And as youll see, one of the things was also we took the logging pieces and attached the tracing ids to the logs where it was useful. And that gave us the ability to do easy coordination across those, to be able to backtrack the context from a problem space. So Java on the other hand was a little bit more challenging. First of all, let me point out that simple Java Greenfield Java wasn't too bad. Pull the libraries in, use the APIs, you're pretty much done. And interestingly enough, when we started looking at this, that was these kind of the model we were using an existing application. And that existing application made come life some challenges here. So with spring framework it looked pretty easy. So we could use spring cloud. Sleuth adds the traces and span ids instruments common to ingress and nextpress points adds traces to scheduled tasks and it can directly generate Zipkin, another open source project like Jaeger that traces. But at that point in time, Autoconfig was a milestone release and it supported some really old out of date open telemetry versions. And we also had to pull from spring snapshot due to coded dependency references. And so this made it a little bit more problematic using looking at an old product set using losing functionality. Remember this, open telemetry is under constant development for this, as well as not being able to control the environments, the code repositories quite as easily as we wanted to for this. Some of these things have challenges, but when I went out right before this talk and still looked at this, it looks like they still are listing older versions of opentelemetry tracing and instrumentation at 111 and 112, I believe, whereas I think the current versions are about 14 or 15. And so need to be careful. When we looked at doing this, how we could make use of existing structures and what they're tied to became a dependency that we needed to constantly check across. So we needed to be able to work around some of these limitations here. And so what we decided to do was we built a common opentelemetry model telemetry way. We're passing data across these, and this provided us the ability to extend our tracing functionality and so we could build auto confusion classes and add additional trace attributes. There were certain pieces of information that we really wanted to make sure were carried forward, as well as certain things that we wanted to make sure we're constantly up to date and shared. We also wanted to be able to clearly see what the impact was going to be. So we built an implementation of no ops, so a no op implementation, production durage tracing that we can simply flip on and off and so we can see what the impact of tracing versus no tracing looks like here. We decided to standardize our trace names. They're across different languages, crossing different structures. So we needed to be able to look and see how we did this. And that allowed us to do this. We added an error handler so that we could output errors to both logs and tracings. This gives us the ability to coordinate back across those pieces here. And then we actually added some additional tracing attributes, a service name, the instance id, machine id. There's a few others that are there here, but we also made sure we're in a database world that we put the trace id into the comments that preceded our SQL statements. It's really important to know where your SQL statements are coming from, what request generated them, all the way back to what the user started them looking like here. All of those things were built into an open source module, Opentelemetry Nginx module, which you can find on our Nginx Inc. GitHub site. So the last part was the metrics piece. And honestly we kind of skipped python. It's not difficult to put metrics into python, but the type and class of metrics we were getting from those front end pieces was not really meaningful for what we were trying to accomplish as part of this. And so we then broke it down and said, okay, so what are we doing in Java? So Java, the original code bank of Anthros used something called micrometer and used connected to GCP stackdriver. And when we looked at that, we found that micrometer was actually a very mature layer for Java virtual machines and the default metrics API in spring. And so micrometer became near and dear to our hearts very quickly. We also looked at the current opentelemetry work for metrics. And at that point in time, and actually I will admit still today, there are still some discussions going on around what limits there are to otel metrics, in particular, what kind of metrics are not being covered, what kind of metrics are missing inside of these picture? Open telemetry metrics are stable, specification is stable, the languages vary. But is it complete enough of what you want? Well, the nice thing was that opentelemetry and micrometer together were not a blocking factor. So if you remember back that opentelemetry collector, the opentelemetry collector can use metrics from anything. It can pull them from Prometheus stats id, it can pull them from opentelemetry line protocol OTLP, it can pull pretty much anything. And that's the strength of the collector is the ability to export things and collect the various data and then process it in the middle. Micrometer supported lots of actions. It already supported Prometheus, it already supported statsd. And so these was obviously a natural fit. We could have micrometer produce Prometheus and these use the collector to coordinate and aggregate our data together. The opentelemetry metrics SDK allowed us to then send those metrics via OTLP. And so we could use the micrometer model, but use OTLP as the collection agent of choice. From these we could send them to anything. We could send them to lightstep, we could send these to Grafana, we could send them anywhere we want to. Again, because we have that receiver concept and that export concept. The thing that we did ended up was that this is not a streaming model, this was actually a tools model for the collector, and the collector would pull the various metrics to be able to pass them forward. So without too much more, a quick summary for what we're going on here, and this is a more complete list. So when we did this, opentelemetry was clearly the right choice for dealing and production categories for observability here for distributed tracing, we did have a number of interesting pieces. So Java used spring cloud sleuth with the Otel exporter, leading to the Otel collector which then led to a pluggable store python used the Python libraries, the opentelemetry collector to the pluggable store, and Nginx, which was the driving heart for this application space. The ingress controller for these is not currently otelized, not currently traceable for here, but we had an Otel module that we can pull data in from NginX itself to the collector. The metrics piece was one of those that we thought was going to be easier, but actually turned out to be easy with some caveats. For this we used micrometer via spring to a Prometheus exporter to the collector. Likewise with Python we use unicorn with statsd to Prometheus via the service monitor out to hotel collector Python. We did not implement metrics for and in NgInx we have a Prometheus endpoint that we also passed to Prometheus and then logs. We collected all the container log files through filebeat, which went into elasticsearch and were exposable via cabana. Is this perfect? No, we would like to have a more centralized source. We would like to be able to pull all these things together and have them easily correlated without necessarily having to chose lots of different tools with lots of different query structures to this. But this was a demonstration of the fact that Otel can become the underlying structure for any of these pieces. So our summary that continues out that I didn't cover here error aggregation we use Otel to distribute traces for health checks. We have a spring boot actuator through Kubernetes and Python flask management endpoints fed into kubernetes for this introspection. Again bring in Python had the ability to allow us do introspection and then with the heap core dumps, Java spring boot actuator let us have thread dumps. And right now we do not have support for this in Python. So the TLDR for this three classes of data metrics, traces and logs. All of these took different approaches to get what we wanted, but we managed to get what we wanted out of all of those. And the open telemetry collector is the thing that made this actually work. The open television collector is our friend, and it made it incredibly possible to youll off what six months ago looked like a very challenge project, by the way. It was a very challenging project. The people that were helping code this used to scream bloody murder at various times. It's probably easier now, and we will be revisiting this in the next few months just to see what's happening. Metrics and traces had some interesting gotchas based on the languages that we were looking at. Also, metrics and traces in conjunction with logs needed to be correlated and so we needed to make sure that we could have those trace ids show up crossing into our log files so that we could trap them. And as I said before, this was a snapshot in time. This was as the state of the industry was about six months ago for this. And things have changed. We have flat out will admit that things have changed and we will be looking at updating this to see how the changes impact our decisions. We're also going to be looking at other solutions. Should we go to Loki for log files or should we look at Greylog? Or can we start making use of the nascent Otel logging information? And by the way, there are auto configuration instrumentation files and they are amazingly cool. Drop it into your Java file and you get all these great traces out. But they don't necessarily give you what you want. And so when it works, it's great and when it gives you what you want, you can also take this for yourself. We built this into an open source category. Take a look at our repositories, play with the application itself and give us some feedback on what you're interested in hearing from or where you'd like to see us next. For that, feel free to join our community and look for the Mara modern applications reference architecture and give us feedback. Tell us what you're thinking about this and tell us what your world looks like with that. Let me thank you for listening to me today, and again, I hope you enjoy the rest of the conference. Thanks and have a good day.
...

Dave McAllister

Senior OSS Technical Evangelist @ NGINX

Dave McAllister's LinkedIn account Dave McAllister's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways