Conf42 Site Reliability Engineering 2021 - Online

Investigating Performance Issues In Microservices Arch. With Distributed Tracing

Video size:


Running a multi-tenant SaaS at scale is no easy task. With the massive scale-out we’ve started encountering performance issues. Investigating those issues turned out tricky with our microservices architecture running on Kubernetes and Docker containers across multiple regions and multiple cloud providers. We run an Observability SaaS platform, both using it internally and offering it to others. Our system is instrumented inside and out around logging and metrics, but that proved not to be the right tool for the job. We needed another weapon for our performance blitz. This is how we got to the world of distributed tracing, first as practitioners and then also started offering it as part of our Observability platform.

In this talk I’ll share our journey to distributed tracing and Jaeger open source project, how it helped us overcome our performance issues in our application across the stack from the Node.js down to the Java and database backends, and how it has become an integral part of our daily routine. If you’re battling performance issues, if you’re considering making your first steps into distributed tracing - this talk is for you. I’ll show useful examples, best practices and tips to make your life easier in battling performance issues and gaining better observability into your system, as well as how to make this a gradual and smooth journey, even into high-scale production systems.


  • Dotan Horovits is a developer advocate at Logs IO. He talks about the journey that led them to start using distributed tracing in handling performance issues in their system. He gives practical tips on how you can start with distributed tracing and use it in your performance investigations.
  • Recent surveys show that 33% of those using tracing use Jaeger. Going with an open standard is a recommendation. Open telemetry is the future path for instrumented. Check out the September 2021 episode for the latest on the Jaeger project.
  • Use open telemetry and onto instrumentation. Don't try to instrument every part of your system from day one. Prioritize the critical services where you need observability the most. Use auto instrumentation to ease the coding.
  • Use minimum duration filtering to hone in on problematic traces. Next, combine and correlate traces with logs. The last advice around investigation is to look for common patterns in your timeline.
  • So this is really the last part of this talk. I would like to look into a few common patterns to look for when inspecting your traces in the timeline view. If you're looking into how to go about it, then Yeager is a great open source option for the tracing analytics backend and for instrumented.


This transcript was autogenerated. To make changes, submit a PR.
Are youre an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hello everyone, glad to be here at SRE 2021. And today I'd like to share with you the journey that led us to start using distributed tracing in handling performance issues in our system. And I'd like to give you some practical tips on how you can start with distributed tracing and how to effectively use it in your performance investigations. Word about myself my name is Dotan Horovits. I'm a developer advocate at Logs IO. I've been around as a developer solutions architect product guy. I'm an advocate of open youre and communities in general and the CNCF, the Cloud Native Computing foundation in particular. I organize a local CNCF chapter in Tel Aviv, so if youre around, do come and visit one of our monthly meetups. I run the open observability Talks podcast, check it out and in general you can find me everywhere at Horowitz. So if you're tweeting something interesting out of this talk, feel free to tag me and about us we pride ourselves in being the masters of observability logs, metrics traces all around. But we weren't always observability masters. So let's rewind a couple of years back before we had the tracing. Back then we were the best. We were the masters of logs and metrics. Our system is instrumented inside and out around logging and metrics. We're talking about the system in microservices arch multi cloud, on AWS and on youre and multi region multi tenant running at high scale. And we felt we had it all figured out until the performance alert. So it was a UI performance issues coming from the metrics. Obviously. Looking into the metrics we started seeing the application started loading a little slower with many requests taking longer to respond. And now, as anyone here in the audience, SRE, DevOps, dev, whoever know youre need to figure out where it comes from. And our service handles thousands of requests per minute with a serious amount of internal server communications. So this is the conditions that we needed to investigate. And the first attempt, as always, is to turn to the logs for answers. But this time it wasn't that easy. First of all, we had tons of identical data, but worse, it seemed to be like a random decreasing performance issues in different flows all around our application. It was really hard to point to one thing in particular, and even harder to connect the problem to a specific service. What made it more difficult to investigate through the logs is that the request sequences were pretty long. It was really hard to get the details from the logs and figuring out what called which and how the flow goes. Bottom line, youre logs weren't enough for this investigation. So the second attempt was to enhance the logs. The idea was to add a request id to each and every one of the logs, so that we can look at all the logs and visualize the logs of all of the same request in Kibana. So if before, we couldn't really understand which logs belong to this specific problematic request, and some requests are good, some requests behaved oddly. This is a way to bucket it, and it sounds pretty simple, right? Add the request id, then visualize based on request id in cabal. So it wasn't that simple. First of all, it was very difficult to handle the propagation of the request id. We needed to go into each and every piece of code service microservices that we suspected has to do with this request, and add an incoming parameter to propagate, to bring in the request id, then obviously put the request id as part of the logs, but also then send it downstream, in any downstream call from that service to any others service or any other operations of the same service. So we needed to handle all this request id propagation. A big headache. So that's in the coding. And then on the visualization side, it was pretty painful to visualize traces using kibana. Essentially what we got is a bucket of all the logs that share the request id, which is nice, but you can't really understand the relationship and which invoked which and the relationship between them. And also it was pretty limited in context, so we had the context of all of them belonging to request id to the same request id, but not much more than that. What you really needed to see was the communication between the microservices. So we needed better visualization options to tricky a request and analyze it throughout the request flow. On the coding side, we also needed easier instrumentation because it was a real, real pain for us, and there was no chance that we'd be able to convince other teams in charge of other microservices to go along this path and start adding these parameters and handling this request id propagation. So definitely easier instrumentation is another challenge. And then we reach the third attempt. In fact, strike those is a charm. We reach the realization that we need proper distributed tracing. Distributed tracing is in fact those tool for pinpointing where failures occur and what causes poor performance. And then you can ask questions, where did the error occur? Where did the performance degradation occur? Where's the critical path? Those sorts of questions are exactly what distributed tracing is meant to answer and how does it work? For those in the audience that are not fluent in distributed tracing, the way it works is that each call in those call chain creates and emits a span for that specific service and operation. You can think of it as a structured log for that matter. And this span includes a context, the trace id, the start time and the duration of that specific span, the parent span that called it and so on. And this context propagates between the calls in the trace through those system. And that's thanks to client libraries, sdks, traces has they're known that take care of that. So if you look at on the screen on the left hand side in this tree graph, ABCDe are all spans or service and operation. And then we can see a is the root span, the one that got those client request, and then a calls b, then b calls c and then d, and then getting back to a, a calls e. So that's in the context propagates from a to b to c and so on. So this is the context propagation. And each such span, ABC and so on emits at the end, when it ends, it then emits the span outbound, and then there's the back end that collects all these spans that my application emits and reconstructs the trace according to causality and then visualizes it. And the typical visualization is the famous timeline view or gun chart. You can see an example here on the right hand side of the screen, the same ABCDe on a timeline you can see. So a, the root span obviously takes the those time from beginning to end because this is the full request. But then a called b, B called C and then called D, and then when D is finished, when B finished, then battling e and then going back to a and finishing the entire request. So it's very easy and very clear to see that from a timeline view. And this is why it's a preferred view for investigating performance issues in the system. So we realize that we need distributed tracing, which is great. Now the question is which distributed tracing tool are we going to use? And we are open source advocates. We use the elk stack for logging and Prometheus for metrics and so on. And obviously we turned to open source for tracing as well. We looked into Yeager and Zipkin, two open source candidates. In fact we started using both of them, but ultimately we chose Yeager. We still have Zipkin somewhere in our system. But the main path is Yeager, which needed to be the leading choice. And also today, recent surveys still show that Jaeger is by far the leading choice. Here youre can see on the screen from DevOps path 2020. It's a yearly survey we run around the DevOps and you can see that 33% or more from those using tracing use Jaeger. So Yeager is those choice? And what's Yeager? Just in a nutshell, Yeager was built by Uber and then they open sourced it, sorry, and contributed to the CNCF in 2017. And then it reached production of the CNCF by 2019. So it's a pretty mature project used in production by quite a few companies and organizations. I won't go into lots of details about Yeager here, but if you are interested, I had the privilege of hosting Yurish Kuro, the creator of Jaeger, on this month's episode of Open Observability Talk. So it was a fascinating talk. So do check it out, the September 2021 episode for the latest on the Jaeger project on distributed tracing in general, including some very advanced use cases for tracing that people usually don't consider. So Jaeger tracing is the choice. And now what we need to do is before we can start investigating with tracing, we need our application to actually generate trace data. So let's instrument our application. And we realized that we can't instrument everything from day one, and especially since we have a burning issue to handle the performance investigation. So we took baby steps and started with two front end services, node JS services. We used Jaeger's tracer for Node JS, if you remember again, Yeager has a tracer SDK per language. So we took the one for node JS, we used open tracing API for the code, and we instrumented all our node JS HTTP layer with tracing as well has the node JS frameworks that we use. Happy and exprs a word about open tracing API. It's another open source project under the CNCF that is essentially an open and vendor neutral API. It's not part of Yeager, it's another project. And the advantage is that since it's standard API, it means that if later down the road we decide to switch the backends from Jaeger to another type of instrumented, sorry, another type of investigation and analytics backends, it should be pretty simple to replace just replacing the tracer library or the auto instrumentation agent, and that's about it. So we shouldn't be having too much coding changes to do. So going with an open standard is a recommendation. It is important to note that today or these days, open tracing is actually deprecated. Open tracing was merged with open sensors to create open telemetry. So open telemetry is the future path for instrumented. If you start today, I'll talk about it later. But back when we were there, the mature API was open tracing and we went with that. And once we have the application emitting spans generating trace data, we can go ahead and start investigating our traces. And we opened the Yeager UI and at first glance it immediately gave away two major issues that have long been in our code, but we didn't see them until now. It was astonishing to actually see them, and I want to go over that with you. The first thing we saw was what youre can see here on the screen. This is the Jaeger Ui's timeline view and you can see here this sort of staircase type of pattern. This is an obvious pattern that indicates a serial call sequence. Call a then when it ends call b, then it ends, call C happening sequentially. It but in our case, in our system, there was no real need for it to be sequential. We could easily turn it into running concurrently and reduce the overall latency. So we went ahead and made this change. It was very very easy to do. And the result has a significant application performance improvement. Right, but the bat so really amazing. But this, despite the improvement, wasn't the root cause of the, if you remember the spike that we talked about before. So we needed to carry on with our investigation to look for the root cause of what triggered the original performance issues in the UI. So we went ahead and filtered the results according to the minimum execution duration, which brought up another performance issue. We saw one request that fetched user data, but its duration varied significantly. Some cases it was has short as 100 milliseconds and sometimes it was almost 10 seconds long. So two orders of magnitude. We couldn't really figure out why it changes so much between invocations, but one thing for sure is that this request executes from the browser and had to finish before we could render the screen to the user. So that led to loading time that took even several seconds. Definitely something unacceptable for Ux, for user experience. So this is something that we needed to investigate as part of the root cause analysis. And looking at this request, we saw that this request triggers concurrent requests to fetch all the relevant user data from various endpoints, where each of those requests goes through several other microservices. So it was pretty elaborate. And some of these downstream microservices were not even instrumented yet, reaching the back end code and Java and the database and so on. And what we found was one key that is read for some user requests which was mistakenly not cached. And this triggered a call to the database to retrieve it, causing the latency only on the calls requiring this particular key. This is why the behavior was so erratic and inconsistent and tricky to put the finger on without tracing. So that was a real aha moment. One side benefit that we got, but of it that we didn't even plan on, is that we realized that Yeager actually provides us with live system diagram. We don't need to maintain a sequence diagram anymore and update it every couple of weeks. Jaeger Auto detects the calls between the microservices and generates a live system diagram. It's always up to date, so that's a definite side benefit. That convinced a lot of our engineers and the architects and anyone else in the company to adopt tracing finally to understand how our microservice architecture works and the interconnect and everything around it. It was love at first sight. Since then, tracing has become a standard practice in those company, not just on the front end and node js, but also on back end and Java code and database and frameworks that we use. We realize that it's crucial tracing. This is where the tracing is crucial for observability. And considering ourselves or wanting to be the masters of observability, we started offering it to our own users. So I'd like to move from our own experience to some useful tips on how youre can get started with distributed tracing, both on how to instrument and how to investigate and find common patterns to watch out for. Let's start with the instrumentation. Don't try to instrument every part of your system from day one. If you have even a modest microservices arch, it will be too complex. Prioritize the critical services where you need observability the most and grow the coverage over time. It is very common to start with request endpoints in the front end and then work the way to the back end according to the sequence. But it really depends on your architecture and what's critical in your system. So prioritize your instrumentation. And the next tip on instrumentation is use open telemetry and onto instrumentation. As I said before, open tracing that we used back then was merged into open telemetry. So now open telemetry, or Otel as it's severely called, is the standard. And open telemetry is an observability framework. It providers, libraries, agents, collectors, other components that you need to capture telemetry from your services. And it's not just for traces, it's also for metrics and logs. So all the different telemetry signals are under open telemetry framework. So it generates, it collects, and then you can send it to any backend of your choice you want to monitor with Prometheus, you can send it to Prometheus. You want to use Jaeger or Zipkin or other, it's fine. So it's really agnostic to what backend analytics tool you use. My company, for example, logs IO offers a service that can get open telemetry data from the collector and use it. It really doesn't matter. The most important thing is that you have a single API and a single SDK per language for traces, logs and metrics, and also a unified protocol OTLP that ultimately will become the de facto standard also for transmitting those telemetry. So use open telemetry. And the second tip is on this is use auto instrumentation to ease the coding. You want to do as little coding as possible to get the instrumentation done and you have instrumentation existing for many languages. Java Ruby in fact, I just wrote a blog post on auto instrumented with Ruby. With open telemetry. You're more than welcome to check it out in node js. We also use the auto instrumentation that plugs into the frameworks that we work with like happy and Express. But you have for other languages like for Java, spring and log four j and others. So it's very common and it reduces significantly the amount of coding you need to do in order to generate the span data. Sorry, if you're new to open telemetry or instrumentation, I wrote a quick guide on open telemetry. Just Google Open Telemetry guide or click the URL here on the screen and I think it'd be a very good entry point to understanding how Otel works and getting the links to all the necessary libraries, instrumentation and sdks that you need for your system. So use open telemetry, use open instrumentation. Youre covered on the instrumented side. Now let's move on to tips for investigating performance issues. And the first, seemingly very basic but many people don't use it well enough, is to use the minimum duration filtering to hone in on problematic traces that you need to look into. Youre have a simple filter, you just ask. Please show me only the traces that take longer than 300 milliseconds to execute and you see only the problematic ones. You can take it further and implement smart tagging on your spans and then you can also filter by tags or some auto instrumentation agents also add and reach the spans with useful tags that you can filter by. So working with the filters will help you flush out the problematic span traces very easily. Next, combine and correlate traces with logs. Logs together with traces are extremely powerful and logs contain a lot of more context that can augment what you see from those traces. And the best practice is to add the trace id to the logs. Very similar to what we did with the request id. But you don't need to take care of generating and propagating it, you just need to take it and use it in your logs because you have those tracer the instrumentation library that taking care of it the trace id. So just embed the trace id as part of your logs and make it very easy to correlate your logs the traces with the logs and the last advice around investigation is to look for common patterns sorry in your timeline. So this is really the last part of this talk. I would like to look into a few common patterns to look for when inspecting your traces in the timeline view, whether in yeager or any other distributed tracing backend tool. The first one we saw that also from our own case is the staircase pattern which often indicates a sequence of serial calls. When you see this pattern, you are more than likely want to check if it can be paralyzed, something that can run in parallel. Or maybe you can use some bulk queries or joins on a database or something to reduce this sequence invocation and reduce the overall latency. It could be an issue in those implementation in the code. It could be just a matter of misconfiguration like a thread pool configuration. So staircase pattern is very common and easy to resolve, but definitely can take a lot of those latency. Second one is span spike pattern and this is the pattern of long spans, especially ones that vary significantly between executions like the one we saw on our case. And this may indicate either slow queries or caching issues as weve in our case or similar issues. It also helps to focus where you want to put your investigation and efforts into. So when you go about investigating, you want to inspect and optimize the longest span on the critical path and reducing this is likely to produce the largest benefit on the overall request latency. So this pattern also helped focusing the efforts on where you'll see the most benefits in optimizing latency. The next pattern is gap between spans and such a gap can indicate, like you can see here on those screen, slow interservice communication that contributes to the latency. Or it can also be a service missing instrumentation. For example, it could be that you send a query to the database, but the database driver is not instrumented. So in such a case you may want to go ahead and instrument the database driver to get more instrumentation. By the way, it's very interesting actually just this month, a couple of weeks ago, only Google contributed SQL commenter open youre to open telemetry. So now within open telemetry there's now an open youre piece that can automatically generate application driven metadata to enrich the queries so that you can correlate database queries to application logic much easier. So check it out. The next pattern is spans finishing at the same time like you can see here on the screen. And this may indicate connectivity issue that is causing timeouts or errors or some other artificial constraint like locking issues. Several requests waiting on a specific lock, and once the lock is released, then all of them finish quickly. So in this case we may want to tune the timeout parameter or investigate why things are taking longer than the timeout that you expected them to take or things like that. Another important note here about instrumentation is that when you see such cases, it may also call for enhancing those instrumentation around the shared resources. So if it is indeed a lock or, I don't know, call to a database or something like that, you may want to increase the instrumentation there to get better observability into what's happening there and also what prevents those execution. So that was the last pattern. And just to summarize what weve seen, distributed tracing for us was the missing piece in our puzzle for battling performance issues and for observability in general. And if you operate microservices and cloud native system, you are more than likely can benefit from it too. If you're looking into how to go about it, then Yeager is a great open source option for the tracing analytics backend and for instrumented. Highly recommended to go with those open telemetry. That's those future proofed path standard way for instrumentation, by the way, not just for tracing as I said, also for logs and metrics. So definitely look into that and explore as much as possible auto instrumentation to make your instrumentation as smooth and easy as possible. So try and enjoy. If you have any questions or comments, do feel free to reach out to me here on the chat Q and A. Whichever, you can, just reach out to me on Twitter Dotan horovits H-O-R-O-V-I-T-S whichever way I'd be youre than happy to answer questions, hear comments, and get your feedback and your experience about this. I'm Dotan Horovits and thank you very much for listening.

Dotan Horovits

Product Evangelist @

Dotan Horovits's LinkedIn account Dotan Horovits's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways