Conf42 Observability 2025 - Online

- premiere 5PM GMT

Turning Synthetic Traces into Gold: Scalable Monitoring for Critical User Journeys

Video size:

Abstract

Imagine instantly seeing performance risks across critical user journeys without sifting through thousands of traces. In this talk, we’ll explore how scalable trace aggregation groups complex flows into clear, actionable insights - enabling faster detection, better prioritization, & issue resolution

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Topic of this session is turning synthetic traces into gold by scaling monitoring for critical user journeys. I am EP Kumar. I'm a principal engineer in Salesforce, and I belong to the Monitoring Cloud organization. Our organization provides a single stop solution catering to all the observability needs of different Salesforce applications. So we have many stacks for metrics, events, locks, and traces. To give a little bit context about the scale at which we operate we are responsible for providing observability for over a million hosting containers. More than 2000 teams and 13,000 developers are actively engaged on our platform. To give a little bit idea about the. Telemetry volume that we handle around one 40 billion times series. Every hour, more than 300 billion spans a day and more than one petabytes of log volume every day critical user journeys. Now what are what are there really, right? I like to define them as hunting down hero flows. Basically they are a way to be able to. Identify end user's journey, emulate it. And often these end user journeys are high value request flows. They are customer facing and business critical experiences, so potentially revenue impacting as well. Now, typically such external user action flows. Often traverses through multiple service in your infrastructure. And this is especially true in today's world of microservices, where you have a plethora of microservices and each microservice enables a specific function or a feature, and therefore it becomes so much harder to actually pinpoint where an issue lies when an incident occurs. It also emphasizes the need to have robust monitoring. To ensure availability and performance of such key critical transaction flows, and this is where distributed tracing can come in and really shine. For folks who are not aware of distributed tracing distributed tracing is a way by which you can track individual request flows as they travels through different services. Typically, each of these requests are tracked by a unique grid or a trace id. We will delve a little deep into semantics of distributed tracing to understand this topic better. Now, how can distributed tracing help with cgs? It can help you answer very key questions, for example, which are the services that I involve in my critical path, and what are the health. Of these critical CUJ transactions. Now, these user journey flows often evolve by because they new features get added and therefore it becomes important to be able to keep track of it. And tracing can help identify these unwanted, unsafe access patterns as well. And. Since you're track like tracking individual requests, it can help you identify where exactly the performance bottleneck is. And by doing so, it can reduce the time to remediate, time to detect issues. And I cannot emphasize this enough, but instrumenting applications carefully on your CUJ park becomes so much more important because. Those give you golden signals on where exactly an issue might lie. Often this means that you go in manually instrument one span at a time and add value to your to the traces themselves, right on that particular request flow. A little bit on our tracer platform. Our platform provides disable tracing for all the Salesforce services. We are a centralized collection of tracers, and we understand that these user journeys, the source of it can come from different things, right? Different regions, different application stacks and so forth. So we have the ability to collect spans from different a PM agents, from managed frameworks, from infrastructure layers, such as service mesh. Which is especially true for our Kubernetes workloads. And we have tracing enable on our integration test a little bit on the scale. We handle per, on the per minute basis, we handle around 300 million spans. And we have around 10 million unique traces that gets reported every minute. Let's briefly try to understand the semantics of disability tracing because that will help us understand this topic better. So consider a individual request flow that goes across five different applications, S one to S five. Each of these services would emit a certain record or an event, which is often referred to as a span. And a span record has few characteristics, for example. Take the take S one. It has three spans at Ms. Ad and h. Now each span has a start time a duration of the span and which operation within the service does it represent. It can also have various other metadata associated with it. And each of these services would independently send their SPAN records. And all of these SPAN records remember, are tied to a single Geo ID or a trace id, and they all are, say, collected to some kind of collection framework and end up on the tracing backend. Okay. Once they arrive at the tracing backend, you can actually now show a nice waterfall representation, which is a hierarchy representation of the request flow by first overlaying or placing the root span under which you start displaying the child span. The placement of the child span is based on the start time of that particular span, and the width of the span is generally indicative of the duration of this particular span. Now with this, you can track each and every request that is happening on your infrastructure. Now, how does context propagation happens now? Each of the spans that the services Emett needs to be tied up to the same trace id. Now that happens through context propagation, through transport headers. Consider this example where application one is talking to application two. Say it's an HCTP post call as part of the post call, it'll send in some headers. We at Salesforce today use B three headers for legacy reasons, but you can use ot open elementary headers as well. As the trace contact is propagated to the next hop, say in this case, application two, when it generates a new span, it'll use this context propagated to it to start a span with the same request trace ID as the caller which is the application one. And that's how every record in the request path gets stitched up to the same presiding. The sampling flag is quite important because the sampling flag basically would mean whether you are going to collect that per trace or not. If it is set to zero, even though the context fabrication happens generally, this application two would not emit that span to the collection endpoint. This is often done to reduce the overall volume of the trace data that you collect. How can you enable now CS with distributed tracing? You can do that with synthetic test. So when you are enabling, synthetic test, you always are enabling it with a hundred percent sampling. That means you, you firstly, enable tracing on these synthetic tests and ensure that all of these synthetic tests are sample with 100%. That means you don't drop any kind of trace, right? So each of your synthetic tests that is emulating the request flow is completely captured, right? And with that, you can actually now do things like real time browser monitoring, a PM monitoring, DNS monitoring, and so many other things, right? And overall, it gives you a performance and availability from an outside in perspective, which is what you really want. How would you do that? Firstly, you need to have a framework to be able to enable you to do this, so you need to build one. But consider this case wherein you are having a test a synthetic test. And each of the synthetic test would have a set of steps and each step would emit its own trace. So remember all these steps would be emitting their own trace and these traces are TA getting tied to the same test id. Also, these steps are generally, are like templates. On these templates. You can apply different variables for example, you want a test to be run from Europe, from Japan, some US, so you can have the same. Test definition, but you can apply, you can have different kind of variables that gets supplied onto this the step and the emulation of that particular user journey happens from that particular region. So that's how you can enable it. But once you enable your synthetic, you can give this particular kind of view where you show individual execution runs. And when you try to drill down on a particular execution run, it will show associated trace IDs with it. You can click on the trace IDs and that will give it to the, take you to the same waterfall view. The graph here, if it's one, it shows a successful run. And if it's a minus one, it shows a failure, right? And you can drill down to that particular taste to see where the failure had happened. So at a very. Raw or a level, you can actually track the execution at a trace level. But is it useful to actually track it? Yeah, it is. But you can probably do more, right? You can do things like aggregating these traces to get more meaningful insights. For example, consider this case wherein you're doing a. Create account transaction. You can see that with a metric. Generally it takes around 85 milliseconds for this operation to complete, but you don't know exactly where it is spending most of your time in. Through tracing that particular flow has different services involved, like app or storage or attend db. But you don't know exactly where the time is being well spent, so you look at similar looking traces. Aggregate them, and you can give a time series view like this which will say that, okay, authentication operation is the one which is taking the most time, right? This is a very powerful view by which you can quickly drill down towards a problematic. Or where the problem is actually having, right? This same view. Now, you can have not just for latency degradation, but even for errors. So also you can slice and d this data as you choose. For example, you can have this contributors for a particular step across different runs. You can have this particular graph or different variables for say only for Japan location or a US location. You want to see how the contribution looks like and where exactly the issue is happening. That can be done as well. You can overlay this with some kind of baseline performance to see. Changes or trends that has happened week or month over month Overall, you can quickly find out where the possibility of optimization lies by identifying the faulty service which is causing the performance degradation. You can also do more with trace segregation. You can create a directed graph representation of the flow of the request. Like this, right? You basically are giving a directed graph representation of how the request was flowing and what the health of those individual requests were looking at an aggregated level, right? For example, you had 72. Runs of a particular test, out of which as they were going across different services started failing. When the core was calling Salesforce app 12 of them failed, right? Remember, all these traces are already sampled, so you're already collecting them. You're not dropping anything. So you can get a, you can enable a view like this to be able to get nice quick view of. How the flow of the request is happening and where exactly an issue might lie. I would also like to emphasize the need for on demand tracing. Sometimes when in user flows, there may be one user which has some issue, and you may want to actually enable tracing for a specific user, so have mechanisms to enable that. We have created our phone framework to actually enable it and sometimes. Issues are sparse and sporadic, and they happen maybe only at certain time of the day that you're not aware of. Long-term tracing is also sometimes important and have the ability to do that. We also have a, have the ability to do instance waste tracing because you deploy multiple instances. Each instance may be tied to a tenant and one instance might be the one which is having issue. So you can enable tracing on a particular instance and then the critical user journey flow through that instance can capture what exactly might be happening. Yeah. And with that, I would like to conclude my talk here. Thank you everyone. I.
...

Sudeep Kumar

Principal Engineer @ Salesforce

Sudeep Kumar's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)