Instrumenting at 10 years per second

Video size:

Abstract

Using automation we automatically instrumented service which would normally have taken months. Using Config-Accelerated Custom Instrumentation (CACI), we enabled auto-capture of telemetry from function inputs using simple configuration. The talks covers the process, implementation and learnings.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

All right. Welcome to instrumenting at 10 years per second. Today I'll talk about a new approach that we've developed to accelerate manual instrumentation. Now, auto instrumentation will get you 80% there, but that last 20% manual instrumentation. Takes a lot longer, 10 years in our case, roughly, but manual instrumentation really increases your ability to deeply understand what your system is doing and why. Now we piloted this approach successfully with one of our neuro microservices, and we're looking forward to spreading this approach across the rest of the microservices in our fleet, and I'm really excited to share this approach with you today. A little about me. My name, I'm Jean Mark Wright. I'm a staff engineer at Wave. We're. I Lead Observability at Wave. We're building an operating system for small business owners. We really want to take the tedium and the complexity out of running the business so that small business owners can focus on what they do best. So we provide a suite of products to ensure that small business owners, can really thrive at what they are doing. Now, today I'll talk about the problem of accelerating manual instrumentation. What we're trying to do is we're really trying to dramatically increase the pace at which you can manually instrument your service and which in turn is gonna really improve the data that's available for doing investigations and really understanding what your system is doing. So I'll talk about. Our solution. I'll talk about how it works and then I'll close off with some next steps in terms of what we thought of. But first, lemme tell you a story. I tell you this story because for two reasons. First, it demonstrates the power and value of high quality. Manual instrumentation. And then secondly, it really motivates why we want to spend and invest a lot of engineering time and effort to ensure that all engineers have access to this high quality instrumentation. Once upon a time our on-call engineer was minding his own business, but then he realized something that was particularly disturbing. He realized that our system was given access to a premium feature for free. This feature allowed users to connect their bank accounts and import their transactions into our accounting product. A costly feature for our business, but very convenient and critical for business owners. This feature is called Transaction Import Now, transaction import is one of several features that users have access to. The problem is that transaction import is a premium feature. Typically folks would purchase our pro plan, which gives access to transaction import, along with some other premium features. And we also had this special legacy plan that would also give access to transaction import for some of our existing users. But it wasn't live as yet. And then finally, we also had this startup plan, which all customers were initialized on once they had completed onboarding. This startup plan gave access to free features. It didn't give access to transaction import, which is a premium feature, but whatever we needed to dig in, understand what was going on quickly so we could stop the leak of this premium feature. Now, fortunately for our on-call engineer, this application was really special. It had auto instrumentation, so all our services, our databases, our caches, they were all instrumented. We got that for free using Datadog, right? But, and so we also had metrics, right? We have a PM metrics, we're getting latency, throughput, error metrics. We also had distributed tracing, so you could actually trace a request all the way from our API down to the service that fulfilled that request. And we also had logs, meh. One thing that really set this system apart was the fact that it had really high quality manual instrumentation. So auto instrumentation is the stuff that you get out of the box, right? So from your vendor, like Datadog or whatever your vendor is, are open telemetry. You'll get instrumentation for your databases, for your caches, for the different transports that you use. But manual instrumentation. Manual instrumentation is what you are adding from your application. But it's this instrumentation. That made this app particularly special. What it meant is that we could actually peel back the layers. We could really inspect the system state. We could ask thoughtful questions, and we could really interrogate the system's behavior to really understand what the system was doing, why it was doing it, why it made the choices that it made. But let's get back to the story. So our on-call engineer, he noticed that we're giving a premium feature away for free, and he's trying to figure out, we're all trying to figure out why is the system doing this right? So our investigation took us to an endpoint called initialized. Business. Now, initialized business is responsible for putting customers on a plan after they've successfully onboarded with us. Now the plan is what gives you access to the features, right? And after onboarding, the initialized endpoint would put a customer, would put the customer on a plan. So by default, customers are put on our start plan. So eventually I talked about this legacy plan that we had as well. Eventually, we'd support putting users on a legacy plan, which would give access to transaction import, but we weren't there yet. But with the manual instrumentation that we added, it meant that we could really dive deep to understand exactly. What this endpoint was doing, we could really examine the data that was processed to better understand the system state. So not only could we look at common attributes like user, we could ask bigger questions so we could ask, Hey. What plan was initialized for this user? Or we could look at, okay, once the plan has initialized for, has been in, has been initialized for the user, what features did we give to that user? And we could look at the country that the user was being initialized from. There were bevy of different attributes that we could look at, and all of this data we had access to in our traces, in our errors, and in our logs. So a key question that came up. Was, did we initialize the right plan? We know that when users onboard, we're supposed to be putting them on the start plan, right? So one key question that we had to ask, and we probably started with first, was, did we put the users on the start plan? The good thing is that because we had manually instrumented our system and we have all of this high quality data available to us, we could actually interrogate our quest in. We could actually interrogate our traces. We could ask our system this question. So we could say, looking at recent traces, which. Plan, have we been assigning to our users? And see here, we've actually graphed that with our traces. We've broken down our traces and essentially asked our system that question. And when you look at the result here, you can see in the bottom left corner of the graph that we're actually correctly assigning the starter plan. So that's actually expected. We know customers are supposed to be put on a starter plan by default, so it looks like the system is doing that correctly. But somehow users are still getting access to this premium feature. Maybe let's open a specific trace. Let's look at the span tags and then let's look at each of these attributes in turn. So we know the user ID is fine. That's not the most important thing for us. Now, the country's okay. The plan. We saw that yes, we're actually putting the customer on the starter plan. We're a little concerned about features because we know transaction import is in that list of features that the user got. And that's not expected, but, wait, what's this feature group thing and why does it's a legacy. So it looks like we're initializing the startup plan, but we're using a legacy feature group. That's weird. What's a feature group? So it turns out that we actually don't assign plans directly to features. What we do is that we actually put features into a feature group, and then we point the plan to the feature group. So what our instrumentation was telling us is that we were initializing the right plan, but the incorrect feature group. So the feature group correctly defined the set of features, so the legacy feature group, because it represents the legacy plan, essentially it has transaction import in that list of features. But what we were doing wrong is that the start plan was somehow incorrectly pointing to the legacy plan. What should have been happening was the startup plan should be pointing to the start feature group, which would've correctly pointed to the list of features. That's really interesting. There's so much information we're able to glean just looking at our traces and looking at the data that's there. So the theory we have now is that the start plan is incorrectly pointing. To the legacy feature group, and that's incorrect. But can we ask our system, what feature group have you been assigning? Turns out we actually can, we can confirm that with our traces. We can ask our system, what feature group have you been assigning? And if we graph that, you can actually see from the graph that run about the 23rd. We started incorrectly assigning the legacy plan to users. So all along you can see that we're assigning the start of sorry, the start of feature group. All along you can see we're assigning the start of feature group, but somewhere around the 23rd you can see that we've started adding, we've started initializing the legacy feature group and just like that. A little quality manual instrumentation, saves the day. So our on-call engineer discovered we were giving features away for free. We jumped into Datadog, put our manual instrumentation to work. We interrogated the system using traces, and within 12 minutes we found the problem. It was a data corruption issue. There was a problem with the data in our database. We weren't pointing to the right pieces of data properly, right? Our starter plan was pointing to this incorrect legacy feature group. Now, it only took us a few minutes to fix the data in the database, but after 15 minutes, we had debugged and fixed the premium leak. That is the power of manual instrumentation. So you might say that's a great outcome. Lots of great manual instrumentation, debugging and fixing a problem in 15 minutes. So what's the problem then? The good news is that we had one really well instrumented, microservice, the bad news. We had one. Really well instrumented microservice. There are about 40 others, so during the year. A year and a half that the service was being developed, the engineers manually instrumented it. So if we say that it took three months to completely add that manual instrumentation, conservative estimate, if we assume it takes three months to instrument each existing service, three months times 40 services gives us 120 months, which is 10 years. And yes, we could paralyze the work, but it would still take us years. So what we decided to do was to do a pilot with one of our new American services. We thought let's try and accelerate manual instrumentation for them. Now in order to do that, we looked at two things. The first thing we looked at was the component interaction. So when a request comes into the application, several components collaborate to fulfill that request. So in our application, we typically have API components. We have services that capture our business logic, and we also have repositories that handle an abstract persistence. So Initialized business endpoint, for instance, there was an API component. Which would call a service method on, or a subscription service. And then that subscription service would call one or more repositories. So in this example, you can see the subscription service initialized business method is calling the plan repository dot get method, and it's also calling the subscription repository that create method. So that's the first thing that we looked at. We thought about the component interaction and. It seemed like this was a piece of the puzzle because what we're trying to do is we're trying to figure out how can we accelerate the pace of manual instrumentation? How can we really help engineers to get to a point where there's little or no manual instrumentation, but we've got to a point where there's a lot more? How can we automate this process? So we looked at the component interaction. That was one piece of the puzzle. The other piece of the puzzle that we thought about was the data. That's exchanged. So there's a lot of data exchange in the process. So for example, when the subscription API Initialized Business Method calls the subscription service initialized business method, there's a lot of data that's passed between those two components. So the subscription API passes a business ID, passes a user ID country, and a business name. And similarly, when the service calls or repositories data is also passed between those components. Now, it would be great if we could let engineers specify the data that they're interested in capturing. So if they could say, I wanna capture a business id, user ID the business name. So if they could tell us the data that they wanted to capture, then what we could do is we could look at these components, because now what these components represent is a great opportunity to insert some capturing logic. So if we're able to actually wrap each of these components, then we could capture the data that the teams need. So what we're saying here is that we're looking at the application stack. We're looking at the different components that are used to fulfill our request, right? And we know that data gets passed in between those components. So what we're thinking is if we let engineers specify and say, Hey. This is the data that I want to capture. Then we could create hooks or we could wrap those components and just capture the data that engineers have asked us to collect. So what we did was we started with this configuration. I. Tell us what data you want to capture. So maybe they wanna capture the business id, user ID and the country. And then what we did is that we took this notion of the configuration and then we converted that into code. So here's a sample configuration. We call this a telemetric capture configuration. Now, with this configuration, they can specify a load or disallowed field names, field types, and since it's written in Python, we can also add support for data classes for dictionaries. And for each of these properties that you're seeing here, there's an equivalent disallowed one. So for example, looking at allowed field names, there's a disallowed field names or disallowed field types. So again, using this, engineers could know very succinctly specify, this is the data that we want you to capture, right? And they don't have to worry about where and which component is handling that data. They could focus on the data that's exchanged, and then we would provide the hooks and wrap each of those components that they have, and then capture all the data that they've access to capture in this configuration. So in terms of wrapping the component methods, we know that we needed to inject some logic to capture the, to capture to we need, we know we needed to inject some logic before their function was called to capture inputs. So in this example, when the subscription API is calling the subscription services Initialized Business method, we know, as you've seen that comment there, we know we need to insert some logic. Before that to capture that business ID and capture that user id. But then the thing is, we can't go modifying all of their code because that's not sustainable. So we needed to turn to an advanced language feature of Python to help us do this, to help us do this. Capturing logic behind the scenes without having to go and modify everybody's code. So what we wanted to do is we really wanted to look at the data that was being passed between the components, right? So here, the subscription API is passing a business ID and a user ID to the subscription service. And what we wanted to do is look at the data and once it's, once that data is specified in the configuration, then we will capture it. So in terms of the actual component wrapping. Python has a, Python, has a wonderful notion for doing this. It's called meta classes. So a meta class is a class that is responsible for creating a class, which is a mouthful of a definition. But essentially what that means is that we could create a class that is responsible for, that could be used to make. Any class or component that an engineer wanted to make. So the subscription API, we could use our meta class and our meta class could modify the creation of the subscription API or the subscription repository or the subscription service, and then we could insert our logic, right? So the meta class gives us this perfect opportunity to wrap their functions, and then that allows us to do capturing logic, right? And then. We can call their original function. So if we look at this simple example here, what we're doing is that we're looping over each function. And then we're defining a new function that captures telemetry. And then what we do is that we call their original function. And then finally, we replace the original function with our modified function. So now what is gonna happen is that every time someone calls their function, it will call our logic first. So in that example, we're looking at. The initialized business method on the subscription service. So what we would've done here is that we would've wrapped that initialized business method on the subscription service, so that what we do is that first we look at the inputs that are coming into that initialized business method. We check if those inputs are in the configuration. And if they're in the configuration, then we'll just capture those inputs for the engineer. And then after we've captured those inputs, we can go ahead and let the application resume as it would normally. So then we will call their logic so that the rest of the application functions normally. Here we have a quick example of how the meta class is used. So here you can see the subscription service specifies that it wants to use the service. Meta class. Meta class, right? And that's how it gets used. And once an engineer has added this meta class to their component. We are now automatically wrapping their component, right? And once they have defined the telemetric capture configuration, which specifies the data that they want us to capture, then our meta class will get to work. Once this initialized business method gets called, we consult our configuration that they've provided and look at the inputs. Okay, you've got a business ID coming in, let's check if it's in the configuration. And if it's in the configuration, then we'll just go ahead and capture it. And then because this manual instrumentation process is now becoming a little bit more magical, one other thing that we did was also to add a coverage report that provides, is provides insights on what gets captured and what doesn't get captured. So what you're seeing here is it will go through each component, it will go through each method. It will list each input for each method, and it will tell you check box, yes, this is getting covered. No, this is not getting covered. And with this simple approach, regardless of which component handles the data, we can capture it. So now engineers can focus on the data that they want to capture. They focus on the data, not where. Not which component, they just focus on the data that needs to be captured. And once that's clearly specified, we will do the heavy lifting. We will automate the capturing of the manual instrumentation. We call it config, accelerated custom instrumentation, or case I for short. Now in our pilot, we were able to move the instrumentation of the new microservice. From 1% all the way up to about 60%. Now, this vastly increased the data and the telemetry that was available to them for investigating, debugging and demystifying their system behavior. So after we actually developed kci, it took us less than a week to get the configuration in place. All of this really accelerated the manual instrumentation. They didn't have to spend the time. Going through manually instrument in their code. It didn't take them three months, it didn't take them six months, it didn't take them a year because we've developed this config, accelerated custom instrumentation, and they can easily specify with configuration the data that they want to capture. That's something that we were able to work with them and get done in record time, what does this actually mean in terms of debugging their systems? So for instance, if they're looking at errors, they can actually break that apart. It's not just errors. They can break that apart by various dimensions based on the data that the application handles. So in this example. They're breaking those errors down by the event type, which reveals that the majority of errors are updated event types, right? And there's many other dimensions that they could break the different errors apart by, and they could combine all of these things. And the reason that this is important is that, just as you remember that story that we told about our on-call engineer trying to figure out why do we have this premium leak? This is really important to have a lot of high quality manual instrumentation because that's what's gonna help you when you're trying to debug your system. You're looking at some kind of degradation. In this case we're looking at arrows, but you're able to break that apart so you can start to draw different boundaries to understand exactly which segment of your customers are being affected. And similarly, if they were looking at latency. Another quick example is not just latency. They can actually break that down. So in this example, we're showing a graph where they're looking at the latency, but they're breaking that down by the Kafka topic. And here they can see that, okay, apparently the Kafka, the company Kafka topic is taking longer to process it. It seems to have, a lot more peaks than some of the other topics that they have. And. All of this data that they have available to them, all of this quality manual instrumentation that we're talking about isn't just available in their traces. It's also there in their logs and it's also in their errors, right? They get this high fidelity wherever they're using and accessing the telemetry. I. That's the power of manual instrumentation and we've accelerated it for them using ksa. Now, in terms of what's next for us, we're thinking about scaling up for the other 40 plus services that we need to capture. So some of the things we're thinking about there is, in this. In this talk I talked about a service meta class, and we only looked at one of the components in our pilot. We looked at one of the component types that they had, which are services, which holds our business logic. But we're looking at making that component agnostic so that we have meta class, we have a meta class. Know that it doesn't matter what component type you have. We'll be able to capture and hook into that component and capture the telemetry for you. Another thing we're working on is you'll see in this talk that we focus on inputs. So you know when the API called the service, we talked about the inputs, but we haven't actually started to capture outputs, and that's something that we're interested in looking to as well. Another thing that we're really excited to look at is thinking about how we can actually use LLMs to help accelerate the pace of doing and installing this integration, right? Because what we really want to do is we really want to tighten the feedback loop between adding the instrumentation and seeing the value of that instrumentation and seeing what you can do. For example, in Datadog, we really want to tighten that loop, engineers can really get a clearer picture of that. And we really want to use LMS to help us do that. And because we're a small team as well, it really helps us scale our impact and scale our voices. I. And then finally, we later found out that data log actually has some abilities in their DD Trace Library to propagate telemetry between spans. And so that's something that we're very interested to looking into as well. All right. And that has brought us to the end of the talk here. Instrument in at 10 years for a second. I want to thank you for tuning in. The best way to find me is probably on LinkedIn Jaw mark. I also blog at JY 13 hash node dev. But I'd really like to thank you for coming and listening to our talk. I hope you really enjoy the rest of the talks at Con 42 and looking forward to hearing, hearing from all of you. If you have. Ideas if you've done something similar or if you just want to chat about observability, I love to geek out about observability. So looking forward to hearing from you. Thanks. Bye-bye.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Instrumenting at 10 years per second

Video size:

Abstract

Summary

Transcript

Slides

Jean-Mark Wright

Staff Engineer, Observability Team @ Wave HQ

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Instrumenting at 10 years per second

Video size:

Abstract

Summary

Transcript

Slides

Jean-Mark Wright

Staff Engineer, Observability Team @ Wave HQ

Join the community!