The Polyglot Cloud Native Debugger - Going Beyond APM

Video size:

Abstract

Production bugs are the WORST bugs. They got through unit tests, integration tests, QA and staging… They are the spores of software engineering. Yet the only tools most of us use to attack that vermin is quaint little log files and APMs. We cross our fingers and put on the Sherlock Holmes hat hoping that maybe that bug has somehow made it into the log… When it isn’t there our only remedy is guesswork of more logging (which bogs performance for everyone and makes the logs damn near unreadable). But we have no choice other than crossing our fingers and going through CI/CD again.

This is 2021. There are better ways. With modern debugging tools we can follow a specific process as it goes through several different microservices and “step into” as if we were using a local debugger without interrupting the server flow. Magic is possible.

Summary

With kubernetes, the deployment scaled to such a level that we need tools like this to get some insight into production. Without an APM, we're, well, not as blind as a bat, but it's pretty close. There has to be a better solution than this hard separation between developers and ops. logs are the way we solve bugs in this day and age.
An APM needs to observe everything. The more you instrument, the more you have runtime overhead. By observing, we effectively change some of the outcome. We need a way to connect with the server and debug it.
Continuous observability is complementary to the APM. With continuous observability, we don't ship new code either, but we can ask questions about the code. There's clear separation between the developer and production. This talk will go into the code portions soon.
A log lets me inject a new log into the code at runtime without restarting the server. A snapshot is kind of like a breakpoint you have in a regular debugger. Airflow lets you write workflows with Python and execute them at scale. This is a perfect use case for continuous observability.
A simple Kotlin prime number calculator simply logs over numbers and checks if they are a prime number. The simplest tool we have is the ability to inject a log into an application. We can also inject a snapshot or add metrics. I'll discuss all of those soon.
The next thing I want to talk about is metrics. Here we can count the number of times a line of code was reached using a counter. We also have a method duration, which tells us how long a method took to execute. All of these measurements can be piped to statsd and Prometheus.
Lartran supports Java languages like Java, Kotlin, Scala, etc. Conditions run within a sandbox so they don't take up cpu or crash the system. Blacklisting lets us block specific classes, methods or files. Lightrun and can be used from the cloud or using an on premise install.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Today were going to talk about polyglot polyglot polyglot polyglot cloudnative debugger beyond APM apms we don't have much time so I'll get right to it. But first a few things about me. I was a consultant for over a decade. I worked at sun, founded a couple of companies, wrote a couple of books. I wrote a got of open source code, and currently work as a developer advocate for Lightrun. My email and Twitter account are listed here, so feel free to write to me. I have a blog that talks about debugging and production issues at talktodeduck Dev. It would be great if you check it out and let me know what you think. I love apms. They are absolutely wonderful. I'm old enough to remember a time where they weren't around and I'm so happy we moved past that. This is absolutely amazing. The dashboards and the details. You get this great dashboard with just everything you need. Amazing. Were truly at a golden age of monitoring hell. When I started, we used to monitor the server by kicking it and listening to see if the hard drive was spinning properly. Today, with kubernetes, the deployment scaled to such a level that we need tools like this to get some insight into production. Without an APM, we're, well, not as blind as a bat, but it's pretty close. A lot of the issues we run into start when we notice an anomaly in the dashboard. We see a spike in failures or something that performs a bit too slow. The APM is amazing in showing those hiccups, but this is where it stops. It can tell us that a web service performed badly or failed. It can't tell us why. It can't point us at a line of code. So let's stop for a second and talk about a different line. This line. On the one side we have developers, on the other side we have the ops or DevOps. This is a line were had for a long time. It's something we drew out of necessity because when developers were given access to production, well, I don't want to be too dramatic, but when developers got access to production, it didn't end well. This was literally the situation not too long ago. Yes, we had sysadmins, but the whole process used to be a mess. That was no good. We need a better solution than this hard separation because the ops guys don't necessarily know how to solve problems made by the developers. They know how to solve ops problems. So when a container has a problem and the DevOps don't know how to fix it. Well, it starts a problematic feedback loop of test, redeploy, rinse, repeat. That isnt ideal. Monitoring tools are like the bat signal. They come up and we, the developers, we're Batman or Batwoman or bat person. All of us heroes, we step up to deal with the bugs. We're the last line of defense against their, well, villainy. Well were coderbat people. It's kind of the same thing without the six pack abs, too much baked goods, you know, in the company kitchen here, coderbat man needs to know where the crime or bugs are happening in the code. So these dashboards, they point us toward the crime we have to fight in our system. But here's where things get hard. We start digging into the logs, trying to find the problem. The dashboard sent us into a general direction, like a performance problem or high error rates. But now we need to jump into logs and hope that we can find something there that will somehow explain the problem we're seeing. That's like going from a jet engine back to stone age tools. There are many logs processing platforms that do an amazing job at processing these logs and finding the gold within them. But even then it's a needle in a haystack. That's the good outcome where a log is already there waiting for us. But obviously we can't have logging all over the place. Our billing will go through the roof and our performance, well it will suffer. We're stuck in the sloop of add a new log, go through CICD which includes the QA cycle and everything. This can take hours. Then reproduce the issue in production server with your fingers crossed and try to analyze what went on. Hopefully you found the issue because if not, it's effectively rinse repeat for the whole process. In the meantime, you still have a bug in production and developers are wasting their time. There just has to be a better way. It's 2022 and logs are the way we solve bugs in this day and age. Don't get me wrong, I love logs and today's logs are totally different from what we had even ten near ago. But you need to know about your problems in advance for a log to work. The problem is, I'm not clairvoyant. When I write code, I can't tell what bugs or problems the code will have before the code is written. I'm in the same boat as you are. The bugs don't exist yet. So I'm faced with a dilemma of whether to log something. This is a bit like the dilemma of writing comments does it make the code look noisy and stupid? Or will I find this useful at 02:00 a.m. When everything isn't working and I want to rip out a few strands of hair I still have left because of this damn production issue. Debugger are amazing. They let us set breakpoints, see callbacks, call stacks, inspect variables, and more. If only we could do the same for production problems. But debuggers weren't designed for this. They're very insecure when debugging remotely. They can block your server while sending debugger commands remotely. A small mistake has such as an expensive condition can literally destroy your server. I might be repeating an urban legend were but 20 or so years ago, I heard a story about a guy who was debugging a railed system located on a cliff. He stopped at a breakpoint during debugging, and the multimillion dollar hardware fell into the sea because he didn't receive the stop command. Again, I don't know if it's a true story, but that's plausible. Debuggers weren't really designed for these situations, were. Debuggers are limited to one server. If you have a cluster with multiple machines, the problem can manifest on one machine always, or might manifest on a random machine. We can't rely on pure lock. If I have multiple servers with multiple languages, platforms crossing from one to another with a debugger, well, it's possible in theory, but I can't even imagine it in reality. I also want to revisit this slide because I do love having apms, and looking at their dashboard gives me that type of joy we get from seeing the result of my work plotted out as a graph. I feel there should be a german word to describe that sort of enjoyment. But here's the thing. Apms aren't one thing. The more you instrument, the more you have runtime overhead. The more you have runtime overhead, the more hosts you need to handle the same amount of tasks. The more hosts you have, the more problems you have, and they become more complex. I feel Schrodinger should deliver this next line. By observing, we effectively change some of the outcome. An APM needs to observe everything. An APM doesn't know what it's looking for. Like I said before, it's a bat signal or a check engine light. It's got sensors all over the place, and it needs to receive information from these sensors. Some sensors have almost no overhead, but some can impact the observed application noticeably. Some people use that as an excuse to avoid apms. Which I feel is like throwing away the baby with a bathwater. We need apms. We can't manage at scale without them, but we need to tune them. And observing everything isnt an option. Thankfully, pretty much every APM vendor knows that, and they all let us tune the ratio of observability to performance so we can get a good result. Unfortunately, that means we get less data. Couple that with the reduced logs that we need to do for the same reason, and the bad problems we have in production just got a whole lot worse. So let's take the Batman metaphor all the way. We need a team up. We need some help from the server on the servers, especially in a clustered polyglot environment where the issue can appear on one container and move to the next, et cetera. So you remember this slide. We need some way to get through that line, not to remove it. We like that line. We need a way to connect with the server and debug it. Now, I'm a developer, so I try to stay away from management buzzwords, but the word for this is shift left. It essentially means we're letting the developer and the QA get back some of the access we used to have into the ops without demolishing the gains we've had in security and stability. We love the ops people, and we need them. So this is about helping them keep everything running smoothly in production without stepping on their toes or blowing up their deployment. This leads us here. What if you could connect your server to a debugger agent? That would make sure you don't overload the server and don't make a mistake, like setting a breakpoint or something like that. That's what continuous observability does. Continuous observability is complementary to the APM. It works very differently. Before we go on, I'd like to explain what's continuous observability. Observability is defined has the ability to understand how your system works on the inside without hoping new code. The without hoping new code portion is key. But what's continuous observability? With continuous observability, we don't ship new code either, but we can ask questions about the code. Normal observability works by instrumenting everything and receiving the information. With continuous observability, we flip that, we ask questions and then instrument based on the questions. So how does that work in practice? Each tool in this field is different. I'll explain the lightrun architecture, since that's what I'm familiar with, and I'll try to qualify where it differs from other tools in Lightrun. We use a native IDE plugin to vs code or Jetrain's IDE, such has intellij, Pycharm, webstorm, etc. It can also use a command line tool or other tools sometimes have a web interface or CLI. Only this client lets us interact with the Lightrun server. This is an important piece of the architecture that hides the actual production environment. Developers don't get access to the production area, which is still the purview of DevOps. We can insert an action which can be a logs or a snapshot or a measurement metric. I'll show all of these soon enough. This talk will go into the code portions soon. Notice that the Lartran server can be installed in the cloud as a SaaS or on premise and managed by Ops. The management server sends everything to the agent which is installed on your production or staging servers. This is pretty standard for all continuous observability solutions. I don't know how other solutions work, but I assume they are pretty similar. This means there's clear separation between the developer and production. As you can see, the DevOps still has that guarding line were talking about. They need to connect the agent to the management server and that's where their job ends. Developers don't have direct access to production, only through the management server. That means no danger to the running production servers from a careless developer. Well, like myself, the agent is just a small runtime you add to your production or staging server. It's very low overhead and it implements the debugging logic. Finally, everything is piped through the server back to your ide directly. So as a developer you can keep working in the IDE without leaving your comfort zone. Okay, that should raise the vendor alert right here. I heard that bullshit line before, right? Apms have been around forever and have been optimized. How can a new tool claim to have lower overhead than an established and proven solution? As I said before, apms look at everything. A continuous observability tool is surgical. That means that when an APM raises an interesting issue, we can then look at a very specific thing, like a line of code. When a continuous observability solution isn't running, its overhead is almost nothing. It literally does nothing other than check whether we need it. It doesn't report anything and is practically idle when we do need it. We need more data than the APM does, but we get it from one specific area of the code. So there is an overhead, but because it only impacts one area of the code, it's tiny this is the obvious question. What if I look at code that gets invoked a lot? As I said, continuous observability gets even more data than an APM does. This can bring down the system and, well, we could end up here. So this is where continuous observability tools differ. Some tools provide the ability to throttle expensive actions and only show you some of the information. This is a big deal in high volume requests. I'm going to show you two demos that highlight what we can do, and the first is a simple hello world flask server. So this is a simple hello world flask app which is running in Pycharm. I'll demonstrate vs code soon. First I right click and select the log option in the menu. A log lets me inject a new log into the code at runtime without restarting the server. But there is more. See here. I can log any expression or variable from the currently running app. In this case, I am logging the value of name. Logs can appear in the console below, or they can appear with the rest of the logs from the code. Let's press the ok button which inserts the new log. We can now see the dynamic log appearing just above the line, as if it was a line we added into the code. Now let's go to the browser window and hit refresh. Then we go back to the ide and within a matter of seconds we can see the log notice you can send the log to the iD or to be integrated with the other logs from your app. Let's delete the log and select a snapshot instead. A snapshot is kind of like a breakpoint you have in a regular debugger, but it has one major difference, it doesn't break, it doesn't stop the threads. When it's hit, it grabs the stack information, variables, values, et cetera, but doesn't stop the thread. So a user in production isn't stuck because you're debugging. Let's go back to the web UI and hit refresh button to see the snapshot in action. Then we can go back to the ide and wait for the snapshot result to arrive. Below you can see the result of the snapshot, as is the convention Jetbrains ide. You can walk the stack like you can with any breakpoint, inspect variable values like you could with any debugger, and all of that doesn't bother any live user in the system. I skipped a lot of interesting features were such as the ability to define conditional logs or snapshots, which let you do things like define a snapshot that's only hit when a specific user is in the system. That means that if a user somewhere has a bug, you can literally get information that's specific only to that user and no one else. That's pretty cool. Airflow lets you write workflows with Python and execute them at scale. There are many frameworks with similar concepts such as Spark, etc. Logs of them have different use cases and target demographics, but they have one core concept in common. They launch workers that are short lived. They perform a task and return a response. In the case of airflow, this is commonly used for processing data at scale. A good example for this is tagging or clarifying images. Here we have multiple independent processes that can take pieces of data, process it, and return a result. One of the nice things about these tools is that we can create chains of dependencies where results get passed from one process to another to use computing resources in the most effective way. But here's the problem. This thing is nearly impossible to debug. This is so bad. Companies just let major bugs live in production and accept a pretty terrible error rate because they just can't debugger this thing. They have logs, but the FMRL processes, they lose the context very quickly. This is a perfect use case for continuous observability. Tools that can deliver more airflow lets you break down huge tasks like clarifying a large set of images into distributed workers that can run on multiple machines and use available resources intelligently. This makes it nearly impossible to debug. Your worker might run somewhere and all you have is a log after the fact, which you would need to dig through to check for a bug or a fix. This time I use vs code to demonstrate this functionality. This is a simple airflow demo that classifies images. The problem with airflow is that we don't have an agent or a server on which our code is running. An agent can come up, process and then go away. By the time we set the snapshot into place, it will be gone. This is where tags come in. Tags let us apply an action such as a log or a snapshot to a group of agents. That means that every agent that logs with the given tag will implicitly get the actions of that tag. By the way, notice that in vs code we need to add actions from the left pane. The UI is a bit different here. Adding an action to a tag is pretty similar to adding it to an agent. We just add it and it looks the same so far. Now that it's added, let's move to the agents view and wait for the agent to come online and trigger the snapshot. By the way, notice that the UI for all of this is similar in spirit to the one in Pycharm. Now we have an agent that's running and we got a notification that the snapshot was hit. Let's go into the snapshots tab and click the snapshot. Unlike Pycharm, we need to open the snapshot manually and it looks like a vs code breakpoint, which is good as it's native to the ide. But the core idea the UI of the snapshot with the stack variables, etc. That's the same as it was in Pycharm. The title of this talk refers to polyglot debugging because of time constraints. I can show the full polyglot demo, but let's look at this simple kotlin prime number calculator this is a simple Kotlin prime number calculator. It simply logs over numbers and checks if they are a prime number. It sleeps for ten milliseconds, so it won't completely demolish the cpu, but other than that, it's pretty simple. Pretty simple application. It just counts the number of primes it finds along the way and prints the result at the end. We use this code a lot when debugging, since it's cpu intensive and yet very simple. In this case, we would like to observe the variable I, which is the value we're evaluating here, and print out CnT, which represents the number of primes we found so far. The simplest tool we have is the ability to inject a log into an application. We can also inject a snapshot or add metrics. I'll discuss all of those soon. Selecting log opens the UI to enter a new log. I can write more than just text in the curly braces. I can include any expression I want, such as the value of the variables that I included in this expression. I can also invoke methods and do all sorts of things. But here's the thing. If I invoke a method that's too computationally intensive, or if I invoke a method that changes the application state, the log won't be added. I'll get can error. After clicking OK, we see the log appearing above the line in the ide. Notice that this behavior is specific to intellij or Jetrain's files. In visual studio code, it will show a marker on the side. Once the log is hit, we'll see logs appear in batches like before. You'll notice that the experience is pretty much identical to the one we had with Python. The next thing I want to talk about is metrics, and this is a different demo that I usually use to show metrics. It's based on Java actually fitting the polyglot stuff. Apms give us large scale performance information, but they don't tell us fine grained details. Here we can count the number of times a line of code was reached using a counter. We can even use a condition to qualify that, so we can do something like count the number of times a specific user reached that line of code. We also have a method duration, which tells us how long a method took to execute. We can even measure the time it takes to perform a code block using a TikTok. This lets us narrow down the performance impact of a larger method to a specific problematic segment. In this case, I'll just use the method. Duration measurements typically have a name under which you can pipe them or log them, so I'll just give this method duration a clear name. In this case, I'm just printing it out to the console, but all of these measurements can be piped to statsd and Prometheus. I'm a pretty awful DevOps, so I really don't want to demo that in this case, but it does work if you know how to use those tools. As you can see, the duration information is now piped into the logs and provides us some information on the current performance of this method. In closing, I'd like to go over what we discussed here, and a few things we didn't have time for. Lartran supports Java languages like Java, Kotlin, Scala, etc. Every JVM language is supported. It supports node for both JavaScript and typescript code, and of course Python, even complex stuff like airflow. We're working hard on adding new platforms that are going to doing that really fast. So if you want a new platform I didn't mention here, just write to me and I'll connect you with the product team. You can become a better tester for the new platform and have an impact on the direction we take when we add actions. Conditions run within a sandbox so they don't take up cpu or crash the system. This all happens without networking, so something like networking hiccup won't crash the server. Security is especially crucial with solutions like this. One of the core concepts is that the server queries information, not the other way around. As you would see with solutions such as JDWP. This means operations are atomic and the server can be hidden behind a firewall, even from the rest of the organization. PIi reduction lets us define conditions that would obscure patterns in the logs. So if a user could print out a credit card number by mistake, you can define a rule that would block that. Pii reduction lets us define conditions that would obscure patterns in the logs. So if a user could print but a credit card number by mistake, you can define a rule that would block that. This way the bad data won't make it into your logs and won't expose you to liability. Blacklisting lets us block specific classes, methods or files. This means you can block developers in your organization from debugging specific files. This means a developer won't be able to put a snapshot or a log in a place where a password might be available to steal user credentials or stuff like that. This is hugely important in large organizations besides the sandbox. I'd like to also mention that Lightrun is very efficient and in our benchmarks has almost no runtime impact. When it isn't used, it has a very small impact even with multiple actions in place. Finally, Lightrun and can be used from the cloud or using an on premise install. It works with any deployment you might have, whether cloud based, container based on premise, microservice, serverless, etc. Thanks for bearing with me. I hope you enjoyed the presentation. Please feel free to ask any questions and feel free to write to me. Also, please check out talktoduck dev were I talk about debugging in depth and check out.

Slides

Download slides (PDF)

See all 24 talks at this event!

Conf42 Python 2022 - Online

January 27 2022

The Polyglot Cloud Native Debugger - Going Beyond APM

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Developer Advocate @ Lightrun

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2022 - Online

January 27 2022

The Polyglot Cloud Native Debugger - Going Beyond APM

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Developer Advocate @ Lightrun

Join the community!