Conf42 Cloud Native 2022 - Online

Debugging at Scale in Production - Deep into your Containers with kubectl debug, KoolKits and Continuous Observability

Video size:

Abstract

Brian Kernigham said: “Debugging is twice as hard as writing the code in the first place.”

In fact, debugging in a modern production environment is even harder - orchestrators spinning containers up and down and weird networking wizardry that keeps everything glued together, make understanding systems that much more difficult than it used to be.

And, while k8s is well understood by DevOps people by now, it remains a nut that developers are still trying to crack. Where do you start when there’s a production problem? How do you get the tools you’re used to in the remote container? How do you understand what is running where and what is its current state?

In this talk, we will review debugging a production application deployed to a Kubernetes cluster, and review kubectl debug - a new feature from the Kubernetes sig-cli team. In addition, we’ll review the open source KoolKits project that offers a set of (opinionated) tools for kubectl debug.

KoolKits builds on top of kubectl debug by adding everything you need right into the image. When logging into a container, we’re often hit with the scarcity of tools at our disposal. No vim (for better or worse), no DB clients, no htop, no debuggers, etc… KoolKits adds all the tools you need right out of the box and lets you inspect a production container easily without resorting to endless installation and configuration cycles for each needed package.

We’ll finish the talk by delving into how to get better at debugging on a real-world scale. Specifically, we’ll talk about how to be disciplined in our continuous observability efforts by using tools that are built for k8s scale and can run well in those environments, while remaining ergonomic for day to day use.

This session will go back and forth between explanation slides and demonstration of the topic at hand.

Summary

  • Today we'll talk about kubernetes debugging at scale in production. The Kubectl debug command adds ephemeral containers to a running pod. Each failure requires a different set of skills and solutions. We need new tools to deal with those scale.
  • Koolkits includes the common things most of us need based on the platform or language. It has sensible defaults and comes with ubuntu as the distribution. This lets us debug the source code directly from the IDE. To start, we need to sign up for free lightrun. com.
  • The snapshot gives us a stack twice and variables, just like a regular breakpoint. You can define a condition for a log and for metrics as well. This is one of those key features of continuous observability. Let's up the ante a bit and talk about user specific problems.
  • The next thing I want to talk about is metrics. Here we can count the number of times a line of code was reached using a cluster. We also have a method duration, which tells us how long a method took to execute. All these measurements can be piped to stacksd and Prometheus.
  • Koolkits made Kubectl debug easier to use with preinstalled tools. Lightrun made keeps secure, readonly real time debugging at scale easy. If you have any questions, my email is listed were and I'll be happy to help.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Shai Almog. Today we'll talk about debugging at scale in production, specifically about kubernetes debugging. I'm assuming everyone here knows the basics of kubernetes, so so I will dive right into the basic problem description and then into the three tools I will show today. But first, a few things about me I was a consultant for over a decade. I worked at sun, founded a couple of companies, wrote a couple of books, wrote a lot of open source code, and currently work as a developer advocate for Lightrun. My email and Twitter accounts are listed here, so feel free to write to me. I have a blog that talks about debugging and production issues at talktodeduct dev. It would be great if you check it out and let me know what you think. I also have a series of videos on my Twitter account called 142nd Ducklings where I teach complex things in 140 seconds. The current series is based on debugging and it covers a lot of things most developers don't know and would find helpful. Containers and orchestrators revolutionized development and production, no doubt. But in a way, kubernetes made debugging production issues harder than previously was. In the past we had physical servers we could just work with, or even vps. Now we face much greater difficulty in this age due to those big challenges. The massive scale enabled by Kubernetes is a huge boon, but it also makes debugging remarkably difficult. We need new tools to deal with those scale. We now have multiple layers of abstraction to the deployed where failures can happen in the orchestrators container or code layers, each failure requires a different set of skills and solutions. Tracking the cause to the right layer isn't necessarily trivial. Finally, there's the bare bones or lean deployment problem. When debugging, this is the first problem I want to focus on. We'll get to the other two soon enough. It's the problem of those bare naked container. We can connect to a bare bone container, but we have nothing to do inside it. Nothing is installed. We can inspect logs, but that relies on luck. Furthermore, if your logs are already ingested by a solution like elastic, you probably don't have anything valuable to do within a bare bones container. Kubectl debug solves these problems and can work even with a crashed container or a bare bones image. The Kubectl debug command adds ephemeral containers to a running pod. An ephemeral container is a temporary element that would vanish once the pod is destroyed. With it we can inspect everything that we need in the pod. Most changes we do in it don't matter, they won't impact the pod after we're gone. It works with bare bones container. The way it does this is with a separate image so we can have an image that includes everything in it. The container spun from the image is ephemeral and can include a proper distro and the set of tools we need. Kubectl debug was introduced in version 1.23 so if youre still in an older version you will need to wait for that. If you use hosted kubernetes you need to check those version they use. Let's start with a simple demo. As you can see we have here a few pods. We're experiencing an issue and would like to increase the logging level so we can better see what's going on. I can use exec and log in directly to the live pod. I log in with proper bash shell. I'm sure most of you did that in the past as it's pretty easy here I can just use standard commands like cat and grep. Check the logging level. This is all good. We can see the current level is at info. Unfortunately I don't even have vim if I want to edit this file now when the container has apt or APK and when those pod isn't crashed I could in theory just apt get install vim, but that's got its own problems and is a painful process. We don't want people in production just installing packages left and right, even if it's cleaned up. The risks are sometimes tools high pods shouldn't be touched in production. Once deployed, all the information and state of a pod should be described in its manifest, unless it's strictly a statewall pod, like a database, et cetera. So installing like this is problematic. So let's exit for a moment and try connecting again with Kubectl debug. I'll use the busybox image in this case and we'll see how that works. Notice that this is the image referenced in those Kubectl debug docs. So I'm connected to the pod again, or so it seems. Technically I'm connected to a new ephemeral container and not to the pod directly. This is an important distinction, but as you can see again, I don't have Vim or really any of the tools I would expect like visual, vm, trace, youre, et cetera. I can fix that. I can create an image that represents everything I need and packages all of it all of the tools I need. Then pass that image to Kubectl debug and just use these tools. But here's the thing, I'm not unique. We're all pretty much the same. We all expect the same thing in our debugging sessions. And that image I can use is probably the same image you would use tools, so why not have one generic image? This is where koolkits comes in. Coolkits is an open youre project that includes a set of opinionated, curated, platform specific tools for Kubectl debug, so you can have everything you might need at your fingertips while debugging. So what does this mean when you use Kubectl debug to spin up an ephemeral container? It's built using a coolkit's image. Currently there are four standard images, a go image that includes tools such as Delve, Pprof, Go, Calvis, and many others the JVM toolkit, which includes tools such as Sdkman, jmxterm, honest profiler, visual, vm and much more. The node version includes NVM, NDB, zero, x, vtop and much more. And finally the Python version includes pyn, IPDb, ipython and much more. But this is just the tip of the iceberg as all the versions include the many tools you would expect in any proper debugging session such as Vim htop. Also, we also have lots of networking tools like traceroute NMaP database clients for postgres mysql redis again so much more. So let's continue from where we left off in those demo. We can disconnect from the current session and then spin up a new session with the Koolkits image. Notice we can also use the shorthand KK command for many operations which I don't use here, but you can see the syntax in the Koolkits docs specifically in this case. Notice I use the JVM version of coolkits, which I chose because I'm a Java guy. But if you're using a different environment, you can use what fits there. In coolkits, pretty much every tool I want is already pre installed as part of the image by default. This means we can just connect and everything is already there. Since were all very similar in our needs. Koolkits includes the common things most of us need based on the platform or language. It has sensible defaults and comes with ubuntu as the distribution. This is important. You have a full distribution like would have in a desktop or in a regular server. This is very helpful for debugging so you get everything you need even when debugging a bare bones container. Notice that thanks to Kubectl Debug, we have full access to the main application, main application's container, file system and pods process namespace so we can do everything there while residing in a more convenient environment and having our cake and eating it. So to finish the story from before, I can just use vim to edit the file and change the logging level to error, which I can then confirm using cat and grip. I can also do a lot of other things such as profile using a profiler, debug using with GDB or JDB. I can use Jmxterm to perform JMX operations, which lets you configure the way the JVM behaves in runtime and pretty much anything I can do with a local machine. To give you a sense of what Koolkits installs, this is the list of packages for the JVM clients and this is bound to grow as you can all submit pull requests with your favorite packages. This is just those JVM specific image. The other images contain similar tools at a similar scale, and you can get all that thanks to Kubectl, debug and koolkits. I didn't forget about this slide. How do we debug issues of massive scale and massive depth? Instead of talking theory, let's talk about real world example. What if what we're tracking seems to be an application debug? This is a common occurrence for sure. We might not know it at this stage, but that might be the place we want to investigate. We can use one of the debuggers in koolkits to track it, but that would only work if we know the server where the issue manifests. It's also remarkably risky. Connecting a debugger to a production environment can lead to multiple problems, stopping on a breakpoint by accident using a conditional statement that grinds the system to a halt, exposing a security vulnerability. JDWP itself is literally an open invitation for hacking. We can try using logs and probably should start there, but more often than not, the issue we needed isn't logged. We can try using various observability tools. They're great, but not for application level issues. They rule for big picture analysis and container level problems, not for application level problems. We used to call this continuous observability, but developer observability makes more sense. It's a newer set of tools designed to solve this exact problem observability is defined as the ability to understand how your system works on the inside without chipping new code. The without chipping new code portion is key, but what's developer observability? With developer observability, we don't ship new code either. We can ask questions about the code. Normal observability works by instrumenting everything and receiving the information. With developer observability, we flip that, we ask questions and then instrument based on those questions. So how does that work in practice? In practice, we add an agent to every pod. This lets us debug the source code directly from the IDE, almost like debugging a local project. To start, we need to sign up for a free lightrun account@lightrun.com. Slash free notice that Lightrun has a very generous free tier you can use to experiment with the product. Pretty much everything I show here can be accomplished with a free account. I'll skip the actual setup since it's covered there and we don't have much time. You can check out the lightrun docs for more detailed instructions on setting lightrun in docker, minikube, etc. Unlike Kubectl debug, we need to install the agent before the problem occurs. So if we do run into a problem, we will be able to skip right in. Let's skip right ahead to a simplified demo right in the id once the agent is set up, this is the prime main app in kotin. It simply loops over numbers and checks if they are a prime number. It keeps for ten milliseconds, so it won't completely demolish the cpu. But other than that, it's a pretty simple application. It just counts the number of times it finds along the way and prints the results at the end. We can use this code we use this code a lot when debugging, since it's cpu intensive and very simple. In this case, we would like to observe the variable I, which is the value we're evaluating here, and print out CNT, which represents the number of primes we found so far. The simplest tool we have is the ability to inject log into the application. We can also inject a snapshot or add metrics. I'll discuss all of those soon enough. Selecting log opens the UI to enter a new log. I can write more than just text in the curly braces. I can include any expression I want, such as the value of the variables that I included in those expression. I can also invoke methods and do all that sort of thing. But here's the thing if I invoke a method that's too computationally intensive, or if I invoke a method that changes the application state, the log won't be added. I'll get an error. After clicking OK, we see the log appearing above the line in the ide. Notice that this is behavior specific to intellij or jetbrain's ides. In visual studio code, it will show a marker on the side. Once the log is hit, we'll see logs appear in batches. Notice I chose to pipe logs to the id for convenience, but there's more I can do with them. For now, the thing I want to focus on is the last line. Notice that the log point is paused due to high core rate. This means the additional logs won't show for a short period of time. Since logging needed threshold of cpu usage, this can happen quickly or slowly depending on how you're observing. Let's move on to a different demo. This is the node JS project that implements the initial backend of a microservice architecture. This is the method that gets invoked when we click a movie and want to see the details. This time I'll add a snapshot. Some other developer observability tools call this a capture or a nonbreaking breakpoint, which to me sounds weird built. The idea is usually the same. Once I press ok, the camera button appears on the left, indicating the location of the snapshot like you would see with a regular ide breakpoint. Now I'll just access the production in the front end that triggers this code. And now we wait a second and the snapshot is hit. So what is the snapshot? It gives us a stack twice and variables, just like a regular breakpoint we all know and love. But it doesn't stop at that point, so your server won't be stuck waiting for a step over. Now obviously youre can't step over the code, so you need to step by individual snapshots. But this is a huge benefit, especially in production scenarios. But it lets much better this was relatively simple in terms of observability. Let's up the ante a bit and talk about user specific problems. So here I have a problem with those request one specific user is remaining that the list on his machine doesn't match the list for his peers. The problem is that if I put a snapshot I'll get a lot of noise because there are many users reloading all at the same time. So a solution to this is to use conditional snapshots just like youre can with a regular debugger. Notice that you can define a condition for a log and for metrics as well. This is one of those key features of continuous observability. I add a new snapshot and in it I have an option to define quite a lot of things. I won't even discuss the advanced version of this dialogue in this session. This is a really trivial condition. We already have a simple security utility class that I can use to query the current user ID, so I just make use of that and compare the response to the ID of the user that's experiencing a problem. Notice I use the fully qualified name of the class. I could have just written security and it's very possible that it would have worked, but it isn't guaranteed names can clash on those agent and the agent slides isn't were of things we have in the IDE. As such, it's often good practice to be more specific. After pressing ok we see a special version of the snapshot icon with a question mark on it. This indicates that this action has a condition on it. Now it's just a waiting game for the user to hit the snapshot. This is the point were normally you go make yourself a cup of coffee or just go home and check this out the next day. That's the beauty of this sort of instrumentation. In this case, I won't make you wait long. The snapshot gets hit by the right user despite other users coming in. This specific request is from the right user id. We can now review the stack information and fix the user specific bug. The next thing I want to talk about is metrics. Apms give us large scale performance information, but they don't tell us fine grained details. Here we can count the number of times a line of code was reached using a cluster. We can even use a configuration to qualify that, so we can do something like count the number of times a specific user reached that line of code. We also have a method duration, which tells us how long a method took to execute. We can even measure the time it takes to perform a code block using a TikTok. That lets us narrow down the performance impact of a larger method to a specific problematic segment. In this case, I'll just use the method. Duration measurements typically have a name under which we can pipe them or log them. So I'll just give this method duration a clear name. In this case I'm just printing it out to the console. But all these measurements can be piped to stacksd and Prometheus. I'm a pretty awful DevOps so I really don't want to demo that in this case, but it does work if youre know how to use these tools. As you can see, the duration information is now piped into the logs and provides us with some information on the current performance of the method. The last thing I want to talk about brings this all together and that's tags. We can define tags to group agents together, such as production green, blue, ubuntu, et cetera. Every pod can be a part of multiple tags. Every action we discuss today can be applied to a tag and as such can run on multiple machines simultaneously and asynchronously. This solves the scale problem when debugging. So in closing, I'd like to review some of the things we discussed today. Kubectl debug made debugging crashed pods possible it also made it possible to debug a pod based on a bare bones image. Koolkits made Kubectl debug easier to use with preinstalled tools. Lightrun made keeps secure, readonly real time debugging at scale easy thanks for bearing with me. I hope you enjoyed the presentation. Please feel free to ask any questions and also feel free to write to me. Also, please check out talktodeduct dev where I talk about debugging in depth. And check out lightrun.com, which I think you guys will like a lot. If you have any questions, my email is listed were and I'll be happy to help. Thanks for watching.
...

Shai Almog

Developer Advocate @ Lightrun

Shai Almog's LinkedIn account Shai Almog's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways