Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi everyone. Today we're going to talk about debugging node
            
            
            
              js applications in production with Lightrun.
            
            
            
              But first, a few things about me I was a consultant for over
            
            
            
              a decade. I worked at sun, founded a couple of companies,
            
            
            
              wrote a couple of books. I wrote a lot of open source
            
            
            
              code, and currently work as developer advocate for Lightrun.
            
            
            
              My email and Twitter accounts are listed here, so feel free to
            
            
            
              write to me. I have a new blog that talks about
            
            
            
              debugging and production issues at talktotheduck dev.
            
            
            
              It would be great if you check it out and let me know what you
            
            
            
              think. I want to put my heart out there.
            
            
            
              I love apms. They are absolutely wonderful.
            
            
            
              I'm old enough to remember a time where
            
            
            
              they weren't and weren't around and I'm just
            
            
            
              so happy that we moved past that.
            
            
            
              These is absolutely amazing. The dashboards
            
            
            
              and the details. You get these
            
            
            
              great dashboard with just everything you need.
            
            
            
              Amazing. We're truly at the golden age of monitoring
            
            
            
              hell. When I started, we used
            
            
            
              to literally monitor the server by kicking it and listening to see if the hard
            
            
            
              drive was spinning properly. These vendors
            
            
            
              provide amazing dashboards that let you see every stat
            
            
            
              that matters in your app. Monitoring is absolutely
            
            
            
              amazing. And if you don't use these tools on your production servers,
            
            
            
              well, you're missing out and you might have issues you're
            
            
            
              unaware of. In fact, a lot of these issues we run
            
            
            
              into start when we notice an anomaly in the dashboard.
            
            
            
              We see a spike in failures or something
            
            
            
              that performs a bit too slow. The APM
            
            
            
              is amazing in showing those hiccups, but this is where
            
            
            
              it stops. It can tell us that a web service performed
            
            
            
              badly or failed. It can't tell us why.
            
            
            
              It can't point at a line of code that
            
            
            
              we need. So let's stop for a second and talk about a
            
            
            
              different line altogether. This line on
            
            
            
              one side we have developers,
            
            
            
              on the other we have ops or DevOps.
            
            
            
              This is a line we've had for a long time.
            
            
            
              It's something we drew out of necessity because when developers
            
            
            
              were given access to production, well,
            
            
            
              I don't want to be too dramatic, but when developers
            
            
            
              got access to production, it didn't end well.
            
            
            
              This was literally the situation not too long ago.
            
            
            
              Yes, we had sysadmins, but the whole
            
            
            
              process used to be a mess. That was no good.
            
            
            
              But we need a better solution than this hard separation
            
            
            
              because the ops guys don't necessarily know how
            
            
            
              to solve the problems made by the developers.
            
            
            
              These know how to solve ops problems because
            
            
            
              this is really what those monitoring tools are all about.
            
            
            
              They're the bat signal. They come up and
            
            
            
              we're Batman or Batwoman or Bat person.
            
            
            
              All of us damn heroes in these cases,
            
            
            
              we step up to deal with the bugs and we're the last
            
            
            
              line of defense against their villainy.
            
            
            
              DevOps have enough of villains of their own
            
            
            
              that they need to deal with for this case.
            
            
            
              They need us, but, well, we're coder.
            
            
            
              But people, it's kind of the same thing, just without the six pack
            
            
            
              abs, too much baked goods. And in the company kitchen over
            
            
            
              here, coder Batman needs to know these the
            
            
            
              crime or bugs are happening in the code.
            
            
            
              So those dashboards, these point us in the general direction
            
            
            
              these the crime is, but not at
            
            
            
              the exact specific location where we need to fight crime.
            
            
            
              And here's where things get hard. We start digging into
            
            
            
              logs, trying to find the problem. The dashboard sent us into a
            
            
            
              general direction like performance problem or high error rates.
            
            
            
              But now we need to jump into logs and hope that we can
            
            
            
              find something there that will somehow explain the problem we're
            
            
            
              seeing. That's like going from a jet
            
            
            
              engine back to the Stone age tools. Now there are
            
            
            
              many log processing platforms that do amazing job at
            
            
            
              processing these logs and finding these gold within these.
            
            
            
              But even then it's a needle in a haystack. And that's
            
            
            
              the good outcome where a log is already
            
            
            
              there waiting for us. But obviously we can't have logging
            
            
            
              all over the place. Our billing would go through the roof and our performance will.
            
            
            
              Well, you know, we're stuck
            
            
            
              in these loop. Add a new log, go through CI
            
            
            
              CD which includes the QA cycle and everything. This can
            
            
            
              take hours. Then reproduce the issue in production
            
            
            
              server with your fingers crossed and try to analyze
            
            
            
              what went on. Hopefully you found the issue because
            
            
            
              if not, it's effectively rinse, repeat for the whole
            
            
            
              process. In the meantime, you still have a bug in production and
            
            
            
              developers are crashing their time. This is so
            
            
            
              bad. Some developers even advocate extreme solutions.
            
            
            
              Just build everything in these cloud as servers functions to
            
            
            
              avoid that inconsistency between local and remote environments.
            
            
            
              Gareth, who quoted this
            
            
            
              but a lot of faith in isolated unit tests for his functions.
            
            
            
              Having written a lot of code, I have very limited faith
            
            
            
              in my tests. Then there's the other
            
            
            
              extreme that sees even microservices as
            
            
            
              a way to get more money. I dont even need to ask Vlad what
            
            
            
              he thinks about lambda. My personal opinion on this is
            
            
            
              somewhere in the middle, but closer to Vlad, probably because I'm old.
            
            
            
              I do think we need microservices and serverless in the right circumstances,
            
            
            
              though. But regardless of your opinions here,
            
            
            
              debugging in production is a hard and risky
            
            
            
              process, even with fast deployments,
            
            
            
              especially if you have a team larger than one person.
            
            
            
              And Garth is correct when he says that production and local
            
            
            
              development can never be quite the same.
            
            
            
              There are just too many variables and too much noise. We can't
            
            
            
              possibly avoid production only bugs.
            
            
            
              There just has to be a better way. It's 2020
            
            
            
              ones, and logs are the way we solve bugs in these day
            
            
            
              and age. Don't get me wrong, I love logs, and today's
            
            
            
              logs are totally different from what we had even ten years ago.
            
            
            
              But you need to know about your problems in advance for a log to
            
            
            
              work. The problem is, I'm not clairvoyant.
            
            
            
              When I write code, I can't tell what bugs or
            
            
            
              problems the code will have before the code is written.
            
            
            
              I'm in the same boat as you are. The bug doesn't
            
            
            
              exist yet, so I'm faced with the dilemma of whether
            
            
            
              to log something. This is a bit like the
            
            
            
              dilemma of writing comments. Does it make the code look
            
            
            
              noisy and stupid, or will I find this
            
            
            
              useful at 02:00 a.m. When everything isn't working and I want to rip out
            
            
            
              a few strands of hair I still have left because of this damn
            
            
            
              production problem? I'm sure you've all been in that
            
            
            
              situation. I've been through it quite a lot.
            
            
            
              As you can see on my head.
            
            
            
              Debuggers are amazing. They let
            
            
            
              us set breakpoints, see callbacks, inspect variables,
            
            
            
              and more. If only we could do the same for production
            
            
            
              systems. But debuggers weren't designed for this.
            
            
            
              They're very insecure when debugging remotely. They can
            
            
            
              block your server while sending debug commands remotely.
            
            
            
              A small mistake, such as an expensive condition can literally
            
            
            
              destroy your server. I might be repeating an urban legend
            
            
            
              here, but 20 or so years ago I heard a story about a guy
            
            
            
              who was debugging a railed system located on a cliff.
            
            
            
              He stopped at a breakpoint during debugging, and the multimillion
            
            
            
              dollar hardware fell into the sea because it didn't
            
            
            
              receive the stop command. Again, I don't know if it's a true
            
            
            
              story, but it's plausible. Debuggers weren't
            
            
            
              really designed for those situations.
            
            
            
              Worse, debuggers are limited to one server. If you
            
            
            
              have a cluster with multiple machines, the problem can manifest on one
            
            
            
              machine always or might manifest on a random machine.
            
            
            
              We can't rely on pure luck.
            
            
            
              If I have multiple servers with multiple languages,
            
            
            
              platforms crossing from one to another with these debugger?
            
            
            
              Well it's possible in theory,
            
            
            
              but I can't even imagine it in reality.
            
            
            
              Let's take the Batman metaphor all the way. We need
            
            
            
              a team up. We need some help on the servers,
            
            
            
              especially in a clustered polyglot environment where the issues
            
            
            
              can appear on one server, move to the next, etc.
            
            
            
              So you remember these slide. We need some way to get
            
            
            
              through that line. We need a way to connect the server
            
            
            
              with these server and debug it. What if you could
            
            
            
              connect your servers to a debugger agent?
            
            
            
              That would make sure you dont overload the server and don't make
            
            
            
              a mistake, like setting a breakpoint or something like that.
            
            
            
              That's what Lightrun does. It lets you connect securely
            
            
            
              through your own server so you physically don't
            
            
            
              have access to production. That means that lovely
            
            
            
              line we have with DevOps is maintained.
            
            
            
              DevOps have their control and know you can't
            
            
            
              break production. Developers still get insight and can investigate
            
            
            
              an issue to avoid the redeploy cycle.
            
            
            
              There's a safety net in place protecting production
            
            
            
              from harm, intentional or otherwise.
            
            
            
              So how does this thing work? Well, we install the plugin
            
            
            
              in our IDE and it lets us interact
            
            
            
              with the lightrun server. We can insert an action
            
            
            
              which can be a log or a snapshot, which I'll show
            
            
            
              soon enough. Right now Lightrun supports Webstorm,
            
            
            
              intellij idea, etc. But there's a visual studio
            
            
            
              code plugin that will look very similar to what you're seeing here.
            
            
            
              It will launch soon. Notice there's also a
            
            
            
              command line option which I won't discuss here. The management
            
            
            
              server sends everything to the agent which is
            
            
            
              installed on your server. That means there's clean separation
            
            
            
              between developer and production.
            
            
            
              DevOps still has that guarding line we're talking about,
            
            
            
              and we don't have direct access to production. That means
            
            
            
              no danger to the running production servers from careless developers.
            
            
            
              Well, like myself. The agent
            
            
            
              is just a small runtime you add
            
            
            
              to your servers. It's very low overhead and it implements
            
            
            
              the debugging logic. Finally, everything is piped through
            
            
            
              the server back to your ide directly. So as a
            
            
            
              developer you can keep working in the IDE without
            
            
            
              leaving your comfort zone. So let's go over a simple
            
            
            
              hello world demo so we're all on the same page.
            
            
            
              This is a simple express typescript hello world app.
            
            
            
              It should be very familiar to anyone using nodejs.
            
            
            
              I opened it here in webstorm and I want to debug
            
            
            
              that app in the same way I would debug a real world app
            
            
            
              in production. To do that, we need to sign
            
            
            
              up on Lightrun.com and set up an agent and
            
            
            
              IDE plugin which will let us debug this application.
            
            
            
              Integrating lightrun requires two steps. The first
            
            
            
              is very easy. Just run NPM install
            
            
            
              Lightrun to add lightrun from NPM.
            
            
            
              The second stage is adding this code
            
            
            
              to the top of your main file. Notice that you
            
            
            
              need these custom code from Lightrun, which includes the correct lightrun
            
            
            
              secret value. Now we just log in from the
            
            
            
              Lightrun id plugin and we're in business.
            
            
            
              Back in Express TypeScript hello World app,
            
            
            
              we can add the code snippet to the top of the file and
            
            
            
              run the server. Here's the hello world
            
            
            
              demo running. Let's go to the code and add a
            
            
            
              snapshot for the main page request. These snapshot
            
            
            
              will be hit when we load the main page.
            
            
            
              Now let's add a log to the method covering
            
            
            
              the hello URL. This is injected without
            
            
            
              restarting the server or anything like that.
            
            
            
              A snapshot is like a breakpoint. It gives us the
            
            
            
              stacks, traces and variable values, but it doesn't stop the
            
            
            
              debugger in place. So a production server
            
            
            
              dont block on snapshot. The camera icon we
            
            
            
              see on the left is the line on which
            
            
            
              we set up the snapshot.
            
            
            
              Let's reload now back in the ide I
            
            
            
              can see the stacks trace popped up.
            
            
            
              We only have one method on these course stack, which isn't
            
            
            
              as valuable as usual, but we can see the variables and we
            
            
            
              can see that the stacks was hit.
            
            
            
              Let's go back to the browser and check the hello URL.
            
            
            
              Notice it printed world.
            
            
            
              Now in the ide we can see these new log we injected
            
            
            
              is instantly these. Notice the response
            
            
            
              in the code for world which is what we indeed saw
            
            
            
              in the browser. I want a better demo,
            
            
            
              but trying to run and explain a truly complex environment
            
            
            
              in a conference session will probably confuse all
            
            
            
              of us more than it will teach, myself included.
            
            
            
              So I'm doing a relatively simple demo
            
            
            
              in this app. All the front end logic is handled from the web
            
            
            
              to these start of the back end with node js
            
            
            
              and JavaScript. In the previous sample I showed off support
            
            
            
              for typescript, so using JavaScript here makes sense.
            
            
            
              Java is pretty common in the enterprise backend, so in this scenario that
            
            
            
              makes sense. We're storing the data using spring
            
            
            
              boot and the front end and the portion
            
            
            
              of the backend is done in node JS.
            
            
            
              So now that we got this out of the way, let's go to
            
            
            
              the simple to do app used by so many demos in the
            
            
            
              past. Now the funny thing is that I was going to
            
            
            
              demonstrate something completely different and ran into this
            
            
            
              bug by accident as I was working on the code.
            
            
            
              Well, opportunity knocks. I can't clear completed
            
            
            
              to do items in the app. Well,
            
            
            
              it's time to pull out the debugger and figure out what's going
            
            
            
              on. The only problem is that these
            
            
            
              are two servers running on remote containers in production.
            
            
            
              I can't just debug them. So let's look at
            
            
            
              the code and see if I can spot some problem.
            
            
            
              This is the method that implements clearcompleted in the node
            
            
            
              servers. I assume you all know how to debug the client,
            
            
            
              so I didn't really show that part. But even if we don't,
            
            
            
              the technique would instantly show if the problem
            
            
            
              is in these client or not. You'll notice
            
            
            
              that we don't have any logging here. That would have been really nice,
            
            
            
              but obviously we can put logs everywhere.
            
            
            
              Besides the huge cost of
            
            
            
              log at scale, these also costs and performance in the individual
            
            
            
              machine. So I'm not a huge fan
            
            
            
              of overlogging. So I already installed Lightrun
            
            
            
              on both servers here. I also have Lightrun
            
            
            
              installed in my IDE, so I can just add a snapshot
            
            
            
              to the method to see if it's even reached.
            
            
            
              As you may recall, a snapshot is like a breakpoint.
            
            
            
              It gives us stack track and these variable values, but it
            
            
            
              doesn't stop the debugger in place.
            
            
            
              But we also want to see the end
            
            
            
              of the method was if the end of the method was reached, did we
            
            
            
              return from this method? If only there was a log.
            
            
            
              Well, we can inject new logs in runtime.
            
            
            
              We can just add a new log and type in the details we want in
            
            
            
              that log and bam, we can see that log
            
            
            
              right here. The cool thing is that if you
            
            
            
              have built in logs, these log will be integrated
            
            
            
              with the rest of those logs so you can see it in context.
            
            
            
              In this case, I selected that the log will pipe
            
            
            
              into the IDE console too, which is more convenient for
            
            
            
              demos and quick tests. But when you do this in real life,
            
            
            
              viewing the log as if it was written in
            
            
            
              the code is a game changer
            
            
            
              that can literally save you millions. I know
            
            
            
              we say that about a lot of features, but the ability to
            
            
            
              see these things in context can seriously shorten the amount of
            
            
            
              time spent debugging. It also means you can eliminate
            
            
            
              a lot of your redundant debugging overhead and
            
            
            
              reduce your logs cloud storage cases significantly.
            
            
            
              The savings in a large company can be very
            
            
            
              significant. Now let's go back into the browser
            
            
            
              and press that dysfunctional button. Then go back into
            
            
            
              the ide and wait a couple of seconds for the
            
            
            
              whole loopback of events to reach us.
            
            
            
              Now we can see the snapshot was hit.
            
            
            
              That means the calls reached that method,
            
            
            
              which already eliminates a lot of potential problem.
            
            
            
              Now we can look in the stack trace on the
            
            
            
              left to see who invoked that method, or look within
            
            
            
              variables on the right to see if there is an interesting variable
            
            
            
              we missed. We should also check out the log console
            
            
            
              where we can see that the method reached the log line,
            
            
            
              so we didn't get an exception or some
            
            
            
              other problem. This is odd. Let's check the springboot
            
            
            
              servers. This is the Springboot
            
            
            
              server code. Notice this is Java code representing a
            
            
            
              rest request for the clear completed API.
            
            
            
              It deletes all the elements that are marked completed, then returns
            
            
            
              the to do items that remain. I want to add a log
            
            
            
              here just like I did in the node JS code, but in this case
            
            
            
              I'm going to do something a bit different. I'm going to add
            
            
            
              an expression to the log. I'm going to print out the number of items
            
            
            
              in the response variable, which is of type list.
            
            
            
              That means I'll literally invoke the size method on
            
            
            
              that object within the curly braces here.
            
            
            
              Notice you could do the same exact thing in these node JS project
            
            
            
              using JavaScript code for the expression.
            
            
            
              Now you might be thinking why not print the whole
            
            
            
              thing? Why not print the response object?
            
            
            
              Well, it won't work. Well,
            
            
            
              there are two reasons. First, it might output tools much text
            
            
            
              and get cropped. But the more important aspect is evaluation
            
            
            
              speed. When we're running in production environments,
            
            
            
              changes can be very dangerous. If we
            
            
            
              evaluate these expressions dynamically, it might take too much
            
            
            
              cpu and crashing our server. So lightrun limits the
            
            
            
              amount of operations you can do in an expression.
            
            
            
              If I'll print that list, Lightrun might just eliminate my log because
            
            
            
              of cpu usage. So the code here needs to
            
            
            
              be very efficient.
            
            
            
              So let's press ok to add the new
            
            
            
              log. We'll then go to the browser to trigger
            
            
            
              the event. I'll press clear completed again and
            
            
            
              then we'll go back to the server. Back in
            
            
            
              the servers we can see the log showing and that's weird.
            
            
            
              The API is showing that we're returning two elements.
            
            
            
              That doesn't make sense, but this is the core idea of
            
            
            
              debugging. Validate your assumptions.
            
            
            
              I don't get that. I want to check the values,
            
            
            
              so for that I need to edit the log entry and instead of
            
            
            
              listing the size of the list. I want to print out these content of the
            
            
            
              first element in the list using the got zero method call
            
            
            
              pressing ok deletes the current log and creates a new log
            
            
            
              object in its place. We now need
            
            
            
              to go back to the web and clear the completed elements again,
            
            
            
              and after a short wait the log prints out the result.
            
            
            
              It seems that this is the bug.
            
            
            
              Looking at the values for the object, it's obvious that the completed
            
            
            
              field is false on the servers. That's why
            
            
            
              it wasn't cleared. With that in hand,
            
            
            
              I looked at the code that toggles the completed field on the nodejs
            
            
            
              server and oops. It seems that
            
            
            
              code was never written. These completed field
            
            
            
              was only set on initialization. So that's
            
            
            
              the bug. That's very cool, but that's
            
            
            
              just one type of production problem. For a different production
            
            
            
              problem. Let's go back to the slides and look at ones
            
            
            
              of my first slides. This is
            
            
            
              the dashboard I found off of Google. It isn't mine,
            
            
            
              but you probably see a couple of interesting things here.
            
            
            
              I mean, what's this? What the
            
            
            
              hell happened here? If you use a dashboard, you probably see
            
            
            
              stuff like that all the time and ask yourself the exact
            
            
            
              same question. Now apms are pretty sweet.
            
            
            
              These will usually tell you which entry point caused a performance issue,
            
            
            
              which cpu worked too hard, where memory was used up,
            
            
            
              and all of those things. But what they won't tell
            
            
            
              you is which block of code took too long.
            
            
            
              Let's say we have an entry point that takes too long to
            
            
            
              perform. Now what?
            
            
            
              We need to review deep calls all over the place to
            
            
            
              try and pinpoint a specific operation.
            
            
            
              This might be something we can't reproduce locally.
            
            
            
              Let's go back into the IDE briefly and go over
            
            
            
              the options here in the metrics menu.
            
            
            
              We can add measurements that give us fine grained benchmarks
            
            
            
              on the code level. We can count the amount of time line of
            
            
            
              code times a line of code was hit. We can measure the duration of
            
            
            
              a block of code or a method. We can even add more complex
            
            
            
              metrics based on customer evaluation. The result
            
            
            
              of metrics can be periodically logged or piped
            
            
            
              into Prometheus statsD, etc. This provides
            
            
            
              that missing piece that we need to see beyond what the APM
            
            
            
              provides us. That way we won't spend ages optimizing the
            
            
            
              wrong area because the production information didn't include
            
            
            
              all of the lower level information. Please notice
            
            
            
              that at this moment measurements are only available for JVM languages,
            
            
            
              but they will be coming to nodejs and Python in the update.
            
            
            
              That's coming soon. In closing,
            
            
            
              I'd like to go over what we discussed here and a few things we
            
            
            
              didn't have time for. Lightrun supports JVM
            
            
            
              languages like Java, Kotlin, Scala, etc. It supports
            
            
            
              nodejs for both Javascript and typescript and
            
            
            
              Python, even complex stuff like airflow.
            
            
            
              When we add actions, conditions run within a sandbox
            
            
            
              so they don't take up cpu or crash the system.
            
            
            
              This all happens without networking, so something like a networking
            
            
            
              hiccup won't crash the server. Security is
            
            
            
              especially crucial with solutions like this.
            
            
            
              One of the core concepts is that the server
            
            
            
              queries information, not the other way around as
            
            
            
              you would see with solutions such as JDWP etc.
            
            
            
              This means operations are atomic and these server can
            
            
            
              be hidden behind firewalls even from the rest of the organization.
            
            
            
              PIi production lets us define conditions
            
            
            
              that would obscure patterns in the logs. So if a user could
            
            
            
              print out a credit card number by mistake, you can define a
            
            
            
              rule that would block that. This way the bad
            
            
            
              data won't make it into your logs and won't expose you
            
            
            
              to liability. Blacklisting bets us block
            
            
            
              specific classes, methods or files. This means
            
            
            
              you can block developers in your organization from debugging
            
            
            
              specific files. This means a developer won't be able
            
            
            
              to put a snapshot or a log
            
            
            
              in a place where a password, for instance, might be
            
            
            
              available to steal user credentials or stuff like that.
            
            
            
              Besides these sandbox, I'd also like to mention that Lightrun
            
            
            
              is very efficient and in our benchmarks has almost no runtime
            
            
            
              impact. When it isn't used, it has very small
            
            
            
              impact even with the multiple actions in place.
            
            
            
              Finally, Lightrun can be used from the cloud or using
            
            
            
              an on premise install. It works with any deployment
            
            
            
              you might have, whether cloud based, container based on premise,
            
            
            
              microservices, serverless, etc.
            
            
            
              Thanks for bearing with me. I hope you enjoyed the presentation.
            
            
            
              Please feel free to ask questions and also feel free to write to
            
            
            
              me. Also, please check out talktothoduct dev
            
            
            
              where I talk about debugging in depth. And check out
            
            
            
              lightrun.com which I think you guys will like a lot.
            
            
            
              If you have any questions, my email is listed here and I'll be happy
            
            
            
              to help. Thank you.