Conf42 Observability 2023 - Online

Developer Observability - the Forth Pillar

Video size:

Abstract

In the old days, production was the machine we could kick. We could all check what was going on, debug, fix and observe. This was problematic and DevOps, SRE, etc. took the front stage. That’s good. But it also has negative implications and Developer Observability is here to fix that.

Summary

  • Shirel Moog: Today we'll talk about developer observability. I've worked in this industry for decades, in many companies and as a consultant. I have another book about Java coming out. In a couple of months it will cover Java eight to 21.
  • Most of us know the three pillars of observability. They are logs, which are mostly written by the developer but ingested and managed by Ops traces. But they all have a few drawbacks. Logging seriously impacts application performance. Another limitation is the heavy focus of these tools on DevOps.
  • As developers we don't access production. The developer observability backend is accessible to the developers like any other observability server. The most important feature for security is block lists. With these tools we can block a developer from logging or adding metrics to sensitive files.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Shai Almog. Today we'll talk about developer observability, which I think should be the fourth pillar of observability. But first let me say a couple of things about me. I wrote a few books, including one about debugging, which covers all of the topics we'll discuss here. I've worked in this industry for decades, in many companies and as a consultant. I also worked at sun, which became Oracle, etc. You can contact me over my socials listed here and follow me on Twitter, LinkedIn, etc. Check out my YouTube channel and my blog for videos and posts in this style. This is my April book titled practical debugging at scale. Everything in today's talk is in there, and a lot more. I have another book about Java coming out. In a couple of months it will cover Java eight to 21. Let's jump right into the talk. Most of us know the three pillars of observability. They are logs, which are mostly written by the developer but ingested and managed by Ops traces which are usually pretty seamless for developers who quite often aren't even aware they exist, and metrics which help us measure and quantify pretty much everything but our application. They are all great and essential as part of a healthy production environment. But they all have a few drawbacks. The first is the fact that they are static. I can't add a log into a production system. A developer needs to make that change, then go through a process to add the log and thats takes a while. This is true for metrics and traces as well. While traces are typically seamless, thats is not universal and it's sometimes hard to understand a trace without custom spans. This is best summed up in a visual was thats is a familiar pattern. We have a problem in production so we need to add more information, a log, a metric, or sometimes more. Then we need a pull request or a similar review process. This can take a while. In some companies we have double reviews which can really stretch the time. Then we go through CI, CD and possibly a QA process beyond testing to finally get that code into production. This whole process can take days or sometimes more. If we don't have a fast CD cycle, then in production we need the user to reproduce the problem. This might be a flaky problem that's hard to reproduce. This might take a while. Then when we review the problem, it is often the case that we didn't log enough. We don't have the exact information that we need. We then need to go all the way back to square one and start that cycle all over again. That's the CICD cycle of death. I use this term a lot and every time I describe this I get a lot of nods from the crowd. We all know the story. It's universal. We all suffer through that cycle when tracking an elusive bug. It's a deep pain in our industry, but things can be worse. Yes, worse than this painful cycle. The solution is often to log more just in case. This seems like a sensible solution. We solve the problem with lack of data by adding a lot of data. To be fair, that does work on some occasions, but it is one of those cases where the cure is worse. Thats the disease. Logging ingestion alone can account to a third of the total cloud costs. That is often much more than other costs combined. Logging seriously impacts application performance. This has a cascading effect of requiring additional resources, slowing the application, etc. Cetera. Other observability took have similar impact on performance and on storage. This was discussed a while back in Reddit and I love this quote from one of the posters. A team just set a log level too high and burned through $100,000 in days. This is a very common scenario, although this is indeed extreme. Overlogging can kill projects, companies jobs and kill the rainforest. The amount of pollution, production by overlogging and wasted resources is absolutely frightening. This might have been worth it if it actually solves the problem, but more often than not, it doesn't really help. We can log everything due to privacy we can't log everything due to privacy and security concerns. Looking over a huge mess of logs and metrics makes the process of tracking an issue into a needle in a haystack. At best, it slows us down. At worst, it has a lot of redundant memory and a lot of storage, but is still missing that valuable information that we need because, as I said, we can't truly log everything. We can't log the truly valuable data. Another limitation is the heavy focus of these tools on DevOps. Developers write logs, but they're ingested and handled by DevOps. Production issues are often handled by SRE. This disconnect means that as a developer, you would need to log something that someone whose job you don't understand fully would find useful. That's problematic. Furthermore, the tooling is very much focused on the DevOps point of view. Instead of dealing with source code from the IDE, the tooling talks about agents, entry points and other ideas that are less familiar to developers. With the shift to microservices and serverless systems systems became resistant to debugging. In fact, the only way some developers can check their code is through tests. That isn't ideal. It means that when they have an unforeseen problem, they need to use tools designed for DevOps to understand the problem, and then try and create a test case for that problem. This is a major step backwards. Developers need observability just as much as DevOps. While a vast majority of production problems could be handled by Ops, some of the hardest problems to fix are bugs in the code. I'm not talking about crashes. That's where most of us go automatically when thinking about production problems. Most production problems are cache misses in the code or stale cache. It means an item is missing in a listing. It only happens in production, and we have no clue how to fix that. Existing observability is usually very opaque in such situations. Ops don't even know much about such issues. They might flush the cache, but that's a blunt instrument and a poor workaround. Developers need their own observability, but it needs to be different from today's observability. The first principle of developer observability is to meet developers where they are working in the IDE isn't a requirement. Some of these tools work in the browser, which is also fine as long as they use terms and environments that are familiar to developers. In such a tool, we would inject a log to a line of code. We discuss metrics in terms of specific lines of code, not in terms of spans, entry points, etc. Cetera. Ideally this happens directly in the IDE, since that's where developers spend their time. But the bigger thing is the ability to inject observability metrics right into production code. Thats means I can add a new log metric or snapshot without going through the whole cycle like before. Remember this diagram? This is pretty complex. With develop observability we can simplify this considerably. We can remove two stages from the process. Developers can instantly inject a log or a metric to production without any changes to the code itself. We can then reproduce the problem while coordinating with the end user experiencing it. I can thats with a customer while they reproduce the issue. The great thing is if I don't have all the information they sre still on the line, I can immediately add another log or metric and ask them to try again. This completely changes the way we look at production. I used a very loaded word there, injecting. In fact, when I was working for a developer observability company, I was prohibited from uttering the I word. It's a scary word. It means we change code in production and the typical association we have with that word is very negative. Injecting bugs, changes, or even injecting a security vulnerability. I get exactly why my employer didn't want to be associated with that word, but this isn't the only tool that uses injection to implement functionality, so it's not the end of all. To be fair, though, security is a major concern. Most of the tools in the field have similar approaches to that's a key aspect in the security is the management server. As developers we don't access production. This is the job of DevOps. It's segregated the developer observability backend is accessible to the developers like any other observability server. The actual backend communicates only with thats server. Thats is pretty familiar if you worked with other observability tools, but is very different from other developer approaches such as remote debugging. This means thats even if there is a weakness in the injection code, it would be very hard to exploit as even the developers don't have direct access to the back end servers. Furthermore, some developer observability solutions enclose the solution in a sandbox which executes everything in a controlled environment. Let's say I add a log and it takes up too much cpu, or I add a conditional metric that tries to modify the application state in the conditional statement. Some developer observability tools will detect both of these scenarios and limit the amount of resources or block execution entirely. Since all access to the system is done through a back end server, it's a trivial matter to keep an administrative log. That means thats we can track every operation performed by any user. There is always a record. If a user tries to steal private information, it will be logged and can be used has evidence. Some information is problematic, such as credit card numbers. Thats is called personally identifiable information, or PII for short. We must remove such information from logs, sometimes by law and sometimes by regulation. Ideally we will catch that in the review, but if a log is injected, it might accidentally print something that shouldn't be printed. We can recognize those patterns and implicitly block them from logging. This is done with the PII reduction functionality supported by some tools, but the most important feature for security is block lists. Imagine a disgruntled developer within our organization. Thats developer can add a log to the user login code and print all the usernames and passwords. By the time we notice it in the administration log, he might be in a different country with all of the ill gotten gains. We can stop that with a block list. With it we can block a developer from logging or adding metrics to a specific set of sensitive files, classes or packages. I think we had enough theory. Let's do a short demo of one such product, Lightrun as a disclaimer. I used to work there, but thats was last year. I no longer do. On the left side you can see intellij idea, my id of choice. On the right side I have an application station that counts the prime numbers running on a remote server. We can see the console of that demo. The application doesn't print any logs as it does the counting, which makes it hard to debug if something didn't work there. In the middle we can see the currently running agents which are the server instances. We also see tags above them tags let us apply an action to a group of server processes. If we have 1000 servers, we can assign the tag production to 500 of them and then perform an operation on all 500 by performing it on a tag. A server can have multiple tag designations such as East Coast, Ubuntu 20, green, et cetera. This effectively solves the scale problem typical debuggers have. We can apply observability operations to multiple servers. Here we have only one tag and one server process. Because this is a demo and I didn't want to crowd it, I can add a new log by right clicking a line and adding it. I ask it to log the value of the variable I and it will just print it to the application log. This will fit in order with the other logs, so if I have a log in the code, my added log will appear as if it was written in the code next to it. They will all get ingested into services like elastic seamlessly, or you can pipe them locally to the IDe. So this plays very nicely with existing observability while solving the fact thats traditional observability isn't dynamic enough. The tools complement each other, they don't replace one another. Notice I can include complex expressions like method, invocations, et cetera, but lightweight enforces them all to be read. Only some developer observability tools do that, while others don't, but the thing I want to focus on is this. Notice the log took too much cpu and Lightran pauses logging for a bit so it won't destroy the server performance. Logs are restored automatically a but later when we're SRE cpu isn't depleted. This is the sandbox I was talking about earlier. With developer observability, we can add debug information in areas that don't make sense since the information will be removed once we're done, it isn't a big deal. A log that might be too expensive as it will blow up ingestion costs because it's on a line that is invoked very frequently, can be added for a few minutes and then removed. That isn't a problem, but the most important aspect of developer observability is insight at a developer level. DevOps know the features that are used frequently, but they can't tell if a specific method or block of code is reached. With developer observability, we can detect if a block of code is used and get applicable statistics. If we're considering a code change, we can evaluate the risk and reward beforehand by adding a metric to that block. Developer observability is a new tool for a new audience, but it's still an observability solution. First and foremost, when you inject a metric, it integrates with your existing dashboards. When you inject a log, it integrates with your ingested logic. Developer observability is about making the crucial benefits of observability accessible to a new crowd, a crowd of developers, which is the most important goal. When I give talks to DevOps, I often ask them about observability, and a surprising small number of developers are actually using observability tools on their day to day basic basis. They hear about observability solutions, they know about them, but they don't truly use them. Developer observability is a way to open the world of observability to the developer community at large. And this is the time in which developers truly need these sorts of solutions. With the migration to microservices and serverless, they are figuratively blind by these new architectures, unlike before. Thanks for bearing with me. I hope you enjoyed the presentation. Also, check out debugagent.com, my book and my YouTube channel where I have many tutorials on these sorts of subjects. Thank you.
...

Shai Almog

Founder @ debugagent.com

Shai Almog's LinkedIn account Shai Almog's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways