The Observant Developer - OpenTelemetry from code to prod

Video size:

Abstract

Observability can be about more than pretty dashboards, it can be a powerful tool for designing better code. In this talk, we’ll look at practical ways of leveraging OpenTelemetry and open-source tools to validate code changes, glean production insights and improve coding practices

Summary

Ronnie Dover talks about open telemetry and coding with the insights on. Dover: I think it represents a very big change in how we write code. He hopes that by the end of this presentation, he will convince you that these things are really worth looking at.
Developers used to take a feature, develop it all the way through, and then deploy it into production. But as teams become more cross functional, developers started taking on more responsibility. Here is a diagram showing the DevOps loop that by now is completely overused. But there is one segment that only has one tool associated with it, and that is continuous feedback.
Open telemetry is a spec, it's a standard. And there are lots of implementations of that spec for different languages, platforms and so on. Many programming languages actually integrate with it in a very, very easy way. There is already automatic instrumentation available.
There's already very useful information about the pieces of code that we're writing. 99% of the time, Bill would still not use that data. The more profound reason is that it's not continuous. We need to create a continuous feedback platform.
I think one of the more fundamental things that need to change is more around culture. If we don't use and harvest the observability data we already have, then why are we collecting it? Feedback is something that we need to implement in the process. We need to have at least a biweekly feedback meeting.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. My name is Ronnie Dover, and I'm extremely excited to be talking to you today about open telemetry and coding with the insights on and the reason I'm so excited about this technology and the possibilities it presents is because I think it represents a very big change in how we write code. And I'm just. In regard to my personal background, I've been a developer for over 20 years. I've been a product manager, and kind of oscillated between the two roles. It was very difficult for me to stop coding about design when I was developing or leave code behind when I was doing feature design. However, throughout that time, I was really fascinated by technologies that really changes the way we code. And I think to some extent, we saw that when testing started becoming more widespream, asynchronous programming, event sourcing, there are a lot of technologies that kind of changed how code got written, how we thought about code. And I think observability today, and in particular, open telemetry and continuous feedback, represent just such a change. And I hope that by the end of this presentation, I will at least kind of have convinced you that these things are really worth looking at right now. So, to illustrate that, I want to start with the story of one of my developers. This is Bill. And Bill has been tasked with a very common task for developers, which is take a feature, develop it all the way through, and then deploy it into production, something that is fairly routine for developers today. Now, Bill's job used to be extremely simple, and this was kind of the situation when I got started around 20 years ago, which was, you build a feature, you design the feature, you develop the feature, you wrap it up real nice, and then you take it to the guys across the hall and the QA department, and they start looking at it. There would be some perhaps philosophical arguments about what's a feature, what's a bug? And then eventually it gets rolled on into production, and you would probably never hear about that feature again unless there is some bug, in which case you may be called to correct it. But of course, that's no longer the case. So as teams become more cross functional, as developers started taking on more responsibility, Bill, as well as the rest of the team, started taking and assuming more ownership. So now a part of Bill's job is actually to also write the tests, to validate that there's a test plan, integration tests, load tests, other types of tests. Often, Bill needs to worry about how to deploy his service. So do I deploy it using helm terraform? What's the RaM requirement, these things that used to be kind of once kind of the sole responsibility of the DevOps or it people now is everyone's to care about and to know, because eventually there's going to be an issue and the person that might need to investigate that issue could be Bill. So a lot has changed. But the question is, let's say that know he's top notch developer, he's using the best tools available, he has the best CI CD pipeline in the world, and he's just released the feature into production. So the question is, what happens then? What happens? Or what should Bill do the moment that he finished rolling his feature into production? And the answer to that question is also interesting, because my expectation from Bill was to ask a lot of questions. I'm very kind of evidence based in how I like to think about things. So my first instinct would be, well, check whether your code actually worked. Did it work well, is anyone using it? I've witnessed enough horror stories where metriculously written code was perhaps just a few bad if statements away from actually getting executed in production. So did it actually get run? Did it change anything for the good? For the bad, you just changed the data access layer and added something. Did it actually make things better for everyone? So that is my expectation. However, 99% of the time, what would happen in this situation is that Bill would move on to the next feature. And this is something that I tried changing. So to figure out why this was happening, I went back to the basics, and here is some diagram that I pulled from an online source showing the DevOps loop that by now is completely overused, but it's still a good model to kind of think about the process that releases go through. And I would challenge you to look at this particular diagram, and you may notice that something is a bit off about it. So this is a pretty accurate representation of the different stages of development, from building to continuous integration, deployment, operations and so on. But there is one segment here that actually appears in the diagram that only has one tool associated with it, which is salesforce for some reason, and that is continuous feedback. So although we have plenty of tools to take our code across the chasm and into production, to operate it in production and so on, we have very little tools to none in this diagram that can actually take the information back from production and make it into something useful that we can use in depth. So to think about Bill in this sense, Bill has a lot of feedback when coding in his local environment. He has at least some limited feedback from testing. Limited, I say, because tests are usually more kind of a red, green, black, white kind of a thing, rather than a very qualitative way to measure improvements, let's say. But it's still some feedback, but there is almost no feedback that he can use in his day to day from the production environment. So I thought to myself, well, if only we had access to instant objective data about the code. Like, if only it was kind of a non issue, that whenever Bill would want, he would just glance over the edge and kind of see exactly how this code is working. And this is kind of the perfect segue to talk about opentelemetry. So open telemetry is a spec, it's a standard. It defines how to do observability. And there are lots of implementations of that spec for different languages, platforms and so on. And opentelemetry, in my humble opinion, is not important because it is something amazing or revolutionary in terms of the technology. Although it's a great technology, it is important first and foremost because everyone agrees on it. So the fact that there is a consensus around opentelemetry that we're no longer talking about kind of this fragmented landscape of different proprietary agents, protocols, instrumentations and so on by different vendors, makes it very easy for two things to happen. First, for an ecosystem to emerge, as often happens with open source tools. So suddenly there are a lot of tools that are kind of coming together and providing the value add of how can we make this data actually useful and how can we analyze it, and how can we take that data and make sure that bill can use it. The other aspect is in terms of coverage. So if I'm a platform tool designer or if I'm a maintainer in, I don't know, some major, it doesn't matter if it's a backend server platform or a web server or anything else, the choice for me is very easy. Now I don't need to worry about, well, should I allow instrumentation or enable instrumentation for datadog or splunk or whoever it is, I just support open telemetry. And as a result of that, what we're seeing is that first of all, many programming languages actually integrate with it in a very, very easy way. I think net even made it a part of the standard library just to make it much easier for people to use. But also, it doesn't matter kind of which tool you're using or what type of project, there is a very good chance that you'll find that there is already automatic instrumentation available, which basically means that we can get data at practically no cost about my project. So for this particular example, I created a sample app that we're going to use to kind of explore these information pieces that we can now get about our application in runtime. So because I'm a bit allergic to very simple crud, apps know, basically just do basic database operations. I created a more involved application as I was watching the Harry Potter movies at the time with my kids. I created an API for the Gringotts vault, and I tried to use a variety of technologies, in this case a queuing system, RabbitMQ, fast API server, some postgres, an external API with mock data, but all of that is just details. The same would very easily translate to any platform and any programming language. And what surprised me right from the start was just the amount of out of the box instrumentation for opentelemetry that exists with all of these libraries and frameworks. And from my experience, this repeats itself no matter what you're using. So in this case, I was using FastApi, which is a very popular Python server. It has out of the box instrumentation. I was using RabbitMQ with a package called Pica, which also has can instrumentation. I was using SQL, alchemy, psycop, a lot of different libraries, each of them already had another way, or a very easy way to instrument it and get data. So the ramification of that is that I was able to get from an application that has zero data. Okay, as Bill, I would look at this application and I would start searching in logs, trying to find clues, which would take me a lot of time to a situation where the application was basically spewing out tools of data about how it was behaving. Now let's understand what is the type of data that I'm collecting. So one of the interesting things that Opentelemetry provides is called tracing. If you're not familiar with tracing, here's a very quick one on one on that. So a trace essentially describes a flow within the system. So in this case, in my application, a user goes to the fast API service, let's say he calls evaluate vault operation that gets translated to a message queue that gets picked up by a worker that actually does the work. That entire distributed operation is a trace, and we can keep track of it and understand what are the different sections or sub activities there, how much did each of them take, how does it work over time? So all of that information is very easily available. And the other term that we use is a span, and a span is just a subset of a trace. So within a trace, let's say, within the segment where we're making the API call. And before it gets to RabbitMQ for the next phase, we have various activities we have actually handling the request and then checking permissions, maybe validating with some other authentication sources, then in queuing the job. So each of these is an activity that we can also track and keep tabs on. Who called it, how long did it take, what errors did we have there, what logs and so on. And that is what we call a span. And in a sec we can actually see how we can use open source tooling. In this case, we'll use Jaeger in order to visualize that entire trace so that my experience as Bill will be upgraded. For all of a sudden I'll be able to completely understand how my code is working in the real world and maybe assess my changes, which is what we were going for when we got started. But let's look at some sample code because I think that would illustrate it the best. So this is the source code. All of the links will be provided at the end of this presentation. I'm looking at some basic operation like authentication. So a lot of the data I don't need to change the code to get, as I mentioned. So for example, the fast API service already tells me about things that happened, events, and keeps track of the traces as they happen. The same goes with the database and the RabbitMQ instrumentation and all of these other pieces. Now, just to illustrate, getting all of that to work was extremely simple. Here you can see kind of the entirety of that code. Make this a bit bigger. As you can see, it's basically turning on the insights. Calling a specific instrumenter, let's say request or Fastapi or postgres, and then just calling the instrument method. And there are ways to actually make that also automatic. So even that code today is not that necessary. Just by including the right packages, you'll have everything you need about or all of the data that you need. In addition to that, you can include within the code specific manual instrumentations, which basically says, I'm defining a scope and I want to track that scope in the code manually. You can think about it like logging on steroids. So it's not just a message here or this code was called, but it automatically tracks who called it, the duration, start and end. So for example, here we see authenticating the vault owner with the key and we create a scope. This is the python way of doing things. We call this scope authenticate, vault owner and key. But of course there are equivalent ways to do it. In every programming language to use. And there is ample documentation about how to use it with opentelemetry. Now once we have this up and running, let's see how we can actually get data out of it. So let's take a look at a quick example. So here we have our application. It is the API for the greengrass vault that we're using. And let's trigger a specific operation. Let's say we want to trigger a vault appraisal. As we mentioned previously. In this case we just provide the vault id, we run the operation, we get back a result, something happened. What? Who knows, right? And at this point we can go back to the ide and maybe look at the code, imagine what would happen. Or we can, as we wanted to accomplish in the beginning, kind of get that immediate feedback about, okay, what happens when this operation is called. And here is Yeager. It's a very popular open source tool that I like a lot that just allows you to visualize the traces. It's very easy to set up. You just export data to it from opentelemetry as a part of kind of the boilerplate setup, which is very easy. And I won't go into it now, but it's very well documented. So let me see, kind of the latest data that we have about our vault service. And immediately you can see this is the operation I just triggered. And we can see that there are two services actually in this distributed trace. One is the goblin worker, the other is the vault service. And we can actually go and kind of explore the entire request. And bear in mind all of this data I got just for free just by enabling the instrumentations. So we have here the HTTP call to the appraise. We have here some database statements that are happening. We have here some logical spans that we declared in the code, like in this case. And we can go all the way to see the individual DB statements and we can see that the goblin worker picked up the request here and what happened there. So it's very easy to track and visualize exactly what happens with such a request. So this kind of impediment or information gap between the developer and production no longer exists once we have the code being monitored in this way and all of that information is now readily available for the developer. And if you think about it, we can actually make this information much more useful than just validating the code once we've deployed it. So as you might think, this is a loop. It's not like every code that we write is completely new. We continually update code and there's already very useful information about the pieces of code that we're writing that could help us design it. So if you think about it, even before Bill got started into his Dell upgrade feature, there are a lot of questions that if he had access to the right data, he could actually ask, who is using this code? Is it even used? Who will break if I change it incorrectly? What are some issues I should know about? What's the baseline I should compare myself to? What should I optimize for? Where does concurrency happen? And then later, when reviewing the changes, we can get data from the test environments and start asking more questions like, what should I watch for? What are some historical issues associated with this code? What can we learn just by the same observability that we get by looking at the test. So there is a lot of data here, and that data has the potential to completely revolutionize how we write code, because it can be available at every turn, not just when we validate our code changes, but also when we design them. Because whether we look at it or not, the data is already there. So now the question is, can we open our eyes and actually use it? But the answer to this question is that 99% of the time, Bill would still not use that data. And I spent a lot of time trying to figure out why that is the case and why, despite my best efforts to convince Bill, hey, look, there is this really awesome pile of data over there. Why don't you look and see what you find? Often Bill, or whoever developer it was, would prefer to move on to the next feature. And here are some reasons that I found. There are a few small reasons, and I think one very big one. So the first has to do with expertise. And it's by no chance that I put here a picture of house repairs, because that's my personal blind spot and something that I would procrastinate as much as possible rather than do. And it's the same for many developers with domains that they're less familiar with. For example, not all of us have brushed up on our statistics 101. And to make that data useful that I just showed you, I actually need to know to remove outliers, to calculate the median or the p 99 sometimes to do more complex statistics just to get to meaningful conclusions about what does this mean about my code. In addition, I need to actually stop what I'm doing and start learning a new tool, move between my ide and whatever dashboard it is continuously, and kind of look for troubles in simple terms. But I think the more profound reason is that it's not continuous. So in the same way where we have kind of the equivalent and symmetrical process of continuous integration, which is taking code into production that happens continuously. You don't think about, hey, I'm going to run some tests. These tests run automatically just by checking in code. If you're using continuous deployment, you don't think about, oh, I'm going to deploy to production. No, you just designed a very good pipeline, and as a result of your commits and merges and pull requests, everything will get deployed into production. In the same way, if we want to make this information useful, it can't be something that I need to think, oh, let me ask Bill to go search for stuff after he did his check in. That needs to be continuous, it needs to happen automatically. And this is my own kind of personal journey with observability and continuous feedback, because once I've noticed that, I became really obsessed with the idea of how can we actually create a continuous feedback platform, something that can actually continuously look at the data that the application is already collecting with technologies such as opentelemetry and try to make that extremely useful for the developer. Now I want to show you an example of this, and by the way, I'm very happy to see that there are other tools and platforms and ecosystem libraries, except digma, which is the one that I'm working on, that are providing the same value. I want to show you an example, not particularly to talk about Digma, but to just show you my vision or what I think or where I think development is changing towards and what a modern developer might do in his code. That's very, very different from how we code today. So to do that, let me pull up that same code that we were looking earlier. Let's look at this vault service. And what I'm going to do now is simply enable continuous feedback. In this case, one of the outlets of that feedback is an IDE plugin that I'm going to enable. Now, bear in mind, I'm looking at the code. I had no idea whether it's good or bad or what's going on with it, but now I've turned on these new spectacles, which are basically the information that I get back from, in this case, digma. So immediately I notice things about this code and I can drill to know more. I can see that this is actually an area of the code that sees pretty low traffic. I can find an issue in this case, there's an n plus one query that can be very easily identified by looking at the traces I just need to do it. I can look at the bottlenecks, understand who's using this code and so on, and let me transition over to where this issue is happening. And I can see where the culprit or the query that in this case is repeating over 101 times in each trace, I can understand who's being impacted by it. And again, this time I can look at the trace visualization. But sorry, in this case, it doesn't exist in this machine, but I can see that trace visualization from the point of view of the issue. So instead of looking around phishing for trouble, I'm kind of starting from the code, starting from the example, concrete example of something that I found, and then I can continue to explore and kind of look for trouble in dashboards. But it's now contextual to my work. So the vision is taking all of that amazing information. And I know that all of that information exists specifically because whenever something is wrong, we go and we dig deep deeper into logs and traces and we find troves of interesting things that if only we had known them earlier, we would have fixed them and make that just a part of my coding, making it so much closer to production. This code I'm looking at right now is already running in production. And here is what those productions or insights are telling me. And by the way, I'm learning a lot of things. For example, I'm seeing that this code only gets called in production and not by my local test. I can also see that there is a code here that's never reached, which is also interesting. So there's a lot of things that we can do just by putting on these new spectacles that allow us to understand how this code is actually running in production and not just theoretically. So how does one get started? With continuous feedback. This is a completely new uncharted water, but there are a lot of people who are also making really great forays into this new and great methodology. So first of all, I've created this web page which has a lot of quick start links that you can get started with. So if you go on continuousfeedback.org, let me go there right now. I've included some really interesting. Just 1 second. Yeah, I've included some interesting links, including getting started with opentelemetry. Yeager. I talk a little bit about digma here, some example projects, including the project that you just saw now with the Gringotts vaults, which you can easily get started running just using docker compose. Everything here is containerized and so on. So that's extremely easy and something that I would recommend everyone doing. I think one of the more fundamental things that need to change is more around culture. So in a similar way to what we had when we got started with testing, for example, it was very hard to convince developers that testing is a part of their job. I remember having conversations with developers. Tell me, what do you know? This is QA. Why am I doing testing? And in a similar manner, I think that today we're kind of taking the next step and saying, well, we need to own our code all the way to production, and that's a cultural change that's already happening. But I think embracing it and understanding what it means in terms of the ramifications for me as a developer is something that we all need to kind of learn more about. If we don't use and harvest the observability data we already have, then why are we collecting it? I think that's the second really important point. I've seen organizations that had amazing dashboards for observability and they were literally, they could just be screenshots or pictures on the wall. If we don't actually use them, make sure that we're using them in practice, then no point in collecting them, right? And if we're not using them, we're also kind of creating a very crippled process because we don't have any feedback loop between what is happening and what we're doing. Feedback is something that we need to implement in the process. And in the same way that we have scrum rituals like dailies or scrum of scrums, we also need to have feedback meetings. So it needs to be on the agenda. And I'm putting on my product manager hat here. If it's not on the agenda, we're just going to be completely biased towards the next feature and the next feature, and we're not coding to care about the feedback that we're receiving and whether it's actually doing what we think it's doing. So we need to have at least a biweekly feedback meeting where we're discussing the features that got into production. What do we know about them? What do we need to know about them more? This is the only way that we can make it a part of the agenda. So I'll be very happy to hear your thoughts about this really interesting topic and also to have you join the thinking. We have a slack group that's also in the links that I presented here. You're welcome to join it and share your thoughts. My contact details are also there I'm always happy to talk about this topic. This is it. It was a really great and amazing opportunity to talk here on conf fourty two. And please do reach out. Thank you.

See all 46 talks at this event!

Conf42 DevSecOps 2022 - Online

December 01 2022

The Observant Developer - OpenTelemetry from code to prod

Video size:

Abstract

Summary

Transcript

Roni Dover

CTO and Co-Founder @ Digma

Join the community!

Featured event

2025

2024

Info

Conf42 DevSecOps 2022 - Online

December 01 2022

The Observant Developer - OpenTelemetry from code to prod

Video size:

Abstract

Summary

Transcript

Roni Dover

CTO and Co-Founder @ Digma

Join the community!