What happens when DynamoDB explodes? A practical guide for developers

Video size:

Abstract

2:32 AM. PagerDuty wakes you up. DynamoDB is throttling. Should you wake up the team and fiercely charge to resolve the issue, or can it wait for tomorrow?

Understanding the business impact and the affected users are the key points to making this decision. Those data points are usually not easy to obtain, especially in highly distributed asynchronous architectures like serverless.

In this session, we will share guidelines on what needs to be part of your serverless application in order to be able to answer those questions in a matter of minutes.

The main operational questions, when things go bad:

What is the user functionality being affected?
Which users were affected and how?
What is the root cause of these issues?

Getting a good night’s sleep is within arm’s reach…

Summary

Erez Berkner, CEO and co founder of Lumigo, talks about observability monitoring and what should we do when things go wrong. Uses the concept of distributed tracing and takes it to the next level and creates a virtual stack trace.
Lumigo is a tool that focuses on cloud native applications. It allows you the ability to drill down and actually troubleshoot to find the root cause. The main point is to see and understand what you should be aiming for.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, this is Erez Berkner, CEO and co founder of Lumigo. And today we're going to talk about observability monitoring and what should we do when things go wrong. So think about this. It's 232 in the morning, page of duty wakes you up and DynamoDB is failing again. What do you do? And I like to call this talk, what happened when DynamoDB explodes and I use DynamoDB, but I want to make it a bit broader. It's not just about DynamodB. DynamoDB is different in that it is a managed service. You don't control the server, you don't control the operating system and you're very limited with your visibility. So when we talk about Dynamodb, when we talk about managed services, I want to broader this to everything that is as a service. So we all know function as a service. But think about Q as a service or data as a service. Snowflake, DynamoDB storage as a service, even stripe Twilio, all the SaaS services, everything that you don't control, you don't deploy agent, you don't maintain the server, you don't write the code over there, you don't define the API. All of this really creates a challenge when it comes to monitoring, debugging, troubleshooting, and that's what we're going to focus on today. I like to call this broader sense of managed services serverless and it's a broad definition, but it really helps me define these core. There are no servers over here and those applications are usually very distributed with dozens or hundreds of services that keep changing and no longer three tier monoliths like we used to have. So across all these services, when DynamoDB actually explodes, it's very hard to zoom out and understand the context, understand the overall application health. How does that impact without having the actual connection and context between different services? So that would actually be the first thing we need to make sure that we have in order to understand what's going on. When something goes wrong, we first of all have to implement one way or another tracing. Cause it's distributed, it's distributed tracing. And the point is that this tracing will allow us to go back, go upstream and understand not just that this failed, but what happened before that, where did it originate from and what did it impact. So in things case I can find actually what is a customer facing API. Is this business critical and take a decision and assign a priority based on that. Now I want to use the concept of distributed tracing and take it to the next level and be able to look at. I like to call it a virtual stack trace. Like in the past I used to see exactly which functions were called in a monolith environment. I want to be able to do the same in a distributed environment. So seeing the inputs, these outputs, environment variables, the stack trace, the logs, everything that I can on each and every service along the path of that failure. And there are different ways to do that. One, you can use Cloudwatch and other cloud native friends. I think the main point here is that you can use code and implement in the containers, in the lambda before, after the managed service, different outputs, different logs that will allow you to understand what's going on. So you can log the request and you can log the payload and the output and the environment variables and can exception, you can catch and log the exception of course, and on the additional details. And that would actually work. The main problem is that it takes a lot of time, it requires a lot of maintenance, and you still need to connect the dots. So when you have millions it can work. When you have like few thousand, let's say invocations a month, when you have millions or billions, you can no longer connect the dots via timestamps. So you can no longer see that I have one request being through in that second you have thousands going through. And how can you differentiate the different logs from these different executions? So that becomes a problem. Drilling down to the specific log groups and log stream to understand how this connects. So you have this data islands but you're not able to connect these. But it's good for many of the cases, especially for dev and early production, before you scale, it's out of the box. It's easy, relatively easy I would say, to get started. It's supported by AWS, it's in the same cloud or cloud vendor. I think the biggest challenge, it's complicated, it's time consuming to implement, it's time consuming to make sense of it. You need to really know how to configure it to get power, visibility. And the biggest challenge is there's no good way to do event correlation, to trace. To understand the bigger picture. There are tools like x ray for example within AWS that allow you to do that, but they're also very limited. So they're not going across DynamoDb and Sri and event bridge and the other things that we started talking about. And therefore it's hard to understand the business impact, what is a customer facing API, how critical that is. The second option is to implement homebrewed solutions so I can actually build a distributed tracing system or better off. I can use something that is open source and that's great because there are different frameworks for that you might want to read if you don't know about Zipkin, about Jaeger, and in general about the framework of open telemetry. And this provides you a very nice common ground for implementing distributed tracing and getting the information back to you in a visual way, seeing latency, breakdown of the environment, getting traces. So if you planning to implement distributed tracing, I suggest starting with open telemetry as a first point. And if you do that, I really suggest reading this, a consistent approach to track correlation ideas through microservices by Kui. If you're not following can Kui, that's a great time to start doing that. If you are interested about distributed tracing, about managed services, about serverless, he is for me the number one guy out there and he blogs a lot. He has great workshops also to consider and books. But to our point, he is blogging a lot about correlation tracing, distributed tracing. So this would be a very strong rate for that. I think the pros for a homebrew solution, it gets as tailored as you can possibly get because you build it so you will have all the different perks that you want to have. It's your solution, it will be the best fitted to your needs. Open telemetry is supported by many vendors, so that's great as a standard to use that and then you can base future engagements based on that. And it's not cloud specific, so you can oppose to Cloudwatch for example, so you can actually use it across clouds and move cloud in between. The main challenge is that you build it so it's tailor fit, but it's very high touch. It does not solve the problem of managed services. So if you need to trace across DynamoDB, you still need to figure a way to get a correlation id, a request id across to the other side, across s three, across event bridge across kinesis. So it doesn't tackle managed services and components that are not supported like API, Gateway, Dynamodb and others that we mentioned. Luckily there are several cloud native monitoring solutions out these that were built for the modern environment to solve that problem. Exactly. To solve the distributed tracing and observability monitoring troubleshooting of the modern cloud native architectures. They're distributed, they granular, and these are multiple services over there. And what you can expect is usually those are SaaS platforms, so they get the traces to their back end and they process it to generate the view for you. So that's great in terms of maintenance it's a SaaS at the same time, by the way, it's a SaaS, so you need to be able to send information to those services. That's also for you to know, especially around privacy. Most of the vendor has a very good policies in place for GDPR, for ISO 270, zero one, et cetera. Most of the vendors also solving a larger problem, not just the distributed tracing and showing you what happened, but as I mentioned, kind of building the virtual stack Trace. So getting the inputs, the outputs and everything you need to know, it's much more and cost analysis and latency breakdown, et cetera, much more than just a map of services. And usually they're using code libraries integrated in one way or another and an API using can im role. I think pros are, you'll find out that those tools are usually very opinionated. You come with a set of pre configured alerts that you should know about. That's what it means to be niche focused. They provide more than just tracing. So many times you'll be amazed of what you can get and you say, wow, I can get this. And I was just looking for tracing. And they're very low touch, very easy to get started. Usually like few minutes, ten minutes, 15 minutes to get started and actually see what's going on within your environment. On the other side, this is yet another third party platform among the many others that you probably have. And they provide more than just tracing, meaning you do get additional data. You do have, I want to say, more layers to the tools that might be great, might be even beyond what you're looking for if you're at an early stage. So just to remember that, and this is where I want to take one tool, our tool, Lumigo, and share with you how we actually do that in Lumigo. I think the main point is to see and understand what you should be aiming for. And it doesn't matter if you implement this with Cloudwatch or you're using open telemetry or Jaeger or using Lumigo or other tool. I want to share what are the abilities if you do it right, or what you should expect from well observed system. So that's the main reason why I want to share how this can look like. And the main thing is that one, you're getting the monitoring, the alert, the things that tells you that everything okay or not and what's not okay. And then it allows you the ability to drill down and actually debug and troubleshoot to find the root cause. So let me very quickly share with you how that looks like with Lumigo. This is our dashboard. It takes literally five minutes and five clicks, no code changes required in order to get started, and you get value, get alerts about errors that you probably never knew you have. As I mentioned, the dashboard is focused on the alerts, on things that are focused on cloud native. So it's no longer about cpu or I O, et cetera. It's about the number of failed invocations. It's about cold starts. It's about show me the biggest latency offenders I have in dozens of services that I have, because that becomes really a big problem. Runaway cost, function duration, timeout, all of these are things that you get very easy view and alerts on with cloud native tools. But at the same time, let's take a scenario when something actually goes wrong. So let's look at our issues and we'll find an issue that is occurring. And let's suppose we'll get alert on this for slack for page of duty. But let's assume we want to dive into this. We click on this specific error that is happening. Last happened three minutes ago. This is actually a live environment that I'm showing that is based on a cloud native architecture in AWS. And when we drill down, we can see a lot of information about that error. It's in a specific lambda. I can see that there was a deployment over here. I can see number of failures, and I can see that one by one, the actual failures that happen. And this is were we actually move to troubleshoot. So if we click on any of the invocation we actually starting, I like to call it a debugging session, which is how I mentioned about the virtual stack trace. This is where you actually can start looking at these virtual stacked trace. So what do we have over here? You can see this lambda, this is why we got here. This post to social failed. And you can see this event bridge. This is the service that triggered that lambda. As I mentioned, we want to see the full transaction, the end to end transaction show. So I'll ask Lumigo to calculate and go back and upstream and build the entire request from the very beginning, all the way to the different nodes. And this is what Lumigo built over here. And this is the core of what you should be targeting, having a direct view going from a failure, an internal failure, a dynamodb that failed, or lambda whatever, and then immediately being able to zoom in and understand, okay, this is a customer facing API. It's an upride. I can tell you it's a business critical API. So I need to fix it now and at the same time to be able to drill down and understand. Okay, let's click on this. And this is the added layer that I mentioned that I don't just get a map, I actually can click on any service and see a lot of information like the issue, the actual stacked race and the exception variables, the event that triggered the lambda, these environment variables, the logs, everything that has to do with this invocation, these are things that are generated by the vendor, in this case Lumigo, and most of them do not exist in AWS or in other regular tools. So in this case, details write id cannot be an empty string, that's a failure. And if I look at these event and I click to understand what these message actually that the lambda got, I can see that details write id was empty to begin with. So the lambda got an empty write id. Just by having that visibility that you get only in tools that are focused on cloud native applications. Just by seeing that, now I know that this lambda is not a problem. I need to understand why is this empty? Where is this coming from? So let's go upstream and let's go to the event bridge and we'll click on this. And now we can dive into the property that it got in the message and we can see that details run id is naturally empty also in eventbridge. So we can go upstream and look at the lambda that triggered that and we can go one by one and check all the different services, including things like dynamodb. What did you try to write? What was the outcome? In this case there's a failure. The provided key element does not match these scheme. There was a retry probably. So I see the second call was successful. So it really tells you the story of what happened in each and every service along the way, all the way to things like Twilio for example, that you can actually see. SMS was sent to this number and the response and so on. At the same time, you can also check this out in a timeline view. So to see if there are any latency issues, to see if there are any. You can see this is taking a second, so maybe I need to dive deeper and understanding what's taking the time and so on and so forth in a latency view. I want to stop over here. And again, this is just to give you the context of what you should expect from a modern distributed tracing that is focused on cloud native. To summarize, serverless and managed services in general really changes the way we develop, really changes the way we're doing things. There are many strong benefits. We didn't touch on that, a lot of accelerated development, but there are some new challenges, especially around visibility and troubleshooting of an application. We talked about three approaches to monitoring and troubleshooting distributed services. We talked about cloud native tools, we talked about homebrewed and open source solution. We talked about third party SaaS vendors. I think two, three main things that I wanted you to leave these session with. One is I think you saw what you should expect in these modern environment. Don't settle for what you used to have with having logs out there in a log aggregator and that's it. You can expect more. There are better tools, better technology to serve you. Number two, think about this upfront. It's much better to bake it in during dev, during preprod rather than after the fact. And third, consult. There are a lot of companies that are going through what you are going through or already have solutions. So consult with the community. There's a lot of resources and from my experience everybody is really happy to help and in that thought I also want to offer my help. I really enjoy talking with folks in our community and hearing about new project and new application being developed. So feel free. Even if you're not using Lumigo, even if it's just about managed services, distributed tracing observability to reach out. I'm available on this email or direct message me on Twitter and would love to try and do our best to help. Thank you very much and enjoy these rest of the conference.

See all 50 talks at this event!

Conf42 Cloud Native 2022 - Online

April 28 2022

What happens when DynamoDB explodes? A practical guide for developers

Video size:

Abstract

Summary

Transcript

Erez Berkner

CEO & Co-Founder @ Lumigo

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2022 - Online

April 28 2022

What happens when DynamoDB explodes? A practical guide for developers

Video size:

Abstract

Summary

Transcript

Erez Berkner

CEO & Co-Founder @ Lumigo

Join the community!