Conf42 Chaos Engineering 2022 - Online

Chaos engineering for serverless with AWS Fault Injection Simulator

Video size:

Abstract

Companies of all sizes and industries perform chaos experiments on instance- and container-based workloads. However, serverless functions and managed services present different failure modes and levels of abstraction.

This session looks at forming hypotheses to fit serverless, what the experiments can achieve, and how to perform them safely using AWS Fault Injection Simulator and open-source libraries.

Summary

  • Kuna George will talk about chaos engineering for serverless using AWS fault injection simulator. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again.
  • When we say serverless, we mean that it's about removing the undifferentiated heavy lifting that is server operations. With serverless we pay for value. And serverless is built with availability and fault tolerance in mind. The serverless landscape is really growing all the time.
  • Serverless chaos experiments look at common serverless weaknesses. Are we handling errors correctly within our applications? How we handle events within our solutions is really important. We want to fix them before they break and create a big outage in our serverless applications.
  • We can modify different types of service configurations and we can change IAM policies. And to do that, we can use the AWS console straight away, just make changes there, observe what's happening, and then change it back. The safeguards that we get with a managed chaos engineering service act as this automated stop button.
  • There are two main options for using fault injection for AWS Lambda. There is chaos lambda for Python and then failure lambda for node JS based. You can add latency, change status code and create disk space faults. Let's look at an example of how we can use this in practice.
  • An example of how we can use AWS fis to once again automate use these experiments in a safer way. We are using an experiment ourselves with an action type that basically is put parameters. When the experiment stops, either by the duration is over or by stop condition. Very cool.
  • AWS FIS can first off do these experiments updating a parameter. What if an alarm sets off? We can use the CLI to set the alarm state for my specific alarm. About 50% of invocations seems to work. The experiment and demo site should now be back to normal.
  • Gunnar Grosch: To find the hypothesis for your serverless application, use those what ifs. Make use of both configuration network and code manipulation. If you want to have more chaos engineering for serverless, just scan the QR code shown on screen.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. You hi everyone, my name is Kuna George. I am a developer advocate at AWS and excited to be here at 42 today to talk about my two favorite topics combined, chaos engineering and serverless. In this session I'm going to talk about chaos engineering for serverless using AWS fault injection simulator. So let's get to it. So I think you all know the faces of chaos engineering by now, where we walk through from a steady state, creating our hypothesis and then create and run our experiments and so on. And for this specific session I'm going to focus on the hypothesis part, how to create hypothesis for serverless and how to run experiments for serverless applications. So to begin with, to just set the stage, I want to talk just briefly about serverless. In case you don't know what serverless is, so many of you might know that these are the tenants that define serverless as can operational model. First off, we don't have any infrastructure to provision or manage, so there's no servers for us to provision, operate, patch and so on. And serverless automatically scales by the unit of consumption, the unit of work or consumption rather than by the server unit. And with serverless we pay for value. We have a pay for value building model. So for instance, you might value consistent throughput or execution duration, you only pay for that rather than buy the server unit. And serverless is built with availability and fault tolerance in mind, so you don't have to specifically think about architecting for availability because it's built in into that service. But when we say serverless, we mean that it's about removing the undifferentiated heavy lifting that is server operations. So you don't have access to the underlying services, the underlying infrastructure, which could be a difficulty when it comes to chaos engineering, but more on that later on. And at AWS, when we're talking about serverless services, well, we are often referring to these like AWS, Lambda, Amazon Dynamodb, Amazon API gateway, our object storage service s three, and a lot of other services as well, like SNS and SQs and so on. And the serverless landscape is really growing all the time. So now with that said, let's get to the topic of this session. Serverless chaos experiments. So when we're creating our experiments, we can start by looking at some of the perhaps common serverless weaknesses that we can see in architectures at time, for instance, we can look at errors. Are we handling errors correctly within our applications? No matter if the error handling is inside our code or a feature of the service, we better make sure that we're handling errors. And certain services have releases or features like dead letter queues and things like that, which are great, but we want to make sure to be able to test it. And chaos engineering can help us do that. And with AWS, lambda functions and different dependencies like other AWS services or third parties, we need to and want to get our timeout values right. And in most cases they probably are, but that's often in a steady state. So what happens when there are issues, say latency for instance? And with event driven architectures becoming more and more common, how we handle events within our solutions or our applications is really important. Are we queuing events and messages correctly? What happens to events in case of any issues within our applications? And we're using many different services, and we also often have third party dependencies and we trust them to be there. So what happens if they aren't there? Do we have fallbacks or graceful degradation? How do we handle if a third party is unavailable? And these are just some of the potential weaknesses, and there are of course a lot of others as well. So we want to find these weaknesses, these unknowns, and fix them. We want to fix them before they break and create a big outage in our serverless applications. So what are some techniques then for doing fault injection on serverless applications? Well, we can start off with configuration manipulation, some common faults. There might be throttling for instance, or setting concurrency limits. We can deny access to services or other parts of our application. Basically any type of service configuration that's available to us is something that we can use to create fault injection and tools to do this might be resource policies or using IAM policies to restrict access. We can use VPC attachments and so on. Another technique is to do network manipulation and common faults. There might be, say TCP packet loss, doing bandwidth limitation, network latency, or restricting connectivity. And tools we can use to do that might be security groups, network access control lists, using network firewalls, HTTP proxies, Nat instances and so on. And then we have code manipulation. So with code manipulation we can create different type of faults. We can for instance, create different API responses. We can do disk exhaustion, we can corrupt messages in the code. We can create network latency with code manipulation. One thing though is that we're missing environment manipulation, and that is basically because we don't have access to the environment where we perhaps would have started if this wasn't a serverless application. So with that, let's then look at a concrete example. So very simple, serverless applications on screen. Right now it is a web service. We have an API gateway that's fronting a lambda function that retrieves data from Amazon Dynamodb. And we also have a queue. So items that are posted into our API are stored in a queue and then retrieved by an AWS lambda function before storing them in a dynamodb table simple serverless application, but contains several different services. So what we can do then on this is we can inject errors into our code, for instance by creating exceptions or by using other types of errors. We can remove downstream services so we don't have access to a downstream service or a third party API. For instance, we can alter the concurrency of our AWS lambda functions, and we can restrict capacity of tables. Other examples might be that we can create security policy errors where we restrict access to services. We can create course configuration errors, something that perhaps we struggle with to get course right. And that is a good example of something that we might try. What happens in our application if we have course configuration errors? And once again, we can basically create any type of service configuration error. And we can also do manipulation with the disk space available in our AWS lambda functions, if that is important to us. And the perhaps most common example of doing chaos engineering experiments for serverless is to add latency to our functions. And by using that, by adding latency, we can simulate a bunch of different failure scenarios. For instance, it could be runtime or code issues, it could be integration issues to downstream or upstream services. It could be to test our timeouts for our AWS lambda functions. And it can also be to test how our application behaves in case of cold starts. So let's then start off by looking at configuration manipulation to begin with. So when doing that, we can modify different types of service configurations and we can change IAM policies are two good examples. And to do that, we can use the AWS console straight away, just make changes there, observe what's happening, and then change it back. We can use the AWS CLI, have our companies ready to make a change to a service or a policy, do that, observe what's happening, and then roll back with another command. In the CLI, we can use APIs, we can use the different sdks to make these changes. Or of course we can use serverless, serverless serverless AWS fault injection simulator. Well, and the big reason I see to use serverless serverless serverless serverless AWS fault injection simulator. The safeguards that we get with a managed chaos engineering service and safeguards, they act as this automated stop button. So it monitors the blast radius of the experiments that we're running and modes sure that it is contained, and that failures created with the experiment are rolled back if alarms go off. So if we run an alarm instead of me manually having to observe it and stop it if alarms go off, or let's say that we've run our experiment for five minutes and that's the end, we can then use AWS Fisc to automatically stop and roll back to the previous state. So let's look at an example then on configuration manipulation. So this is the same application that we're using. What if Sqs invocation of lambda functions is throttled, so we are pushing a lot of messages to our SQs queue, but if the lambda function is throttled so we're not able to pick up those messages, what happens in our application then? Or what if SQs invocations of the lambda function is disrupted entirely so we're not picking up any messages from the queue? Or another example, what if lambda function loses permission to the dynamodb table and isn't able to store the messages that it's picked up from the queue and processed within the lambda function? So let me briefly show you an example of how we can do this type of configuration manipulation with a quick demo. So, switching over to the AWS console in this case and what we're seeing here is an AWS lambda function, and this is the AWS console. So we have a lot of different options, configurations and so on. I've switched to the configuration tab and the concurrency setting, and as you can see now it's set to the default value nine nine nine. And in the console I can easily set it to zero, meaning that the function will be throttled and it won't be able to run that AWS lambda function. But I didn't save it. Now it's still at nine nine nine. Instead, I want to show you how we can do this using AWS Fis. So I have an experiments template created in Fis already to update lambda concurrency, and I'm making use of an action that's called SSM start automation execution. And with that we can run SSM documents. They in turn then contain different type of automations that we want to do. So I've defined this document that is created to be able to then change the concurrency of our AWS lambda functions. So it has a first step where it will update the concurrency to whatever we set it to zero for instance. Then it will sleep and in the end it will then do a rollback. So we can then add these parameters to our fiscs action to be able to update that lambda function and have that automated rollback when the experiment is done or in case of an alarm. So let's start this experiment just to see what it looks like the experiment is initiating. It is now running. There we go. So it will now run that SSM automation looking in the lambda console. It's still at nine, nine nine. Let's do a refresh and we can see that the reserved concurrency is now set to zero. We can also see at the top that this function is throttled so it will not be invoked as soon as this experiment is done now. Or if an alarm sets off it will then do that rollback. So now we can see it was a quick 1 minute experiment. It is now rolled back to the initial state which was 9999. So that was a very quick example of how we can use AWS, fis and this very adaptive way of creating automation to change configuration, change policies and so on by using the SSM automation action. Very cool feature that allows us to do a lot of these experiments. So let's look at one of the other three then code manipulation. And this is a favorite of mine. So there are today two main options for using fault injection for AWS lambda. There is the chaos lambda for Python and then failure lambda for node JS based. And let's have a look at the node JS one fault injection with failure lambda. It is an NPM package that you can use for node JS based lambda functions. You configure it using parameter store or AWS app config and it chaos several different fault modes that you can use. So you can add latency. You can change the status code for an API. For instance, instead of returning a 200, you can return a 404, 502 or whatever you wish. You can create exceptions within the invocation of the lambda function. You can add things to the disk to create disk space faults. You can use deny list to deny calls to specific URLs. And what you do is basically you install the NPM package, then you import it in the lambda function and you wrap the lambda function handler. So like this, we then import failure lambda and then we wrap our handler with failure lambda in this case. And then we're good to go to be able to add these fault injection to our lambda function. And as I mentioned, we control it with a parameter in basic JSOn? So we set if it's enabled or not, we set the failure mode, which type of fault injection we want to do. We set a rate if it should be on all invocations or as in this case on 50% of invocations we can set the latency and so on. Configure each of these different fault modes and then let's look at an example for this as well. So what if my function takes an extra 300 milliseconds for each invocation? What happens to my application in those cases? Or what if my function returns can error code? So instead of returning a 200 response to the API or to the client, what if it's 404 or a 502 or 301 or whatever error code we want to use? Or what if I can't get data from dynamodb? So let's looks at an example of how we can use this in practice then. So this is a very basic site used for this example, serverless chaos demo site. I'm using three functions that are just copies of each other to be able to easily show the difference between them. So this is now running, it is fetching data from dynamodb to then load a new image. And this is constantly updating and we can see it's 150 to 200 milliseconds per invocation at the moment. This is our AWS lambda function for function one. Just to show you that we are importing failure lambda and we're wrapping the lambda handler with failure lambda as well. In this case then we have a parameter stored in parameter store which then contains the configuration for that specific AWS lambda function. So it is now set to false so it isn't enabled. We can then specify the failure mode, in this case latency, and we're using a minimum latency of 100 milliseconds and a maximum latency of 400 milliseconds. So now to enable this, we simply update this parameter, set it to true still with latency, and then save it. Now switching back to the site and now observing function number one. We can now see that the invocation time for function one is longer than for function two and three. So we have added latency to that AWS lambda function for each invocation. And this is to be able to test how our application behaves in case of latency. And looking in the logs, we can also see that it's showing that we are adding latency to the invocations as well. So that's latency fault injection. And as I mentioned before, we have a bunch of different ways we can or different things we can test by using latency. So just disabling this again, updating the parameter saving and now it should go back and be around 200 milliseconds once again and that seems to be the case. Cool. So now then let's check parameter number two. In this case we're going to use a different failure mode and we're going to use status code instead to then manipulate what status code is returned from our API and in this case it's set to 404. So instead of returning 200, we're returning 404 and I'm setting a rate of 0.5, meaning that it will return a 404 on about half of the invocations, saving that, switching back to the site and we can see functions two, we're getting 200 right now. We are getting 200 and still 200. Come on. There we go, we got an error on one invocation we get an error again, meaning that it's unable to get the response from Dynamodb, basically getting a URL for a new image so it can't load a new image. So by using this fault injection method, we're able to then simulate what happens if we have responses that aren't 200 or okay from our APIs, changing it back and it should now load a new image on each invocation, which seems to be the case. Right, let's check failure lambda parameter number three then, and updating this one I am going to use a different failure mode. In this case we're going to use deny list and with the deny list we're able to add then which calls to deny. In this case we're denying to s three and to dynamodb. But this could be any third party dependency as well. Setting it to true, and if you remember the architecture we looked at, we have a dynamodb as a downstream dependency, meaning that our lambda function should now not be able to fetch data from dynamodb. For function number three, we can see that function one, function two, they are continuing to update, but function three is throwing an error and isn't loading new images. So once again we can see what happens when we are injecting that type of fault into our application. So that was very manual. So let's have a look at how we can do this using AWS fault injection instead and make use of those safeguards and automatic rollback that I talked about. So creating a new experiments template, choosing a role to be able to run this experiment, I'm now adding an action and let's give that a name, lambda fault injection and selecting the action type. And we used the automation execution in the previous example that I showed you using fis. And we can use this here as well to set a document and a document that is then meant to update parameters in parameter store. So we have one document created for this that you can simply use. So then you just define what's needed. That is the new parameter and the rollback parameter. And that's fine and dandy, that works, that's very cool. And you can use that straight away to do different types of experiments against parameter store. Now switching back to fis. I want to show you something that we're say playing with right now, because we're seeing customers using parameters a bit. So we are using an experiment ourselves with an action type that basically is put parameters to see if that might be something that customers want to use. So I'm selecting this action type put parameter. I will then add duration for the experiment or for the action at two minutes. Then I need to give a name to the parameter to update. So let's copy the name of failure lambda parameter two. And now you can see that I am supposed to add a value and a rollback value, and the value in this case is the value that will be put into parameter store. So I'm copying it from our existing parameter, switching so that it's true, meaning that when the experiment starts, it's going to enable the experiment. We're going to use status code as the failure mode, keep it at rate 0.5 so 50% of invocations, and finally keep using 404 as a status code. Then we have the rollback value and that's the value of the parameter. When the experiment stops, either by the duration is over or by stop condition. So saving that we don't define a target because the target in this case is defined through the parameter instead creating my experiment template, I can start it, start the experiment and we can see that it is running. So, meaning that if we now switch to our parameter and refresh that we can see that it is set to be enabled now. So now the experiment is running. Switching over to the demo site, we can see that the function number two is giving us an error because it is getting a 404 response in return and 200 response 404 so about 50% of invocations. Let's check the experiment in FIS. It is now completed and with it being completed we should now have a rollback of our parameter. So let's take a look at the parameter in parameter store and we can see that it is set to disenabled false, so disabled and functions number two is returning. 200 responses. Very cool. So that's an example of how we can use AWS fis to once again automate use these experiments in a safer way. I want to show you one last thing with this. I've talked about stop conditions, but haven't really used any. So let's add a stop condition to this experiment. I'm going to use demo one that I have my fist demo alarm. Saving that and we can now switch over to Cloudwatch. And this is the alarm in question. So as you can see, it is right now in the ok state, meaning that no alarm is setting up right now. Starting the experiments, the same experiment. Once again, with the experiment started, the parameter is being updated. Double checking. Yes, it's updated. And with that updated, we will now have four or four responses every now and then on function number two. About 50% of invocations seems to work. So let's now then try to use this stop condition. What if an alarm sets off? So instead of actually making sure that something sets off, we can use the CLI to set the alarm state for my specific alarm into the alarm state. So doing that, using that command, switching back to Cloudwatch, we can see that it is now in alarm state. And AWS, soon as that alarm moves into the alarm state, we can see that AWS FIS is stopping the experiments. Since that was our stop condition, a safeguard, it is halted by a stop condition, and that also then means that it will use the rollback behavior update, our parameter. Refreshing to make sure, yes, it is now disabled. The experiment and our demo site should now be back to normal and returning 200 responses once more. So that was an example of how we can use AWS FIS to first off do these experiments updating a parameter. And I showed you something that we are experimenting with a new action type where we put a parameter straight into parameter store, but you can all do it right now using SSM automation. The document is available for use. All right, so then I want to do a bit of a summary recap of what we've looked at. First off, the chaos engineering part is the same no matter if it's serverless or if it's say, server full. To find the hypothesis for your serverless application, use those what ifs that we asked earlier on. What if a downstream service is unavailable? What if latency is added and then create a hypothesis around that? And when you're doing the experiments, make use of both configuration network and code manipulation. We looked at examples for configuration and code manipulation in this session, and then try to use safeguards and automatic rollback so you don't have to be responsible for actually running a command to rollback or changing configuration in a console to be able to use that rollback behavior. And if you want to have more chaos engineering for serverless, just scan the QR code shown on screen right now, or go to grosh serverless chaos for more links, examples, demos, all gather in one place. And with that, I want to thank you for joining this session. Happy to be here at Conf 42 chaos Engineering. My name is Gunnar Grosch, developer advocate at AWS. If you want to contact me, I'm available on Twitter as shown on screen, and LinkedIn, of course. Happy to connect. Thank you all for watching.
...

Gunnar Grosch

Senior Developer Advocate @ AWS

Gunnar Grosch's LinkedIn account Gunnar Grosch's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways