Testing Network Resilience in Amazon ECS: Fault Injection Experiments on Fargate

Video size:

Abstract

Amazon ECS now supports network fault injection experiments for Fargate workloads through AWS FIS. Teams can simulate network latency, packet loss, and blackhole conditions. In this session, learn how to implement network fault testing patterns for resilient containerized applications.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Everything fails all the time is a famous quote from Amazon's CTO, Werner Vogels. This means that software and distributed systems may eventually fail because something can always go wrong. So what happens when your containerized applications face network disruptions? So in this session, you will learn how to proactively validate your containerized applications. Resilience. using AWS Fault Injection Service, a FIAS, turning potential network challenges into opportunities for strengthening your architecture. Hello, everyone. My name is Sunil Govindankutty. I am an enterprise support lead at Amazon Web Services. I'm also a member of our Resilience technical community, where I focus on the chaos engineering area. I'm excited to be here talking today, talking to you today about this particular topic. Now let's see the agenda. We'll start with an overview of Amazon ECS and AWS Fargate options. You have, to run container workloads in AWS. We'll then see some of the, network faults that can happen that you might have seen in real world. We'll look at an overview of FIS and the various features and components that the service has and supports. We'll also see a reference flow of a chaos engineering experiment triggered through FIS. We'll then conclude with a demo of a sample application that is running containers on Amazon ECS using Fargate and see what the application behavior would look like in case of network faults. ECS, or Elastic Container Service, provides a managed platform that provisions compute, network, and storage that you need for your containers. It manages the scheduling of containers, the self healing of containers, the auto scaling of containers. And it integrates well with our, container registry, the secrets manager, the load balancing service, et cetera, that allows you to focus on just your application and your requirements for containers and leaving the management of the container platform, the orchestration platform to ECS. Now where do you run these containers? So you have a few options. One of them is AWS Fargate, which is a serverless compute platform for containers. It runs each of the containers in its own kernel and manages the life cycle of the underlying host. So you eliminate the need for customers to upgrade, patch and to scale the underlying hosts needed by the containers. Now we'll see some of the core constructs of the ECS platform. One of them is a task. So in, you specify the compute requirements, your networking requirements, the permissions, and the configurations you need for one or more of your containers in a task definition. Then tasks are created from this definition, matches all of the attributes that you have specified in the definition file. And then I mentioned, the orchestration activities of, healing, failed containers, replacing them, managing the scaling of the, containers, performing deployments of latest code to existing containers. All of those activities are managed by, what's called a service. And it also registers these containers to load balancers to accept the end user traffic. Then you also have what's called a cluster, which is a logical grouping of services and tasks in an AWS region. You can then use IAM or identity and access management service. to control the permissions of users to these clusters and its underlying components. Now let's do a quick, take a quick look at, networking for Fargate tasks. Each task gets its own network interface, its own ENI using an IP from a subnet that you have provided to the ECS platform. And then you can apply security groups to a task that will control the ingress and egress of the traffic from the underlying containers. In case that, you have a bunch of containers running in terms of, a spike in traffic or something like that, you can attach secondary CIDR blocks to your VPC to address that particular need. Now we've looked at the, networking. What are some of the, fault actions, fault events that could happen in a real world. One of them is latency, which is a delay in communication between services and their dependencies. you could see behaviors like Network, congestion or, a route change resulting in latency or a DNS resolution, delay that, again, resulting in latency. And then you have network black hole where packets are silently dropped. Here you might see DCP connections hanging. You might see, loss of connectivity between dependencies and. You might not get an error response from underlying dependencies in an event like this. Another flavor of a network fault is packet loss, where the packets doesn't reach their intended destination. So with this, you might see, retransmission attempts at the TCP layer. You might see degraded performance across your workload. You might see, bursts or random patterns of network packets in your monitoring tools. Now, given that, some of these, network faults could happen or would happen in your container environment, how do you prepare for that eventuality? That's where FIS, Fault Injection Service from AWS, comes into picture. It's a fully managed service for running fault injection experiments. It comes pre packaged with scenarios of various fault actions that you can take and run against your workload. You can, run it through the AWS management console or CLIs or integrating with any tools of your choice. So you can get started in a matter of minutes, in fact. And it allows you to test real world conditions from as simple as stopping off or termination of an EC2 instance to as complex as a power interruption in an availability zone. FIS also fully embraces the concept of safeguards that allows you to limit the blast radius of your chaos engineering experiments. So you could Specify an alarm, and if that were to go off, FIS will stop all the experiment and the fault actions it is performing and then try to roll back those actions. Now let's look at a reference flow for the service. So at a very high level, we start with an experiment template that packages the different fault actions and the targets that will be affected by those actions. and safeguards that you want to enforce while running the experiment. FIS then performs these actions on the AWS resources that are specified as the targets when you start the experiment. You can then monitor the experiment using CloudWatch. FIS also integrates with Amazon EventBridge, allowing you to use a monitoring tool of your choice. to observe the behavior of the application in question. And experiments, once started, will automatically stop when all of the actions specified are complete. Or, as I mentioned earlier, you can optionally configure an alarm, and if that were to go off, the actions are stopped. And once the experiment is complete, you can, view the results to identify any performance concerns or observability issues or resilience challenges. You could also then use the report of Kiosk Engineering as an evidence of testing to your compliance group or a security department. Now that we have seen, a high level capabilities of FIS, let's take a sample application running on an ECS cluster and let's try to inject network faults. So in this case, this sample application is a CRUD API. Running on an ECS Fargate cluster, getting data to and from an RDS database. So we will run NetworkFalse against this dependency and see what the behavior of the application looks like. Okay, now you have the AWS console where, We have our sample application up and running. So let's take a look at the cloud formation stacks for our application and its components. So we have an app stack that contains all the infrastructure components needed for our application, including the ECS Fargate cluster, the RDS database, et cetera. And then we have a monitoring stack. So one of the prerequisites with any type of experimentation is that you have an observability and monitoring mechanism in place for your applications. Otherwise, It's the old pre false in the forest saying. So for our sample application, we will be doing the monitoring using CloudWatch. As I mentioned earlier, FIS integrates with third party monitoring tools through EventBridge as well. And then we have our experiment stack. So the experiment stack contains the various network fault actions that we will be performing against our sample application. Now let's take a closer look at the application stack and its outputs. So one of the outputs is the API URL. So this is the end point of our sample API that we will be testing with. Now, let's see how this API behaves from an IDE. Let's, do a get on this endpoint. As you can see, the API returns a few items. So basically, the functionality behind the API is to do Create, read, update, delete operations for items. And I was able to get a list of items that are currently in the database through this API call. Good. now the other thing we want to do, one of the other prerequisites for experimentation is load generation. So as you start on the chaos engineering journey, we recommend that You start with the lower environment, whether it's a testing or a staging environment. And the challenge there is that you might not have the real user traffic in those environments. So you need to generate some type of synthetic traffic that mimics your real production traffic so that when you run these experiments, you get a sense of what could happen in a real world condition. So in our case, for the sample application, I'm going to be using the open source tool called Artillery. to generate some synthetic load against our API. So basically it's just going to do a, post action, against our API endpoint for 20 minutes. So let's go to our Cloud 9 IDE console. and kick off that load generation script. So you can see here, the load generation tool is hitting our API endpoint and getting a 201 item created HTTP status code. That's good. so let's keep the load generation script running here while we go take a look at our, ECS cluster, right? So our Compute layer for, the API is, ECS. So as I mentioned in the slides, there are a few constructs with ECS. One of them is a cluster. So we have one cluster running in this account. With one service, and this service has a couple of tasks that are on the running state that is supporting our application workload. And each of these tasks have two containers. One is the application container, the app container with the application code, and the other is an SSM agent container, which is one of the Prerequisites for running network faults against an ECS Fargate. Cluster, you can refer to all the AWS public documentation around the prerequisites for network faults. So we have the cluster, the services, the task, all of that is good. now let's take a quick peek at our, database layer, our RDS endpoint. So you can see that, we have, a MySQL database supporting our application. And this is the endpoint of that cluster and this is the port the database is listening on. So we'll be using this information while we run the experiments later. So now that we have looked at the application architecture, let's take a quick peek at our observability tooling as well. So we have a dashboard here. That we've created through our experiment stack. Let's take a look at, let's say the last 15 minutes or so, we, yeah, where we've started running some load generation script against our application, right? So it's still, catching up with the observability metrics for our, application. So we'll, keep an eye on this, let it catch up with the load generation script while We take a look at, the AWS FIS console. So here you can see that as a part of our experiment stack, we have three, experiment templates created. So one of them here is the network latency, the other black hole, and the other, the last one packet loss. So these are the. Various network faults we looked at, during the, review of the slides earlier. Now let's dive deep into this network latency, experiment. Let me just do an update of the template so you can clearly see the various components of it. So the two key things, or two key concepts we covered. with an experiment template is the action and the target. So the action specifies the fault that you want to test against your workload. And as, I mentioned earlier, you can sequence a bunch of fault actions or you can run them in parallel or mix them up, etc. So in our case, what we're going to do is we're going to run this network latency fault action against our ECS cluster. And as you can see here, you see there is a bunch of services that our, FIS service integrates with. So right now, let's focus on the network latency action. And then based on the action that you selected, you can specify a certain number of parameters. So for this latency action, one item is the delay. What is the latency that we want to add to? our network traffic between the compute layer and the database layer. So in this case, I'm selecting the default which is 200 and then we're going to run this for five minutes. And you can see in the source So in this field here, I've specified the endpoint of our RDS database. So that is the source that we want to introduce latency for. Now that we have specified all of that information, let's take a look at the second major component of the template, which is the target. So in our case, the target is a set of ECS tasks. We are specifying the ECS cluster name. and the ECS service name to identify the tasks that FIS should be targeting for this particular experiment. And, another key thing here is the, permission. So as I mentioned, earlier, you can control the permissions and experiment has through, IAM role. And then controlling the blast radius of an experiment is important. So you can specify CloudWatch alarms as stop conditions for an experiment. So in this sample application, you're going to leave that as blank. So let's go back to the experiment template itself. So before we start running any experiments, we want to make sure that we understand the steady state of our application. so we have the prerequisites in place in terms of observability in monitoring and load generation. The next thing is we understand the steady state of the application. What is the, output of the metrics? What is the value of the key metrics that we are interested in? So let's go back and take a look at our CloudWatch dashboards that should have caught up with the load generation script. So let's say in the last five minutes, what does the metrics look like? So as I, as we expect, there are no errors from our API. So that's the other thing, if there are errors in an application that you are trying to run experiments on, you might want to pause that experimentation, go back, and fix those errors, right? So if you have known errors, you would want to address that before starting with any experimentation. And then you can see here that we have two key widgets that, we have identified for observability of our application workload. One of them is the API gateway latency. So from a. System perspective, from let's say a system operator perspective, what does the application health look like? And what are the measures of those different metrics? At the same time, you would want to also understand the behavior of the application from an end user perspective. Sometimes there could be a disconnect between what an operator thinks health of an application is versus an end user. So you would want to have a pane of glass into each of those perspectives to understand the system behavior. So that's what we are trying to demonstrate here. So from an, system perspective, our latency is around 22 to 25 milliseconds. And the P90 value from an end user perspective is around. 75 to 150 milliseconds. So that's the steady state of our application. And hopefully these metrics are well within the threshold that we have, agreed upon with our end user, with our application owners saying that, this is the latency metric that, our application will be running with. We now have the prerequisites. We have a steady state understanding. Now the next thing before we run any type of experimentation would be to come up with a hypothesis, right? Especially, in this case with network latency. So the network latency, our hypothesis is that in the event of a latency in the network path between our compute layer and the database layer, we believe that the overall latency of the application would go up. It would still be well within the threshold of, what we consider as healthier, healthy state for our application. So the latency would go up, it will bump up, but it's going to be well within the thresholds that we have defined for our application. So let's see if that hypothesis is going to hold true by running this experiment. So I'm going to start the experiment, add a tag and then I'm going to You know, provide a value for the tag. So this will help us, differentiate between the various runs of an experiment in the event that, we are attempting the same experiment multiple times as we observe different things. Now, once I start to start the experiment, Aquarius is going to ask me to confirm that. And once I have that ready. You can see the experiment going through a couple of phases. It starts with an initiating phase where it collects all the resources and everything it needs and then moves into a running state and then into either a failed state or a completed state, in a minute. And then if we look at our actions, in for this experiment template, with this experiment, we only have one, action, which is a network latency. That's going to be. Also running. we are running that for five minutes in this particular case. Now, if I look at the resources, if you remember in the template, we only gave the cluster name and the service name, but FIS has identified that these are the two tasks that it should be running these latency experiments against. So we've started this a couple of minutes ago, so we'll give the observability tooling a minute or so to catch up as well. Let's take a look at a couple of other features and functions So here under the timeline, you can see the various actions FIS is performing and where they are. So in the event of, multiple actions, you would see them getting queued up or running at any time while the execution is ongoing. And if there are any, log events associated with that particular experiment run, you can see that here as well. So now that it's been a couple of minutes, after we started the experiment, let's take a look at our, application monitoring tool. So here our p90 values was in the steady state, 75 to, 150 milliseconds. Let's see how it has changed with this experimentation. Yeah, so you can see as expected the average latency has gone up from a system perspective and a real user monitoring perspective, right? So from 75 to 150, it's now gone up to 262 milliseconds, almost, double the measure that we have seen initially and that is continuously continue to trending up. But, we can observe that there are no errors. The application is still continuing to function well. there are no errors being returned, from a user perspective or from a system perspective. Now, the big question is, this increase in latency, is that within the threshold that we have agreed upon with our system users? So that's a decision that you will have to make with the stakeholder. Say we saw the, that the latency has gone up. is that latency still acceptable in an event like this? If it is not. You will have to take a look at some remedial actions, right? So if the latency were to go up beyond a certain threshold, maybe fail over your database from your primary to standby or, something like that. So those are the, Potential actions that you can take with these observations. Oh, let's go back and take a look at that experiment template. I think it should be finishing up here, any minute. so we'll let that, complete while we, look at the next, experiment template. So let's try the, black hole one. So in this case, what we are doing is trying to create a network black hole between our compute layer and the database layer. So if I look at the action for this particular experiment template, you can see it is network black hole. And what are the parameters this action supports? So again, we're going to run it for five minutes. And this is the port number that our RDS cluster was listening at, right? So I've specified that as one of the parameters for this action. And any TCP traffic coming into that port will be black holed. And then in terms of targets, it's still the set of ECS tasks. that will be targeted by our experiment. Now let's go back and make sure our previous experiment is completed. Yes, it has. And now coming back to the black hole experiment, you already have You know, the prerequisites, observability, load generation, we've, looked at the steady state metrics of our, application as well. And now the next thing is hypothesis. So what is our hypothesis for this particular experiment? So here, what we are saying is that if there is a network black hole between our compute layer and the database layer, we believe our application is going to recognize that, and it's going to return a user friendly error code or error message to the application users. So now let's see if that hypothesis is going to be valid or not. Now before we do that, let's look at our observability tooling real quick and make sure that it is getting back. Yeah, so once our experiment is complete for latency, you can see that the metrics are getting back to their original state. So this will also help us give the confidence that our application is healing from that network latency event 30. Okay, now let's go back to the black hole experiment template and start that, then give it a name, then start the experiment. So as we saw earlier, it's going to go through a couple of phases, and, perform the actions that we have specified. And it's going to still target those couple of ECS tasks that are running our application code. So we'll give this experiment a minute or so to, impact these resources and we will see, what our application behavior is. So while that's happening, let's look at the last template here, which follows a similar construct, which is, in this case, it is network packet loss. So let's take a look at the different parameters for that particular action here. So for the packet loss action, one of the parameters you can specify is the loss percentage. So by default it is 7, but you can, configure that. And similar to the latency, you can specify a source. Where you would want to inject this network packet loss for and another important thing that I want to call out in terms of sources is if let's say your application is connecting to DynamoDB or S3 and using them as the storage layer, then you can just specify that literal text in the sources And FIS will kind, do the, IP resolution for you and then make sure that any traffic that is going out to DynamoDB and S3 is impacted by this network fault. You don't have to, provide the IP range for these services or anything like that. The resolution of these literal texts to appropriate IP addresses are handled behind the scenes by DynamoDB. And, like we did here, you can specify a domain name of, let's say, a third party. Let's say you have a dependency, an application dependency that is on premise or running elsewhere other than AWS, maybe a SaaS product. Then you can specify the domain name of those. end points of those dependencies that your application might be communicating, through APIs and add that here. So what that would do is then, inject latency or packet loss into that particular network path. So that will help you test and validate, any potential network faults that can happen with those third party dependencies as well. So let me cancel out the packet loss experiment template and let's see where our black hole experiment is. Oh, okay. let's see where the, experiment status. I'm going to go refresh the, dashboard here. Oh, so now you can see that in the event of this particular experiment, We are seeing errors both from a user perspective and from a system perspective. API gateway is throwing 500 errors and the End user in case in our case artillery is actually seeing timeouts So this was not the behavior that we expected right, we were expecting that in this case our Application will handle that timeout that net black hole between our compute layer and database layer And return an appropriate error code or error message or something like that. So this is an important observation of the application behavior that we want to go back to the drawing board and correct. So those are the two things, two network faults that I wanted to show you today. So we had a sample API application that was performing, CRUD actions and we saw the behavior of that application when we injected network latency as well as network black hole into that workload. So with respect to latency, we did see a spike in the average latency and the P nine D value. For that API endpoint, you would have to take a look and see, if those metrics are within the acceptable threshold. And then in case of black hole, we do need to do some work with our application to identify that event and return appropriate error messages to the end users. So again, thank you for your time. really had fun talking about this topic with you. Have a good rest of the day.

Slides

Download slides (PDF)

See all 31 talks at this event!

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Testing Network Resilience in Amazon ECS: Fault Injection Experiments on Fargate

Video size:

Abstract

Summary

Transcript

Slides

Sunil Govindankutty

Enterprise Support Lead @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Testing Network Resilience in Amazon ECS: Fault Injection Experiments on Fargate

Video size:

Abstract

Summary

Transcript

Slides

Sunil Govindankutty

Enterprise Support Lead @ AWS

Join the community!