Conf42 Cloud Native 2021 - Online

Improving reliability using health checks and dependency management

Video size:


We can improve the reliability of services by decoupling dependencies, using health checks, and understanding when to use fail-open and fail-closed behaviours. In this session we’ll talk about and demonstrate how to implement graceful degradation, monitor all the layers of your workload to help detect failures, route traffic only to healthy nodes, use fail-open and fail-closed as appropriate in response to faults, and reduce mean time to recovery.

We’ll take some lessons learnt from the AWS Well-Architected framework and from the Amazon Builder’s Library, showing some of how Amazon builds and operates it’s software.


  • Andrew Robinson: We're going to be talking about how we can improve reliability using health checks and dependency management within our applications and our workload. The workload that we're looking at for the purpose of this is a web application. Each one of these is running in a separate availability zone.
  • In the event of a network failure in one of our availability zones, we can still route traffic through another Nat gateway in another availability zone. This helps give us a smaller fault domain or a smaller blast radius for errors that may occur. Finally making sure that we're using loosely coupled dependencies.
  • Next we'll have a look at dependencies. When component dependencies are unhealthy, the component should still function. If you can't provide a dynamic response, use a predetermined static response. We should continuously monitor all components that make up our workload to detect failures.
  • If you've heard anything in today's session that interests you, I'd recommend going and having a look at the Amazon Builders Library. The actual architecture is all available as part of a series of AWS well architected labs. The labs are all open sourced on GitHub.


This transcript was autogenerated. To make changes, submit a PR.
Hi folks, and welcome to this session as part of the Comp 42 Cloud Native Conference. In this session today we're going to be talking about how we can improve reliability using health checks and dependency management within our applications and our workload. So let's get stuck into it then. Before we go any further, quick background on myself. My name is Andrew Robinson. I'm a principal solutions architect at Amazon Web Services and part of the AWS wellarchitected team team that I'm in. We work with our customers and our partners to help them build secure, high performing, resilient and efficient infrastructure that they can run their applications and workloads on. I've worked in the technology industry for the last 14 years and my background is mainly in infrastructure and reliability. Also spent some time in data management as well. The workload that we're going to be looking at for the purpose of this is a web application. This web application works as a recommendation engine. Let's just have a quick look then at the data flow of how users would walk through this application. So first of all, we start with our users up here. They connect in through can Internet gateway, which is our public facing endpoint. This then tends them to an elastic load balancer. This load balancer will take those incoming connections and distribute them out across a pool of servers, or in this case Amazon EC two instances. You'll note that we have three separate EC two instances, and each one of these is running in a separate availability zone. We've got multiple instances so that in the event of a failure of one of those instances, we can still service user requests and they're in different availability zones. That way we've got some level of separation. If you're not familiar with an availability zone in AWS, think of it as akin to a data center or a collection of data centers that are joined together using high throughput, low latency links to provide you with different geographical areas that you could deploy that workload into within a single AWS region. You'll note that the instances are in can auto scaling group this auto scaling group allows us to scale up and scale down to meet demand, improving reliability because we're able to better handle user requests by making sure we've got the appropriate number of resources available. But also it provides us with the ability to replace a failed instance. So if one of these instances has an issue, maybe there's a failure of underlying hardware, maybe there's a configuration issue and that instance goes offline or becomes unavailable, or as we may see later, fails a health check. We then have the ability to replace that instance automatically to continue serving user requests. I mentioned that this is a recommendation engine and the actual recommendation engine that we use is an external API call. So we have a recommendation service that sits external to our application and this is where our recommendations come from. In our case we're using Amazon DynamodB as a NoSQL key value pair database and this stores our recommendation. But this could be any external services to the workload or application. It could be an external company, maybe a payment provider, or it could be an external service within your own organization that you're calling. Just as a final point, you will also notice that we have these Nat gateways. These provide our EC two instances or servers with external Internet connectivity. This is needed in our case to be able to call our recommendation service. Let's dive a little bit deeper then into some best practices for how the infrastructure of this has been built, and then we'll dive into the code that's running on those instances and show you how we can implement some of those health checks and dependency monitoring that we mentioned. So some best practices to get us started. High availability on network connectivity at AWS we take care of some of this for you. At the physical level, we make sure that there's multiple connections going into our different data centers that make up our availability zones. However, at the logical level you will still need to do some implementation. As you could see from earlier, we've got nat gateways that we use here. We've got multiple Nat gateways. So in the event of a network failure in one of our availability zones, we can still route traffic through another Nat gateway in another availability zone. So that instance can still communicate with our external API service. We also want to deploy to multiple locations. This helps give us a smaller fault domain or a smaller blast radius for errors that may occur, and also means that in the event of a failure in one of those areas, we can still continue to services our application. We do have this done at the availability zone level in this case, and as I mentioned earlier, an availability zone is a collection of data centers or a single data center that has low latency, high throughput connectivity to other data centers in the same availability zone. So you could think of it as akin to a single fault domain if you need to. For workload purposes. You may want to look at deploying this to multiple geographic regions across the globe, maybe to service users in different areas, but also that can help to improve reliability. But it does come with some additional management overheads because you then have multiple AWS regions that you're managing your workload within. Finally making sure that we're using loosely coupled dependencies. We've done this in this scenario by placing our API call for our recommendation engine external to the actual application that's running. On our instances we could go further and use systems like queuing or dataflows to be able to stream data in some way, and that would then help create additional loose coupling that we need within there. But for this purpose, we're just making an external API call. So let's jump into the code that's running on these applications. And if you're not a developer, I'm not either. Please don't worry, we're not going to be going through all the code on here. We're just going to be looking at some extracts and I'll be explaining what all of this nodes does. The coding question here is all written in Python. This is the language that I'm the most familiar with, which is the reason that I chose it. You can achieve the same thing in multiple other different languages, but for the purpose of this, I've just chosen Python. I find it the easiest one to explain, and hopefully that will make it easier for everybody to understand what we're trying to achieve here. So our first basic health check that we're doing with our load balancer, we can have a path that we specify in our load balancer that the load balancer uses to checks the health of those servers or instances it's connecting to. In this case, this is on the health checkpath. So we're looking for any connections coming in on that slash health checkpath, and if they do come in, we're sending a 200 lessons a HTTP 200 response, which is a success. That means that when our load balancer does its health check routinely, which we can specify what period that happens on every 30 seconds, every 60 seconds. If it then successfully connects to this URL with the health check path for that instance, we'll get a 200 response back. So that gives us some idea that the instance and the application is running. But this is a fairly simple health check, and it only tells us that what could we do to make this a little bit more meaningful? So we could look at doing deeper health checks, and that's what we'll have a look at here with this here. We're still looking on our health check path for any health check connections coming in from our load balancer. What we're doing is we're setting this variable called is healthy immediately to false, and we do that as you'll see later on. That way, if anything goes wrong with our health check process, the load balancer will get an error. In return, it will get an error code from the instance and that means it knows the instance isn't healthy. What we actually then do is we use a try statement to make our call. So we're making our get recommendation call as part of our health check. So our health check is now going to be checking on the health of our dependency AWS, well as the health of the actual application itself. So we're just looking then for a response. And as I mentioned, we're looking for a tv show and a username for our recommendation engine that it provides. If we get this back, we set our is healthy to test values so that we've now no longer got that false value that's set there. And then we just have an exception clause that just catches any errors that we've got and provides us with a traceback error code that we can use. Carrying on still wrapped here in the same health check statement is we just have an if and an else statement. So our if looks at is healthy and if there's a value that that's been set to, we send a 200 lessons content type is HTML and we set a message as success and send some metadata. So that means that we're not only checks that our application and therefore our instance is healthy, we're now also checking that we can successfully call that external API, meaning that that external API is healthy because it's providing us with a valid response. We then just have an else statement. So if anything else happens, we send a 503 error and we then include that exception error message from previously so that we know what this error is. In this case, we'll be sending that 503 error back to our load balancer and our load balancer will then mark this instance or server AWS being unhealthy and won't send any traffic to it. I mentioned there that it won't send traffic to this instance. This is a behavior that we call fail closed. This means that in the event of that instance using unhealthy, the load balancer will no longer send any traffic to it. So you can see here that this instance is marked now as being unhealthy and health checks failed with a code 503. The two other instances are still showing AWS healthy, so any users connecting in will be sent to those two instances and they'll still be able to use the application as before. But the load balancer will not send any traffic to that instance, we then have a choice. We can either choose to replace that instance, or we could have a countdown timer for the number of failed health checks before we decide to replace it. If we replace it straight away, that of course means that the instance is taken out of service and then a new one will be built to replace it. However, if we wait until the instance is back healthy again, we could then resume sending traffic to it. What happens, though, if all the instances within our load balancer fail? So in this case, we revert to a behavior called failopen. And this means that requests will be sent to all targets. Because all targets are unhealthy, it means all of them have failed their health check. But because of the fact that our load balancer is configured to do fail open, we will route those requests to all targets. Now sometimes that's helpful. For example, if you have an external dependencies that you're making a call on, which may be slow to respond, you may have instances that flap in and out of being healthy or unhealthy. Now, as those instances are flapping, that might not trigger the threshold to be able to take that single instance out, and you may end up with all of the instances flapping at the same time, all go unhealthy at the same time, and then your application isn't available. We have this standard behavior AWS fail open. We need to make sure that we're testing our dependencies. We need to make sure that we're doing partial testing as well, so that we're testing both a standard health check and also the deeper health check as well, so that we've got a real true idea of what those instances are doing and what the application state is. Next we'll have a look at dependencies. As we mentioned earlier, we've got a dependency within our application, which is our external API that we're making a call to to get our recommendations from. In our code here. When we get a request that comes in from a user, we're making a call to this get recommendation engine, and then we're parsing the value of those responses from a tv show and a username that we get back from our recommendation engine. Now this is called a hard dependency because if this dependency call fails, users will get a HTTP 502 or 503 error. That means that they can't actually get anywhere with our application. It doesn't work, it won't do what they need it to. What we can do is change this into a soft dependency. So we would have a try statement and in here, we'd have that same code that we just had on our previous slide that would try to make that recommendation call, but this time we add an exception clause. What this means is if that call fails for any reason, we would provide the customer with a static response, and we'd recommend the tv show I love Lucy to them. Now, we would then also provide them with some diagnostic information on their browser, which would just say we can't provide you with a personalized recommendation at the moment. If this problem persists, please contact us and then we'd provide them with details of the error code. Now, yes, this does mean they're not going to get a personalized recommendation, but it does mean that they'll still be able to access that application. And if this application forms part of a larger app that you're building, the whole application will still continue to function. The recommendation engine just might not recommend them exactly what they want, but they'd know that, and we'd be giving them a predefined static response, meaning that they can still continue to do what they need to do. So having a look then, at some of these best practices from our previous slides that we went through here. So when component dependencies are unhealthy, the component should still function, but in a degraded manner. As we saw in our previous example. If you can't provide a dynamic response, use a predetermined static response so that you can still provide your customers and users with something. We should continuously monitor all components that make up our workload to detect failures, and then use some type of automated system to be able to become aware of that degradation and take appropriate action. In our case, that's our load balancer. Our load balancer detects that an instance has become unhealthy and then removes that instance from being able to have traffic sent to it. We also use our load balancer to make sure that if we have an unhealthy or a failed resource, we have healthy resources available that we can continue to serve requests, and our autoscaling group helps to provide those resources that our load balancer can then send the traffic to. You'll note the health check that we had earlier operates on the application layers or the data plane. This indicates the capability of the application rather than the underlying infrastructure. We want to make sure that our application is running rather than focusing our health checks on the underlying infrastructure. A health check URL, as you saw in our examples earlier, should be configured, but to be used by the load balancer so that we can check the health status of the application. We should also look at having processes that can be automatically and rapidly brought in to mitigate the availability impact that our workload has. It should remove the cognitive burden that we place on our engineers so that when we're looking at these errors, we have enough information to go on to make an informed decision about what the problem is. An example of how we can do this is by providing that static lessons and including the exception traceback errors in any error messages that we provide so that we have more detail on what the application issue is. We should also look at fail open and fail close behaviors. When an individual services fails a health check, the load balancer should stop sending traffic immediately to that services or instance. But when all servers fail, we should revert back to a fail open and allow traffic to all services. As I mentioned, there were some caveats around this, and making sure that we're testing the different failure modes on the health check of the dependencies so that we can see exactly what's going on within our application. To wrap up and some conclusions, you will find servers and software fail for very different and very weird reasons. Physical servers will have hardware failures. Software will have bugs. Having multiple layers of checks, from lightweight to passive monitoring to deeper health checks, are needed to catch the different types of unexpected errors that we can see. When a failure happens, we should detect it, take the affected server out of service as quickly as we can do, and make sure that we're sending traffic just to those healthy instances. Doing this level of automation against an entire fleet of servers does come with some caveats. We should use rate limited thresholds or circuit breaks to turn off automation and bring humans into the decision making process. That's as the example where we use fail open, we're providing that safety net that we could then bring a human in to help us with the diagnosis of this. So using these fail open behaviors when we have all servers in a pool unhealthy can really help to provide that additional safety net that we could need here to make sure that we're able to correctly diagnose this issue and therefore return our application to a healthy state. So finally, a couple of call to actions, folks. So if you've heard anything in today's session that interests you, I'd recommend going and having a look at the Amazon Builders Library and this specific article on implementing health checks. It goes into much more detail than what I've covered in this session, and talks through a little bit more about how Amazon uses some of these technologies to improve the reliability of the workloads that our customers use. There's also a collection of multiple other articles on the Builders library that will help you with understanding how Amazon and AWS implement some of these best practices. The actual architecture that I went through today and the code is all available as part of a series of AWS well architected labs that we've published. The specific lab for this session is available following this link, but there's a collection of over 100 labs covering all of the different areas of reliability, security, cost optimization, performance efficiency, and operational excellence that you can go and access. The labs are all open sourced on GitHub, so you can take the code and you can use it to help yourself learn with that, I'd like to thank you all for your time tending this session today. I really hope you do enjoy the rest of the Comp 42 cloud native conference and look forward to seeing you at the next one. Thanks again everybody, and bye.

Andrew Robinson

Principal Solutions Architect @ AWS

Andrew Robinson's LinkedIn account Andrew Robinson's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways