Conf42 Cloud Native 2021 - Online

Static stability using availability zones

Video size:


In this session, we’ll define a pattern that we use called static stability to achieve extremely high HA targets. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.


  • Eduardo Janicas will talk about how AWS achieves static stability using availability zones at Amazon. We will describe how we built Amazon elastic compute clouds, or EC two, to be static stable. And finally, we'll go a bit deeper into some of the design philosophy behind Amazon EC two.
  • Each AWS region has multiple availability zones. Amazon EC two service consists of a control plane and a data plane. When building systems on top of availability zones, one lesson is to be ready for impairments before they happen.
  • We follow availability zone independence principle in EC two in our deployment practices. All packet flows are designed to stay within the availability zone rather than crossing boundaries. Understanding some of this concept is helpful when we build a service that not only needs to be highly available itself, but also provides infrastructure on which others can behighly available.


This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Eduardo Janicas and I'm a solutions architect at AWS. I've been here for over two and a half years, and I have a background in networking and opera questions I'm going to talk about how AWS achieves static stability using availability zones at Amazon. The services we build must meet extremely high availability targets. It means that we think carefully about the dependencies that our systems take. We design our systems to stay resilient, even when those dependencies are impaired. In this talk, we're going to define a pattern that we use called static stability to achieve this level of resilience. We'll show you how you apply this concept to availability zones, which are a key infrastructure building block in AWS, and therefore they are a bedrock dependency on which all of our services are built. We will describe how we built Amazon elastic compute clouds, or EC two, to be statically stable. Then we're going to provide two statically stable example architectures that we found useful for building highly availability in regional systems on top of availability zones. And finally, we're going to go a bit deeper into some of the design philosophy behind Amazon EC two, including how it is architected to provide availability zones independence at the software level. In addition, we're going to discuss some of the tradeoffs that come with building a service with this choice of architecture built. First, let's explore and understand the AWS cloud infrastructure. A region is a physical location in the world where we have multiple availability zones. Availability zones, or AWS, consist of one or more discrete data centers, each with redundant power, networking and connectivity housed in separate facilities. Availability zones exist on isolated fault lines, floodplains, networks and electrical grids to substantially reduce the chance of simultaneous failure. It provides the resilience of performing real time data replication and the reliability of multiple physical locations. AWS has the largest global infrastructure footprint with 25 regions, 80 availability zones with one or more data centers, and over 230 points of presence with 218 edge locations, and this footprint is constantly increasing at a significant rate. Each AWS region has multiple availability zones. Each availability zone has multiple physically separated data centers. Each region also has two independent, fully redundant transit centers that allow traffic to cross the AWS network, enabling regions to connect to the global network. Further, we don't use other backbone providers for AWS traffic once it hits our backbone. Now, each availability zone is a fully isolated partition of the AWS global infrastructure. This means that it is physically separate from any other availability zones by a meaningful distance, such as many kilometers. Each availability zone has its own power infrastructure, thus availability zones give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than it would be possible from a single data center. All availability zones are interconnected with high bandwidth, low latency networking over fully redundant dedicated metro fiber. This provides high throughput, low latency networking between availability zones when interacting with an AWS service that provisions cloud infrastructure instead of an Amazon virtual private cloud or VPC. Many of these services require the called to specify not only a region but also an availability zone. The availability zone is often specified implicitly in a required subnet argument, for example, when launching an EC two instance, provisioning an Amazon relational database or RDS database, or creating an Amazon elastic cache cluster. Although it's common to have multiple subnets in an availability zone, a single subnet lives entirely within an availability zone, and so by providing a subnet argument, the caller is also implicitly providing an availability zone to use to better illustrate the property of static stability, let's look at Amazon EC two, which is itself designed according to those principles. When building systems on top of availability zones, one lesson we have learned is to be ready for impairments before they happen. A less effective approach might be to deploy multiple availability zones with the expectation that should there be an impairment within one availability zone, the service will scale up, perhaps using AWS auto scaling in other availability zones, and be restored to full health. This approach is less effective because it relies on reacting to impairments as they happen, rather than being prepared for those impairments before they happen. In other words, it lacks static stability. In contrast, a more effective, statically stable service would over provision its infrastructure to the point where it would continue operating correctly without having to launch any new EC two instances. Even if an availability zone were to become impaired, the Amazon EC two service consists of a control plane and a data plane. Control plane and data plane are terms of art from networking built. We use them all over the place. In AWS, a control plane is the machinery involved in making changes to a system, adding resources, deleting resources, modifying resources, and getting those changes propagated to wherever they need to go to take effect. A data plane, in contrast, is the daily business of those resources. That is what it takes for them to function. In Amazon EC two, the control plan is everything that happens when EC two launches a new instance. The logic of the control plan pulls together everything needed for a new EC two instance by performing numerous tasks. The following are a few example it binds physical server for the compute while respecting placement groups and VPC tenancy requirements. It allocates a network interface out of the VPC subnet. It prepares an EBS volume, generates IAM role credentials, installs security groups, stores the result in the data stores of the various downstream services, propagates the needed configurations to the server in the VPC, and to the network edge. But in contrast, the Amazon EC two data plane keeps existing EC two instances humming along as expected, performing tasks such as routing packets according to VPC route tables, reading and writing from evs volumes, and so on. As it's usually the case with data planes in control planes, the Amazon EC two data plane is far simpler than the control plane. As a result of this relative simplicity, the EC two data plane design targets a higher availability than that of the EC two control plane. But the concept of control planes, data planes, and static stability are broadly applicable even beyond Amazon EC two. Being able to decompose a system into its control plane and data plane can be a helpful conceptual tool for designing highly available services. For a number of reasons, it's typical for the availability of the data plane to be even more critical to the success of customers than the control plan. For instance, the continued availability and correct functioning of an EC two instance after it is running is even more important to most of you than the ability to launch a new EC two instance. It's typical for the data plane to operate at a higher volume, often by orders of magnitudes, than its control plane, and as it's better to keep them separate so that each can be scaled accordingly to its own relevant scaling dimensions. And we found that over the years, a system's control plane tends to have more moving parts than its data plane, so it's statistically more likely to become impaired for that reason alone. So putting those considerations altogether, our best practice is to separate systems along the control plane and data plane boundary. To achieve this separation. In practice, we apply principles of static stability. A data plane typically depends on data that arrives from the control plane. However, to achieve a higher availability target, the data plane maintains its existing state and continues working even in the face of a control plane impairment. The data plane might not get updates during the period of impairment built. Everything that had been working before continues to work. Earlier, we noted that a scheme that requires the replacement of an EC two instance in response to an availability zones impairment is a less effective approach. It's not because we won't be able to launch the new EC two instance, it's because in response to an impairment, the system has to take an immediate dependency for the recovery path on the Amazon EC two control plan plus all of the application specific systems that are necessary for a new instance to start performing useful work. Depending on the application, these dependencies could include steps such as downloading runtime configuration, registering the instance with discovery services, acquiring credentials, et cetera. The control plane systems are necessarily more complex than those in the data plane, and they have a greater chance of not behaving correctly when the overall system is impaired. Several AWS services are internally composed of a horizontally scalable, stateless fleet of EC two instances or Amazon elastic container service or ECS containers. We run these services in an auto scaling group across three or more availability zones. Additionally, these services over provision capacity so that even if an entire availability zones were impaired, the service in the remaining availability zones could carry the load. For example, when you use three availability zones, we over provision by 50%. Put another way, we over provision such that each availability zone is operating at only 66% of the level for which we have load tested it. The most common example is a load balanced HTTPs service. The following diagram shows a public facing application load balancer providing an HTTPs service across three availability zones. The target of the load balancer is an autoscaling group that spans the three availability zones in the EUS one region. This is an example of an active active high availability using availability zones in the event of an availability zone impairment, the architecture shown in the preceding diagram requires no action. The EC two instances in the impaired availability zone will start failing health checks and the application load balancer will shift traffic away from them. In fact, the elastic load balancer service is designed according to this principle. It has provisioned enough load balancing capacity to withstand an availability zone impairment without needing to scale up. We also use this pattern even when there is no load balancer or HTTPs service. For instance, a fleet of EC two instances that processes messages from an Amazon simple queue service or SQs queue can follow this pattern too. The instances are deployed in an out of scaling group across multiple availability zones appropriately over provisioned. In the event of an impaired availability zone, the service does nothing, the impaired instances stop doing their work and others pick up the block. Some services we built are stateful and require a single primary or leader node to coordinate the work. An example of this service is a relational database such as Amazon RDS with a MysQL or postgres database engine. A typical high availability setup for this kind of relational database has a primary instance which is the one which all writes must go to and a standby candidate. We might also have additional read replicas which are not shown in this diagram. When we work with stateful infrastructure like this, there will be a warm standby node in a different availability zone from that of the primary nodes. The following diagram shows an Amazon RDS database when we provision a database with Amazon RDS, it requires a subnet group. A subnet group is a set of subnets spanning multiple availability zones into which the database instances will be provisioned. Amazon RDS puts the standby candidate in a different availability zone from the primary node. This is an example of active standby high availability using availability zones, as was the case with the stateless active active example. When the availability zone with a primary node becomes impaired, the stateful service does nothing with the infrastructure. For services that use Amazon RDS, RDS will manage the failover and repoint the DNS name to the new primary in the working availability zone. This pattern also applies to other active standby setups, even if they do not use a relational database. In particular, we apply this to systems with a cluster architecture that has a leader node. We deploy these clusters across availability zones and elect a new leader node from a standby candidate instead of launching a replacement just in time. What these two patterns have in common is that both of them have already provisioned the capacity they need in the event of an availability zone impairment well in advance of any impairment. In neither of these cases is a service taking any deliberate control plane dependencies, such as provisioning new infrastructure or making modifications in response to an availability zone issue. This final section of the talk will go one level deeper into resilient availability zone architectures, covering some of the ways in which we follow the availability zone independence principle in Amazon. EC two. Understanding some of this concept is helpful when we build a service that not only needs to be highly available itself, but also needs to provide infrastructure on which others can be highly available. EC two as a provider of low level AWS infrastructure is the infrastructure that applications can use to be highly available. There are times when other systems might wish to adopt that strategy as well. We follow the availability zone independence principle in EC two in our deployment practices. In EC two, software is deployed to the physical servers hosting EC two instances, edge devices, DNS resolvers, control plane components in the EC two instance launch path, and many other components upon which EC two instances depend. These deployments follow a zonal deployment calendar. This means that two availability zones in the same region will receive a given deployment on different days. Cross AWS, we use a phase rollout of deployment. For instance, we follow the best practice regardless of the type of service to which we deploy, of first deploying a one box and then one by end of servers, et cetera. However, in the specific case of services like those on Amazon EC two, our deployments go one step further and are deliberately aligned to availability zone boundary their way. A problem with a deployment affects one availability zone and it's rollback and fixed. It doesn't affect any other availability zones, which continue functioning as normal. Another way we use the principle of independent availability zones when we build in Amazon EC two is to design all packet flows to stay within the availability zone rather than crossing boundaries. The second point, that network traffic is kept local to the availability zone, is worth exploring in more detail. An interesting illustration of how we think differently when building a regional, highly available system. That is, a consumer independent availability zones. That is, it uses guarantees of availability zone independence as a foundation for building a high available service, as opposed to when we provide availability zones independent infrastructure to others that will allow them to use for high availability. The following diagram illustrates a highly available external service, shown in orange that depends on another internal service shown in green. A straightforward design treats both of these services AWS consumers of independent TC two availability zones. Each of the orange and green services is fronted by an application load balancer, and each service has a well provisioned fleet of backend hosts spread across three availability zones. One highly available regional service calls another highly available regional service. This is a simple design for many of the services we've built. It is a good design. But suppose, however, that the green service is a foundational service. That is, suppose it is intended not only to be highly available, but also itself to services as a building block for providing availability zones independence. In that case, we might instead design it AWS three instances of a zone local service on which we follow availability zone aware deployment practices the following diagram illustrates the design in which a highly available regional service calls a highly available zonal service. The reasons why we design our building block services to be availability zone independent come down to simple arithmetic. Let's say an availability zone is impaired for black and white failures, the application load balancer will automatically fail away from the affected nodes. However, not all failures are so obvious. There can be grave failures, such as bugs in the software, which the load balancer won't be able to see in its health checks and cleanly handle. In this example, where one highly available regional service, called another highly available regional service, if a request is sent through the system, then with some simplifying assumptions, the chance of the request avoiding the impaired availability zone is two divided by three times two divided by three, so it's four divided by nine. That is, the request has worse than even odds of steering clear of the event. In contrast, if we built the green service to be a zonal service as in the current example, then the hosts in the orange service can call the green endpoint in the same availability zone. So with this architecture, the chances of avoiding the impaired availability zones are only two divided by three. If N services are a part of this call path, then these numbers generalize to two divided by three to the power of N for N regional services versus remain constant at two divided by three for N zonal services it is for this region that we built Amazon EC two Nat gateway as a zonal service. So Nat Gateway is an Amazon EC two feature that allows for outbound Internet traffic from a private subnet and appears not as a regional VPC wide gateway, but as a zonal resource that customers instantiate separately per availability zone. As shown on this diagram, the nut gateway sits in the path of the Internet connectivity for the VPC and is therefore part of the data plan of any c two instance within that VPC. If there is a connectivity impairment in one availability zones, we want to keep that impairment inside that availability zone only, rather than spreading it to other zones. In the end, we want a customer who built an architecture similar to the one that we mentioned earlier. That is like providing a fleet across three availability zones with enough capacity in any two to carry the full load to know that the other availability zones will be completely unaffected by anything going on on the impaired availability zone. The only way for us to do this is to ensure that all foundational components like the Nat gateway really do stay within one availability zone. So some lessons learned when designing a service oriented architecture that will run on AWS, we have learned to use one of these patterns, or a combination of both. The simpler pattern regional, called regional. This is often the best choice for external facing services and appropriate for most internal services as well. For instance, when building higher level application services in AWS such as Amazon API Gateway and AWS services technologies, we use this pattern to provide high availability even in the face of an availability zones impairment. The more complex patterns are regional, called zonal or zones coral zones. When designing internal and in some cases external data plane components within Amazon EC two, for instance, network appliance or other infrastructure that sits directly in the critical data path, we follow the pattern of availability zone independence and use instances that are siloed in availability zones so that network traffic remains in its same availability zone. This pattern not only helps keep impairments isolated to an availability zone, but also has favorable network traffic cost characteristics in AWS. Thank you for listening to my talk and I hope you find it useful. If this topic interests you, you can find more dive deep articles on the Amazon Building library, which you can find in the public AWS website. Have a great day.

Eduardo Janicas

Solutions Architect @ AWS

Eduardo Janicas's LinkedIn account Eduardo Janicas's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways