Conf42 Site Reliability Engineering 2022 - Online

Optimize end user connectivity for multi-region architectures

Video size:

Abstract

Enhancing global user experience, meeting data residency requirements or ensuring business continuity are a few reasons for building multi region applications on AWS. Consistency in application performance and availability for end users can be one key consideration in multi-region architecture design.

In this session, we cover how to optimize the availability and performance of end user connectivity to the multi region application on AWS. We provide best practice guidance based on the type of the application and go over practical use cases leveraging AWS Global Accelerator and Amazon Route53. We dive deep into their key features benefiting multi-region architectures along with a demo.

The session is targeted for cloud teams who are looking to build performant and resilient multi region architecture on AWS for their end users including those in regulated industries where security and reliability are critical such as Financial Services and Healthcare.

Summary

  • Christian Elsen is a specialist solutions architect for networking. Previously worked in other networking roles for about 15 years. Work spans areas of data center switching, network virtualization, global continuity, distribution networks.
  • Today we are going to talk about optimizing end user connectivity for multiregion architectures. First is performant to connect end users to application endpoints in multiple regions. Second is maximizing availability in case of disaster recovery.
  • AWS Global Accelerator improves the availability and performance of your applications through AWS edge locations in the AWS backbone. In global accelerator we have ability to set traffic dials for fine grained traffic control. How do these features help us with bluegreen deployments?
  • Next, Chris is going to show a demo on bluegreen deployments for multi region applications. In this demo, we will look at AWS global accelerator for a multi region blue green software deployment scenario. Let's take a look at disaster recovery in multiregion architectures.
  • Route 53 application recovery controller routing controls are like a circuit breaker. With this, we can open the circuit breaker for the US east one region and thereby initiate a failover to the US west two region. Consider manual failover mechanisms by using route 53 applications recovery controller as a big red emergency button.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You hi, my name is Christian Elsen and I'm a specialist solutions architect for networking. I've been with AWS for about six years and have previously worked in other networking roles for about 15 years, spanning areas of data center switching, network virtualization, global continuity, distribution networks, and DNS providers, as well as BGP routing for service providers. My name is Lornak McJolo. I'm a senior solution architect in AWS. Prior to this, for about 17 years I also worked on web infrastructure, centralized authentication systems, distributed caching, multiregion cloud native deployments, using infrastructure as code and pipelines to name a few. But today we are going to talk about optimizing end user connectivity for multiregion architectures. So first, why are we talking about this? There are two aspects that we are focused in today's talk. First is performant to connect end users to application endpoints in multiple regions in the most performant and reliable way possible. Second is maximizing availability in case of disaster recovery. So how can we ensure that we can perform instantaneous failovers even in face of gray features? If you're running our application in multiple regions, let's first take a look at how to achieve performance for end user connectivity. For this, we are going to take a deeper dive into this networking service called AWS Global Accelerator. So here we have an application that is deployed in multi regions, in this case North Virginia US, east one and Ireland US one, and the user of our application. The users are accessing one of these stacks from around the globe. However, as they're accessing the application over the public Internet, each hop from the end user to the application endpoint can incur additional latency, and this is going to result in a nonoptimal experience for the end user. Now, is there a better way to provide them with more reliable performance connectivity to the application? Here we are looking at a map of the global network of 96 points of presence across 46 countries in 84 cities that AWS global accelerator uses. So for example, in Asia, here are the edge locations, to name a few of them. So there is edge locations in Bangalore, Bangkok, Chennai, Hyderabad, Jakarta to name a few. And global accelerator provides you with two static IP addresses that serve as a fixed entry point to the application hosted in one or more AWS regions. And underneath the covers, these IP addresses are any cast from those edge locations. So they're announced from multiple AWS edge locations at those same time. And this enables traffic to enter the AWS global network as close to the user as possible. So if you have a user in Jakarta, then if you're using global accelerator for your application deployment, in that case, that user will be entering the AWS global network through the Jakarta Edge. Similarly for the Chennai user, for example, this way the end users of your apps are benefiting from the reliable, consistent performance of the AWS global network. So you can associate these ip addresses to regional AWS resources or endpoints in this case such as application load balancer, network load balancer, easy to instances elastic IP addresses and global accelerator's ip addresses serve as the front end interface to the app. You can think of it AWS a door close to your end users wherever they may be located across those globe and that door is at these edge applications that will be looked at on the global network map. Next, Chris is going to show us a demo. AWS Global Accelerator improves the availability and performance of your applications through AWS edge locations in the AWS backbone. This test tool compares those performance of global accelerator with the public Internet from your location. It does so by comparing the time it takes to download a file of a certain size via the public Internet as well as the optimized path via global accelerator. In my case, the end user location is San Francisco, California and I selected 100 kilobyte as the file size to speed up this particular test run. Let's have a look at the results. We can see the performance gain via global accelerator from San Francisco to these five selected AWS regions. Next, let's take a look at traffic dials. So in global accelerator we have ability to set traffic dials for fine grained traffic control. We can dial up or dial down traffic directed to a specific endpoint group. Now we do this by setting a traffic dial to control the percentage of traffic that is already directed to that endpoint group to that region. Now here I have two endpoint groups. One is in us east one, one is in us west one and in each endpoint group I have two endpoints, the two elastic load balancers. The percentage is applied only to traffic that is already directed to the endpoint group based on proximity and health of the endpoints. So if we have 100 requests directed to northern Virginia, then 100% of those requests will be directed to US east one northern Virginia endpoint group if we set the traffic dial for it at 100%. Similarly for requests that are directed to US west one endpoint group. So you can think of traffic dials as giant valves that are controlling the traffic to the endpoint groups. Later on we may decide to switch all traffic flow to only go to us west one endpoint group and for this, we can set the traffic dial for us east one to 0%. So we close, we shut off that giant valve that controls the traffic that is sent to us east one. Now we have a finer grain control, a smaller nub if you will. We can use to set weights on each endpoint inside an endpoint group, such that we can adjust the amount of traffic each endpoint gets. Now, endpoints can be network load balancers, application load balancers, EC two instances or elastic ip addresses. Here I'm showing the elastic load balancers as endpoints, and global accelerator calculates the sum of those weights for the endpoints in an endpoint group, and then directs traffic to the endpoints based on the ratio of each endpoint's weight to the total. So you can go as fine grained as one over 256 for the percentage of traffic that is directed to an endpoint inside an endpoint group. Now, how do these features help us with bluegreen deployments? First, a quick recap of bluegreen deployments. The goal of bluegreen deployments is to deploy and roll back new versions of an application with minimal to no downtime for our app consumers. The way we achieve this is by having two environments in production that are identical to each other. And at any given point in time, only one of these environments is alive in terms of taking in production traffic, and those other one is idle. So the one that's taking in production traffic, we can call this those blue environment. For example, if you want to perform a new release, then we deploy the new version of the application in the green environment. That is not taking any production traffic. We test it those, we verify it in the green environment, and then we cut over the production traffic to the green environment. Now the blue environment becomes the new idle environment. And in case of issues in the newly deployed version of the application, we can always cut the traffic back over to the blue environment. Now, one way to achieve blue green deployments in a single region is by using the little nubs that we just talked about using endpoint weights in global accelerator. So we stand up two identical stacks of our application, for example, behind can ALB endpoint for the blue environment and another ALB endpoint for the green environment. And these two endpoints are inside one endpoint group. Think of that as a region, and we use the endpoint weights to adjust the prod traffic flow as part of the deployment that we just discussed. Next, we are going to look at a slight modification of blue green deployments. But for multiregion applications and those goal is always the same to have minimal to no downtime as we are deploying and rolling back new version of an app for our app consumers. So here we have version one of our application. It's deployed in two regions, US west two and US east one. We are using global accelerator for the application. So our clients in Japan are accessing the application via the global accelerator point of presence that is closest to them, that is either in Tokyo or Osaka for global accelerator points of presence. And the global accelerator is then intelligently routing their requests now inside the global network, AWS global network into the nearest app stack. That's in US west. Two clients in Europe are also accessing this application also through the global accelerator point of presence closest to them. So that'll be in Europe. And global accelerator again intelligently routes their request know after taking them in through the point of presence through that door in Europe, it then routes the requests inside the AWS global network into the nearest app stack, and that is Us east one. So one thing to note is that both of these app stacks are actually serving live production traffic. And now we decided to upgrade our application from version one to version two without incurring any downtime to our app consumers. So remember that we have the traffic dials to control traffic for endpoint groups in global accelerator. Those are the giant valves that we can use to control to dial up and down the traffic for endpoint groups. So we are going to use these. We first set the traffic dial in US west two to 0%, and then all production traffic now flows to us east one. For our clients in Japan, they still enter using the same point of presence closest to them in Japan. So that will be either Tokyo or Osaka, depending on where they're located in Japan, whichever is closer to them. But then global accelerator is going to intelligently route their request this time to US espawn application stack. And now that us best two has no production traffic flowing into it, we can upgrade our application in Us west two to version two without incurring any downtime to our app consumers. Next, we are going to repeat this process for us east one. So we first turn down the traffic dial to 0%. In US east one. All traffic now goes to users two, including for our clients in Europe. And they still enter the AWS global network through the door through that point of presence that's closest to them. So they'll be in Europe, and global accelerator is going to intelligently route their request to the application app stack in USS two. So this way we can upgrade the app in USC SWAN to version two without incurring any downtime for app consumers. This time. Clients in Europe finally, we now have both regions with version two of the app, and we turn the traffic dial in US east one up to 100%. And now the clients in Japan go to us best two and clients in Europe to USC swan through their global senator point of presence doors that are closest to them that will take them in take their requests into the AWS global network. Next, Chris is going to show a demo on bluegreen deployments for multi region applications. In this demo, we will look at AWS global accelerator for a multi region blue green software deployment scenario. As depicted in the presentation, this setup uses a single accelerator with endpoints in two AWS regions. The US west one region represents our plue deployment and the US east two region represents our cream deployment. Right now, traffic dials for both regions are at 50% as the percentage is applied to only the traffic already directed to the endpoint group. Not all listener traffic. Only by explicitly specifying 50% as a traffic die for both will we see each region receiving about the same amount of incoming end user traffic. This cloud watch dashboard shows the incoming traffic ratios across the two application load balancer that front each of the two regions. On the left we have a traffic gorge that shows the most recent distribution with a historical distribution over the last half hour. On the right. As expected, the traffic ratio across the two regions is about 50 50. Now let's train traffic from our flu region us west one so we can perform a software update there. For this, we will set the traffic dial for us west one to zero while leaving the traffic dial for us east two where it is AWS. Us East two will be the only remaining region. It will receive all traffic. Let's have a look at the cloud watch dashboard and see how traffic shifts. We will speed up the recording a bit so we don't have to wait the two to three minutes it takes. Great. Now we can see that 100% of our, including traffic is headed to the cream deployment, allowing us to upgrade the application in the plue deployment. After we finish this upgrade, let's switch all traffic to the plue deployment so we can upgrade the cream deployment. This time, we will set the traffic dial for us west one to 50% and the one for us east two to 0%. Let's have a look at the Cloudwatch dashboard and see how traffic shifts again. We will speed up the recording a bit so we don't have to wait the two to three minutes it takes. Now 100% of our incoming traffic is headed to the blue deployment, allowing us to upgrade the application in the green deployment. Once we finish this upgrade, let's switch all the traffic back to the original 50 50 split. This time we will set the traffic dial for us east two back to 50%. We'll return to the Cloudwatch dashboard the last time and see how traffic shifts again. We will speed up the recording a bit so we don't have to wait the two to three minutes it takes, and we're back to a 50 50 traffic split. So let's take a look at disaster recovery in multiregion architectures. The concepts of data plane and control plane date all the way back to networking terminology. So these are not new concepts. And for a given AWS service, there is typically a control plane. That is what allows us to create resources, to modify resources, and to destroy resources. For example, if you think of EC two, control plane operations are launching an EC two instance, changing a security group on an existing EC two instance, or terminating an EC two instance, among others. At the same time, there are data plane operations that allow for resources that are already up and running to continue to operate. So in the case of EC two, you may have already instantiated EC two servers that are already serving requests, that are up and running and serving requests. So all operations that are performed while these instances are running are part of the data plane. So for example, reading and writing to existing elastic block storage volumes or routing packets according to the existing VPC route tables. Now, in case of impairments to the control plane, the EC two instances have all of those information that they need available to them locally in order to continue to run. Here I have an analogy for you. So let's think about lifecycle of flights. So you can think that for any given flight there is a takeoff, there is landing, and there is the part in between where the plane is up and running, flying in the sky. And for the parts of this that has to do with creating a flight, which is a takeoff, and terminating a flight, which is the lending, you need a certain set of steps. That includes the control tower, getting clearance from. It also includes running through a rum book, ensuring that the plane is ready, et cetera. So those are strict procedures around what we need to create a flight and to terminate a flight, and those are part of the control plane operations. But once the plane is up and running and it's in the sky, it no longer needs the control tower. It has zero dependencies on that control tower. For example, to continue to be up and running to continue to fly. So if the control tower, for whatever reason, goes away, the plane has everything that it needs locally, old instruments. It has the fuel that is needed for it to continue to fly. So data plane operations ensure to keep what's already in flight to operate. Now, both data plane and control plane operations are important, but data plane operations favor availability. We want to make sure that if we have easy to instances that are already serving requests, they should continue to be serving requests in case our control plane is having some impairment. And that has its roots also in the cap theorem if you think about it. So the idea is to rely on the data plane that is designed with a higher availability target and not on the control plane that favors consistency during recovery. Let's have a look at Amazon Route 53 application recovery controller, which provides a mechanism to simplify and automate recovery for highly available applications. Some industry and workloads have very high requirements in terms of decide availability cases and recovery time objectives. As can example, think about how real time payment processing or trading engines can affect entire economies if disrupted. To address these requirements, you typically deploy multiple replicas, called cells, across a variety of AWS availability zones, AWS regions, and on premises environments. Route 53 application recovery controller provides those highly reliable mechanism to aid Route 53 in reliably routing end users to the appropriate cell in an active, active setup. Or in a nutshell, Amazon Route 53 application recovery controller gives you a big red emergency stop button, which acts like a circuit breaker to take a problematic cell out of service. What are the key capabilities of application recovery controller? First, readiness checks continually monitor AWS resources across your application replicas. Checks can monitor a number of areas that can affect recovery readiness, such as updates to configurations, also called configuration drift capacity or network routing policies. Second, routing controls gives you a way to manually and reliably fail over the entire application stack. Such a failover decision is often a conscious manual choice based on application metrics or partial features. You can also use them to shift traffic for maintenance purposes or to recover from failures when your monitors itself fail. Third, safety rules as safeguards for application recovery controller itself to determine the side combinations of routing control and to avoid unintended consequences. For example, you might want to prevent inadvertently turning off all routing controls, which would stop all traffic flow and thereby result in a failopen scenario. Let's look closer at the architectures of the application recovery controller. We start by looking at routing control. Routing controls allow us to create control panels and model the desired cell structure of our applications and how our big red emergency buttons or circuit breakers should look like in this example here we have two cells with one circuit breaker switch each. The cell in the active region is currently in the on position, while those cell in the standby region is in the off position. To actually influence traffic between active and standby region, Amazon route 53 is needed. In addition to the application recovery controller, our circuit breaker buttons are mapped to route 53 health checks that can be used for various record type such as failover record type. At this point, it is very important to point out the key capability of this integration. Route 53 healthcare checks are part of the data plane which has a 100% uptime slA. Therefore, even if the route 53 control plane is affected during a large scale event, forcing a route 53 health check to unhealthy via the application recovery controller still allows us to perform a failover between the active and stem by region. But why is the applications recovery controller's control plane superior in this scenario? As depicted in the diagram, you can see that each controller consistency of a cluster across five different AWS regions with API endpoints in each of them. As long as one of these five endpoints is still available, changes to the routing control state can be made and these changes can still factor in safety rules that you previously saw. Now let's look in detail at how the application recovery controller interfaces with route 53. A DNS request for our application MyApp AWS would reach route 50 three's distributed data plane. At the same time, route 50 three's global health checkers integrate with application recovery controller's routing control. The application recovery controller provides a virtual health check that is mapped to manual on off switch which can be controlled via a highly available API. If we flip this switch via the routing control, the information is provided by the route 53 global health checkers to the distributed data plane. This updated health check information within the distributed data plane now allows a change DNS response and this change DNS response reroutes incoming traffic to our secondary or standby region. Now let's look at a brief demo and how all this can be used with an example application. In those demo, we will look at the Amazon Route 83 application recovery controller. For this we have deployed a simple demo architecture with a tictactoe game deployed across two regions. US east one acts as active region while US best two acts as a hot standby region. Initially, both AWS regions are healthy and therefore the game should be served out of the active US east one region. Those is accomplished by searing inbound network traffic via a Route 53 failover record using Amazon Route 53 application recovery controller. We also have a circuit breaker in the form of routing controls in place. Let's have a look. Here we can see the route 53 failover record with a primary entry for the US east one and a secondary entry for us west two. Both records have a distinct health check associated with them. Looking at these health checks, we can see that each of them is currently healthy and therefore the primary failover record entry for us east one is being used. Each of the two health checks corresponds to route 53 application recovery controller routing control. We can imagine each of these routing controls to be like a circuit breaker. Here we can see that at this point, each routing control state is on let's play a game of tictactoe we can see that the tictactoe games are currently served route of the US east one region, creating a new game and choosing a worthy opponent. We can validate that the application is working. So far so good. But what if we are faced with a disastrous event in our active region and need to fail over to the Sampai region? As part of this disastrous event, we're also no longer able to make changes to the DNS public hosted zone. But thanks to the Route 53 application recovery controller, we have our circuit breakers in the form of route controls in place. With this, we can open the circuit breaker for the US east one region and thereby initiate a failover to the US west two region. Let's have a look. First, we will open the circuit breaker for the US east one region by making the route control state as off it. It's looking at the associated health checks. We can see that the health check for us east one will change to unhealthy, which was triggered by the route 53 application recovery controller routing control state change. If we reload our tictactoe game, we can see that it will now be served out of the Usos two region. Time to play another round of tictactoe. Let's look at those key takeaways from this talk. Improve those application performance resiliency by minimizing the number of network hops through the AWS backbone. Eliminate control plane dependencies of your application to improve disaster recovery and consider manual failover mechanisms by using route 53 applications recovery controller as a big red emergency button. Thank you.
...

Christian Elsen

Principal Solutions Architect @ AWS

Christian Elsen's LinkedIn account Christian Elsen's twitter account

Lerna Ekmekcioglu

Senior Solutions Architect @ AWS

Lerna Ekmekcioglu's LinkedIn account Lerna Ekmekcioglu's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways