Conf42 Cloud Native 2022 - Online

Preparing for disaster: going Multi-Region

Video size:

Abstract

Even with high availability, things can fail. Migrating legacy services running in only one region with no fault tolerance, have a big chance of downtime in the best case scenario.

What steps must we take to get to a real 99.99% uptime?

Summary

  • Even with high availability, there is a chance that a whole region fails. But if we have a fault tolerance setup with a multi region approach, disaster can be avoided. Here are some pointers to go with multiregion in a complex service.
  • S three has multiregion replication. For non relational databases like DynamoDB, it has a global structure. Third, but not last, we have application management. This means that that bigger part side of the service that has to be taken into account.
  • Each cloud has similar resources and alternatives to build a fault tolerant multiregion service. It's hard, but it's possible. The outcome of the peace of mind it brings when disaster happens is something to take into consideration when designing or updating a service or an application.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, today I will be talking about fault tolerance and how to migrate service to multiregion in the cloud. Services and resources are located in an availability zone inside a region like us east or us west. We can have multiple availability zones in a region. Sometimes a whole region can fail, causing a whole service to die. But if we have a fault tolerance setup with a multi region approach, disaster can be avoided. I am currently working at Globant as a DevOps engineer. We are a digitally native company that helps organizations reinvent themselves and unleash their potential. We are the place where reinnovation, design and engineering met scale. Since I started working in the cloud, I've noticed that even with high availability, there is a chance that a whole region fails and critical services that need 100% uptime are not available, generating massive loss to the clients. The process not only to modernize these services, but also migrating them to be multiregion can be quite taxing, but if done correctly, it will generate a peace of mind to stakeholders, clients and end users. I will explain the difference between high availability versus fault tolerance, and then provide some pointers to go with multiregion in a complex service. So first, what is high availability? High availability is the ability of a service to remain operational with minimal downtime in the event of disruption. Disruptions include hardware failure, networking problems, or security events like a DDoS attack. In a highly available system, the infrastructure services are spread across a cluster of instances. If one of these instance fails, the warlocks running on it are automatically moved to other servers or instances. These clusters are normally set up across different availability zones, but all in the same region. The main advantages are the easier maintenance, even if the design is more complex, because the scalability that it provides and almost no service disruption with a load balancing solution that will automatically divert the traffic to a functioning cluster. Things means that we will need to set up double the components to provide this balancing solution, which can raise the cost even if we are spending to prevent disruption. There is also a chance of data loss if there is data being transferred while the change of authority happens. And finally, there is still a chance of disruption. There's like a percentage of disruption that can cause several loss in some services. Have we always seen a full service failure over a regional disaster? Then we have fault tolerance where there's no interruptional service. So it's a design concept when a service will continue working normally after experiencing expected failure or malfunctioning with zero service interruption. So in this seamless transition, when there's a failure is the problem, but also not only with failures, but when there's a need to upgrade to change or perform maintenance on the service or hardware. Now, regarding data loss that was applied in high availability, here is if well implemented, there shouldn't be none because we have set up redundancy and we do not have that crossover component between active and passive systems and will write and receive all of the requests. Obviously, the design concept that will assure that the service will continue to work if a whole region fails has some more costly and complex setup. But sometimes you have to wait those things and decide what is better for the situation now. So both designs reduce the risk of service disruption and downtime, but they do so in different ways. Additionally, the two models tend to differ in terms of cost. When we choose which one to adopt, you have to take into account the level of disruption, the infrastructure requirements and management effort in the design setup and operational maintenance. As I've been working with some sensitive projects that needed zero downtime and no downloads, I had to implement multi region, and I like to provide some insight on how to implement it using a multi region in AWS. So first of all that we have to make sure is that we have a good cross account replication on the security resources and cross region replication on the security resources too. Authorization, encryption, auditing and observability need to be replicated, and the data logs should be stored in an s three bucket with multiregion replication, as we can see in this graphic snuckly AWS IAM, that's the users roles groups provider for AWS has multi region availability automatically has no configuration required on your part. Then we have the AWS secret manager that can store secrets with KMS encryption, which can be replicated over secondary regions to ensure they can be retrieved in the closest and available region. Some services like s three package and Aurora databases have cross region replication, which makes the encryption decryption steps more agile. But for those applications that run multiregion, there's an option to set up TMS multiregion keys, which will make your life easier on the encryption operations. As stated before, you can save all the clothes right logs in a replicated s three storage, but keep in mind that you can also enable security hub to send all the findings from both regions to a single navigation pane. Second, but not less important, is networking. Because we have to analyze and be aware of the networking infrastructure. We are going to need to set up a global network to communicate between the regions. We can use BPC peering. These resources can communicate using private ip addresses and do not require an Internet gateway, a VPN or separate network appliances, and by the way, it's cheaper than other options using a private private cloud for on premise communication we have transit Gateway is a networking transit hub that connects your vpcs and on premises networks. Things can be chance to expand the additional regions with transit gateway interregion peering to create a globally distributed private network for your resources. Now we have route 53, that's the DNS solution to route the users to those distributed Internet applications and offers a comprehensive available solution that has minimal dependencies. Then we have close one is a content delivery network like for websites which allow us to manage our content closer to the end users with apps locations. But it's also possible to set it up with an origin failover. If the primary origin is unavailable or returns a specific HTTP response status code that indicates a failure, Cloudfront will automatically switch to the secondary origin. And then for Internet facing apps you can set up a global accelerator which automatically switches with two static Anikas ips with one single entry point as you can seamlessly add or remove origins and redirect traffic within 6 seconds. It also allow us to ask traffic waves to test deployments. I've used global accelerators and it's really useful for devs, not cloud engineers because they have only to switch the weights and traffic all the workload and it's a really good solution. It also help with live fixes to production when some scenarios require it. Now next we have compute depending on your infrastructure, there are different things to consider. For example, if you use EC two instances, they have their corresponding EDS volumes. They are storing one availability zone, but we have data lifecycle manager to automate the replication to those volumes to another region. And if we use amas, we have replication of our regions with EC two image builder. So we don't need to do this manually. If our service or application is based on microservices, which is a really good idea, we can use Amazon elastic container registry which has private image replication which can be cross region or cross account. I use third party container registry tools that generates the call from my pipelines and deploy them into my AWS infrastructure in both regions at the same time. So that's another option for data replication. This is also a complicated topic because we have heard about this things is the cap theorem that states that we can have the three we can have consistency, availability and partition tolerance at the same time and we need to choose two only and decide which we select depending on our needs. So when we go for multiregion, the one that's harder to achieve is consistency due to the long distance between the services I have already mentioned that S three has multiregion replication. That's the simple storage service or buckets or that usually call them. This replication is one way or two way, continuous and things replication can also be applied as subset of items inside the bucket. For non relational databases like DynamoDB, it has a global structure, has multi writer capabilities, will detect the changes and replicate them in other regions. For cachet, we have elastic cachet. For Redis, it offers a global data store to create fully managed, fast, reliable and secure cross region replica. For redis caches and databases, and for relational databases such as Aurora, the cluster is in one region is designated as the writer, and then we have secondary regions that are designated as read copies or read replicas. While only one instance can process the writers, our MySQL supports write forwarding. That's a feature that will forward write queries from a secondary region endpoint to the primary region. To simplify the logic in application code, logical replication, which utilizes a database engine built in replication technology, can be set up for Amazon RDS for Mariadebe, MySQL, Oracle PostgreSQL and Aurora database. This means that a cross vision red replica will receive and process chance from the writer in the primary region. This will make local reads faster and can reduce data loss and recovery times in the case of a disaster by being promoted to a standalone instance. This technology can also be used to replicate data to a resource outside an Amazon RDS, like an EC two instance, an on premise server, or even a data lake. Third, but not last, we have application management. This means that that bigger part side of the service that has to be taken into account. For example DevOps. I think it's important to plan what CI CD tools are we going to use in order to deploy all of this infrastructure, whether it's AWS core pipeline, GitLab CI or Jenkins pipelines will need to be configured to assure this double deployment, but at the same time deploying first to one region and then to other to the other while the primary is working isn't as complicated as it seems. Working with variable files for each region and environment is a really good idea. I mostly use terraform, which is really solid and allow us to review before we apply changes to the infrastructure. But AWS has cloudformation that's also really solid and allow us to create and delete stacks and update them across multiple regions and multiple accounts with one simple template obviously providing the corresponding variables. Now, depending on the architecture of our service, if we use the coupled applications, we will need an event manager we have Eventbridge. This will help us provide a notification service across regions. Eventbridge is serverless and we can use cross region routing to interconnect messages between the resources. If you rely on pubsoft messaging like it can work with multiple destinations, so you can send messages to a central SQSQ that process orders going multi region application finally, to maintain visibility and observability over an application deployed across multiple regions and accounts that can generate a lot of resources, you can create a trusted advisor dashboard, an operation dashboard that can be done with system manager Explorer. This operation dashboard offer a unified view of resources like easy to cloud watch AWS config data, and you can combine the metadata with a master latina to create a multiregion and multi account inventory with a good view of all of the resources. So you have heard me talk about all of these resources, and while I don't want to sound like an AWS evangelist, I think it's important to know that these options exist. Each cloud has similar resources and alternatives to build a fault tolerant multiregion service. It's hard, but it's possible, and the outcome of the peace of mind it brings when disaster happens is something to take into consideration when designing or updating a service or an application. It has bring me a lot of solutions and help me to provide better architecture for my projects so well. Thank you for listening and I hope this information allows you to reinvent and think for better architecture for better solutions. Ask me. Thank you.
...

Ana Van Straaten

DevOps Engineer @ Globant

Ana Van Straaten's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways