Conf42 Chaos Engineering 2022 - Online

Multi Region Terraform Deployments with Terraform Built CI/CD on AWS

Video size:

Abstract

Infrastructure as Code tools like Terraform enable organizations achieve repeatability of their deployments at scale in the cloud. Organizations deploy into multiple AWS regions whether it’s to enhance user experience, satisfy data residency requirements or ensure business continuity.

In this session, we do a deep dive into how to deploy infrastructure using Terraform in multiple regions in AWS and CI/CD pipelines built with Terraform. We cover the overall architecture for the CI/CD pipeline and target workload accounts, walk through how to structure Terraform code for multi region deployments and go over best practice design considerations for the CI/CD pipeline.

The session is targeted for cloud teams who provision resources using Terraform in AWS including those in regulated industries where security and reliability are critical such as Financial Services and Healthcare.

Summary

  • Jack Yu is a senior solution architect with AWS. For the past 15 years he's been in financial services infrastructure. We'll talk multi multi region terraform deployments with Terraform built CI CD on AWS.
  • AWS cloud spans 84 availability zones within 26 gigabyte region worldwide. Customers want to achieve extreme resiliency for their workload. Multiiregions come with a cost, especially when it comes to infrastructure deployment. In this talk, Lerna and I is going to give you some prescriptive guidance to actually manage your multi region terraform code.
  • So let's talk about the terraform deployment workflow. We have four simple steps in our deployment. First, we'll talk about infrastructure AWS code and git tag. At the end it's going to deploy those resources into the target account.
  • So let's zoom into different pipelines stages. In the resources stage we have two pieces of information. One is the terraform code that describes the target state infrastructure. Next we want to look it into the infrastructure as code linting stage. It's very crucial to have this security scan in this early stage.
  • Lerna: We are following multi account best practice and we have a central tooling account. We have a target workload account that contains the sample workload. We are using the cell architecture principle for pipelines per environment. We integrate security scanning in the very early stages of development.
  • We are storing our terraform state files in Amazon s three buckets. And we also use dynamodb tables for the terraforming state locks. This is so that in case there is concurrent write attempts against the state files that these are prevented. We are following the cell architecture principle, ensuring that we're isolating and minimizing this impact scope per environment.
  • Let's take a look at the terraform code structure for our sample infrastructure code. The pipeline parses the git tag and is able to get the value of the account number to target. Our provider is parameterized by region. We try to reuse as much of the existing well tested code as possible.
  • All right, so let's take a look at the resources in our accounts. Here you're seeing the code repository. Inside the build stage we have a terraform plan action that generates the plan. The next stage is a Tf lint of the terraforming code. And that report will be presented in a manual approval action.
  • The most important takeaway about multi region deployments is that we should be consistent with our deployments. It helps to structure our infrastructure as code and architect our pipelines such that we maximize code reuse. Thanks for joining us.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Cloud Mcjolo I'm a senior solution architect with AWS. For the past 15 years I've been in financial services infrastructure, specifically web infrastructure, authentication systems, distributed caching, to name a few. And for the last three years I've been working with cloud and infrastructure, AWS code and CI CD, and I fell in love with these concepts. So that's me. Hi everyone, my name is Jack Yu. I'm a principal senior solutions architect AWS. So I've been in AWS about three years and helping financial service customers to adopt and optimize on AWS. And I have a mix of software engineering background cloud infrastructure and help building complex distributed systems. And I help highly regulated financial service customers architect multiregions transactions workload and also help to build their DevOps pipeline and deploy those applications as well. So we'll talk multi multi region terraform deployments with Terraform built CI CD on AWS. Everything you see today is built using terraform and we'll talk about multiregion deployments. So Jack, why multi region deployments? So first we want to understand what is AWS regions. So AWS has a concept of regions, which is a physical locations around those workload where we cluster data centers and we call each group of logical data centers and availability zones. And AWS cloud spans 84 availability zones within 26 gigabyte region worldwide. And each availability zones has one or more discrete data centers and they all have redundant power, networking and connectivities and they all house in separate facilities for vast majority of the customer workloads. Those highly resilient multi AZ deployment is actually the best way for customer to host their applications. This is how Amazon.com has run and still runs today. With that said, let's talk about why customers looking into multi region deployments. And we actually have seen multiple use cases for customers. So first, we see customers want to actually get their compute resources closer to their customers, so their request doesn't need to traverse all the way across the world to get to the compute resources they need. This type of customers typically have a global expansion plan and they want to leverage multiregions deployments for that purpose. Second, it's really to support mission critical applications. These type of customers want to achieve extreme resiliency for their workload. And third, it's related to the business continuity and regulatory requirements, that is to actually to build multiregion architecture to satisfy business disaster recovery strategies and business continuity requirements. Um, so multiregions come with a cost, especially when it comes to infrastructure deployment. So first, it's actually very difficult to make sure that you deploy the same infrastructure across multiple regions. Especially hard for the customers to use AWS console to deploy their infrastructures, and customers actually figure out those already. They use infrastructure as code, actually is the way to actually manage the consistency between different stacks, different environment. But that's actually very challenging with a single region deployment. With the IEC, it's actually very simple. You just multiple modules together and you click a button to deploy it. And management is extremely simple. But this actually gets very complicated when you have multiple business units, when you have multiple SDLC environment and multiple region deployment. And finally, organizations need to think about their deployments strategy and how do they leverage toolings to actually enable their continuous delivery strategy. So in this talk, Lerna and I is going to give you some prescriptive guidance to actually manage your multi region terraform code and deploy this in scale. So, Lerner, why don't you take us through how to manage this complexity? Absolutely. So let's talk about the terraform deployment workflow. We have four simple steps in our deployment workflow. First, we'll talk about infrastructure AWS code and git tag. So here you see two pink boxes. These are the accounts that we're using in this solution. The larger box is the central tooling account. It contains all of the CI CD resources, as well as a git compatible repository in which we store our infrastructure AWS code. So that's our terraform code, the infrastructure sample workload that gets deployed into the target workload account. That's the smaller pink box. You can imagine the target workload account to be belonging to a business unit, a line of business, and it is also per environment. So we're imagining a research business that has a dev account, Qa account, staging account. Similarly, another business with different requirements for security and access, like RISC, has its own accounts. So DevOps engineers are working against the terraform repository, and they follow their branching strategy of their choice in the solution. We're imagining trunk based branching strategy. So let me walk you through the tagging and how it looks like. So, first of all, DevOps engineers, they'll be working in short lived branches. They'll be making their changes, and then when they're ready with the changes, they'll submit their changes through a merge request for a teammate to review and approve. And then the code gets merged into the main branch. In the repo and then they can tag to release from the main branch. The tags will follow a convention. So we're imagining that first of all there's an environment like dev QA staging prod here I'm showing you the dev tags. Next there will be a deployment code like a region name or global for global resources deployment and we'll talk about that later. And then next will be the team name. So this is those business unit name as well as a version number. The version number is important because you want to know at any given point the resources that are deployed in your account, what version they are. So here, as soon as DevOps engineer git tags, using for example this tag, dev Eu central one research and then a version number, our pipeline parses the tag and knows from the tag which target workload account to deploy inside the infrastructure resources as well as what type of resources it's deploying and what's the scope. So in here we're telling the pipeline that we're intending to deploy into research Dev account and in EU central one region. Next, let's take a look at triggering of the pipeline. So the pipeline will get triggered as soon as those DevOps engineer git text against the repository. There is one more step in between. Jack is going to go into that and of course this pipeline will deploy into the target account. So as the pipeline runs, it goes through a number of stages. Jack is going to go into the details of those and then at the end it's going to deploy those resources, terraform infrastructure resources into the target account. So Jack, what about the core infrastructure pipeline? So let's zoom into different pipelines stages. First we want to look at those resources stage. In the resources stage we have a versions s those buckets, that is a source of pipeline. So it contains two pieces of information. One is the terraform code that describe the target state infrastructure. On the right. So on the right it contains those vpcs, the application load balancers, you see. And those second piece of information is what learned talk about the git tech data so that it can actually be passed on delayed stage of pipeline so that they can influence the pipeline deployments, so different regions and different accounts, different environments and so forth. And let's take a look into how we capture that information. So the magic here is the AWS code. So that actually is triggered when the DevOps engineers tag the git tech repo. So that actions would trigger a code build, which in turn grab information from the code, commit repo, that's the terraform code. And also it would get the git tag information, it bundles it and it will put it into the Amazon s three buckets. That in turn would trigger the pipeline to run. And next we want to look it into the infrastructure as code linting stage. So as best practice we would want to lend the terraform code very early in the pipelines. In fact it's actually the first step in the pipelines and that actually would ensure that the code that we want to deploy is actually adhering to the best practices. In this case we are using PTF lit. It's an open source tool. It performs a couple of things. It does those static analysis of the code to find out possible errors, for example illegal instances type. It will warn about any deprecated synthexes, unused declarations. And actually it would enforce the best practices. If there's any validation detected, the pipeline actually stopped there and would notify the DevOps engineer to fix terraform code. So let's take a look at the next stage. So those next stage is a very important stage. It's an infrastructure as code security scan. It's actually very crucial to have this security scan in this early stage. For illustration purposes, we use open source tool called checked off. Checked off is actually open source tool that contains thousands of policies. It's ready to be used. And of course you can use any tools and choice for the security stand stage. But essentially what it does is it will scan your terraform code and would generate a junit XML report and it would give you a report so that you can approve or reject in the pipelines run. So this actually has a very robust security feature so that it would scan anything that, for example like if it's an s three bucket that's open to the world or any security group that has a quad serial that opens to worlds. So the track top would actually find out that vulnerability and would actually stop the pipeline to proceed any further. So next we wanted to take a look at how the terraform actually get deployed to a target environment. So you probably will guess it, it's terraform plan and terraform apply, right? The terraform plan actually would generate terraform plan file that describe the changes that's going to be happening on the target workload environment. Now the terraform plan would generate that and that can be reviewed by a DevOps lead. Once they're okay with the change then they can put their manual approval in their pipelines so that it can go on at the terraform apply stage where the terraform apply actually happened to make changes into your target workload so Lerna, you put a lot of thought into this architecture. Can you highlight some of the key considerations here? So we are following multi account best practice and we have a central tooling account. It contains the CI CD resources. We have a target workload account that contains the sample workload. So you can imagine that the central tooling account has the git repository, the pipelines, and it needs to be accessed by the DevOps engineers to trigger these deployments into the business unit accounts that are the target workload accounts. So it's pretty sensitive in terms of the resources that it contains. Similarly, the target workload accounts for the business units. For example business unit like research that is publishing research documents to its external clients, or a risk business unit that has internally use only data for its internal clients, internal company workers. So each of these have a different security profile and a different access profile, so accordingly they need an account of their own. Here we are following a cell architecture principle and the idea is that we are categorizing these different use cases based on the security profile access profile and also we are minimizing the blessed radius by containing these resources that belong together. The idea is similar to the bulkheads that are these vertical partitions that is used inside shipbuilding all the way back to twelveth century. And this idea of bulkheads is also used in the space, in ISS, for example International Space Station. So we are applying that principle of containing these resources, minimizing the bless radius in cloud deployments. What about multi region pipelines? So currently our pipeline is a single region and from those single region pipeline we are deploying against workload accounts and against multiple regions. What if we extend our pipeline and deploy our pipeline resources into a second region? For example, we can do this because we are using terraform for our pipelines deployment as well. And the idea here is to use the regions boundary as the cell and making sure that we are containing any issues within that boundary within the region. So what we will do in this case is have each region's pipeline be in charge of its own regions deployments. Meaning if there is an issue in region one, then we want region two through n pipelines to continue to be able to deploy to their respective regions. So this way we are containing the issue in a given region and we are ensuring that there is no impact across all of our regions. Again, we're using the cell architecture principle for pipelines per environment. We actually have pipelines per environment. Here we are showing the dev pipeline that is triggered with git tags that are prefixed with dev in the name and these dev pipelines are using s three buckets that are dedicated to the dev resources and they're targeting the workloads in dev environment. So again, cell architecture principle applied to environment. In this case, if there's an issue in those dev environment with the dev pipelines, for example, we're not going to be impacted in the higher, more sensitive environments like production. So I find it fascinating that we are using this idea of bulkhead twelveth century building ships in International Space Station and using it in cloud deployments for multiregion. So it's pretty fascinating, but I'm really curious about the security considerations for our solution. Jack, can you walk us through those? Yeah, sure. It's very interesting that you kind of make that analogy that International Space Station and the ballcat with the 20th century ship making and multiregion deployment has in common. Very fascinating. So let's take a look at the security aspect. Now, we wanted to actually integrate those security scanning in the very early stages in the pipeline. So imagine there's a developers go in and try to make an s three bucket public, right? So if that happened, that actually would get caught in the very early stage. Right. In the third stage here you see in the security scanning, it would get caught and the pipeline would not proceed and deploy that open s those bucket change into the target workload accounts. So if you look at how traditional enterprise software development process goes, they actually have security scanning stage. AWS, a very late stage of the development lifecycle typically is right before production deployments. So imagine that happened. Then they pretty much have to go back to the development cycle again. So to be able to fix that issue and then redeploy again. And that's highly inefficient. So with this shifting left approach, the iteration of the development lifecycle can happen much faster. So you provide type feedback loop back to the developers so in case they make any mistakes in terms of security vulnerability perspective. And the last security principles that we're looking into is to actually make the pipeline AWS the authority source for deployment. So what it means that all the changes that happens to those target workload should be made from the pipelines. So with that concept, that would actually ensure that the environment perspective is all consistent. Right. So how do we do that? So one way we do that is to actually prohibit developers actually going in into those target environment to actually make changes and to enable the pipelines to be able to make changes only there, right? So if you look at an implementation perspective, it's a combination of the IM roles and IM policies to make sure that in the pipeline it actually has the permissions to deploy to the target environment and it would also prohibit any individuals, personnel, developers to go in into the target environment and start making changes. So learner, so let's dive into the code deployment part. So we are storing our terraform state files in Amazon s three buckets as per Hashicorp best practices for AWS terraform remote state management. And we also use dynamodb tables for the terraform state locks. This is so that in case there is concurrent write attempts against the state files that these are prevented using the lux as per best practices. So that's pretty standard. But if you look at our buckets and dynamodb tables, we actually have one set of these per environment. Again, we are following the cell architecture principle, ensuring that we're isolating and minimizing this impact scope per environment. Now if you look at those s three keys for the terraform state files themselves, you'll see some familiar things in there. They are actually coming from the git tag. So I want to dive into this git tag example on the left. So a git tag that looks like dev, EU central one research and a version number when the DevOps engineer tags the repo with this, our pipeline is parsing the git tag. As we had mentioned, it is taking the environment, the team name, so that's dev and then research and the deployment scope EU central one in the first example there, and it is using it to construct the s three key for the terraform state file inside the Amazon s three bucket. And if you look at the different terraform state files, there are kind of multiple reasons why we are using this structure. And the reason is first of all, if research dev account has an issue in its existing regions. So let's say research dev currently deployed in EU central one and us east one is experiencing an issue in EU central one, then we know, and let's say that's an issue with the terraform state. Maybe we need to make some terraform surgery state surgery in EU central one for research Dev. We know that we will not have any impact to the existing deployments of research dev in any other region like research dev, us east one is not going to be impacted because that's in a completely different terraform state file. Now similarly, if we want to expand research dev account to additional regions, we know that our provision deployments in existing regions like EU Central one and US east one won't be impacted because the terraform state file for the additional region will be in a completely different place. So research Dev, ap southeast one terraform TF state for example. Next, let's take a look at a pipeline demo and terraform code structure walkthrough. Let's take a look at the terraform code structure for our sample infrastructure code. So first we have our account considerations that is available under the environments folder. We have the different environments that we use that we configured, and under it we have those team names, these map to the business units, and inside of it we have variable cfrs. This is where we are passing the account number to the pipeline. So remember that the Git tag has information about the environment name like Dev and the team name research. So the pipeline parses the git tag and is able to get the value of the account number to target. Next, let's take a look at the provider configuration. So our provider is parameterized by region. And as you remember that inside the git tag we also talk about deployment scope. So that is used as the region by the provider to target. The provider will also assume an IAM role inside the target workload account. So here you're seeing that when the target workload account is created, initially this role also should be provision inside the account. And here you also see the account number. We just talked about how we feed that into the pipelines. Next, in main TF, we have created two terraform namespaces. This is helping us to ensure consistency of what's deployed inside each region. Remember that regional resources are scoped at the account and region level, and global resources are scoped at the account level. So global resources are managed for the whole account and regional resources are managed per account and region. Therefore we have created different terraform namespaces and in the pipeline we will target these namespaces depending on the tag. So in the tag we have the deployment scope if it's a global deployment, and for example dev underscore global, that's the prefix. If it's a regional deployment, then it's dev underscore and then a region name. So depending on the tag then we will use either module global resources to be deployed or module regional resources to be deployments. And these are mapping to the folders inside our repository. Inside these folders we have for example under global for our sample infrastructure workload we have IAM role, and under regional we have VPC, which is we're using versioned terraform modules. So we try to reuse as much of the existing well tested code as possible and also all the way up to an application load balancer that's external facing. So that's our sample workload. And this concludes the code structure walkthrough for our sample infrastructure workload in terraform. All right, so let's take a look at the resources in our accounts. So on the left hand side you are seeing our central tooling account and I am logged in as cloud apps. So I have list privilege permissions for cloud apps to be able to access the CI CD resources. On the right hand side you're seeing the workload account, the target workload account, and here on the right hand side, in the purple, like in the other browser, I have read only access. Okay, so let's first start with the central tooling account. Here you're seeing the code repository. This is a terraform infrastructure sample workload repository with the one that we just did the code walkthrough. So you can see the project here with the environments configuration and all of the folders that we discussed earlier. And if you look at those git tags, you will see all the previous Git teams that triggered pipelines for deployments. I have triggered this dev global research two six and also dev EU central one research two six for an EU central one, deployment of regional resources for that one. And the previous one is global resource deployment. And if we look inside the build projects, what we will see is that we have a number of build projects defined. And these are also, some of these are the ones that we will use inside the pipeline stage actions. So the ones at the bottom here, for example the TF lint for the linting of terraform code checkoff for security scanning the terraform plan and apply are the ones that we will use inside the pipeline. And also the ones on top are the ones that we use per environment. When you git tag as DevOps engineer on the repo, what these projects do is that they'll take the git tag and a full git clone of the repository and push that terraform artifact into an SG bucket that is versioned and it's a separate SG bucket per environment. So then that SG bucket becomes those source stage of the pipeline. So let's take a look at the pipeline itself and how it looks. So I've been working on the dev pipeline. That's why you're seeing this one used and the other ones I haven't been touching. So we'll look inside the dev pipeline and if you look there, you will see that there is a source stage. So this is the stage that we just mentioned with s three versioned bucket. And every time there is a data change in the SG version bucket that contains the terraform infrastructure code for our sample workload, as well as the git tag that the DevOps engineer just created. Every time there's a data event, this pipeline gets triggered. The next stage there will be a Tf lint of the terraform code. So let's take a look at the output of that. In our case, the tflint ran and it was successful. In the next stage we have the Chekhov security scanning of our terraform code and this stage runs generates a junit XML report of Chekhov's security successful rules and also the failures in our terraform code. And that report will be presented in a manual approval action to the person reviewing the report and deciding whether or not it's okay to move forward with the check of output or we should reject because we saw some security scan failures that we decided that we need to fix. So here you're seeing the check of output. Here on top is the actual output in the logs, but it also generates a junit XML file that we can view in the dashboard after this step. The next step is those terraform build stage. Inside the build stage we have a terraform plan action that generates the plan. So here we see that plan generated in the output of that job. And then if we look at the stage itself, we'll see a manual approval action that will present the terraform plan to the reviewer and the reviewer can check the plan and ensure that it performs all the changes as intended as per our terraform code. And if they're satisfied with the terraform plan, then they'll approve or they can reject it. If they see something funny in there, they'll reject it. Now, yesterday I approved this terraform plan and it went into terraform apply. Now if you look at the apply output, it's just a typical terraform apply. As you can see, we are applying what was in our terraform plan. In this case, I was deploying these resources into the EU central one region and this terraform plan completed successfully. The apply completed successfully. As you can see here, it's all green and on the right hand side, as you see in this region, EU central one, I have the resources deployments and those load balancer and my load balancer in there, it's working as well. So here you can see that I am able to see the output from my load balancer and that concludes those pipelines walkthrough. Thank you very much lauren for taking us through the demo and code structures. That's really helpful. So let's summarize what we learned about in this talk. So if you have one slide to take away from this talk, that particular slide. So it actually summarized what we learned so far. So you have a DevOps engineer to commit the telegram code and tagging a code that would in turn trigger the pipelines through the code build. And also this pretty much pipeline described different stages that we spoke about and have the pipeline actually deploy the terraform artifacts under target workload accounts. So it also illustrated the concept of the multi account structure that we learned about and having different pipeline per environments and having different accounts for different workflow deployments as well. So I think the best thing is that we build every single component here using terraform, which is really cool. So Lerna, what are some of the key takeaways? Absolutely. So the most important takeaway about multi region deployments is that we should be making sure that we are consistent with our deployments. So it helps if you're leveraging the pipeline as our single source of truth for the deployments. It helps to structure our infrastructure as code and architect our pipelines such that we maximize code reuse, such that we can be consistent in the resources that we deploy into those accounts. And also it's important to shift security left in our pipeline such that we can catch those security issues early on in the SDLC, as Jack mentioned. Thanks for joining us. Thank you for joining us.
...

Lerna Ekmekcioglu

Senior Solutions Architect @ AWS

Lerna Ekmekcioglu's LinkedIn account Lerna Ekmekcioglu's twitter account

Jack Iu

Solutions Architect @ AWS

Jack Iu's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways