Conf42 Site Reliability Engineering 2021 - Online

Pitfalls of Infrastructure as Code (And how to avoid them!)

Video size:


Are you looking to start your journey into Infrastructure as Code? Or have you already jumped in head-first? Either way, this session is for you! We’ll talk about many of the common pitfalls of IaC, and how you can avoid them.

We’ll go over: * What IaC is * Types of pitfalls you may have * Infrastructure pitfalls * Coding pitfalls * Basic mitigation strategy for each pitfall

We’ll go over all kinds of things that you may or may not have even thought of yet. Get your questions ready, because I’m here to help you be successful in your IaC journey!


  • Tim Davis talks about the pitfalls infrastructure as code code and how to avoid them. You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native.
  • I'm currently the DevOps advocate with Mzero, an infrastructure as code automation startup. Previous to that I was with VMware and I was part of a small group of people that created the cloud and developer advocacy team. My background is in infrastructure, but I've always been focused on the application specifically.
  • What is infrastructure as code code? It is delivering and deploying infrastructure using code. Most of the pitfalls that come along with writing and designing and deploying code. Can be solved by bringing together the development team and the infrastructure.
  • Cloud formation is a great infrastructure as code framework. Pulumi will allow you to use the languages that you're already writing in. These are more of not a cloud specific framework, but a cloud agnostic or a multi cloud framework. What is your cloud going to look like in the future?
  • Every single time you build and test your code, you're building and testing your security policy as well. Open policy agent is great for some security, but also for some compliance. Don't repeat yourself if you are using infrastructure as code.


This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hey there, I'm Tim Davis, DevOps advocate with Mzero, and today we are going to be talking about the pitfalls infrastructure as code code and how to avoid them. But first I want to thank the sponsors and the conference folks and everybody for putting on the event, for accepting me to talk, and hopefully this is helpful for you. And all of the other speakers that have been speaking today are great. So it's a pleasure just to be involved a little bit. About me I did mention I'm currently the DevOps advocate with Mzero. We're an infrastructure as code automation startup. Previous to that I was with VMware and I was part of a small group of people that created the cloud and developer advocacy team there. Before that I was in the networking and security business unit. Previous to that I just had a string of things of enterprise architecture, virtualization, engineer, systems administrator, infrastructure operations. Long story short, my background is in infrastructure, but really I've always known that there's no point in delivering infrastructure just for the sake of delivering it. You're delivering it for the sake of running the application that runs the business. So I have always been focused on the application specifically and not just throwing a bunch of servers and whatnot out there for no reason. So what infrastructure as code code? It sounds pretty self explanatory. I mean, it is delivering and deploying infrastructure using code, but how did we get there? Folks like myself for years have just been building physical servers, installing operating systems, configuring applications. With the advent of public cloud and automation tools, it kind of changed from physically building stuff to going into a UI, clicking around and building stuff. Well, with DevOps and all sorts of things that are going on these days with application development and releases, with small startups, smaller companies and even some bigger companies, developers are actually deploying more and more infrastructure themselves, some of which is done through portals that the infrastructure teams have set up. Some of them are through systems like OpenStack back in the day, where you could package an API call and throw it out to figure out what you wanted to do. Developers though, they don't want to jump into a UI and click around to get what they want. They want to write code and do exactly what they do for their normal job. And that's really what brought around the infrastructure as code model, and that's just kind of morphed into what we know today with infrastructure as code and Gitops workflows and all of these complex tasks. It just kind of came about from developers wanting to do what developers do, and that's write code. So what types of pitfalls can come up with infrastructure as code? I've got good news and I've got bad news for you. The bad news is it's a lot. It's both. It's most of the pitfalls that you can think of coming around with infrastructure, and it's most of the pitfalls that come along with writing and designing and deploying code. There's a lot going on here, but really, the way to mitigate that and to help solve a lot of the issues that come along with infrastructure as code can simply be solved by bringing together the development team and the infrastructure, or the questions team. I mean, it's development and it's Ops. It's DevOps. So really it's one of those things where a lot of people think that DevOps is just throwing tools like CI CD, pipelining tools and stuff like that at a problem. And that's really not it. The tooling is only a small part of it, DevOps and being successful. A lot of that is people and process. So if you do have the communication, kind of break down those old silos, you will have a lot better time. And one of the reasons for this is because infrastructure folks have the background, they have the knowledge. They know exactly what kinds of issues may come up, how to solve it, what things you may need to think of. Development folks have been writing code, they've been storing code, they've been deploying code, editing it. They know exactly what kind of goes into that process, all the different methodologies. So why not utilize everybody's experience across the board in this new venture that brings everybody together to be successful? So let's kind of break this down a little bit. We'll start with some infrastructure pitfalls, then we'll move into some code pitfalls, and we'll just kind of give you a lot of questions to ask and hopefully set you up for success along the way. One of the biggest things when it comes to infrastructure as code is what kind of framework are you coding to use? Now, this could technically also be a code pitfall, because there are languages that you may need to learn. Some of them don't use regular languages. Some of them use their own, like, for instance, terraform uses HCL. Pulumi, though, will allow you to use the languages that you're already writing in. So if you're using go or node JS, you can write your infrastructure as code files with that. But from an infrastructure perspective, you need to make some choices. Know, what is your cloud going to look like in the future? Now, for some folks, you may be in AWS right now, you may be in Azure, you may be in VMware, you may be in whatever, and as far as you're concerned, you're never going to move. You're all in on that. You're always going to be in AWS. So cloudformation is definitely the right choice for you. Well, that's great, but what happens if something happens and your company is acquired twelve months from now and they're 100% azure shop or they use IBM cloud, you're not going to be able to take those AWS cloudformation scripts with you. You're going to need to rewrite it or convert them somehow. There are frameworks though, that allow you to work a little more easily between clouds. Your terraforms, your pulumis, your terragrunts, these are more of not a cloud specific framework, but a cloud agnostic or a multi cloud framework, if you will. And not only do these give you the ability to kind of move around clouds. Now, of course it's not as easy as just say, taking your AWS terraform and deploying it in azure. You're going to have to change the provider, you're going to have change up a lot of stuff, but it gives you a little bit better starting point so that you already know the language, you already know the structure, you already know your deployment methodologies, all of this kind of stuff, and you don't really have to change that up, so it gives you a little bit better. Also with terraform, it's not just the big three public clouds. There's terraform providers for every major public cloud, a lot of minor public clouds, some private cloud software, some SaaS tooling. And at this point, I believe f five has a terraform provider that will allow you to interact with hardware that they have. So really there's a lot of different things you can do with terraform or Pulumi or terragrunt that allow you to kind of stay a little bit agnostic and give you a little more freedom down the road just in case something happens. Also, if you choose a bigger, more popular framework, which at this point in time, terraform is the industry standard for infrastructure as code, you're seeing a lot more Pulumi and things like that these days. But you're also going to get a lot more providers, a lot more community support. It's open source. So if you feel that you need to change something or you want to suggest can update, or you want to open up an issue, you can always just jump into GitHub and do that. Now of course, cloud formation has AWS support and stuff like that, but it just kind of closes you in a little bit. Now of course, if you are going to be 100% in AWS and you know that and you want to take that risk and you want to go for it, great. Cloud formation is a great infrastructure as code framework. There is a ton of different tools built into AWS to help you with that. Same with arm templates inside of Azure. So there's lots of different things you can do, there's lots of different variables that kind of go into it. But really the very first question you need to start asking yourself when you're looking infrastructure as code, code is what framework are we going to use, what direction are we going to go off in security things is something really big. It shouldn't be an afterthought, and in the past it always has been. In my experience, I worked in so many shops where developers just had their sandbox. They did their own thing throughout the day. All of their development processes and everything got pushed through and then security wasn't really brought in until it was time to push this new thing out to production. Then it's here's a bunch of firewall requests, here's all the stuff we need to push things into production when it comes infrastructure as code. Code and speeding up the developer lifecycle and all this kind of thing, you're going to have a lot better time. If you bring security closer in and you make it more a part of that lifecycle. A little bit of like we talked about with bringing developers and infrastructure together to help be successful. If you bring all the stakeholders in, your security folks, your networking folks and everybody into this design and deploy process, you're going to have a lot better time. Now there's a term you may or may not have heard before called shift left. And if you're not familiar with this, if you think of the development and deployment lifecycle as kind of a timeline, you're building code, you're testing code, you're deploying code as just a giant line. Well, if you take security and you're just kind of doing the firewall request, you're making your IAC policy and stuff like that, right? When you're pushing it out there at the end. Well, if you have a problem with that, you may have to go and take five steps back in order to fix that. Well, if you shift security left and every single time you build and test your code, you're building and testing your security policy as well. If you ever have a problem with that, you just take one little step back. Or you can even take it a step further. With tools like Terra scan by accuracy or checkoff by bridge crew, you can actually set it up in your automation where you can't deploy. If you have something insecure, say you're deploying an s three bucket out to AWS and you may have some sensitive data in there. Well, you don't want to deploy it with a default open policy or anybody can read. If you set up security as code during that build, test or deployment process, if you see an issue like this, you can just cancel that deployment, make the quick fix, and then actually go through and finalize the deployment. So there's lots of different tools out there that will help you to design this security policy and to help push that out. So we talked a little bit about choosing the right framework, bringing infrastructure and developers together, having those conversations, making sure everybody's kind of involved so you can be successful. A lot of that kind of focuses around the infrastructure side. But what about the code side? What are some issues that you may run into and some of the mitigations that you may have for that default values is a big one for any infrastructure person out there, you know, or even some developer. If you've used AWS and you've jumped into the UI and you said need to deploy an EC two instance, well, it's going to open up a little wizard. It's going to have a bunch of little boxes. And every time you click next, there's a bunch more boxes that you need to fill out or answer questions or do it so that you can configure that instance to be deployed. When you're using infrastructure as code and you're deploying an EC two instance, you don't necessarily have to account for every single one of those little boxes. Some of them, if you leave them blank and you don't specify, well, it may fail the deployment. Others it may have a default. So if you don't specify something, it may just pick for you. Well, what happens if it's the IAC policy or something like that for security where it chooses a default policy, but that may not be as secure as you need for that resource? Or if you design your code and you try to deploy it and it fails because it needs something, making sure that you know exactly which resources need. What kind of specifications do you have that set up as a variable? Are you hard coding that value? You just need to make sure that you are answering that question so that you don't accidentally deploy something that's insecure or incorrect and that you have all those questions answered so that you just don't have a bunch of failures and have to try and start over. So how do we mitigate that? Open policy agent is great. This can help you out. You saw this on the security slide a little bit back and you absolutely can use it for some security, but also for some compliance, making sure that the deployments that you have going out stay within a guardrail of hey, this can only be within a certain region, or I can only use a certain type of instance, or I can only have a certain amount of instances. Also, hey, I cannot deploy an S three bucket that has an open policy to the Internet. There's so many different things and it's totally customizable and it's another one of those blank as code type things. We have infrastructure as code. We've talked about security as code and now we have policy as code with open policy agent. So you can actually go through work with your security team, work with your compliance team, work with your infrastructure and developers to build a policy of what your deployments should look like, or maybe the other way of are there just specific things that you will not allow and everything else is okay? Open policy agent allows you to write that and then enforce that policy during the deployment time. Don't repeat yourself. Don't repeat yourself. This is a developer methodology that aims at simplifying writing code. For instance, if you are using infrastructure as code and you say pick terraform and you want to deploy out to AWS, but you also have a few different environments. You have a testing environment, you have a development sandbox, you have production. And all of these use similar versions of the same app. They just may have different values for things like IP addresses, which vpcs they're connected to, and et cetera. Does it make more sense to have all of these different versions of the same code or write your code in such a way that you maybe only have to have one piece of code that you change things whenever you deploy it for a different purpose? How do we mitigate this? There's lots of different ways we can do this. You could just take regular code and make everything a variable, and then during the deployment process, with whatever automation you choose just inject the variables that you need, but there are also things that have been purpose built to work and help you make drier, more manageable code. Things like terraform modules. If you have at least a VPC or an RDS instance, or some EC two in all of your different environments, why not just write that code one time, make it a reusable module, and then just call and customize that module every time you want to deploy it. Or if you say, have the same set of code that you utilize for production and staging and development and everything like this, but you just want to make a little bit of configuration changes each time. That may be something that a terragrunt works for you so that you can deploy all this stuff. That way you only have to update one set of code, but then it can go through and redeploy to all of your different environments. It just allows you to not have to manage so much of the same code and just write a little bit less code that's a little more manageable and a little more reusable. Designing for state size, this is a good one. When you deploy things with infrastructure as code, you end up with a state file and that is simply just a text snapshot of the current state of the deployment. So if you have terraform that's deployed out to the cloud and you have three EC two instances, you have an elastic load balancer. Those are connected to a VPC that you've got an RDS instance tagged to as well. All of that stuff, you need to kind of know what's out there at any given time. Now one of the reasons that it does this is because a lot of infrastructure as code frameworks have the smarts to be able to say, well, I've already deployed three instances out, they changed the code to say that I need four, well, I'm not going to go through and deploy four more on top of that three, or I'm not going to delete those three and deploy four. It just says I already have three, I'm going to add one and then I've got four. That's one of the cool things about declarative infrastructure is you just tell it what you want it to look like at the end and it will decide for you how it does that. Now if you have all of your different resources, like your vpcs, your vnets or your instances, your databases, and all of this kind of stuff mashed into one infrastructure as code file, you're going to end up with one giant state file. Now this is all fine and good if it's a very static environment, but if it's a very dynamic environment and it's always changing and it's always updating every single time you need to make an update to that infrastructure, the framework is going to have to go through and check that entire state file. It's going to have to make sure and kind of sanity check what does my state file say I have and what's actually out there so that it makes sure that it's deploying what it needs to do and that everything looks the way it should in the end? Well, that can sometimes take a really long time. I've had customers that come and say that their deployment times are upwards of one to 2 hours because they have giant terraform workspaces with one giant state file that takes forever to run against every single time. So how do we mitigate state size issues? The good news is we just learned that and talked about it. Don't repeat yourself. This methodology that we talked about of designing smaller, more repeatable code can help you design a state file, thus improving your deployment performance down the road. So if you are designing your code utilizing things like terraform modules or terragrunt code, it's going to be broken up into much more manageable pieces. So you have your VPC module, you have your EC two module, you have your RDS module, and every time you need to make an update to that, your deployment process is going to run against that module and it's going to say oh great, here's my VPC, I know what I have. I'm going to deploy this out and im going to be done. It doesn't have to go through and check all of the EC, two instances, all of the load balancers, all of the databases. It's only checking against that specific module. Now writing your code in a module doesn't just fix that problem. You have to make sure that your automation allows you to do this as well. But really starting with the dry methodology when you're designing this code and making sure that it is smaller, more repeatable chunks will absolutely help you mitigate any state file performance issues down the line. If you want to learn more, there are so many awesome resources out there. Infrastructure as code code today at this conference I'm surrounded by so many awesome other speakers that hopefully you've learned a bunch from as well. If you have questions for me, always please feel free to reach out on Twitter. That is by far the best place to get me. I put out videos and stuff. Infrastructure as code code tools once a week on Mondays at 08:00 central time, so feel free to check those out as well. Thank you so much for checking out my session. Hopefully this was helpful for you and have a great rest of your day. Hopefully next time we'll see each other in person.

Tim Davis

DevOps Advocate @ env0

Tim Davis's LinkedIn account Tim Davis's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways