Conf42 DevSecOps 2021 - Online

Compliance As Code with Cloud Custodian

Video size:


Compliance is about risk management and the Cloud is no exception to that. Data leaks, privilege escalation and so on happen all the time. Cloud Custodian is a rule engine that sets a comprehensive and scalable way to bake compliance into your Cloud Platform. This session will show you how.

Aligned with the Everything As Code approach, the Policy As Code consists in describing a number of rules that our cloud platform should abide by. However, unlike Infrastructure As Code which is now widely adopted, this approach remains vastly unheard of.

We usually observe hand crafted solutions to complete the limited services that Cloud providers already provide. Cloud Custodian is an open source solution that enables Policy As Code with AWS, Azure and GCP.

Through the example of a common Finops problem, this session will demonstrate the benefits of such an approach and its straightforwardness compared to an empirical and manual approach filled with copy-pasted boilerplates.


  • Compliance is all about managing the risk. By exploding the notion of compliance into policies as code, we can apply already proven solutions. The question really becomes how to implement this compliance into the platform.
  • Cloud custodian is an open source initiative launched by Capital one. It mainly consists of a Python library based on YaML files. Can be interfaced with the three main public cloud providers, AWS, Azure and Google Cloud platform. Since it's written in Python, it can run everywhere.
  • There is about three filter types you can filter on. And there is also specific filters to do more complex filtering. Let's see a little example that use those three types of filters. All our RDS DB clusters must have continuous backup backup retention period enabled to its maximum time, which is 35 days. Any resources wrongly configured will be remediated.
  • Cloud custodian is open source by nature. If you identify a specific need, it's up to you to develop a new feature and give it back to the community. Main goal here for compliance is to help you leverage on cloud custodian to bake your own rule of compliance into your cloud platform.
  • Cloud custodian allows you to forbid the creation of compute instance with public ip. The policy is made of four different policies, each with a different trigger. Tangi shows you how to use it with a concrete case.
  • When we want to batch operation on the cloud platform, we usually go through scripts. But cloud custodian through the filtering and the action is able to really to batch your action. Here I have a policy to filter on GCP instance labeled Devfest in order to delete them.
  • Wescale is a company that have built a community of 50 experts who helps you to become cloud native. We are currently CNCF service providers and also Azure Corp, AWS and GCP partners. We actively hiring in France in and remotely. If you have any questions, feel free to contact us.


This transcript was autogenerated. To make changes, submit a PR.
Hi there, my name is Ismail. I am a cloud native developer at Wescale. In today's talk, we are going to present you the notion of compliance in the cloud and how to think it efficiently. So first of all, what is compliance? When we think about cloud, we think about a cool place where we can have id and implement it into a business in a matter of hours or even minutes, because the cloud will provide us with hardware and managed services so that we can focus on the business side of your code. But it could also be a very dangerous place where is located my data? Is it encrypted? Does the workload, does actually what it says? And all those questions must be answered from a client point of view because we need some guarantees that we are not in a malicious situation, that we are in a safe place, and this guarantee is represented by compliance from the business provider. Compliance is all about managing the risk. It's a set of rules to abide by in order to provide that we took the necessary steps to protect our client in the consumption of our service. It is usually legally driven. We all know RGPD, but also PCI, DSS for instance. It could also be internally driven with human resources or environment policies. As a set of rules, compliance can be seen as an obstacle to your business or innovation process. But you should really create a win win situation where you embrace compliance as being part of your business, where you think it continuously and in an automatic manner, so that any audit is a non event and you can get rid of of countless errors of meeting and in the end have kind of compliance governance. Of course, compliance is nothing new for cloud providers aws. They already provide you with a shared responsibility model that tells you that the hardware and managed services provided are compliance regarding RGPD, for instance, but you are still in charge to implement yourself, your part of compliance on the cloud platform, on each data or workload that you deploy on this platform. So the question really becomes how to implement this compliance into the platform. And in this schema, what is interesting is that we want to really bake the compliance into the cloud platform so that each time we deploy a data or workload, it is by design, compliance. And by exploding the notion of compliance into policies as code, we begin to see that we can apply already proven solutions such as DevOps that will enable us to apply the CI CD process to our policies and in the end our compliance. So one question that we can ask is how does code that we store in these repositories translate in terms of cloud vocabulary? With Tongi, we see three layers model. First one is identity and access management. That is all about trust. You give trust to identities so that they can act on the cloud platform. It is cloud platform provided and we obviously need to control that trust. And this is where the policies intervene. We distinguish two kind of them. First one, passive platform policies is also cloud platform provided and is about configuration that you can set on the different services provided by your platform. For instance, you can forbid the creation of an object bucket inside that region. The problem is it's really loosely coupled to the platform and we may lack of expressivity when we think in business term. That is to say the cloud platform won't be able to follow you in all of your needs. And this is where reactive platform policy shines, because the platforms produce events when something happened, where in the capacity to consume those events in order to trigger action that will implement one of our policy and in the end enable the compliance. And we can take simple example on the GCP, for instance, we may want to stop each evening our SQL instances and start them up again each morning. Usually we will see the following implementation. We call that orchestration choreography. We will choreography different services with one another in order to get a certain policy in place. And in the case of GCP, we could very well implement two entries inside a scheduler. One of them would be in charge to call a top SQL instance cloud function and the second one would be in charge to call another cloud function that would start the SQL instance each morning. Of course we would use infrastructure as code through terraform in order to easily reproduce this architecture inside another project. So far so good, but we see three main drawbacks. The first one is that we have to think about this architecture and it's very subjective. In fact, some of you may have another approach with the same result. Second drawback is about the code that we put inside those function. It's up to you to code these functions and you could very well set bug or bugs. And last but not least, we have a risk of duplication between different teams because of a lack of communication with Tangi. We discovered a nice tool that could provide you with answers to the different problem we previously mentioned. First of all, it enables you to switch from an imperative pradeem into declarative one. Clocksodian will provide you with a domain specific language that will enable the describing of policies through a YaML file and with your own business term. This is really interesting because from there you don't have to do anything but to write a YAMl file and deployed infrastructure will be taken in charge by cloud custodian AWS, long as the code that also will be deployed inside those function. So we tackle the problem of the architecture, we don't have to think about it. We tackle the problem of the code inside the cloud function, we don't have to think about it too. And we tackle the problem of duplication because now we are able to expose in few files or policies with business terms and those files are easily shareable amongst all your team so that they can benefit from the same treatment and the same approach of your policies. So Tongi, can you present yourself and tell us more about what cloud custodian is? So cloud custodian is an open source initiative launched by Capital one. It mainly consists of a Python library based on YaML files. In entry, each YamL file will describe a list of policies that will help you to set up and ensure two types of compliance mentioned by Ismail just before reactive compliance and passive scanning compliance. In few words, cloud Kosian can be interfaced with the three main public cloud providers I mentioned here, AWS, Azure and Google Cloud platform. Note that Google Cloud platform is now in Alpha stage for now, but soon to be fully released. Since it's written in Python, it can run everywhere. The project is currently in the CNCF sandbox initiative and have a release frequency of a month. So cloud custodian under the hood a cloud custodian policy can be described as shown here. We can see the type of resource that will be targeted by the policy. Here we're talking about Microsoft Azure disks filter. Also we can filter to select only the resource that matters for us. Here we want to select all the Microsoft Azure azure disks that are not attached to any other resource and also actions to perform on the found resources. Here we want to delete them. We can do a quick parallel with SQL language with the select from where syntax like the select is the action we want to perform from is the resource we want to target and where all our filtering conditions. Now let's talk a bit about numbers. Here is a summary of all the resources, actions and filter available for each cloud custodian service. The project was launched for AWS in the first place, so that's why there is a lot more settings for this cloud provider, but it kind of represents the market shares between cloud providers. The big point here is to show you that the coverage of all the possibilities is already consequenced, so most of your compliance cases can be fulfilled with cloud custodian. Let's talk a bit about execution modes. Cloud custodian is defined to be completely agnostic of its run place as long as you have a patch and virtual environment. The most interesting things here is that the projects can also rely on cloud providers several services to set up more complex workflows. It's really easy to make it all together with cloud custodian just by specifying a mode with several arguments in your policy. Cloud custodian will automatically deploy on the cloud providers multiple resource to cover the need described in the policy. Depending on what is specified, the deploy action will provision triggers and serverless application directly in the cloud. For AWS triggers, you can see we can use cloud watch events coming from cloud trail or schedule events to trigger lambda AWS lambda functions. For Google Cloud platform, cloud custodian will use cloud security command center logging or a cloud scheduler to trigger Google Cloud functions. Eventually on Azure, the same workflow is applied using Evan grid, a cloud scheduler and also azure cloud functions. So move on to filters. Let's talk a bit about filter types. There is about three filter types you can filter on specific value filter that there are filter that will make an API call an API call to the cloud platform to check a specific setting for an identified resource. You can also specify naven filter which will only check incoming events and do the verification with the value provided in the policy. And there is also specific filters to do more complex filtering. Let's see a little example that use those three types of filters. So here we have a little policy. This policy will check and enforce that log file validation is enables on a cloud trail trail at least what we want is to have notification when the cloud trail trail is updated to disable log five validation. So you can see here the resource targeted is AWS cloud trail. The mod is a cloud trail mold so we will react on an event and the event we are watching is update trail. There we can see unlike some filters, the first filter here is a filter on a tag. So it is a value filter. Here we want to know if the tag trail to watch is to true. After that there is little specific filter. This filter is a specific filter on the trail just to see if the trail is logging or not. Then there is two value filters. The first filter is to see we want to know if the trail is military regional and the second filter is to assure that this cloud trail trail will log all actions on a specific s three packet name. The last filter is an event filter. So as you can see we use gms pass syntax to reach in the API event in the cloud trail event, a specific setting and to check its value. Okay, so now let's do a more complex example. Let's extend this a bit. In some cases you can't apply remediation on a resource at creation. That's because the resource takes some times to be available and cannot be modified until the resource is available. What we can do here is create a mark for up workflow for delayed actions. First we need a policy to detect the creation of a resource and check that the resource is wrongly configured. The policy will apply a mark for tag on the resource with an operation with an operation to perform, and also a minimum time period to apply the execution. Then we wait for the resource to be available using a periodic policy which contains a filter on the resource states. Once the resource is available, the periodic policy will apply a remediation. To finish the workflow we have to delete the remediated resource. And then finally you can take a coffee because any resources that it wrongly configured will be remediated. Let's see a little example of this workflow. So here I made a little use case for you. The need we are trying to answer here that we want that all our RDS DB clusters must have continuous backup backup retention period enabled to its maximum time, which is 35 days. So here you have the first policy. The first policy is a cloud trail policy reacting on two separated events, a create DB cluster and modified DB cluster. Because we want also to apply the remediation when someone updates this setting, it has only one filter which is a value filter on a backup retention period with an operation and the value. The thing is that it say that we select only DB cluster that have backup retention period less than certified date. Here we have two actions. The first action is a macro up action. So we put a tag named backup retention compliance on the LDS cluster and the retention operation to change. And as I said, we have to set up a limit of time until the remediation is applicable. So we want to do it as soon as it can be applied. And also a second action which is notify. We want to also notify that Closkojan has found a non compliant RDS cluster. This will help you to treat afterwards like CI CD deployments that may have wrongly configured resources. Here the second policy, this policy is not triggered by cloudtrail. It's a periodic policy triggered every two minutes. The main goal of this policies to find marked for up DB clusters with the tag we saw before and DB clusters that are available. So we need here three different filters, specific filter which is a mark four filter, a value filter for backup retention period and also a value filter on the status of the DB cluster. Here the only action to do is to enforce backup retention period to 35 days and to finish this workflow. As I said before, we have another periodic policy that will just filter on resources that are now compliance but wasn't before. Since they weren't compliant before, they have a tag and the only action of this policy is to remove the tag to prevent unwanted actions on compliance resource. So another big objective of cloud compliance and auditability is to bring Githubs in the game. With this library you can also have a compliance driven by your favorite cvs. The main goal here is to reassure auditors that conformity is applied since your code repository is the truth and also the true state of your platform. There is also different businesses and operational needs that you can answer. With cloud custodian you can achieve finops fulfillment by starting and stopping development instance at night. Then you can also use this library to detect malicious actions made from within the cloud platform and send an alert. As you can see, there is multiple integration available using slack, splunk or Datadog. The main goal here for compliance is to help you leverage on cloud custodian to bake your own rule of compliance into your cloud platform. Cloud custodian is open source by nature. If you identify a specific need, it's up to you to develop a new feature and give it back to the community. The first thing you have to do is to fork the GitHub repository, then develop your feature, make and pass the sets and open a pull request. A little story here with this mail we figured out that there were no start action on Google Cloud platform SQL instances. Let me show you how we managed to develop this feature and to add it to cloud custodian. So the library is written in Python, so it's really easy for you if you develop a bit using this language to understand how to use this library and how to add some features. First we made a good use of cloud custodian prepared classes, functions and registries to add the new action here. Then we also developed the related test. We can find it here. The test is really easy to understand. We have to create a policy. This policy is the chisen tradition of what we can write using Yaml. Then we run the policy and we assert that the number of resource identified is to one. All information for development are really well explained in the developer manual. Even the stubborn test stubbers are tests that you can see here. Here are the records of the stubborn they are API calls. They are the response of API calls made to the Google Cloud platform. So the first thing you have to do is to use a function named record flight data that will output the API call results. And then in your test you use replay flight data to only replay the response of the API call to mock the actions made on the platform by your policy. These turbo are really useful because you have to apply them once and then the API call results recorded are run on each test suite. Now I leave you in good hands with Ismail who will show you a little demo, live demo of what we can do with cloud custodian on the GCP. Yes. Thank you Tangi for your cloud custodian presentation. So right now we are going to show you how to use it with a concrete case which is on the Google Cloud platform to forbid the creation of compute instance with public ip. To illustrate the case, we would use cloud shell, which is a dev environment as a service provisioned with a certain amount of tool. And we have also installed cloud custodian in it to deploy our policy. Before deep diving inside this policy, I want to show you that we don't have any cloud functions deployed. So in this kind, yes, zero items nor any cloud scheduler jobs. So why do I want to do that? To show you that indeed it is cloud custodian that would be in charge to think the architecture and deploy the code on the cloud platform. So before doing the cloud custodian part, we have to mention that we have to use a certain amount of API on the cloud platform. So we used terraform in order to activate some APIs and also create identity so that cloud custodian could use it, and so that we can on our side apply the least privilege principles with workload that cloud custodian will act with. So our policy called forbid public ip on compute instances is made of four different policies. Before going into the escalation, I want to deploy it because it takes a certain amount of time to provision the resource. So let's do that. We already have installed a virtual environment with custodian, cloud custodian so forbid public ip. Okay, and okay, it's running. What happens under the hood is that cloud custodian will provide, will implement different resources and we think about cloud functions, but also job scheduler and we'll see why. So what the chaining of our policy is doing is the following. First policy is about to listen for audit log on the event of insertion of compute instance. Basically it says for every creation of a compute instance do this action, we don't have any filtering because we want to apply the action of setting labels on all compute instance. Those labels are state the first and next policy, check public ip. The second policy this time will also act on GCP instance, but is of type GCP periodic, meaning that every minute it will apply this specific filter, looking for instance with a label, next policy and the value check public IP but also exposing a public IP. This time the action would be a mark for which is a syntactic sugar. In order to apply an already formed label that we can use in the next policy, we also notify through pubsub dedicated email address, but it could be really what you want. A third policy will still act on GCP instance is still periodic, so we have a scheduler that would trigger every minute a workload and we would filter an instance marked for up with stop. So we notice that we also have stop here. What is happening is that we are chaining those two policies with one another, and this time we effectively apply the action stop. A final policy is here to create a specific case where my also GCP instance is called unstoppable, following five digits. And in this case we consider that we want to start the instance again and remove the label that were chaining the policies in order to avoid to fall into an infinite loop. This time we are not periodic, we are of type GCP audit, meaning that we are listening for dedicated stop events in order to apply the filters and finally the actions. So let's see it in action. In the above window we have complete instances list command that would display the different instance and state that would be creating. So this time we want to show you that we have the function that are provisioned. So functions list okay, and we have four different methods, each of which with a different trigger. We have HTTP trigger, but also event trigger. HTTP trigger are for the periodic policy, whereas the event trigger are for the policies who are listening to the audit log. So for the GCP audit type also we have two different jobs, jobs lists corresponding to the HTTP trigger that we just previously mentioned. So we have the custodian auto check public IP, which is corresponding to the policy that would check each instance if it has public IP, and stop instances with public IP that correspond to the third policies, the one with marked for up that would be triggered every minute. So let's create an instance called Toto, okay? So it will provision on my cloud platform compute instance, exposing by default a public IP. So if we follow the workflow that we previously described, we have this indeed new label applied on this instance. Next policy, check public IP so that it can be filtered by the next policy. Here we go. That would be applying the Mac for up. So we have this specific label related to cloud custodian or source policy stop. And it would be serving the third policy, the filtering of the third policy to apply the stop operation. So if we run it manually, we can see, okay, this one, okay, it would be triggering new methods that would be looking for instances with the marked for up. And we see that Toto is now stopping indeed, because it was marked for up with the operation stop. And now if we create nonstoppable instance, okay, followed by the digits, this time it will follow the same workflow, it will start, it will be branded with the labels and then it will be stopped because exposing a public IP, but because it is called unstoppable 12345, it will be restarted. And we will also see disappearing the different levels that are necessary to apply the stopping policy. So to avoid to fall into an infinite loop, we would be removing the level. So we are going to manually run our jobs. So seems that it was check, okay, it was already triggered and we have the mark four up for the stop operation that is appearing. And this time when we stop the instance, this instance will be filtered and stopped. Okay, we see that. But because we have this name, the final policies will be called and it will start the instance again. So it takes a certain amount of time, but not so much. Let's see it in action. So right now it's stopping, it's terminated. And because the force policy is listening for the stop event and filtering on this specific kind of name, it is now restarting again our instance, which is now running. And because we don't want to fall into an infinite loop, we should also see that the label would be removed in the end. So it may take a certain amount of time, but in the end it would finish the job. In the meantime, I want to show you dedicated policies for a specific use case of cloud custodian. When we want to batch operation on the cloud platform, we usually go through scripts, but cloud custodian through the filtering and the action is able to really to batch your action. And for instance, here I have a policy to filter on GCP instance labeled Devfest in order to delete them. And right now it will services us in order to clean our cloud platform from those specific instances. Before applying this policy, we can see that extra labels that we were mentioning were indeed removed from the unstoppable instance. So we avoid the infinite loop. And now let's apply this dedicated policy to remove those instances labeled with state equal defest. Both instances are in this case. So if I run this policy we can see that the filter is indeed counting two instances and it is stopping the instances and then removing it. That would conclude our demo and I leave the lead to Tangi now. Tangi, up to you. Thank you very much Ismail. I hope this demo had shown you all what is possible using cloud custodian with unified language cross cloud providers I will now present a bit where we are from Wescale who are we? Born in 2015, Wescale is a company that have built a community of 50 experts who helps you to become cloud native. We advise and help our clients to think, build and master their own cloud native architecture, always in correlation with their material availability in the cloud. We are currently CNCF service providers and also Azure Corp, AWS and GCP partners. We actively hiring in France in and remotely. Wescale has also a training program for cloud enthusiasts. All training journeys about GCP, AWS, Kubernetes and Nashikob technologies like vault and terraform will help you to master cloud technologies and DevOps methodology. Thanks a lot for your attention. If you have any questions, feel free to contact us. We will be more than happy to answer.

Ismael Hommani

Cloud Native Developer @ Wescale

Ismael Hommani's LinkedIn account Ismael Hommani's twitter account

Tanguy Combe

Cloud Builder @ Wescale

Tanguy Combe's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways