Conf42 Cloud Native 2022 - Online

Deep-dive into Open Policy Agent + Conftest + GateKeeper: Kubernetes Policy in Action

Video size:

Abstract

This talk will walk you through applying centralized policy for Kubernetes deployments leveraging Open Policy Agent, Conftest and Gatekeeper - both from the developer’s and DevOps / operations perspectives.

Open Policy Agent has been an excellent and complementary project to ensuring centralized policy management for your Kubernetes deployments.

In this session, we will do a deep-dive session into: Open Policy Agent, Conftest, and GateKeeper, three projects that really enable you to apply granular policies and controls for highly distributed, microservices deployments.

This talk will show real-life use cases of how to use those technologies in production in order to configure and enforce a centralized policy for Kubernetes, both from the developer and operations (DevOps) perspectives.

Summary

  • Noaa Barki: Today we are going to talk about how to build centralized policy management at scale. What are policies? Why do we even care about policy enforcement? We will see some very cool tools that I personally like.
  • We always want to make sure that we set the concurrency policy to either forbid or to replace, never to allow. When a cron job gets failed, the failed job will never replace the previous one. So you'll end up with a lot of pods that just will spawn your cluster.
  • Nobody is immune to Kubernetes misconfiguration. How can you make sure it won't happen to you? The answer will be policy enforcement. Integrating and validating your resources in the CI using tools that can be used as a local testing library.
  • OPA is general purpose policy engine. You can write all your policies in it and execute it with a specific input. Conftest allows us to write tests for our Kubernetes resources with OPA as a policy engine under the hood.
  • As your organization grows, you would probably want to modify or to change some of your policies. This can be very challenging tasks when your policies are written both as rego files and as a constraint template for a gatekeeper. Argo CD makes life so much easier and it's specifically designed to make continuous deployment in kubernetes more efficient.
  • Argo CD is made just for that, to make continuous deployment in kubernetes more efficient. It does it by simply reversing the flow instead of pushing the changes to the cluster. Argo pulls those changes from a git repository. It's recommended to separate between the application source code and another repository to the application configuration.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thank you so much for coming today. I'm super, super excited to be here. Today we are going to talk about one of my favorite topics which is how to build centralized policy management at scale. But I believe that at least for some of you this is the first time that we meet. So hello, my name is Noaa Barki. I've been a developer advocate at Tree and a full stack developer for about seven years. Six, seven years. Yeah. I'm also a tech writer and one of the leaders of GitHub Israel community which is the largest GitHub community in the whole universe. And I work at an amazing company called the Tree where we help developers and DevOps engineers to prevent misconfigurations, Kubernetes misconfigurations from reaching production. But enough about me, because today we are going to talk about how to build centralized policy management at scale. We are going to specifically talk about policies. What are they? Why do we even care about policy enforcement? We are going to see some very cool tools that I personally like like OPa, gatekeeper, conftest, argosd and my very own those tree open source project. So without further ado, let's just get it started. Ladies and gentlemen, close your eyes. Picture this. You had a long week and it's Friday now and you're in your bed dreaming peacefully about a warm wonderful comfortable weekend. Unfortunately your weekend came a little sooner than expected when you woke up from the sound of your phone and you had 15 missing calls from work. Oh no. You wake up, you wash your eyes, you go to the Slack channel and oh apparently somebody forgot to add in one of the deployments memory limit which caused one of the containers a memory leak which cases all Kubernetes node to ran out of memory. Wait a second. Oh did I? Oh no, I think, I think, I think that somebody, that somebody is you. Now of course I'm kidding, of course I'm kidding. This is not you. I'm sure it will never happen to you you. Because first of all you're very responsible developers. I always forget that I add this slide anyway. I'm sure it will never happen to you. You're a very responsible developers, you're here with me that's for sure. And I know that you know how to use kubernetes, you definitely know to never forget those memory limit. So let me ask you a different question. Who's here is ready to play a game? So the game goes like this. I'm going to show you two kubernetes manifest each time I'm going to point to a specific key which is configured differently on every manifest. You will have to look very carefully and tell me which one you will deploy, left or right. Let's get it started. Okay, so this is a cron job configuration. Pay attention to the concurrency policy. Which one you will deploy, left or right? I'll give you 10 seconds. Ten, nine. Okay, I'll stop. And the right answer will be right. You see, we always want to make sure that we set the concurrency policy to either forbid or to replace, never to allow. The reason why is because whenever a cron job will get failed and we set the concurrency policy to allow, the failed cron job will never replace the previous one. So you'll end up with a lot of pods that just will spawn your cluster. And this is actually what happened to target. They has one failing quant job that created thousands of pods that were constantly restarting. And not only that it took their cluster down, but it also cost them a lot of money because their cluster accumulated a few hundreds of cpus during that time. Very sad story. Let's move forward to the next question. This is another quant job configuration. And once again pay attention to the concurrency policy. Which one you will deploy, left or right? Tan dun dun dun dun dun dun. And those right answer will be right again. You see here on the left side, the concurrency policy isn't part of the cron job spec. So we end up with a cron job without any limits. And this is actually what happened to Zalando which is an online leading fashion company with over 6000 employees. It's a big company guys. They actually used the correct configuration, however they placed it incorrectly in their yaml, which is very sad. And this immediately took their API server down. So let's move forward to the next question. This is pretty simple. Pod, which one you will deploy? Left. All right, I'm sure that you know this one and this is going to be short. The right answer is of course right again, we want to make sure that we never forget the memory limit. And this is actually what happened to Blue Metador. Back then they were a small startup company with those monitoring software. They pods hosted a sumo logic third party application whose container were memory hogs. And because they forgot to put the memory limit, nothing. Basically stopped from those pods to take up all the memory in the node and eventually caused out of memory issues. Very sad. But you see target, Zolando, Blumetor, they aren't the only one who suffered from the pretty innocent mistakes. I'm talking about big companies I'm talking about Google, Spotify, Airbnb, Datadog, Toyota, Tesla, who's not here. I'm talking about a lot of other companies who share their own Kubernetes failure story. Trust me, nobody is immune to Kubernetes misconfiguration. So first of all, I highly recommend everybody to read about other companies failure stories. Not only that you would learn so much about how Kubernetes works, what to do, what not to do and what are the best practices, but it will also will make you ask the ultimate question, which is how can I make sure it won't happen to me? How can I make sure that I won't become one of those failure stories? And this question is very important because it forces you to think about what are the workloads requirement in your organization? What is the stability in the security that you want to achieve for your cluster? How can you make sure that it won't happen to you? And the answer will be policy enforcement. I know this because before we launched the tree for the first time, we wanted to learn as much as possible about the common misconfigurations and the most common pitfalls in the Kubernetes ecosystem. And what we did was to read more than 100 Kubernetes failure stories. And not only that, we learned that policy enforcement is the solution to prevent misconfiguration from reaching production, but it also turned out to be a key solution to improve the DevOps culture in your organization. So great. So how do we start? So first of all, we want to define the policies and the rules that we want to enforce in our organization. Maybe we want to make sure that every container has a memory limit. Maybe we want to make sure that all the containers has configured readiness or liveness probe. Maybe it's about quant job, that every quant job has a deadline or that those scheduler is valid. It's not really matter. But those policies that you will define are really dependent on your workloads requirement. And once you have a set of policies that you want to enforce, the real question is how will you integrate, how will you distribute those policies in your organization, in your pipeline? How will you make sure that you and your teammate will follow these policies? So the way I see it, you know what, let me tell it in a different way. I believe in two things. I believe in shift left and I believe in Githubs. I believe that as soon as you find a mistake, the less it might take your production down. And I believe that every Kubernetes resource, every config file should be handled exactly the same as your source code in the CI, exactly like your source code. So with this mindset, the way I see it, we should automatically validate our resources on every code change in the CI. Furthermore, integrating and validating your resources in the CI using tools that can be used as a local testing library. I'll pause. As a local testing library can extremely help you nurture the DevOps culture in your organization station. And the reason why is because local testing library is actually one of the developers policies. Developers, they know how to use local testing library. They are used to it. They are used to write code, test it locally on their local machine and then submit a pull request. And they expect those tests, at least those tests to be ran again in the CI. This is actually part of the developers policies and allowing the developers to do the same with infrastructure has code with Kubernetes resources will allow the DevOps to delegate more responsibilities to the developers and therefore to liberate the DevOps from the constant need to fence every Kubernetes resource from every possible misconfiguration. But I hear you. I hear you back there. I can totally hear you. Here in Israel we have a thing, it's called let's talk Dugri, which means show me the real business. Come on Noah. So let me show you the real business. Let's talk dogri. Let's talk about how can you start today? Use your policy. Use policy enforcement in your organization. And the first tool that I want to talk about is OPA or OPA, I'm going to call it OPA. So OPA is general purpose policy engine. You can write all your policies in it and execute it with a specific input. Check if it violates any one of those policies. You can practically think about OPA as a super policy engine. I like to think about it this way. Now those main idea behind OPA is actually to decouple all the policy decision making logic from the policy enforcement usage. You see, suppose you have microservices architecture and one of the microservices receives an API request. You probably need to make some decisions in order to allow or disallow this request. Right now these decisions, they are based on rules. They are based on criteria that you want these requests to meet. These rules, they are called policies. And what OPA gives you is the ability to decouple and to offload all those policies into one dedicated agnostic service and therefore to provide more control to your administrators and your ops team in your organization. To control in a better way over the policy enforcement and the service at runtime. So let's talk about how can you use OPA. So there are actually two ways to use OPA. If your services are written in go, you can use OPA as an internal package and embedded it within your code. The second option is actually to use OPA as the host level demon and query it with an HTTP request. You will send the input as JSON and OPA will evaluate and will send you the response. Now this is important. By definition all the policies should be written in a special language called Rego, which is those official policy language by OPA. It's very easy to learn in fun fact actually it's inspired by datalog and basically that's it. From this moment on, OPA is your centralized policy service. You will write all the policies in OPa and you can query all those microservices to ask OPA what is the policy enforcement. OPA will evaluate it with the input and will return a response, valid or not. Now OPA is great. OPA is wonderful. I really love OPA, but when it comes to Kubernetes, not so much because it still requires a lot of heavy lifting work. And I do enjoy crossfit training and I do enjoy heavy lifting, but not when it comes to my Kubernetes cluster. This is where conftest is coming to picture. So conftest is an open source utility that allows us to write tests against any structured file. And this is number one world about me. When I say any, I mean any. I'm talking about JSON XML Docker files and of course YaMl. I'm talking about our Kubernetes resources. Conftest allows us to write tests for our Kubernetes resources with OPA as a policy engine under the hood. Conftest is specifically designed to be ran in the CI or just the way I like it as a local testing library. Now let's talk about how can you use conftest. So first of all you need to install conftest on your local machine. Then you need to write all your policies in Rego. So here for example, you can see two policies. One that makes sure that verifies that I don't run with root privileges, and the second that I always use the app label for my deployments. And after I wrote all the policies I put them in a folder. By default the folder name should be policy and I simply execute conftest with conftest test with the path of all the files that I want. Conftest to test conftest will take all the policies, will take all the files that existed in that path, will send everything back to OPA and will output those result. Now I said that you can use conftest in the CI. So here for example, I used GitHub action and what I want you to pay attention to is actually the green part over here. So as you can see I pull conftest and this is actually an awesome feature by conftest. Conftest allows us to push and pull our policies to a docker registry, which is kind of nice. So I pull all my policies and I simply run a conftest test with those path of all my resources. Very easy, very fun. I love conftest, it's super friendly, makes life so much easier. Highly recommend everybody at least try it. So with configure in those CI we can safely go back to sleep now because we validate our resources on every code change. Let me just put myself here. Oh, this is more comfortable here we can safely go back to sleep because we validate our resources on every code change. And this means that our cluster is truly project, right? Wrong. This is not true because what about those criminal users who can, anytime they want during the day, can simply type cubectl, apply and do whatever they want with our clusters. What about those criminal users? And in the year of 2020, Google, Styra, Microsoft, Red Hat, they asked themselves those exact question and their answer was this is not enough. So this is the story of how they created gatekeeper and gatekeeper. Gatekeeper is actually the bridge between your Kubernetes API server and OPA. And it allows us to validate our resources, our Kubernetes resources natively in our cluster. But before we talk about what is exactly Gatekeeper and what it does under the hood, I want us to talk about Kubernetes admission controls and specifically Kubernetes admission webhooks. You see, when an API comes into Kubernetes API server, it passes through a series of steps. First of all, it's being authenticated and authorized. Then it passed into the admission controllers which basically triggers a list of webhooks that can mutate, validate and that's it. Mutate and validate your request. And then, and only then when your request is valid it move forward and to be persisted and executed to those AHCD. Now, gatekeeper, gatekeeper is a customizable admission webhook. So whenever a resource in the cluster is being created, updated or deleted, the API server calls the admission controls, which triggers the gatekeeper Webhook gatekeeper takes those request along with the resource and predefined policies, sends everything to OPA. OPA evaluates everything, and if OPA find any misconfigurations, any violation, then gatekeeper will project a request and the user will receive an error. Now let's talk about how can you use Gatekeeper. So first of all, you need to install gatekeeper on your cluster. Then you need to write all your policies. But this time, since Gatekeeper is installed on the cluster, you don't write your policies in rego files, you write them in a constraint template CRD. So as you can see here, the constraint template is basically the policy that you want to enforce. You can see the actual rego in the green part over here. And this is an example of a policy that receives a required label and check if that resource, if a resource actually includes that label. Then after you write all your policies, you still need to tell Gatekeeper how to use that policy on which kind you want this policy to be applied to, and you do that with those constraint. So as you can see here, I created a constraint that takes those Kubernetes required labels, which is the policy that we just wrote, and applies it on every namespace with the owner label. So in conclusion, this policy ensures that every namespace in the cluster has the owner label. So with conf test in the CI gatekeeper in the cluster, now we can go back to sleep safely, because we gain some very powerful policy enforcement to our Kubernetes resources, but does come with a price. There are a couple of challenges that I want us to talk about. So first of all, as your organization grows, you would probably want to modify or to change some of your policies. You would probably want to add new policy, delete some of the policies to change one of the policies. And this can be very challenging tasks when your policies are written both as rego files and as a constraint template for a gatekeeper, especially if you have multiple or a lot of many git repositories. Now there are a couple of ways to face this challenge. You can use a dynamic Yaml file that you download from s three with all your configuration. You can use eTCD as your centralized management, state management, and you can pull your configuration from it. Those are many ways to face this challenge, but it is a challenge that you should take under consideration. Another challenge that I wanted to talk about is actually the Rego language itself. And I know this because I tried to implement some of those tree policies in Rego language. So if your policies require some level of complementary let's reverse it. If your policies don't require any level of complexity, and they are pretty straightforward, you shouldn't worry at all about Rego. But if your policies will require some complexity, such as, I don't know, like make sure that one of the keys is within a range of numbers, make sure that one of the keys, I don't know, uses a specific kind of input unit, or compare one of the keys to another key, maybe in a different file. Then Rego might be a little difficult to use and a little bit difficult to implement. So my fair advice to you is to look for tools that already come with built in policies, or at least look for policies that already written in Rego that you can use and send inspiration. But another way to face this challenge, if you think about it, would be by not using gatekeeper at all and to guarantee that your git repository is your single source of truth and by that eliminate the need to fence or to guard your cluster because users won't be able to kubectl apply whatever they want and to destroy our production. Some people might call it githubs. Now I know that there are many ways to practice githubs and I know that there are many tools out there, but the tool that I want to talk about today is Argo CD because I really love Argo CD. It makes life so much easier and it's specifically designed to make continuous deployment in kubernetes more efficient. But before we talk about Argucd and why so magical, I want us to talk about how SCD workflow without archd, how it actually looks like. So let's say that I have Kubernetes cluster in production and I use Jenkins for Sci CD. So you know the drill. Usually we have a developer that submit a code to a repository upstream, maybe new feature, maybe Hotfix and it triggers Sci pipeline which build those code, test those code, build a new image of our application, push that image into a docker registry and then in CD we update Kubernetes deployments resource, maybe some more resources to use that new image of our application and we apply those resources to our cluster. Now this is all nice and easy, but there are a couple of challenges. First of all, in order to Kubectl to fully function, we first of all need to configure some access to our Kubernetes cluster. Additionally we usually use, I don't know, AWS, Microsoft or Google. So for example I use eks. So I also need to configure some my credentials to the AWS account, then not only this is a configuration challenge, but this is also a security challenge. Another challenge that I want us to talk about is once Jenkins deploys that deployment, we don't have any visibility over the deployment status. We don't know if the deployment actually failed or not, and we need to manually check the logs in our cluster, which can be very difficult or very uncomfortable for sometimes. And Argo CD is made just for that, to make continuous deployment in kubernetes more efficient. And it does it by simply reversing the flow instead of pushing the changes to the cluster. Argo pulls those changes from a git repository. And the real magic about Argo CD is that it is actually part of the Kubernetes cluster itself. It's an extension to the Kubernetes cluster. So you don't need to provide any secrets or any configuration credentials, or to configure any credentials to Argo CD, which is really awesome, but let's talk about it in more details. So first of all, to use Argo you need to install it on your cluster and then you need to configure it with a repository so that Argo will monitor that repository by default every 3 seconds. So if Argo will detect that new changes were submitted, I. E the last commit has is different now, then Argo will pull those changes, will clone the repository and pull those changes to the cluster. Now you're probably wondering what will happen if we will apply Kubectl apply manually to our cluster. Now, since Argo CD is installed on the cluster, then in this case Argo controller will detect that the cluster is out of sync and Argo will override those changes with the state that exists in the repository, which will make the repository our single source of truth. Now a quick note about repositories, because it's very important, and I mentioned it a lot, it's been established as a best practice in githubs and specifically with ArgosD to separate between the application source code and the application's configurations repositories, and to have run a repository to the application source code and a different repository to the application configure. The reason why is because you usually have more than one deployment file. You usually have services and ingress files and deployment and secrets config maps, and you have a lot of other config file types. You usually manage them in different environments with tools like has customize Bazel whatsoever. So you have a lot of configure files to manage. And once you change one of these files, you don't want to trigger the application CI pipeline because nothing really changed in the application source code. And this is why it's very recommended to separate between the two and to have one repository for the application source code and another repository to the application configuration, which usually called the Gitops repository. So now let's talk about how a CD workflow looks like with Argocd and a separated Gitops repository. So once again we have a developer that submit new code into GitHub's repository upstream, which triggers the CI pipeline, which build the code, test those code, create a new image of the application, push that image into docker registry, and then the CI updates a deployment to use that new image of the application. But in a separated repository, a repository, a Githubs repository that argo monitors. And when Argo will detect that those new changes were made, Argo will pull those changes and will apply them to the cluster. So if you think about it, using conftest for instance in the CI of your Githubs repository will provide you a very powerful policy enforcement because now you can guarantee that your GitHub's repository is your single source of truth much. Now the good news is that you have very powerful policy enforcement. The bad news is that this is only the beginning because once you have policy enforcement, your organization is a continuously living, breathing cell and policy management policy enforcement should continuously evolve and continuously updated according to your organization's needs. And this is where the centralized policy management comes into the picture. So to build centralized policy management, first of all you need to make sure that you have those right environment to dynamically adjust your policies. Not only implement those policies, you also need to make sure that you can update your policies and reconfigure policies and control your policies. And git. Git isn't the best solution because Git won't provide you anything that you need. Let's take permissions for instance. How will you use git to controls over who can delete or create a new policy, for instance, another thing that I wanted to talk about is that you would also need to make sure that your policies are actually effective. And the thing about policies in Kubernetes is that those is sort of like a contract between application owners, cluster admins and security stakeholders. So in order for the policies to be truly effective, they not only need to work and to really enforce what you want them to enforce, they also need to be communicated properly between everybody in those organization and to make sure that your policies are actually being communicated properly. You need to ensure that people actually know what to do when one of the policies gets failed. You need to know which policy gets failed the most. You need to ask yourself, how will I delegate the knowledge? How will I provide some guidelines to my people in order for them to actually solve the policies when they fail? You can use email for that, and we saw a lot of companies do it, but obviously this doesn't scale because I'm a developer, I do code for my life, I'm a feature machine, I don't know infrastructure, I don't know Kubernetes. And honestly I can't remember to put the memory limit or to use that version instead of that version. Misconfiguration can easily happen here, so you need to be able to review constantly, continuously review, monitor and control your policies. Which policies fail the most? Which policies are actually being used in practice? How can you make sure that you make progress and that you get those improvement that you want with your policies? And the last tool that I wanted to talk about is actually my very own the Datree Cli open source, which actually combines everything that we just talked about. So the datree, much like conftest, allows us to validate and scan all our Kubernetes resources. But unlike conftest, the tree already comes with built in rules and policies for Kubernetes and Argo CD. Now the tree is specially designed to be run in the CI or as a local testing library or as a pre commit hook. And the way that those CLI works is that for every resource and for every file that exists on a given pet, the datree runs automatic checks to see if it violates any one of the policies and the rules, and for every violation that it finds, for every potential misconfiguration, the datree displays a full detailed output of the violation and guidelines of how to actually solve this failure. Now under the hood, every automatic check, every one of these automatic checks include three steps. First of all, to ensure that the file is actually valid yaml file. The second step is that those file is actually a valid Kubernetes file or ArgoCD resource. And the last step is the policy check to verify that the resource follows the best practices, the rules that you customize, that you wrote for your organization and the tree provided out of the box. Now to use the Datree, first of all you need to install it on your machine, on your local machine, and then simply run it with the tree test and the path of all the files that you want to test for. And that's it. It's free, it's open source. And I highly recommend everybody submit a code, review the code, submit a pull request. I will be there. Can't wait to meet you. And last but not least, I really those that this session inspired you to start thinking about what are the policies in your organization, how would be the best way for you to enforce them, and how will you start to build your own centralized policy management solution. Thank you very much.
...

Noaa Barki

Developer Advocate @ Datree

Noaa Barki's LinkedIn account Noaa Barki's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways