Conf42 DevSecOps 2022 - Online

Integrating Cloud Native Security into the SRE culture

Video size:

Abstract

When we integrate DevSecOps into our SRE culture, we often perceive Security as an add-on, which is decoupling our processes. This talk will showcase how we can integrate open source security tools into our existing workflows, building on my experience of managing hundreds of tenant clusters.

Summary

  • Anis Oles: How can we integrate cloud native security tools best practices into our site reliability engineering? Oles is the open source developer advocate at acro security. In this talk he wants to speak about the overlap between site reliability Engineering and cloudnative security.
  • The SRE culture focuses on continuous improvement and embracing risk. The other thing is analyze learnings, analyze your failures and learn from them. And the last thing is autonomy. It was my first SRE role where I had experience in the cloud native space.
  • What is devsecops? It's about integrating security into all of our business functions by empowering people and creating accountability. We want to incorporate security into every business functions, whether that's administration or engineering. It's not about finger pointing, it's about having more productive outcomes.
  • SRE practice and security practices have a really tight overlap. When we define healthy services, we should also define secure services. The next thing is visibility. Automation is great for different aspects. But we have to be careful about when we do automation for SOE work and cloud native security.
  • The needs for the different tools and the way that you need to integrate security tooling and practices will be different. Different tools are installed differently. The number of integrations and the type of Integrations available can also be important. Finally, a global view of the security state of services is very important.
  • Trivia is an all in one security scanner. It does vulnerability scans of any container image it finds inside of your cluster. It also does exposed secret scans. If everything is a Kubernetes resource, you can then use the same processes across your stack.
  • The same doesn't hold true for security, right? You don't need to define how you scan your resources. Set up alerts and make your vulnerabilities scream at you. optimize based on what works for your team. Every application will be differently deployed depending on your environment.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Our name you hello there and thank you so much for joining me here at Conf 42 integrating cloud native security into the SRE culture. It's really great to be here. I hope you're enjoying the conference so far. Now, my name is Anis Oles. I'm the open source developer advocate at acro security and in this talk I want to speak about the overlap between site reliability engineering and cloud native security. How can both benefit from each other? How can we integrate cloud native security tools best practices into our site reliability engineering? Now, last year I was actually working as SRE in between positions as developer advocate. I'm also a CNCF ambassador since 2021 and I have a YouTube channel that you can see here where I talk about cloud native tools and trend tutorials on how to set them up, how to use them most effectively. And I have a weekly DevOps newsletter where I share amazing content from across the space with the community. So if you are curious, do check those resources out. You can find all of the links at my twitter. I also have a puppy. She's just five months old and she might make a little bit of noise in the background, so I apologize for that. She's making up for it by being super adorable. However. Now, last year when I was working with SRE, the SRE team at the startup was just about setting up all of the practices and the SRE culture within the company. And this is a slide taken from our Kubecon talk that a manager of mine and I gave in 2021 at Kubecon North America. And we basically talked about the different Kubernetes operators that we were using across our infrastructure. And the infrastructure was very dynamic and very complex, let's say, because we had basically several superclusters across different regions across the world, Frankfurt, New York, London, you name it. And basically what we had there were compute racks, and those compute racks were our supercluster. And in those compute racks we have compute nodes, and you can see them depicted here as these drawings that these are all compute nodes and the details will matter here. But basically tenant clusters, customer clusters would then be scheduled on those compute nodes within. So we would have clusters within a large supercluster. And as you can imagine, you need very advanced observability tools to get the necessary insights to understand what is going on. Where, for example, if a tenant cluster is stuck in some state and you need to repair it manually, you kind of need to kick it, you need to know how to identify those tenant clusters right in the easiest and fastest way possible the thing is, at the time we were talking a lot about our operator design and about our observability tools, but we weren't really talking about cloud native security. So before we go into cloud native security, I want to talk a little bit more about the SRE culture. I mentioned that we were really focusing on establishing an SRE culture and that really focused on these different areas. The first one is continuous improvement. You don't want to keep the state of your services the same, even though things might be working. You want to continuously improve your setup and the tooling that you have in place to gain insights on your services. And that's also related to embracing risk. Of course you want to keep the risk profile low and not deploy something that might bring out all your infrastructure accidentally. But ultimately it's a balance between both, because without embracing risk and taking new step in advancing your tooling, you can't really improve the tooling itself and it will slowly deteriorate. Then the other thing is analyze learnings, analyze your failures and learn from them. We had lots of incidences in the early times, so we had lots of incidences and varying in all times of severity and degree. We were using tools that at a scale that haven't been used to that scale before. So we encountered some really, some edge cases. So a lot of times we had to sit down with the other companies, with the other projects and really analyze what has happened. So both the projects, well, both our company, but also the projects can learn from that. And the last thing is autonomy. It was my first SRE role where I had experience in the cloud native space. I didn't have experience working with production environments, but I received a lot of autonomy. And I think that was really beneficial to have that trust and focus within the team. So that's ultimately what the SRE culture is about. To me. I know it's different across different teams. You will have different implementations and similar, but this is kind of what you can think of when you think about the SRE culture. What is devsecops? Usually I ask people what devsecops is, but then people are really shy and really, this is a conference about devsecops. So I think that everybody has kind of an idea of what devsecops is. So just think to yourself, okay, this is what I think about when I hear devsecops. Some people might think about buzzwords, some people might have specific terminology in mind. Now I think about integrating security into all of our business functions by empowering people and creating accountability. And every word here is kind of, I carefully picked every word here. So we want to incorporate security into every business functions, whether that's administration or engineering. Because ultimately, if everybody's empowered to take ownership of their part of what they are working with, right, then you can cover all areas within the business. So it's really about empowering people to take that ownership, to know what they are supposed to do, how they can do it, how they can ask for help and similar. And then when things go wrong, if they happen or don't happen both ways, if things go good, but also if things go bad, you then bet, then you can create accountability and have productive, more productive conversations, right? It's not about finger pointing, it's about having more productive outcomes in the end. So that's what devsecops is all about to me, to really make things happen across the business by shared ownership. So next thing, if you're working with anything away from this talk, it should be that SRE practice and security practices have a really tight overlap. Ultimately, when we define what healthy services look like, we should also define what secure services look like, because only secure services are healthy services. So that's basically what this talk is boiling down to. When I moved from my work as SRE back into developer advocacy for open source tools at Aqua Security, I realized that there's such a strong overlap between both and it just doesn't make sense to completely decouple them. And I know in many businesses you will have a separate security team. That's a great thing, right? But at the same time, we should also see, okay, how can different areas benefit from each other, and how can we make, for example, something like integrating security as easy as possible? So the idea is, start with, if you have an SRE team, if you have people focused on observability, start with those. So here SRE, some additional goals that you might have within your SRE team that are also security goals, or they're tightly coupled, let's say. So when we focus about how we can scale our services, we also have to talk about how can we keep those services secure over time as they become more complex, as we scale up and down our services based on demand, our replicas, if they scale, then we also have to talk about, okay, how can we keep those secure? The next thing is visibility. Within your observability tools, you obviously want to gain visibility, insights into what's deployed, where, how is it deployed, who deployed it, when was it deployed, how is it interacting with other services? Is it maybe causing failure in other services? And similar. Those are all questions and topics related to visibility. Now, when you are getting started with cloud native security, you want to focus on security scanning, you want to focus on getting more insights into the security posture of your services. And all those is also contributing to how do we gain more visibility into those different areas. The next thing is reduce noise. And torial, there's something I'm going to talk about in a little bit more, which is called vulnerability fatigue, and which basically means that you're bombarded with security issues and you can't keep up with fixing them or taking care of all of them. So within your cloud of security, you want to focus on the most productive and the most efficient information that you can take actionable steps from. Similarly, within your SRE team, you might have thousands, thousands of logs that you can't filter through, obviously manually. So similar to that, you want to have processes, workflows, but also tools in place that help you to reduce that, all that noise. The next thing is automation. Automation is great for different aspects. It's making our lives obviously easier. But I'm going to talk a bit about the downsides of automation and what we have to be careful about when we do automation for SOE work, for observability tools, but also for cloud native security. The last thing is what I already mentioned, ownership, communication is key for both areas. So here SRE, some of the more practical items that SRE just what a lot of SOE teams do, what we can also adapt for our cloud native security. Getting started with cloud native security, the first one is investing in runbooks and documentation. So when we define how to respond to different types of incidents, when to escalate an incident, what steps to take during an incident, the same thing we can do for any security issues that we might have within our tooling. So we could, for example, define okay, if there's a critical vulnerability, what steps have to happen, who has to take those steps in similar then the other items. SRE really also something that can be adapted for both teams. If you have different teams, if you have security teams or people focus on security versus people focused on site reliability engineering, or you can integrate one into the other. So here SRE, some of the tools that we used in that startup that I mentioned where I was working, SRe. So the observability tools are really like your standard stack, I would say. With Grafana and Prometheus Jaeger, we tried to install temple. We used Grafana Loki for logs. For management, we mainly used helm and terraform. It was very much helm terraform focus and then we used GitLab CI CD pipelines. But we talked a lot about these different tools and the different integrations and installation of those different tools. However, we didn't talk about security tools. That's like something we didn't really talk about. We had at some point, I mean, we were following security best practices, right? Like, don't think we were not. But at some point we had an intern who was a university student who was helping us implement tools such as Kubebench from Aqua Security as well. Now, just quickly mentioning every tool that I showcase here from Aqua, these are all Aqua's open source tools. I am not promoting any enterprise tools in this talk. So you don't have to sign up. It's all used for free on GitHub. You're not sending us any data. Similar. So since there is so little conversation about how we can actually get started with cloud native security, for example in your SRE team and similar, I've thought about okay, here are different steps that you can take. It's one approach, right? There are different approaches. This might be one approach. So we're going to focus as security scanner. As our main security tool, we're going to focus on Trivi. Trivi is an all in one security scanner. All in one because it can scan all of those different scan targets. It also has s Bom functionality features and cloud provider account scanning, starting with AWS. It also can do in cluster scanning of running workflows. So it's a very, very versatile tool that's focused on different users and different workflows. So step one in our ten step journey is understanding your need. That's really important because if you have no idea what you're actually aiming for, then you don't know what to look out for, right? So our need will be influenced before we can define our need. We have to be aware of the influencing factors on that, on our goals, on what we actually need to accomplish. So the first one is the size of our team, right? If you are working as an individual contributor, the needs for the different tools and the way that you need to integrate security tooling and practices will be different. If you're working within a large scale team, the next thing is the industry you're already working with. Is it a highly regulated industry that requires you to choose specific tools, work with a specific stack? Or are you working for a startup where it just makes things work in the best way possible with the tools available, then the type of technologies you're working with, it's also related to the integrating that are available. Do you need to have a custom setup with your custom on premise infrastructure that your need will be quite different to somebody who's managing can open source project for example, or managing, I don't know, a small retail website. Right then the company goals and leadership. A lot of times security, whether to acquire the skills or the tools, is related to having budgets and expertise, right? It's usually something that people keep as last thing to do to take care of, which is obviously an issue. But yeah, it's one of the factors that you want to take into account. It doesn't mean when you want to get started with cloud native security and integrating cloud native security, it doesn't mean you need to have a budget and expertise already available within your team. It just means that that is one of the factors that can influence which tools you're using in the end. Now tools will differ in different ways. That's also something you want to keep in mind. The first one is the installation. Different tools are installed differently. A lot of the cloud native security scanners are used as CLI tools, so you use them either in your local terminal or in your CI CD pipeline. Other tools come as Kubernetes operators and other Kubernetes resources and can be installed within your cluster. Now you want to be worried about the tools that do something within your cluster because security scanners will need lots and lots of privileges within your cluster to perform proper security scanning. So whenever you are signing up to a tool and you give it access to your cluster, you want to be mindful of what is it actually doing within your cluster, who's getting that data from those scans versus if you install, for example, an open source Kubernetes operator within your cluster and it performs just the scans within your cluster and the reports and resources of the scans are only available within the cluster. Then you know it's really contained there within your existing environment. Next thing is scan coverage. We get lots of questions in trivia, in the project issues and so on, where people asking why does this scan from Trivi differ from that can from another tool? And basically Trivi has a trivia database which is a separate project under the Aqua open source umbrella and it's pulling from different data sources, for example, list of vulnerabilities. Then the next thing is on how tools differ in quite a significant way is the number of integrations and the type of integrations available, especially if you're going with an open source security scanner. You want to be mindful of the integrations that are available, so more mature scanners will have more integrations available. Usually the last thing is the focus. Different tools are focused on different people, different type of audiences. Some might be focused on security professionals, others are focused on engineers. So here is can example of need driven development from device engineering blog. They basically detailed how they changed their security scanning to gain better insight into the security posture of their services. And here are the four goals that they want to accomplish with that change. The first one is assign ownership of vulnerabilities. They wanted to have people, different people within the team, take ownership of different vulnerabilities. So actually somebody, it's going to be somebody's job to take care and fix that vulnerability. The next thing is they want to have a global view of the security state of services. And that's very important because only if you have a global view, that's not helpful to analyze specific services, right? And to fix specific service, but only if you have a global view, you can then see how other changes, wider changes, for example in your workflows. Adopting other tools, external tools, has an impact on your overall security posture. Then they want to develop dashboards for different users and requirements, and that's more related to breaking down the security issues related to specific services. And they want to overcome difficult to use in different uis. A lot of times in the cloud native ecosystem, whenever you're using a new tool, you're adopting a new workflows and you're adopting a new UI and interface and frameworks, and that takes time to first of all get used to them, to learn your way around it, and you will always then have to do something separate to what you have already been doing. So they wanted to integrate their tools, their tooling, their security tools into their existing workflows. To have just this one thing to go to. Then step two, once we know what we actually want to do, what we want to achieve, and how different tools differ and so on, and what factors we have to keep in mind, we want to choose a cloud native security scanner. Now here is a list of different cloud native open source security scanners in the space. And they SRE focused on different types of scanning. For example, some SRE just focused on vulnerability scanning, others are focused on infrastructure as code misconfiguration scannings. Others are compliance scans. Now compliance scans, for example, would likely more be used by security professionals versus in cluster scans might also then be used by cluster admins. As you can see, trivia is really across those different areas since it's an all in one security scanner. It does lots of different things, but if you just need vulnerability scanning, you might want to consider, for example, another tool that focuses on vulnerability scanning. And here's the list. Now once we have looked at the different scanners, in our case we're going with trivia because I'm familiar with trivia. We want to set it up and make sure everything is running properly. And sometimes you might go with one scanner and then you set it up and you play around with it and you realize it's not the right tool either because the workflow is not intuitive for you or something is just not working and it's completely fine to go back to step two and be like, okay, we actually want to use a different scanner now. In our case we're using trivia now we want to make sure it's working properly. So the first thing is identify the best installation options. Also trivia comes in different installation options. Now I usually go with helm installation inside of my cluster in addition to having automated CI CD pipeline scanning, then you want to decide upon a different configuration. For example, if you're using trivia in combination with observability tools such as Prometheus, you have to configure some parts slightly different. You then want to test those custom configurations and ensure that it's working properly with all tools that it's supposed to work with. So for example, if you have some niche cases where trivia is supposed to perform, I don't know, a thousand vulnerability scans of different containers, right? And then on a regular basis, something like that, like some really edge case, you want to test it out in a small environment first before and that's with every tool, right? You want to test out your specific edge case in a small environment before you implement it in a large scale environment. Now here is an overview, very simplified overview of a Kubernetes cluster, how that might look like once you installed trivia, the first thing is you have like maybe an application namespace with all your application related resources. Then you have a monitoring namespace with your Prometheus Grafana, other observability tools and then you have your trivia system namespace with the trivia operator. Now the trivia operator is that part of trivia that does continuous in cluster scanning of your running workloads. In addition to that, you could then also use trivia, the CLI tool in your CSCD pipeline or also on your developer machines. The beautiful thing is if everything is a Kubernetes resource, you can then use the same processes across your stack. So for example, here you can use the same processes if everything is a Helmchart processes Grafana as a Helmchart to view operators and Helmchart you can deploy and manage those applications through the same processes, which is really nice, really handy. So here's what you will then see inside of your trivia system namespace. Now alongside the trivia operator you will then have also several kubernetes, custom resource definitions, deployed crds and they basically extend the Kubernetes API to allow for custom security scans. So here we have the metrics of our different security scans. Trivia does vulnerability scans of any container image it finds inside of your cluster. It does exposed secret scans. Are there any exposed secrets within your cluster then? Is there any RBAC misconfigured, any role based access control that should be changed? Maybe. And then it also does config audit scans. Now the thing is, things might change dynamically and it shouldn't. And inside of your cluster, right, like people might change things around manually, they might try out things, they might deploy set containers to debug things. I don't know what your company or team does, right? But trivia will then identify any misconfigurations that are present within your cluster of those newly set up resources and can alert you on those. Now these SRE, just the metrics from the security scans, from the security reports, the security reports itself, they are just other Kubernetes resources. They are yammer manifests, the security reports and you can read them like Yaml manifests. And then because they are yama manifests, they are kubernetes resources to security reports. You can export them. For example, you can get the metrics out and then you can integrate them to your observability stack. That's the next step, setting up a dashboard. So we have Grafana Prometheus installed, we have our security tools installed. It's time to set up a nice dashboard. This is the dashboard created by the community where we have a summary of our different vulnerabilities and they are broken down in severity. So in total we have 175 vulnerabilities in our cluster. Now you can also see all of the other metrics directly through a dashboard as well in Grafana. And basically by breaking out those different vulnerabilities into different categories, it then makes it easier to identify the different types of vulnerabilities that you have. Now the next thing is, what you might think already be thinking about is how do you avoid vulnerability? Hell, because if you have 175 different vulnerabilities, how do you go about addressing them, how do you go about managing them? Those are, I'm not swearing, those are a lot of vulnerabilities here, right? That we can't manage all at once. Here's a screenshot from Alex Jones on Twitter saying I just give up and I just give up and die. No, then difficult sentence. Anyway, so he scanned a research, I don't know what type of research he actually scanned, but he scanned a research with sneak and found over 550 different vulnerabilities. And they are broken down in critical, high, medium and low as well. But still there's lots of vulnerabilities you can't look at 550 vulnerabilities or similar, right? Doesn't work. So here are some practical steps that you can take. First one is ignore all but critical vulnerabilities. He only has three critical vulnerabilities that's easy to address. Just take care of the critical vulnerabilities first, and then look in a more productive way at the rest. Don't scan everything at once. I don't know if they scan just one resource or if you scan multiple resources, but there's really no need to scan everything at once. Just scan the most critical workloads first. Filter by vulnerabilities with known fixture trivia allows you easily, with an additional flag to just specify that you only want to see vulnerabilities that already have a fix available. So you could go ahead and do that. Just look at those vulnerabilities first, then filter vulnerabilities by team and by application. Really make them team and application specific. Give them context. Give them meaning that they are not just like a line of text of something that's wrong with any resources, right? That's ultimately what you don't want to have. And that's also related to device engineering blog post needs, right? So next thing, step six, what are metrics without alerts? The thing is, and this is related to what I said earlier about automation, that I want to talk a little bit more about automation after I take a sip of coffee. Sorry, my throat is still a bit messed up from a cold. So basically, when we define our deployment resources, you need to define your deployment resources to deploy your application, right? That's a necessity. Otherwise your application is not deployed, it's not working, customers can't access it, customers are unhappy, right? You don't want that. So the thing is, you then need to obviously define those deployment resources. But the same doesn't hold true for security, right? You don't need to define how you scan your resources, you don't need to define like scan coverage. I don't know, all those things related to security you don't need to do to deploy your application, to have it working to make customers happy. Customers are only unhappy when things go wrong in the security world, right? Like when their data is ultimately exposed. So it's not a necessity for engineers, for anyone operating an application, operating a business to actually take care of the security of that. I mean, most of the services that you use online, you probably don't know what kind of critical vulnerabilities are within, and you shouldn't have to care about that. That's something for the business to care about. But that's exactly why you want to set up alerts and make your vulnerabilities scream at you, right? Give them a voice, make them, set them up in such a way that you cannot ignore them. So once you do that, you can correlate metrics. So, for example, if you have a new critical vulnerability here, new vulnerabilities introduced, we can then correlate that dashboard from our vulnerabilities, from our misconfiguration issues that went up with our deployment dashboards and see, okay, how do those, what happened in our cluster, there's a new deployment, there's a new replica set, okay, that caused the vulnerabilities to go up to have more inside of the cluster. Step eight is some additional tips that you can do, and some are iterating on the previous ones that I already mentioned. First one is assign ownership, really make it somebody's responsibility. And ideally, the person who's already managing that resource should look at its vulnerabilities. Don't introduce tools, many new tools at once. That's something lots of people want to do when they get started with. For example, cloud native security is implemented everywhere and everything at once, and that's complete overload, and people are likely not going to be able to adapt to those new processes. The next thing is utilize existing workflows, platforms and processes. Utilize it as much as possible because it makes it easier for people to actually look at the security reports. In that case, step nine is optimize based on what works for your team. A lot of times we can follow the initial setup, follow whatever company said, but ultimately every application will be differently deployed depending on your environment. A lot of times when I get questions about trivia operator specifically and its deployment, I cannot answer those questions before I get more information on your setup, on your environment, on your needs, on all those different pipes that play a role right because ultimately my answer will defer based on how your setup looks like and what applications you're already using in simulam. So there's really no one thing works for everybody. And step ten downstop at security scanning there sre lots of different types of security tools in the cloud native space. So for example, Tracy is a runtime security and forensic tool that analyzes events on the node level. So it can basically, while Chevy can scan any misconfigurations once they have happened inside of your cluster, Tracy can detect if somebody uses a misconfiguration to do something they shouldn't do. Those are the main differences. So here you can see just a dashboard of its different logs. Now you would want to obviously filter them more in different ways to actually then have actionable steps to those logs. Because over 2000 logs, that's nothing you can really follow up on. And here sre some of the resources used the blog post from wise Engineering on their application security journey. Then on the AG for open source YouTube channel we have lots of different tutorials to get started with. Here's the trivia GitHub repository and the trivia operator repository. If you Google trivia trivia operator, you should find it as well. And here's a demo project that I've been using on GitHub as well, and you can find us on slack if you have any questions about this presentation, about anything I said, or about trivia and other projects within the aqua ecosystem. Now, I hope we have some time for questions. Otherwise, thank you so much for attending my talk. I hope you have can amazing rest of your day and to see you soon.
...

Anais Urlichs

Developer Advocate @ Aqua Security

Anais Urlichs's LinkedIn account Anais Urlichs's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways