Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

A Comprehensive Approach to Cloud Security Posture Management: Integrating Infrastructure as Code, AI-Driven Monitoring, and Reactive Security

Video size:

Abstract

Unlock the secrets to a resilient cloud security posture! Learn how to leverage Infrastructure as Code, AI-driven monitoring, and reactive security to prevent vulnerabilities and automate compliance. Gain practical insights and actionable strategies to protect your cloud environment

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Good morning, good afternoon, good evening, wherever you are from. Thank you for tuning into my talk today. A quick disclaimer before we get started. This presentation includes some research points and also some demos, which are purely based on my view and as an attendee. You are encouraged to perform your own research and any professional before you start implementing this within your organization. Speaker introduction. My name is Anto Pel and I'm a cybersecurity professional with background in computer science and master's in cybersecurity. I hold many industry certifications like C-I-S-S-P-C-S-S-L-P-C-A-S-P. Some of these are industry specific security certifications and, I basically like, enjoying challenging myself and I was able to acquire all these skills when I was working in different organizations in different roles. And I'm also like multi-cloud certified in all the three clouds across Azure, GCP and AWS. And I am a lead cloud security engineer at Humana. Where I'm leading some of the cloud security initiatives across the enterprise. And I'm basically passionate about threat hunting and capture the flag, even some challenges because they give me an opportunity to think like a bad actor so that we can start building controls in a more effective way. And I also contribute to the home automation project, and I like connecting with people talking about home automation, cybersecurity insights. I. General automation, any of such please reach out to me on LinkedIn if you would like to connect. And also you can use the QR code, which is out there to basically connect me on LinkedIn. Moving on to our topic today, which is cloud security poster management and cloud native integration. So we are gonna be doing a deep dive understanding what cloud security poster management is. And specifically focusing on cloud native integration. So what we are gonna be talking about is how do you secure the cloud with cloud native provider capabilities so that whenever any kind of like deployment goes through the cloud, you can secure it either if it comes from your IAC or even if it comes from the platform. So let's see our agenda today. So we are gonna be starting with understanding threat landscape, which is out there, right? Because this is gonna give you some understanding about the growing complexity inside the cloud and what are some smart things that you can do to secure it? Understanding building blocks of what? Cloud security, PORs management, what really makes all these things. And understanding infrastructure as code security understanding, what that is and why do you need it. And we are gonna be doing a deep dive into cloud native security policies today, understanding how do you implement it, and also like understanding what are the need for having those. Policies and the CSP specific level. And then we are gonna be doing a deep dive into event driven security, architecture, security, portion management, exception management, get an understanding about what are all these three things with some key takeaways. Moving on to the next slide where we are gonna be talking about current cloud security challenges, right? As you might have all seen the cloud workloads, the utilization of cloud is going up, and the cloud in general comes with shared responsibility model, right? Where cloud provider is responsible for few things. As a customer, you are responsible for managing few things, especially securing those configurations, which the cloud provider would. Give you access to, you need to ensure those are properly secured, right? If not, you can see the growing attack surface based on the research published out there. The attack surface is growing up, right? And because the attack surface is going up. Monitoring it and also ensuring proper configuration is getting deployed is a complex task. We need to ensure we are driving automation to drive. The configuration complexity, right? Because without that, it's gonna be really complex for us to start like configuring all these things at multiple different places and basically lose the track of what we are monitoring. These all agonist. And that might lead to some gaps and that's where the attackers could leverage that part to start at the attacking some of the things that, they are basically running in the cloud, so we're not, when you also look into the compliance requirements, right? The fines are rosing. The reason why is, when you run some of these things in the cloud or on-prem, and when you are in a specific sector, you should be meeting some of these compliance requirements. For example, you need to understand how data should be segregated. What is PCI DSS and how many days you can basically do retention of some of these records. So with all of these things, if you're not doing compliance in a proper way or format, even if you have a breach, or even if it is caught in an audit or if it's caught in a breach, and then that's where the discovery happened. That would lead to huge fines. And as you can see, the fines are going up and up. The reason why is the complexity and interpretation of the requirements, you know how those things are done When we look into the foundation pillars, right? Inside largee poster management, we have basically four pillars. The first pillar where we start with is IAC infrastructure code scanning. Where we proactively scan for any kind of security misconfigurations within the pipeline. And the second aspect of that is, or the second pillar of that is cloud native validation. So this is security validation layer where, which is closely tied to the cloud security provider. And this is the area on focus that we are gonna be doing a deep dive today, understanding how to configure it. What are some of the advantages with it? And then we are gonna be doing a deeper dive into runtime remediation. So runtime remediation basically comprises of looking even in real time triggering functions to correct it if there is a misconfiguration, which was failed to be detected in one and two pillars. And next, stitching all these 1, 2, 3 together is compliance reporting on how exactly we are. Showing compliance and monitoring it is really key in understanding the attack surface and as well as the regulatory compliance because understanding what is out there and how they are know the resources are running is really a key in coming up with your different strategy. Moving onto the next slide about infrastructure as code security scanning. Think about what you can do at this layer is you are basically enabling predeployment scanning in this phase where whenever your IAC code goes through the pipeline, you're enabling this feature so that people or developers can get early guidance and you have a policy integration there within the pipeline. So that all these misconfigurations are caught so that nothing is going out there inside the cloud basically is in a misconfigured format. A lot of times these all come with integrations with VCS systems where anytime when you submit a pull request, it automatically scans with specific set of policies to ensure it's meeting compliance, and then it moves on to the next stage to merging into the main. So we did a deep dive into this topic, so you can use the link which is out there or scan the QR code. So we did a deep dive into this topic. So what we did do is we have shown how to use an OPA agent for performing and authoring ISE policies. So that you can basically stop misconfigurations before entering into the cloud. So please watch the video. That should really help you to understand how do you prevent any misconfigurations before it goes to the cloud. So integration is very important in the DevSecOps model, so that you are enabling feedback so that anytime when people know about this misconfigurations. Developers should basically get a complete understanding of what is expected value so that you don't have to chase behind some of these things during the runtime and remediating these things during the runtime has potential issues. With, configuration being drifted, or sometimes what we see is vulnerabilities are high, which needs to be remediated. We tend to go and remediated, the service goes off in a different direction. So it's really important to fix this in the shift left approach because fixing them during the right time on the shift right side is really huge. The cost to remediation is very huge, and what really helps here is having that frosting collaboration. With your developer community security teams to understand the pattern by enabling the DevSecOps model. Within your organization would really help you to do that. So this is a topic of focus today where we are gonna be focusing on cloud native security policies where we are gonna be looking at GCP org policies, Azure policies AWS SAPs to understand why do we need it and what is centralized policy management. Because a lot of these policies should be managed in a way like they're centralized so that you get maximized value. And how do you automate it so that anytime when a new resource goes into the cloud, it basically gets scanned and the policy gets basically enforced. So today during the demo, what we are gonna be doing is we are gonna be looking into two specific cloud providers, Azure and GCP, right? We are gonna be creating a mis complicated resource in Azure. Then we are gonna be authoring a cloud native security policy in Azure and then redeploying the resource to check and see. What is the difference, right? How exactly you're able to stop the misconfiguration. And the same thing in GCP two. We aren't gonna be reviewing some of the policies on how do you enforce it. And we are gonna start like creating some virtual machines which are. Not appropriate to be deployed. And you can see how basically within GCP those are getting blocked. So if you can see my screen currently, so we are inside Azure Cloud. So this is Azure Cloud. Right within the Azure, what we are gonna be doing is we are gonna be, so within a subscription, we have a resource group. We are gonna be going and creating a resource called Azure Storage. So this resource is basically used to store any kind of like. Data within the cloud, maybe like some kind of block access kind of data. Like it could be like, files or any of such. So this is a configuration which I was referring to where you're trying to create these things within the cloud. It's gonna ask you, so for this one we can give a demo. Name him, and. The name should be unique and we are gonna be doing as we can see here, these are all the configurations. So what we are gonna do is we are gonna intentionally misconfigure this particular storage account, right? So you see I'm selecting minimum TLS 1.0. Just not, it's an insecure version. So at the same time, on the networking side, this has enabled. To get access across the internet, right? The firewall is basically opened, so anyone from the internet can access it if they have proper keys at the same time, this is what the data prediction looks like for soft delete and encryption tag, and I'm not gonna change anything. I'm just gonna create it as you can see. We have submitted it and the deployment is going through. Let's wait for a couple of minutes and let's see how the deployment happens. So what we really are doing here is. We have intentionally misconfigured this resource, right? So this resource is, as you see now, Microsoft basically accepted it as you basically accepted it. Hey, this is a configuration. Looks good. Now I'm gonna start like creating the resource. And if you go to the resource here, you can see. Some of these settings, right? It's recommending. It's recommending, right? You need to have TLS 1.2, it's not enforcing it, right? And if you also look into some of these things, right? Networking, you have enabled from all networks. And if you have shared access keys and basically enabled for all networks is enabled, then think about it. Like anyone with access to the key can access it from any place. Rather than giving access from a specific subnet or having from a specific location same time with data protection. As you see it comes with the default things. And if you look into configuration settings, so the minimum TLS version is 1.0. So we know this is a misconfigured version, or I mean at least a version, which is not greatest from security perspective. So we wanna make sure any storage accounts moving forward are basically enabled with TRS one or two so that at least we maintain that security. So what we do in Azure is we have something called as Azure policy. So this is basically gonna help you to write and author those policies which are required for ensuring compliance. So anytime when you submit anything for deployment, this is where you can basically author policies to say how that deployment should look like. So what we're gonna do here is we are gonna be doing a deep dive into policy authoring and how that policy is basically applied. Let's do PLS 1.2. And Azure has, a lot of built-in policies, right? So when you look into some of these policies, it, you should really be able to understand how those things are authored and how you can do it. So you can also do a custom policy definition if you have a specific customized requirement. But for this demo, what we are gonna be doing is we are gonna be creating a storage account. Policy with minimum TRS version. So if you look into it, what the policy is basically saying is stating is the allowed values are allowed and denied and disabled, which basically means like what's the policy, how it should look, be like, if it's an audit policy, it's just gonna audit, it's not gonna do anything. If it's a denied policy, then you know it's basically gonna deny if it's not meeting any of these parameters. So I'm gonna apply this policy as a denied policy. And I'm gonna be setting a scope to the working place where we are currently working. And what I'm gonna do is I'm gonna be setting the parameter here to say, Hey, if it is not t ls 1.2, then the deployment should be denied. So I'm gonna set it to deny and I'm gonna add a little bit of a. Message, which basically says this is part of testing. Just so that you can customize this mess message however you want it so that developers can basically get some understanding of what needs to be selected. But this demo is putting this and looks like there is a scope issue, should be fixed. Okay. If you look into it so right now the policy is basically created and. If you check here into the assignments, so this is the policy that we applied at that specific scope. Right now, what we are gonna do is since we have applied this policy, we are gonna go again under this resource group and start creating a storage account. At this time, if I'm selecting the misconfiguration, it should potentially deny. So if I go to advance, I'm setting it to 1.2 intentionally. So that, to see what happens if you move forward with the deployment. So again, so this is going through the deployment phase at the minute. It's evaluating to ensure hey, is this deployment, meeting the standards? And then within couple of minutes we should be able to see the results around how the deployment is going forward. I. So if you go to the resource, the, you can basically see like this particular storage account was created, but let's quickly check the policy, right? To see that particular policy is active. Okay, let's. So the name of the policy was TLS. Okay. Let's quickly make sure, okay, this is an audit policy currently, so as you can see, if you view compliance, you can see there is one non-compliant resource. So this is how the audit functionality works, and this is what it is gonna show you if the specific storage account is not configured the way it is supposed to be configured, but not what we are gonna do now is we are gonna go back, change the configuration of this from audit to deny. So that it actually denies it right away when the misconfiguration is provided to the cloud provider while creating the storage account. So let's create another storage account here. Yeah, you can just pick a random name. I'm just gonna give something like this because I just for testing and I'm gonna give TLS 1.0 now, making sure this gets, yeah. In the proper resource group. Yes. I'm just gonna leave everything and the validation is in progress, so as you can see. This was delighted by the policy, right? This is part of SRE demo, which we are just basically just put it right there, and as you can see here, what we are seeing is, the TLS one machine 1.2 is not selected, and that is the reason why this particular. Configuration, or this particular storage account is getting blocked from getting it deployed. So if I go back and change the configuration to 1.2, which is a compliant configuration, it's gonna perform the validation again, I. Right. And now it's giving me option to create it. So this is gonna give you some understanding into, how do you use those policies and how do you author it? And when you look into Compliance pacs, right? When you go back here, look into Compliance pacs. So you would be having, for example, if you're coming from a specific industry and you would like to basically enable a pack, for example, FedRAMP or ISO or NIST or any of such. You can basically search by the compliance pack here and you can see a bunch of list of all policies which are applicable to that specific compliance standard. And you can basically assign us initiative so that all these particular policies would be assigned at that specific scope so that you can basically start monitoring it to a specific compliance level based on the. Requirement. So that is what we did inside Azure. If we quickly go back to the notes here, so what we did here is we basically checked like the account like storage account to ensure it's meeting a specific requirement and it is also like ensuring like without TLS 1.2. The storage account would not be deployed. So now what we are gonna do is we are gonna do the similar activity inside another cloud, which is like GCP, where you're gonna be getting some understanding into how do you do it and where exactly you need to do it inside IEM. And you'll be getting an understanding about what those are. So if you see my screen here. So let's log into GCP, right? And here what we have here is a project. So anything that you deploy in GCP is part of a project. So when you go to I am, and when you click on organizational policies. So this is exactly where you come and author those policies. So these are constraints where you can basically start creating constraints and forcing them to ensure, like any misconfiguration would not go through, but example here, I'm gonna come to storage. So these are some of the constraints that you can enforce so that if the storage account is not meeting any of these parameters, for example, you say public access prevention, right? You basically manage the policy. Just edit the policy here so that it is enforced. So anytime an storage account is created, if it is open to the public, then it is gonna block it. So let's see some of the already enforced policies here. So for example, there is something called custom machine type, which basically means like, if you're trying to deploy any kind of virtual machine and if that machine is not meeting a specific kind of skew levels. Then it is basically gonna be denied. So let's test that scenario out here. So I'm gonna create a virtual machine. What I'm gonna do is I'm gonna basically come and say, I'm gonna create a very big machine. I. And let's see if this gonna, let's see if this gets created. Okay. As you can see here, there is a compliance violation. It basically says Hey, this particular machine cannot be provisioned because it's not meeting the compliance requirement. Where SKU sites or machine type should be in a specific sku. Or a specific type. So what we'll do is like we again, go back, we'll again, try to create a new instance. Let me pick something. Okay, now let's do the same thing. Let's create instance. I've just put everything as default, haven't changed anything. This is the minimum skew, like where it is allowed, as you can see. Now, this is going through with the deployment without any issues. So this is how you would enforce those cloud native policies within GCP. So within the organizational policies here, right? So coming back to the presentation. We have reviewed two use cases, one in Azure Cloud, one in GCP, to understand how these things get deployed and how you could use org policies and GCP org policies and Azure policies to basically block the deployment, which is not meeting the compliance or your security requirement. So you understood what Cloud Native Security is. How do you enforce it? Within AWS, we have a concept called AWS Service Control Policies. That's how you are basically configuring those policies within AWS as part of organizations. So moving on to the next one, even driven security automation. So think about this as a third pillar where we talked about if there is any kind of misconfiguration detected. It is gonna be analyzed immediately based on what in the activity locks, and then an automated response, like a form of a function, is triggered so that automatic correction happens, and then again, a validation happens to ensure that particular misconfiguration was corrected. So that it's not a misconfiguration anymore, right? So what we need to do here is understand here is understand what is the MTTR, right? So for example, if you detected a misconfiguration and how soon do you want to correct it? If you would like to wait for a day, or is it something like you just want to do it 10 to 15 minutes. So this is something where we need to balance, cost and security because some of these tools come with. Cost, especially when you're reading events in real time and basically deploying any kind of functions to deploy it. Because as company scales, events would basically, grow and the automation should also grow, which leads to cost. So balancing that entity R is really key in enabling the event driven automation solution. So as we have seen, how do we monitor everything from a compliance monitoring perspective? Real time monitoring is really important. And the same time mapping all those policies to a specific compliance framework is really important because what those would give you is give you an understanding of what the risk is of not staying noncompliant. I. It's gonna give you a understanding about, Hey, what is the threat out there if I'm not doing some X, Y, and Z configuration, and if that threat could be exported or not. So that's how you need to prioritize it based on the risk. And anomaly detection is a key part of it, because if you are still doing everything right, there might be an insider or any of such patterns where. Something activity, which is deviating from normal activity, is basically really helpful in categorizing the benchmark and would basically help you to understand what your risk is when you're basically doing it. So some of these solutions are now coming up with the AI driven security insights, which are basically. Predictive threat intelligence kind of thing, where based on your workloads, if something is getting deviated, it's basically gonna notify you. Or if you see a pattern where it, there is a suspicious behavior by someone who's intentional, or for example, or it might be something, for example, some of these identities are machine identities. If it basically identifies some kind of like privileged scope for example, there is a creep in, permission is required, what is required and what is country doing. So all of those really are key in understanding and creating the baseline so that when you monitor those things it should be able to detect any of this misconfigurations and you would be getting a better understanding of what's the behavior. And these, some of these tools come with AI insights, which basically see those patterns. And would you. On behavioral analytics and automated vulner vulnerability prioritization, where, for example, if you would like to prioritize vulnerability from getting it patched, having these insights would really help you, Hey, this is a public facing application, was an internal application with exploitation. Whatever that is. At the same time it's with, some of these scoring models out there, which really give you some insights into. What really these codes are, and at the same time with these codes, which one can you prioritize in fixing it first? So understanding the roadmap is really important for implementation because when you pick A-C-S-P-M tool of your choice, you need to understand, where exactly you're trying to deploy these things. Is it just one cloud multicloud hybrid cloud? What are the approaches need to ensure the tool is properly configured or properly supported to begin with in the initial phase, and then you properly have policies. Proper policies defined, and then you will have to start implementing those CSPM tools for tool implementation. And you need to have some of these tools come with integrations like from IAC side to CSPM sites. For IAC side, you need to ensure it's properly integrated. Two C-I-S-C-D pipelines where it supports all the orchestration tools, which are currently running there so that it detects those pattern early on and just blocks it and for the runtime sake of acuity. Also, some of these tools have capability so that it can basically into it, and the pillar tool is what we have seen in those demo. That's where some of these tools don't have coverage, like direct integration, and that's something that you'll have to do. Continuous optimization is something which is really crucial. For example, policy naming and policy enforcements. It should be consistent across all these different tools of starting from IAC to run time and to monitoring to ensure like there is proper integration in place. So exception management, again, is a key. Where we understand, hey, we have security requirements, but for some reason this is a business decision. Again, we are unable to meet those requirements for X, Y, and Z reasons. And a business owner has basically taken or basically accepted that risk so that there is a temporary exception in place while a permanent solution is identified or. Maybe that is the way, basically security benefit. Is really not much, even though if you have those controls enabled. So that's how we can basically do security exceptions. And this all should be driven by risk based approach, by following risk assessment strategies under understanding the scope and ensuring, like what we have seen is some of these exceptions get started with the scope of a, and there is a scope creep, which happens to a, B, C. So we need to make sure prob all of these are currently. Can exactly monitored in the way they're supposed to be monitored. And all the documents all the approval should be documented for any kind of like auditing or compliance requirements. So key takeaways from this presentation are we have seen the complete journey from shift left where we have first demoted. The video for that is already, kind, it's part of the link there. So please see that. And what we have now seen today here is leveraging cloud native security controls and how those controls are built, and how do you embrace automation to basically deploy some of those controls. And why do you need a, specific AI driven insights to prioritize some of these things. So in the future, at some point of time, we're gonna be doing a deep dive into exception management because this is a growing area of interest inside cybersecurity where we see scope creeps and also what some of the platforms that are coming up would really help in automating some of these tasks so that you can deploy that exemption at scale. So you can be doing a future deep dive at. Third to around this topic at some point of time later. So that concludes my topic here. Thank you very much for your time and if you would like to connect or scan or basically chat, please scan the QR code and I look forward for, to connecting with you. Thank you all for your time today.
...

Santosh Bompally

Cloud Security Engineering Team Lead @ Humana

Santosh Bompally's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)