Conf42 DevSecOps 2022 - Online

Staring into the data abyss: Achieving a higher level of cloud security so you can sleep better

Video size:


Public buckets, missing encryption, databases open to the internet, and an endless stream of misconfiguration tickets all keep us awake at night. How can we put an end to this? The answer lies in knowing where your data is located, what types you have, and how that data is protected. These insights make it easier for us to know what’s a priority and what can be put on the backlog.

Chris Webber, Engineering Director, IT and Operations at Open Raven, will cover the following:

  • Cloud security posture is just the beginning
  • The data lifecycle - leveraging the cloud to automate policy
  • Essential data context attributes for security
  • Three things you can do right now to improve your data security


  • Christopher Weber: There is just so much data in the cloud. He says it's important to think through all of the breaches in these environments. Weber: We have to think about how we protect these environments a bit better.
  • The first thing we're going to do is add ourselves some security tooling. CSPM is cloud security posture management. We'll apply the AWS CIS benchmark policy. We all have this data from customers that we need to protect.
  • Security of S three buckets, as in the AWS Chris benchmark policy. Ensure all s three buckets employ encryption at rest. Ensure the s three bucket policy is set to deny HTTP requests. What can we focus on to really drill into?
  • You can use tools like open raven or AWS Macy to go and classify the data that's inside the buckets. It becomes a lot more critical to understand who can write to the bucket. When you understand how data can get in the bucket, you can start from there.
  • Don't protect data that doesn't need protecting, right. If it isn't there, you don't have to do anything with it. Use intelligent tiering to save money. The next thing is applying lifecycle rules. This ties really closely with using data retention rules.
  • Manage your riskiest buckets first. Look for broad write permissions. Focus on large numbers of small files. Classify your data, go figure out what you've got. Ultimately, that's really going to be what is the game changer for you going forward.


This transcript was autogenerated. To make changes, submit a PR.
Hey, all. My name is Christopher Weber, and I am the director of product and IT operations at Open Raven. We're going to talk a little bit today about staring into the data abyss and how we can achieve a higher level of cloud security. Mostly so you can sleep better at night, because I think that's really the critical piece of understanding how we might better sleep at night just to not have to think about all these crazy things. Because the reality whats I think that a lot of us are dealing with at this point is there is just so much data, and I mean so much data. It's really easy to think about when we look in terms of things like s three. I think using that as kind of our starting point, it's really interesting to reason about just how much is actually out there. So, first off, think about the number of AWS accounts you have. Even in my small organization, we've got over 30 accounts, and each one of those ends up with a bucket per region, per config. So AWS's config service, you add to that things like cloud trail, cloud formation, all that sort of stuff, and we haven't even started talking about your actual data and the applications that write and how all of that plays together. So it's really incredible just how much data ends up in the cloud, if you will. And I think it's worthwhile taking a step back. Right. In the old days, for those of us that have been around a little while, the data kind of protected itself in some way, right? You had to get into the system before you could get access to the NetApp filer, to the big EMC boxes, because you had to actually have access to those systems, whether they were sitting via NFS or via some fiber channel loop, you had to have access. And not just that, but the data could only grow so big because one of the things we realize is that in those environments, it really was a lot more about, you could only afford to buy so many shelves, you could only afford to add so many controllers because it was so expensive. And not to mention the upper limits of those systems, right? You could only get so much space on a given filer. And I think this is where it becomes really interesting and important to think about how much the world has actually changed, because it wouldn't be so bad with all of this data and the unlimited ability to write, except all those darn breaches. And I think when I look at s three in particular, and we can talk a little bit about rds as well. We can talk about your elasticsearch servers and pick a different cloud. Right? If we're talking about Google and Google Cloud storage or bigquery or any of those sorts of things, you have similar sets of problems. But at these, end of the day, it really boils down to we have to think about how we protect these environments a bit better. And I don't want to belabor the point, but I think it's important to really think through all of the breaches in these environments. Right? So Corey Quinn from last week in AWS or the duck bill group, as part of their last week in AWS newsletter, they regularly call out this bucket negligence award. And it's really interesting to me to think through just how much data gets exposed in some of these larger breaches. And the crazy part here is the three that we're showing here. That first, second and third is 1st, second and third that I found in my inbox. There's nothing particularly interesting about any one of these three breaches, except that it's personal data, it's customer data. And even more so than that, when you look at things like the Breast cancer Organization or say that ten times fast, it was personal images, it was things that really make a huge difference to care about because we need to protect folks. So I don't want to go into any and shame any particular organization, but we all have this as a potential, right? We all have this data from customers that we need and have a responsibility to protect. So let's do that, right? The reality is that we're going to take that seriously. So the first thing we're going to do is we're going to add ourselves some security tooling. I think the starting point here is a CSPM tool. And if this was a live studio audience, I'd ask you all to raise your hands as to who knows what a CSPM tool is. But since I can do that, I'm going to go ahead and define that so that we all make sure that we're using the same meaning for these same acronyms. CSPM is cloud security posture management. So in a nutshell, you apply policies and you get alerts when things have incorrect configuration or configuration that's not secure by some definition and the way that plays out. So we install these tool and you know what? We're going to lean on people. Whats should know things better than us. We're going to apply the AWS CIS benchmark policy. For those that aren't aware, CIS is the center for Information Security and they do a fantastic job putting together a set of benchmarks. We all feel good, right? We're going to know all about our environment and it plays out really well because we're going to come back into our CSPM tomorrow once all the policies have run, and then we find ourselves in these abyss. So let's talk about this a little bit, because anybody that's done this before knows where I'm headed. But let's talk so there are five controls as part of two. One which deals with security of S three buckets, as in the AWS Chris benchmark policy. And I'm only going to deal with the automated ones because these are the ones that any CSPM tool is going to actually evaluate against. So let's look at these. First off, ensure all s three buckets employ encryption at rest. This makes sense, right? Until you realize that there are lots of places where you wouldn't necessarily want to use encryption at rest. For example, things that are intentionally made public, or my favorite things that have heavy readloads. Let's just say I got to know the CFO really well after some mistakes made with Athena and KMs and the cost of reading from Athena. There's a great story there at some point, so catch up with me afterwards to dig into that. But I digress. Two one, two. Ensure the s three bucket policy is set to deny HTTP requests. This is really a way of prohibiting effectively what could be anonymous calls, right? So if you're coming in via HTTP, that means that you're likely not authenticated via s these, and that's what this is wanting to do. There are lots of reasons that you might want things turned on, right? We may want to serve up images, we might want to serve up things that come in directly over the various protocols. So like cloudformation, that sort of thing. So there's lots of legitimate reasons why that may be a thing. Ensure MFA delete is enabled on so pro tip if you use MFA delete, you are going to need to go use the root account to go delete anything that has MFA delete turned on. So this seems really good in practice, or rather really good in theory, but in practice is absolutely terrible. I don't have to explain to this group, I don't think, why you shouldn't be logging in as the root user, and anything from a security policy that really binds to needing to access the account as a root user likely has some concerns and then finally block public access. Well, first off, AWS, by default when it creates buckets for you, doesn't tick this box and it gets really interesting when that plays out. So think a little bit about that. Here's the reality. Based on what we just talked about, 95% to 100% of your buckets, they're going to flag, they are absolutely going to show up as being problems, as being in violation of that security policy. And when you get to a point where 95% to 100% of a given asset fails by default, those checks are kind of useless. It really is hard to think of a world in which it makes sense that everything is in violation of that policy. And I think, for me, what's really critical here is I have no ability now to priority what's bad, because it's all bad, right? The sky is falling. Well, which part of the sky am I even caring about at this point? So I think the real piece becomes, now, what can we focus on to really drill into and think a little bit differently, me, about what data we need to know and what information we need to be aware of for our success in this arena. So we'll start with where did the data come from? There's a bit of a history piece around this first point, and I want to call it out because a lot of folks aren't aware of this. So back in these day, because as I was talking about EMC and NetApp, you should probably get a good feel that I'm a little on the older side and been around the block a couple of times. Back in the day, AWS had this thing where you could only have so many s, three buckets for a given account. And one of the workarounds was to store things that were loosely affiliated, but not necessarily the same data in a single bucket. So what you might do is you might have your images in one prefix, you might have, or static assets, if you will, and then maybe some customer data in another prefix, and then maybe some separate application data in another prefix because you only got so many buckets and it was in the 100 bucket range was the limit. That limit has been lifted because it was at one time a hard limit. Like, you couldn't actually get them to raise it unless you were like super special. That's not the case anymore, which is fantastic. But those buckets still exist, those applications still write to those places, and it's still a thing. What region is it in? So I think it's really important to reason about the regionality of the data, because a lot of times it doesn't necessarily matter whether it's protected. You can have stuff that's completely protected properly and you still be in violation of compliance concerns, because you've got data that shouldn't be there in that region. Not to mention, from my perspective, it's really interesting. We've got a map at open Raven where you can look at your infrastructure, and one of the first things that catches a lot of customers'eyes and is always, why I'm a super big fan of it, is you look at it and go, wait, why do I have stuff in AP Southeast one? I shouldn't have anything these. And then sometimes it's, oh, we turned on AWS config and it put a bucket there. Fantastic. Or you go hover over and look at the buckets and go, yeah, that shouldn't be there at all. We need to go take care of that. So I think that's a really valuable tool. What apps actually write into this bucket? And I'll talk about the write piece a little bit later, but it's understanding what apps send data to that bucket and keeping that in mind. The other thing is, is things all coming from automated processes or is it being manually uploaded to? So one of the things that becomes really an interesting question, and when you look at some of the breaches, a lot of times it's not uncommon that a backup got uploaded to the wrong spot or to a place that someone thought was safe but wasn't because they were manually uploading it and there weren't all the other controls in place from the application side. And I think it's really critical to kind of look at that and reason through. Okay, so is it a normal thing for this to be manually uploaded for someone could accidentally upload the wrong thing? I things from there. We really want to talk about what kind of data is in the bucket. This seems really straightforward, and you can take a bunch of different approaches to go figure out what's there. Right. If there's protected health information, if there's personally identifiable information, you should know. Hopefully you're going to want to know if it's these. And on one hand, we can absolutely go talk to each individual person. And if you are in a large organization, that probably won't work super well. So you can use tools like open raven or AWS Macy to go and classify the data that's inside the buckets. The same is true on the open raven side. You can do this with your RDS instances as well. And we're looking to expand beyond just s three. We've got a bunch of stuff coming down the pipe, and it's going to be exciting, but you need to know what kind of data is there? Things one always makes me laugh a little bit because the first place we always jump from is who owns it. And this would be amazing to know. Like, I would love to know who owns the data. The problem is, and I want to call it out here as it's a great thing to know, but reality is that you're probably not going to know. It's going to be hard to track down who owns it. And just because someone owns it doesn't necessarily mean they have control or have any semblance of understanding of what's actually going into the buckets. I think it becomes a lot more critical to understand who can write to the bucket. When you understand how data can get in the bucket, you can start from there. So even if one team owns the data in that bucket, can applications that are owned by other teams right into that bucket and it get accidentally used? Are there other opportunities for people to once again manually upload into it? So you can use tools like open Raven? We've got a feature coming out, it's API only now, but will be available in our UI soon, where you can actually go in and look and say, okay, what security principles have the ability to write into this bucket? You can use tools like hermetic as well, which does a bunch of things around IAM and better understand who can read and write to a bucket. But I think it's so common for us to focus on who can read from it. I think the starting point should be who can write to it because these you can actually start to identify where your actual risk is. So I've talked about a lot of what, right, we want to know all of those things and I think it's really critical to think in a different way, think about where we can start and how we really enable teams to start taking next steps. So the first thing is, don't protect data that doesn't need protecting, right. If it isn't there, you don't have to do anything with it. So I really kind of call out a couple of things. First off, use intelligent tiering. This is going to sound silly, but it gives you the ability to get an alert about the state of the world that isn't directly tied to all the security tooling. If you're using intelligent tiering and all of a sudden you start accessing a bunch of stuff and it's changing lies so that your costs go up, you're going to see that. And the reality is that we're all watching cost a heck of a lot more than a lot of these security tooling, because the security team is looking at the security tools, cost is being looked at by everyone. And so as a result, we can use things like intelligent tiering to save money because things shouldn't be being accessed all the time. And it gives us the ability to see those anomalies in the system. The next thing is applying lifecycle rules, and this ties really closely with using data retention rules. So lifecycle rules are the technical implementation, right? I go into the s three bucket and I say, hey, after some period of time, delete this thing. Data retention rules are the business side of that, right? It's the hey, we're dealing with healthcare data, so it must be kept for 24 months, five years, whatever it happens to be. But on five years in one day, we can get rid of it and we should get rid of it. And so the real key becomes, can you use something like lifecycle rules on those s three buckets to remove that data so that you don't end up having to protect it going forward? There are some also great conversations about having data that you don't need and how it plays into legal things like discovery and whatnot. That's a little bit broader than this talk goes into, but I think more than anything, there's no reason to protect things whats don't need to exist. So get rid of it so that you're not protecting things unnecessarily. Manage your riskiest buckets first. I think it goes without saying whats public buckets are going to, by definition, be the riskiest. The problem is that we normally stop there in our conversations. It really becomes a good point to go, hey, go start there. But also look for a couple of things. Look for broad write permissions. So if you can find and track down places where you've got everybody and their brother is able to write into that s three bucket, you've probably got a problem, because it's much more easy for something to be exposed than it would be if only two or three applications are able to write, or no human users are able to write into that s three bucket. So that becomes a really important thing. And then one of the things that we found in our environment is backups aside, one of the real indicators that you've got actual legit data somewhere is lots and lots of small files, whether it's lots of images that are being uploaded from customers, whether it's Json, that sort of thing. The large number of files tends to be an indicator of there's some automate process, or there's some process that's putting data in these and that's a really good place to start because it's actual data coming from customers and not just a dump of some source code archive out of NPM or something like that. We see all sorts of fun things, but I think really the biggest thing is focus on those large numbers of small files as a good place to start and hone in on your managing for risk. Ultimately, I think for me the biggest thing is, and yes, I get, I work for open Raven. There's a reason why I do. I believe that understanding your data classification by being able to understand what is actually out in these world matters, it's so critical to be able to go out and say, this is what's in that buckets. And you can start really simply whether it's using open Raven, whether it's using Macy, go do some scans, understand what you've got out there and from there, run those scans regularly, making sure that you are actually checking for things. And one of the cool things we do at open Raven is we cache the results, right? If the file hasn't changed because the e tag hasn't changed, we're not going to rescan that object in s three because we know it hasn't changed, but we're going to do those sorts of things. And then more than anything, you need to have rules in place to alert on the things that are actually critical, right. You want to know if you find european data in a us region. You want to know if you find Pii in a bucket that's open and that's the real critical differentiator, right? It's not that you found Pii, it's not whats you have an open bucket, it's that you have Pii in an open bucket and it's those sorts of things that really provide the value. So, to summarize, I think the real key is these three things. You turn on intelligent hearing, it will get more eyes on the problem, because if costs bump heavily, you'll know that, hey, data, whats shouldn't be accessed is being accessed. Classify your data, go figure out what you've got. And these use those retention policies and those lifecycle policies to delete the stuff you don't need. Ultimately, that's really going to be what is the game changer for you going forward. So with all of these things said, I want to thank you for joining me for my talk. I can be found on the interwebs, you can find me on Twitter, hit me up via email, and I'm trying the new Mastodon thing. We'll see how that plays out. But I hope you've enjoyed this talk, and I'm looking forward to catching up with you in discord.

Chris Webber

Engineering Director, IT and Operations @ Open Raven

Chris Webber's LinkedIn account Chris Webber's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways