Conf42 Chaos Engineering 2022 - Online

Defining Steady States and Developing Hypotheses for Security Chaos Engineering

Video size:

Abstract

This talk explores how teams can use existing security frameworks and benchmarks to define a Steady State and develop hypotheses for Security Chaos Engineering.

Summary

  • Jamaica make up real time feedback into the behavior of your distributed systems. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. I hope by end of this talk is to explore a concept that would help adoption of security chaos engineering to broader cybersecurity practices.
  • Sakshyam Shah is a developer relations engineer at Teleport. He has eight years of experience exclusively in cybersecurity. He loves talking about new technologies and startups in general. If you have any question or you want to connect with him, feel free to ping him on Twitter or LinkedIn.
  • Security chaos engineering is applied chaos engineering to validate security implementation and test for resiliency. Despite all the efforts, data breach is nowhere going to end anytime soon. Security chaos engineering involves experimentation to validate assumption and find the unknowns.
  • The problem with testing these type of assumptions when it comes to security chaos engineering is how can you define and measure a steady state appropriately? There are many, many ways to bypass that cross site scripting attacks. Can we adapt security chaos testing to speed up compliance process?
  • Security chaos engineering based on security benchmarks and frameworks and best practices allows us to align security chaos engineering with existing security processes in your organization. And changes happen as the team grows, as the requirements grows.
  • Security chaos experiments are a noble way to find unknowns in security. But then again, the experiments should be aligned with existing security process to gain adoption. Security chaos testing is more effective as organizations move closer to the highest level of security maturity model.
  • I would like to thank again the organizers for giving me an opportunity to speak at this conference. If you have any questions feel free to ping me. Have a great day.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close can welcome to my talk. I would like to thank the organizers for giving me an opportunity to speak at this conference. I will be speaking about defining a steady, steady, steady states developing hypotheses, security chaos engineering. So the primary focus will be to discuss challenges in moving chaos experiments beyond regular testing of downtime and resiliency. Finally, I'll also be exploring how these challenges can be addressed by utilizing existing security benchmarks and frameworks. And what I hope by end of this talk is to explore a concept that would help adoption of security chaos engineering to broader cybersecurity practices. So, a little introduction about myself I am Sakshyam Shah and I currently work as a developer relations engineer at Teleport. For those of you who you don't know, teleport is teleport provides passwordless access to infrastructure. It's one of the most easiest yet secure access to SSH servers, Windows servers, kubernetes, clusters, databases, and web application across all environments. Before teleport, I have eight years of experience exclusively in cybersecurity. Been doing both offensive and defensive stuff. And besides cybersecurity, I love talking about new technologies and startups in general. So if you have any question or you want to connect with me, feel free to ping me in either Twitter or LinkedIn. Happy to chat with you. So let's begin by explaining what is security chaos Engineering? Please bear with me on this. This is chaos Engineering conference, and I can bet that there are speakers who can explain what is security chaos engineering more appropriately than me. But then again, just let me set a stage for the topic. So security chaos engineering is applied chaos engineering to validate security implementation and test for resiliency. To understand secure chaos engineering, you first have to understand what led to the whole chaos engineering itself. In a typical software application developing and delivery process, developer writes unit test, they write integration test, they write end to end test, and the applications are deployed in production. But despite all those testes, despite 100% test coverage, applications are bound to fail crash or they are bound to face a downtime. And the reason for downtime can be related directly to the application itself or can be affected by many other dependency that goes into production. So that's where chaos engineering comes in and tells that, okay, despite the fact that you tested all those stuff, the application services are still crashing in production. So let's just try to find out those unknowns way before they become an incident in production. So chaos engineering comes in and tries with many experimentation to validate assumptions that the test written are actually working as expected. So security chaos engineering, it works in the same way. So despite organizations practicing security for a very long time, so organizations are buying security products, the next gen firewalls and whatnot, they have been implementing both security reactive and proactive practices. They have been doing vulnerability assessment, pen testing, red teaming. They are awareing their users, end users, internal employees and even developers for secure software development practices for security awareness. Yet, despite all these efforts, data breach is nowhere going to end anytime soon. Security chaos engineering comes in and tells that, okay, despite all the fact and effort that we put in security, the rate of compromisation of any organization is just growing and it's not stopping anytime soon. So let's just try to find those unknowns that can escalate to a data breach. Let's try to find those things earlier as much as possible so that they are prevented in the future. So security chaos engineering involves experimentation to validate assumption and find the unknowns. Basically test the effectiveness of all the controls that have already been implemented. So the way I see it, in a short, is security chaos engineering are test to validate all the teams how a chaos experiment is conducted. A typical chaos experiment starts by defining a steady state by measuring a normal behavior, so how the application behaves in a normal case. Then a hypothesis is developed with a what if scenario to create can experimental process and some chaotic variables are introduced which will affect the CD, which tries to affect the CD state of application. And at the end we measure the results and validate that. Are the assumptions still correct or not? For example, let's say that we have an application service responsible for handling 1 million requests per second. That's a steady state of that application. So a chaos engineer comes in and tells that, okay, what if the caching proxy in front of this application service is down for ten minutes? In that case, would our application service still be able to handle 1 million requests per second? So the chaos engineering comes in and deliberately takes down the existing proxy and takes some metrics and tries to again come with a result that tests that. In that case, would the application service really can handle 1 million requests per second? And if it can't handle that, then we have fined and the chaos engineer has find an unknown that was thought to be already addressed by developers or reliability engineers. But then again, despite all those efforts, we can verify that in fact the application service will not be able to perform as it is supposed to perform. So that's a typical example of chaos experiment. So how this can be related to security. So first thing is of course an obvious testing for resiliency. Testing for resiliency is route. That's not related to only security chaos engineering, but the regular chaos engineering in general as well. So in testing for resiliency, so we have lots of expectations for resiliency. And as a developer, as a site reliability engineer, always think that the application or service will work as expected in the server or in the production. If you take this topic resiliency and try to relate it with modern infrastructure or application development and deployment process, that is cloud native and that is cloud native along with many microservices. So a typical resiliency dependency of an application has grown lot bigger. For example, I have a picture here in this slide which shows the resiliency dependency of an infrastructure access solution, right? Access control, infrastructure access solution. We take these things for granted and we think that okay, what could there be? Just might be a host bastion server or VPN server, or modern access proxy that allows the access. But then you can see that a typical access control solution in modern infrastructure has many, many resiliency dependency. For example, it depends on identity providers, access providers, certificate authorities, multifactor authentication providers, hardware security models, approval systems, routers and files and switches. And these are just the main dependency that I've listed here. So there are many ways an application can face disruption due to failure of the dependency that it depends to operate its whole feature set. So assumptions can be like, okay, so the services can withstand downtime of this dependency and a developer or DevOps engineer comes in and tells that, okay, a whole new infrastructure service can be running in under five minutes, and we have already tested for it. Security engineers can come and tell you that in case we find a vulnerability, the security patches can be applied as soon as patch is available. So these are the general assumptions that existing in operation infrastructure operations of every organization. So security with secured chaos engineering, we try to validate these assumptions. For example, so for testing for resiliency, for example, is surviving a downtime due to outage of service provider, okay, so when a DevOps engineer comes and tells you that the service can withstand downtime of a certain dependency chaos, engineers can come and okay, let's try to check if that assumption is correct. And they can introduce chaotic variables which include shutting down containers, virtual machines, servers, disconnecting network interfaces. They can also mock a dependency downtime to validate the assumptions. Let's say when there is an assumption that security patches can be applied as soon as the patch is available. So chaos engineering can come in and try to check that. Okay, is that assumption true in any regard? So how much time does it actually takes for upgrades and patches? Do we even have an infrastructure ready for rolling out certain kinds of patches to our fleet of servers in our infrastructure? What happens when new pipelines and workflows being introduced? Okay, we talk about security, but are we in a position that would help us smoothly roll out credential rotation operations like rotation of passwords, API keys and certificate authorities without a downtime? So obviously these things are already thought out when they are designed. But then again, given operations inside any organization, the changes are bound to happen. There will be drift in configurations. Some new introduction of new workflows will have a side effect on existing procedure, and they can affect the resiliency of what the assumptions we already have thought out or test out. So security chaos engineering can come in and test for those assumptions. Okay, now we are venturing more into core cybersecurity topics. So testing for effectiveness of security controls example of security administrators comes and tells that web application firewall policy will detect cross site scripting attacks. A SoC engineer will come and tell that, okay, our SIEM policies are configured to detect credential compromise and lateral moment adversarial tactics. And security engineer of a web application comes and tells that, okay, in case our x application is compromised, it still will only be affect the y database and not be able to pivot beyond that database. So these are the generic types of generic assumptions we have inside of our security operations. The problem with testing these type of assumptions when it comes to security chaos engineering is how can you define and measure a steady state appropriately? For example, when an administrator tells that, okay, WaF policy will detect cross site scripting attack in this case. So it's really hard to take metrics and measurements in terms of security. For example, in regular chaos testing, you can go and take a measurement of the example I gave earlier, that handling of 1 million requests per second. You can collect bandwidth metrics, you can collect cpu and memory metrics, right? So you can take metrics that's related to request handle per second, send those data to monitoring dashboard, prometheus and Grafana dashboards, and you can collect metrics. But it's lot hard to apply the same concept to security because you can quantify security in terms of, let's say, okay, the security that we have applied is 100% secure or 90% secure. And now we want to test a hypothesis and check if it will reduce the security to 80% or 90%. So that's not justifiable in terms when we speak about security. So when can administrator tells that policy will detect cross site scripting attacks? There are many, many ways to bypass that cross site scripting attacks. If you go and check the existing cross site scripting bypass cheat sheets, there are many. And given the condition and the pace where the front end developing ecosystem is moving forward with, the chances are that more of bypass techniques will be found in the future. And this is just not ever going to stop. So it's hard to quantify that. The steady state of this application we are testing against is already the best and possible way about it. So in a similar states, what should you test? Okay, you can test for a hypothesis, and again, it's hard to quantify that. This hypothesis is bound to test that if the CD state will decrease to certain numbers that we can measure in terms of security chaos experiments and exactly what variables should we introduce? Will this going to be enough? So these are the generic problems you will be facing when trying to test effectiveness of security controls and with continuing on that. So when the chaos experiments now tries to directly tackle with the existing security processes that's already in place in your organization, now the biggest question your team will have is how can you align these experiments with existing security processes? So where does security chaos testing fit? Can we adapt security chaos testing to speed up compliance process? These are the questions when do you even start security chaos testing? So these are the questions that the security chaos engineer aspiring or the practitioner will be facing when they try to venture out more inside testing of existing security processes. So this is where the core topic of my talk is how security chaos engineering based on security benchmarks and frameworks and best practices allows us to align security chaos engineering with existing security processes in your organization. So it's just a way that would help us for a broader adoption of chaos engineering, because chaos engineering is a good concept and it must be practiced, but unless it is aligned with existing security process in your organization, it wouldn't go much further. Okay, so what I mean by security benchmarks and best practices? So there are security baselines such as CIS benchmarks or vendor specific best practices like AWS, Azure, Google cloud best practices for identity and access management, for example. And you have compliance specific controls such as PCI, DSS, HIPAA, or SoC whatever compliance your organization is following. Then you have security frameworks such as Mitra, attack, cyber kill, can, et cetera. So these are the examples of security baselines, benchmarks, frameworks that one way or another your organization is already practicing or have implemented few parts of it. So how can you introduce chaos testing in those parts? Right? So for example, here is testing effective implementation of cis benchmark. So as an example, I have taken benchmark that's related to access control of cis v eight. So in that benchmark, so you have list of these things that the benchmark tells you to do. And if you pass security audit, chances are that you will have a green light among all this implementation. But security proper, the purpose of security chaos testing is to come and check, okay, although we have implemented these benchmarks, have we implemented them correctly? A security auditor will not go that far to try to validate all these assumptions. The compliance security auditor will come and most probably will check that you have implemented or not. So this is where security chaos engineering can come in and tell that, okay, we are going to validate this assumptions based on experiments. So for example, you might have already implemented access granting process, access revoking process, require multifactor authentication for externally exposed application remote network access, or for administrator access, and have defined maintenance role based access control, right? But security chaos engineering comes in and tells, okay, let's try to validate our assumption that we have implemented requirement for multi effect authentication for administrative access and the test cases can be okay, can other special policy override this requirement? Or let's say if we deploy a new application that doesn't require multifactor authentication for administrative access, will this be detected? So these are the types of test cases that can be introduced. And these types of changes will happen one day or another, later or sooner, because the static state of any organization is never true. And changes happen as the team grows, as the requirements grows. There are many teams responsible for handling managing infrastructures. There will be many new teams responsible for developing new applications, whether it's customer facing application or internal application. So it will affect the current state of infrastructure. And again, there will be changes in drift in configurations. So these assumptions will come in and try to validate the same assumptions and might be that at current point of time, the security policies will detect a new application that has been deployed without requiring multifactor authentication. But later stage of your organization, as the team grows, as the requirement grows, some policies or new workflows might have already disturbed that security policy. So chaos engineering comes in and can help you validate those assumptions. Second thing is we can also use chaos testing to test for adversarial tactics that will help to test for breach readiness withstanding adversarial tactics, test for threat contentment or validating blast radius. And so these tactics are well mentioned, well cataloged in frameworks such as mitre attack, cyber kill chain, Gartner's cyber attack model, NIST cybersecurity framework. And these are the primary ones that are popular in security industry. So how can we introduce chaos experiments within these frameworks? Right? So for example, I have taken can sample of adversarial tactics that is related to enterprise metrics of mitre attack and for example of this talk. So I've taken four primary tactics. Example are initial access, credential access, privilege, escalation and lateral moment. These are the common tactics that are presented in enterprise metrics. So we can take subtactics. For example, here we have in credential access we have modify authentication process so we can test that although these, in a typical security mature organization, these things would have been already practiced. So as a security chaos engineer, our propose is to validate the assumption that even that our system is ready to detect this stuff. We will come with assumptions that, okay, what if in certain cases we might miss this detection? For example, I have taken here a sample of tactics that is modified authentication process. So we can take that topic as the security chaos experiment and build our hypothesis that, okay, SoC team should be alerted. That's the assumption of the security team. That SoC team will be alerted if there is a modification in authentication process. Now, as a security chaos engineer, we go and deliberately change the authentication process from SAML to oath and disable the lock server for 20 minutes and resume it back. The hypothesis here is that, okay, the policy is defined to detect a change in authentication process, but what if when lock server is down for 20 minutes or cannot handle request for 20 minutes, right? So we do that as a security chaos engineer. And in the fourth step, we might discuss or chat with Sock team that if it's been detected or alerted or not. If not, it means that process responsible for shipping logs is existing, a retry mechanism. Maybe that was implemented way before, maybe not, maybe it never was implemented, but that's unknown. We found that in certain cases that a log server is unable to handle requests for 20 minutes, we will be missing the alerts that were related to the change process, and we might miss this whole alert of can adversarial tactics that has been already going in your network. So these are the examples of how chaos experiments can be performed with respect to existing security frameworks. But again, why should we do that? To summarize, is that it will help you catch misconfigurations or logical flaws that are introduced over time. As I said, drifts are bound to happen and changes in workflow will introduce new side effects in your existing security policies. So it will help to catch misconfiguration or logical flaws that are introduced over time. Then it also helps to automate and close gap between vulnerability assessment and penetration existing and incident response drills. The third one is that it will help to validate the effectiveness of existing security policies and security controls. Even if you don't find any flaws in the assumption, it's just a way of scientific way to say that, okay, rather than just ticking a box that we have implemented or not, we have actually just carried out some exercise that allows us to scientifically or mathematically justify that this is indeed being implemented, right? Then again, security chaos test with engaging security chaos experiment with existing security benchmarks and frameworks will also align security chaos test with existing security initiatives in your organization. So that means it will be helpful for executive buy ins or buy ins from the security team that okay, we should do chaos existing and that will help us increase our effectiveness rather than be just another security concept, okay, I mentioned earlier that lining chaos experiments with existing security benchmarks and frameworks would also allow us to close gap and align it with existing security practices. So any typical organization will have already been practicing proactive security practices such as vulnerability assessment, pen testing, incident response drills. So how will chaos experiment compare with existing proactive security practices? So for example, in this scenario, I have taken a case of just in time access request system. So in a vulnerability scanning, you are looking for known vulnerabilities in the GIT access granting system itself. In vulnerability research, you'll be looking and identifying previously unknown vulnerabilities that might affect the GIt system. In a penetration test, you try to find a way to bypass the Git access granting process by exploiting a known vulnerability or by developing a novel logical flaw explores or via social engineering and security engineering and security chaos testing, you will try to test that. What if the GiT system chaos crashed? Will a downgraded and can insecure access request system be activated and misused to bypass the Git policies? So these are the typical differences between each of the proactive security practices. So security chaos engineering comes with a unique point of view to validate the assumption. What if, despite the fact that policies has been implemented and deployed, will the change or downtime or any effect in the dependency will affect or lead to downgrade of security? That would just let the security control to be bypassed. So when to introduce security chaos testing? In this case, I've taken a sample of security and privacy capability maturity model, which is also known as SPCMM and short. And it just shows you how you can measure a maturity model of your security and privacy controls implemented in your organization. And it's just one of the framework given the industry your organization is on, might be following the other security maturity model. And again, this is just for example. So when should you introduce security chaos testing in organization? Personally, I believe that security chaos experiments works best when it is introduced later in your security process. For example, if you don't have any implemented cybersecurity baselines benchmarks if there is not any team that's practicing security, there's no point in conducting experiments and trying to find the unknowns, right? So first, the basic is that you have to go and implement basic controls. Ensure that the basic hardening stuffs have been implemented. The benchmarks, basic benchmarks or frameworks have been followed and implemented to tighten security and enhance security, right? So security chaos experiment is a way to validate the assumptions that you have once you have implemented all those security practices. Without that, I think it can be helpful if you use security chaos experiments just to identify a gap on where should you focus on implementing or prioritizing security. But again, the most effectiveness of security chaos experiments will come when you are later in the maturity model of security process. So that's about my presentation for today, concluding all the things that I've said is so secure chaos experiments are a noble way to find unknowns in security. But then again, the experiments should be aligned with existing security process to gain adoption. Otherwise, it's a good concept and it will have challenges to grow beyond just concepts and just a few steady, steady, steady states. Developing hypotheses, security baselines, benchmarks and frameworks would help security chaos engineering with existing security process. So the challenge is to align it with existing security process. And to address that challenge, we can integrate or bring the security chaos experiments to validate all the security controls that we have already placed in our organization. And security chaos testing should close the gaps not addressed with penetration test and incident response drills. So any mature security organization will have been practicing this proactive security practices, including penetration testing and incident response drills, vulnerability scanning. So security chaos experiment is not about replacing them, but it's about closing a gap that are left by these tests, right? So if you think that way, there's a spot for security chaos engineering. If your team wants to replace the existing process, then it will have a challenge to change what is already in place and followed by many many security teams over the world. So finally, and security chaos testing is more effective as organizations move closer to the highest level of security maturity model. So it's security chaos experiments. It works best to find the unknowns when the knowns are implemented correctly or have been implemented without the known knowns. You'll just be firing experiments all over the place without any good result and validate any security prior implemented security assumptions. Okay, that's it for my talk today. I hope it was helpful for those of you who started planning to start venturing out in security chaos experiments. I would like to thank again the organizers for giving me an opportunity to speak at this conference. If you have any questions feel free to ping me. I have provided with my social media handles in the earlier sites of this presentation. Okay thank you so much. Have a great day.
...

Sakshyam Shah

Developer Relations Engineer @ Teleport

Sakshyam Shah's LinkedIn account Sakshyam Shah's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways