Conf42 Chaos Engineering 2021 - Online

Risk-Driven Fault Injection: Security Chaos Engineering for the Fast & Furious

Video size:


“The dynamic nature of cloud-native infrastructure requires continuous security mechanisms to effectively detect security threats, especially those with unknown patterns and behavior. This talk proposes Risk-driven Fault Injection (RDFI) techniques to address these challenges. Essentially, RDFI applies the principles of chaos engineering to cloud security and leverages feedback loops to execute, monitor, analyze and plan security fault injection campaigns, based on a knowledge-base.

The knowledge-base consists of fault models designed from secure baselines, cloud security best practices, and observations derived during iterative fault injection campaigns. These observations are helpful for identifying vulnerabilities while verifying the correctness of security attributes (integrity, confidentiality, and availability). Furthermore, RDFI proactively supports risk analysis and security hardening efforts by sharing security information with security mechanisms. We have designed and implemented the RDFI strategies including various chaos engineering algorithms as a software tool: CloudStrike.

Several evaluations have been conducted with CloudStrike against infrastructure deployed on two major public cloud infrastructure: Amazon Web Services and Google Cloud Platform. The time performance linearly increases, proportional to increasing attack rates. Also, the analysis of vulnerabilities detected via security fault injection has been used to harden the security of cloud resources to demonstrate the effectiveness of the security information provided by CloudStrike. Therefore, we opine that our approaches are suitable for overcoming contemporary cloud security issues. “


  • Kennedy Torkura is a cloud security engineer at Mattermost and also a PhD student. His PhD is on cloud security. Today he will share knowledge about part of the things that he researched on and some of the theorems he proposed.
  • Security chaos engineering aims to identify security control failures and defend against malicious conditions. On the other hand, security case engineering addresses availability also, and a slightly different kind of availability. Today we live in a cloud native world, and our systems are getting more and more complex.
  • Cloud native security is about securing cloud native infrastructure. 99% of attacks against cloud infrastructure are going to be caused by users faults. Traditional security mechanisms are struggling to catch up with these recent trends. Security in a cloud native world has to be set up at every layer.
  • Riskdriven fault injection is about employing security case engineering from a risk perspective. Risk helps us to measure security, that we can communicate this whatever strategy we are trying to propose in more clearer and more sensible ways. The idea here is how we can take security based engineering and push towards it.
  • Cloud strike lets you chain various attacks to form an attack scenario. The third stage is about analyzing. Even if you had to stop these experiment, it's critically important to understand why it failed. The knowledge gained from these experiments can be used for training security teams.


This transcript was autogenerated. To make changes, submit a PR.
You. Hello. Good day, good morning, good afternoon, good evening, wherever you are today. Thank you so much for staying up or for coming to my talk today. I'm really excited to share some knowledge today about security rates engineering. So today I'm going to be talking on these subject risk driven fault injection security cares engineering for the fast and furious. My name is Kennedy Torkura and I am a cloud security engineer at Mattermost. And actually I'm also a PhD student. I'm completing my PhD at these Hasoplatna Institute, and my PhD is on cloud security. And therefore I'm going to be sharing knowledge about part of the things that I researched on and some of the theorems that I proposed and also evaluated as part of my doctoral thesis. So, let us start off with definition of security chaos engineering. And we're going to be borrowing the definition that has been proposed by Aaron Reinhardt, who is also the creator of security chaos engineering. He defines it as the identification of security control failures through proactive experimentation to build confidence in the system's ability to defend against malicious conditions in production. So some similarities actually between definition of chaos engineering and that of security chaos engineering, but the key differences here are that we are trying to look at how we can identify security control failures and how we can defend against malicious conditions. So some differences. Basically, chaos engineering tries to address availability problems, and that is done by employing resiliency patterns. So, resiliency patterns, strategies like timeouts, buckheads, circuit breaker, are being used to inject failures and to identify problems that might affect the availability of services on the other side. On the other hand, security case engineering addresses availability also, and a slightly different kind of availability this time around. Availability problems that might be caused by malicious actions, for example, denial of service attacks. Security based engineering also tries to look at integrity and confidentiality, whatever might impact on integrity or confidentiality or availability, which are the three principles of security, usually we call the CIA triad. And how that is done is by employing the existing controls that we have been using in cybersecurity. For example, the preventive controls. We're talking about mechanisms like firewalls, detected controls like intrusion detection systems, corrective controls, for example, incident response systems. So, security case engineering tries to verify that these controls are working the way they are supposed to work in an environment. And if they're not working that way, that's going to be notified, that's going to be identified. And the key big picture here is to be able to detect security blind spots, spots that these systems are not able to see, they're not able to identify. So what's the importance of applying security care engineering in the current dispensation? Today we live in a cloud native world, and actually our systems are getting more and more complex. And according to Bruce Schneer, in an essay, he wrote a plea for simplicity. The core message in that article was that complexity is the worst enemy of security. And essentially, that is what we are seeing today. Our systems are becoming more and more complex, and this makes it harder for security to be even effective, because security professionals can only defend systems that they are able two understand. The more they understand the system, the better they have a chance of defending it, or even identifying when there are malicious or insecure events in that system. Also, another problem we see is there is an increases of attacks against cloud infrastructure. A few years back, there were already attacks, actually when we saw hackers penetrating the Amazon web services, actually the account of cecilar, to be able to spawn virtual machines to mine bitcoins. And doing this, they were able to hide their tracks in a way that their infiltration was not even noticed. And later on, we also saw other kinds of attacks. For example, the exploitation of s three buckets, where attackers were able to successfully have access, unauthorized access to s three buckets and exfiltrate very sensitive information. However, things are actually getting worse. Attackers are becoming much more organized. And in a recent report, the Cloud Native threats report that was released by the Aqua security team, they actually showed that these attacks are getting more and more complicated. They are getting more and more sophisticated against cloud native infrastructure. They were able to deploy a set of honeypots in these wild, and based on this, they are able to gather attacks as they happen, and they are able to analyze it. So there are more and more attacks coming up against cloud native infrastructure. Another problem we see is the problem of new kinds of attacks, or let's say new security problems. One of the most common one is that of misconfiguration. So according to Gartner, from now up to four years ahead, there's going to be a lot of problems that are caused by misconfiguration. And as you know, we've heard a lot about, I just mentioned about s three buckets and these, when the major reason why these buckets are being attacked is because of misconfiguration. And 99% of attacks against cloud infrastructure are going to be caused by users faults, majorly, inability to configure, or inability two deployed to deploy cloud assets in a way that they are secure. And these things are being caused by, essentially, I put it into two main reasons. There is a knowledge gap as regards what is expected from people who are using the cloud, and there's insufficient tooling support to help them to be able to deploy or to be able to deployed these infrastructure properly. And we can easily see this. So we have on the screen here, actually two policies which are access control policies for Amazon Web services on XM. Right here is a policy that actually is quite large and most of the time people are expected to pick up these policies, manage them, to edit them by themselves, by hand, actually doing it manually. And this is a very, very tangible example of working about insufficient tooling support for security. So also as we observe, there's a lot of new trends coming up that are actually aligned with the digital transformation agenda. We got DevOps, we have the CI CD pipeline, we have a lot of people are shifting their workloads to the left. They want to be able to be fast, they want to be able to agile, to be agile. They want two be able, two basically make use of the new trends, new technologies to take. It has an advantage for themselves. And unfortunately this is not easy for security two handle because the traditional model of security is like we see, to be able to take care of infrastructure that is more or less static, doesn't really change. And this is how security has been in the last two decades or even three decades. Most of our security systems are designed to protect those kind of systems that are static. And the new trends have become, they're a problem for security. Security is more like quite confused these days. You see the traditional security mechanisms are basically struggling to catch up with these recent trends. And so there is a new kid in the block, a new concept coming up called the cloud native security. And essentially cloud native security is about securing cloud native infrastructure, which kind of summarizes the new trends we're seeing these days. Like has I mentioned in the last slide, in order to sort of define what cloud native security is, which essentially boils down to defense in depth, the Kubernetes security team issued an article which the link is down below there. But essentially security in a cloud native world has to be set up at every layer of the cloud native infrastructure. So starting from the internal one, the inner layer, we got the code layer, which is these most common one to us because we've been writing software for many years, many decades. So security has two be embedded in our code using things like static code analysis or dynamic code analysis. The next layer is the container. So we got to be able to scan our containers to detect dependencies that have malicious components and things like that. Then the next outer layer is the cluster. Now we're talking about orchestrators like kubernetes, and there's a whole new kind of problems that are emanating from kubernetes. And we got to be able to take care of things using things like the network policies, or to be able to analyze when there are processes within the containers that are actually malicious or are suspicious. Now, the final layer is these cloud infrastructure, which is the very platform upon which the entire auditory layers delay upon. And we're going to be able to take care of this cloud infrastructure. We should be able to look at things like these shared security model, and we have good understanding of how it works, have a good understanding of our responsibilities, and understand the kind of security efforts that are expected from us. So this is a summary of what cloud native security is, but how does the attacker look at it? We've just talked about four layers of infrastructure. Unfortunately, attackers still look at this as one single target. So inasmuch as they might need to have the skills that are necessary for them to conducted attacks, probably they need just one toolkit to be able to successfully attack a cloud native infrastructure. And the attack surface is so wide that these possibilities are endless. As you see here, the attacker can virtually start from any part, either from the code or from the docker layer, or from the Kubernetes layer of the cluster, or even from the cloud layer, can virtually start from any layer and literally move across the other layer. And we have seen attacks, these kinds of kinds of attacks. What we still see, and the way our cloud native security platforms are being designed, is to take care of these layers, one after the other. So we got tooling support today for these core layer, we got tooling support for the container layer. A lot of systems are, security systems are designed. Two do that these days. We got cluster security platforms and we got of course, the cloud security platforms. And the challenge here is that most of these tooling support, these platforms, these security systems, do not talk to themselves, they work independently, and there is really no cross coordination or no cross understanding. So eventually, human operators are expected still to come into this loop. And these try to make sense of the output, the results, the analysis that these individual components are making. So essentially what is missing is for us to have a unifying layer, a unifying strategies that stitches these various components together and makes sense of it. And that is where security chaos engineering comes in. And I will in the next slides, try to explain how that works. So basically, security chaos engineering, as far as I see, is going to be a new way for us to be able to put together these various cloud security, cloud native security platforms. And commonly I've put in this diagram the major categories of cloud native security. First, we got the cloud security posture management, which tries to look at the control plane of cloud infrastructure to detected malicious actions, to be able to detect misconfigurations and stuff like that. We got the cloud workload protection platforms, essentially looking at workloads from kubernetes, doing vulnerability scanning and things like that. And we also have the cloud access security brokers, which are also another kind of security system that looks at these, tries to understand the interactions between on premises infrastructure owned by organizations, and how these on premises systems are interacting with a cloud platform, and tries to make sure that sensitive data is not handled in ways that exposes them and quite a number of things. So essentially, security rates engineering, as far as I see, is going to be that unifying mechanism that brings together these various security components to make sense out of them. So let us talk a little bit about riskdriven fault injection. And essentially, riskdriven fault injection is about employing security case engineering from a risk perspective. And why that is important is because firstly, we know that 100% security is a dream. There is no security system that is 100% secure. Problems emanate from various directions, either from within our employees that may make mistakes to attackers that evolve new ways. Two things like zero do vulnerabilities that might be exploited by some attackers. And I've also spoken with a couple of people who are trying to convey chaos engineering to these security teams. And what I sense is it's a bit difficult, because security is a hard language to explain, it is hard to measure. It is larger, abstract. And so we can use risk as a method to communicate or to drive chaos engineering to our security engineers to end our culture. And we have various kinds of methods for looking at risk. Quantitative risk assessments are really more attractive. We're going to look at data driven strategies. Risk helps us to measure security, that we can communicate this whatever strategy we are trying to propose in more clearer and more sensible ways to management, as well as two other teams in a company. I'm going to walk you through what we refer to as the security case engineering feedback loop, which is a method that we think is going to drive these implementation of security case engineering much better and much constructively in an organization. It consists of five parts, and essentially this is a feedback loop, and which is an adaptation of the MApik feedback loop that has been used in autonomous computing systems. And the idea here is how we can take security based engineering and push towards it, become an automated system that works behind the scenes and works together with other security systems in a way that hardens security and makes security much, much better. So the first part of this group is execute. And here we have to talk about what is the aim of the experiment. So if you want to conduct a security based engineering experiment, you want to be able to clearly define these aim. What do you want to achieve? And based on that, you're going to craft a suitable hypothesis which you're going to be proving, and then you're going to look at and define the scope, the intensity of the experiments you want to carry. It's really important to carry out some sort of sanity checks. You're going to be coordinating with responsible teams. You want to understand, you want to convey to them clearly what you aim to achieve. These are administrative aspects, these are, of course, social aspects. Very important. There is a human side that is largely overlooked. You want to be able to communicate with people and let them understand your mindset aim, and you want to by them in very important is recoverability. And what I mean by recoverability is if things go wrong, you want to be able to roll back to the state. That is good enough. And there are various ways of putting this in place. There are infrastructure as code strategies where an infrastructure is already in git using things like terraform or AWS, cloud formation, and there's also state management. So if things go wrong or if you break things, you can sort of recover and kind of not bring too much problems to the system. So I talked about you trying to have kind of a scope of what you want to do. We created a tool called cloud strike, and in cloud strike we has different modes of operation. So if you're going to launch inject security fault injection, or if you're going to inject security faults, they can have different magnitude of intensity, 30%, 60%, 90%. You have to figure out how the impact is going to be in different degrees and decide on what degree you will use based on the maturity of the team or of the infrastructure. And you could have an attack scenario. These actually we had, and I'm going to give an example in the next slide where we had, we could chain various attacks to sort of form a scenario. So it's two simulate, to simulate how attackers move in real life, because attackers do not, they launch series of attacks to be able to achieve their objective. So here we have a table which has different attacks which we use in cloud strike. So we got the cloud resource we want to attack, the action that we want to take. And just a brief description. These first line we got the user. So we create a new random user. This is an action. And you could kind of, like I said, have a scenario where you link three or four or more of these various individual actions, and that is going to form an attack scenario. And here is an example of the experiments we carried out. We start with creating a user called bob get buckets. We select a random bucket from that which we got from Amazon Web services, and we create a malicious policy and we assign Bob access to the bucket using the policy. And in this case, we want to be able to see whether our security system, whatever security system we are using in the cloud, maybe it's cloud security posture management or something as simple as cloud trail. You want to see whether it's able to detect these activities. When you created a new user, was it flagged? Did the cloud security mechanisms detect it? How long did it take for my notification to reach you, for example? Also similar, when you create a malicious policy, are you able to detect that a malicious policy was created. So the second point on the stage of the feedback loop is monitor. So once you start injecting failures, you want to be able to monitor the progress. This is pretty important that you have sort of either logging system where you are able to see the logs in real time, or you have an observability systems and there are a lot of them coming up these days, or you have even tracing. So whatever you have that gives you clear visibility into the progress of the attack, because essentially you want to be able to stop these experiment. If things begin to go too bad and you want to be able to recover, as I said, you want to be able to have recoverability, which makes it possible for you to roll back to these good state. Of course, this is the third part. The third stage is about analyzing. So assuming everything, even if you had a failure, if you had to stop these experiment, it's critically important that you have to understand why it failed. You get some lessons from what happened, what went wrong, why did the experiment fail? So you can have another trial. And if you succeed, you want to be able to derive questions, derive answers to the questions you posed at the beginning of the planning stage. And essentially what we are talking about from a security perspective is looking at this is an example of the OWAsp risk rating methodology. And so since we are proposes a risk based methodology, it is pretty important for a good analysis to understand exactly the results you got from the experiment. You want to understand these kind of threat agents that might exploit this attack. You want to look at the attack vectors, the vehicles they're going to use to conduct such an attack. You want to understand exactly the problem, the vulnerability that was detected, because eventually you're going to have to fix that. You want to understand the security controls that were compromised and other important things. What is the technical impact of that attack and of course the business impact. We think if you are able to sort of have this clear understanding or this clear analysis of experimental results, it even makes it much easier to convey two, get a buy in from management, for example. So the fourth stage of the security case engineering feedback loop is about planning. So you want to plan for the next iteration of your experiments because the idea is to have a continuous system. So in this case you're going to have to create things like backlogs for vulnerability management, for whatever teams that are responsible to fix the things, the security problems that were detected. This might mean you're reaching out to the security operations teams, development teams and also threat modeling. I think the knowledge that is going to be gained from security case engineering is a knowledge that can be used for threat modeling or things like security awareness training for teams, because you must understand that what you have at the end of a security based engineering experiment, you've been able to understand these problems in these system, meaning that you have knowledge about what might happen in the future. And it's different from what you get from traditional systems, which they try to explain about what has happened here you are trying to explain what might happen in the future. So it's really, really proactive. So you want to be able to fix, as I said, the issues you saw and also construct hypothesis for the next iteration of experiments. So this is like the last part of it, which is very critical part, and talking about automation, we want to have a knowledge base. So every result that you got from the security rates engineering experiments, imagining that you were able to construct supports. So for us in our tool cloud strike, every security based engineering experiment had a report and that report has put into a sort of knowledgebase which might be just some database where you put in your supports. And these gives you access to greater possibilities. For example, you can create cloud watch rules to trigger alarms for specific events. You could create rules for your cloud security posture management system. You could create rates for identity and access management analyzer. You could also do a lot of things. So what we see here is nowadays actually the concept of SIM is actually getting obsolete because SIM systems are beginning to struggle to even manage data or to be able to analyze security information properly. And we see here that it's possible that security chaos engineering is put into the so called security data lake, which is more becoming, more and more becoming a much preferred way for putting together security information so that you can get some sort of intelligence from it. So security based engineering, the reports you get can be put into a security data leak. And where you have other resources of information like the threats intelligence source, you're getting a lot of information from threat intelligence source. The feeds that tell you about things like malicious ip addresses and things like that. You have the ETL things, all the log analytics systems, they push their knowledge, they push the output of the analysis to this place, to this central data leak. We see that security case and generated results can also eventually live in such a security data lake and give users much, much better and much more contextual information to use to harden their security system. We also like to point you to some of the papers we wrote. So these first sets of papers, two patterns were written where we cyber security engineer security case engineering methods, firstly to evaluate a cloud security posture management system to see if it functions as expected. And the other paper was based for incident response, where we were also trying to see how an incident response system works. If it works as fast, has it should work, if it's slow. And we think these are also very good use cases. There are also two papers we wrote that focus squarely on security rates engineering. We took a deep dive into this subject and tried to understand from an academic perspective as well as from a practical perspective, what are the connections with existing literature that are related to this field of fault injection. And we saw there is quite an existing work about security fault injection, more under the canopy of dependability. And we think it's kind of exciting to explore these related works, to have a better understanding of security based engineering. And last is, I want to point out the security based engineering book that was released actually last year. And we had a very good opportunity to contribute to these book. And if you are really interested in understanding security chaos engineering, I will really recommend this book to you. And also you can also have a look at our publications and you will have a much better understanding of this field. So this brings me to the end of my talk. Thank you so much for staying along and feel free to shoot a mail to me or to reach out to me in case you want to learn more about what I'm doing. Thank you very much.

Kennedy Torkura

Cyber Security Engineer @ Mattermost

Kennedy Torkura's LinkedIn account Kennedy Torkura's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways