Security Chaos Engineering: Considerations for Gamedays when the Experiments are Cyberattacks

Video size:

Abstract

SREs main task is to keep the operations up and running. An SRE dealing with Kubernetes has many challenges to keep resilience is at the desired level and improving over time.

In this talk we will go through techniques to measure and improve resilience of Kubernetes platforms in a Cloud-Native way.

Summary

Judy Nino talks about security chaos engineering with some considerations for game days when the experiments are cyberattacks now. Culture inspired by example by samurai. Japanese always recovered from adversity. If we want to face security cyber attacks, we should develop skills such as discipline, precision, attention to details and the resilience.
Cyberattacks can be defined as unanticipated and catastrophic incidents. They happen on production and can take the system down. Chaos engineering is focused on making systems more reliable and secure. Security incident management could be useful here.
We should build a culture based on security and reliability. Chaos engineering is the discipline of experiments failures in production. Security chaos engineering addresses a number of gaps in contemporary security methodologies. We need a new approach to keep pace with the evolving world of attacks.
Chaos game days are based on game days. For game days they sre an interactive team based learning exercise. The goal is to practice how your team or your supporting system team deal with the real world turbulence conditions. Human factors in cybersecurity are perhaps the biggest challenge when building an effective threat prevention strategy.
Erica is a company specialized on security case engineering continuous verification. Continuous verification is a game changer for complex software system management and complex attacks. The adoption of security chaos engineering principles across organizations is an open challenge.
It's really important to consider human factor when you are trying to experiment with security in your systems. Here are some books can be useful if you want to start with security chaos engineering. Don't fear failure in gate attempts. It is glorious even to fail. One single vulnerability is all an attacker needs.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody. Thank you very much for being here. The title of my talk is security chaos engineering with some considerations for game days when the experiments are cyberattacks now. Nice to meet you. I am Judy Nino. I work as site reliability engineer in Adl Digital Labs, a company in Colombia that provides technology and innovation services for a banking group. Also, I am chaos engineering advocate between a spanish community. Lastly, I am from Garagoa. Garagoa is a beautiful town located in Colombia. Although I've never been in Japan, when I seen in resilience, Japanese immediately come to my mind regularly. Palmat by Nashville and may made by disasters. Japanese always recovered from adversity. In the last 100 years, Japan has faced tragedies such as the great Kanto equipment in 1923, two nuclear bombs over Iroshima, Nagasaki in 1945 and a tsunami in the Toku in 2011. These adversities have proved a culture based on resilience. People say that is a consequence of discipline. Positive, polite and resilient culture. Culture inspired by example by samurai. Samurai cultivated the bujido cause of martial bertuts, indifference to pain and loyalty, engaging in many battles during twelveth century. Here are the most famous cyber rides in the history. My apologies for my pronunciation here. Minamoto Miyamoto Tojo Tomi Honda Antakera if I have to choose one, I would say my favorite is Miyamoto Musaji. Has sisting allowed to him improvising with absolute efficiency in any situation of apparel. Not without first reflecting on all variables he faced. For Musaji, there is no better weapon than any other. It is important for the warrior to evaluate which is the most according to the circumstances. Towards the end of his life, he's dedicated his time to study and teaching. He wrote the book of five Rings, an essential resource for martial arts. Probably you are asking why I am speaking about Japanese and samurai in an aesthetic conference. The reason is that if we want to face security cyber attacks, we should develop skills such as discipline, precision, attention to details and the resilience. Like Japanese inspired in the Miyamoto's story, I have designed this agenda for address security chaos engineering. We are talking about some of the famous cyber attacks, how to use incident management theories. To mitigate this, I am going to introduce a novel discipline known chaos engineering and security chaos engineering. With this context, I am going to talk about security Chaos game days. That is the main topic. I am going to present a framework that we develop to design and practice security chaos game days. And finally, I am going to share some learnings and challenge on this chart. Recently, I read on Twitter. Cyber war is everywhere, in the media, in the military, among politicians and in academia. Although I think that is a personal it is not a cyber war. Cyber attacks could be useful for a war that is a fact. But I am going to show how you can use chaos engineering to mitigate the risk and training engineering teams to be prepared for a disaster like this. Cyberattacks can be defined as unanticipated and catastrophic incidents. They happen on production and can take the system down. Cyberattacks are hard to or most likely impossible to predict, have a severe impact on the availability of a sofa system, and may generate multiple hours of downtime. Some of the famous cyberattacks include nope, cyberattacks. In 2018, this cyberattacks affected all business units and Mexline. Mex is a famous shipping company, the Equifax data breach. This happened between May and July 2017 and American Credit Bureau Equifax. In this cyberattack, the credit data of millions of Americans, Canadians and british citizens were compromised and recently the cyberattacks to Twitter. At this moment, Twitter believed that attackers used social engineering scheme to manipulate a number of employees and use their credentials to access Twitter internal systems. As you notice, cyberattacks are a reality and they are hard to predict them. However, we can respond to these issues before they impact our systems and this sense severity and for this case, security incident management could be useful here. An incident in the context of information technology is an event that is not part of normal operations. Incident management is the practice of recording, triaging, tracking and assigning business value to problems that impact our critical systems. SIV is an acronym used to refer to an incident and it is derived of severity. Some examples for this kind of incidents include availability drops, product issues, features, broken data losses, and security risks. Here are some resources from Gremlin Gremlin is a company specializing SRE and chaos engineering and here are more resources about SIP Gremlin has provided a complete guide to analyze the impact of an incident in our sres and slos. Here, it's very important to consider the level associated to the incident and finally, they explained a formula to calculate this impact. Since security teams have focus on confidentiality and reliability, I asked why in this literature the word security is missed. According to Google, both SRE and security have strong dependencies on classic software engineering teams. So it is really strange why the word security is missed here. So, for example, in this service level agreement document, it is from a company for which I work in the past. I remember that we include a disclaimer when the incident has related to security, no matter if we are talking about incident levels or environments, we couldn't commit with times or solutions. When we are speaking about cyber attacks, the reason has described by Laura Nolan in his Eucenics presentations. Cyber attacks are black swans, so they are hard to or most likely impossible to predict. They have a severe impact on the availability of a software system. As I mentioned, according to the third book about SRE and security from Google, security and reliability, SRE missed in the customer's operations. The reason is that if the system is working well, the customers don't notice them. However, I believe that security and reliability should be the top priorities for any organization because there are a lot of common things such as invisibility, assessment, simplicity, evolution, resilience, investigations and recovery. They are explained in the chapter one of this book. So in this sense, I asked to Colton Andrews about this in a public questions and answer session and my two questions were related to this. So my questions were are there is a list of common attacks when you are considering experimenting with security on a system? And my second question was should we have special considerations when the attacks SRE involved with security instead of infrastructure. For example, he answered, reliability is a core pillar of security testing. An offensive security tester. Although penetration testing and chaos experiments share some parallels, they have different goals. However, chaos engineering is focused on making systems more reliable and secure in any situation. If you notice reliability and security and our next topic chaos engineering SRE related topics so for me it is about culture. So we should build a culture based on security and reliability. And in this sense we should train our security teams based on this culture. According to the last book from Google according to the last book related to Chaos Engineering published on April, a culture based on reliability and for me on security. Our incident response teams should have these roles, designer, designer or facilitator, who is the person leading the discussion and commander who is the person executing the comments and scribe and unescribe who takes notes in a communication tool such as slack on what is occurring in the room, an observer, this person looks at and sres relevant graphs with the rest of the group and finally the correspondent who keeps an eye on slack for example. Related to security we have some exercise can be useful for training our security teams. Although consider the answer from Colton and Andrews, they are different from chaos experiments. Red team exercise were originated with the USA armed force and in this case the adversarial approach that imitates the behaviors and techniques of attackers in the most realistic way possible. Two common examples include ethical hacking, penetration and penetration testing. In the other side, blue teams are the defensive counterparts to the red teams in this exercise. Purple team exercises are an evolution of red team exercises since they deliver a more cohesive experience between the offensive and defensive teams. The goal of this exercise is the collaboration of offensive and defensive tactics to improve the effectiveness of both groups in the event of an attempted compromise. An orientation is to increase transparency and allow to learning. So however, our point here is that penetration testing are not enough and red teams exercise and blue teams exercise are not enough for mitigating attacks such as Nopedia or resin attacks to Twitter. We need a new approach, one that keep pace with the evolving world of attacks and of course software engineering. So it is the moment to introduce a novel technique can be useful for solving this challenges. Chaos engineering chaos engineering is the discipline of experiments failures in production or on production in order to reveal their weakness and to build confidence in their resilience capability. This definition was taken from the site of principle of chaos which contains a manifesto for chaos engineering. I have highlined experiments, production, rebel, build, confidence and resilience because these words are really important in this definition. So here there is a list of the common attacks practiced by engineering teams using chaos engineering. We have a list for technical issues and some examples include dependency failures, region and some failures provided failures for example cloud provided failures, network upgrades and failures in Iraq for example. And the most important is that cultural issues related to this or allow to these technical issues happen. Some examples includes, for example, lack of chaos engineering or lack of knowledge sharing, for example, lack of on call training. That is a practice described in the blue books in Google for example. So if you notice in the previous days, the word security is missed again. Why? Because this friendly reminder, if security teams have focused on confidentiality and reliability when the issue is a cyber attack, we don't commit with this. We don't commit with times or solutions or sres, slos or slM. So what is security chaos engineering? Security chaos engineering is the identification of security control failures draw proactive experimentation to build confidence in the system's ability to defend against malicious conditions on production. I highlighted six words that are variable in this definition provided by Aaron in the last book about chaos engineering published in April in this year. Security failures, experimentation, confidence, defense and production because they sre super important because this discipline is based on the scientific method and confidence and defense are words related with resilience and reliability. And of course the experiments should be ruined on production. You security chaos engineering addresses a number of gaps in contemporary security methodologies such as red and purple team exercise. It is really important to have clear it is not the intention to overlook the value of red and purple team exercise or other security testing methods. These techniques remain valuable but differ in terms of goals and techniques. As Colton Andrews mentioned in his answer, combined with security chaos engineering, they provide more objective and proactive feedback mechanisms to prepare a system for adverse event that when implemented alone for example and let me before to pass to Chaos game days. Remember to this phrase from Warner Bowels, everything fails all the time. So that is the moment for introduce the Chaos game days. So chaos game days are based on game days. A definition from AWS. For game days they sre an interactive team based learning exercise designed to give players a chance to put their skills to the test in the real world gamified, risk free environment. So as a form of game days, we have chaos game days. In this case, it is a practice event that can take a whole day. It usually require only few hours. The goal of a game day or a chaos game day is to practice how your team or your supporting system team deal with the real world turbulence conditions. The difference between them is that the technology and the objective. Finally, because in the case for AWS, we sre experimenting with AWS technologies. So in this case for Chaos game days, we sre experimenting with our system, with our technologies and trying to mitigate real turbulent conditions as Rosemail mentioned in this book. So in this framework provided by Ruth Meyers in the previous book, Sorry Learning Chaos Engineering, the framework has three phases, before, during and after. During the before during the before stage we should focus on pick a hypothesis, pick a style, decide who, decide where, decide when, document and get approval. That is really important. And this phase is very expensive because we have to plan a lot of things. So the second has is during. In this phase, the idea is to run the exercises, validate or refute the hypothesis using observability tools and our skills for solving issues. In this phase, the engineers for example detection the situation, communicate, visit that board, analyze data, propose solutions. In this case, we are trying to apply this framework with these activities. So finally, in the last part, the last part is for writing the postmortem with these parts, what happens, impact, duration, resolution, time resolution, timeline and action items. And although there are several templates for this in the literature, before to pass to our framework our focus for security. I would like to share this phrase from Vincom. Human factors in cybersecurity are perhaps the biggest challenge when building an effective threat prevention strategy. So considering the previous framework provided by Rusmiles and the context. If you remember the security word is missed in the literature in the list of attacks in the exercise. I would like to share a framework that we are using to practice security chaos engineering in our study group in ADL and study group related to chaos engineering. So this framework. In our framework we consider the three phases such as Rusma is described in his book, but we are including a new stage for evolve or for evolution. So in this case we are dedicated time after each game day to improve vulnerability database, refine the process, adjust metrics, validate the chaos maturity position after the exercise and adapt the new game day because it is really important to use the feedback generated in a previous Chaos game day. In this case security chaos game day in order to improve the sorry. Also we have to consider some things in the phases provided by Rosemary, so also to include a new has or a new stage. We are providing some considerations for the three classical stages. So in this case, our recommendation better. Our consideration better is when you are picking the hypothesis. It's really important to understand the adversary because the motivations, profiles and methods are super important here. Remember, the human factors are a main considerations when you are trying to provide experiments related to security. And in a second one decide or pick the style. In this case, if you are going to choose a style, choose one. With adversaries. It is preferred over the classical mode in which we had a master of disaster who attacked the system. The idea in this case is to take inspiration from the red and blue teams exercise, but using different teams attacking between them. So finally reconsider the roles. If you remember, we reviewed the roles described in the Chaos engineering in the last Chaos engineering book published in April. Sorry, but consider for example if you need a consultant or an expert with knowledge of the last attacks provided by attackers in the market. So they are our consideration for this phase. So if you want to practice an exercise, this book, building secure and reliable system that we use in a previous slide provides some examples for attacks in order to practice with security in a chaos experiment. For example, in this case they are describing an attack in which attacker use a search engine to find the email address of employees at an attacker organization. Attacker send phishing emails to employees so attacker remotely go logs or remote access to the system using these credentials. So in this case during the exercise, some examples that we are trying to practice in our study group introduce RNC on security controls for example. That is a classic example for these experiments group folder like a script in production software secret clear text disclosure permission collision in a share AI policy. So disable service event login. That is really critical. AP gateway shutdown and XSI related with s three buckets. For example, if you are using a technology such as AWS or disable multifactor authenticator. So it is an example of a mental experiment using security case engineering. In this case, our hypothesis was after the owner of root accounting AWS left the company, we could use our cloud in a normal way. So we were thinking about this in an experiment and the result was hypothesis disproved because in this experiment the access to AWS was connected with the active directory. So when an employee left the company, his account is dropped and we lost the access to AWS. So that is really critical. And in this case we are trying to practice a mental exercise. But if you notice we have the opportunity to identify vulnerabilities and controls we should review in order to guarantee that our system is secure on production. So aside, effect of this experiments thinking in this scenario allows to consider another applications connected to active directory. Finally, in the third phase, it is really important to consider that security postmortem should be different of classic postmortem because they should cover technology issues that the attacker exploits and also recognize opportunities to improve incident handling. That is the reason for including a four stage, for example, document the time frames and effort associated to these action items and decide which actions items we consider. So finally, in this last phase, introducing our framework, I would like to focus on continuous verification because remember, it is a continuous process in which we should use the previous feedback in order to improve the next game day. So according to Erica, Erica is a company specialized on security case engineering continuous verification encourage both of these requirements in a way that proactively educates engineers about the systems they operate is emerging. Has crucial practice for navigating complex software systems, one definition more provided by them. Continuous verification is a game changer for complex software system management and complex attacks such as security attacks. In the future, it will fundamentally change the scale and types of systems that we even consider building. So lastly, I would like to share some learnings and challenge after trying to practice security kos game days because remember, it is a fact that the future only can be improved if something is learned from the past. So this definition was provided by David woods in the resilience engineering book. So our learnings include the adoption of security chaos engineering faces challenge in this adoption human factors and consider the view of an attacker is very very important in order to provide the proper experiments and the proper hypothesis in an experiments. So reducing potential damage and blast radius is super important in security. It is important in a common chaos experiment, but the insecurity is important too. Communication or better skills related to communication and observability can guarantee the success of an experiment. So it is really important to have the proper skills for communication and observability and to use tools such as data doc or neuralic for this. And requirements can make collision with experimentation in security. So it is really important that consider that our requirements are aligned with the experimentation in this case. So finally, you don't need to be an expert and security expert in order to start with security chaos engineering. My favorite phrase here is that just start with the experiments that probably in the past you will have the opportunity to learn the side things that you need for practicing this discipline. So they are the least challenge that we identify it after trying to apply to this framework in our group of stories. So the adoption of security chaos engineering principles across organizations reminds us an open challenge. It is our mission to provide more investigation, more techniques and other frameworks that can be useful for trying to mitigate cyberattacks. So as I mentioned, security is missed in several resources for chaos engineering, in this case for chaos maturity model. It is not an exception. So security may be included in this chaos maturity model, since combining chaos maturity model and security chaos game days help new custom newcomers to start their chaos engineering efforts and allow to build resilience and security. And finally, it is a challenge for us. It is an exciting time to be working on this space. So we are in a moment to try with the experiments. So now I would like to share a phrase from Arun is an expert in security case engineering and he says that humans operate differently when they expect things to fail. So it's really important to consider human factor when you are trying to experiment with security in your systems. So now finally, here are some books can be useful if you want to start with security chaos engineering. They provide the fundamentals to chaos engineering. Chaos experiments, scientific method. For example, this book by Ruth Meyers. It is a really good source if you want to study the scientific method. And the last book provided by Nora Jones and Cassie Rosental. In this case, this book provides a chapter dedicated to security chaos engineering. So the idea is to consider that security should be present in our definitions, in our experiments related to chaos engineering. So now, and to close two phrases that I consider proper here. Don't fear failure in gate attempts. It is glorious even to fail. And one single vulnerability is all an attacker needs. So thank you. Thank you very much for attending this talk. I would like to share my contact data and my username is Yury Nino in LinkedIn medium, Twitter and in another show. Thank you. Thank you very much for attending.

Slides

Download slides (PDF)

See all 10 talks at this event!

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Security Chaos Engineering: Considerations for Gamedays when the Experiments are Cyberattacks

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

SRE @ Aval Digital Labs

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2020 - Online

August 27 2020 - premiere 5PM GMT

Security Chaos Engineering: Considerations for Gamedays when the Experiments are Cyberattacks

Video size:

Abstract

Summary

Transcript

Slides

Yury Nino

SRE @ Aval Digital Labs

Join the community!