Conf42 Cloud Native 2023 - Online

Move fast, build things… Safely!

Video size:

Abstract

How about learning to ship faster & secure artifacts from the fastest and safest motorsports in the world. In this talk, we’re going to discuss different principles followed in F1, what lessons we can learn from and adopt in the way we design, architect, ship and secure our systems.

Summary

  • Mohammed is a backend developer in Spotify and a Google developer expert in cloud technologies. DevOps enable us to release faster, ship faster, making sure that both developers and ops are aligned. However, security is also important because once you break that confidence with your customers, it's hard to get back.
  • The FIA, which oversee the Formula One industry, and the teams in the Formula one, value the safety of the drivers. What would happen if we design our systems, our architecture, our applications in the same way? Here are ten of the safest measures that the FIA applies.
  • Automation can enable SieM to deliver greater value by eliminating manual efforts. Tasks like checking software vulnerabilities can be handled automatically by a bot. It enables us to release the software faster, but also make sure that those softwares are actually secure.
  • The second thing I want to talk about is stranger, dynamic static and load test to ensure the safest of the drivers. From a software engineer perspective, this is similar to designing for failure. Developers need to think about failure tolerance when designing our application in basically all layers.
  • Software engineers use chaos engineering to identify failures before they become outages. From a software engineering perspective, this is not only designing for failures, but actually testing for failures. What would happen if we start to adopt the same attitude as the pit grows?
  • There are other metrics that we can adopt basically to ensure the security of our system. From a security point of view, even a zero day vulnerability, the maximum amount of time to fix all your vulnerabilities is basically the reverse uptime. And there are a lot of benefits of having a loosely coupled system.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, super happy to be were with you today to discuss how we can build things, build software, services, applications faster, while keeping the security and safety on top of our mind. First of all, thanks to Conf 42 cloud native conference for allowing me to be here with you today. And quick intro. I am Mohammed, I'm a backend developer in Spotify and I'm also a Google developer expert in cloud technologies. And let's the fun begin. So we all know that the demand for speed and innovation is basically everywhere. Users nowadays no longer wants to wait for weeks, days or even days to get their feature shipped, to get their buck fixed, to get their new improvements released. However, that speed or that innovation or that enthusiasm to release faster can create conflict sometimes when it's butts up against security, especially that security sometimes tends to value status quo. And in some cases the dutched hand to increase velocity can introduce sorry new challenges and risks such as skipped processes, insufficient testing, insufficient security testing and so on and so forth. And when you skip testing, you soon going to hit a roadblock. And we all know about a company that was bragging how fast they are releasing, how fast they are innovating. They even made it one of their motives right? However, soon enough they hit a roadblock. And that's to say that security is also important. Definitely speed and velocity is required and desired. DevOps enable us to release faster, ship faster, making sure that both developers and ops are aligned to once the software is ready, ship it to our users. However, security is also important because once you break that confidence that you have, or that moral relationship with your customers or clients or users, once that's broken, it's hard to get back. So that's why it's also important to keep the data of your users safe and secure. And that's what devsecops tries to bring to the table, basically. So devsecops try to make sure that the developers are doing what they do best, release software, introduce new features, fix bugs, improve reliability, resiliency, and improve the user experience at a whole. They enable operators to, once this improvements or a feature is ready to be deployed, deploy it to deploy it as fast as possible to the users. And this is what make dipsycops really important. Really particular and important is it try to save a seat in the table to security folks to ensure that the things that we ship and the software that we releases are also secure to ensure the safety of our users or the data of our users. So you'll probably figure it out what the remainder of the talk going to be. Obviously, we're going to talk about the Formula One and quick context. During the pandemic, was sitting back home, nothing much to do, was trying to kill some hours. Usual pandemic thing. And then I started watching drive to survive. So it's basically, it's between a documentary and the reality tv show. It's not really a documentary and not really a reality tv show, but it tells the behind the scenes of a Formula one weekend of a Formula one race in an entertaining manner. Netflix way. And in one of the episodes, this actually happened. So, Roma, during the Bahrain Grand Prix, Roma is a has driver. Has is one of the Formula one teams. Was driving super fast, 160 mph, that's almost 300 km/hour was ripping super fast. Bad maneuver, probably. He hit another driver and then hit the barrier and the car turned instantly into a fireball. Now, Roma remained there in the extreme heat, in that fireball for a couple of minutes. And during that time, it was interesting to hear the thoughts of the Formula one people. Because by any standard, in a human standard, this is actually a death sentence and a severe accident. And the Formula One folks were also worried. And a lot of them thought that this is the end of Roma racing career, or at least it's going to have a severe injuries. However, the miracle actually happened. Roma managed to get out of the car from the fireball, from the extreme heat, on his own feet. Sane and safe things is a miracle. But it got me thinking how the FIA, which is the entity that oversee the Formula One industry, or races, or sport, and the teams in the Formula one, value the safety of the drivers. And it got me thinking, what would happen if we design our systems, our architecture, our applications in the same way, with the same level of devotion as the Formula one folks, design the cars and make regulations and architecture to make sure that the safety of the driver is insured at any moment, even in the extreme accident. Now, throughout the documentary, they were talking about safest and little to no mentioning security. And I am not a native english speaker, so I use them interchangeably. For me, they are more or less the same. So I opened the dictionary out of curiosity to find that security actually means the state of being free from danger or threat. And that, from a software engineering point of view, is not what were trying to achieve. Right. We know that once we put a system facing the Internet, there are handle of things that are considered as danger and try to break our application and bring it to its knees. However, safest means the condition of being protected from danger, risk or injury. And this is what we are trying to achieve, right? We know that the production system, the Internet facing application are not safe. However, we try to design our system in a safe manner, in a secure manner, build the whole design architecture, buy other pieces of software like firewalls and everything, to make sure that. To make sure things risky and unsafe environment is actually safe, and ensure the safety of the data, of our users, our services and our servers. And that got me thinking that actually good architecture should allow people to move as fast as they can in a safe manner. So I started a journey to understand actually the safety regulations, the safety rules that the Formula one forks follow, to ensure the safety of the driver, the drivers, and check if there are any similarities, any lessons that we can take and apply them in software engineering. And this is actually what I ended up with, were are more safest measures that the Formula one people apply. But for the sake of the time, I will limit myself to ten. I hope that I can make it. It's going to be challenging, but nevertheless, there are four, five pre crash measures and five post crash measures. And the data never lies, as the saying says, right? You can see here that as the FIA introduced more and more regulations to ensure the safety of the drivers throughout the years, you can see the number of deaths has drastically increased as well. And if we compare the number of days between fat and incidents, you can see that as we introduced more and more safety measures, the racing in the Formula one is becoming more and more safer, even if it's still a dangerous sport. But the regulations actually worked and helped to make the Formula one a safe sport, even if they are driving at a fast, super fast speed. Anyway, let the fun begin for the second time. So here, the first part, we're going to discuss five precrash measures, which are basically the things that they do and insured before the crash, before actually even the driver get into the car. And to ensure that the driver safety is ensured, the first one is the seatbelts that the Formula one drivers put containing of a six point harness, which can be released by the driver with a single hand movement. This is actually what the seatbelt for the Formula one car look like. It contains six harness and the driver cannot put it by himself like he needs help. However, it can be released by a single push, a single click and things is, from a software engineering perspective, similar to push to deploy analogy which make us think of automation, automation is really important in ensuring speed and safety as well. Automation can enable SieM to deliver greater value by eliminating manual efforts. Tasks like checking software vulnerabilities can be handled automatically by a bot, by a software, but by a job, by a pipeline. And that would enable to free up teams to pursue more productive and innovative pursuit and innovative tasks. Obviously, automation also can decrease manual human errors, and that's occur with manual approaches. We, as we all know, humans do not excels with manual work. Bots and machines excels in that. So that can help us to reduce bugs and reduce errors. So automation is actually a fundamental part in the defsecops movement and the deficops mindset. And it enables us to release the software faster, but also make sure that those softwares are actually secure. The second thing I want to talk about is stranger, dynamic static and load test to ensure the safest of the drivers. And that means that they are testing the car in different conditions. When the weather is hot, when the weather is raining, when there is actually wind, when the track is hot, when the track is a little bit cold. They are testing different things, the engine, the aerodynamics, a lot of things, to make sure that they have data about the car and how it actually reacts in different conditions. To make sure that during the race day, the car, they know a lot of things about the car that enable them to react if a crash happened or before even the crash happened. And for software engineering, that means that it's actually the equivalent of having a trusted, repeatable, and most importantly, adversarial ICD pipeline enable us to effectively and repeatedly test any chance to our application at any moment in time. Making sure that our application go through a stringent process to understand how it will react facing different conditions. And those conditions needs to be the same conditions that our application would need to be in production, because there is no place as production. It's the perfect place in the world. And there is where the fun happened. And that will not only engage security through the development and operation processes, but more specifically will ensure the involvement as we design things process. And this is actually difficult at scale and difficults at speed. Another thing we can do to ensure that we are testing our application in the same conditions that is going to be in production, is can deployments and canal deployments is a pattern for rolling out releases to a subset of users. So the idea here is basically to deploy the change to a small subset of servers, test it, have thresholds, both errors, latency, business errors, if you want to. And once we hit those thresholds and pass over them, we basically roll back, because this is not a safe release, otherwise we release it to all the users and the Canaro deployment serves basically as an early warning indicator to us with less impact or downtime, which enables, as I mentioned, to test our application in actually production traffic where the fun is actually happening. The third thing they do or they have is the cockpit is surrounded by deformable crash protection structures and this is how the cockpit looks like. And the thing that strikes me is that they designed the car around things structure about this piece of structure, that its sole purpose in life is actually or in existence is actually to ensure the safety of the driver at all time. And from a software engineer perspective, this is similar to designing for failure. And we as developers or it folks need to think about failure tolerance when designing our application in basically all layers. No longer can our application developers confine themselves to thinking about functionality only. We must also consider how to provide functionality in face of databases, outages, slow networks or even outages as a whole. Something that we can do to ensure that is basically enable mtls. For example, MTLS is mutual TLS, which basically ensure that both the client and the server are mutually connected and the traffic between them is encrypted. And this is important in a microservices architecture where both services are actually interchangeably becoming service and client, making sure that the traffic circulating inside your cluster is encrypted. So even if an attacker manages to gain access to your cluster has no clue what is happening, and he would need to do additional work to gain access. Another thing we can do is design our application in a micro segmented way. And that means that we are putting the data as far away from the Internet as possible, and were ensuring that we are never exposing sensitive system data. So even if an attacker compromises the Internet faces system things, structure deforms itself and contain itself, and the attacker will not have the final data and they will have only access to the Internet service that are not actually sensitive. And we will make the work hard for them to gain access to the sensitive data. Moving on. Before they race, driver must demonstrate they can get out of the can within 5 seconds. Every drivers cannot get out of the car before the race under 5 seconds. Goodbye. They cannot race for the day. So if you want to race, if you want to go to race, show me that you can get out of car under 5 seconds. Similar to what would happen if a crash happened. I want you to get out of that car very fast. And from a software engineering perspective, this is similar to, this is not only designing for failures, this is actually testing for failures, right. It's actually testing for failures. And a similar discipline in software engineering is the chaos engineering and chaos engineering. Very briefly, very briefly, sorry. Is an approach to identify failures before they become outages. And we do that by proactively testing how our system respond and their stress and their errors and their failures. And while doing that, we can identify failures before they end up in the news, basically. And nothing stop you. I mean, you can start small, right? You can design the smallest experiments possible. You can just add in some slowness in your server, add in some seconds to check how your system react, and then take notes about what you did, know what you didn't know, and then make fixes. And then you can grow gradually, starting adding errors, such in a database, a whole instance, a whole region. And then you gain confidence over the releases of your application. And then the last part in the pre crush measures is they have constant monitoring and replacement of the tires. Look at that things is its readable team. They held the world record of the fast replacement of the tires. Sorry. And look how proud they are, how happy they are. Not by fixing the tires, not by trying to patch the tires, but actually by changing the tires. Why in the hell we as software engineers brag how long our servers and containers and instances have been running. What would happen if we start to challenge that mindset and start adopting the same attitude as the pit grows, at the pit grows and changing the chairs as fast as possible? We all know the pit versus cattle analogy, right? Which basically means that we should not treat our service as pits, but more as cattle. Resources should be sorted as well and removed as well, and killed and switched it off as well, whenever necessary, of course. But what will happen if we push this zoologic analogy to a little bit further? And introducing chicken analogy. So the amount of food and amount of resources a cattle need to raise to reach adulthood needed is six weeks, or is more compared to the chicken, which is fewer resources, obviously. And the period that it needs for a cattle to reach adulthood is two years, almost 24 months, compared to the chicken, which is a couple of weeks. So we are optimizing if we adopt the chickens analogy. And the chicken analogy is similar to the container analogy, to be honest or to be transparent. But there are other metrics that we can adopt basically to ensure the security of our system. And for that, I want to introduce an example. So imagine that you have configured and secured a production cluster. And on that cluster, basically, you are running a few mission critic application of your company. Now, a hacker managed to get access to your system and has gated root access of one of the nodes. Now, that alone is already bad. But the attacker started to use your node as a base to attack other nodes on your system. So he managed to hide himself, and without your knowledge, he started to attack other nodes and other servers on your cluster. So imagine now what would happen if this node is replaced. The attacker will basically need to do the same process over and over again to keep his base. Now imagine that we are repeatedly killing this node after a certain period of time, let's say one day. So the attacker will need to repeat the same process each and every day. So once the node is actually removed, the hacker has lost their base and they need to do the same process over again. That's to say that we can backdoor a system that is constantly being revaged. So from a security point of view, this is actually good. So we have an uptime, we have a certain amount of period that we define, and then that after that we basically destroy that node and replace it with a freshly provisioned node. And that's what we call the reverse uptime. Now combine that with the base freshness of your image, which is the image that we use, which is like imagine the OS or your container image that we use to deploy all your other images. From a security point of view, even a zero day vulnerability, you know that the maximum amount of time to deploy or to fix all your vulnerabilities is basically the reverse uptime. So the attacker, that he managed to gain access after the reverse uptime period, no longer have that access since we fixed the base image and the vulnerability that he was using. So those are other metrics that we can keep in mind to ensure the safety and the security of our system. I'm running a bit behind time, so I'll be a bit faster. For the five post crash measures, the things that we do and ensure after the crash has happened. And the first one is the driver can be extricated from the car by lifting out the entire seat. So if the crash happens, we lift, actually the entire seat. If we cannot get the driver, obviously, and he cannot get out by himself, we lift up the entire seat and provide him with any measures necessary to save his life. And take a look here and notice that designing the car in a modular way enabled him, in case of crisis, to lift out the car, lift out the seat, the entire seat from the car. And this is similar to designing the car in a moderate way and a less coupled way, enable you to react in case of crisis and in case of incidents. And there are a lot of benefits of having a loosely coupled system, since fewer dependencies that you need to manage and ensure diversion and make sure that they are up to date and containing no vulnerability. There are failure, isolation. If one node or one service failed, you have that resiliency to not fail your whole system basically, and keep the other parts of system up and running and evolvable architecture is a great win as well. You can evolve your system and architecture independently, reusing other services with less and few coupling of your system. The second thing they have in the postcraft measures is a hands system. And hands stand for hand and neck system that absorbs and redistribute forces that would otherwise hit the driver's skull and neck muscles. And this is how a hands look like. It's basically when there is an incident. The force generated by these accidents are gigantic since they are moving super fast. So the hands make sure to redistribute those forces throughout the body of the driver that otherwise would basically break his neck. And this is similar in software engineer to have or design our system in an elastic way. Making sure that we have a set of patterns to ensure to build our architecture in an elastic manner, such as loud balancers when the traffic increase, if traffic increases and our nodes are not keeping up, we can have auto scaling to introduce more nodes. We can have request time threshold if some of the services are starting to get slow. In some cases having a degraded performance is better than having an outage or nothing at all. And also adopting some anti overload patterns such as circuit breaking and exponential backup help to alleviate some of load issues. Basically another thing they have is driver suits, our fire resistance, and they can keep the driver's body under 41 degrees for at least 11 seconds. And this is how our dear friend Roma managed to stay sane and safe, because his suit can keep his body under 41 degrees, even if it's in a fireball in the extreme heat. And from a security point of view, this is similar to containing the damage and to containing your attacker. And there are a couple of security principles that we can be followed to achieve that. Making sure, like adopting least privileged principle, making sure that our application services are running with the minimum set of access rights and resources enable them to perform its function, defense in depth and having layers, basically. So even if an attacker manages to breach one layer, we have an additional layer and a second layer, many layers afterward, that can make his life harder and slow him down until we figure out what's happening and fix the vulnerability. Having a zero trust policy and zero trust mean eliminating implicit trust and continuously validating every stage of digital interaction and communication between your services. And it's really important to have no implicit trust, but every time requesting authorization authentication from the calling services or servers or whatever. Another thing we can do is basically having hardware security modules. And an example would help here to understand what I try to say. So imagine an attacker managed to get access to your database or to your server and extract the data from your database that contains, let's say passwords of your user now without your knowledge. So he can take that data, brute force, it manages to break it and he would gain access of plain text of your user's password. And you will not know until you find those data in the market, in the dark Internet. And you were basically using a function to hash your script, you hash user's password and he managed to break it somehow. Imagine now having a dedicated entity or a dedicated server that give you a key with which you hash the password and you cannot or encrypt the west and you can decrypt it until you have that key to ensure like compare it until you have that key. Now even if an attacker gets this data out of your cluster, he would be severely limited on the impact he can make, since he would need to stay within your cluster to get that key in order to be able to break the data and gain the password. So we are keeping him in and containing him until you figure out what's happening. And hopefully that can help you gain some precious minutes until you figure out what's happening. Another thing they have is basically a fire suppression system that can be activated by the driver externally or by the race marshal. So in case of a fire, the driver can trigger the system by itself, by himself. Otherwise the race marshal can trigger it. And this is the equivalent of having access policies to your system. There are no right or bad way define the access policies that you want, both from communication policies between your service. If a service doesn't need access to a server, don't give him that access, block him. Basically each server needs to be allow listed before calling another service for your users. Having airbag policies and SaL policies, don't be overly restricted, but don't be very optimistic as well and very without any protection in place. And finally, they have data record data recorder that keeps data about everything. I've read that in each race Formula one people, each car can generate more than 1 data, which is a lot of data for each race. And that goes to say that no system should go to production without having monitoring and logging tools in place, especially security ones that help you to detect unusual behavior and trigger alarms and allow you to react in case of issues. I want to end this presentation by highlighting how comfortable driving a Formula one car is. Look how the driver sits in a Formula one car. And it's actually a trade off between security, between speed and. Between speed and safety and extraction. The maximum of the car. And the same thing applies for software engineering and the way we design our system. So it's basically a trade off of our system that we need to make between security, speed and velocity. And I want to end this presentation with a scary number. From cost of data breach by IBM security folks, they run analysis and found that in average, the cost of cyber attack is more than 4 million. However, it took more than 280 days to detect the breach, which is a long period of time. So I hope that throughout this presentation, you discovered some patterns that would help you to ensure the safety of your system, application, servers and services while increasing your velocity and moving fast. Thank you very much and thank you to conf 42 cloud. I'm a little bit overtime, so sorry for that. Happy to take your question. If you have anything, and I'm pretty available on slack, those are the resources that helped me to build this presentation. And yeah, that's pretty much it. Thank you very much and I hope you're enjoying the talk and wish you a great conference so far.
...

Mohammed Aboullaite

Backend engineer @ Spotify

Mohammed Aboullaite's LinkedIn account Mohammed Aboullaite's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways