Conf42 Chaos Engineering 2022 - Online

Continuous Reliability. How?

Video size:

Abstract

As engineers we expect our systems and applications to be reliable. And we often test to ensure that at a small scale or in development. But when you scale up and your infrastructure footprint increases, the assumption that conditions will remain stable is wrong. Reliability at scale does not mean eliminating failure; failure is inevitable. How can we get ahead of these failures and ensure we do it in a continuous way?

One of the ways we can go about this is by implementing solutions like CNCF’s sandbox project Keptn. Keptn allows us to leverage the tooling we already use and implement pipelines where we execute chaos engineering experiments and performance testing while implementing SLOs. Ana will share how you can start simplifying cloud-native application delivery and operations with Keptn to ensure you deploy reliable applications to production.

Summary

  • Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. As our systems get more complex, that means that our failures are also more complex.
  • Ana Margarita Medina is a senior chaos engineer at Grumlin. She focuses on empowering others to learn more about chaos engineering. Medina also sits on the advisory board for Captain project. Representation is something that really matters to her.
  • Captain is a control plane for DevOps automation of cloud native applications. It uses a declarative approach to build scalable automation for delivery and the operations of these services. Captain is also part of the CNCF foundation and it sits in as a sandsbox project.
  • Service level objectives and service level indicators are key to building reliability. Captain allows for you to take these concepts known as SLOs and SLOs, and have standardization. A lot of that SRE operation work now gets to be defined and done in a way that we actually have pass and failures based on these metrics.
  • Captain has three stages: Development, staging and production. You can also do chaos engineering alongside performance testing. Captain allows for you to define these as their own Yaml files. This allows for continuous reliability.
  • We have to remember that we can't just build reliability overnight. The way to do that is by establishing processes, automating them, and continuous reliability. If you're interested in taking the next step in your learning journey, check out Gremlin certification.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Slos sake you for tuning in to my talk. I am going to be talking about continuous reliability and how is it that we can get there. Let's go ahead and just jump right in. As we know, software is going to break. The world that we're building continues relying on the stability of this naturally brittle technology. The challenge that we continue facing is very much about making sure that our customers stay first. How do we continue to innovate and deliver products and services so that our customers are happy that we're minimizing that risk of failure as much as possible? But we actually have come a long way. Maybe we used to think that our technology stacks was very complex, but boy were we wrong. Our legacy systems were way more simpler than the systems that we have now. These complexity continues increasing. We started out with just a few services on premises. Maybe we had one service that was being hosted, and maybe we had one or two annual releases. We've gone ahead and shifted left and rearchitected our monoliths to be microservices. Now we have things hosted in the cloud that we don't even know that location. If it's data centers, we now have hundreds if not thousands of microservices that we have to look after. And thanks to DevOps, we've collaborated a lot more when we have more frequent releases and sometimes we even deploy on Fridays. We have been thinking about these complexity. We have been preparing for the more unexpected that can happen to our systems. We are at a chaos engineering conference, so I am not going to go ahead and cover the history of chaos engineering. It's been great to see this space continue evolving, the community getting larger and stronger, and for more tools to be out there to make these possible. It's great to find tools that allow for us to run simple and safe experiments without needing to do one of a kind configurations. But what is it that we're missing? We have to continue reminding ourselves that as our systems get more complex, that means that our failures are also more complex. Operating them at scale is even more of a headache. We still find ourselves doing a lot of these manual work. That is a lot of toil of us to find the site reliability engineers. We end up having to do a lot of remediation, a lot of looking for the proper dashboards and observability links, and we also spend a lot of time executing those runbooks or shell scripts to make sure that our systems come back up. We constantly find ourselves feeling like this little stick figure, this is fine. My systems are broken, but they'll come back up. I'll have more coffee. I won't sleep. It's going to be okay. Maybe we'll actually escalate and get more help. But we are the ones that are on call. We're the ones getting pitched and woken up in these middle of these night. Until when is this too much? Until when do we question and ask ourselves what can we do to make things better? How do we make things more automated? How do we make things more reliable? I say that I am going to be talking about continuous reliability, but how is it that we're going to get there? Well, I believe that with these three words, we can get there. When I look back at my time as an SRE and working within SRE communities, three things that always come to mind is automation, standardization, and experiments. We learned that automations and standardization are core principles to site reliability engineering. And of course we cannot forget experimentation. From chaos engineering to feature flags to canary deployments, they've all helped us move the needle through all these years. We know that automation helps our organization and our teams not burn out and our systems to be more reliable. And of course we know that defining these reliability goals, it helps keep us online. Well, captain allows for these things to come together under one roof. So let's go ahead and dive right in. And first I'm going to introduce myself. My name is Ana Margarita Medina. I am a senior chaos engineer at Grumlin. I've been working here for almost four years with the focus of empowering others to learn more about chaos engineering and move their journey forward. Prior to that, I used to be at Uber working as an SRE where I focused in chaos engineering and cloud infrastructure. Prior to that I was also a front end developer, a back end developer and I even did some mobile applications. Gotten a chance to take all my knowledge from those industries and try to talk about making things more reliable. I also sit on the advisory board for Captain project, which has been really cool to see this space continue growing. Representation is something that really matters to me. So shout out to all of you that are joining in from one of those groups. I was born and raised in Costa Rica and my parents are from Nicaragua and I now reside in the San Francisco Bay area. So shout out to all of you. Let's go ahead and jump right back into this captain project. Captain is these control plane for DevOps automation of cloud native applications. It uses a declarative approach to build scalable automation for delivery and the operations of these services. It can also be scaled to a large number of services as well. The cool thing about captain is that it works for cloud native applications and is not just exclusive to Kubernetes. Captain is also part of the CNCF foundation and it sits in as a sandsbox project. It's been great to watch this project grow, improve and for it to gather more adoption. One of the awesome things within captain is that it allows for you to have a lot of things out of the box. You're going to be getting observability, some dashboards and alerting that allow for you to have best practices. You also get to configure that monitoring and observability whether you want it as default settings or you want customizable dashboards, along with getting some extra alertings that are going to be set up based on service level objectives for each managed service that you have within captain. And it allows for you to bring under the same platform some of that delivery, along with operations and remediations for your services. So I talk but service level objectives and service level indicators, I want to make sure to cover that terminology before we talk a little bit more about them. That service level agreement is going to be that contract with your users that is going to include the consequences for that contract to be not met. And that comes down to that service level objective, that is that target value or the range of failures that you have for that service that is going to be up or not, and that is going to be then measured by that service level indicator. That is a carefully defined quantitative measure of some aspect of the level of the service that you are providing. So a perfect example is our service level indicator being a web request that is going to have latency and for it to be less than 500 milliseconds. So that indicator is just these latency of every single request for that service. When we look at it, we look at that service level objective that it's 95% of those web requests have a latency less than 500 milliseconds over a rolling month in that service level agreement. A web request that have a latency less than 500 milliseconds for the month. If not that customer gets that money back. So there's actually a consequence for things to not be done with reliability in mind. And of course we have that big idea that we care so much about reliability. So we actually just don't want nines of reliability. We want to go ahead and think that we're trying to reach 100% of web requests, but that perfect ideal world doesn't really work when we actually put it out to technologies. The amount of dependencies that we have really make it really hard to have five nines, four nines, three nines of reliability. You have to do the work. Captain allows for you to take these concepts known as SLOs and SLOs, and it allows for you to have standardization. We have many tools that allow for us to be declaring service level objectives, but we don't have them under one platform. We need to find a way to standardize them across the tools and across different stages that you have within your pipeline, and sometimes even just these organization on its own. Captain allows for you to do just that. If you're interested in learning more specifically about the ways that slos can get created and be done within captain. One of the contributors, Andreas Grabner, a great friend, has a lot of talks around this. I personally love the one he gave last year, the Slos conf. So as we keep in mind that we have a declarative environment that allows for us to set up service level objectives, that is a way that we can think about building reliability. With bringing slos into all of this, we now have service level objectives that are going to work within the pipelines. That means that developers are going to see how their code, their improvements, their features that they're working on are actually impacting this reliability metric. And this will allow for a service level objective to actually work as a gatekeeper. So they're getting a chance to see things gradually roll out to your dev environment, to your staging environment. They get to say, oh, actually, this is making the request be even slower. We actually don't allow for that to be hitting our customers. Based on the service level agreement and that SLO that we just recently covered, we now have slos, part of a platform, part of the CICD, a great way that I love calling it is this being test driven operations. A lot of that SRE operation work now gets to be defined and done in a way that we actually have pass and failures based on these metrics that we define. We have these slos built into pipelines. This is a way for you to then think about what can you do to make things better afterwards. And captain allows for you to define these remediation actions, what to execute, to reevaluate that service level objective, since those objectives must be met in every single stage for every deployment within captain, captain is going to be running tests to make sure the service level objective is not breached before that promotion. You're getting a chance to automate that delivery. You get a chance to automate that extra step that SRe also gets to do. When Dynatrace looked around these space using the 2020 DevOps report, they saw that 63% of folks are building internal delivery platforms and they wanted to find a way that they can give back to the community. They then ran their own surveys and got a chance to find out that a lot of time was being wasted maintaining pipelines, doing a lot of manual tasks and doing a lot of manual remediation. This can totally happen. I've seen this in multiple orgs where a lot of these things are completely just shell scripts that folks have to run, or you have to send a message on slack to one of your friends across these.org and ask, how do you bring a database back? We first start with the pipelines. We then bring in service level objectives to be set up as quality gates that captain allows for you to define. So as these service level objectives are defined within the development stage, as they are met and there's not a breach in it, and breach and reliability that then gets promoted over to your pre production environment, your staging environment. And as we see that those things are reliable and it's not a harm to our slO, we then get to promote that to our production environment. The cool thing too is that we also get to automate the operation of bringing our systems back when there is an issue that breaches that service level objective. So it gets to execute one of these remediation actions that you've set up, such as toggling that feature flag and these. That quality gate reevaluates that service level objective. If it passes. You now have remediated what was going on and it closes your reported issue. This takes in alerts and problems within your observability, within your dashboards. This is something that you can set up with tools such as Dynatrace and Prometheus, of course. And I wanted to show you a little bit of what that looks like in action. One of my favorite things about captain is that it has multiple learning resources. The tutorials. Captain Sh has really cool tutorials that you can run through. I'm going to go ahead and follow the captain. Full tour of Dynatrace. But if you don't want to use Dynatrace, you can use Prometheus and you'll get a chance to see how you bring your own kubernetes cluster. You install captain and such. So let's see what this does. We get a chance to just install captain by running this command. We get to give it these use case. Today we're going to focus on continuous delivery since that come with those quality gates. And it allows for you to see how this gradually rolls out. We're going to onboard our own project. We're going to call sock shop and we're going to pass a yaml file that helps define it. As we start our project, we now see that we have defined our three stages. Our stages are going to be very simple. Development, staging and production. We have a project now we actually need to onboard some services. So we're going to start out by onboarding a cart service. We're going to pass that sharp for it. We're going to pass our database of carts as well. And we're going to trigger that first delivery of our application. We're going to start out by triggering that database and then triggering the delivery of these cart's application to give the tag for the images that you are actually deploying. And things are going well. We see that we did the delivery over to deployment, to staging and to production. And things are all green. Things are going well. The project is succeeding. We then go ahead and do another release. We're now releasing version two of our carts. We see that production is still running on version, still running on version one of our application. But as we're actually rolling out version two of our application, we see that it's starting to fail in our preprod environment and in our development environment. When we look over at captain, we can see that the delivery got started, but when it came down to that evaluation stage, it's actually having issues. Dynatrace is reporting maybe a breach within slos and it's not allowing for this to get promoted. This allows for us to take a deep dive into that evaluation. When we go see what it's looking like in staging, we see that the response time of the 95th percentile is actually being 1052 milliseconds. This does not meet these criteria that we have for things to be passing. So the result is being failed. And this is not getting promoted from staging onto production. We go ahead and we get to now release version three of this application where we're thinking more, but that response time. So as we release that, we see that our dev environment has moved over to version three. We also see that our staging environment eventually got rolled out to version three. And our production has also adopted into version three. So this is the ideal delivery that we're doing, our applications, it has passed some service level objectives and it's able to get promoted to the next one. If you're wondering how some of these service level objectives are defined and measured within captain and how all the magic that it does, you get to see some examples of it. We have some service level in the create indicators, which is that response time in the 95th percentile, in that 15 percentile. And of course we have our error rates and we have our through output. We get a chance to see how those now have service level objectives. You get to define what the criteria needs to be for them to pass the criteria that it takes for them to be on warning. And then you now get to get an overall score for that service based on these different quality gates that the application has. You can have some that are warning and the quality gate is going to give you that warning. Some of them are just going to fail or continue passing. The awesome part is that captain allows for you to define these as their own Yaml files. You have your Slos Yaml. You have your Sli Yaml that you just get to say what it is that you want for that indicator and objective to be. And now captain allows for those YAml files to define the platform and the way that it's going to work. So when we talk about continuous reliability, to me that gets created when we have service level objectives, when we have things like captain that come with quality gates and we bring in that experiments piece, we bring in that chaos engineering. Those slos are going to require for us to do the work of setting up indicators. And now we get to inject chaos within the pipelines and see what that experiments does. So there's multiple ways for one to do chaos engineering. You can go ahead and run this at every single stage, see what chaos engineering experiment does in the dev and the staging prior to leaching to production. Or you can also just have a chaos engineers stage. So you can have your development, your staging, then have a chaos engineering stage. And that is that last quality gate before you promote to production. You can also think about doing chaos engineering alongside performance testing. This is one of the great ways that you have a lot of learning that comes with chaos engineering. This is one of the things that we did at Uber. We did load testing and we went ahead and also did chaos engineering. That's a way that we were preparing for our Black Fridays. That was our Halloween and New Year's. How do we make sure that we have enough bare metal racks that allow for us to handle the large load of capacity that we have on our peak traffic days, and how do we make sure that all these 50 microservices that it takes to run a trip are actually reliable on that day that we really matter. So that practice, we're not just building this overnight. There was a lot of testing that got done. And when we do these type of chaos engineering within the pipelines, we're asking ourselves, is this service level objective met? Yes. Cool. We're promoting that over to production. That service level objective is not met. Did we actually identify a weakness? We get a chance to do multiple things with something like, captain, you get to have autoremediation in case you want to set that up, or you just have to now do a new release, and that fix is actually going through. How do we think about these experiments? We're going to always keep in mind quality gates. We don't even have to do the math or ask our team if we think that these chaos engineer experiment results are okay for the customer or not, because we went ahead and we defined slos and slis. When the slos are met, we're good to go. When that SLO is not met, we're not okay to ship over to our customers. The way that this all comes together, when we do that example of having a chaos engineering stage is going to be about that application rolling out to that chaos engineering stage. That stage now triggers chaos engineering experiment. It takes all that data from the application and your tools that you have connected to it. It then says, is this SLO a pass or a fail? Do we promote to production or not? And we get to see how a lot of that continuous learning comes about. We're going to learn by doing, we're going to learn by injecting failure into our systems. And of course, there's that continuous aspect of it where we're going to be improving and repeating as we do more releases of our application. The ecosystem of captain continues growing. There's a lot of tools that you can use, starting with a delivery subset of applications. On the test side, there's a lot of it when it comes to observability, you can tie in multiple tools, and of course, in collaboration, you can send messages over to Slack to Microsoft Teams. Captain also recently launched SAP. Your integration for you to have a little bit more freedom if you want to do things in a no code way. You can also access these webhooks that allow for things to just be plug and play within your own internal systems or any other service that you don't see in the integrations. It's a great time to be playing around with captain and make sure to join the captain community. There's a lot of learning resources within chaos Engineering, SRE and DevOps. You can head on over to Captain Sh to learn more about this project. You can follow them on Twitter, YouTube and LinkedIn. Go ahead and give them a star over GitHub and make sure to get your hands dirty. Head on over to tutorials Captain Sh and make sure to take a look. They even have one that you can do locally using k three s in just a few minutes and set up a local Kubernetes cluster and get a chance to play with this delivery pipelines. As we're closing out, I want to make sure that I leave with some final thoughts as we go through the stuff that we're building within our systems. We have to remember that we can't just build reliability overnight. That can't just be an OKR. We also have to remember we can't. But reliability overnight. You can't bring in just a new tool that's going to promise you more nines of reliability. Those don't exist. You actually have to do the work. You have to learn. You have to inject failure and learn. You have to be able to make sure that you think of ways to make your team, your systems, more robust, more reliable. And the way to do that is by establishing processes, automating them, and continuous reliability that those processes are being ran and that the proper results are being gotten. That includes things like experiments, those service level objectives, making sure that you're doing game days, that you're doing failovers, that you're executing those runboats so that they don't become scale for those days that really matter. And reliability is not can accident at all. You have to do the work. You have to make sure that you're thinking ahead and thinking about unexpected things that can happen to your system and do chaos engineering around it. You also have to continue learning. If you're interested in taking the next step in your learning journey, feel free to check out Gremlin certification. You can head on over to gremlin.com slash certification to learn all about it. There's currently two certification modules that Gremlin is providing. We have the practitioner level, which ends up being chaos engineering fundamentals, and you have the next level. What is that professional level where you can get to test your skills on advanced chaos engineering along with some of that Gremlin terminology. I hope you all get a chance to check it out. And with that, I would love to say thank you for tuning into my chat. If you have any questions about Captain Chaos, engineering Gremlin Sre DevOps, don't be afraid to reach out. You can reach me on my email at anna@gremlin.com or feel free to say hi on Twitter Anna Underscore m underscore Medina gracias.
...

Ana Margarita Medina

Senior Chaos Engineer @ Gremlin

Ana Margarita Medina's LinkedIn account Ana Margarita Medina's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways