Conf42 Cloud Native 2024 - Online

Architecting Resilient Cloud Native Applications: A Practical Guide to Deployment and Runtime Patterns

Abstract

In cloud-native applications, resilience is paramount. This talk delves into a comprehensive toolkit of deployment and runtime patterns, equipping you with the knowledge to design and implement resilient systems that can withstand and gracefully recover from disruptions.

Summary

  • Today we're going to focus on practical tips for deployment and runtime patterns. We'll explore some of the patterns to build resilient cloud native applications. To truly harness the power of cloud native development, we must make resilience a foundational principle.
  • How do we deploy new code or features without taking our applications offline? These are strategic approaches to rolling out changes in a way that minimizes or even eliminates any disruptions to our users. We'll explore four key patterns, bluegreen deployments, rolling updates, canary deployments, and dart launches.
  • You deploy your new feature or code changes completely behind the scenes, hidden from your users. This allows you to conduct live testing, collect real world performance data, and even gather feedback from targeted and selected groups. Once you're confident in the new feature, you simply turn on the switch.
  • runtime resiliency patterns provide mechanisms to cope with partial failures, network issues, and incoming surging traffic. Timeouts and retries rate limiting bulkheads and circuit breakers help prevent single failures from cascading through your system.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to Conf 42, Cloud native 2024. My name is Ricardo Castro and today we're going to talk about architecting resilient cloud native applications. Specifically, we're going to focus on practical tips for deployment and runtime patterns. What do we have on the menu for today? We'll explore some of the patterns to build resilient cloud native applications. Here's what we'll cover. We will see deployment patterns, blue green deployments, exploring seamless transitions to minimize downtime updates. We'll see rolling updates. We'll try to minimize user disruption by using gradual code changes. We'll explore canary deployments. We will examine strategies for controlled rollouts and early feedback end to end with deployment patterns. We will see dark launches. We will delve into the prerelease testing and feature gating. After that, we will explore runtime resilience patterns. We will explore popular patterns like timeouts and retries. We'll see how to prevent applications from stalling due to slow dependencies. We'll also see rate limiting, a technique to ensure fair resource allocation and prevent overload. We'll also see bulkheads. We'll understand how to isolate failures and improve overall stability. And finally, we will see circuit breakers. We will learn how to protect critical services from castaining failures. The world of cloud native applications promises remarkable benefits, incredible scalability, rapid development cycles, and the agility to meet ever changing business demands. There's an inherent tradeoff. This distributed architecture, with its reliance on microservices, external APIs, and managed infrastructure, introduces a unique set of fragility concerns. To truly harness the power of cloud native development, we must make resilience a foundational principle. This means ensuring our applications can withstand component failures, network glitches, and unexpected traffic surges, all while maintaining a seamless user experience. Let's start by addressing the core challenge of updates. How do we deploy new code or features without taking our applications offline? This is where deployment patterns come into play. These are strategic approaches to rolling out changes in a way that minimizes or even eliminates any disruptions to our users. We'll explore four key patterns, bluegreen deployments, rolling updates, canary deployments, and dart launches. The concept behind bluegreen deployments is elegantly simple. You maintain two identical production environments. Blue is where your current live environment, serving users lives, while green stands by fully updated with the latest code. Once you're ready to deploy, you seamlessly redirect traffic from blue to green. The beauty lies in the rollback capability. If any issues arise, you can instantly switch back to blue. This offers a safety net for large updates, minimizing the potential for downtime. Let's see an example. Let's imagine that we have a set of users accessing our V one application through a load balancer. We deploy V two of our service, and we test that everything is okay. Once we're confident, we redirect traffic from the load balancer to our version two. If any error arises, we simply switch back from V two to V one. Rolling updates offer a controlled approach to introducing new code changes. Instead of updating our entire environment in one go, the new version is gradually rolled out across your instances. This is like changing the tires of a moving car one at a time, allowing you to minimize any potential disruption. With each updated instance, you carefully monitor for any errors or any unexpected behavior. If any issue arises, the rollout can be paused or reversed, limiting the impact. Rolling updates are particularly well suited for containerized environments where tools like Kubernetes can seamlessly manage this process. In this example, we see that we have V one of our applications deployed one by one. We start by replacing V one with v two of our application. If at any point we see a problem, we can simply stop that rollout and even reverse it to a previous version. This doesn't mean that we have to update one verse one node at a time. We can deploy a percentage of nodes each time, but the idea is that this change is rolling, so we have a set of replicas that are updated at each time. Canary deployments take their name from the historical practice of miners bringing canaries underground. These birds were sensitive to dangerous gases, alerting the miners to potential hazards. Similarly, a canary deployment exposes your code to a small subset of users. You closely monitor key metrics, waiting, watching for any performance degradation or any error. If all goes well, you gradually roll out the updates to a larger and larger segment of your audience. This cautious approach helps catch issues early, before they impact your entire user base. So in this example, we start by rolling out the new version, v two, of our service. Then we start gradually shifting a percentage of our users to that v two. If everything goes well, we increase the switch between V one and V two, eventually arriving at 100% of users using V two. If any problem arises in this process, we can simply switch back and continue using V one. And our last pattern in the deployment patterns is dark launches. And dark launches introduce a fascinating twist. You deploy your new feature or code changes completely behind the scenes, hidden from your users. This allows you to conduct live testing, collect real world performance data, and even gather feedback from targeted and selected groups. Once you're confident in the new feature, you simply turn on the switch, making it instantly available to everyone. Narc launches are powerful when you need extensive pre release validation or you want to gradually ramp up a feature's usage. So the basic concept behind dark launches is this, you have a new feature, and then you are able to select who is able to access that new feature or not. You can use, for example, feature flags, where you can turn a feature on and on, and you can specify which users have access to it. You can also do things like providing a specific header that only certain users are able to use to access that feature. We've addressed how to safely deploy changes, but what about the unpredictable events that can happen while your application is live? Runtime resiliency patterns provide mechanisms to cope with partial failures, network issues, and incoming surging traffic. Let's dive into some essential patterns. Timeouts and retries rate limiting bulkheads and circuit breakers even the best designed systems components can become slow and unresponsive. Maybe a database is struggling or an external service is down. In these scenarios, timeouts can act as a deadline. If a dependent service doesn't respond within a fixed set of time, it stops waiting and signals failure. But that's just half of the equation. Retries give your application a second or third or fourth chance to succeed. Retries will automatically attempt a repeated fail request, often with increasing intervals, to avoid overwhelming the struggling service. This combination help prevent single failures from cascading through your system, keeping things running as smoothly as possible. So in this example, we see that service A tries to do a request to service B because we have no timeout. Service B can take how long it needs to actually give a response back. In this example, we see that it takes 5 seconds, but maybe 5 seconds for us is unacceptable. So we can set a timeout, for example, for 3 seconds. This means that after 3 seconds we will mark that request as a failure. In the case of retries, your services can do a request to a downstream service. If that becomes as an error, we can automatically retry that request until it eventually succeeds, or until we have a limit of the amount of retries that we can do. It's important to note that these requests usually increase in these retry requests easily increase in time so that we don't overwhelm downstream services. Rate limiting acts as a traffic cop for your applications. It controls the incoming flow of requests, preventing sudden spikes from overwhelming a service. Think of it like a line outside a popular club. Only a certain number of people get in at a time. Rate limiting is also crucial for fairness. It ensures that a single user or a burst of requests cannot monopolize resources, causing slowdowns for everyone else simultaneously. It's also a protective measure against potential malicious attacks designed to flood your system. So in this example, we have a client making a request to an API. If the client eventually does too many requests, the rate limiting capability will send too many requests response back to the client, preventing it from affecting other users or to flood your system on purpose. The book have pattern draws inspiration from ship design. Ships compartmentalize, so if one area floods, the whole ship doesn't sink. We can apply this to cloud native applications as well. By isolating different services or functionalities, we might limit the number of concurrent connections to a backend component or allocate fixed memory resources. The key idea is this, if one part of your system fails, the failure doesn't spread uncontrollably, potentially taking down your entire application. Bulkheads help maintain partial functionality, improving overall experience. In the example that we have here, we see clients accessing a service if we only have one service instance. This means that if that service is overwhelmed, all clients are affected. It's common practice to split that service into several service instance, so that if one replica is overwhelmed, it means that only the clients accessing that replica are affected. This can be extrapolated to also use entire features. So this means that if one feature from your system is overwhelmed or has a problem, it doesn't mean that the other features are stopped as well. Circuit breakers think of circuit breakers like those in your home. They prevent electrical overload by cutting off the flow of power. If there's a search in software, the principle is similar. When a service repeatedly fails, the circuit breaker patterns trips. This means temporarily blocking all calls to that service. It prevents fruitless retries from clogging up the network and lets the failing service potentially recover. After a set period, the circuit breaker tries to let some requests through. If it succeeds, the service is considered health again and normal operations resume. In this example, we see an HTTP request arriving at a circuit breaker command. The circuit breaker command checks is the circuit open? If it is open, it automatically says the result is not okay. You are not allowed to make this request at this point in time. If the circuit breaker is closed, it means that we can execute burner command and if everything went okay, return a good result to the user. This diagram also shows that we can use these patterns in combination. So I can use a circuit breaker, but I can also use timeouts and retries to throw exceptions so that our requests are okay or not okay. In today's digital landscape, resiliency isn't a luxury. It's a fundamental requirement for any application that demands continuous runtime and positive user experience. By thoughtfully applying the deployment and runtime patterns we've discussed, you lay the groundwork for systems that are not just fast and scalable, but truly robust. The result is peace of mind, knowing that your applications can weather the inevitable storms of the cloud. Native weather and this was all from my part. I hope this talk was informative for you, and don't hesitate in contacting me through social media. Thank you very much and have a good conference.
...

Ricardo Castro

Principal Engineer, SRE @ FanDuel / Blip

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways