Conf42 DevSecOps 2025 - Online

- premiere 5PM GMT

From Blue to Green Without Seeing Red: Mastering Deployments on AWS

Video size:

Abstract

How can you implement a Blue/Green deployment strategy on AWS? What are the benefits — and the trade-offs? These are some of the questions this talk will answer through a real-world case study.

We’ll explore how Lambda@Edge can be used to meet this need for a web application delivered via CloudFront, and how to adapt this implementation when caching comes into play. Expect discussions around architecture, optimization, and performance.

This is a hands-on field experience, illustrated by the technical decisions made and the challenges encountered along the way.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Welcome to my talk. My name is Ryan and I'm a DevOps engineer at Land Straw. Today we're going to look at how to implement a blue green deployment strategy on AWS. So what benefit it brings, also, what trade offs it comes with, and we focus on how lum that edge can power bluegreen deployments and how this approach must adapt when caching becomes part of the equation. So before we begin, let me first introduce Lens Straw, which is the company I work for. So Lens Straw is a consulting firm specialized in managing the entire cloud stack. So we build and secure cloud environments. We organize and operate data platforms and we develop API driven or AI power solutions to help our clients, um, address their challenges efficiently. So we primarily work in the luxury banking and insurance sectors. And we're a team of 80 people across whole Europe, with our headquarters in Paris. And also we are growing pretty quickly and all, uh, and we are always looking for new clients. So if you're interested, feel free to reach out. Um, this stock is actually inspired by the work done for one of our clients, but it'll go a bit beyond, uh, this single project because my goal is that most of you, uh, walk away with something applicable to your own context. So let's begin. And it's always better to understand why we are doing things. So, uh, let's start with this simple question, why, and here is the starting point. So this is a pretty classic LWS architecture to deliver a web application. Nothing, uh, fancy. Well, of course this diagram is a very simplified view of the reality, but the idea is that platform serves all satisfies. So HT M-L-C-S-S, JavaScript from an, uh, an S3 bucket and for the backend platform forwards. API calls to an application load balancer, which, uh, itself distributes traffic across an ECS cluster running on, forget. So the compute is serverless, scalable and fully managed. And even if, uh, even if this looks good, we were experiencing a high number of incidents during production release. So the rollbacks were, uh, slow and complex, so really painful, and this caused a lot of operational stress to the team and sometimes even downtime during deployments. So we, yeah, we definitely needed a safer, a more controlled way to deploy our application. And this brings us to the next part, to the next part, blue green deployments. So, um, blue green deployments is a real strategy where we maintain two identical, uh, production ready environments. So, but one is live here, that's the blue environment. And the other, the green environments will host the new version that is deployed. So the idea is simple. We prepare the new release in green. We run all of our tests, we validate that everything behaves currently. And when confident, we switch traffic from blue to green instantly and without impacting the users at all. And if anything goes wrong, we can just flip back traffic to blue without, uh, without any issue. So how does that help help us? In our case, first, we could test the new version in the real world conditions before switching traffic. Um, and also rollbacks became fast and low risk, essentially with one action. So at the end, releases become safer and more, uh, predictable. And importantly, we achieve zero down time during deployments. Well, of course it's not perfect. Uh, there are a few pitfalls to be aware of. The, the first one being that you're temporarily running two environments. Which means, of course, higher costs. And you also need to think very, um, carefully about states and database synchronization. Um, especially for applications that often have, um, schema changes. Um, in our cases, uh, in our case, we didn't manage any database. We only had a few illustration instances, so we didn't have to worry about that. But this can, yeah, this can become very tricky, uh, in some cases. And also finally, it adds some configuration complexity depending on how your routing is set up in your architecture. Okay, now we are going to look at civil possible configurations for implementing blue green deployments based on the architecture we, we just saw, um, not all of these approaches. Made totally, totally sense in our specific situation, but at Seal Walk, uh, through them because, uh, depending on your own environments, some of these options might be the, the right fit for you. The first option to implement bluegreen is simply to use Route 53. This is usually the most straightforward approach. So, uh, you point your domain to the blue environment, and when you are done, you're ready. You update the DNS record to points, to the green environment. Um, you can also use weighted routine. Uh, this is a feature of route, uh, 53, where you gradually shift, uh, uh, defined percentage of traffic from blue to green. Um, but in practice, DNS caching, um, makes this difficult to control with, uh, with precision. So the main benefits of this method are, uh, that it's very simple and that it's, uh, universal. So it works, um, no matter which architecture you have, uh, whether you are serving static files, APIs, if you have a, a microservice architecture, everything works, uh, with that setup. But it has a few drawbacks. Um, first it's dependent on DNS caching. Uh, so the traffic switch is never instant and you can't fully predict how long clients will hold, uh, will hold on to hold records. And you also get very poor control over individual request because you count route, uh, for example, using cookies or headers, uh, logic. And yes. Finally, uh, this setup requires to duplicate the entire infrastructure to, so this is something also to to have in mind. Okay, the next option is to handle Bluegreen routing directly at the edge using Lambda Edge. So, um, just so everything, uh, so everyone is at the same page. Lambda Edge is a platform feature that lets you run lightweight functions inside AWS Edge location. So before the request even reaches your, uh, your backend, your origin. This gives you the ability to inspect the incoming request. So for example, you can inspect cookies, headers, uh, path. Basically anything, anything that the request, um, contains, and then you can, um, modify the request, um, fields dynamically. So the routing in that case works by modifying the request before it hits the, the origin. For example, uh, the Lambda function can change the origin domain to point to the blue, to the blue or the green backend. Or it can modify the S3 path prefix to serve a different version of the front end. So for the user's perspective, everything is transparent, but we decide behind the scenes, uh, the scenes, which version they actually get. This gives, uh, two big benefits. First one, it's a very dynamic and highly flexible routing because we can choose the version. Uh. Um, per user, per cookie, or per uh, request. And it has a very low latency since all of the logic runs directly at the age close to, to the, to the user. Of course, uh, there are some drawbacks. This approach introduce more complexity. Because you now need to manage routine logic inside platform's, request lifecycle, and handle cash interactions carefully. We'll come to that later. Um, and updating along that edge functions has a propagation delay since the new code needs to be deployed to every AWS Edge location worldwide. Um, so yeah, we also mentioned that later. Okay, before we look at the solution, we actually implemented. I want to quickly mention three other AWS options for blue green deployments. They are all valid approaches. Um, they didn't really make sense in our specific context, but yeah, they might be useful for you. So, so here, there are, um, so the first one is API gateway. It offers a very flexible routine and you can switch versions using headers, cookies, or stage variables. So it's a very great solution for APIs and microservices. But it doesn't work for S3, the static front ends, so that's why it didn't fit, uh, our a, our architecture and we didn't have this service already, uh, configured in our case. The next option is a LB, so application load balancer waiting routing, uh, you can set up two target group blue and green and assign weights to shift the traffic between them. It's. Very fast, and the rollback is also very fast. But, um, even if it works well for API and microservices, it doesn't support static, static, um, S3 content and compared to API gateway, it doesn't offer the, all the advanced routing logic, for example, with a cookie based rules. And finally we have CodeDeploy. This one gives you, uh, a fully automated blue-green workflow with health checks, uh, with lifecycle routes and automatic world back if something goes wrong. So it's a really great tool, but it has, um, yeah, it also introduced significant set up overhead, um, especially if you don't have it already configured. Okay, so now we're going to see how our implementation of bluegreen deployment, uh, works with Lambda. Um, so the idea is that we implemented the Bluegreen routing, the Bluegreen, yes. Rooting directly at the edge using the cookie based approach. So let me explain how it works. Um, so first, yeah. As you can see, each environment blue and green has its own application load balancer with its own ECS cluster. And for the front end part we have, uh, different prefixes on the same S3 buckets. So the idea is that we will use lambda at H to modify the request fields so that it points to the correct origin or the correct S3, uh, prefix, depending on the cookie value. So based on this cookie, the function decides whether the user should see the blue version or the green version. The cookie can, um, store something meaningful, uh, but difficult to guess. For example, the commit ash of the, the application release. Okay. So cookies were a great tool in our case because it lets us, uh, take decisions per user to preserve their experience. Because on a user is assigned the version, uh, we want them to stay on that version so that they don't bounce between releases while navigating the app. And that's especially true if there are breaking chains, breaking changes between uh, two version. We also added, um, a new component, which is a developer override cookie. So this cookie has priority over everything. And it allows developers or QA to force a specific version, uh, directly in their browser. So this prevents them from being switched automatically, even if they refresh the page or anything else they can do. Uh, this is crucial for testing and the begging, uh, for, for the use case. Okay, so here's the routing logic. If a version cookie is already present, we keep the user on that version. So this preserves the, the context. Uh. Of the user and is ensure that there is a smooth experience for the, for the user if no cookie is present, or if the app is being reloaded, for example, on a full refresh. Um, that requests the new HTML uh, content. We assign the user to the currently active release and we set the cookie at that moment for future requests. And this lets us move users only when it's, um, only when it's needed. Okay. Um, and the last step, if the developer cookie is present, it's, uh, always, it always wins. So the user is explicitly pins to that version regardless of reloads or global switches. The combination gives us a control, flexible and user friendly blue green system. While still letting developers, uh, target specific versions instantly. Um, but there's still an issue. Uh, it's that when, um, we want to, uh, to update the, the active version in the Lambda Edge code, there's a propagation delay, uh, which means that updating the alarm that code, uh, is slow and lacks predictability. So we find, um, a way to improve the, this, this implementation. And that's what I'm going to, to explain in the next slide. So for the second iteration of the, the project, the idea was to avoid redeploying the lambda edge every time a value changes. So to achieve this, we externalize all dynamic configuration into the SSM parameter store, and this allows us to store the active version outside of the code. So, uh, the LAMB data only has to retrieve the, this configuration when needed and to prevent, um, a network request on every invocation. We cache the release information in memory, uh, for short duration. And this preserve the performance benefits of running at the edge while, uh, still allowing configuration updates without redeployment. Okay. There's something we didn't talk about for the moment, and that's one of cloud, uh, one of, uh, cloud phone's key capabilities. It's the ability to cache contents at edge locations. Um, and that's, uh, yeah, we will see how it implies some adjustments to the implement, the implementation. We just. Okay, so before diving into the adjustments, there is one essential concepts we need to clarify, and that's how CloudFront processes request and more importantly, how caching happens. So CloudFront has four possible interception points for Lambda age. Uh, the first one, the viewer request, which happens just after the user makes the request. Then the, um, the origin request. That's, uh, when platform decides it needs to contact the origin. Then there's the origin response when the origin sends data back to platform, and there is the viewer response. So just before the response goes back to the user, the key here, uh, is to understand that the cache sits between the viewer side and the origin side. This means that CloudFront will only reach the origin request phase. If the requested object is not already cached at the edge location, if the content is cached, the content is cached. The origin request and origin response phases never happened, and that's why we need to, um, really carefully decide which logic runs in your request and which lodging Logic runs in origin request to ensure that, uh, Bluegreen works correctly with static assets because they only, this only concerns, uh, the static assets. So the S3 origin in our case, and that's the final workflow we implemented. So on the reviewer request side, if a developer preview cookie is present, we inject a unique identifier into the request so that it bypasses cash object. So this guarantees that developers, as I I mentioned previously, always see the most UpToDate information and content for the testing purposes. Then if the cache isn't hit, uh, a release, the origin request phase starts and the Lambda at edge, uh, is triggered, uh, with the following logic. So if a release cookie is already set and the app is not loading, we route the request to the correct blue or green prefix, depending on the active version. Otherwise, we retrieve, uh, the currently active release from SSM and we roots accordingly. Then on the origin response, so before the response is sent back to CloudFront, um, if the app is loading, we, uh, choose that moment to assign the release cookie. And this inserts that. Cons this, uh, yes, this ensures consistent routine for, uh, the following request. The subsequent request, and this is also why we perform a cash invalidation whenever the active release changes. Um, because the next request without a release cookie or the next week request to, um, reload the application will not be a cash sheet. And instead, CloudFront will trigger the origin request, Lambda, where we decide the, the version to serve and the origin response Lambda, where we assigned the version cookie. And on this cookie set, the following request will consistently hit. The cache entry that corresponds to the assigned version. So the invalidation inserts, that version assignments always happens correctly at the moment. A user, uh, first loads or reloads the application. And this combination inserts that caching works with our blue green strategy rather than against it. So to, uh, to wrap things up. Here's what we were able to achieve first. We now have low latency, cookie based routing that lets us, uh, switch traffic between blue and green, instantly and safely. Um, promoting a new version or rolling back takes less than a minute with no prop delay or fully, and with a fully predictable behavior. And we also implemented cash Azure Static Frontend management, which means that version, uh, changes or image it and consistent. And most importantly, we have a zero production incident. We had zero production incident during releases since adopting this strategy. Okay, now there are of course a few additional consideration that we didn't cover today, uh, but are still very relevant, especially when scaling this approach. For example, uh, we don't talk about, uh, coordinating blue green states across multiple components, uh, across multiple backend services, databases, and microservice. This can become very complex, especially when shared states or schema changes are involved. Um, there's also the idea of progressive, uh, validation where only a subset of user are assigned to the new version. Before switching everyone over, uh, this gives you, um, an additional layer of safety for large or risky deployments. And finally, integration with CICD pipelines. Um, so that's how we would do to fully automate the promotion, the roll back, and the validation steps, making the entire release, uh, yeah, the entire release process, more reliable and reputable. So, thank you for your time. I hope this talk gave you some useful insights and ideas. Also, you can apply in your own context. If you'd like to continue the conversation, collaborate with RA or join the team, please feel free to reach out using the links here. Or you can also contact me directly by email or LinkedIn. Uh, thanks again and have a great day.
...

Rayenn Hamrouni

DevOps Engineer @ Lenstra

Rayenn Hamrouni's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content