Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to my talk.
My name is Ryan and I'm a DevOps engineer at Land Straw.
Today we're going to look at how to implement a blue green
deployment strategy on AWS.
So what benefit it brings, also, what trade offs it comes with, and
we focus on how lum that edge can power bluegreen deployments and
how this approach must adapt when caching becomes part of the equation.
So before we begin, let me first introduce Lens Straw,
which is the company I work for.
So Lens Straw is a consulting firm specialized in managing
the entire cloud stack.
So we build and secure cloud environments.
We organize and operate data platforms and we develop API driven or AI power
solutions to help our clients, um, address their challenges efficiently.
So we primarily work in the luxury banking and insurance sectors.
And we're a team of 80 people across whole Europe, with our headquarters in Paris.
And also we are growing pretty quickly and all, uh, and we are
always looking for new clients.
So if you're interested, feel free to reach out.
Um, this stock is actually inspired by the work done for one of our clients,
but it'll go a bit beyond, uh, this single project because my goal is
that most of you, uh, walk away with something applicable to your own context.
So let's begin.
And it's always better to understand why we are doing things.
So, uh, let's start with this simple question, why,
and here is the starting point.
So this is a pretty classic LWS architecture to deliver a web application.
Nothing, uh, fancy.
Well, of course this diagram is a very simplified view of the
reality, but the idea is that platform serves all satisfies.
So HT M-L-C-S-S, JavaScript from an, uh, an S3 bucket and for
the backend platform forwards.
API calls to an application load balancer, which, uh, itself distributes traffic
across an ECS cluster running on, forget.
So the compute is serverless, scalable and fully managed.
And even if, uh, even if this looks good, we were experiencing a high number
of incidents during production release.
So the rollbacks were, uh, slow and complex, so really painful, and
this caused a lot of operational stress to the team and sometimes
even downtime during deployments.
So we, yeah, we definitely needed a safer, a more controlled
way to deploy our application.
And this brings us to the next part, to the next part, blue green deployments.
So, um, blue green deployments is a real strategy where we maintain two identical,
uh, production ready environments.
So, but one is live here, that's the blue environment.
And the other, the green environments will host the new version that is deployed.
So the idea is simple.
We prepare the new release in green.
We run all of our tests, we validate that everything behaves currently.
And when confident, we switch traffic from blue to green instantly and
without impacting the users at all.
And if anything goes wrong, we can just flip back traffic to blue
without, uh, without any issue.
So how does that help help us?
In our case, first, we could test the new version in the real world
conditions before switching traffic.
Um, and also rollbacks became fast and low risk, essentially with one action.
So at the end, releases become safer and more, uh, predictable.
And importantly, we achieve zero down time during deployments.
Well, of course it's not perfect.
Uh, there are a few pitfalls to be aware of.
The, the first one being that you're temporarily running two environments.
Which means, of course, higher costs.
And you also need to think very, um, carefully about states
and database synchronization.
Um, especially for applications that often have, um, schema changes.
Um, in our cases, uh, in our case, we didn't manage any database.
We only had a few illustration instances, so we didn't have to worry about that.
But this can, yeah, this can become very tricky, uh, in some cases.
And also finally, it adds some configuration complexity
depending on how your routing is set up in your architecture.
Okay, now we are going to look at civil possible configurations for
implementing blue green deployments based on the architecture we, we just
saw, um, not all of these approaches.
Made totally, totally sense in our specific situation, but at Seal
Walk, uh, through them because, uh, depending on your own environments,
some of these options might be the, the right fit for you.
The first option to implement bluegreen is simply to use Route 53.
This is usually the most straightforward approach.
So, uh, you point your domain to the blue environment, and
when you are done, you're ready.
You update the DNS record to points, to the green environment.
Um, you can also use weighted routine.
Uh, this is a feature of route, uh, 53, where you gradually shift, uh, uh, defined
percentage of traffic from blue to green.
Um, but in practice, DNS caching, um, makes this difficult to
control with, uh, with precision.
So the main benefits of this method are, uh, that it's very simple
and that it's, uh, universal.
So it works, um, no matter which architecture you have, uh, whether you
are serving static files, APIs, if you have a, a microservice architecture,
everything works, uh, with that setup.
But it has a few drawbacks.
Um, first it's dependent on DNS caching.
Uh, so the traffic switch is never instant and you can't fully predict
how long clients will hold, uh, will hold on to hold records.
And you also get very poor control over individual request because
you count route, uh, for example, using cookies or headers, uh, logic.
And yes.
Finally, uh, this setup requires to duplicate the entire infrastructure to, so
this is something also to to have in mind.
Okay, the next option is to handle Bluegreen routing directly
at the edge using Lambda Edge.
So, um, just so everything, uh, so everyone is at the same page.
Lambda Edge is a platform feature that lets you run lightweight
functions inside AWS Edge location.
So before the request even reaches your, uh, your backend, your origin.
This gives you the ability to inspect the incoming request.
So for example, you can inspect cookies, headers, uh, path.
Basically anything, anything that the request, um, contains,
and then you can, um, modify the request, um, fields dynamically.
So the routing in that case works by modifying the request
before it hits the, the origin.
For example, uh, the Lambda function can change the origin domain to point to the
blue, to the blue or the green backend.
Or it can modify the S3 path prefix to serve a different
version of the front end.
So for the user's perspective, everything is transparent, but we
decide behind the scenes, uh, the scenes, which version they actually get.
This gives, uh, two big benefits.
First one, it's a very dynamic and highly flexible routing
because we can choose the version.
Uh.
Um, per user, per cookie, or per uh, request.
And it has a very low latency since all of the logic runs directly at the
age close to, to the, to the user.
Of course, uh, there are some drawbacks.
This approach introduce more complexity.
Because you now need to manage routine logic inside platform's,
request lifecycle, and handle cash interactions carefully.
We'll come to that later.
Um, and updating along that edge functions has a propagation delay since
the new code needs to be deployed to every AWS Edge location worldwide.
Um, so yeah, we also mentioned that later.
Okay, before we look at the solution, we actually implemented.
I want to quickly mention three other AWS options for blue green deployments.
They are all valid approaches.
Um, they didn't really make sense in our specific context, but yeah,
they might be useful for you.
So, so here, there are, um, so the first one is API gateway.
It offers a very flexible routine and you can switch versions using
headers, cookies, or stage variables.
So it's a very great solution for APIs and microservices.
But it doesn't work for S3, the static front ends, so that's why it didn't
fit, uh, our a, our architecture and we didn't have this service
already, uh, configured in our case.
The next option is a LB, so application load balancer waiting routing, uh,
you can set up two target group blue and green and assign weights
to shift the traffic between them.
It's.
Very fast, and the rollback is also very fast.
But, um, even if it works well for API and microservices, it doesn't support
static, static, um, S3 content and compared to API gateway, it doesn't offer
the, all the advanced routing logic, for example, with a cookie based rules.
And finally we have CodeDeploy.
This one gives you, uh, a fully automated blue-green workflow with health checks,
uh, with lifecycle routes and automatic world back if something goes wrong.
So it's a really great tool, but it has, um, yeah, it also introduced significant
set up overhead, um, especially if you don't have it already configured.
Okay, so now we're going to see how our implementation of bluegreen
deployment, uh, works with Lambda.
Um, so the idea is that we implemented the Bluegreen routing, the Bluegreen, yes.
Rooting directly at the edge using the cookie based approach.
So let me explain how it works.
Um, so first, yeah.
As you can see, each environment blue and green has its own application load
balancer with its own ECS cluster.
And for the front end part we have, uh, different prefixes on the same S3 buckets.
So the idea is that we will use lambda at H to modify the request
fields so that it points to the correct origin or the correct S3, uh,
prefix, depending on the cookie value.
So based on this cookie, the function decides whether the user should see
the blue version or the green version.
The cookie can, um, store something meaningful, uh, but difficult to guess.
For example, the commit ash of the, the application release.
Okay.
So cookies were a great tool in our case because it lets us, uh, take decisions
per user to preserve their experience.
Because on a user is assigned the version, uh, we want them to stay on
that version so that they don't bounce between releases while navigating the app.
And that's especially true if there are breaking chains, breaking
changes between uh, two version.
We also added, um, a new component, which is a developer override cookie.
So this cookie has priority over everything.
And it allows developers or QA to force a specific version,
uh, directly in their browser.
So this prevents them from being switched automatically, even if they refresh
the page or anything else they can do.
Uh, this is crucial for testing and the begging, uh, for, for the use case.
Okay, so here's the routing logic.
If a version cookie is already present, we keep the user on that version.
So this preserves the, the context.
Uh.
Of the user and is ensure that there is a smooth experience for the, for
the user if no cookie is present, or if the app is being reloaded,
for example, on a full refresh.
Um, that requests the new HTML uh, content.
We assign the user to the currently active release and we set the cookie
at that moment for future requests.
And this lets us move users only when it's, um, only when it's needed.
Okay.
Um, and the last step, if the developer cookie is present,
it's, uh, always, it always wins.
So the user is explicitly pins to that version regardless of
reloads or global switches.
The combination gives us a control, flexible and user
friendly blue green system.
While still letting developers, uh, target specific versions instantly.
Um, but there's still an issue.
Uh, it's that when, um, we want to, uh, to update the, the active version
in the Lambda Edge code, there's a propagation delay, uh, which means
that updating the alarm that code, uh, is slow and lacks predictability.
So we find, um, a way to improve the, this, this implementation.
And that's what I'm going to, to explain in the next slide.
So for the second iteration of the, the project, the idea was
to avoid redeploying the lambda edge every time a value changes.
So to achieve this, we externalize all dynamic configuration into
the SSM parameter store, and this allows us to store the active
version outside of the code.
So, uh, the LAMB data only has to retrieve the, this configuration
when needed and to prevent, um, a network request on every invocation.
We cache the release information in memory, uh, for short duration.
And this preserve the performance benefits of running at the edge while,
uh, still allowing configuration updates without redeployment.
Okay.
There's something we didn't talk about for the moment, and that's
one of cloud, uh, one of, uh, cloud phone's key capabilities.
It's the ability to cache contents at edge locations.
Um, and that's, uh, yeah, we will see how it implies some adjustments to
the implement, the implementation.
We just.
Okay, so before diving into the adjustments, there is one essential
concepts we need to clarify, and that's how CloudFront processes request and
more importantly, how caching happens.
So CloudFront has four possible interception points for Lambda age.
Uh, the first one, the viewer request, which happens just
after the user makes the request.
Then the, um, the origin request.
That's, uh, when platform decides it needs to contact the origin.
Then there's the origin response when the origin sends data back to platform,
and there is the viewer response.
So just before the response goes back to the user, the key here, uh, is to
understand that the cache sits between the viewer side and the origin side.
This means that CloudFront will only reach the origin request phase.
If the requested object is not already cached at the edge location, if the
content is cached, the content is cached.
The origin request and origin response phases never happened, and that's why
we need to, um, really carefully decide which logic runs in your request and which
lodging Logic runs in origin request to ensure that, uh, Bluegreen works correctly
with static assets because they only, this only concerns, uh, the static assets.
So the S3 origin in our case, and that's the final workflow we implemented.
So on the reviewer request side, if a developer preview cookie is present,
we inject a unique identifier into the request so that it bypasses cash object.
So this guarantees that developers, as I I mentioned previously, always
see the most UpToDate information and content for the testing purposes.
Then if the cache isn't hit, uh, a release, the origin request phase
starts and the Lambda at edge, uh, is triggered, uh, with the following logic.
So if a release cookie is already set and the app is not loading, we route
the request to the correct blue or green prefix, depending on the active version.
Otherwise, we retrieve, uh, the currently active release from
SSM and we roots accordingly.
Then on the origin response, so before the response is sent back to CloudFront,
um, if the app is loading, we, uh, choose that moment to assign the release cookie.
And this inserts that.
Cons this, uh, yes, this ensures consistent routine
for, uh, the following request.
The subsequent request, and this is also why we perform a cash invalidation
whenever the active release changes.
Um, because the next request without a release cookie or the
next week request to, um, reload the application will not be a cash sheet.
And instead, CloudFront will trigger the origin request, Lambda, where we
decide the, the version to serve and the origin response Lambda, where
we assigned the version cookie.
And on this cookie set, the following request will consistently hit.
The cache entry that corresponds to the assigned version.
So the invalidation inserts, that version assignments always
happens correctly at the moment.
A user, uh, first loads or reloads the application.
And this combination inserts that caching works with our blue green
strategy rather than against it.
So to, uh, to wrap things up.
Here's what we were able to achieve first.
We now have low latency, cookie based routing that lets us, uh,
switch traffic between blue and green, instantly and safely.
Um, promoting a new version or rolling back takes less than a minute
with no prop delay or fully, and with a fully predictable behavior.
And we also implemented cash Azure Static Frontend management, which
means that version, uh, changes or image it and consistent.
And most importantly, we have a zero production incident.
We had zero production incident during releases since adopting this strategy.
Okay, now there are of course a few additional consideration that
we didn't cover today, uh, but are still very relevant, especially
when scaling this approach.
For example, uh, we don't talk about, uh, coordinating blue green states
across multiple components, uh, across multiple backend services,
databases, and microservice.
This can become very complex, especially when shared states
or schema changes are involved.
Um, there's also the idea of progressive, uh, validation where only a subset of
user are assigned to the new version.
Before switching everyone over, uh, this gives you, um, an additional layer of
safety for large or risky deployments.
And finally, integration with CICD pipelines.
Um, so that's how we would do to fully automate the promotion, the roll back, and
the validation steps, making the entire release, uh, yeah, the entire release
process, more reliable and reputable.
So, thank you for your time.
I hope this talk gave you some useful insights and ideas.
Also, you can apply in your own context.
If you'd like to continue the conversation, collaborate with RA
or join the team, please feel free to reach out using the links here.
Or you can also contact me directly by email or LinkedIn.
Uh, thanks again and have a great day.