Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning.
My name is, I am working for the Salesforce as a lead engineer.
Today I want to talk about an I starting using the multiple zone
to achieve the high availability.
I have been working in software industry from past 13 years, and I have mostly
been working as a backend engineer.
Recently I have been working on the infrastructure services and
that's where I got the opportunity to to work on a problem where we
wanted to make our infrastructure service very highly available.
Even though if there is an complete outage of one of the avail zone.
So with that context, I will start with the with with the slides.
And the idea here is that in the modern application we have to have a very high
availability of the services, right?
And and then how can we attribute right.
For the agenda first I will be talking about the importance of the highly
available fault tolerance system, and then the foundational principles
the different components in the architecture and how we can use the
Kubernetes to implement the same.
And then I will also touch upon how we can have the consistent
routing and at the end the the op.
Before we actually talk about I available system the first thing
is that why do we need it right?
In a modern day-to-day, the there are services which if not up
they will be losing the money.
And also if it's not up then you are losing the customer trust.
So in another way, again, losing the money, right?
So it is very important to make sure that your service is very highly
available and, and with the concept of the microservices how we can achieve it.
So the monolithic service is the problem with the monolithic services, that it's
the single point of this area, right?
So that's where the microservices architecture comes into the picture.
With the mi microservice architecture, we will have the multiple,
we can have, we can start our services into the multiple zone.
And let me explain about this.
When I say multiple zone the most of the cloud providers, they have the
they have the multiple availability zone and the availability zone.
You can think it's a self-contained set of infrastructure.
And when I say self-contained, it means that if something has
to go wrong, it'll not affect the other availability zone, right?
So with that logic, if we have a service deployed in the multiple
availability zone, what happens if one availability zone is down?
Your service is still functional your services is still available, right?
So the the foundational principle so as I explained, there are
multiple availability zone.
We can treat the each one separately.
One going down should not affect the the other one, right?
The stateful services we have to be released.
Careful when we do the sharding between multiple availabilities known
for the stateful services because the stateful services generally means
that there is an there is an state of the data with the portal, right?
So we have to think when the service upscale or downscale, how do we
maintain the state of the data, right?
The data locality.
We should try to keep the data not flowing into the multiple
zones right to minimize the flow.
Partial failure design when there is there is issue with the one zone.
The service code still be functional.
And we need the operational transparency.
We need to have the enough monitoring and the alerting to monitor the the
status of the different services in the different availability zone.
So architectural component, right?
So now since we have multiple zones, our service is deployed in
multiple zones, we need to know.
That which customer request will be served from the which zone, right?
So that means that we need some kind of metadata service
to provide that information.
Think of this metadata service, which it's keep in thick of the your customer versus
which zone service the customer, right?
It's responsible mainly for the shared assignment.
So whenever a new customer comes, it's it'll be able to assign that particular
customer to one of the existing card.
It'll also be help full monitoring the health of the services, and also
routing policies like whenever the request come for the existing customer,
we start to serve the request.
Starting into the multiple zone, also help reduce the block radius.
Okay so service design for multi zone resilience.
So injury service registration that includes zone information, capacity
indicators, and dependency relationship enables in intelligent client side
load balancing and availability.
A stateful component, isolation clear separation of a stateful component
from a stateless processing logic.
And that scale configuration will help us to maintain the global
configuration, keeping the Joni specific configuration as an option to override.
When we, if you have microservices, you have, you are implementing
using the Kubernetes.
These are the best practice to follow.
You need to have the past distribution strategy.
You also need to have the persistent volume management the
persistent volume management.
Whenever we create a stateful pod, we can actually have the
pod affinity in in the topology.
To tell the different parts to come up in a, to fall into
the different topology, right?
The service consideration.
And also we can have the zone aware scaling deployment, right?
This will be helpful when you have the one zone down and then you want to
redirect the traffic to the other zone.
If you have the zone of very scaling that time you don't have to worry about other
zone, able to handle all the traffic.
Okay.
We'll do, I do want to talk a little bit more about the the request routing
and the load balancing the metadata service, which keeps the state of the
which knows the state of all the drones.
It also knows that which customer is being come from the Wi Zone, so
it'll be able to act as an intelligent router and also as a load balancer.
We also need to make sure that we are using the consistent testing because
the number of the Jones is not fixed.
It can we might have.
The requirement to add more zones or the requirement of removing a zone.
In this case, we need to use the consistent housing which will, and we
need to make sure that the the the data is being moved in the correct zone so
that the customer can be served correctly, okay, so for the data consist consistency
and the state management data consistency.
If you have higher consistency requirements than what you want to
do before you replicate the data across the zone, you have to you the
request will only be successful if the data is replicated across all the
Jones, which is highly consistent.
From.
From the ki you can know like the consistency and
availability goes hand to hand.
So if you try to make the system high consistency, your
availability will go for a toss.
And the other way also, if you want to keep the system for a highly available,
the consistency will go for a toss, right?
So if you go for high consistency, you are saying that you want to write to all the.
Before you confirm the completion of the request, which means that if any
drone is not available, then then you are you will not be able to write.
So you're saying I'm going for the high consistency, but then
you are reducing the availability.
My suggestion is to go with the eventual consistency because our
focus is the high availability.
So it should be okay if the other zones eventually get the data.
And then Joan Aware database charting.
So we can like place complete the data participant within Joan while
maintaining the global consistency.
For the conflict solution.
We can have the simple last slide win, or maybe we can have the better
complicated, algorithm That depends on the requirement or requirement.
It's also very important to have the monitoring and the observability.
We need to make sure that the different zone of what are their capacity.
And based on that, we have to pre-provision the, another drone or
increase the capacity of the drone.
So it's very important to monitor the the health of the services in the each
zone since it's now multiple zones.
Debugging any issue is harder.
That's why it is if you have a distributed testing it'll help you trace the issue.
And and also having the locks will help find the two better.
This is also very important, the cost optimization.
When we go for the high availability what we are doing actually is
duplicating everything, right?
So we are duplicating infrastructure and then, which means that the more cost there
are few things which we can do to make sure that we have a check on the call.
First thing is that we can, based on the pop test, we can
rightsize the instance, right?
So on the different zone, there is also concept of the reverse reserve
instance, which means that we, we book the the instance like in advance which
comes at the very discounted rate.
So that can save us the save us well.
We have to be careful, like how do we be, we communicate between the Jones.
There are different options based on the cost.
We can choose the one the Joan Aware auto Scaling policy.
This will help us
to not over provision the zone and only autoscale when it is needed.
Yeah.
Yeah, so operational best practice.
We can actually have the CICD pipeline, which actually also have a canary.
So before we deploy in June, we can actually test our solution
to reduce the blast radius.
We can.
We also need to be, make sure that the changes that we are
doing is is not zone dependent.
So like your change work on the one zone and not onto the other.
So your service need to be agnostic of it and it should be
able to to work on the any zone.
Okay, so in conclusion.
Your service across multiple zones will help you achieve the higher availability.
But then there is additional complexity to maintain the the metadata service,
which is required for the routing.
There is additional cost and there is an additional complexity when
you want to debug the atrial.
So depending on the application use we have, we need to see if charting
across multiple zone is required.
The one of the use case, which I can think is like the database,
when you have a database, services where you actually store the data.
If you charge your service on the multiple zone, then and then.
You replicate, right?
So what happens in this case, even if you, one of the zone is down, you
still will be able to serve the data from the other zone if it's serving
a mission critical services, right?
Or the TGO services.
So for those use cases, it makes sense.
Yeah, so I think I will stop here.
Thank you very much.