Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello all.
Good morning, good afternoon, good evening.
I'm Shar and first of all, thank you for joining this session.
So today I will talk about how we can achieve the reliability excellence
with SLO Management product.
Let me introduce myself.
I have a 12 plus years of IT experience.
Currently I'm working with Naka software as a DevOps and a platform lead, and
I love to talk about DevSecOps Cloud Engineering, SRE Platform Engineering.
And I'm happy to connect with you guys via the LinkedIn and the mail
that I have already put it here.
So let's start the session.
So the agenda of the today's session is about the quick understanding of
the SREs the principles and practices.
Then we see what are the challenges that we face for the SLOs in
the complex environment and what should be our approach to solve it.
Then we look for the solution by understanding the SLO management.
Then we have and one use case.
Here we see about to how we can define the SLOs, and as well as we
check how we can basically measure the reliability of the system.
Then we'll look for the SLO management products whatever
available in the markets.
And then we see some comparisons in between those products.
So let's start with the session.
So in, in SDLC that we know that the developers want to go as fast.
Like they want to do a development feature as fast as possible.
We are in parallel like operation teams that they look
down to slow down the things.
They just, they want to make a system reliable.
That's why this kind of a misalignment happens in between these two teams
and it often create tensions.
So that's why the SRE comes into the picture.
SRE.
Is basically what introduced by the Google in the early 2000 and it is useful
to basically measure the, and to solve the operational challenges and issues.
SRE, it's a job role.
It's a mindset, it's a set of metrics that we used to measure and to manage
the overall system reliability.
So there are some key principles of SRE.
First is service level indicator.
Service level indicator are the metrics that used to measure
the quality of the service.
So there are the.
Couple of different metrics.
Mostly that we use is availability, throughput, and the latency.
Then we have an service level.
Objective service level.
Objectives are the the target level of that particular.
Metrics.
Basically that the 99, 9 9% of our system should be available.
So this is a target level that we set this.
So this is, SLO is for the internal benchmarking.
So SL is a goal rather than a promise.
But it is also help for the customers.
That we define the specific metrics and we have some kind of in shorty
that our system would be available up to this kind of a target so
that we promise, to the customers.
Then we have an service level agreement.
So it's a contract between the end users and the service provider.
So if the the promises are the breaks, so the service provider is need to pay or
they have some consequences are related, to the different payments and they need to
expansions of the different subscriptions, then we have an error budget.
But before moving to the error budget, we need to understand that there is a
misconception in the overall SRE one.
So the misconception is that we always look for the a hundred
percent of the system reliability.
But not, this is not true.
If you look for the a hundred percent of system reliability, it means that
our system is available a hundred percent and every time it should be up.
But that, that can't be a case, of course in the practically your system.
Sometimes it would be down and some kind of a disruption can be happen.
And it also affect the overall product lifecycle because if we.
Go for the a hundred percent of availability.
Then of course developers need to focus on the reliability of the systems.
They can't focus on the feature development.
They can't focus on the innovation side.
So that's why like the development teams operation teams and the product teams
they look to and to have the common.
Target level that need to be set for the overall the system.
So that's why error budget comes into the picture.
Error budget is the it is just unavailability of that of the particular
system that the teams can be tolerated.
Basically, so how much time that our system could be unavailable and which
is fine for the overall, the life cycle.
That's why.
So when you have an advocate have educate error budget, so it means you can
focus on the product development part, but your error budget is threatened.
It means that you need to focus.
On the SRE part of the reliability of that system.
Then the caution is how we can basically practice this principle,
implement this principle.
So we have a different kind of practices.
With the metrics monitoring alerts, so of course if we look for the system
reliability, so we have a proper defining of the monitoring systems, we need to be,
have correct metrics for the services.
We need to be set the alerts for the error budget, so if the error
budget is threatened, so we can have that alerts and we can focus
on the system reliability part.
Then the demand forecasting and capacity planning.
So we we, this is one.
A key practice where we have enough resources available for the system
reliability part if something happens so that the resources can
focus on the improvement sites can also focus on the reliability part.
This is, one of the key practice.
Then we have a change management and the emergency response where the each teams,
like the development teams, operation teams can should know about the overall
process defined for the change management.
Like how if the change, if the, even though change is small basically, so they
should focus on like how we can basically.
Overall do the changes and it'll improve the overall
reliability of the applications.
Then the emergency response, that of course, in the practically if
you, you have a system sometimes that the system would be down.
So you need to be like ready with what kind of emergency comes.
So you.
You need to be focused and you need to be to work on that.
How early just you can like back your system up and running.
And then of course the culture and to management is one of the one of
also the key practice where it is operational workloads that how you manage.
So in SREs of course there are.
Each and everything.
You can't automate.
So there are such kind of a manual practices that you need to be
followed and accordingly, you need to be manage that overall of
your operational workloads part.
These are the.
Some kind of a basic understanding of an SRE one.
Now, let's move to the what are the challenges that you faced
of as o for the complex system?
So here, the complex system means it, it could be your microservices base,
it could be your distributed system.
So let understanding the problem, what is a problem.
So sometimes what happens when we, like to define an SLOs
for the complex environment.
We SLOs are not aligned with the business goal.
It means that the priorities are misalignment.
Like we, we focus the SLOs on the focus on the different metrics, but our business
goals needed a different kind of metrics.
So this is the.
One thing and the target level is also changed.
This is the one challenge we face with the system as well as we mostly
don't look focus on the data driven approach because if we define the
SLOs we also need to be, ensure that what are the previous, checks for the
data and accordingly we can measure it and define the SLOs for the systems.
And sometimes we, our, the SLOs have we just set unrealistic SLOs.
And the other KPIs are the completely different.
And so we need to be, understand the system first and accordingly we can set
the realistic approach for that systems.
Then of course the distributed systems and interdependencies.
If we have a distributed systems, multiple microservices and APIs
and one microservices dependent on the other service, and they
have some other interdependencies.
It is it is really difficult to find out the pinpoint of
where the SLOs is violated.
So this is the also one of the challenge we faced in this kind of environments.
And we don't look on the priority components.
It means that according to the system, we need to be define
it, what exactly user is doing.
So we need to be look for the, their user journey.
And accordingly we need to be center priorities of the system.
And then we can define an SLOs as well.
And, of course the tool limitations is there, whatever the tools we have.
So we have the limited capabilities with respect to that tool.
So that is also one challenge.
And at the end, of course, the dynamic and evolving systems.
We work in the agile environment our system is also upscales.
We are also upgrades and we work on the continuous deployment.
So we always have the new features comes in.
So accordingly, we need to be again, set the SLOs as per the changes.
So these are some kind of a c cha challenges.
So what should be our approach to solve this kind of issue?
So first is we need to be understand the system correctly.
Like what, what is particular system, what are the different services in the system?
What is the.
User is doing like in your day to day activity.
So the critical user journeys also need to be understand of that.
Then we can go with the designing part how we can define design and SLOs
of that part of critical services.
And then we can set metrics.
For that.
And then at the end, of course, how we can use it effectively.
That, that is the one approach that accordingly we can move it and we can try
to solve this complex issues for that.
But.
The question is what would be our actual solution for this issue?
The solution is the service level objective management.
SLO management it's cover, it's a lifecycle that we need to be
defined within our organization.
So it involves basically to define the OS by aligning the.
User journey with the business objectives, whatever the business goals we have.
So accordingly, we need to be, define the user journey first.
Then set a realistic target for the KPI like the uptime, latencies and error rate.
So that is the one of the key practice that we need to be do.
Then we have a a different like SLOs, like the composite SLOs.
So composite SLOs is basically a combination of individual
SLOs that we define it.
So if we look to measure the liability of a complete system, so we need a
holistic view of that particular system.
Composite Aslu comes into the picture in that case where.
We have individual SLOs for the different services.
And basically aggregation of that services that SLOs is is
made to create a composite SLOs.
And at the end the correct measuring and the monitoring of system is
needed where we are defined the correct alerts for the budgets.
And the monitoring of the complete system.
So what benefits we can get of course the overall system liabilities
and performance would be improved.
We have a less downtime.
We have a less incidents.
We have a less priority issues on that.
If we focus on the data driven that.
That we have already discussed before.
So the data driven decision making is one of the key factors to prioritize
improvement because if you look for the previous data to understand the
system better on that one and then.
Cost to the proactive monitoring of the systems.
So it'll also reduce the meantime recovery of the system as well.
And communications and collaborations within a team as well.
It'll also help to share the understanding for the overall reliability process.
Then we have a customer satisfaction of course.
Why overall by defining this practices and process, and we focus
on the innovation part on that.
Now, let's see understand the use case of that.
This is a simple like the product lifecycle of a user journey.
Yeah, the user is to go into the shopping app and they just
put some product into the cart.
And do the payment.
And after that overall the backend process comes, whereas the warehouse
that received it and dispatched orders then invoicing happens and the user
get the confirmation via the mail.
So overall, this journey, let's break into the two parts.
So first is the pre-purchase, and second is the post-purchase.
So in the so in the pre-purchase the.
The first, we understand that the user part is that that the put the
products overall, the per repurchase system should be available.
And user has just purchased the product.
So the availability of the overall, the pre-purchase system is needed, and
of course the payment of the system.
And then the secondary part is a post-purchase.
So the availability is the more important than the latency.
This is the one thing.
Of the system and the services are in the pre-purchase order are the more
important than the post-purchase start.
So in the po, in the pre-purchase website, availability is a slightly more than
the payment availability, of course.
Because the, if the website is not available, so user can't go and
purchase the part, and then the payment part availability is needed.
So accordingly from this user journey, so this, we need to be defined it
which are the critical components, which are the critical services.
And which are the services.
And accordingly, we need to be to find the matters metrics for that services.
And then at the end, pulling the post purchase, of course,
the warehouse and the invoicing.
Slightly more important than the emailing.
If you will get the email after sometimes, then of course that, that would be fine.
But your order should be dispatched and you will have the invoicing is
also we should be created then after sometimes the, if you get an email.
That is fine.
So the thing is like that, how we can measure the reliability
of this complete user journey.
First is identify the services.
And their dependent components.
So from according to this user journey.
So we have a store website, we have a payment, invoices,
emailing and warehousing.
So these are a services that we have identified it.
Then we will set up the individual SLOs for each service.
With respect to the different metrics like the availabilities and latency.
And based on the understanding, we will define the composite SLO part,
like the pre-purchase user experience and the post-purchase user experience.
And at the end we will look for the composite slo.
As I said, composite SLOs are the aggregations would be about to.
So we have, the pre-purchase user journeys and post-purchase user journeys and
aggregation would help to make the to understand the overall system reliability.
This is the simple use case of of the overall reliability process.
Then we have the SLO management products who can help you to define it.
The SLO management lifecycle.
It also help you to give you the good features.
It would also give you the features like the composite os that
also give you the user journey.
In this overall product.
The HA and Nobel nine are the, I will say, the leaders in that because they have
the such kind of different, features were like the composite solo part, like the
user journey part, even in the harness.
It is a complete 360 degree platform.
It also give you the specific feature for the service reliability
management in which we can analyze it.
We can easily gather to know how, why the, our system is getting failed and it also
help you to to set up the governance with the open policy agents where the noble
line is also help you to keep the three years of old data so you can have the
previous data to understand your system and accordingly you can make a decision.
It has a good composite feature part and it also have to define
the user critical journey.
Then we have some other tools like the fire head end Scott Car, ServiceNow,
new Relic, which is just, fit it under the monitoring part incident
management part, but it is not a specific SLO management product.
These are some products well in the market.
And now there are, like, I have done some the comparison on the basis of
the different capabilities like the real time error budgeting tricks and
map SLOs to business capabilities.
And RCA.
And of course, the customized dashboards and reports help you.
And of course the automation part auditing parts and the overall, the customized
visas with respect to the business.
You can see they have some tools, have some capabilities.
Some are good in the specific.
Features, some are like fit under the complete category.
So the harness is mostly the noble line is there, so accordingly you can
choose it, whatever the requirement.
And you can just use the overall features of that products.
This is for this session.
So thank you very much for joining this session.
Thank you everyone.