Transcript
This transcript was autogenerated. To make changes, submit a PR.
Building large scale for tolerant and reliable systems is a, the
long process is a hectic process.
It goes over multiple iterations, but everything that's large and
long starts with some basic blocks.
And one of the basic blocks here is having the right mindset.
The stock will explore about framing your thoughts, having the right mindset and
cultivating it over the period of time so that you get your infrastructure right.
Hello, everyone.
This is, I work for Google on Google search infrastructure and the stock.
I will be talking to you through the lessons that I have learned over the
course of these years, and basically building high, highly reliable,
fault tolerant, and scalable.
Webscale applications.
My experience ranges from like small scale startups to this large behe North
Google, and I'll be laying down some basic principles that are really required
for you to define and decide what should be your thought process while going
about building any large scale systems.
Before we actually start taking deeper into how do we cultivate the mindset of
being an sre, let's try to understand what a, what a definition of an SRE is and how
that has really evolved over the course.
Over the course of last 30, 35 years, I feel there has been lot
going on in the computer society and the computer as far as in 30 years.
We started somewhere around main competitors, used to set up, used
to set up switches, all handheld.
It has evolved to a state where in 2000 there was a dotcom bubble, a
dotcom post where like thousands of dotcom sites came online.
People started thinking internet as a new way of, of basically,
serving services, giving out information, and tons of companies
came online and did the rock compost.
Internet was an era.
And last 10 years, what I see is there's something called a cloud burst that
has really happened where multi-tenant architectures, SaaS applications
have really taken a long haul.
You see like tons of applications SaaS applications are now thriving
in the industry, and they all really started around two.
Pick any big cloud providers, they are thriving as of today, and that
has really made a drastic shift from like, you know, what, managing a, a,
a mainframe computer was what managing the physical color was, what was what
you really need to do when you're dealing in virtual your environments.
You don't have to deal with all the physical environments, but to
some degree, the principles have still stayed the same, but there
has been a big that has happened.
If you move ahead.
Like the present era.
The present era is kind of an AI era where you see a big boost of agent applications.
And what I see here on is like in next 10 years, this agent applications
would be the next big thing.
And that probably you might have something you come up after the next 10 years,
but if you see the relationship between a server and a client, it has evolved.
So has the technologies that are basically pairing.
These two things together have evolved.
I mean, look at the very basic technologies like Evolution of two G,
internet, then it went to 3G, 4G, 5G.
You are essentially talking about the number of bits that are sent over the
network and those are exponentially going up as those number of bits
that are sent the network have.
So have the threats, so have the technologies powering them.
So have the customer expectations of experiences that you should be delivering
to your customers have evolved and so has the operations that were basically
necessary to support this kind of heavy internet traffic has changed.
So that's how the evaluation of computer has basically
evolved the role of engineer.
I should certainly mention that security and compliance is also a
very important thing that has evolved with this ever evolving landscape.
And as a site reliability engineer, I think our perspective of looking
at security and compliances has changed a lot with lot of things
like G-D-P-R-P-C-I, hipaa, FedRAMP.
And those kind of compliances are basically making a move site
reliability engineers nowadays have, have started thinking into how
do I make a system more reliable?
At the same time, I cannot be compromising on the security
and the compliance aspect of it.
So I think as the reliance on the internet has increased so has the so has the
feature offerings or the applications that are served over the internet have.
Streaming very basic application informational content.
Now we are in the age where you are basically delivering very critical,
very sensitive information over the internet and lot of lives, and lot of
people are dependent on the internet.
There there is a big spectrum where you can basically, basically deliver all
the gamified, gamified, TikTok kind of a content to delivering critical human care.
So I think it, it becomes more and more necessary to ensure that your
systems are reliable and people, real people are depending for
their real needs on the system.
I mean, looking at this thing, I, I, I, I still feel what has really
made SRA now think is as the systems have evolving, as this new things
are built, built, we really need to.
I think as as we are mindset the first and the foremost question that I would really
ask myself is, what's the size of the problem am I really trying to solve here?
And when I say the size of the problem, it essentially boils down
to what size of the company, what is the number of fuses are we talking
about, and what are the company prior?
The priorities and are align the goals of your role with
the priorities of the company.
For a, in companies of different sizes, let's say it's a small
scale seed state startup where the resources are pretty constrained.
You really have limited number of users.
The priority of the company is basically delivering x number of features.
Y number of days and x and y higher, unimaginable when basically
come from a large organization to a small organization where the
delivery speed is very different.
For me, I think the most important that thing for a small scale organization
would be optimizing for delivery speed.
And sometimes it's, it's fine to compromise on some of the reliability
aspects there, but probably investing that you can deliver those features
pretty quickly to the end users and automation for deliver investing
into the build process so that those features are quickly and reliability.
It probably would be the most important thing for that smaller scale organiz.
But as this company scale, as you go from like a very stage to something like
a growth stage company, series BCDC, where you have a number of, I think
aspects start taking more precedence, you really need to be aware that.
Depending on it on the user experience and the along with the feature.
So you need to basically wear different hats to ensure that the reliability aspect
of your product is relatively worth.
And then you are also at the same time investing in the right set of tools to to.
Those set of users, you are basically investing into the right SLOs.
Defining the right SLOs, finding the right reliability metrics, monitoring
them over a period of time, ensuring that you have the right set of technologies
to support those those metrics.
You are basically.
Displaying those metrics on your site so that people are aware that
they can trust your application, they can trust your site, and this is
pretty much available on all of the important SaaS applications you want.
Today.
They kind of display their reliability metrics and that is more like a, a
selling point for this companies where they demonstrate that, hey, we have been
99.99% available for like past 60 months.
Whatever, like past 60 days and that's the reason you should be buying a product.
Most of the security companies are, are kind of using this metric as a unique
selling point and that basically, is something that you should be considering
for that stage of the organization.
I think also accordingly, the security and the compliance requirements evolve
as the com companies get picture.
Maybe you can just do away with like some basic PC.
IDS is kind of compliances, but as those mature, maybe FedRAMP could
be a next priority based on the set of customers you're addressing.
The different security challenges might need investment into different
tools for delivering the secure experiences you should need to invest
into those next level of tools and technologies At that stage of the.
And yeah, I mean that that's what essentially happens like, you know, as
you grow from a very small organization to a very large organization, you need
to really prioritize your reliability story with the size of the organizations,
with the goal of the organization so that there is some coherence onto what we
are building, how it is being consumed.
And what's the end goal of your, a very core aspect of design thinking as
far as the reliability is concerned is important for a site reliability engineer.
You need to really evolve with the company.
I can, as you are defining what assertive mindset looks like and how do
you basically cultivate this mindset?
Some of the core principles that are clearly required by defining
your reliability story is, are some of the things that I have here.
First thing that I would really ask myself is what the customer
experience is gonna look like.
What's my end user?
What is it really expecting from my application, and how can I really
build my reliability story around it?
So.
Building a customer.
Building a SRE mindset really starts with customer expertise
first, what is your customer?
What is he expecting from that application?
To put things in more detailed perspective or like to highlight an example.
Let's say for an example like presently the company I work for deliver
some end number of 10 blue links in whatever, X number of seconds.
For us, the most critical, important and important aspect is basically
delivering those end number of results within a certain timeframe.
We need to be accurate as much as possible, and at the same time, you need
to deliver those experiences to the end user within one second of timeframe.
And that's where most of the optimization and most of our energy goes in.
Whereas one of the last organization I've worked for, I think.
Latency was not much of a concern there.
I think back then we really optimized for reliability.
We had a status page, which basically said that we have been up for like last
last 30 days for 99.99% of the time.
And that, that is one of our metrics that, you know, you can reliably trust us to use
and build your application on top of that.
So I think each, each, a reliability story essentially
starts with customer experience.
And once you get a grip on the customer experience, you can invest
in the right set of tools, the right set of technologies, and basically
define your metrics of success for reliability based on what a customer
experience is gonna look like.
I always think the most important thing that UNI should to do, not
just from the Sari perspective, but even from a Korean perspective, is.
Defining before you actually dive.
Planning is the most important aspect of any career, any phase, and even
for site reliability engineers.
I think that's the most important thing that we should be doing before
you actually take on any project.
Defining what are the metrics of success.
Writing it down like, you know, Hey, these are the five set of things that I would be
addressing with my this X, Y, Z project.
These are my end goals.
These are the metrics.
And the SO definitions for me, this is how I'm basically gonna
chop out the entire landscape.
I think defining well in advance doesn't really solve the problem, but
it at least keeps you on the track.
It at least keeps you from getting detract on something that you
shouldn't really be addressing.
It.
It helps people review your plans.
It helps people basically guide you.
So I think defining before you dive is one of the most important things
that, sRD should consider from the mindset building perspective.
The third thing that I think is important is basically balancing velocity
or, or innovation with reliability.
This to some degree I explained in my previous slide is like, as the
maturity of the organization, as the size of the organization grows
from x, y, z stage, I think your reliability story should really evolve.
But there should be a threshold where you are really balancing the velocity.
You are really balancing your innovation with reliability because real people
are dependent on real things for your applica on your application.
I think balancing velocity with the right reliability knobs
and bolts is, is very crucial.
And the third thing and the fourth thing that I think is, is crucial.
As a site reliability engineer is keeping on top of my mind, like, you
know, things are always gonna fit.
Like, you know, you, you cannot build for success.
You have to design for failure.
Things are always gonna fail.
There are always gonna be eight scenarios where, which you never consider,
which always stay off your plate.
And which is something you should replan it once.
And this is where.
This aspect and this design thinking comes in place like, you know, you
have to really design for failure.
You are not designing for success, and that way you really bring
in those critical aspects and critical thoughts on your mind.
Like, you know, what are the different ways this application can
fail and how could I avoid that?
One of the things I we largely do over at Google is even like, you know, running a
lot of the tests it's more like a, more like simulating failures in an application
before you have actually productionalized it or, or like, you know, you have
onboarded a new feature is like, you know, running through different scenarios and
trying to simulate what are the different failure scenarios you can come across.
And whether it's it's degrading customer experience one another,
do we have the right alerting and monitoring set up so that we get.
I alert it well in advance before customer notices it.
And third, do we have the right mitigation strategies in place so that if in case
there is a fallout, we have a way to mitigate that thing before customers
actually feel the heat and you have a bad reputation going out there.
So I think these are like different strategies.
Guarding, you know, having the right guardrails for an application
in the build phase in the.
In the, you know like in the delivery phase and define what are the
different places where like, you know, your network can fail, your
disks can fail, your servers can fail, things can go out of order.
As in data centers my previous company, we ran a multi easy and multi
data center kind of infrastructure.
Just to ensure that, you know, if a specific data center of a
specific region is down, we are always taken care by another region.
You know, there is failover scenarios, there is replication, there is like, you
know, backups, there is standby backup.
There is a third backup so that you never lose your data.
So all these things essentially spin up from the same.
From a very core aspect that you really need to think about failure while you
are designing this application and designing it for failure is one of the
very crucial aspects of design thinking as far as site reliability is concerned.
Another thinking that I think that has really helped me and I think is
very important as far as the design thinking of this reliability aspect
is concerned is keeping in mind that reliability is a continuous process.
It's not something that, you know, you're gonna define at the start and then you
are gonna achieve at, at a certain point it's, it's like this infinite curve, which
never really reaches a goal, but each.
Each progression is, it's like it takes you to closer to your destiny.
And you should think that as a continuously evolving process where
each incremental improvement is basically take you closer and closer
to your end goal of defining and, you know, delivering this super awesome,
reliable experience for their customers.
And I mean, where do you start?
Like, you know, like.
It's, it's very difficult at the start.
It looks like, oh, you're gonna deliver this pleasant experience
for this millions of users.
But where do you start?
I think each, each great automation always started with a very hacky script.
Something that was done manually once, second, and the third time someone
really thought of like, you know, writing a back script to get things done.
I think that's where the, the.
Automation really starts, and it's okay to be hacky.
It's okay to write your back script.
It's okay to have like in a very buggy script to start with, but that's the
first step you need to take before you actually start investing into way better
tools and way better processes out there.
Keep learning, keep evolving, but at the same time, you
need to start somewhere, one.
Second, keep your continuous involvement and continuous improvement going on
because this, this is a continuous cycle.
It is never gonna end.
Each incremental step is gonna basically take you closer to your end goal.
And you need to basically keep yourself engaged and involved in
improving the reliability aspect.
And for that, I think you really need to invest in right, a set of.
As I, as I said in the previous slide, you know, define the right SLO
metrics, define the right reliability metrics, maybe start with 99% of
reliability for an application, and eventually challenge yourself that,
oh, next year we're gonna target 99.9% reliability, 99.99% of reliability
or availability of our application.
And take it for me, like, you know, 99 and 99.9 is a huge curve, 99.9.
Point nine 2, 99 0.99 is the second exponential investment
that you really need to make.
There are some crazy numbers out there at Google, like, you know, that
we consider, like, you know, each incremental line is like tons of, tons
of efforts basically to get that going.
But I think the point I want to make is this is a continuously evolving process.
It's a, it's a process that is gonna.
Keep on going, and you really need to keep yourself focused
and you need to start somewhere.
As for building this reliability process, oh yes.
And don't expect perfect results at the start.
It is gonna evolve.
It is gonna basically reach to a state where you're gonna be happy
for that state of your customer and for that stage of your company.
But eventually there would be a next evolution story that
you would go through with.
You might want to scrap a lot of things and build all over again.
So it's, it's, it's a constant process.
You need to really keep yourself evolving as the reliability and and
the story of your organization evolves.
As, as we are thinking as we are basically learning to understand,
how do you start to think?
I think one of the most important things and the questions to
answer here is, how do you learn?
Because there's this vast amount of information out there.
You can go all the down to the stack, like at the kernel level and understanding
how, like, you know, how interrupts work, how operating systems are basically
designed to how paging and low level memory mapping really works too.
To things like, you know, how networks are designed, how, how P-C-P-I-P model
works and how basically with this I mean, you can, you can look at the entire
spectrum and go all the way to like how this modern agent AI applications
are designed and what's this ML algorithm really doing and how this ML
machines and ml models are deployed.
I think.
My point here is this is a worst domain.
Like, you know, you can always keep learning.
There's also a security aspect of it.
There are vulnerabilities, there are different attacks.
And as a SIT, as a cyber reliability engineer, you really need to keep yourself
heads about the ground and understand like what's going out in the market
to be aware, like, you know, these are the set of vulnerabilities out there.
You need to basically take those design considerations while you're
designing those applications.
So compliance requirements is, is a another aspect that you need
to really keep yourself with.
And while you are like keeping yourself up with all this traditional aspects
of learning, this lot evolving in the computer industry, Nvidia keeps
growing out new new hardwares.
Google is coming up with new technologies, there are new
tools coming up in the market.
Something that you will.
Invested for like last five years suddenly gets changed.
You were running your monolithic applications or like maybe design
your VPCs in cloud and design some E two applications and suddenly
there's a thing called hot Kubernetes probably helps you do all those
things in the span of a second.
Maybe that's how that is gonna solve your problem.
So, I mean, the point here is you are continuously learning.
You're continuously evolving.
Where do you keep yourself grounded is the most important question.
And that's where I think I, I believe in this concept of t model of learning,
where you keep yourself breath going, you keep yourself learning, you keep
attending sessions, you keep attending conferences, you, you keep understanding
what is going out in the market.
But at the same time, there is one single domain.
There is one single choice of your learning that you keep going deep down.
And be, be a expert of that field that is actually gonna
keep you going on the long run.
And this, there's a number of things that are happening on the, on the
breadth side of the world, which will actually help you navigate
your career in the right direction.
Otherwise it gets very crazy.
Like, you know, you cannot really keep up with what's happening out there.
You cannot really understand each and every thing and how it really operates.
And it's fine.
Like, you know, understanding that you don't know everything.
It's way better than saying that I know everything and I can fix everything.
I, I mean, as a site reliability engineer, that's one of the important learnings
that I have, that I have had over the years is you cannot know everything.
There are only be certain aspects of the system that you can really know
very well, but that also doesn't mean that you should just stay invested
in that aspect of the problem.
You need to increase your broader scope of learning.
You need to just.
At least have a right balance of like, you know, what depth and what
breadth you are basically having for different set of tools, technologies,
and the new evaluations that is happening in the, in the market.
And this, this goes a long way, I think the another thing that is very
core to a site reliability engineer is basically dealing with incidents like
I always feel Cy Reliability engineer is more like a soldier standing in the
forefront and ensuring that that is experiences of his end users are always
protected and mitigated in the right way.
And while you are doing.
This entire incident handling process.
I think one great learning that I have had over the course of this
n years is blameless culture.
It is a very important part, OFCY engineers learning to understand what
this blameless culture essentially means.
It is very helpful in basically when you are defining postmortems,
when you're writing things down.
On a postmodern talk why something happened, how something happened, it
really addresses the aspect of steam.
It addresses a collective problem solving approach where you define what really
happened, how it could be avoided, what you need to ensure what could be done
so that it doesn't really happen and you are not pointing fingers at other people.
You are basically taking it as a. The responsibility, your own responsibility
in owning things for yourself.
And I think that this thing really goes a long way.
The second thing while dealing with fires in production is basically you
really need to start thinking about going from this firefighting mode
to basically a fire prevention mode.
If there is fire for the first time, maybe.
It was not avoided.
It couldn't be avoided.
But the second time, if that happens, then probably having a mindset that we need
to really address this for a long term is something that should come to your mind.
And I think that is very important because that's how you keep yourself
invested into new things rather than going back to the same old aspects and
keep fixing things again and over again, over, also from prevention perspective,
I always feel it is more important to invest in long term projects.
If something has happened for once or twice, you should really invest your
energy in ensuring that this doesn't happen and what's the long term solution
to basically address this rather than showing short term mitigations.
Short term things are medications, which are good to stop the bleeding,
but long term solutions ensures that they never occur or reoccur.
So that your energy is not wasted in figuring out the problem,
or it's not always you are who is dealing with the problem.
So you are also saving your teammates basically from from the fire.
So that that's a important thing.
The third and the most important thing over here is you really need to move away
from tools and move towards principles.
What are the principles on which.
This this observability and this reliability story is built on because most
of the times the fundamental aspects of reliability, scalability, observability,
they always have stayed the same.
Irrespective of what tools you have used and what technologies you are
using to power your applications, you move from one company to another.
You move from one application to another.
The most important thing is.
The principles always these things operate, always stay the same.
So I would generally invest my energies in understanding and
learning more about these principles rather than the actual tools.
You can pretty much learn a tool in fuel, a number of days, get
away it and move on a new tool.
But I think investing your right energy and understanding these
principles is very important and that's how you pass on the knowledge.
It is very crucial as insight team Junior if you are defined a certain
principle, if acknowledge something, you really need to write it down, pass
it on to the next ari, or make this that, make it more like a defining
principle for your team so that everyone acknowledges it and follows by it.
So that it's more like a standard.
So setting standards is very important for for the psychological engineer.
I think last and the most important thing with the life of an ARI or the mindset
of an sari is you should be involved and thinking about building a community,
staying with the community, because I feel community learning is very important.
That keeps you ab upgraded and you abreast with what's happening in the society.
So going out, connecting with the people, knowing what's happening
out there, sharing your thoughts, sharing your insights to the people.
Evangelizing what you have done within an organization is very important because
it not just helps you grow, but also helps other people grow along with you.
And it's the collective mindset that solves a larger problem.
I feel we have been here because we are standing on this.
Shoulders of this thousand great people who build underlying technologies.
And we are like, building the next layer of technology solutions on top of that.
So I think it is very important for Recit liability engineer to develop a
mindset where he has to learn, but at the same time he has to like, share
this knowledge and information for the outer world to basically evolve,
learn, keep sharing, keep growing.
And yeah.
That's, that, that's, I feel is something of very crucial importance
for a site library to engineer.
Yes, that's all I had for this session.
Hope you enjoyed my thoughts and and my ideas about site
reliability and the design thinking.
If you have any questions, please feel free to reach out to me.
If you have any questions, I'll be happy to guide you along.
And thank you for the organizations for organizing this awesome event.
Event.
Thank you.