Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems
and observing changes exceptions.
Errors in real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
You hi
everyone, my name is Vishnu Vardhan Chikoti and then in this session
I'm going to talk about my learnings from chaos experiments.
About me I have about 17 years of experience across
it labeled engineering which includes chaos engineering, product development
and business analysis. It's a diverse experience that I have.
Currently I'm a senior manager SRE at Fanatics
and prior to fanatics I have worked at Broadridge,
Goldman Sachs, bank of America, Tektora Consulting and DBS
bank. So most of this experience is into investment banking, product development
or in the SRE areas. Other than this talk, I have
also done few other things. One is like I'm a co author of
a book, hands on site reliability Engineering, which has been published recently.
In July 2021, I have done a couple of
tech talks. One is in conf 42 SRE 2021,
about a new enterprise SRE adoption framework called
Arctic and I have also spoken about chaos engineering
and how it relates to error budgets in Chaos Carnival
2022. I also have a blog except
geek.com. It has content across capital markets,
technology and some other things. If you want
you can check it out and from a location perspective.
I am almost all the time at Hyderabad,
India from the last 20 years.
Before I get into the main topic, let's do a quick
recap on what chaos engineering is. So this
is like the definition from principles of chaos. I'm not going
to read the full thing, but if you see there are like three main parts
to it. So one is that it is about experiments on
a system and then the second part is building confidence.
And then the third part is about the turbulent conditions in
production. Now when an experiment is done,
it can result in a failure that the system is not
able to withstand that condition or it can result in a
situation where the system has actually withstand that condition.
So if it chaos actually able to handle
that, then it builds confidence that okay, like this can be
handled. And if it doesn't then we need to go back and actually look at
how do we fix it. The next thing
is about the turbulent conditions themselves. So what turbulent
conditions can actually occur in production? So the first one
is the famous application or service and availability.
So we all know that there can be a chaos monkey which can be deployed
and it can randomly shut down services or applications. And this is a
famous thing. And coming to the modern way on
how we develop, deploy applications and expose it
to users. So there are like various components.
So we might deploy it in an on prem system,
we might deploy it on cloud, multicloud, hybrid, cloud,
one region, multiregion. So there are various different
ways we actually deploy. We replicate across database
instances. And then the users themselves are now at
a very high scale where they are trying to,
they are actually spread across the globe and accessing that
over their wiFi, broadband or some wireless
connectivity. So in such a scenario, there are
so many network hops between the users requesting
for something and then getting back what they want.
So there can be delays in the network, there can be failures
in the network, like packet loss or packet corruption kind of a scenarios.
Or we can have a resource utilization problem
with respect to cpu memory or I o
related issues. Now, when we
consider an application, let's say running on a vm, so it's not that that
particular process is running. So there can be other things
that might be running, which can be an infrastructure agent for
infrastructure automation, there can be an agent for observability
purposes. There can be an agent from security purposes.
So there can be various other agents, and they might actually cause a
problem which can create that cpu situation, how much you test your
application, like when you deploy it in production, and then when it
is running in an ecosystem with other processes. And the other
things can also cause a problem and configuration
error. So we have all seen that when there is some condition
error, there can be a massive incident as well.
So it's better we try that on a small best radius
by injected that and then verifying how the
system behaves and then database failures.
So with database we can have logs, we can have transaction log fill ups,
so there can be condition, so there can be other things
in the database. So how is your overall system
actually working or behaving in this particular scenario?
So that's like a quick recap on what chaos
is now coming to actual learning.
So what I've actually learned from my experience with chaos
engineering since the last four or five years, now,
one is about the tools themselves.
So being a technical person, we are all very interested
about what tool to use the moment we know that, okay, there's a chaos
engineering that needs to be done. Now, when we look
at a tool. So there are various tools now available,
open source or from a vendor. Now,
some of the thinking that we need to look at is
whether we are going to use this particular tool just for a team
or a department, or at an entire organization level.
Now, as you broader your scope.
So what you observe is there is a variation in runtime,
so how the applications are actually running.
So are they running on vms, are they running on on prem,
are they running on cloud, are they running on kubernetes kind
of environment, so where they are actually running and then how they
are connecting with each other. So that would be one variation.
And the next variation that you would see is like the architectures themselves on
how again they are deployed, how you are trying to maintain the
ha for that, how you are trying to replicate. So there
can be some differences in that.
And then the application maturity itself, like given some standard
architecture patterns and design patterns. Now how each application
or service has implemented them can also vary.
So the overall maturity of what they have actually considered
and at what level they have implemented also differs.
Now if you run chaos experiments on a particular blast
radius and then you think that okay, the same thing will also work for
another set of applications and all that might not actually
exactly match. So there will be differences on what
has been implemented. And next is about the policies.
Now you take an organization, so there will be various things
like what would your infrastructure team actually allow you
to do on your applications or when
you keep expanding that particular scope of where
you want to run the chaos experiment. And then there will be
certain policies from security perspective on
what experiments you can run and in what way you have to run.
Like though the tool has a capability that might not be straightforward
to implement given that. And then there will be like change related policies
now. And then there will be incident related policies like how you are
making sure that what you are actually going to run is in line with
these all policies. So that's one thing about
the tools that I have learned. And then the second
thing is about the actual tools themselves. There are like, as I said,
many tools available. The first one is obviously very famous chaos
monkey. We all know that. And that's what it started. And then there
is Simeon army, I didn't mention it here.
And then there is Warrian which is basically has
failures based on protocol like HTTP MySql.
And then error condition can be like you
are returning an error or you are injecting a delay. So those condition
and then there is boot chaos monkey. Basically you
add certain configuration to a palm file and you build your sharing
boot application and then it will randomly start injecting that
faults. And then chaos blade is from Alibaba
which has various chaos experiments,
scenarios including like attacks on the JVM level,
the fault injections at the JVM level, and then there is pumba
which is basically for Docker containers so it can
run as a standalone binary or you
can inject some libraries into your container and then use palma as
well. And there is chaos mesh for kubernetes
environments. Similarly we have litmus for kubernetes and
cloud native environments. And then there is chaos
cube for kubernetes environments.
And then there is Cthulhu and we have
monarch from T Mobile which is used for pivotal cloud foundry
based applications. So if you have deployments there then monarch can
be used. And then there is mangle from VMware.
So there are various fault injection available
for VMware related or at a VM level.
So there are various things which are available through mangle. And then there
is at T resiliency studio.
And then there is moxie which is also a proxy like Warian.
So you can create put moxie as a proxy and then you can
inject like response code failures or
injected delays or any network related
failures also through moxie. And then there is chaos toolkit with
various integrations that are available now.
Again, which tool you want based on where your ecosystem
is, where your application ecosystem is, and then what is your
best radius. So based on that it can be used. And then there are
also some operating system native features
that can actually be used. So if you look at these tools like
how they are implemented internally, there are some common things that are
actually used. So some common Linux features or
some common windows features that are actually used. And then
coming to the actual scenarios like what I have seen
while injecting chaos experiments.
So one is like reconnect failures. So most of us who have
worked for a long time can connect to this reconnect failure.
Like let's say we have an application and running and then suddenly it can't connect
to the database. Like how is it actually behaving? So is it
actually not able to reconnect anymore or is it hung or
what kind of things are happening?
So that's one thing that I have seen. So we need to look at how
to fix it and then the timeout
problems. Now if you inject a network delay, like how is your timeout
being handled in your service or UI or whatever is
that you are trying to verify?
And then let's say it's a UI, is it actually crashing
the timeout and throwing a proper message? Or is it just throwing
a fin out four l and
then there can be crashes like
whether it again can be a service or the UI or a mobile app or
whatever it says it can simply crash.
So based on the injected failure condition, so is it something which is happening?
That's something that needs to be checked. And then the master slave
setups. Now let's say you bring down the master instance
and then you have multiple slaves also available. And is
the election actually working fine? And then is one of the slave actually upgrading to
master? So that's something that can be checked.
And now there are other things that can be
done, is like you disconnect the master from the slave through network
failure condition. Now is the slave actually thinking that
the master no longer exists and it being upgrading itself to a master?
And then we have a split brain scenario. So that is like other
kind of scenarios that can happen. And then
again, depending on what actually happened, you will have a fix, like you need to
have three nodes, like one master
and two slaves, where even if one master disconnects from the
slave through some network problem, the slave can connect to the other slave
and then it knows that, okay, the master is still alive. So there
can be like various fixes, various things to look at based
on what the problem has been identified. And then the
auto scaling thing. Now if you inject a high cpu condition or you are trying
to bring down few instances, is your auto scaling actually
working properly? That the scaling is actually happening to
bring back the required instances, or to bring up a new instance
based on the cpu or memory injection that would have been
done. And then the ha setup again,
like how is your high availability working when one of the instance is
actually not available? Is it kicking in correctly and then
working then exception handling. Let's say there
is a user who is trying to do some transactions and
then there is a failure at a network or these old timeout things.
And then how is it actually being handled on the US side?
Again, let's say he has done a buy transaction, it has actually hit
the server and the buy actually got processed, but the return
was not properly handled in the UI. So then
that will be a problem. The user will not know what exactly has happened.
So are you actually like, let's say there is a timeout or
there is some other error coming from back end. Are you
actually taking the user back to some kind of a screen where he knows
what exactly has happened overall and that kind
of a user experience and then observability.
So this is not directly injected directly related to how
the application has withstand the
turbulent condition. But other things that can be considered are like observability,
like when these kind of errors are actually happening. Are you actually
logging it or are you catching it within your observability data?
And then observability is the data. And once you have that
data, do you have the right monitors in place when these kind of
things have happened? Actually, did the monitoring pick it up
or not? So in certain cases, yes, we have set up the observability,
but we haven't set up the monitors. And once we have set
up the monitors, the next thing is, is the alerting being done
through the right channels? Are you messaging someone? Are you
emailing someone? Are you actually paging someone? So this kind of
alerting will also come into the picture that you can actually verify
whether that chaos actually worked fine or not.
Right. And then error budgets, now, when these errors are happening,
fan raids or whatever, errors are actually being sent back,
are these being recorded and adjusted against the error
budget so you don't create a major
impact to the error project, to chaos experiments because it's a small
blast radius, but still you can verify that whether they are reflecting
there or not. And then auto recovery, if you
have auto recovery set up, you can check whether the auto recovery
has actually triggered correctly and then recovered the instance.
For example, you are trying to bring down an instance and
then did it actually through the monitoring, did it trigger
the auto recovery and did it actually bring that instance back?
So that's another thing. And then the other learning
I have is about injecting multiple failures.
So we don't actually inject multiple
failures at the same time across the ecosystem, we inject
like kind of one failure at a time. Like if you inject too many failures
parallel at the same time, you don't know what
went wrong and what would have actually caused it.
That is another thing. And then, yeah, how do you actually handle the identified
weaknesses? So when you have actually identified some problem
through chaos experiments, it comes back to the product
development backlog. And we have this problem of like,
basically it goes into the technical backlog because this is some kind of a
technical problem that you need to fix.
Now, how do you actually prioritize this and get
it fixed? So it depends on the scale of the problem.
Now, you would have injected again the chaos experiment on a small
best radius. Now, what happens if such a problem happens at a
widespread thing? So basically what
would be the impact and what level of problem it would creating?
Like you need to prioritize it accordingly and either
utilize your 20% of tech backlog that
can be allocated in a sprint, or if it is a bigger problem,
then you have to allocate more time than the 20%
to fix this problem and then knowledge sharing. So once you have done
the chaos experiments and learned something from that, it is important that you actually
go back and share within the organization that these are the
experiments that were run and this is what are the results that we saw.
And if there are any observations like the system did not
withstand that, then what did you actually do to fix it?
Because in certain cases, the similar patterns or the architecture
patterns will be used across the enterprise.
And then it's important that you go back and fix the different places
that the same problem might occur.
That's my talk. And any questions you have, like send
it discord and thank you.