Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everyone.
I'm excited to be here today to talk about how to build reliable systems
under unpredictable conditions.
A challenge that every engineering team faces.
We will explore smart approaches in chaos engineering, observability and
incident management and how they can work together to create resilient systems
that can withstand real world failures.
Throughout this session, we will discuss.
Practical strategies and tools, including SteadyBit that help teams
proactively uncover weaknesses, strengthens reliability, and ensure
both systems and organizations are prepared for unexpected disruptions.
Let's dive in.
My name is Benjamin.
I'm one of the founder of SteadyBit, a chaos engineering platform.
I started with chaos engineering around nine years ago.
And with SteadyBit, we are in, in, in that space since five years.
Setting the scene, let's dive in.
Let's start with an important question.
What do we strive for?
What goal does each of us pursue every day in our work?
In other words, what is the mission of software development?
When we write code, deploy applications, or optimize systems, we are all
working towards something bigger.
But what is that overarching, purpose?
Is it about delivering features or is there something deeper?
The mission of software development at its core is to continuously improving
and delivering a software solution that provides real value to its users.
Software isn't just about writing code or deploying applications.
It's a living, evolving entity.
The best software is never done.
It adapts, improves, and meets users needs over time.
But achieving this mission isn't simple.
There are challenges, complexity, changing changing requirements,
reliability concerns, and that's why we need to focus on not just on building
a software, but on building resilient, maintainable, and user centric software.
Now, why does this continuous improvement matter?
Because customer, customers trust a system when it's consistently good
in both quality and performance.
It's not enough for software to work well from time to time.
Users expect reliability, Stability, every single time they interact with it.
If a system is fast today, but slow tomorrow, or if it works well in some
cases, but fault in others, trust erodes.
And once the trust is lost, it's incredibly hard to regain.
So as engineers, our job isn't just to build features.
it's to ensure that those features work predictably.
Reliably, under real world conditions.
Why is it so hard?
Now, let's talk about one of the biggest challenges we
face in achieving reliability.
Complexity.
The complexity of today's systems is massive.
And it's only growing.
think about.
How modern applications are built.
We are no longer dealing with monolithic, self contained systems.
Instead, we have distributed architectures, microservices, cloud
platforms, third party dependencies, and constantly evolving infrastructures.
This complexity makes our job even harder.
It's no longer just about writing code.
It's about understanding how all the moving parts interact, fail, and
fingers crossed, hopefully recover.
But wait, when we talk about a system, are we all talking about
the same definition of a system?
Let's have a look at the, at the definition of a system
from my point of view.
Of course, There is software.
There is something we have created, but it needs to deploy it on
some infrastructure hardware.
People are needed to build, maintain, and operate such software and infrastructure.
There is a pipeline, there is a build process, there is something
how we can really create and deliver the applications and software.
All that is done inside of an organization.
Which means also there are processes and many more components.
So it's quite complex and a lot of stuff is going on in such a system.
And now we also must continuously improve our systems, not just to add new features,
but to make them more fail safe and capable of handling failures gracefully.
And to be honest, That's freaking hard.
And one thing must be clear.
We will never have a 100 percent error free system.
No matter how much testing we do, no matter how many best practices
we follow, failures will happen.
Servers will crash.
Networks will lag.
Dependencies will break.
So instead of chasing this impossible dream of a failure proof system, we
should focus on building resilience
by using chaos engineering and the power of chaos engineering.
You all know the definition of chaos engineering, so nothing new in here.
Let's continue.
But as mentioned earlier, the definition of a system is much
more than just technology.
The term socio technical system expresses it very well.
It's not just technology.
It's a combination of people, processes, and tools working together
to produce a specific outcome.
This means failures aren't just software bugs.
They can come from misconfigurations, human errors, unexpected
interactions between services or even misaligned team processes.
Chaos Engineering helps us proactively test, learn, and improve.
we are not just reacting to failures when they happen, but anticipating
when, then anticipating them and strengthening our systems in advance.
We are getting into something more proactive.
Let's now take a look at such a socio technical system and
how it evolves over time.
This simple website shows the result of a collaboration between many
different people and technologies.
I would like to address the question of what and who is needed to build
and operate this online shop.
Starting on the right hand side, we now see the individual services
that are needed to operate the shop on the left hand side.
Let's group the elements by the, yeah, let's say subject
area to get a better overview.
It's not getting better or it's still quite complex and many moving parts.
These components are provided by more than one team or one person.
And they require coordinated interaction between the individual teams for the
overarching common goal to provide the customer with a functioning
system at all times with consistency, good in quality and performance.
But it gets even more complex.
Within the teams, people work together to build and operate this online shop.
There's a lot of interaction between the teams and each individual
person is interacting as well.
So it is difficult to keep everyone in the loop and to keep this
constantly changing system stable.
Now that we are all clear on the preconditions and now the whole system,
let's get into the subject in detail.
So I want to increase the systemic resilience.
And there are three phases needed and we have to go through to ensure
that our system constantly delivers good quality and performance.
Let's go through the phases one by one and discuss how we proceed in them.
Uncover risk.
First one.
Let's talk about how we identify and address system vulnerabilities.
Vulnerabilities.
One of the key challenges in modern distributed systems, especially
in Kubernetes environments, is understanding where the risk is.
Dependencies, misconfigurations, or weak points in redundancy
can all lead to failures.
This is where SteadyBits approach comes in.
By analyzing system configuration, we can uncover hidden risk and
provide precise recommendations to enhance system robustness.
What you see here is a risk analyze, highlighting potential weak spots.
Red areas indicate a critical issue that could impact availability, while
green areas show resilient components.
with this analysis, kind of insight.
We are not just reacting to failures, we are getting ahead of them, we are
turning it into a proactive approach.
Phase number two, understand impact.
Once we have identified hidden risks, the next step is to verify
our resilience strategy to ensure our system can truly withstand disruptions.
This is where chaos engineering experiments can We don't just assume
that our redundancy measures work.
We test them under real world conditions before failures happen.
Here on the picture, you can see an experiment focused
on validating Kubernetes pot redundancy across AWS zones.
The system, or SteadyBit, flags a potential risk.
Pots are distributed across zones, but we still need to verify if
an outage in one zone is valid.
Impacts our performance and quality.
The best part about, these experiments don't disrupt production.
They are controlled, observable, and measurable early in the development cycle.
Helping teams build confidence in their infrastructure before
they deploy into production.
Want to get a real demo?
Real example?
Let's jump into SteadyBit.
Now, on the right side, again, you can see like this bubble with all
the moving parts from the system.
I would like to do again, a grouping.
So I would like to group the elements by the deployment name of Kubernetes, just
to get a better picture about the system.
And now I want to identify a potential risk in my system.
So I can do that with some colors.
I can ask SteadyBit, please color all the elements on the right side
by the zone they are running in.
In total, SteadyBit now tells me, okay, you're running in two zones.
And with just one look, I can identify Toy's Best Seller is distributed across
two zones, checkout service as well.
But here in the center, orders or our hot deal service isn't distributed
across all the zones you are running in.
This is already a risk, and I was able to identify this without any need of running
experiments or so, just based on the data.
doing one step back, grouped by all the deployments.
Now, I'm using SteadyBuild and the Advice feature.
help me to filter in my, yeah, in my moving targets on the right side.
And SteadyBit is now able to tell me where a potential risk is already.
So getting in again in the checkout service, SteadyBit tells me,
okay, here are some orange ones.
So what's going on?
Let's get into this example, the pod deployment.
So an advice by Steadybit was, Hey, you should run this in, in, in, with
more instances and how to fix this.
Hey, here takes a snippet and reconfigure your elements.
This is already something where, yeah, Steadybit can help
me to identify something, can help me to improve my system.
And last element to verify.
Now, if this.
is something which already fixed the situation.
So there is an experiment created based on the data SteadyBit has discovered.
And that's just one example.
So let's zoom in.
What SteadyBit is doing here right now is, first of all, there's a pre check.
SteadyBit will check if all the pods are ready.
If not, don't execute this experiment because the system
isn't in a healthy state.
Next, we will simulate a pod failure of the checkout service.
So we will delete a specific pod.
And then our expectation is, first of all, we want to see that this
pod is failing, but also that it's returning in a couple of seconds.
So just a minute.
So our expectation is, even if there's one pod, the system under test is
still working and is recovering.
Getting back to the slides, because now we can execute this experiment.
same experiment.
But this time it's, just with the fashion bestseller.
So pre check was done.
Everything is fine.
Now we are injecting the delete pot attack, and then we want to see the pot
is coming back after, within 60 seconds.
If not, the experiment will fail and we already identified the risk in the system.
It's now just been executing on fast forward to save some time.
Steadybit is also collecting some internals about Kubernetes.
So Kubernetes events are coming in.
I'm getting some information about the deployment readiness and everything
is getting in one central report.
Four, three seconds left.
Here we go.
Experiment was successful in the end.
So first job done.
We are quite okay.
Now let's get into the second part.
We have established the preconditions, understanding systems complexity,
uncovering vulnerabilities or risks, and verify, the resilience more proactively.
It's time to deep, dive deeper into this subject, subject, sorry.
So let's shift our focus now from Why we should do it into more how we can do it.
So how we can apply these principles in practice, run meaningful
experiments, and truly enhance the system resilience in our organization.
And that's on another level.
It's not only on the technical side of things.
Now we are, yeah, we want to increase the systemic resilience of our
organization, which is like the last element, the last building block.
We have already done the hard work of improving our system's resilience,
identifying weaknesses, validating redundancies, and running experiments
to strengthen our technical setup.
But there is still one crucial question left.
How will our teams respond when things go wrong?
Even the most robust systems can, can't prevent all failures.
What really determines success is how well the organization, our people and
processes can react under pressure.
This is where socio technical resilience comes in.
It's not just about the technology, it's about how different teams
interact, coordinate and make decisions when things break.
The unpleasant feeling of uncertainty, of knowing how well we
will handle an incident, is real.
And the best way to eliminate that uncertainty?
We test it.
We simulate incidents, we practice responses, and we ensure that
our organization is just as resilient as our technology.
Now you can see how both worlds are coming together.
What you see here are two complex systems on the left
hand and one on the right hand.
But what's important to understand is that both are essential to running and
operating a truly resilient system.
On left we have the technical system, the architecture, infrastructure and
software that, that power our application.
On the right, we have the organizational system.
It's a team's communication channels, processes that ensure smooth operations
when incidents occur and that are helping us to keep our systems up and running.
So now.
Let's jump into another demo and see how a tool like Steadybit can help you
harness the power of chaos engineering to measure and improve the reliability
of both systems and organizations, of both systems and organizations.
There was, let's assume there was an incident a while ago, and with all
the data from the incident, we were able to identify a specific scenario.
Let's The scenario I'm talking about is a latency spike in one
of the backend services called Hot Deals, followed by a total outage
of the central gateway component.
So this was really like a cascading error, something went wrong in the
backend, but still it also went wrong in the, in, in the, earlier stages
of the system, earlier, elements of the system in the gateway component.
My expectation, or in other words, my hypothesis is I want to see that our
monitoring recognizes the scenario within 90 seconds and reports it to the relevant
team on call, we are using PagerDuty, and that the incident has been opened
and acknowledged within three minutes.
That's the first element.
That is something I need to see from my system, but also for my organization.
But then.
Getting more on the technical side of things, the technical system, the
system normalized within three minutes and the incident and our on call team,
determined that the resilience system has recovered and is working normally.
So the incident is resolved within four minutes.
So you can now really see both, both worlds.
It's a technical system where we have implemented, yeah, resilience patterns.
So that our system can recover from failures, but also on the other side,
our organization, our teams, our people that are able to understand
what's going on and how they should react under those conditions.
And to truly strengthen our organization's resilience, we need to focus on the
interaction, not just within teams, but also between the tools and platforms
that keep everything running smoothly.
Here, we see three key players, PagerDuty, Datadog, and SteadyBit.
PagerDuty helps us orchestrate incident response, ensuring the right
people are alerted at the right time.
Datadog provides observability.
giving us real time insights into system performance and failure patterns.
And finally, SteadyBit brings in chaos engineering, allowing us to
proactively test, validate, and strengthen both our technical and
organizational response strategies.
And by integrating these tools, We create a holistic approach to resilience
where we don't just detect failures, but also understand, react, and
ultimately prevent from impacting users.
And yeah, resilience isn't about a single tool or strategy.
It's about making sure everything and everyone works together seamlessly
to handle whatever comes our way.
All right.
Getting into the details and the timeline.
everything will start with a latency spike in our Hot Deals service, followed
by an outage of our gateway component.
We want to see this event or the, yeah, the failures in our monitoring
tool, Datadog, within 90 seconds.
Datadog will call PagerDuty, that there is something wrong, and will trigger the
An incident, PagerDuty knows who's on call and will, yeah, notify those people.
We want to see that those people on call are jumping in, getting first insights and
doing an acknowledgement of the incident.
Then getting back on the technical side of things and our resilience
and reliability patterns.
The system is able to recover.
It's turning into an okay status back, which has been recognized by Datadog.
And this will also, this also means that our team is able to see this improvement.
So the system is back to normal.
So the incident needs to be resolved as well.
A lot of stuff is going on.
Let's get in, in, in something before we start into the experiment.
Something very important.
It's, I don't like finger pointing.
And it's about.
Testing how fast it's not about testing how fast an individual
person fixes an outage.
It's really about how good the organization is at detecting and fixing
faults and how to processes and all the and how processes are coordinated.
So now it's time to zoom in and see, take a look at the experiment we will run.
Okay.
First.
SteadyBit is integrated with Datadog and we will check as a pre check if
the system is in a healthy state.
If not, don't execute this experiment.
Next, SteadyBit will inject latency in all hot deal services, followed by a
total outage of the gateway component.
And then, our expectation in the hypothesis is After 90 seconds,
there needs to be an alert in our monitoring tool data talk.
So we want to see this alert coming in.
This alert needs to trigger an incident.
We will, we are integrated with PagerDuty from SteadyBits point of view.
And we can check if there is an incident triggered.
If so, everything okay, everything as designed.
Now, our system is able to recover.
It's getting back into an okay status.
So far, so good.
But also, we are checking again in PagerDuty if someone on call
is jumping into that incident and is doing some search about
what's going on to get insights.
If the system is stable back in an okay status, the incident
should have been resolved.
And that is something again, we will check inside automatically in PagerDuty.
And that's this experiment we will execute now.
You can really see how many interactions are needed to really see if this process
is designed right and working as needed.
So here we go.
Let's take a look at the recording of this experiment, because normally
it takes about five minutes.
So prechecks are executed.
Data doc has been, yeah, integrated inside of Steadybit.
So we can ask the system if everything is fine.
At the bottom of that screen, you can see like a little green bar, it's coming up.
This tells us that the status is okay.
And we are checking it for a couple of seconds.
So it's not just a one shot.
It's really like for a specific time we want to see everything is okay.
Now Steadybit is injecting the latency followed by a total
outage of the gateway component.
The expectation now is after 90 seconds, we are in the fast forward mode.
We want to see an alert inside of Datadog.
So we are calling again Datadog.
Hey, the status needs to be alert.
And you can see a red bar is coming up at the bottom.
And also we are now checking PagerDuty if there is an incident triggered.
And we found an incident related to that outage.
Now, again, the outage is gone.
System should recover.
And we want to see after some seconds that first of all, someone
is jumping in the incident.
So the incident needs to be acknowledged.
And we want also to see that Datadog is able to see that the system is able to
recover and that the status of the system is turning back into an okay status.
Monitoring is okay.
Now you're scrolling down.
You can see the bar.
And now, last element, now the system is back to normal.
The incident needs to be resolved, maybe by data doc automatically inside
of PagerDuty or by someone on call.
And now our scenario is successful because our system was able to handle it and
the people on call were able to react as needed, so they are well trained.
Let's quickly recap what we have covered today.
We started by defining the mission of software development.
Delivering continuous value while ensuring reliability.
It's hard.
It's not about just features.
We saw that complexity is massive.
Failures are inventable and trust is built on consistent
good quality and performance.
And to, yeah.
recover from a bad outage is quite hard to get the trust from
the customers back is very hard.
So to tackle this, we explored chaos engineering as a way to uncover
weaknesses in both technical and organizational systems called socio
technical systems, helping us verify resilience before failures happen.
And we are all, we did this every time in a, not in production,
in a early stage environment.
And finally, we saw how a tool like SteadyBit empowers team to test,
validate, and strengthens both the infrastructure and their response
strategy on the organization level.
So because in the end, resilience is not just about preventing failures.
It's about also preparing for them because they will happen.
So thank you again for listening.
I hope you enjoyed the session.
If you have any questions.
Please reach out to me.
You can find me online at LinkedIn at SteadyBit.
And yeah, again, thank you very much for your time.
Bye.