Transcript
This transcript was autogenerated. To make changes, submit a PR.
It's absolutely fantastic to be here at Con 42 2025.
Thanks for taking the time to join the session.
Today we're diving into a topic that challenges traditional thinking about
reliability, which is the chaos first mindset and not like laying back
and not being proactive about it.
So buckle up because it's going to be fun and exciting ride
into proactive resilience.
So a little bit about me.
My name is Shahid.
I am a senior software engineer at Harness and also the maintainer
and community manager at Litmus.
from time to time I also work as an LFX mentor for the Linux
Foundation programs for Litmus Chaos.
So that's about me.
So what's the game plan for today?
We're going to start with a discussion on internal developer platforms, IDP.
Why they're gaining so much traction and why they matter.
then we'll move into the biggest challenge in cloud native development, reliability.
The very thing we built our systems for, yet struggle with so much.
From there, we'll introduce chaos engineering, talk about the chaos first
principle and explore how it plays a crucial role in platform engineering
We'll also see some of the tools And a hands on demo as well, which we
intentionally have kept to break things and on purpose And finally we'll discuss
the future where this space is heading and how you can integrate chaos into
your own platform journey So let's go.
Before we jump into that, let's take a step back and talk about what's really
happening in Cognitive Development.
Now, today, it's fast, it's dynamic, and it's constantly evolving.
We've got CICD pipelines, DevOps, SecOps, Configuration Management, Observability,
Analytics, and the list keeps growing.
And yet, despite all these advancements, one problem still remains.
That failure is inevitable.
The more we distribute our systems, the more dependencies we introduce and
the harder it becomes to predict what will break and when there are a couple
of things, which I'll talk about later as well, but there are things like.
cascading failures or failures due to complicated architecture now imagine an
authentication service going down Suddenly the payment systems user dashboards
and even notifications stop working right because they're all dependent.
They're all dependent on it So this is called a cascading failure and is a huge
problem in cloud native system In fact, 80 percent of cloud native applications
experience downtimes due to such issues And, also instead of one big application,
we now have hundreds of tiny services running across multiple clusters, which
talk to each other over networks, depend on APIs, rely on different data stores.
And if something fails, figuring out where, what, and where
can be incredibly difficult.
there's unpredictability, and, unpredictability is the
default today, unfortunately.
containers can crash, nodes can go offline, there could be network issues,
and, Everything is always changing.
So how do we build resilience in such an unpredictable world?
The reality is failure is no longer an exception.
It is a given.
And the question isn't how do we prevent failures?
Rather, how do we prepare for them?
So that's where we start shifting our mindsets.
Now let's talk about internal developer platforms or IDPs.
These platforms are designed to streamline developer experience,
enabling teams to deploy and manage applications with minimal friction.
Think of them as self service portals where developers can request
infrastructure, manage deployments, and ensure governance without needing
to interact with multiple teams.
But here's the challenge.
IDPs work great when everything is running smoothly.
The moment something breaks, things can spiral out of control.
That's why integrating chaos testing into IDP is critical.
It ensures that failures do not just get handled after they happen,
but are proactively addressed.
Let's zoom in on the real issue, which is cloud native complexity,
which I talked about before.
We have moved from monolith to microservices, and while that's great
for flexibility, it also introduces a massive reliability challenge.
Your code doesn't exist in isolations anymore.
It's running on a web of interconnected services, APIs, databases, third
party dependencies, and so on.
The problem is, when one thing fails, it can trigger a domino effect.
And in cloud native world, these failures are happening all the time.
So instead of just hoping for the best, we need a new approach, one
that assumes failure from the starts and prepare for it proactively.
Let's compare old school DevOps with today's cloud native reliability.
Back then, we built a single application, deployed it in maybe once a quarter
and had time to test things out.
Fast forward to today, we are deploying 10 times as many microservices at a lightning
speed across hundreds of environments.
And with all this complexity, we have to ensure reliability.
So how do we even do that?
Traditional monitoring and incident response aren't
really enough for us anymore.
And, we need to inject.
Chaos deliberately test failure scenarios in real time and make resilience
an active part of our development process Now outages are expensive.
They lead to financial loss reputational damage frustrated users and whatnot
Some of the biggest tech giants have experienced massive failures due to these
different type of issues And sometimes the problem isn't even the application itself.
It could be Bad code unhandled edge cases or even unexpected system load
there could be a series of cascading failures, that can take down systems.
For example, a similar event happened with Slack, where it impacted
thousands of businesses worldwide.
And during the incidents, during, whenever incidents happen, service often lock too
much data or retry too aggressively, which sort of overloads the system even more.
And this sort of results in users unable to access the services, leading to all
this drop in trust and frustration.
Even with cloud providers and Kubernetes, infrastructure is
never 100 percent reliable.
You could have device failures like hard drives crashing, power supply
fail, or memory leaks building up.
And, a financial company recently lost over, 55 million because of the failure
in one part of their infrastructure that prevented transactions from processing.
Sometimes, it's not code or hardware, but how we handle these
incidents making things much worse.
if auto scaling is not configured properly, let's say, then service
crashing would not be even detected.
If teams do not have right alerts, they don't even know sometimes
when something has gone wrong until the user starts complaining.
Having an active channel of alerts or having a good way around how you can
handle incidents is also something that can trigger these things.
So what exactly is chaos engineering?
At its core, it's about running control experiment to
simulate real world failures.
And you already know it because you're in con 42 in chaos engineering.
So we're not going to spend too much time on what is chaos engineering.
Rather, we'll talk about the principles later.
So instead of waiting for a real outage to happen, we just introduce feel a failure
ourselves and measure the impact and then we learn how to recover quickly, right?
So that is our goal to build something, build confidence in
our system and want our systems to withstand unexpected disruptions.
Now, here's the big idea, the chaos for the chaos first principle.
So instead of.
So instead of treating failures as real anomalies, we assume they're inevitable.
And instead of reacting to failures, we prepare for them up front.
By injecting chaos early and often, we make resilience a core part of
our platform engineering process.
And this means fewer surprises, faster recovery, and much
more reliable system overall.
Platform engineers are the backbone of reliability in modern system.
But let's be honest, no matter how well you design a platform,
unexpected failures will still happen.
Microsoft that's why chaos engineering is a must have in platform engineering,
as it allows us to test system behavior under failure conditions,
uncover hidden weaknesses, and also continuously improve, our reliability
postures through all these different components that you see right now.
Chaos engineering is not just a testing practice, it's a mindset shift.
And by running chaos experiments, platform engineers gain real insights into how
their system behaves under stress, Now imagine deploying a new feature.
Wouldn't it be great to know how it would react to database failures or network
latencies or certain spikes in traffic?
Now chaos engineering helps uncover these weaknesses spot weak spots early.
With proactive resilience, we can identify system bottlenecks
before they impact users.
We can improve incident response time with well tested failure scenarios.
We can also ensure that our infrastructure self heals and recovers efficiently.
This leads to more self sustaining platform, actually, that can
handle real world uncertainty.
And whenever we are using IDPs or platform engineering, we
are already thinking about it.
This is just, One more step down the layer, one more addition to
what you're already thinking of.
Now, there are some fantastic tools out there that can make
chaos engineering accessible.
Litmus Chaos, for example, is an open source framework designed for cloud
native chaos engineering and Backstage, on the other hand, helps organize
developer workflows really well.
Combined, these tools make it easier than ever to adopt a chaos first mindset.
The future is fully automated, AI driven chaos experiments that integrate
seamlessly into the development lifecycle and so on and so forth.
but we're heading towards a world where chaos testing isn't really
an afterthought rather It's like a built in part of every platform.
So these are some of the tools we have narrowed down to but it's an Not really
an exhaustive list and you can pick any tools which sort of adhere to this goal
But for this presentation i'm going to show litmus chaos and backstage
together Now before we jump into how you can use these tools I want to talk
about the vision what we plan to do.
What is the ultimate goal of it?
So to truly integrate chaos into platform engineering, we envision
a structured journey built around the key, four key pillars.
Define and execute chaos experiments.
The foundational step is to define chaos experiments that fit
your specific application needs.
This means identifying appropriate failure scenarios based on your
infrastructure, whether it could be cloud native, legacy, Linux based,
Windows, or even mainframe applications.
Initially, these experiments may be executed manually to gain insights
into potential failure points, but later they can be automated.
Chaos as a service.
The next step is to make chaos engineering self serviceable.
By enabling chaos as a service, teams can easily apply and disable
chaos experiments on demand.
This reduces friction, makes it easy for application teams to
integrate chaos into their workflows.
And this helps greatly when people are using platform, platform tools.
Automate chaos in CI CD pipeline.
Once chaos experiments are well defined and serviceable, the
focus shifts to automation.
The goal is to integrate chaos engineering into CICD pipelines as
well, which allows failures to be tested with every release that we do.
Now with a push of a button, teams can trigger chaos experiments and
validate the system resilience before production or deployment.
Lastly, we enable observability and automated chaos evaluation.
we already know observability is key when we want to track certain metrics.
To understand the impact of chaos.
And before we jump in a quick mention to Namtu, who is a maintainer who
helped with the litmus plugin for backstage, which was just introduced
and, is working great for us.
Thanks.
Now it's demo time.
So let me explain the different, three different aspects of this demo.
So we have the app configuration.
we have the entities, YAML, and then we have a small Jeff, which
is showing you how it works.
So the app configuration YAML is where you configure the target, which would
be your litmus URL, where litmus is deployed for you, which would also require
a litmus authentication token, which you have to export locally or in your
cloud provider so that it can access, it can bypass, or it can authenticate
itself with litmus so that it can help you with the chaos engineering flows.
I'll talk about how you can get this auth code later.
and once you've done that, you go to the entities YAML in backstage, where
you have to paste in an annotation of the project ID of Litmus Kiosk.
So you can copy your project ID that you're currently in, in Litmus, and
you can paste that for backstage to understand which project you're
using to run your Kiosk applications.
So I'll show all of this in a demo right next.
but there's other two, configuration YAML that we need to modify and you see a small
GIF, which is showing you how things look in the backstage and how you can go to
the actual workflows and check it out.
So with that, let's get started.
All right.
So this is our backstage plugin.
The URL is just litmus.
com slash backstage plugin repo.
You can either clone this repo or if you already have your own repository.
You can add this package in your web folder or like in your
application and then just modify the entities And the app config file.
So with that, I think that should be good enough But let me also go back
to visual code and show you the same So you would have in your packages in
your app Package json, you would see this, backstage plugin litmus So you
can use or install this package and with that you would be able to modify your
entities yaml You And you would also be able to modify your app config file.
So once you do that, we will quickly see how we can modify this or create this
litmus auth token, and then use it or export it and put in your external IP or
your IP of the, wherever litmus is hosted.
And then in the entities, you can just put in your project
ID from your project as well.
So we have litmus deployed already.
So this is our litmus instance.
So we're in the LitmusCurse portal right now.
So once you're in the portal, you would see the admin or whichever
user you're logged in as setting.
So go to the account setting.
Once you do that, you will see there's a API token section.
So from here, you can create a new token.
Once you create a token, you can choose the TTL and you can
also give it to no expiration.
And then once you enter it, you will be seeing the token created along
with the specific value of the token.
So feel free to.
copy that value and whenever you are exporting or using the backstage app
Or like hosting the backstage app or wherever you want to use it Just put
this litmus auth token this exact key And export it with that value with
the key value that you just copied So once you do that, you should be able
to authorize your litmus instance with the backstage And we have a backstage
as well also running on the local port.
So it is localhost 3000 where Backstage is running And since I have installed
the package, it's also showing me the backstage litmus demo, which is configured
now for any reason, if you are new and you want to get started with it, you
can follow our follow along tutorial as well from docs or litmus kiosk.
io.
It explains in detail how you can set up potato hair or this demo
application or any other queries you have about might have about litmus.
So you can feel free to go to the docs and check it out.
But for now, we'll just focus on this backstage plugin.
So we already have this project.
Let me go into this component.
This is one configuration right now because we haven't
done much wider setting.
So we'll see the owner, the system, the type, the tags and everything.
How many, chaos hubs are there?
How many infras are there?
How many experiments you have run so far?
But let's go to the main one, which is the litmus tag here.
The other tags you can work together and build your own IDP solution out of it.
But for now, what we want to focus on is the litmus tag.
So let's click on that.
You'll see there's a couple of options.
There's experiment docs.
There's the API docs.
We have the hub.
You have the community.
in the hub itself, you would see there's a litmus chaos sub listed down, which
has different experiments, different faults and the environment as well.
So if I click on this, it'll take me straight to the hub.
Which is the repository of all the different faults available for us to
experiment and play around with or combine into different scenarios with.
So this is a list of the different templates and these are the list of
the different faults that you can use to create your own hypothesis.
So if I go to the generic one, let's say Kubernetes, you'll see there are
different types of, Faults like power off, node faults, or docker faults, or.
Pod memory hog or pod specific faults so you can combine them and create
anything you want to so this is an instance of litmus Which is pulled
into backstage and you can manage and everything manage and control everything
from backstage itself for environments We have backstage n which is again the
infrastructure a chaos infrastructure where our which is enabled for us So if
I have to just show that to you, if I just do a kubectl get pods in backstage
and backstage is the namespace where I have deployed this kiosk infrastructure.
So you would see that we have our execution plane components running here.
So this will be able to help you.
Execute the kiosk experiment also detect if your application is present
in this cluster or not And since it is deployed in a cluster wide access,
so any application demo application Discoverable within this like present
within this cluster would be discoverable.
So if I just do kubectl get namespace, you'll see I have a namespace called
demo and in the demo namespace itself you will see that pot tattoo head,
our demo application is deciding.
If I do this, you will see that there's a potato hat, potato head, potato main,
potato right, left arm and all that.
So these, this is a demo, application and it is also in the same cluster scope.
So I'll be able to target this demo application from, My,
backstage namespace as well.
So since everything is connected, I'll just close them both and go back.
And I had already run one experiment of deleting a hat and you can see the
experiments are also showing up here.
So from here itself, you can see the status.
You can read on this again, or you can visually like manually go to the.
So if I click on this, I should also be taken to the same execution details.
And you would see that I have the potato pod delete application I ran.
And I had a probe which is just doing a health check if it's running or not.
just doing a health check to the specific FQDN if it's healthy or not And it
did pass successfully because I was deleting the hat But if I just had to
rerun this I will just rerun this for an instant and come back to backstage
reload You'll see that another instance of this has already started So this
helps greatly when you're just managing everything from a single portal and for
us backstage would be that portal And when you're modifying things in litmus
enabling githubs Managing multiple faults Adding your own personal private hubs.
Everything could be managed right from within this one single destination.
So this will be a single source from where you can manage your own, chaos experiment,
experimentation application flow.
So that was what I wanted to showcase, what I wanted to show to you.
And since this is anyways going to be a simple pod lead, it will function fine.
If you had Grafana or any other observability tool added with
Prometheus, you would be also able to visualize the same.
So with that, I would just like to finish on showcasing this and
just wanted to put an emphasis on this backstage litmus plugin.
Now everything is also available as part of the documentation as well.
So feel free to explore that.
But this is just empowering what you can do when you have a platform
like backstage or when you have an open source tool this which can
help integrate into your idp setup.
So let's jump back to the future slides So what is the future for us now as
chaos engineering continues to evolve?
We must establish best practices and industry standards So
here's what we envision.
A maturity model for chaos engineering.
So to develop a structured framework to help organizations assess their current
level of chaos engineering adoption.
Now this model would also ensure a clear path from beginner to
advanced chaos engineering practices.
And this helps orgs get like an overview of where they are at and where they
have to go We also want chaos budgeting and guardrails This would be another
framework which defines the acceptable levels of disruptions or downtimes
for different components This helps teams allocate resources effectively
while also ensuring experiments are conducted safely Now lastly automation
and observability improvements too.
We want to enhance the automated evaluation of chaos experiments Improve
observability frameworks to integrate chaos engineering insights into your
own, their own monitoring systems.
So these steps will drive the future of chaos engineering adoption in platform
engineering as well, specifically making it essential, an essential
practice for platform engineers.
With that, I'd like to finish my presentation.
Thanks for attending my talk today.
Here are a few links you can use to connect with me or explore the project.
And thank you once more for listening to me.