Cultural Shifts: Fostering a Chaos First Mindset in Platform Engineering

Video size:

Abstract

What if your IDP could test resilience too? Discover how Chaos Engineering fits into the platform engineering toolbox, enhancing CI/CD, stress-testing services, and fortifying reliability. Make your IDP the ultimate single source of truth for building systems that never back down!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

It's absolutely fantastic to be here at Con 42 2025. Thanks for taking the time to join the session. Today we're diving into a topic that challenges traditional thinking about reliability, which is the chaos first mindset and not like laying back and not being proactive about it. So buckle up because it's going to be fun and exciting ride into proactive resilience. So a little bit about me. My name is Shahid. I am a senior software engineer at Harness and also the maintainer and community manager at Litmus. from time to time I also work as an LFX mentor for the Linux Foundation programs for Litmus Chaos. So that's about me. So what's the game plan for today? We're going to start with a discussion on internal developer platforms, IDP. Why they're gaining so much traction and why they matter. then we'll move into the biggest challenge in cloud native development, reliability. The very thing we built our systems for, yet struggle with so much. From there, we'll introduce chaos engineering, talk about the chaos first principle and explore how it plays a crucial role in platform engineering We'll also see some of the tools And a hands on demo as well, which we intentionally have kept to break things and on purpose And finally we'll discuss the future where this space is heading and how you can integrate chaos into your own platform journey So let's go. Before we jump into that, let's take a step back and talk about what's really happening in Cognitive Development. Now, today, it's fast, it's dynamic, and it's constantly evolving. We've got CICD pipelines, DevOps, SecOps, Configuration Management, Observability, Analytics, and the list keeps growing. And yet, despite all these advancements, one problem still remains. That failure is inevitable. The more we distribute our systems, the more dependencies we introduce and the harder it becomes to predict what will break and when there are a couple of things, which I'll talk about later as well, but there are things like. cascading failures or failures due to complicated architecture now imagine an authentication service going down Suddenly the payment systems user dashboards and even notifications stop working right because they're all dependent. They're all dependent on it So this is called a cascading failure and is a huge problem in cloud native system In fact, 80 percent of cloud native applications experience downtimes due to such issues And, also instead of one big application, we now have hundreds of tiny services running across multiple clusters, which talk to each other over networks, depend on APIs, rely on different data stores. And if something fails, figuring out where, what, and where can be incredibly difficult. there's unpredictability, and, unpredictability is the default today, unfortunately. containers can crash, nodes can go offline, there could be network issues, and, Everything is always changing. So how do we build resilience in such an unpredictable world? The reality is failure is no longer an exception. It is a given. And the question isn't how do we prevent failures? Rather, how do we prepare for them? So that's where we start shifting our mindsets. Now let's talk about internal developer platforms or IDPs. These platforms are designed to streamline developer experience, enabling teams to deploy and manage applications with minimal friction. Think of them as self service portals where developers can request infrastructure, manage deployments, and ensure governance without needing to interact with multiple teams. But here's the challenge. IDPs work great when everything is running smoothly. The moment something breaks, things can spiral out of control. That's why integrating chaos testing into IDP is critical. It ensures that failures do not just get handled after they happen, but are proactively addressed. Let's zoom in on the real issue, which is cloud native complexity, which I talked about before. We have moved from monolith to microservices, and while that's great for flexibility, it also introduces a massive reliability challenge. Your code doesn't exist in isolations anymore. It's running on a web of interconnected services, APIs, databases, third party dependencies, and so on. The problem is, when one thing fails, it can trigger a domino effect. And in cloud native world, these failures are happening all the time. So instead of just hoping for the best, we need a new approach, one that assumes failure from the starts and prepare for it proactively. Let's compare old school DevOps with today's cloud native reliability. Back then, we built a single application, deployed it in maybe once a quarter and had time to test things out. Fast forward to today, we are deploying 10 times as many microservices at a lightning speed across hundreds of environments. And with all this complexity, we have to ensure reliability. So how do we even do that? Traditional monitoring and incident response aren't really enough for us anymore. And, we need to inject. Chaos deliberately test failure scenarios in real time and make resilience an active part of our development process Now outages are expensive. They lead to financial loss reputational damage frustrated users and whatnot Some of the biggest tech giants have experienced massive failures due to these different type of issues And sometimes the problem isn't even the application itself. It could be Bad code unhandled edge cases or even unexpected system load there could be a series of cascading failures, that can take down systems. For example, a similar event happened with Slack, where it impacted thousands of businesses worldwide. And during the incidents, during, whenever incidents happen, service often lock too much data or retry too aggressively, which sort of overloads the system even more. And this sort of results in users unable to access the services, leading to all this drop in trust and frustration. Even with cloud providers and Kubernetes, infrastructure is never 100 percent reliable. You could have device failures like hard drives crashing, power supply fail, or memory leaks building up. And, a financial company recently lost over, 55 million because of the failure in one part of their infrastructure that prevented transactions from processing. Sometimes, it's not code or hardware, but how we handle these incidents making things much worse. if auto scaling is not configured properly, let's say, then service crashing would not be even detected. If teams do not have right alerts, they don't even know sometimes when something has gone wrong until the user starts complaining. Having an active channel of alerts or having a good way around how you can handle incidents is also something that can trigger these things. So what exactly is chaos engineering? At its core, it's about running control experiment to simulate real world failures. And you already know it because you're in con 42 in chaos engineering. So we're not going to spend too much time on what is chaos engineering. Rather, we'll talk about the principles later. So instead of waiting for a real outage to happen, we just introduce feel a failure ourselves and measure the impact and then we learn how to recover quickly, right? So that is our goal to build something, build confidence in our system and want our systems to withstand unexpected disruptions. Now, here's the big idea, the chaos for the chaos first principle. So instead of. So instead of treating failures as real anomalies, we assume they're inevitable. And instead of reacting to failures, we prepare for them up front. By injecting chaos early and often, we make resilience a core part of our platform engineering process. And this means fewer surprises, faster recovery, and much more reliable system overall. Platform engineers are the backbone of reliability in modern system. But let's be honest, no matter how well you design a platform, unexpected failures will still happen. Microsoft that's why chaos engineering is a must have in platform engineering, as it allows us to test system behavior under failure conditions, uncover hidden weaknesses, and also continuously improve, our reliability postures through all these different components that you see right now. Chaos engineering is not just a testing practice, it's a mindset shift. And by running chaos experiments, platform engineers gain real insights into how their system behaves under stress, Now imagine deploying a new feature. Wouldn't it be great to know how it would react to database failures or network latencies or certain spikes in traffic? Now chaos engineering helps uncover these weaknesses spot weak spots early. With proactive resilience, we can identify system bottlenecks before they impact users. We can improve incident response time with well tested failure scenarios. We can also ensure that our infrastructure self heals and recovers efficiently. This leads to more self sustaining platform, actually, that can handle real world uncertainty. And whenever we are using IDPs or platform engineering, we are already thinking about it. This is just, One more step down the layer, one more addition to what you're already thinking of. Now, there are some fantastic tools out there that can make chaos engineering accessible. Litmus Chaos, for example, is an open source framework designed for cloud native chaos engineering and Backstage, on the other hand, helps organize developer workflows really well. Combined, these tools make it easier than ever to adopt a chaos first mindset. The future is fully automated, AI driven chaos experiments that integrate seamlessly into the development lifecycle and so on and so forth. but we're heading towards a world where chaos testing isn't really an afterthought rather It's like a built in part of every platform. So these are some of the tools we have narrowed down to but it's an Not really an exhaustive list and you can pick any tools which sort of adhere to this goal But for this presentation i'm going to show litmus chaos and backstage together Now before we jump into how you can use these tools I want to talk about the vision what we plan to do. What is the ultimate goal of it? So to truly integrate chaos into platform engineering, we envision a structured journey built around the key, four key pillars. Define and execute chaos experiments. The foundational step is to define chaos experiments that fit your specific application needs. This means identifying appropriate failure scenarios based on your infrastructure, whether it could be cloud native, legacy, Linux based, Windows, or even mainframe applications. Initially, these experiments may be executed manually to gain insights into potential failure points, but later they can be automated. Chaos as a service. The next step is to make chaos engineering self serviceable. By enabling chaos as a service, teams can easily apply and disable chaos experiments on demand. This reduces friction, makes it easy for application teams to integrate chaos into their workflows. And this helps greatly when people are using platform, platform tools. Automate chaos in CI CD pipeline. Once chaos experiments are well defined and serviceable, the focus shifts to automation. The goal is to integrate chaos engineering into CICD pipelines as well, which allows failures to be tested with every release that we do. Now with a push of a button, teams can trigger chaos experiments and validate the system resilience before production or deployment. Lastly, we enable observability and automated chaos evaluation. we already know observability is key when we want to track certain metrics. To understand the impact of chaos. And before we jump in a quick mention to Namtu, who is a maintainer who helped with the litmus plugin for backstage, which was just introduced and, is working great for us. Thanks. Now it's demo time. So let me explain the different, three different aspects of this demo. So we have the app configuration. we have the entities, YAML, and then we have a small Jeff, which is showing you how it works. So the app configuration YAML is where you configure the target, which would be your litmus URL, where litmus is deployed for you, which would also require a litmus authentication token, which you have to export locally or in your cloud provider so that it can access, it can bypass, or it can authenticate itself with litmus so that it can help you with the chaos engineering flows. I'll talk about how you can get this auth code later. and once you've done that, you go to the entities YAML in backstage, where you have to paste in an annotation of the project ID of Litmus Kiosk. So you can copy your project ID that you're currently in, in Litmus, and you can paste that for backstage to understand which project you're using to run your Kiosk applications. So I'll show all of this in a demo right next. but there's other two, configuration YAML that we need to modify and you see a small GIF, which is showing you how things look in the backstage and how you can go to the actual workflows and check it out. So with that, let's get started. All right. So this is our backstage plugin. The URL is just litmus. com slash backstage plugin repo. You can either clone this repo or if you already have your own repository. You can add this package in your web folder or like in your application and then just modify the entities And the app config file. So with that, I think that should be good enough But let me also go back to visual code and show you the same So you would have in your packages in your app Package json, you would see this, backstage plugin litmus So you can use or install this package and with that you would be able to modify your entities yaml You And you would also be able to modify your app config file. So once you do that, we will quickly see how we can modify this or create this litmus auth token, and then use it or export it and put in your external IP or your IP of the, wherever litmus is hosted. And then in the entities, you can just put in your project ID from your project as well. So we have litmus deployed already. So this is our litmus instance. So we're in the LitmusCurse portal right now. So once you're in the portal, you would see the admin or whichever user you're logged in as setting. So go to the account setting. Once you do that, you will see there's a API token section. So from here, you can create a new token. Once you create a token, you can choose the TTL and you can also give it to no expiration. And then once you enter it, you will be seeing the token created along with the specific value of the token. So feel free to. copy that value and whenever you are exporting or using the backstage app Or like hosting the backstage app or wherever you want to use it Just put this litmus auth token this exact key And export it with that value with the key value that you just copied So once you do that, you should be able to authorize your litmus instance with the backstage And we have a backstage as well also running on the local port. So it is localhost 3000 where Backstage is running And since I have installed the package, it's also showing me the backstage litmus demo, which is configured now for any reason, if you are new and you want to get started with it, you can follow our follow along tutorial as well from docs or litmus kiosk. io. It explains in detail how you can set up potato hair or this demo application or any other queries you have about might have about litmus. So you can feel free to go to the docs and check it out. But for now, we'll just focus on this backstage plugin. So we already have this project. Let me go into this component. This is one configuration right now because we haven't done much wider setting. So we'll see the owner, the system, the type, the tags and everything. How many, chaos hubs are there? How many infras are there? How many experiments you have run so far? But let's go to the main one, which is the litmus tag here. The other tags you can work together and build your own IDP solution out of it. But for now, what we want to focus on is the litmus tag. So let's click on that. You'll see there's a couple of options. There's experiment docs. There's the API docs. We have the hub. You have the community. in the hub itself, you would see there's a litmus chaos sub listed down, which has different experiments, different faults and the environment as well. So if I click on this, it'll take me straight to the hub. Which is the repository of all the different faults available for us to experiment and play around with or combine into different scenarios with. So this is a list of the different templates and these are the list of the different faults that you can use to create your own hypothesis. So if I go to the generic one, let's say Kubernetes, you'll see there are different types of, Faults like power off, node faults, or docker faults, or. Pod memory hog or pod specific faults so you can combine them and create anything you want to so this is an instance of litmus Which is pulled into backstage and you can manage and everything manage and control everything from backstage itself for environments We have backstage n which is again the infrastructure a chaos infrastructure where our which is enabled for us So if I have to just show that to you, if I just do a kubectl get pods in backstage and backstage is the namespace where I have deployed this kiosk infrastructure. So you would see that we have our execution plane components running here. So this will be able to help you. Execute the kiosk experiment also detect if your application is present in this cluster or not And since it is deployed in a cluster wide access, so any application demo application Discoverable within this like present within this cluster would be discoverable. So if I just do kubectl get namespace, you'll see I have a namespace called demo and in the demo namespace itself you will see that pot tattoo head, our demo application is deciding. If I do this, you will see that there's a potato hat, potato head, potato main, potato right, left arm and all that. So these, this is a demo, application and it is also in the same cluster scope. So I'll be able to target this demo application from, My, backstage namespace as well. So since everything is connected, I'll just close them both and go back. And I had already run one experiment of deleting a hat and you can see the experiments are also showing up here. So from here itself, you can see the status. You can read on this again, or you can visually like manually go to the. So if I click on this, I should also be taken to the same execution details. And you would see that I have the potato pod delete application I ran. And I had a probe which is just doing a health check if it's running or not. just doing a health check to the specific FQDN if it's healthy or not And it did pass successfully because I was deleting the hat But if I just had to rerun this I will just rerun this for an instant and come back to backstage reload You'll see that another instance of this has already started So this helps greatly when you're just managing everything from a single portal and for us backstage would be that portal And when you're modifying things in litmus enabling githubs Managing multiple faults Adding your own personal private hubs. Everything could be managed right from within this one single destination. So this will be a single source from where you can manage your own, chaos experiment, experimentation application flow. So that was what I wanted to showcase, what I wanted to show to you. And since this is anyways going to be a simple pod lead, it will function fine. If you had Grafana or any other observability tool added with Prometheus, you would be also able to visualize the same. So with that, I would just like to finish on showcasing this and just wanted to put an emphasis on this backstage litmus plugin. Now everything is also available as part of the documentation as well. So feel free to explore that. But this is just empowering what you can do when you have a platform like backstage or when you have an open source tool this which can help integrate into your idp setup. So let's jump back to the future slides So what is the future for us now as chaos engineering continues to evolve? We must establish best practices and industry standards So here's what we envision. A maturity model for chaos engineering. So to develop a structured framework to help organizations assess their current level of chaos engineering adoption. Now this model would also ensure a clear path from beginner to advanced chaos engineering practices. And this helps orgs get like an overview of where they are at and where they have to go We also want chaos budgeting and guardrails This would be another framework which defines the acceptable levels of disruptions or downtimes for different components This helps teams allocate resources effectively while also ensuring experiments are conducted safely Now lastly automation and observability improvements too. We want to enhance the automated evaluation of chaos experiments Improve observability frameworks to integrate chaos engineering insights into your own, their own monitoring systems. So these steps will drive the future of chaos engineering adoption in platform engineering as well, specifically making it essential, an essential practice for platform engineers. With that, I'd like to finish my presentation. Thanks for attending my talk today. Here are a few links you can use to connect with me or explore the project. And thank you once more for listening to me.

Slides

Download slides (PDF)

See all 31 talks at this event!

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Cultural Shifts: Fostering a Chaos First Mindset in Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Sayan Mondal

Senior Software Engineer @ Harness

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Cultural Shifts: Fostering a Chaos First Mindset in Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Sayan Mondal

Senior Software Engineer @ Harness

Join the community!