Revolutionizing Software Development: Integrating Chaos Engineering and Feature Flags for Enhanced Reliability and Agile Response

Video size:

Abstract

Explore how the integration of chaos engineering and feature flags is transforming software development. Learn to proactively manage incidents, enhance system resilience, and enable agile feature deployment, while aligning closely with user needs and adapting to evolving industry trends.

Summary

Matt Schillerstrom: Integrate chaos engineering and feature flags for enhanced reliability and agile response. Imagine if your development cycle looked like know. There's many issues that exist today just with releasing code to production and even before that with testing.
The feature release that was supposed to make your business more productive and your customer happy now becomes an incident. Support teams get involved, ops teams, multiple development teams, security database teams, they're all trying to help resolve that incident. Because you don't think about all these things, you get used to it.
In 2023, we saw cloud infrastructure spending increase by 23%, which highlights the need for that effective cost management strategies within DevOps and site reliability engineering practices. Speed and velocity also impact the bottom line for businesses.
CI CD gets you so far, but it ends at the production deployment. One bad feature equals a full rollback of a system. Deploy and release adds stress and toil to the developer. With reduced velocity, you have increased risk and ultimately poor developer experience.
But now let's talk about reliability and resilience with the business impact. What I'm seeing in the industry is that lack of testing, right? And that's where integrating chaos engineering and feature flags helps you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone, this is Matt Schillerstrom. Today I'm going to be talking to you about revolutionizing software development by integrating chaos engineering and feature flags for enhanced reliability and agile response. I'm currently a product manager for feature flags at Harness, the modern software delivery platform. But prior to this I was a product manager at Gremlin for Chaos Engineering. I worked at Target Corporation out of Minneapolis, Minnesota, building out their Chaos engineering program and at a nuclear power plant, ensuring that we're safe and reliable. The opportunity I'd like to talk about is with software delivery. Imagine if your development cycle looked like know. Your development team confidently pushes code to production to solve business outcomes and customer needs. Your customers use software. The customers give feedback to the product they love, and business responds with new ideas to tests. The development team solves problems and tests new feature flags. Get feedback right, and everyone is happy and the business continues to grow and respond to the change in the customer needs and business outcomes. Now, I probably wouldn't be talking to you today if this is what your software delivery lifecycle looked like. All green and happy path, right? There's many issues that exist today just with releasing code to production and even before that with testing, right? But I want you to know there's some open source solutions that exist today provided by the Linux foundation through the CNCF, the cloud native Compute foundation such as Lipnis Chaos, which is donated by harness and open feature, which is a feature flagging open source tool as well. Now taking a step back, my experience has always been around understanding how systems work. And Andy Stanley, who's a pastor and a podcaster, says it best when if you don't know why it's working, when it's working, you won't know how to fix it when it breaks. And if you think about, you know, at 02:00 in the morning when an incident happens, or when you're releasing software, you have to think about how to fix it, right? You don't always know how to respond. And that's where practicing chaos engineering and using feature flags help you learn and understand the behavior of your system. Such that you can be confident when you release software, such that you can also know how to respond to something proactively when it breaks. A recent exercise I took was around this opportunity, but I took it with my team and I said, what if an incident happened? Let's just talk through it. Let's not even run a chaos experiment. Let's just get a mirror board or a paper and write it out. So in this diagram, take a 30,000 foot view, right? You have that green happy path, and you have a red column of variety of different types of incidents, whether it was a network outage or a disk failure or just something else, you name it. But if you look at this, all the yellow boxes here are things that a development team has to respond to or do before they get back on those green squares of software development happy path, right? So if an outage happens, your pager goes off or your customer notifies you, and then you have to look at a dashboard, then you're responding, then you're fixing, you're testing, you're troubleshooting, you're doing all these things when in reality you could have proactively tested this in the past, right? Aligning into that a little bit more, I asked my team, what are some of the things that happen when that incidents occurs, right? And often we related this to just like what happens when you release code to production, right? So thinking about this, like support teams get involved, ops teams, multiple development teams, security database teams, they're all trying to help resolve that incident, right? So basically that feature release that was supposed to make your business more productive and your customer happy now becomes an incident rather than just like a simple deploy and release, right? So all these things are occurring, which is interesting, right? Because you don't think about all these things. You get used to it, right? Like you get trained and you have muscle memory on how to respond to incidents, and you normalize the fact here. But let's lean to the business impact of all of this. So let's talk about cloud costs, right? So what impacts the bottom line? Cloud costs. But why does cloud costs increase? Right? Like, incidents happen so folks have to provision more servers and more workloads to respond and catch up to the recovery of the system. Speed and velocity also impact the bottom line for businesses as far as how fast you're delivering the right solution to your customer. I like to talk about churn and training and onboarding of new developers or existing teammates that are just learning a new system. But the inefficiency around that affect the bottom line. Risk, obviously, like how risk adverse your company is or your customer is, and then being too reliable, right? Like you don't necessarily need to be 100% reliable or even five nines or four nines, right? You just need to be reliable for that customer experience, for that business outcome. You're trying to solve some interesting facts. So here in 2023, we saw cloud infrastructure spending increase by 23%, which highlights the need for that effective cost management strategies within DevOps and site reliability engineering practices. We also see an increase in cloud costs in 2024, simply from gen AI services, right. More and more new companies are starting and more companies are investing in this technology, but they're quick to use it, right? So they're over provisioning workloads and servers just to support the business cases that they need. Another business impact here is velocity, or lack thereof, but continuous delivery isn't good enough, right? So CI CD gets you so far, but it ends at the production deployment. And what I like to get at here is there's risk in large deployments that a feature flags can solve. Currently with continuous delivery. Like one bad feature equals a full rollback of a system. There might be 15 features within that, right? Deploy and release are the same. Developers don't have that control in production that they want. They're babysitting that deployment out there and babysitting the testing and kind of nervous to deploy it to prod because it's also releasing to customers, right? So then the production issues affect all the users and you can't resolve the issue well in prod, right? So you have to roll back everything. And then ultimately once you get going with CI CD, you have that diminishing return, right? So the slow cumbersome deployments, lack of production governance, and then you get more tech and rework, right? So where we see that as an industry with software delivery is reduced velocity, you have increased risk and ultimately poor developer experience, which I care deeply about. Right? So again, just to repeat, like the big deployments and rollbacks equal more rework and less features. And then no control, production means the deployment must be perfect, right? So again, CI CD helps us get there. But now you're worried about that release being perfect for your customer. And then the poor developer experience is just tightly coupled. Deploy and release adds stress and toil to the developer. But now let's talk about reliability and resilience with the business impact. So common Kubernetes, failure modes that exist today, system instability, resource exhaustion, resource contention, configuration errors, scaling issues, these all exist in Kubernetes, which is supposed to be a self healing system, right? But the applications you deploy on Kubernetes might not survive these instabilities, right? And that's where these resiliency patterns are being used in code and infrastructure today to handle those failures gracefully, either degrading the experience or giving a warning message to users. But what I'm seeing in the industry is that lack of testing, right? And when I bring up chaos engineering and proactively testing your system, people kind of laugh at it, right? They don't see it as a priority. But what I like to speak about is technology and standards change. And similar to this 56 years ago, child, car seats have evolved. And what seemed okay then is not humorous now. So looking back five years from now, we might be like, of course, chaos engineering is required for all this testing, right? It's silly to not think you have to do anything proactively. And when I try to box these failure modes in, I always like looking at this type of a table, right, where we talk about known failure modes or unknown failure modes. But these are all things and questions that I like to ask my team around why chaos engineering can help answer these questions, right? Like, what are my single points of failure? Where does my system tip over? What happens when a Kubernetes pod restarts? Or does my system scale appropriately during peak traffic? Did I configure my dashboard correctly? Does my paging system work? These are all questions that you should have. And if you're not doing chaos engineering, you're not going to know the answer. And that's where integrating chaos engineering and feature flags helps you. So with this, I have a software release workflow, and this is where you can really integrate chaos engineering and feature flags. So for step one, think about this. Devs write their code, but any changes are put behind a feature flag, right? And the value there is that you can deliver code on time, test features for impact. Step two, DevOps releases the code to production, and they can test in production, and nothing has changed, right? But the value there is that you can deploy on time every time, no change to your process and test safe failure. Then, step three, product managers can decide who gets the new feature and who doesn't. And the value there is that control release variations, and you don't roll back, you just turn the flag off. Step four, product and development. Decide what to iterate on. And this is faster and more collaborative with feature flags, right? And the value there is that you can iterate on features faster and have higher feature quality, right? So it's this continued process of integrating feature flags and chaos engineering into your development process. And again, tools that you can use for this are open feature and litmus chaos. So today, I hope you understand that integrating chaos engineering and feature flags enhances your reliability in software and response. If you need to contact me, please reach out to me on LinkedIn, and I'm happy to engage in a conversation with you to help make your journey in software liability safer and ultimately more fun. All right, thanks,

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Revolutionizing Software Development: Integrating Chaos Engineering and Feature Flags for Enhanced Reliability and Agile Response

Video size:

Abstract

Summary

Transcript

Slides

Matthew Schillerstrom

Senior Product Marketing Manager @ Harness / LitmusChaos / OpenFeature

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Revolutionizing Software Development: Integrating Chaos Engineering and Feature Flags for Enhanced Reliability and Agile Response

Video size:

Abstract

Summary

Transcript

Slides

Matthew Schillerstrom

Senior Product Marketing Manager @ Harness / LitmusChaos / OpenFeature

Join the community!