Conf42 Cloud Native 2025 - Online

- premiere 5PM GMT

Chaos Engineering Community Tales and Future

Video size:

Abstract

The journey of building a thriving open-source community is filled with challenges, lessons, and moments of triumph. In this talk, the former community manager of LitmusChaos shares an insider’s perspective on the evolution of the Chaos Engineering community. From nurturing the LitmusChaos project as it grew into a CNCF incubating project to scaling a global community of contributors, users, and advocates, this session highlights the key strategies, milestones, and stories that shaped its success. We will explore how chaos engineering principles resonated with the broader cloud-native ecosystem, fostering adoption and collaboration across diverse industries. The talk also delves into the challenges of scaling a community from scratch, balancing technical innovation with community needs, and fostering inclusivity and sustainability in open source.

Looking ahead, we’ll discuss the future of chaos engineering and its community, including emerging trends, potential integrations with other cloud-native technologies, and opportunities for contributors to shape the ecosystem’s direction. Whether you’re a chaos engineering enthusiast, a community leader, or simply curious about building vibrant open-source communities, this talk offers a blend of practical insights and inspiration for navigating the past, present, and future of chaos engineering.

Speaker Bio: Prithvi Raj is working as a Community Manager & Developer Advocate at Mirantis and is leading the community efforts for the Open Source Program Office at Mirantis including the k0rdent project, k0s project, k0smortron, and the other OSS projects Mirantis is contributing to. He previously led the LitmusChaos project community, the CNCF incubating project based on Cloud-Native Chaos Engineering, and has helped scale a community of 3000+ folks from scratch. He is also a CNCF Ambassador and has closely worked with the CNCF community to run LFX Mentorship programs and helped build a Chaos Engineering community in the CNCF ecosystem.

He has worked on global events, conferences, and meetups such as Chaos Carnival, Kubernetes Community Days Bengaluru & Chennai, CNCF Kubernetes Chaos Engineering Meetups, and more to help grow various communities in the DevOps ecosystem.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey folks, a very warm welcome to my talk, chaos engineering community, tales and future. My name is Prithvi Raj. A lot of you folks who have followed me over the years or have seen the progress of the chaos engineering community know me as a community manager to the litmus chaos project. who paved his way through previous companies, MyData, ChaosNative, Harness. And now finally Mirantis, where I'm also a community manager and a dev advocate, although right now I'm Constantly focused on the K 0 S and the newly announced Gordon project. But yeah, my establishment and relationship with the chaos engineering community goes way back in 2020. And alongside that I've helped organize chaos carnival, currently helping organize. Kubernetes Community Days Bangalore and also the CNCF Reliability Engineering meetups that we have done in Bangalore and online. The agenda for today will be obviously, this talk is more from a cultural aspect. We'll talk about chaos engineering in practice, chaos engineering through time, a better solution for chaos engineering that has been seen today, the shift from chaos engineering. to, Resilience Engineering, Chaos Engineering Resources for you all. We identify competition. My journey as a community manager to Litmus Chaos, which is the main aspect where we'll be talking about how the Litmus Chaos community grew, how Chaos Engineering came into frame, strategies and, aspects of the community, the current state of the Chaos Engineering community, and The path ahead. So chaos engineering. I hope I don't need to introduce that There's a lot that has already been introduced at con 42 chaos engineering or even before that various Conferences so many articles out there. So chaos engineering is a practice has been Not just being utilized by one segment of folks, but multiple enterprises post the Netflix days has started adopting this practice. And chaos engineering is being used today at so many enterprises and companies, inclusive of retail e commerce. so many financial transactions that require, robust testing frameworks, and they need identification of failures. So banks. Stock broker, brokers, food delivery applications, gaming, airlines, it's everywhere. And I can name so many companies or there are companies that you might identify as users to popular chaos projects, but chaos engineering has crossed that part of the chasm where it was seen as an innovative tool. And today it's become a tool that has been adopted by a majority of the organizations. And through time we see, through the, as I spoke about the Netflix days and chaos monkey, which was used to just, terminate instances. And then it became part of the larger project that was the simian army, or we had Pumba for running chaos in Docker instances. The innovation era was more or less, post the game day that was run in 2003 by Amazon made chaos engineering a practice that. Could be dreamt of by organizations or could be adopted if they mature with the practice of terminating instances or running some production level failures. But the early adoption era where, I was also part of it as a community manager to , saw the growth of, there was a side of things where folks from Netflix. They parted ways and they started their own thing, and I think a pioneer to that was Gremlin, who I will talk about later on, but then there were so many open source projects that came up during that time as well, like Chaos Mesh, Litmus, Chaos Toolkit, Chaos Blade, and then there were also, These projects are still thriving, the enterprise side of things started building, there was Verica who was focused on security kiosk engineering, and then the big players, the cloud players as we know them as AWS and Azure also came up with their AWS FIS, Azure Kiosk Studio offering, which helped simulate failures, which were perhaps derived from those open source tools, or written by Azure. Thank you. The developer or SRE persona itself and then now I would dare say that it's not the open source era anymore but I think the Late adoption era is also about the enterprise era where enterprise players like SteadyBit, Reliably, Harness I mean there's Amazon with the AWS Resilience Hub, who have, I think, consolidated chaos engineering as a practice that is a must. And I think with so many features and abilities, it has become something, as a practice that has been adopted, largely large enterprises are coming up, adopting this practice, talking about it. Each and every conference you have some talk or chatter about reliability driven by. enterprise end user stories and a better solution for chaos engineering, I believe from the initial practice of running some, production level failure, just, terminating instance, instances, it has become like a larger practice where collaboration is required. Multiple teams are running chaos and, it's same, there's an SRE team, there's a platform engineering team, there's a team of developers, there's a QA team. there is. a must need for collaboration. there has to be a particular team running a particular set of experiments, which is completely different from, say, team B. And that is why, a better solution to chaos engineering driven by the enterprise era is This is collaboration where, there's chaos for teams, there are features, which maybe you, the feature flags are being used to perhaps stop, using a team from, stop stopping a team to use particular features or chaos experiments, the availability of experiments. obviously in the communities that are chaos engineering tests were written as. YAML files, or just experiments written on Go or Python. And then, the idea of having them readily available has become more important. have an interactive UI and have those experiments that you can pull up and run, maybe have a different dashboard, say a dashboard, which is like for Violet Mischias, but you want to run an experiment that. That is part of the chaos toolkit program. So bringing your own chaos, having that chaos experiment readily available has optimized the initial investment rather than having a developers write the experiment for you. And then again, the idea of chaos engineering was to run it in production, but I CI CD pipelines, in our dev environments, in a staging environments. And automating them, controlling a blast radius is rather more important. And then obviously having a metrics, assessing what's going wrong when your system is in steady state and when your system is in production, or when your system is going against some latency chaos engineering, tests. You, you obviously need your metrics. You need to observe what's going wrong. You measure the impact. And then according to that, go. About an incident management solution and that is where I will be talking about chaos engineering from just running chaos engineering tests to actually chaos engineering has become beyond just running some Chaotic scenarios. It's also about monitoring them It's also about managing those incidents and that has to happen in a very short span of time and I think Benjamin Wilmsy, good friend from SteadyBit, he spoke about You know how resilience engineering has developed and I mean you identify you inject a fault you get your readings you monitor them in your dashboards I mean there are open source solutions like signals or there's a data dog or dynatrace Dashboard and then you know, there's a call for an incident. You need to manage that mitigate that I mean there's so many, incident management solutions out there. One of the open source solutions recently I saw respond now, and that helps you, continuously build resiliency. So I see a lot of folks, coining this term as not just chaos engineering, but resiliency. Chaos engineering, which I might, have an agreement too, but then I believe that is a term that is developing more and more shifting from just the, old school chaos engineering practice. But to get started again, it's a term very popularly coined initially by Netflix and then there was Principles of Chaos Engineering which was developed around 2015 which, created, you need to hypothesize, then you need to develop a set of experiments, you need to run them, control your blast radius and repeat that process. there are amazing, resources, again. Shout out to Pavlos Ratis who has developed this awesome chaos engineering repo on github Go check it out, and you might if you are getting started with your chaos engineering journey it will surely be helpful in identifying the resources and getting started with the right tool in the right blogs or articles or cultural aspects to it. The first thing I believe even before the historical part and everything, it's also identifying the competition and the right tool for you. there are cloud providers, let's say if you are already using AWS or Azure or GCP. there's, there are in house inbuilt kiosk experiments. a service in itself with the Azure kiosk studio or FIS. And where you can just pull up your experiments or you can use an open source solution alongside, which I think, is easy for large users of these cloud providers. But then, there are smaller teams, there are teams which want to go about an open source way of it. Maybe they want to run a POC. We want to just get a grasp of it. And that is where I believe as a community person, as an admirer of chaos engineering. I would suggest the open source way of going about it. the right tools, there's the newly launched Kraken, or you have the old school tools in Chaos Monkey or the tools that belong to the adoption era, which is Litmus Chaos Mesh, Chaos Toolkit. And then there are, there's the commercial side of it, when you are ready to pay for it, you are, you believe that you need a standalone commercial solution, which is. completely focused on chaos engineering, which is also, going according to your systems. Maybe you are able to, achieve more by suggesting some feature requests. I feel that's a mature way of going about it, or, running chaos in a very controlled way, maybe having, Controlled game days and then experiments according to your requirement. chaos, induced at particular timelines according to your way. And I think as an enterprise solution, an SRE persona driven solution, that is more important. And in that sense, you get a steady bit or a harness chaos or a gremlin. And that is what you need to identify before even. getting started with the tool and Honestly, without Gremlin, you cannot talk about Chaos Engineering. Some people might not like me saying that, but I believe the role of Gremlin has been immense. From starting off, talking more about the practice, building the Chaos Engineering community on Slack, which has about 9000 plus members, running webinars, running game days. I think Gremlin has played a crucial role in, starting off or growing that mindset for Chaos Engineering and growing that, overall idea that Chaos Engineering is essential with, running conferences in ChaosConf. I think the growth and role of Gremlin has been essential, vital. founders. Were again from a Netflix background and they have I mean played a vital role in becoming an enterprise solution early on for chaos engineering I believe at that time maybe a lot of folks did not understand if it's essential necessary but with the amount of development in the number of experiments in an interactive UI In the great community activities and the great evangelism that folks like, Anna or Tammy or Colton, they did. Jason, I think, Matt, again, a friend from Harness. I think the growth and role of Gremlin is essential when you talk about the history and the growth of Chaos Engineering in itself. And that is how I think I got attracted towards this project, although it, I was pretty unaware when I joined in Maya Data, who was building Litmus as a side project, but my journey as a community manager to Litmus Chaos, I think, grew. More and more by learning from folks at Gremlin, seeing how they are growing the project And doing better as a community overall. So Litmus, as It's a CNCF incubating project, a very popular chaos engineering project. Today when folks talk about chaos it's Litmus that they speak about more or less. I hope everyone agrees. It's like an open source project The idea was to obviously run chaos engineering experiments on a Kubernetes environment and then it kept growing. As we grew, with, non cloud, cloud environments, non Kubernetes environments, and it's adopted largely, when I checked out the scarf analytics, it was like 500 plus enterprises have just run Litmus in some form or the other, POC, or maybe they are, they have adopted, they are running it largely. So litmus, I feel that the popular project in itself went through a journey over the last four or five years and has become a consolidated solution for folks who want to get started with chaos or want to mature their chaos journey by building their own solution eyepiece, just running litmus according to their needs according to multiple teams and all the features that it Provides the growth journey. I'll have figures that are exponential because chaos engineering has grown Exponentially and as you can see the stars it had an exponential growth over the years. It is still reached somehow linear sort of a graph where It's still, I think, growing according to the, time that has been passing. It's around almost, I think, 5, 000 GitHub stars, which is, I think, a great milestone for chaos engineering project, which is being, organically growing over, over the years. And I think the Slack community has played a crucial role. It's the first line of communication, folks who have joined in. That is where they have got their queries answered. That is where all the discussions, the, identifying the group of contributors, the group of users, them contributing back content has happened. And I feel that is what we'll talk about in a bit, that how community growth has come up. What's, The community look like and how the overall chaos engineering community has grown. there was, obviously people were skeptical. People were not really, aware of how chaos engineering works, how should they contribute? Is it even worth it? So I think the conversations that have happened over the last five years on Slack and our community meetings, they have played a huge role in helping chaos engineering as a technology grow. And this is the slack growth, I think, there has been a shift, like it, it peaked and then I, as I said, that the enterprise, adoption has led to a little bit of a slowness for the open source side of things, but I believe that the chaos engineering has its own challenges. We'll talk about them, but, Overall, the Slack growth in itself, I think it's the, it's post the Gremlin Slack, it's, the biggest, chaos engineering, open source Slack community. There's a lot of conversations that have been driven and beyond the Slack as well. There have been conversations around Discord, YouTube, other platforms, Reddit perhaps. the community has seen, maturity over the last few years. So we'll talk about some community aspects and building strategies that have helped the chaos engineering community thrive. I won't be detail, I mean explaining them in detail because as it's not like community building talk. But we'll talk about how chaos engineering as a community has grown, the history of it, we spoke about it a little, but how things look like moving ahead and what the community has gone through to achieve this shape. I obviously community meetings. We held community meetings every third Wednesdays of the month. We planned something for the APAC region because that was predominantly focused on the US and Europe, North America overall, Latin America, and that is what we have seen. Seeing like beyond contributors, growth in perhaps the South Asia region, you have experience that, folks and large enterprises, medium sized enterprise is based out of the European region, European union, the UK and, North America started, adopting litmus at large scale. And then that is why our focus from a community aspect was on that. Part of the community and then we also built one for the APAC region where we saw a lot of folks from China A lot of folks are based out of India, Singapore Australia starting to adopt chaos engineering as a practice. The idea was to talk about everything in the community meetings But then when we started seeing contributor interest growing We spoke more from a contributor aspect like we divided the meetings into three separate meetings that there was community meetings to talk about user stories, more like releases, experiences, maybe planning the roadmap, and then the contributor meetings were specifically to discuss contributions. The recent prs that are merged, issues that are raised, issues that can be taken up. And then the maintainer calls are more like internal discussions to see how the maintainers are functioning, maybe the road blockers, the roadmap, and how the maintainers can help the community thrive. And then community notes were maintained at hack. And as I think it's a community aspect content is king Putting out more and more content on dev. to youtube medium writing more blogs as you see There were more tutorials created which gave an idea or perspective of what cloud native chaos engineering is. It's not the Old way of doing chaos, but it's more like a open source Kubernetes centric community way of doing chaos engineering, where, a lot of things were discussed architecture and running chaos with another sample application, the components of it, the workflows developed with an integration with Orgo. So a lot of things, were discussed the tutorials and the content that helped it. And again, a shout out to Con 42 folks who have done amazing events on chaos engineering over the years. one of the. I think, core contributors to Con 42 and the founder, mark, his brother Miko, has been, I think, a pioneer in kiosk engineering as Han and has helped kiosk engineering grow beyond Bloomberg with, to all these companies writing, amazing, articles, books. I think, kiosk engineering has. seen its shift with the growth of so many events. I think Chaos Conf, Chaos Carnival has done an immensely. great job in helping the chaos engineering community and grow and the stories come out. So I think a lot of things have played a role in helping the community in itself grow. And you can see, yeah, there was a need to, record responses. I mean describe them and that's it's a community building activity again And this is how we have recorded with a sample set of how the slack threads came up What was the time taken to answer and time taken to close these threads so that So as a nascent project, as a project that was building, it was very essential for, issues to be resolved. It was essential for helping, people get started with the practice itself because it's not like an everyday practice where you just maybe run a Kubernetes cluster and get started with the practice. It's more like you need to have so much in you to, you just identify how chaos works. In an infrastructure on an environment and that is where it became popular and these kind of stories came up in meetups. A chaos engineering meetup group was important. Hosting them at conferences or getting them to talk in meetings were important. And that's where the chaos engineering meetup group also became popular and that is where I. Just don't count a community of say 2, 300 folks, but I would believe that a chaos engineering community today is like 30, 000 strong. I mean, from an aspect of interest in the community, not users, I think they are beyond that, but just interest towards the community. I feel grew and grew. And this chaos engineering meetup group where we hosted a lot of in person meetups in Bangalore was a testimony to it. again, we kept participating, it grew from meetup to joining kube cons and talking about litmus, joining multiple podcasts, identifying the right conferences. It's ensuring that there's a CNCF way to it. Chaos engineering kept growing and the participation has, has seen again, growth and participation, students are participating, contributing. I feel, post, even after me joining Mirantis, it's, I have still kept, in touch with the community, I've tried to contribute here and there, this talk is a testimony to it, but yeah, Chaos Engineering in itself saw that, exponential growth with the amount of content that was driven, amount of talks that were delivered, not just by me or the co contributors to Litmus, but overall ecosystem in each and every project. And that's how, these are the examples of how people have contributed back. They have participated at Hacktoberfest. They have participated in community champion programs. The GSOC, the LFX mentorship programs, which have, which I believe have built a core strong community who will perhaps always look back at chaos engineering, even if they move away from the practice or the community itself that, yeah, I have contributed to this project, Litmus, Chaos Mesh, Chaos Blade, and it was a popular project. I think what I learned from there can be implemented maybe as an integration testing framework or something. Which can help the overall practice of resilience engineering reach more and more folks And then I am sure a shout out to one of my community friends Akram Who has led the community well, and I think has been still actively involved with kiosk engineering helping enterprises Implement the same and again Namcube Park another popular LFX mentee who became a mentor later on I mean taking litmus forward with more contributions from his peers, his mentees, his friends, and helping the project grow and develop in what it has in the overall last two years from already reaching a stage of maturity. And again, that has come back, in terms of appreciation, in terms of organic talk delivery or content and social media again has played a crucial role in helping the project grow and talking about it more, just featuring it, understanding more information on it. And that's been the overall, experience, growing the community, growing the essence of the community, trying to achieve a goal, how we built a robust community around, not just chaos engineering. But a very popular project that litmus chaos is today, and I hope this is not like my last talk on litmus And it's it sees you know that there are more talks that I am able to deliver on the community and the growth of it But yeah, whoever is hoping to build a community today may be a chaos engineering community or any other open source community I hope These steps help you. You need to involve the right kind of folks, share your responsibilities, create the right kind of content, which we did for Chaos Engineering. And today you see there's so much that has been talked about it. You announce it, you create your right meetup groups, keep building your Slack communities or your Discord communities, incentivize. Then through an ambassadorship program, chart your metrics, see how your community is growing, and then repeat it in form of community calendar till you see your goals expanding in not just terms of monthly, but quarterly and annual goals. The current state of the chaos engineering community, I think it's a funny question. A lot of folks ask me that and I believe that, sometimes I am left speechless. Sometimes I have I think strong opinions, but, I think there are four major challenges that the community is facing today. And, I, I feel, those who are listening to this talk, and if they are chaos engineering enthusiasts like me are able to, contribute back to the project, do, more about not just litmus, but I think beyond, that for the ecosystem in itself. And I'll speak about it one by one. this is an example of the Chaos Mesh project, which is again a very popular care project used by many, but the current state of the project somehow is that it is in dire need of contributors and contributions, it's, it has just had a hardly 100 commits in the last 12 months. more or less by one contributor or at max two commits have decreased drastically because there is lack of open source focus, lack of focus towards chaos since engineering. In general, and Chaos Mesh as a tool, because maybe the maintainers have become inactive, the sponsor company hasn't, given that amount of traction to it, so it's like in a dire need of contributors, and why has this sort of a state happened for Chaos Mesh, and I believe it's not the same for Chaos Mesh, it's, it's the same for Litmus or Chaos Blade, every project that started off with promise and has helped the chaos engineering community grow, Is in dire need of contributors and is in dire need of new contributions and ideas coming in helping build a roadmap to, reach a goal, not just beyond, say, CNCF graduation, but also For mass, adoption, if chaos engineering as a technology has to thrive and the challenges, they have already been or always been there, the lack of budget, the diversion of cost to other important projects, there's skepticism towards running something where you are, have, you have to maybe break things in production, people haven't still identified the need for it. And they, I still believe that the, evangelism around chaos engineering needs to continue even if it has to continue in the form of Resilience engineering or reliability engineering and the overcoming of business challenges needs to be there there's an uncertain market right now, or it has been the same since the last three four years and knowledge is limited. I think there are only a few folks, maybe largely the SRE persona, a few developers, a few users of Litmus or ChaosMash or Toolkit, other popular projects have had an idea. But they haven't really been able to go out in the community and talk about it. there is a lack of feature towards Chaos in conferences as well, where a lot of conferences were pretty actively featuring stories on Chaos Engineering. I think that has changed because the Gen AI story behind Chaos Engineering hasn't been that robust. there been a lot of chatter. Gen AI, I think, is the future for Chaos Engineering. To, help identify the right systems to give recommendations on which chaos to run. What are the patterns of experiments that should be done? What are the timings? What are the integrations required? Giving out suggestions, maybe helping the amalgamation of chaos engineering, monitoring metrics, logging, and then incident management. So that has to develop a lot and then. Obviously, contributors need to rise as well, like I believe that. A project or a, or an idea cannot grow in today's world, if it has been part of the open source ecosystem and that ecosystem is dying, like contributors and ecosystem needs investment. personally, speaking from a standpoint where I was a community manager to Litmus and I have been actively involved in this community for the last four and a half, five years, I believe that there needs to be more commitment by at least end users to the project. To help these projects grow, help contribute back to the project, even if it's like a non-technical contribution, maybe talking more about it or it's maybe dedicating a couple of resources, engineers to contribute back to the project. I think it's high time that we look at the current state of it and stop, the maybe degradation or In, in, in terms, help the help, help the health of the project, but yeah, I think, with more contributions, more, enterprise level features. gaining importance and maybe coming back by end users to the open source side of things, more research and more white papers. I feel, the challenges can be mitigated or we can see a path ahead and talking about the path ahead. I think this technology is here to thrive. Um, people who are invested, the enterprises who are here. The community that has, that is, I think, really strong is here to invest in it. In some form, we'll see a lot of maintainer tracks, being, being taught. where we will see the feature of, projects that are already part of the CNCF ecosystem, we'll see a lot of, DevOps. engineers, performance engineers, platform engineers, talking about chaos engineering as they keep adding chaos as a testing suit or maybe an essential part of, application development. So the path ahead, I think it's tough. it's really hard. As I said, it needs a lot of investment, both, financially and, Through, effort and then also it needs, I think a lot of evangelism even now, maybe from a different aspect, a different angle, but still does, I think you can still, correlate it with resilience engineering, call it an integration testing framework, sell the idea of, chaos in usage, which should be done by more of the end users coming up with case studies and stories, but yeah, the path ahead looks a little shaky, but I believe that with the amazing folks that are out there, folks from SteadyBit, folks from Gremlin, folks from Harness, Who are, I, who I see as the top, three, investors alongside the larger folks that are, maybe the users and then the cloud providers in AWS and Azure and GCP, of course, who is invested in, chaos engineering. And I have seen a lot of stories coming up. in, in recent times. So yeah, I feel that, the investment needs to be there and they need to give back more to the community. But yeah, these are just thoughts and I hope they come into fruit and give some reward itself and I Keep supporting Chaos. I'd love to keep supporting Chaos engineering on the outside and obviously as an admirer and Someone who has grown through it. So thank you folks. That was it from my side I hope this talk was fruitful to you and yeah, enjoy the rest of the conference
...

Prithvi Raj

Community Manager and Developer Advocate @ Mirantis

Prithvi Raj's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)