Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey folks, a very warm welcome to my talk, chaos engineering
community, tales and future.
My name is Prithvi Raj.
A lot of you folks who have followed me over the years or have seen the
progress of the chaos engineering community know me as a community
manager to the litmus chaos project.
who paved his way through previous companies, MyData, ChaosNative, Harness.
And now finally Mirantis, where I'm also a community manager and a dev
advocate, although right now I'm Constantly focused on the K 0 S and
the newly announced Gordon project.
But yeah, my establishment and relationship with the chaos engineering
community goes way back in 2020.
And alongside that I've helped organize chaos carnival,
currently helping organize.
Kubernetes Community Days Bangalore and also the CNCF Reliability
Engineering meetups that we have done in Bangalore and online.
The agenda for today will be obviously, this talk is more from a cultural aspect.
We'll talk about chaos engineering in practice, chaos engineering through
time, a better solution for chaos engineering that has been seen today,
the shift from chaos engineering.
to, Resilience Engineering, Chaos Engineering Resources for you all.
We identify competition.
My journey as a community manager to Litmus Chaos, which is the main aspect
where we'll be talking about how the Litmus Chaos community grew, how Chaos
Engineering came into frame, strategies and, aspects of the community, the
current state of the Chaos Engineering community, and The path ahead.
So chaos engineering.
I hope I don't need to introduce that There's a lot that has already
been introduced at con 42 chaos engineering or even before that various
Conferences so many articles out there.
So chaos engineering is a practice has been Not just being utilized by
one segment of folks, but multiple enterprises post the Netflix days
has started adopting this practice.
And chaos engineering is being used today at so many enterprises and companies,
inclusive of retail e commerce.
so many financial transactions that require, robust testing frameworks, and
they need identification of failures.
So banks.
Stock broker, brokers, food delivery applications, gaming,
airlines, it's everywhere.
And I can name so many companies or there are companies that you might
identify as users to popular chaos projects, but chaos engineering has
crossed that part of the chasm where it was seen as an innovative tool.
And today it's become a tool that has been adopted by a
majority of the organizations.
And through time we see, through the, as I spoke about the Netflix
days and chaos monkey, which was used to just, terminate instances.
And then it became part of the larger project that was the
simian army, or we had Pumba for running chaos in Docker instances.
The innovation era was more or less, post the game day that was
run in 2003 by Amazon made chaos engineering a practice that.
Could be dreamt of by organizations or could be adopted if they mature with
the practice of terminating instances or running some production level failures.
But the early adoption era where, I was also part of it as a community manager
to , saw the growth of, there was a side of things where folks from Netflix.
They parted ways and they started their own thing, and I think a pioneer to that
was Gremlin, who I will talk about later on, but then there were so many open
source projects that came up during that time as well, like Chaos Mesh, Litmus,
Chaos Toolkit, Chaos Blade, and then there were also, These projects are still
thriving, the enterprise side of things started building, there was Verica who was
focused on security kiosk engineering, and then the big players, the cloud players
as we know them as AWS and Azure also came up with their AWS FIS, Azure Kiosk Studio
offering, which helped simulate failures, which were perhaps derived from those
open source tools, or written by Azure.
Thank you.
The developer or SRE persona itself and then now I would dare say that it's not
the open source era anymore but I think the Late adoption era is also about the
enterprise era where enterprise players like SteadyBit, Reliably, Harness I mean
there's Amazon with the AWS Resilience Hub, who have, I think, consolidated chaos
engineering as a practice that is a must.
And I think with so many features and abilities, it has become something,
as a practice that has been adopted, largely large enterprises are coming up,
adopting this practice, talking about it.
Each and every conference you have some talk or chatter
about reliability driven by.
enterprise end user stories and a better solution for chaos engineering,
I believe from the initial practice of running some, production level failure,
just, terminating instance, instances, it has become like a larger practice
where collaboration is required.
Multiple teams are running chaos and, it's same, there's an SRE team, there's
a platform engineering team, there's a team of developers, there's a QA team.
there is.
a must need for collaboration.
there has to be a particular team running a particular set of experiments, which is
completely different from, say, team B.
And that is why, a better solution to chaos engineering driven by the
enterprise era is This is collaboration where, there's chaos for teams,
there are features, which maybe you, the feature flags are being used to
perhaps stop, using a team from, stop stopping a team to use particular
features or chaos experiments, the availability of experiments.
obviously in the communities that are chaos engineering tests were written as.
YAML files, or just experiments written on Go or Python.
And then, the idea of having them readily available has become more important.
have an interactive UI and have those experiments that you can pull
up and run, maybe have a different dashboard, say a dashboard, which
is like for Violet Mischias, but you want to run an experiment that.
That is part of the chaos toolkit program.
So bringing your own chaos, having that chaos experiment readily available
has optimized the initial investment rather than having a developers
write the experiment for you.
And then again, the idea of chaos engineering was to run it in production,
but I CI CD pipelines, in our dev environments, in a staging environments.
And automating them, controlling a blast radius is rather more important.
And then obviously having a metrics, assessing what's going wrong when
your system is in steady state and when your system is in production,
or when your system is going against some latency chaos engineering, tests.
You, you obviously need your metrics.
You need to observe what's going wrong.
You measure the impact.
And then according to that, go.
About an incident management solution and that is where I will be talking
about chaos engineering from just running chaos engineering tests to actually
chaos engineering has become beyond just running some Chaotic scenarios.
It's also about monitoring them It's also about managing those incidents and that
has to happen in a very short span of time and I think Benjamin Wilmsy, good friend
from SteadyBit, he spoke about You know how resilience engineering has developed
and I mean you identify you inject a fault you get your readings you monitor them
in your dashboards I mean there are open source solutions like signals or there's
a data dog or dynatrace Dashboard and then you know, there's a call for an incident.
You need to manage that mitigate that I mean there's so many, incident
management solutions out there.
One of the open source solutions recently I saw respond now, and that helps
you, continuously build resiliency.
So I see a lot of folks, coining this term as not just chaos
engineering, but resiliency.
Chaos engineering, which I might, have an agreement too, but then I believe
that is a term that is developing more and more shifting from just the, old
school chaos engineering practice.
But to get started again, it's a term very popularly coined initially by
Netflix and then there was Principles of Chaos Engineering which was
developed around 2015 which, created, you need to hypothesize, then you
need to develop a set of experiments, you need to run them, control your
blast radius and repeat that process.
there are amazing, resources, again.
Shout out to Pavlos Ratis who has developed this awesome chaos engineering
repo on github Go check it out, and you might if you are getting started
with your chaos engineering journey it will surely be helpful in identifying
the resources and getting started with the right tool in the right blogs or
articles or cultural aspects to it.
The first thing I believe even before the historical part and
everything, it's also identifying the competition and the right tool for you.
there are cloud providers, let's say if you are already using AWS or Azure or GCP.
there's, there are in house inbuilt kiosk experiments.
a service in itself with the Azure kiosk studio or FIS.
And where you can just pull up your experiments or you can use an open source
solution alongside, which I think, is easy for large users of these cloud providers.
But then, there are smaller teams, there are teams which want to go
about an open source way of it.
Maybe they want to run a POC.
We want to just get a grasp of it.
And that is where I believe as a community person, as an
admirer of chaos engineering.
I would suggest the open source way of going about it.
the right tools, there's the newly launched Kraken, or you have the old
school tools in Chaos Monkey or the tools that belong to the adoption era, which
is Litmus Chaos Mesh, Chaos Toolkit.
And then there are, there's the commercial side of it, when you are ready to pay for
it, you are, you believe that you need a standalone commercial solution, which is.
completely focused on chaos engineering, which is also,
going according to your systems.
Maybe you are able to, achieve more by suggesting some feature requests.
I feel that's a mature way of going about it, or, running chaos in a
very controlled way, maybe having, Controlled game days and then experiments
according to your requirement.
chaos, induced at particular timelines according to your way.
And I think as an enterprise solution, an SRE persona driven
solution, that is more important.
And in that sense, you get a steady bit or a harness chaos or a gremlin.
And that is what you need to identify before even.
getting started with the tool and Honestly, without Gremlin, you
cannot talk about Chaos Engineering.
Some people might not like me saying that, but I believe the
role of Gremlin has been immense.
From starting off, talking more about the practice, building the
Chaos Engineering community on Slack, which has about 9000 plus members,
running webinars, running game days.
I think Gremlin has played a crucial role in, starting off or growing
that mindset for Chaos Engineering and growing that, overall idea that
Chaos Engineering is essential with, running conferences in ChaosConf.
I think the growth and role of Gremlin has been essential, vital.
founders.
Were again from a Netflix background and they have I mean played a vital role in
becoming an enterprise solution early on for chaos engineering I believe at
that time maybe a lot of folks did not understand if it's essential necessary
but with the amount of development in the number of experiments in an interactive
UI In the great community activities and the great evangelism that folks
like, Anna or Tammy or Colton, they did.
Jason, I think, Matt, again, a friend from Harness.
I think the growth and role of Gremlin is essential when you talk
about the history and the growth of Chaos Engineering in itself.
And that is how I think I got attracted towards this project, although it, I
was pretty unaware when I joined in Maya Data, who was building Litmus as a side
project, but my journey as a community manager to Litmus Chaos, I think, grew.
More and more by learning from folks at Gremlin, seeing how they
are growing the project And doing better as a community overall.
So Litmus, as It's a CNCF incubating project, a very
popular chaos engineering project.
Today when folks talk about chaos it's Litmus that they speak about more or less.
I hope everyone agrees.
It's like an open source project
The idea was to obviously run chaos engineering experiments on a Kubernetes
environment and then it kept growing.
As we grew, with, non cloud, cloud environments, non Kubernetes environments,
and it's adopted largely, when I checked out the scarf analytics, it
was like 500 plus enterprises have just run Litmus in some form or the other,
POC, or maybe they are, they have adopted, they are running it largely.
So litmus, I feel that the popular project in itself went through a journey over the
last four or five years and has become a consolidated solution for folks who
want to get started with chaos or want to mature their chaos journey by building
their own solution eyepiece, just running litmus according to their needs according
to multiple teams and all the features that it Provides the growth journey.
I'll have figures that are exponential because chaos engineering
has grown Exponentially and as you can see the stars it had an
exponential growth over the years.
It is still reached somehow linear sort of a graph where It's still,
I think, growing according to the, time that has been passing.
It's around almost, I think, 5, 000 GitHub stars, which is, I think, a
great milestone for chaos engineering project, which is being, organically
growing over, over the years.
And I think the Slack community has played a crucial role.
It's the first line of communication, folks who have joined in.
That is where they have got their queries answered.
That is where all the discussions, the, identifying the group of
contributors, the group of users, them contributing back content has happened.
And I feel that is what we'll talk about in a bit, that how
community growth has come up.
What's, The community look like and how the overall chaos
engineering community has grown.
there was, obviously people were skeptical.
People were not really, aware of how chaos engineering works,
how should they contribute?
Is it even worth it?
So I think the conversations that have happened over the last five years on
Slack and our community meetings, they have played a huge role in helping
chaos engineering as a technology grow.
And this is the slack growth, I think, there has been a shift, like it, it
peaked and then I, as I said, that the enterprise, adoption has led to a little
bit of a slowness for the open source side of things, but I believe that the
chaos engineering has its own challenges.
We'll talk about them, but, Overall, the Slack growth in itself, I think
it's the, it's post the Gremlin Slack, it's, the biggest, chaos engineering,
open source Slack community.
There's a lot of conversations that have been driven and beyond the Slack as well.
There have been conversations around Discord, YouTube, other
platforms, Reddit perhaps.
the community has seen, maturity over the last few years.
So we'll talk about some community aspects and building strategies that have helped
the chaos engineering community thrive.
I won't be detail, I mean explaining them in detail because as it's
not like community building talk.
But we'll talk about how chaos engineering as a community has grown,
the history of it, we spoke about it a little, but how things look like
moving ahead and what the community has gone through to achieve this shape.
I obviously community meetings.
We held community meetings every third Wednesdays of the month.
We planned something for the APAC region because that was predominantly
focused on the US and Europe, North America overall, Latin America,
and that is what we have seen.
Seeing like beyond contributors, growth in perhaps the South Asia region, you
have experience that, folks and large enterprises, medium sized enterprise
is based out of the European region, European union, the UK and, North America
started, adopting litmus at large scale.
And then that is why our focus from a community aspect was on that.
Part of the community and then we also built one for the APAC region where
we saw a lot of folks from China A lot of folks are based out of India,
Singapore Australia starting to adopt chaos engineering as a practice.
The idea was to talk about everything in the community meetings But then
when we started seeing contributor interest growing We spoke more from a
contributor aspect like we divided the meetings into three separate meetings
that there was community meetings to talk about user stories, more like releases,
experiences, maybe planning the roadmap, and then the contributor meetings were
specifically to discuss contributions.
The recent prs that are merged, issues that are raised,
issues that can be taken up.
And then the maintainer calls are more like internal discussions to see how the
maintainers are functioning, maybe the road blockers, the roadmap, and how the
maintainers can help the community thrive.
And then community notes were maintained at hack.
And as I think it's a community aspect content is king Putting
out more and more content on dev.
to youtube medium writing more blogs as you see There were more tutorials created
which gave an idea or perspective of what cloud native chaos engineering is.
It's not the Old way of doing chaos, but it's more like a open source Kubernetes
centric community way of doing chaos engineering, where, a lot of things
were discussed architecture and running chaos with another sample application,
the components of it, the workflows developed with an integration with Orgo.
So a lot of things, were discussed the tutorials and the content that helped it.
And again, a shout out to Con 42 folks who have done amazing events
on chaos engineering over the years.
one of the.
I think, core contributors to Con 42 and the founder, mark, his brother
Miko, has been, I think, a pioneer in kiosk engineering as Han and has
helped kiosk engineering grow beyond Bloomberg with, to all these companies
writing, amazing, articles, books.
I think, kiosk engineering has.
seen its shift with the growth of so many events.
I think Chaos Conf, Chaos Carnival has done an immensely.
great job in helping the chaos engineering community and
grow and the stories come out.
So I think a lot of things have played a role in helping
the community in itself grow.
And you can see, yeah, there was a need to, record responses.
I mean describe them and that's it's a community building activity again And
this is how we have recorded with a sample set of how the slack threads came
up What was the time taken to answer and time taken to close these threads
so that So as a nascent project, as a project that was building, it was very
essential for, issues to be resolved.
It was essential for helping, people get started with the practice itself because
it's not like an everyday practice where you just maybe run a Kubernetes cluster
and get started with the practice.
It's more like you need to have so much in you to, you just identify how chaos works.
In an infrastructure on an environment and that is where it became popular and
these kind of stories came up in meetups.
A chaos engineering meetup group was important.
Hosting them at conferences or getting them to talk in meetings were important.
And that's where the chaos engineering meetup group also became popular and
that is where I. Just don't count a community of say 2, 300 folks, but I
would believe that a chaos engineering community today is like 30, 000 strong.
I mean, from an aspect of interest in the community, not users, I
think they are beyond that, but just interest towards the community.
I feel grew and grew.
And this chaos engineering meetup group where we hosted a lot of in person meetups
in Bangalore was a testimony to it.
again, we kept participating, it grew from meetup to joining kube cons and talking
about litmus, joining multiple podcasts, identifying the right conferences.
It's ensuring that there's a CNCF way to it.
Chaos engineering kept growing and the participation has, has seen again,
growth and participation, students are participating, contributing.
I feel, post, even after me joining Mirantis, it's, I have still kept, in
touch with the community, I've tried to contribute here and there, this
talk is a testimony to it, but yeah, Chaos Engineering in itself saw that,
exponential growth with the amount of content that was driven, amount of talks
that were delivered, not just by me or the co contributors to Litmus, but overall
ecosystem in each and every project.
And that's how, these are the examples of how people have contributed back.
They have participated at Hacktoberfest.
They have participated in community champion programs.
The GSOC, the LFX mentorship programs, which have, which I believe have
built a core strong community who will perhaps always look back at chaos
engineering, even if they move away from the practice or the community itself
that, yeah, I have contributed to this project, Litmus, Chaos Mesh, Chaos
Blade, and it was a popular project.
I think what I learned from there can be implemented maybe as an integration
testing framework or something.
Which can help the overall practice of resilience engineering reach more and more
folks And then I am sure a shout out to one of my community friends Akram Who has
led the community well, and I think has been still actively involved with kiosk
engineering helping enterprises Implement the same and again Namcube Park another
popular LFX mentee who became a mentor later on I mean taking litmus forward
with more contributions from his peers, his mentees, his friends, and helping
the project grow and develop in what it has in the overall last two years from
already reaching a stage of maturity.
And again, that has come back, in terms of appreciation, in terms of organic
talk delivery or content and social media again has played a crucial role
in helping the project grow and talking about it more, just featuring it,
understanding more information on it.
And that's been the overall, experience, growing the community, growing the essence
of the community, trying to achieve a goal, how we built a robust community
around, not just chaos engineering.
But a very popular project that litmus chaos is today, and I hope this is not
like my last talk on litmus And it's it sees you know that there are more
talks that I am able to deliver on the community and the growth of it But yeah,
whoever is hoping to build a community today may be a chaos engineering
community or any other open source community I hope These steps help you.
You need to involve the right kind of folks, share your responsibilities,
create the right kind of content, which we did for Chaos Engineering.
And today you see there's so much that has been talked about it.
You announce it, you create your right meetup groups, keep building
your Slack communities or your Discord communities, incentivize.
Then through an ambassadorship program, chart your metrics, see how your community
is growing, and then repeat it in form of community calendar till you see your
goals expanding in not just terms of monthly, but quarterly and annual goals.
The current state of the chaos engineering community, I think it's a funny question.
A lot of folks ask me that and I believe that, sometimes I am left speechless.
Sometimes I have I think strong opinions, but, I think there
are four major challenges that the community is facing today.
And, I, I feel, those who are listening to this talk, and if they are chaos
engineering enthusiasts like me are able to, contribute back to the project, do,
more about not just litmus, but I think beyond, that for the ecosystem in itself.
And I'll speak about it one by one.
this is an example of the Chaos Mesh project, which is again a very popular
care project used by many, but the current state of the project somehow is that
it is in dire need of contributors and contributions, it's, it has just had a
hardly 100 commits in the last 12 months.
more or less by one contributor or at max two commits have decreased
drastically because there is lack of open source focus, lack of focus
towards chaos since engineering.
In general, and Chaos Mesh as a tool, because maybe the maintainers have
become inactive, the sponsor company hasn't, given that amount of traction
to it, so it's like in a dire need of contributors, and why has this sort of
a state happened for Chaos Mesh, and I believe it's not the same for Chaos
Mesh, it's, it's the same for Litmus or Chaos Blade, every project that started
off with promise and has helped the chaos engineering community grow, Is
in dire need of contributors and is in dire need of new contributions and ideas
coming in helping build a roadmap to, reach a goal, not just beyond, say, CNCF
graduation, but also For mass, adoption, if chaos engineering as a technology has
to thrive and the challenges, they have already been or always been there, the
lack of budget, the diversion of cost to other important projects, there's
skepticism towards running something where you are, have, you have to maybe
break things in production, people haven't still identified the need for it.
And they, I still believe that the, evangelism around chaos engineering needs
to continue even if it has to continue in the form of Resilience engineering or
reliability engineering and the overcoming of business challenges needs to be there
there's an uncertain market right now, or it has been the same since the last
three four years and knowledge is limited.
I think there are only a few folks, maybe largely the SRE persona, a few
developers, a few users of Litmus or ChaosMash or Toolkit, other
popular projects have had an idea.
But they haven't really been able to go out in the community and talk about it.
there is a lack of feature towards Chaos in conferences as well, where a
lot of conferences were pretty actively featuring stories on Chaos Engineering.
I think that has changed because the Gen AI story behind Chaos
Engineering hasn't been that robust.
there been a lot of chatter.
Gen AI, I think, is the future for Chaos Engineering.
To, help identify the right systems to give recommendations
on which chaos to run.
What are the patterns of experiments that should be done?
What are the timings?
What are the integrations required?
Giving out suggestions, maybe helping the amalgamation of chaos
engineering, monitoring metrics, logging, and then incident management.
So that has to develop a lot and then.
Obviously, contributors need to rise as well, like I believe that.
A project or a, or an idea cannot grow in today's world, if it has been part
of the open source ecosystem and that ecosystem is dying, like contributors
and ecosystem needs investment.
personally, speaking from a standpoint where I was a community manager to
Litmus and I have been actively involved in this community for the last four
and a half, five years, I believe that there needs to be more commitment by
at least end users to the project.
To help these projects grow, help contribute back to the project,
even if it's like a non-technical contribution, maybe talking more
about it or it's maybe dedicating a couple of resources, engineers
to contribute back to the project.
I think it's high time that we look at the current state of it and stop, the maybe
degradation or In, in, in terms, help the help, help the health of the project, but
yeah, I think, with more contributions, more, enterprise level features.
gaining importance and maybe coming back by end users to the
open source side of things, more research and more white papers.
I feel, the challenges can be mitigated or we can see a path ahead
and talking about the path ahead.
I think this technology is here to thrive.
Um, people who are invested, the enterprises who are here.
The community that has, that is, I think, really strong is here to invest in it.
In some form, we'll see a lot of maintainer tracks, being, being taught.
where we will see the feature of, projects that are already part of the CNCF
ecosystem, we'll see a lot of, DevOps.
engineers, performance engineers, platform engineers, talking about chaos
engineering as they keep adding chaos as a testing suit or maybe an essential
part of, application development.
So the path ahead, I think it's tough.
it's really hard.
As I said, it needs a lot of investment, both, financially and, Through, effort
and then also it needs, I think a lot of evangelism even now, maybe from a
different aspect, a different angle, but still does, I think you can still,
correlate it with resilience engineering, call it an integration testing framework,
sell the idea of, chaos in usage, which should be done by more of the end users
coming up with case studies and stories, but yeah, the path ahead looks a little
shaky, but I believe that with the amazing folks that are out there, folks from
SteadyBit, folks from Gremlin, folks from Harness, Who are, I, who I see
as the top, three, investors alongside the larger folks that are, maybe the
users and then the cloud providers in AWS and Azure and GCP, of course, who
is invested in, chaos engineering.
And I have seen a lot of stories coming up.
in, in recent times.
So yeah, I feel that, the investment needs to be there and they need to
give back more to the community.
But yeah, these are just thoughts and I hope they come into fruit and give some
reward itself and I Keep supporting Chaos.
I'd love to keep supporting Chaos engineering on the outside
and obviously as an admirer and Someone who has grown through it.
So thank you folks.
That was it from my side I hope this talk was fruitful to you and yeah,
enjoy the rest of the conference