Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

Who is responsible for Chaos?

Video size:

Abstract

If you’re thinking of starting a chaos program, you might be wondering which job functions are typically responsible for managing chaos within their organizations.

This talk will look across a number of companies to determine who historically initiates chaos programs, as well as reveal new trends in this space.

Summary

  • Joyce: This is my dog, Lucy. She's a year old, and I am in from San Francisco. So I'm going to head back tomorrow to see Lucy.
  • Joyce: Who owns Chaos? Who is responsible for chaos? Joyce: About third of the community comes from functions like security or ops or R D. She asks: Why aren't testers also identifying vulnerabilities, and automating these experiments. Joyce: If you think about the lifecycle of software, you should introduce chaos earlier.
  • Why aren't more testers doing chaos today? Have you even heard of such a thing? We're starting now to see cast experiments created by sres and then run and automated by testers. But very few people here identify solely as being tester.
  • Who can start a chaos program? Who gets the ball rolling on chaos? Who knows about potential vulnerabilities and how to properly structure a chaos experiment? A valuable chaos test will not only teach you about your systems, but also your team. Be prepared for failure.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey, guys. My name is Joyce. This is my dog, Lucy. She's a year old, and I am in from San Francisco. She's alone. So I'm going to head back tomorrow to see Lucy. So my talk is, who owns Chaos? Who is responsible for chaos? Again, my name is Joyce. I'm a developer advocate at Postman. So thank you, John and Manuel, for demoing Postman earlier this morning. I'm not going to be talking about Postman, but it is an API development platform used by more than 8 million developers. So one of the best parts of my job is that there's a ton of people who use Postman, and I get to talk with all of them and find out what are they working on and how are they doing it. So this is kind of a side project of mine, chaos. I've been interested in it. So I've been going to community days, conferences, and one of the questions that I had when I was in the audience listening to people talk about chaos engineering, who, who, who, who, who? Who is responsible for chaos? With the engineering slack group, they've posted a diagram of the people in the tools in this chaos community. These are the famous people. Let's take a quick look at this data and break it down. So what job titles are doing chaos? Who here identifies as being engineer in this room? Okay, almost everyone here. So if you look at that diagram of the tools that the famous people in chaos, most of them include the word engineer in their job title. There are specialized roles or vanity roles, as villas will call them, like chaos engineer. You have site reliability engineers that also might handle chaos. And about third of the community comes from functions like security or ops or R D. So typically, the folks that are most motivated to start a chaos program are the ones who feel the pain of a failure in production. So if you're on call, Colton Andres, CEO at Gremlin, says it boils down to who gets paged. If that's an SRE or ops team, they have the most incentive to start doing this work and making their lives better. And this is how Colton personally started off doing chaos engineering when he was over at Netflix. So when you're thinking about roles and responsibilities, which typical responsibilities do the folks who have that are interested in chaos have? So we have chaos specialists, those vanity roles. You have a dedicated team for chaos engineering, and it might actually be a youre competency for your business. Other companies have sres or production engineers. They handle continuous deployment and production support. So postman engineering, we have a microservice architecture and the ones that are responsible for deployment and uptime are the developers that are building the services themselves. If your team has a traditional DevOps department, they might be doing the deployment and uptime. Other people who care about chaos might be responsible for incident management. So Russ called it earlier this morning, he called it a post mortem analysis. Right. But the difference here between incident management and chaos is that you're shifting the focus from the post mortem to the pre mortem, and you're actively proactively trying to prevent errors from happening. So companies that have chaos engineers, there are some companies that have domain knowledge, right? So you're talking about the data, the storage, or the networking teams. And lastly, we think about folks that. Who, who, who is responsible for chaos? Responsible for, and this one might be a little bit controversial here, but those that have responsibility for testing and production, who here test in production? Okay, I see, like seven hands. So this is the best environment for this. It has the most information to accurately recreate situations and demonstrate the true consequences of your attacks. But some companies can't test in production. So if you have fincare or healthcare, you might have compliance issues. You can't take down customer data or it's very, very costly. So these are the general responsibilities of folks who do chaos. Which roles tend to have these responsibilities? When you translate these responsibilities to roles, a lot of the people doing chaos tend to be quality driven or production focused operations engineers. So here's my question. This is all good. Chaos engineers are running chaos tests. They're identifying vulnerabilities, and they're automating these experiments. My question, why aren't testers doing chaos? Does anyone also identify as a tester? Okay. About the same amount of hands. Okay. So before there was chaos engineering, there was chaos testing. So if you go back and look at the earliest blog posts, when Netflix first introduced Chaos monkey, they actually called it chaos testing, and they introduced it to the test community. So it makes sense if you think about the traditional software development lifecycle, youre should be introducing the responsibility of resilience or quality earlier in that cycle, when the cost of bugs is the lowest. That's a noble goal. But as we talked about earlier, the testers aren't the ones on call. They're not rolling out hot fixes in production. So this is the reason why we see a bunch of sres and ops people pioneering the work in chaos engineering. But because you want to build resilience a little bit earlier in the software development lifecycle, we actually see a very new trend of testers, people who solely identify as tester, who focus, are focused now on production testing in addition to what we imagine that they focus on pre release testing. So one such test engineer said the biggest limitation in the fear of delivering software faster is the focus on adding more pre release testing. Abby Bangser says chaos engineering is all about building trust, that your systems are resilient and the meantime to recovery is acceptable. She goes on further to say chaos engineering is all about building confidence that we aren't fragile. We have less fear that any one change will bring down our system. And when issues do occur, we know how to triage and deploy fixes faster. This is Abby Bangzer, based in London here, one of the very few testers I found that was approaching and trying to get a Chaos program launched at her company. So why aren't more testers doing chaos today? Have you even heard of such a thing? So I've talked to testers at events like these that are curious about chaos, but they're still spending most of their time on pre release testing. And it's very rare, but especially at events like these, the people that have a side project or a passion project. We're starting now to see cast experiments created by sres and then run and automated by testers. So you can see that this is an anonymous attribution here, but I've talked to a few very large companies that do have this kind of workflow. And so we see these emerging programs that are being pushed by the testing function. But job titles aside, very few people here identify solely as being tester. Job titles aside, who can start a chaos program? Who gets the ball rolling on chaos? Okay, so who has the insights? Who knows about potential vulnerabilities and how to properly structure a chaos experiment? Insights. Who has the access to pull the plug and in case you need to roll it back, who has the insights to access? And lastly, who has the organizations pull to convince management and adjacent teams to support this chaos program? So this one's probably going to be the hardest lever to pull. And it doesn't matter where you are, what function you're in, what industry you're in, what company you're at, it becomes actually a lot easier if you have a catastrophic failure. So, at the last chaos event that I was at, I met somebody who told me that there's now a directive from our CTO to start a Chaos program. Note this is a director of test after they lost $600 million in 22 minutes. Not going to talk about this, but come grab me later to hear the gory details. So this is the easiest way to start a chaos program when you have a catastrophic failure. But for the rest of us, do I need to wait for this? No. Clearly no. Start thinking about chaos in order to prevent it. And for you, if you're thinking about starting a Chaos program, Casey Rosenthal has some advice for you. Casey says perhaps aggregate bits and pieces from different frameworks that appeal to you and then create a practice around it. You'll likely be the first person to create a similar practice in your particular context. And he goes on further to say, I wish you the best of luck in that undertaking, but I wouldn't wager that you get it right on your first try or your second. So be prepared for failure. Who owns Chaos? Final thoughts here. More teams and functions new functions are thinking about chaos engineering because, to use Russ's buzz phrase from earlier, they're going cloud native stuff is getting complicated, and they're thinking about how chaos testing can complement or augment traditional testing. So as youre teams begin to cover the bases when it comes to pre release testing, we see them spending more time in production, testing in production. And now we start to see chaos experiments created by sres and then run and automated by testers. And lastly, a valuable chaos test. Villas was talking about this a little bit earlier, alluding to it, a valuable chaos test will not only teach you about your systems, but also your team. So as you're thinking about building more resilient software, also think about building resilience into your organization, with your people, with your culture, celebrating the failures instead of hiding them and sweeping them under the rug. That will impact the overall resilience of your systems. Thank you.
...

Joyce Lin

Developer Advocate Lead @ Postman

Joyce Lin's LinkedIn account Joyce Lin's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways