Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Thanks for joining my talk implementing chaos engineering and
            
            
            
              a riskaverse culture. My name is Kyle Shelton and
            
            
            
              I'm a senior DevOps consultant for Toyota Racing development.
            
            
            
              A little bit about myself. I have a beautiful wife and three girls.
            
            
            
              I'm an avid outdoorsman. I love to be outside, love to
            
            
            
              fish, love to hunt, love to hike. I also train for
            
            
            
              triathlons and do triathlons recreationally.
            
            
            
              I'm an avid outdoors. I'm an avid outdoorsman, but I'm also
            
            
            
              an avid I racing and simulation fan. I love racing
            
            
            
              simulators. I just built a new pc recently and been doing
            
            
            
              a lot of I racing. I also enjoy flight simulators and farming
            
            
            
              simulators, and one of my other hobbies is writing and I
            
            
            
              have a blog, chaoskyle.com, to where I talk about platform
            
            
            
              engineering, DevOps, SRE, mental health,
            
            
            
              and things of that nature. So if you're interested in that, check it out.
            
            
            
              In today's talk, I'm going to talk about the evolution of distributed
            
            
            
              systems and what apps look like today versus
            
            
            
              what they look like 1015 years ago. I'm going to kind of go
            
            
            
              over the principles of chaos engineering and how to
            
            
            
              deal with that conservative mindset. A lot of places
            
            
            
              you might work at maybe will laugh at
            
            
            
              you if you ask to break things. So I'm
            
            
            
              going to kind of go over what that mindset looks like
            
            
            
              and how to sell newer technologies
            
            
            
              and practices like chaos engineering. Also talk
            
            
            
              about the art of persuasion and gaining buy in. And then
            
            
            
              I have a case study from my days as an SRE at Splunk that
            
            
            
              you can use as a reference point on how to ask for
            
            
            
              money to do a lot of tests. And then we'll close out with tools
            
            
            
              and resources, and then I'll get into Q A.
            
            
            
              So let's start about talking about the
            
            
            
              evolution of systems architecture. When I first got started in
            
            
            
              my career, I was working for Verizon and we were deploying
            
            
            
              VoLTE, which is voiceover. LTE was the
            
            
            
              world's second nation's first IP
            
            
            
              based voice network, and our deployments looked
            
            
            
              like this. We would schedule each,
            
            
            
              we call them network equipment centers, data centers,
            
            
            
              and we had our core data centers all throughout the country, New Jersey,
            
            
            
              California, Dallas, Colorado. And we would schedule six week deployments
            
            
            
              to where we would spend four weeks on site,
            
            
            
              racking and stacking the servers, getting everything put together,
            
            
            
              installing the overcoming system, installing all of the
            
            
            
              application software, or as we call them, guest systems,
            
            
            
              and then bringing these platforms onto the network and
            
            
            
              eventually get subscribers making phone calls
            
            
            
              through them. And the process was long and we
            
            
            
              had to build out our capacity kind of
            
            
            
              in that 40 40 model to where we would only build out to 40%
            
            
            
              of our capacity. So that if we ever had to fail over from one
            
            
            
              of our, everything was ha, obviously too. So we always had two
            
            
            
              things of everything. And so if we ever had to fail over from one
            
            
            
              side to the other, that it would never be at 100%, it would only be
            
            
            
              at 80%. So there's a lot of overhead, a lot of costs.
            
            
            
              We were really good about knowing what type of traffic
            
            
            
              we would have. But this was early iPhone days,
            
            
            
              so iPhone releases could really mean a huge spike
            
            
            
              in traffic, and there's a lot that
            
            
            
              went to it. And now if you're deploying
            
            
            
              an app with the cloud native revolution in
            
            
            
              order to get an mvp into your customers hands,
            
            
            
              you can do that within 45 minutes,
            
            
            
              basically for simple base cuts. Obviously,
            
            
            
              you're not going to deploy a complicated rails app
            
            
            
              platform in 45 minutes, but you
            
            
            
              have the tools available, using public clouds
            
            
            
              or private clouds or whatever, to deploy virtualized systems with
            
            
            
              all their dependencies quickly. And so with
            
            
            
              the new way of doing things and the new way
            
            
            
              of being able to deploy things fastly, being able
            
            
            
              to scale and have elasticity. So you only pay for what you
            
            
            
              use, right? If you've ever heard one of the AWS pitches,
            
            
            
              the big thing is they had all that extra capacity and
            
            
            
              they started selling. And so you pay for what you use in
            
            
            
              these scenarios, like these new technologies. And this new way
            
            
            
              of doing things brings a lot more aspects
            
            
            
              of failure and a lot more things that can go wrong
            
            
            
              based on abstractions and dependencies into
            
            
            
              the system. And so understanding your tailored
            
            
            
              understanding what happens when
            
            
            
              things break is crucial for modern day applications.
            
            
            
              As Warner Vogel said, everything fails all the
            
            
            
              time, and you have to be prepared for that. And another thing,
            
            
            
              too, is you are abstracting some of the physical
            
            
            
              responsibilities through the shared responsibility model. So something
            
            
            
              can happen on a data center that you don't own that could affect your resilience.
            
            
            
              And so that's another thing that you have to be prepared for. And so
            
            
            
              those are some of the challenges of our modern day systems. And that's
            
            
            
              why chaos engineering kind of came about.
            
            
            
              When I think about what is chaos engineering in my mind,
            
            
            
              you define your steady state first. If you are at a
            
            
            
              point to where you don't really know what a good day looks like, I would
            
            
            
              first figure that out. Get to your steady state. Get to
            
            
            
              where you're knowing that your failure rate or your four
            
            
            
              five x error count or whatever is at x, whatever a good day
            
            
            
              looks like, for your distributed system, that's what's defined as
            
            
            
              steady state. So once you have that, then you can start to build hypotheses
            
            
            
              around disrupting that steady state.
            
            
            
              So you create a control group and an experiment group, and you say, okay,
            
            
            
              if I unplug x wire, y will turn
            
            
            
              off, right? That's just a basic example of hypothesis,
            
            
            
              but that's what it is. If I do something,
            
            
            
              x will happen. And then once you have that hypothesis written
            
            
            
              down or put down in notes or whatever,
            
            
            
              you use the document. That's when you come in and you introduce chaos in
            
            
            
              a controlled manner. That's when you come in and you go unplug the cable or
            
            
            
              you go pour water on your keyboard to see what happens. And after
            
            
            
              you do that, this is a key point, then you're going to observe what happens,
            
            
            
              and you're going to look at both groups and then evaluate your hypothesis.
            
            
            
              And another mistake that I've seen people made is that they want to get into
            
            
            
              chaos engineering without having a good observability
            
            
            
              foundation. And if you can't put metrics
            
            
            
              on your system, and you don't know how to
            
            
            
              put a value or put a score on your system, then you're probably not
            
            
            
              going to be very successful with chaos engineering. So I would also focus on
            
            
            
              that first, defining your steady state and being able
            
            
            
              to observe that steady state is something that you have to. That's one
            
            
            
              of the prerequisites for chaos engineering.
            
            
            
              There's a website, really cool website, that helpful folks
            
            
            
              at Netflix put out called principlesofchaos.org.
            
            
            
              And this kind of goes into detail as the principles of
            
            
            
              chaos engineering. What was their north star when
            
            
            
              know created this type
            
            
            
              of engineering? The big thing,
            
            
            
              know, you build your hypothesis around the steady state, like I talked about
            
            
            
              it, and you're going to want to vary real world events. So if
            
            
            
              you search AWS service events or Google service events, they'll give
            
            
            
              you a long list of the things that have happened from
            
            
            
              the big s three dying outage, way back in the days of December
            
            
            
              7, you can see what
            
            
            
              has actually happened in the cloud. So, you know, like,
            
            
            
              okay, at minimum, this could happen because it's happened
            
            
            
              before. You're never going to ever be able to plan for the
            
            
            
              unknown, obviously, but those are their
            
            
            
              biggest events. Your provider has start there and
            
            
            
              work backwards. Use those published
            
            
            
              events as a starting point for what to
            
            
            
              test on. Run your experiments in prod. You want to get
            
            
            
              to where you're running continuous experiments in prod randomly,
            
            
            
              automatically, without it having an impact
            
            
            
              to your customers.
            
            
            
              We have automation pipelines, and we have jobs
            
            
            
              that can run and spark within unit tests
            
            
            
              for failure. There's also ways to just let
            
            
            
              it go in the wild and just when it runs, it runs. There's a
            
            
            
              lot of ways you can do this, but keep it automated. Don't have someone sitting
            
            
            
              there that has a go and pull the plug and make sure that
            
            
            
              it's automated and that you can run it continuously. And the last
            
            
            
              thing is you're going to want to minimize blast radius.
            
            
            
              You don't want your eyes to bring down production.
            
            
            
              You don't want your monitoring and your testing tools
            
            
            
              to cause large scale events. So isolate your
            
            
            
              test. I'll talk about this when we're talking about buy in, but start
            
            
            
              small. Don't go in trying to just,
            
            
            
              all right, we want to fail over one region.
            
            
            
              And don't go in with the mindset that you're going to
            
            
            
              break everything all at once, all the time. Minimize your
            
            
            
              blast radius, keep it small and work from there.
            
            
            
              Iterate off the small wins so that
            
            
            
              you can get confidence from your leadership to get bigger
            
            
            
              failures and have bigger events. Do red team events, settings like that.
            
            
            
              Let's get into the conservative mindset.
            
            
            
              And we've all worked with those leaders or those
            
            
            
              managers or principal engineers who have
            
            
            
              that very conservative mindset of, oh, we don't want to do
            
            
            
              the new stuff yet, or, oh, we don't want to
            
            
            
              bring failure in. I thought our goal was to keep the app up
            
            
            
              as much as possible. Why would we want to come and start breaking things?
            
            
            
              And one of the key things that I've seen when
            
            
            
              working with these types of leaders and individuals is it's
            
            
            
              a riskaverse bottom line first type of mentality.
            
            
            
              And I think, I'm not trying to downplay
            
            
            
              the bottom line. Right. I understand business. I understand that sales
            
            
            
              generate the revenue which keep everybody employed, and that the
            
            
            
              bottom line is very important. I want to stress
            
            
            
              that right now. But I also know
            
            
            
              that sometimes you have to make investments and you have to make investments in
            
            
            
              time engineering, time tools, systems,
            
            
            
              platforms, et cetera. Initially that you might
            
            
            
              not see return on short term, but their long term
            
            
            
              returns are substantial. So be
            
            
            
              on the lookout for that type of behavior and understand,
            
            
            
              hey, everything does get basically
            
            
            
              falls and delves around the bottom line.
            
            
            
              But if you notice that leadership has
            
            
            
              that bottom line first, well, we can't fail because
            
            
            
              it's going to cost us money. Well, if you don't understand your
            
            
            
              tailored, it's going to cost you a lot more money. And so recognize
            
            
            
              that. Recognize that mindset and be able to counterarguments
            
            
            
              when you're going to sell your case another
            
            
            
              environment. I think my early days at Splunk,
            
            
            
              it was like this to where we had a knock transition. And so
            
            
            
              we were transitioning from an overseas knock to building out an
            
            
            
              internal team. And during that transition,
            
            
            
              we got paged frequently. There was a lot of outages, there was
            
            
            
              a lot of things that were going on, and it
            
            
            
              sucked. I'm not going to lie, it sucked. Getting paged at
            
            
            
              02:00 a.m. Every other night or constantly fighting fires.
            
            
            
              There was a lot of fatigue. And you can tell with some of the older
            
            
            
              engineers that fatigue. And that was the whole basis of us bringing
            
            
            
              that knock back in house because we weren't getting the support from our
            
            
            
              overseas team and we needed to get
            
            
            
              better. So notice that PTSD,
            
            
            
              if you're coming in and everybody's just kind of worn out from getting
            
            
            
              paged all the time, that's something to look out for.
            
            
            
              And then the third thing that I've noticed in these
            
            
            
              conservative mindsets is slow. Everything moves slow. It takes
            
            
            
              two weeks to get your change in front of a change board. Then it takes
            
            
            
              another two weeks to get a maintenance window approved, and then you got
            
            
            
              six weeks in. You're going to run your three lines of code, and if it
            
            
            
              breaks, then you have to wait another six weeks to get things in.
            
            
            
              And not being able to move releases
            
            
            
              fast and not being able to push your code fast. And that's
            
            
            
              normally due to that conservative mindset that there's a lot of reasons
            
            
            
              that could be there, but it's slow and
            
            
            
              nobody likes going slow. I personally hate going slow,
            
            
            
              but slow is steady sometimes. And that might
            
            
            
              be what works best for that team at that
            
            
            
              moment. And your job is to speed things up.
            
            
            
              That's what to look out for when you're
            
            
            
              finding that you think you might be in that conservative mindset
            
            
            
              or conservative environment. Now let's talk about settings,
            
            
            
              chaos, engineering. And if there's a skill set that I'm
            
            
            
              constantly trying to grow, it's my sales skill set.
            
            
            
              I don't work in sales. Right. But there
            
            
            
              are situations to where you have to sell something, whether it's
            
            
            
              yourself for a job interview, whether it's a tool that you want,
            
            
            
              because it's the next best thing, big monitoring
            
            
            
              tool or something, or whether it's just you want money
            
            
            
              for a soccer team. I had to ask an
            
            
            
              executive for a couple of
            
            
            
              start a soccer team for our office. And one
            
            
            
              of the things I mentioned was that support sales and Sre
            
            
            
              just don't get along. We all operate in our own little pods.
            
            
            
              There's not a lot of communication and camaraderie.
            
            
            
              And so I was like, hey, if we had a soccer team and
            
            
            
              we could get groups of all of these
            
            
            
              environments together on one mission to
            
            
            
              go have some fun, it'll help with the bonding
            
            
            
              and the office morale, and it'll help us work better together. And it
            
            
            
              worked, and I got the money and sure, shit,
            
            
            
              it really helped the teams get along better, and the teams work
            
            
            
              together better because there's that just
            
            
            
              team. When you're on a soccer team, you're all trying to
            
            
            
              beat. We were trying to play the other tech companies in Dallas,
            
            
            
              and there's a little bit of pride, and then we don't want to lose.
            
            
            
              Yeah, it was fun. I got to be friends with folks I would
            
            
            
              have never been friends with because I didn't work with, and I actually met my
            
            
            
              wife through that, too. And so being able
            
            
            
              to sell is very important. Now, how do we sell
            
            
            
              chaos engineering? I think one of the biggest
            
            
            
              things that whenever I've asked for
            
            
            
              a new technology or something new from a leader,
            
            
            
              one of the things I've been asked is total cost ownership
            
            
            
              and how to put short term investments
            
            
            
              into long term returns.
            
            
            
              Right. Let's say I
            
            
            
              want this tool that's going to cost us $100,000
            
            
            
              a year. Okay, well, what value
            
            
            
              are you going to get from that $100,000?
            
            
            
              How do I explain? Okay, well, this tool helps
            
            
            
              me see things before the customers do.
            
            
            
              It helps me fix things before they break and page out
            
            
            
              our on call overnight. So our engineers get more sleep,
            
            
            
              so they come in and they're not having to flex their schedule because they were
            
            
            
              paged the night before. So there's not going to be that fallback on
            
            
            
              productivity from that standpoint. And we
            
            
            
              can start to create automation so that we get
            
            
            
              to the point to where the system is just healing itself. And so how
            
            
            
              do you put a number on that? Well, you define total
            
            
            
              cost ownership and define that, hey, we might be saving
            
            
            
              money by going open source and hosting things ourselves
            
            
            
              on the initial software spend, but if we look at the spend
            
            
            
              overall from our sres and architects and developers and everybody
            
            
            
              that has to put more of their time into
            
            
            
              that platform, okay, that $100,000
            
            
            
              investment of the managed
            
            
            
              tool is going to be less than the $400,000
            
            
            
              you're spending on engineering time. So, understanding total
            
            
            
              cost ownership, understanding how to explain total cost ownership
            
            
            
              has been the biggest way I've been able to sell things
            
            
            
              to ask more money, sell a tool or
            
            
            
              sell a system or something like that.
            
            
            
              So, understanding your total cost ownership and
            
            
            
              understanding how to explain total cost ownership is crucial.
            
            
            
              Being able to put a number on reliability.
            
            
            
              Happy customers equals happy engineers, which equals happy
            
            
            
              bosses, which makes everybody happy.
            
            
            
              The cost of fragility,
            
            
            
              it can break your company, especially if you're
            
            
            
              a web SaaS or you provide a service over the web.
            
            
            
              People paying customers expect you to deliver on your part. And if
            
            
            
              you are not able to, because your systems are not up and
            
            
            
              you're not able to fulfill your service
            
            
            
              level agreements, it costs you money, it costs engineering
            
            
            
              talent, and it could cost your business.
            
            
            
              So having those points and
            
            
            
              understanding how to talk about those points is key
            
            
            
              in settings chaos engineering. So now
            
            
            
              we understand what we have to do. How do we do it? Well,
            
            
            
              I'll start at the top. The benefit and cost analysis
            
            
            
              of total cost ownership is, where are you going to start?
            
            
            
              Define that. Get all your data, all your charts and graphs.
            
            
            
              I had an old vp who used to say charts and graphs, or it didn't
            
            
            
              happen. I even made stickers. And working at
            
            
            
              a software observability company kind of makes sense, right?
            
            
            
              But it's true, right? Be able to prove your points,
            
            
            
              do your homework, bring your case files,
            
            
            
              act like a lawyer does when they're trying to persuade a jury their
            
            
            
              case, and have all of your detailed documents, notes.
            
            
            
              Do surveys with your developers. If you're trying,
            
            
            
              do the things that you need to do to make the sale.
            
            
            
              Another way to gain buy in is just ask for small pilot programs
            
            
            
              like ask for a small. Just, hey, I want to try this
            
            
            
              new way of doing things in our test
            
            
            
              sandbox environment. Is that okay? Start with a win.
            
            
            
              Iterate on that win,
            
            
            
              demonstrate that success early, and demonstrate
            
            
            
              that you can control the blast radius
            
            
            
              and can control the failure. And demonstrate
            
            
            
              the ability to say, oh, okay, well, I'm doing this in a
            
            
            
              scientific manner to gather more data
            
            
            
              to make our system better.
            
            
            
              Be able to clearly articulate those
            
            
            
              things and speak their language. Right. If you're having to
            
            
            
              sell this to a higher exec that's a business leader,
            
            
            
              maybe lean less on the technical jargon and more on
            
            
            
              the business value jargon. Or if you're having to do this to your CTO and
            
            
            
              he's going to want to know, okay, what types of failures are you going to
            
            
            
              bring? Are you going to do latency? How is this going to affect us long
            
            
            
              term? Are you going to have the ability to completely break
            
            
            
              everything and not have it be recovered?
            
            
            
              What's your backout plan things of that nature. Speak the language of
            
            
            
              the leader that you are pitching to,
            
            
            
              and then the last thing is just iterate and approve.
            
            
            
              Iteration is the key to this.
            
            
            
              Being a DevOps engineer, we build pipelines, and I get
            
            
            
              really excited when I get new errors because new errors mean
            
            
            
              new improvement. So iterate,
            
            
            
              improve, take baby steps, don't do
            
            
            
              big things all at once. Start small and work up to
            
            
            
              the larger events and the larger scenarios. Don't just go off
            
            
            
              blasting right away. It's hard to get.
            
            
            
              Like I said, this is one of the hardest things to ask for, because you
            
            
            
              are literally asking to break things, but you're
            
            
            
              asking to break things to make them better. So be
            
            
            
              sure to include that last part.
            
            
            
              Now, how do you
            
            
            
              go about doing this? So I want to talk about a time where I was
            
            
            
              at splunk and we were going through
            
            
            
              a very large migration. We were basically going from the
            
            
            
              transition of ansible and chef to
            
            
            
              puppet and terraform. And so we were rebuilding
            
            
            
              our whole provisioning system. And in that migration, we were also
            
            
            
              moving to the graviton instances and moving from our d series to our
            
            
            
              I series. And so we had to
            
            
            
              basically move our whole fleet, 15,000 instances.
            
            
            
              I think we did it in like eight and a half months. It was a
            
            
            
              crazy big project, and I was tasked with the largest customer,
            
            
            
              the biggest, most snowflaked customer we had, with all
            
            
            
              the custom configurations. They had
            
            
            
              a lot of things that most customers did not because they were the
            
            
            
              first customer to give us a million dollars. So they
            
            
            
              got what they want, right. And it was imperative
            
            
            
              that that migration went absolutely perfect.
            
            
            
              Half of their stack was already moving off their security side was
            
            
            
              already moving off of our managed team. So if
            
            
            
              we failed on this, we could have lost a lot
            
            
            
              of revenue. And so our chief cloud officer was
            
            
            
              very interested in making this a success. And so I
            
            
            
              had to basically ask for $5 million to make
            
            
            
              sure it was a success. I asked to duplicate
            
            
            
              their environment in another environment so that I could fully test
            
            
            
              and fully break things and see how I would recover if I did break things
            
            
            
              during the migration and have my backout plan ready
            
            
            
              to a T. So I asked, I put all the necessary
            
            
            
              data in front of him, told him it's going to take me three months.
            
            
            
              This is our plan. We're going to replicate four petabytes of data.
            
            
            
              We're going to execute the migration probably two or
            
            
            
              three times. And then once we get that migration executed,
            
            
            
              we're going to document everything that we think
            
            
            
              will happen, and then everything that happened during that migration,
            
            
            
              because when you're dealing with big data systems,
            
            
            
              it's very difficult to predict things that are going to
            
            
            
              happen, especially in the cloud. So I did that. I did all the settings,
            
            
            
              ended up executing the migration flawlessly
            
            
            
              right through chaos engineering because I was
            
            
            
              asking to test and asking to break things on purpose in
            
            
            
              a controlled environment that would not impact the customer.
            
            
            
              And that investment of $5 million ended up turning
            
            
            
              a ten x margin because now we're using graviton
            
            
            
              instead of 400 D 424
            
            
            
              xl instances. We're now only using 80 I series.
            
            
            
              So we got better performance, better cost.
            
            
            
              We're saving way much more money on using the newer instances,
            
            
            
              and it's a better customer experience. And I don't think that
            
            
            
              migration would have happened had we not done those tests.
            
            
            
              So that's a $20 million swing, right? Like,
            
            
            
              had we failed and had we not spent that $5 million up
            
            
            
              front, we would have not been successful. And so
            
            
            
              I did experience some catastrophic failures
            
            
            
              through my first test when we were doing that. And so we figured that
            
            
            
              out on the practice environment and we learned
            
            
            
              from our mistakes and we grew. And so it's
            
            
            
              hard to ask for money and to ask for tailored, but if you do
            
            
            
              it in the right way and you do your homework and you propose
            
            
            
              the data at hand and show the total cost and what
            
            
            
              happens in the long run, with resilience, it goes a
            
            
            
              long way. So I highly suggest asking
            
            
            
              for this and asking to be able to
            
            
            
              do things like this and to be able to test and to run your red
            
            
            
              team exercises and to practice your incidents.
            
            
            
              Practice, practices, practice. Firefighters, military,
            
            
            
              navy seals spend more time training for their missions
            
            
            
              than they do actually executing the missions. Right, because they want it to
            
            
            
              be perfect. When you're doing your technical operations,
            
            
            
              you want them to be perfect. And the only way to do that is practice.
            
            
            
              Some of the tools that I've used in the past,
            
            
            
              chaos monkey harness and their chaos engineering platform,
            
            
            
              they have an open source litmus, and then they also have an enterprise level.
            
            
            
              There's gremlin, which is another great tool. And then chaos
            
            
            
              Mesh is a cloud native tool that I really like as well.
            
            
            
              Check these out. And that's
            
            
            
              the show. Thanks for taking the time to jump
            
            
            
              into this talk. I really enjoy talking about chaos
            
            
            
              engineering and having influence.
            
            
            
              And yeah, check out my blog, chaos.com. I talk about mental
            
            
            
              health, I talk about platform engineering, resilience,
            
            
            
              chaos engineering, all that. So thanks
            
            
            
              for having me.