Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Well helps and welcome to my talk. LMAO helps
            
            
            
              during outages I'm Richard Lewis, I'm a senior DevOps
            
            
            
              consultant with three cloud solutions and I've had the pleasure of working with
            
            
            
              both software development teams and operations teams
            
            
            
              with. I currently have over 20 years experience of working in those
            
            
            
              industry. I'm also the co organizer for the Chicago
            
            
            
              Honorary Enthusiast User group and as you
            
            
            
              can tell from everything around me, I'm a diehard White Sox fan.
            
            
            
              So a little bit about my company three cloud is the largest peer play Azure
            
            
            
              service partner in those world. We have about 600 people
            
            
            
              who are dedicated Azure professionals focused
            
            
            
              on data and analytics, app innovation
            
            
            
              and helping clients provide a modern
            
            
            
              cloud platform.
            
            
            
              So LMao, as you could
            
            
            
              probably guess by now, it is not just
            
            
            
              laughing your way through the outage, but it's actually
            
            
            
              an acronym for something else. It's an acronym
            
            
            
              for logs,
            
            
            
              Metrics, Alerts and an observability
            
            
            
              tool. And these are the things that you need to have to make up
            
            
            
              a LMAO strategy. So what
            
            
            
              am I actually talking about today? I'm actually talking
            
            
            
              about providing platform support strategies for
            
            
            
              your team members, creating a standard for
            
            
            
              knowledge sharing, helping you
            
            
            
              reduce the meantime resolution, and building
            
            
            
              the psychological safety of your team members so that way they're able
            
            
            
              to be able to lmao their way during LMAO outages.
            
            
            
              So if you're not familiar with logs and metrics,
            
            
            
              those are the things that's going to provide you the insight as to what and
            
            
            
              when it's happening. It's going to
            
            
            
              help you figure out how many errors occurred,
            
            
            
              how many requests you got, and it's going to help you figure out
            
            
            
              the duration of those errors and request.
            
            
            
              Now as for alerts, they come in two
            
            
            
              main ways. Pages, those are critical things
            
            
            
              like hey, our whole network is down, or hey,
            
            
            
              our website's offline. And tickets,
            
            
            
              those would be things like hey, the hard drive
            
            
            
              that runs our website is at 80% full
            
            
            
              or 85 or 90% full.
            
            
            
              Or we're seeing a slowness
            
            
            
              in our ecommerce website. Not outage,
            
            
            
              just some general slowness, just some pain.
            
            
            
              And like most of you, I've experienced alert trauma.
            
            
            
              This is me, my first IT job circa
            
            
            
              2011 and I was supporting an Azure
            
            
            
              97 application. I has the only
            
            
            
              one doing this. I was on an on call rotation 365
            
            
            
              days a year for 16 hours a day. Because it was also a call center
            
            
            
              and they worked two shifts, I had no way to connect to
            
            
            
              the office remotely. Luckily I lived within 10 miles of
            
            
            
              the office, so whenever there was an outage I would get paged
            
            
            
              using actually this pager right here that you see
            
            
            
              on the left side of your screen and I would respond to it.
            
            
            
              My company had no logging framework, no concept
            
            
            
              of the logging framework. And within six months, you guessed
            
            
            
              it, I burned out. But I didn't quit.
            
            
            
              I let my boss know what was going on, my thoughts
            
            
            
              and opinions. My boss was very receptive to these thoughts and
            
            
            
              opinions and we worked on putting together a LMAO strategy.
            
            
            
              We tackled the issue of logging framework first.
            
            
            
              We used log for net. We tackled the
            
            
            
              on call rotation issue by hiring additional people
            
            
            
              and spreading that load out. So we created a little schedule
            
            
            
              when we're going to do these things. So managing
            
            
            
              your alerts effectively requires certain things. And as
            
            
            
              you guessed it, as I just said before, scheduling your team
            
            
            
              members appropriately. That's an important part of it. You can't
            
            
            
              have a person on call 24 hours a day,
            
            
            
              365 days a year. It's just not effective
            
            
            
              and they're going to leave. Then you take with them their tribal
            
            
            
              knowledge and now you're stuck in that position all over again.
            
            
            
              Avoiding alert would take wherever possible. If they
            
            
            
              don't need to be woken up or paged about it, let's not page
            
            
            
              them. Collecting those data on the alerts that are actually
            
            
            
              going out, look into ways to reduce those alerts so
            
            
            
              those would be some kind of continual improvement opportunity.
            
            
            
              If the alerts that are going out is because
            
            
            
              the servers have reached a high number
            
            
            
              of cpu, look at things like auto scaling.
            
            
            
              Or if it's an ecommerce website, all of a sudden the traffic
            
            
            
              has gone through the roof, say like every
            
            
            
              day between certain hours a website increases in traffic.
            
            
            
              Then maybe look at having auto scaling there as well.
            
            
            
              Or if it's a hard drive, as I said earlier,
            
            
            
              the hard drive hit a certain capacity threshold, then look
            
            
            
              at some kind of run hooks or run books that could automatically heal
            
            
            
              those kind of things. And be cautious when you're
            
            
            
              introducing new alerts. Every alert has a purpose.
            
            
            
              That's two. That's right, alert you. So be very
            
            
            
              cautious about those alerts that you're introducing and the impact that they're going to
            
            
            
              have on your team.
            
            
            
              Observability on the other hand, is a little bit different.
            
            
            
              Observability is more of the voyeurism
            
            
            
              of it. So you get to see what's actually
            
            
            
              going on on a nice pretty screen or dashboard or something like
            
            
            
              that. It's going to be helpful for youll to understand your KPIs
            
            
            
              or SLAs. You can actually monitor your usage of different
            
            
            
              systems the same way. And spot trends dashboards definitely
            
            
            
              help spot trends, and that's where you see products like power, Bi and
            
            
            
              tableau and things like that. So you can spot trends.
            
            
            
              Observability dashboards do the same thing.
            
            
            
              And when it comes to observability tools to help you with your observability
            
            
            
              dashboards, there are hundreds of thousands of products out
            
            
            
              there. I actually took
            
            
            
              this screenshot here from the cloud Native
            
            
            
              Computing Foundation's website. Under their monitoring
            
            
            
              section, they highlight a ton
            
            
            
              of different products. This is not exclusively all the products
            
            
            
              out there, but this is just a wide range of different
            
            
            
              types of products that you can use for observability.
            
            
            
              I'm sure that there are at least one or two products
            
            
            
              on this screen, if not more, that you're probably using
            
            
            
              in your current place. Looking at this screen,
            
            
            
              I counted at least five or six.
            
            
            
              Okay, more like nine products that I regularly recommend to
            
            
            
              clients that are helpful for their different situations.
            
            
            
              I don't think that there's one size that fits all. I think
            
            
            
              that you want to use the appropriate tool for the appropriate cost
            
            
            
              at the appropriate time. They all fluctuate on costs and
            
            
            
              trade offs. This is actually a
            
            
            
              dashboard here from
            
            
            
              new relic, and this
            
            
            
              dashboard actually highlights. In the
            
            
            
              left corner, you're seeing synthetic failures and
            
            
            
              violations of policies that they have in place.
            
            
            
              And I believe it's just a coincidence that it's working out where you have 13
            
            
            
              policy violations and 13 synthetic
            
            
            
              failures. And that violation could be just
            
            
            
              because every time that synthetic failure policy is violated,
            
            
            
              it registers as a new violation. So the numbers are correlating right
            
            
            
              there at that time. But the top one
            
            
            
              is those violations of policies. The one below it is
            
            
            
              the synthetic failures. And synthetic failures
            
            
            
              are usually generated by something third party tool monitoring or
            
            
            
              testing your system. So that would be just doing like
            
            
            
              if it's alive or dead, call to see if your service is up
            
            
            
              or down, or how long does it take for your website to load
            
            
            
              using some kind of framework like selenium or
            
            
            
              something like that. The next thing we
            
            
            
              see is the errors that are occurring
            
            
            
              per minute. So now we know how many failures
            
            
            
              have happened and a duration of how many failures we're seeing per minute.
            
            
            
              And this makes me wonder, what did we change recently
            
            
            
              in our system that caused this problem?
            
            
            
              And that's where you see this here. This is actually the
            
            
            
              deployment notes, and luckily they're writing good
            
            
            
              release notes, I'm assuming. And that's how you know what's
            
            
            
              in those most recent release or has a possible impact to what's going on here.
            
            
            
              I doubt that those readme with endpoints
            
            
            
              is causing the problem, but it could be things
            
            
            
              is another dashboard here, and this one's actually made by a company called Grafana.
            
            
            
              This here actually, I really like this dashboard. It's a good example of
            
            
            
              being able to embed a wiki and documentation
            
            
            
              directly into the dashboard. So that way who's ever on call, they're able
            
            
            
              to just pull up this dashboard related to that application and
            
            
            
              click on a link to get from here to wherever they need to
            
            
            
              go to to see a true source
            
            
            
              of truth, or to go access details
            
            
            
              about who to contact or details
            
            
            
              related to the third party service or something like that.
            
            
            
              There's also a diagram in here and it's showing how you can load
            
            
            
              images into it of what that service
            
            
            
              is actually looking like. Like that service's path and
            
            
            
              architectural diagram.
            
            
            
              And we're highlighting here up
            
            
            
              and down of the service. So now we know the frequency of
            
            
            
              the service going up or down, followed by again we're
            
            
            
              highlighting over aimed those status
            
            
            
              codes that are being received by those service are sending out from this
            
            
            
              service. So we see a combination of 500 errors and 200
            
            
            
              errors.
            
            
            
              So the next thing that's really important to think about is preparing
            
            
            
              your team for those outages. Having a playbook
            
            
            
              is one of those critical and key things,
            
            
            
              practicing for those outages as well.
            
            
            
              And having a playbook comes down to what you actually
            
            
            
              put in it. So I
            
            
            
              noted on here, it's important about having it in a location where
            
            
            
              it could be quickly accessible. Sometimes you may
            
            
            
              want to put your playbooks internally, and I kind of like,
            
            
            
              I'm 50 50 on those kind of things situations.
            
            
            
              If your internal system has to be accessed through a third party system
            
            
            
              that may be possibly having an outage, then you're
            
            
            
              going to delay yourself to get into that, your playbook. So if you're using something
            
            
            
              like Azure ad to authenticate to get into
            
            
            
              your company network, then if there's a
            
            
            
              problem with Azure ad, then it's going to delay you
            
            
            
              getting to your playbook.
            
            
            
              I like other systems as well, also tied together
            
            
            
              still using SSO, so single sign on,
            
            
            
              but also tie those together with other systems like Atlassian's
            
            
            
              confluence or a third party wiki system
            
            
            
              or SharePoint. So back
            
            
            
              to those ad again. Or SharePoint or
            
            
            
              Microsoft Teams has a wiki system built into that. If you're a user
            
            
            
              of that as well, somewhere where you can keep it
            
            
            
              outside of your network but still accessible, but kind of
            
            
            
              lessens the likelihood of having an issue there.
            
            
            
              But inside those playbooks, though, you want to put things like links
            
            
            
              to your application that may be
            
            
            
              related to those observability tools that you're using,
            
            
            
              as well as details about the golden signals of that application.
            
            
            
              So that way when a person is actually looking at what's going on
            
            
            
              and hearing from the users of what's going on, they're able to
            
            
            
              say this is within those normal range. This is not within the normal range.
            
            
            
              Any relevant notes or information from previous outages, those kind of
            
            
            
              situations help you title things back together.
            
            
            
              So you're doing like a post mortem after
            
            
            
              the outage and putting a link to
            
            
            
              the post mortem notes are quite helpful contacts
            
            
            
              for those application owner or any third party services
            
            
            
              that's owners of it as well. So say if your applications run
            
            
            
              something like Azure or AWS,
            
            
            
              then links to the premier
            
            
            
              support contact information. So that way who's ever on call knows how
            
            
            
              to get a hold of them to escalate and get the right people on the
            
            
            
              call properly. Or that
            
            
            
              way they don't have to call a manager or something like that and say,
            
            
            
              who do I call about this? Or so forth,
            
            
            
              or links to
            
            
            
              things like Stripe's website, if you have a payment service that may
            
            
            
              be causing a failure or something like that as
            
            
            
              well. And anything else that youll may think there's a ton
            
            
            
              of things you may want to put in your playbook around that application. I do
            
            
            
              suggest, though, dedicate a playbook per application opposed
            
            
            
              to doing just a single playbook of hundreds
            
            
            
              of thousands of things. You can put all of your playbooks together in one
            
            
            
              same system, but you kind of want to have them breaking out
            
            
            
              by section at least. And as I was talking about
            
            
            
              their preparation and training, I know I mentioned before
            
            
            
              the importance of this, but I come from the midwest of
            
            
            
              the United States, where we have a lot of tornadoes.
            
            
            
              And so as kids, we are trained, like you see on your
            
            
            
              screen here, when we hear that
            
            
            
              siren, to go into the hallway and put
            
            
            
              our hands over our head and curl up into a little ball in
            
            
            
              preparation of a tornado coming through those area if one was to happen.
            
            
            
              But we trained for it and we knew what
            
            
            
              to do as muscle memory. So the
            
            
            
              practice of chaos engineering is something
            
            
            
              still being worked out. It's been around for a while.
            
            
            
              The concept has been, it was created by Netflix.
            
            
            
              It helps you increase resiliency,
            
            
            
              is the goal there, really? And you're able to identify and
            
            
            
              address signal points of failure early. What you're doing
            
            
            
              is you're running controlled experiments against your system
            
            
            
              and you're predicting
            
            
            
              those possible outcome that outcome could actually happen
            
            
            
              or the outcome could not. And that's where the chaos engineering comes
            
            
            
              into play. You don't really know if it's going to happen or not, but the
            
            
            
              goal is, in the end, to identify your failure points,
            
            
            
              address those failure points. So that way, if something was
            
            
            
              to happen on those failure points, you youll be able to sustain it.
            
            
            
              There's a great article about how Netflix
            
            
            
              practices their castle engineering. I put a link below for
            
            
            
              you if you want to go take a look at it. And after
            
            
            
              those outages, though, you want to take the time to do a post
            
            
            
              mortem, usually within a day or so of the outage,
            
            
            
              while it's still fresh in everyone's mind. You want to get everyone together
            
            
            
              around the conference table and just talk through what
            
            
            
              went right, what went wrong, where did you get lucky?
            
            
            
              And just figure out what needs to be documented
            
            
            
              in terms of preparation for a possible future
            
            
            
              outage. I really like this quote here from Devon
            
            
            
              with Google, and it is the cost of failure is
            
            
            
              so talking about back to my first on call
            
            
            
              rotation job. One of the things I was required to do regularly
            
            
            
              were to do write alongs with technicians for appliances.
            
            
            
              And when we would go out to customers houses,
            
            
            
              the customer would have something broken,
            
            
            
              but they may have tried to fix themselves, but may have
            
            
            
              done it wrong, not fully followed the instructions that they got from the manufacturer,
            
            
            
              or weren't fully listening to a YouTube video that they were following
            
            
            
              and missed some key details. So we would be able to
            
            
            
              quickly resolve those issues within a matter of minutes.
            
            
            
              And the cost of education in that case was our
            
            
            
              service call fee. And so the
            
            
            
              customer, youll learn something new. They would learn how to fix that problem in the
            
            
            
              future, but at the same time, no,
            
            
            
              that gives them the ability to get their system back up online really
            
            
            
              quickly. So the cost of failure is
            
            
            
              education. It's a good quote.
            
            
            
              So my takeaways from my talk today are pretty
            
            
            
              simple. Have an lmao strategy in
            
            
            
              place, have that documented and ready to
            
            
            
              go. Everyone knows where it's at.
            
            
            
              Update those documents regularly after
            
            
            
              your outages, go back, update them,
            
            
            
              have a revision date on those documents.
            
            
            
              So that way, you know in the last time it was updated. And if they
            
            
            
              haven't been updated in six months, either you're not
            
            
            
              having allergies around that system or you're
            
            
            
              not documenting what's happening with that system. So good or
            
            
            
              bad there, avoid alert fatigue.
            
            
            
              The less alert fatigue you have, the more psychological safety you're building
            
            
            
              into youll people, and the more comfortable they are,
            
            
            
              they know where their documents are and they're able to go forward from there,
            
            
            
              the less likely that they're going to want to leave your organization and take that
            
            
            
              tribal knowledge with them. It'll cut down turnover and everything
            
            
            
              and run readiness preparation drills regularly.
            
            
            
              Chaos engineering again, it's a newer thing, but there's
            
            
            
              a lot of tools out there that can help you. Gremlin makes some great products,
            
            
            
              great documentation out there from them.
            
            
            
              Microsoft has a great has engineering product as well,
            
            
            
              great documentation to help you think about ways to do these things.
            
            
            
              And thank you so much for listening
            
            
            
              to me today. Thank you for your time and
            
            
            
              enjoy the rest of the conference.