Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hey, everyone.
            
            
            
              My name is Matthias Palmersheim, and today I want to talk to you about how to monitor
            
            
            
              your monitoring and why it's important.
            
            
            
              A little bit about me.
            
            
            
              I'm a solutions engineer at Victoria Metrics, and Victoria Metrics
            
            
            
              makes an amazing time series database called Victoria Metrics.
            
            
            
              We also make some utilities for getting your data not only into Victoria Metrics,
            
            
            
              but other time series databases as well.
            
            
            
              We are also starting to make a tool called Victoria Logs for aggregating logs.
            
            
            
              And when I'm not working on that at my day job, I like to take a lot of the
            
            
            
              utilities we make as well as some others and aggregate it together into an easy
            
            
            
              to deploy monitoring and logging system.
            
            
            
              So that way you don't have to have a giant Kubernetes cluster or an expensive cloud
            
            
            
              service to get started with observability.
            
            
            
              So if you couldn't tell by that intro, I'm a huge observability nerd.
            
            
            
              And Actually, on my first date with my wife, I wouldn't
            
            
            
              shut up about observability.
            
            
            
              And there was a betting pool in the office to see if I would talk
            
            
            
              about observability the whole time.
            
            
            
              And the reason why I became obsessed with observability is one of my first IT jobs.
            
            
            
              It was a IT manager at a factory.
            
            
            
              And when I got there, I didn't really have any monitoring system, so to speak.
            
            
            
              And what would happen is an application would go down because I didn't have
            
            
            
              a monitoring system to help me.
            
            
            
              Prevent that from happening in the first place and get alerts for things like,
            
            
            
              Hey, hard drives filling up and the hard drive would just fill up and an
            
            
            
              application would fail and my alerting system at the time was users coming into
            
            
            
              my office and informing me that Hey, this change that was made recently took
            
            
            
              down the application and now the whole office can't work and that's costing the
            
            
            
              company tens of thousands of dollars an hour and that caused a lot of stress.
            
            
            
              So along with not having context as to why things were broken, I had to deal with the
            
            
            
              stress of having somebody there reminding me how expensive this problem was.
            
            
            
              So I could get nice automated notifications instead of
            
            
            
              angry users in my office.
            
            
            
              I implemented a monitoring system that collected the telemetry and that
            
            
            
              telemetry could tell me that, Hey, this resource is overused, or this
            
            
            
              thing is being slow before it finds me.
            
            
            
              failed and although I was angry that now I had alerts, things
            
            
            
              weren't going down as often.
            
            
            
              And, when people would come into my office after they broke, I would usually
            
            
            
              already be working on the problem and that helps lower stress a ton.
            
            
            
              but I had a problem though.
            
            
            
              A lot of the notifications I was getting from my monitoring system
            
            
            
              were coming through the same channels as things like status updates or.
            
            
            
              people, just like general announcements, Oh, Hey, you need to
            
            
            
              park in the other parking lot today.
            
            
            
              And so that led to a lot of alert fatigue because every time I got
            
            
            
              an email, I got the like stress of, is this just a simple update that
            
            
            
              I needed to park somewhere else?
            
            
            
              Or is this a tens of thousands of dollar an hour problem?
            
            
            
              So I looked into noisy notification systems that were
            
            
            
              dedicated for critical alerts.
            
            
            
              So it was super easy to set up.
            
            
            
              D and D override in my phone.
            
            
            
              And the other nice things about this is that it would make
            
            
            
              an individual responsible.
            
            
            
              So instead of just shouting out, Hey, the system's broken, somebody should fix this.
            
            
            
              It would assign somebody to the, incident.
            
            
            
              And the way I decide if something is an important.
            
            
            
              notification and it should be able to override like the notification
            
            
            
              prefaces on my phone is that if it costs life, liberty or property.
            
            
            
              So if it could cause physical harm, if the system is down, if it causes compliance
            
            
            
              issues and the government's going to get involved, or if it can cause loss
            
            
            
              to property and property could just mean that it reaches a certain cost threshold
            
            
            
              and it's losing the company more money than you would like for an outage.
            
            
            
              So I was starting to feel pretty happy about my existing observability system.
            
            
            
              I was able to get noisy notifications when things were breaking instead of email.
            
            
            
              But then I had a philosophical question.
            
            
            
              Is what happens if the monitoring system fails?
            
            
            
              So if my monitoring system fell over at 2 a.
            
            
            
              m., does anyone get paged?
            
            
            
              So this is the problem that I was running into where the
            
            
            
              monitoring system would fail.
            
            
            
              And I would think that everything's fine, whether or not the applications
            
            
            
              behind or that the monitoring system was responsible for were up.
            
            
            
              And to solve this problem, you just deploy a second monitoring system.
            
            
            
              So you set up a monitoring of monitoring system that only monitors
            
            
            
              the primary monitoring system.
            
            
            
              And then from the perspective of the primary monitoring system,
            
            
            
              you It's just another application.
            
            
            
              So it's just like the ERP system.
            
            
            
              It sends metrics and it alerts you when things aren't going as expected.
            
            
            
              There's a problem with this.
            
            
            
              The monitoring and monitoring system is also just another application can
            
            
            
              fail and multiple applications can fail at the same time for whatever reason.
            
            
            
              So this problem could be solved.
            
            
            
              Just be solved by adding more and more monitoring systems, but you never quite
            
            
            
              get to 100 percent availability you get more nines but with more nines of
            
            
            
              Availability you also get more costs and it's harder to maintain the knowledge
            
            
            
              if you have all those monitoring systems so usually the sweet spot is two But
            
            
            
              you don't just deploy two applications to the same region and in the same
            
            
            
              infrastructure as code inventory You need to make sure that the applications are
            
            
            
              deployed in a way that they're isolated from failures as much as possible.
            
            
            
              And before I get into all the ways that this can go wrong, I'm not implying that
            
            
            
              you and your teammates aren't smart.
            
            
            
              I'm saying that humans aren't perfect.
            
            
            
              We make mistakes, but Usually different groups of humans don't
            
            
            
              make mistakes at the same time.
            
            
            
              So you want to make a different group of humans as responsible for your monitoring
            
            
            
              and monitoring as much as possible.
            
            
            
              And another thing you need to do is to make sure that your change management
            
            
            
              processes are aware that these two systems shouldn't be updated at the same
            
            
            
              time because upgrades or configuration changes are frequent source of outages.
            
            
            
              So if you touch both systems, if you're allowed to touch both systems
            
            
            
              at the same time, then there's a high likelihood both fail at the same time.
            
            
            
              So the different tools that we need to make are the different technical ways that
            
            
            
              we can prevent both systems from failing.
            
            
            
              are to use different notification services because again, notifications
            
            
            
              are apps and apps fail sometimes.
            
            
            
              you want to make sure that they're in different infrastructure providers
            
            
            
              because usually, different infrastructure providers don't have simultaneous outages.
            
            
            
              If you can't get it inside of a different infrastructure provider or a different
            
            
            
              cloud service, at least try to get it inside of different regions and separate
            
            
            
              deployments within the same cloud service.
            
            
            
              So in summary, your monitoring and monitoring is another monitoring
            
            
            
              system, but it's only responsible for the primary monitoring system because
            
            
            
              it gets really confusing if you have different monitoring systems available
            
            
            
              for certain business applications.
            
            
            
              It can be harder to test the system too, because not only do you have to
            
            
            
              get the okay from your boss to test and intentionally fail a system, you have to.
            
            
            
              Get approval for a test that could have impacts to other teams as well.
            
            
            
              And the monitoring and monitoring system is just treated by the monitoring
            
            
            
              system as another application.
            
            
            
              So the minimum requirement is to do the most important thing in
            
            
            
              observability, which is make sure that your applications are available.
            
            
            
              This is going to be cheaper to store and easier to configure, because in
            
            
            
              most cases you're just configuring a connection between two things or
            
            
            
              feeding a list of URLs to some service and it's performing health checks.
            
            
            
              But the downside is that you don't get the, you don't get a
            
            
            
              lot of preventative alerts unless like the applications being slow.
            
            
            
              and so what happens if you don't get those contextual things is that when
            
            
            
              the application goes down, you have a much higher mean time to resolution
            
            
            
              because you have to figure out what went wrong as well as fix what went wrong.
            
            
            
              And again, this is responsible for keeping the most important
            
            
            
              applications in your business online.
            
            
            
              So you really should.
            
            
            
              treat it like another application and give it those preventative measures, give it
            
            
            
              really nice dashboards, have run books and all those other things that every
            
            
            
              other application in your business gets.
            
            
            
              And even if you are treating it like another application, your
            
            
            
              applications should have health checks that alert you if they fail.
            
            
            
              So what are the approaches we can do this?
            
            
            
              I'm gonna get into some like quick and dirty approaches that if you
            
            
            
              can't get a full blown Dedicated monitoring of monitoring system.
            
            
            
              It's better than nothing So the first instance of that is gonna be a heartbeat
            
            
            
              so Heartbeat is just one system communicating with another and if that
            
            
            
              communication doesn't happen for a certain period of time You An alert will fire.
            
            
            
              this is, again, it's the simplest up down that you could get is just communication
            
            
            
              between these two systems as it happened.
            
            
            
              And the downside is that the heartbeat is usually like a dedicated health check.
            
            
            
              It's not like a, as accurate of a representation as some of the other
            
            
            
              health checks we're going to go through a little bit in the talk.
            
            
            
              So a good example would be the, ANAG system with Nagios.
            
            
            
              And that's just, if the ANAG application hasn't contacted the server in a
            
            
            
              certain amount of time, it will fire.
            
            
            
              There's obviously false positive risks with that because your phone
            
            
            
              could be having, issues connecting to LTE or something like that.
            
            
            
              And the other problem is that's just shouting into an area saying, Hey,
            
            
            
              this is broken rather than, doing what they do in CPR training, which is say.
            
            
            
              Hey, somebody calling, like you point to somebody and say, you call an
            
            
            
              ambulance or you get a defibrillator.
            
            
            
              Another example of this is Grafana on call.
            
            
            
              If you self host Grafana on call, you can sign up for a cloud account.
            
            
            
              And then if the self hosted version doesn't talk to the cloud
            
            
            
              account for a certain amount of time, then you get a notification
            
            
            
              that there's a missing heartbeat.
            
            
            
              But this only covers the notification system and not everything as a whole.
            
            
            
              So this kind of works, but you should probably look into something better.
            
            
            
              if you're using a cloud vendor for your monitoring, then usually they have a
            
            
            
              status page, and hopefully they're not self hosting it, because if you are
            
            
            
              self hosting, your own status page, then there's a higher likelihood that both
            
            
            
              the application and the status page fail at the same time, because there could be
            
            
            
              overlap in the infrastructure overlap in the humans that cause both the systems to
            
            
            
              fail, but they're really easy to set up.
            
            
            
              If you go, if you just like search online for cloud vendors, status page, they'll
            
            
            
              give you this information and give you like point and click instructions
            
            
            
              for getting this into email or Slack, but the downside is those are, Those
            
            
            
              aren't noisy notification system.
            
            
            
              People commonly will mute, will mute things and it's really tricky to get the
            
            
            
              settings just right to where your status update, or people asking for status
            
            
            
              updates don't bother you after hours, but, Cloud Render going down would.
            
            
            
              And this usually doesn't work for self hosting solutions.
            
            
            
              A bit of a better version of this is going to be health check services.
            
            
            
              So if you're self hosting, you could self host something like uptime Kuma, but
            
            
            
              getting another team to host this inside of your organization is going to be tricky
            
            
            
              because you have to convince people.
            
            
            
              that, hey, I know I'm on the observability team, but this team that isn't the
            
            
            
              observability team should manage the service that does observability things.
            
            
            
              there's also cloud options for this, but again, if you're self hosting,
            
            
            
              this can be really tricky because you either have to allow access to your
            
            
            
              monitoring application, which can be a security risk and getting the security
            
            
            
              team on board with this could be a problem, or you have to, manage an
            
            
            
              agent inside of your infrastructure.
            
            
            
              to, beacon out information to that service as well.
            
            
            
              But these do hook into noisy notifications.
            
            
            
              and they do require a bit of extra configuration because it's
            
            
            
              not just subscribing to a feed or setting up a simple heartbeat.
            
            
            
              You do have to give a list of URLs and if possible, you should give the correct
            
            
            
              responses because sometimes you get a 200 HTTP code and a valid SSL certificate,
            
            
            
              but on that, On that web page or in the response, you say, Hey, even though I'm
            
            
            
              serving HTTP traffic, I'm not healthy.
            
            
            
              So if you can configure look for a string in a response, definitely set that up.
            
            
            
              And again, all of these are minimal context to it's just a simple is the thing
            
            
            
              working or not and maybe a response time.
            
            
            
              So the first system that I would recommend.
            
            
            
              that I would say is an adequate monitoring of monitoring system is to deploy
            
            
            
              two independent monitoring systems.
            
            
            
              this has the widest range of quality, so you could do it the lazy way and deploy
            
            
            
              a smaller version of the exact same tooling with the exact same version in
            
            
            
              the same infrastructure as code inventory with no change management controls.
            
            
            
              And this obviously isn't the best solution because.
            
            
            
              something inside of that region or inside of that Kubernetes cluster could fail.
            
            
            
              you could find out that there was a regression in an update and take
            
            
            
              out both systems at the same time.
            
            
            
              If you use the same notification system and that fails, you,
            
            
            
              it's really hard to tell.
            
            
            
              So doing all those things right is also tricky from a bureaucratic
            
            
            
              perspective too, because you have to convince your boss that, hey, we need to
            
            
            
              deploy an application in a new region.
            
            
            
              Or.
            
            
            
              Another thing is you have to justify of setting up change management
            
            
            
              processes, which can be difficult and it can be hard to get people
            
            
            
              to follow the instructions on that.
            
            
            
              If you do this approach, you should definitely test it.
            
            
            
              So a way you could test it is to break your monitoring of
            
            
            
              monitoring system and then make sure that your primary monitoring
            
            
            
              system is sending notifications.
            
            
            
              And when you do this, make sure that somebody's periodically just
            
            
            
              like refreshing dashboards or some tests to make sure that the primary
            
            
            
              monitoring system is working as well.
            
            
            
              So that way you can figure out if there's any interdependencies
            
            
            
              between the two systems.
            
            
            
              another thing you can do if you're self hosting is purchase a cloud service.
            
            
            
              And if you're approaching your cloud service, you can either purchase a
            
            
            
              different cloud service that monitors the first cloud service, or you
            
            
            
              could set up an on prem system that monitor monitors the cloud service.
            
            
            
              the upside of this is now you're definitely having different humans
            
            
            
              manage the systems, and those humans are probably using different upstream vendors.
            
            
            
              a lot of the time you can control where your cloud service is deployed.
            
            
            
              And so that way they're geographically isolated.
            
            
            
              and I know cloud monitoring bills have a bad rap for being super expensive,
            
            
            
              but because it's one application, it's not that big of a deal.
            
            
            
              And that application is managed usually by the observity observability
            
            
            
              experts in your organization.
            
            
            
              So they're aware of things that can lower costs like react.
            
            
            
              relabeling rules or streaming aggregation.
            
            
            
              And this one, you can still misconfigure it.
            
            
            
              but it's lower because it's a shared responsibility model rather
            
            
            
              than a, you're responsible model.
            
            
            
              And the best version of this is to purchase a dedicated
            
            
            
              monitoring of monitoring solution.
            
            
            
              So Victoria metrics offers this as an part of some of our enterprise plans.
            
            
            
              But the downside of it is it's the most expensive because along with paying
            
            
            
              for a separate monitoring system.
            
            
            
              So a separate instance of Victoria metrics to monitor your on prem Victoria metrics.
            
            
            
              You're paying for the really smart humans to enrich the already rich notifications.
            
            
            
              But again, that's going to be the best experience by far
            
            
            
              because along with a really well supported observability system.
            
            
            
              You get the smart people behind it that can help you, work
            
            
            
              through the issues as well.
            
            
            
              So in summary, this doesn't have to be super expensive.
            
            
            
              It doesn't have to be super difficult.
            
            
            
              And if you can't get like permission to deploy a separate monitoring and
            
            
            
              monitoring system, there's options that are better than nothing.
            
            
            
              And you can mix and match the approaches to fit your availability
            
            
            
              requirements or fit your use case.
            
            
            
              And the most important thing is we get to answer the age old philosophical question.
            
            
            
              If your monitoring system falls over in the middle of the night,
            
            
            
              does somebody get paged at 2 AM?
            
            
            
              After this talk, the answer should be yes.