Conf42: Site Reliability Engineering 2021


Go global or go home! Monitoring your platform from multiple locations

Andrei Danilov
Platform Engineer @ Typeform

Andrei Danilov's LinkedIn account

Setting-up a system of monitoring production availability in multiple regions might seem pretty straightforward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends, Cloudfare (and the list can go on), they all offer solutions that are super easy to setup, manage and maintain. However, the real challenges are behind the use of the tool: * How do we select the regions and locations to monitor? * Do we want to monitor exact particular location (eg: a city) or broad areas (eg: countries) * What do we consider a failure (typical HTTP error codes, certain time thresholds?) * Do all the selected areas have the same importance to the business? * Do all the monitored areas trigger the alerts with the same priority? * How often do we want our platform to be monitored? * Who owns the actions to fix this in case of failure? * How do I flag these monitors so I don’t mess up with company data? * How do we design the monitors so in case of failure, we don’t get a cascade of alerts?

In this session I plan to answer all of the above questions, combining with examples from my personal work experience on the topic. An agenda would look something like this: * About the Speaker * Why is the topic important? * Building your own strategy: things to consider * Choosing the tool * Showcase of an already working solution * Conclusions * Q&A on Discord

After the session you’ll be able to: * Get a good grip of why this topic is important and in which context * Understand how to create a strategy for monitoring different world areas that applies to the realities of their own company * Understand what tools they can use to achieve that

