Go global or go home! Monitoring your platform from multiple locations

Video size:

Abstract

Setting-up a system of monitoring production availability in multiple regions might seem pretty straightforward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends, Cloudfare (and the list can go on), they all offer solutions that are super easy to setup, manage and maintain. However, the real challenges are behind the use of the tool: * How do we select the regions and locations to monitor? * Do we want to monitor exact particular location (eg: a city) or broad areas (eg: countries) * What do we consider a failure (typical HTTP error codes, certain time thresholds?) * Do all the selected areas have the same importance to the business? * Do all the monitored areas trigger the alerts with the same priority? * How often do we want our platform to be monitored? * Who owns the actions to fix this in case of failure? * How do I flag these monitors so I don’t mess up with company data? * How do we design the monitors so in case of failure, we don’t get a cascade of alerts?

In this session I plan to answer all of the above questions, combining with examples from my personal work experience on the topic. An agenda would look something like this: * About the Speaker * Why is the topic important? * Building your own strategy: things to consider * Choosing the tool * Showcase of an already working solution * Conclusions * Q&A on Discord

After the session you’ll be able to: * Get a good grip of why this topic is important and in which context * Understand how to create a strategy for monitoring different world areas that applies to the realities of their own company * Understand what tools they can use to achieve that

Summary

You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud.
Andre shares his story about scaling up in 2021. One of the challenges is having your product available throughout different regions of the world. As a company that wants to grow and scale, we need to prepare our platform.
It is not really feasible to really monitor each tiny location from the globe. Some locations are finally a bit more important than others. Using your company's priority system can help you. Should this monitoring system create an alert each time a failure is generated?
If there is an availability problem in a specific world area, who should fix it? In an ideal DevOps environment it would be definitely the platform owning that particular part of your system. Starting small is better. A good practice here is to keep it simple.
As a scaling organization, you need to look at things differently. Check availability in different world areas today, but also from the perspective of near future. Choose a tool that can help you a bit more than just today. I like to hear what you love and hated about my session today.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are youll an sre? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hi everyone, welcome to my session today. I'm Andre and I'm super grateful to be part of this event and I'm currently saying hi from Barcelona, Spain. About myself as a professional, I've always enjoyed the creative and data driven way at looking at technology through my blogs and talks, and I hope to inspire many of you it professionals out there watching this to bring value in their organization in new, diverse and ingenious ways. Previously I was part of a platform team at Typeform that was called software engineering tools and infrastructure. Our main aim was to support the organization with processes, self built or existing market tools, standards, infrastructure metrics throughout the whole software delivery lifecycle that will finally lead to a more stable and reliable platform. Today, I'm going to share with you a story about scaling up in 2021. Things have considerably changed since the first time a Typeform has been created in 2013. Typeform is growing as a brand right now and you can imagine that also the platform behind it has grown to support what it is today. As our production started to mature, our user base to grow, and all of this not in just one specific part of the world, but in many, we realized that our platform is facing new challenges that we need to tackle if we want to continue to grow as a global brand. One of these challenges is pretty straightforward to guess. Basically your product, having your product available throughout different regions of the world. This is one problem you don't really notice until you start to have a huge number of customers or users in parts of the world you are not really expecting. You think your platform is up because your production monitoring system hit your data center, the same data center where your application is deployed. But it's not the case and these customers are opening support tickets which take a bit more time to understand and resolve by engineers. As a company that wants to grow and scale, we need to prepare our platform not just to be better reacting in this situation, but to be able to proactively tackle these problems if they occur. In this context, there are many improvements that you could bring to your platform and it was pretty clear for us that one of the things we needed to look at is, apart from our current spotlight, is availability in different parts of the world and of course have things coupled with our alerting system. So in case we do have a downtime, we can react and repair much faster. There are many tools out there that offer effortless or almost solution to this problem and most of them are right. Obviously each one of them comes with its own set of features that makes them unique, like integration, different set of locations that they offer, pricing rules and et cetera. Speaking strictly from a technical perspective, there is indeed a very low complexity of spinning any of these solutions up and I really love when technology is making us focus on what is the most meaningful solve problems. So if the challenge is not setting up the tool, where is the challenge? Well, the challenge is actually about all the rest of the things, but these tool as in everything in engineering, understanding what needs to be built and how it's going to be used before laying a finger on the implementation. Up next I will share you ten things you need to be asking yourself so you can build up your owns strategy of monitoring your platform from different locations, together with some tips and tricks from the lessons that I've learned, this first one is the most difficult from all of multiple locations of this world. Which ones to monitor? Obviously if you are not a teleco company or something similar, it is not really feasible to really monitor each tiny location from the globe. Look at what matters to your business. Things like top x location by user base, top x customer location by visits, top x location of customers by premium accounts should help you understand where you need to put your focus. Look at your slas and legal agreements. Do you need to provide something in a special location? Our approach at Typeform was to have a baseline of a mix of top location by visit with our top location by number of customers. So what does a location actually mean? Is it a city, a country, a region? It can be any of the options or any of the combination that I've just mentioned. And this depends mostly on how your business operates. Think that you areas a food delivery company with a strong presence in a metropolis or a big city. Monitoring those particular cities might be making more sense than monitoring whole regions or countries where if something fails in that big city, you might not get to notice it. We've settled for a mix between countries and regions, but we areas aware that this might change in the future. This question might appear before or after selecting multiple locations. So should the chosen location be treated all in these same way? Meaning should I trigger alerts, for example, in the same way for all of my locations? Ideally, yes. It's to treat all multiple locations the same way and alert the same way if your platform is down in Barcelona or in San Francisco. Realistically speaking, this goes a bit like any other production incident or bug. We want to fix everything, but where do we start? First, using your company's priority system can help you. At Typeform quarantine we are treating all of the locations in the same way because we have a very few set of them, but due to regional and countries mix, some locations are finally a bit more important than others. How frequent codes these monitors execute over these production instance of my platform looking at traffic is the best way to tackle this. A couple of users per minute might not give you enough reasons to monitor often, but as we're talking about scaling platforms here and that the chosen locations are with high traffic generally once per minute should be enough to start with. Granularity above the minute might make you unaware for much more time that you are down in a particular area and this is not convenient. Generally at Typeform we are trying even to go more granular than a minute, but if we execute each minute, we'll end up with around 45,000 countries per month and almost half a million per year in the database of different things that we do know. Even the security rules might block so much activity. Yes, and that's a lot. The only way forward is to start flagging it so this doesn't interfere with other system in place at your company. This can be done via allow listing the ip, adding certain bypass headers, user agents, parameterized requests and so on at Typeform. The different system this might interfere with all have different requirements in this case. So the solutions is a mix of more of these aforementioned ways. Should this monitoring system create an alert each time a failure is generated? I'd say no, not really, and this is for a couple of reasons. One of the most important one is that the servers of the location of the tool you are using might not have availability in that moment. You are trying to reach your platform, but this doesn't mean that your users are affected by that or have problems accessing your website. This shouldn't create an alert as it would spam the engineers and the challenge, and they will challenge the robustness of the tool and usefulness of it. A good practice would be to trigger an alert if consecutive amount of failures is identified over that location. You could set up a retry in case of failure or wait for the next execution depending on the frequency of the monitor that you choose for your platform. Okay, we talk a lot about failures, but what are they? What should we consider a failure in this case? All these tools come with a default understanding of what a failure is, which is pretty standardized as far as I've seen between them, basically errors in HTTP requests, timeouts, everything around this area. Of course you can add your own failure criteria. For example, maybe you have some custom error codes that you want to remove or add on top of the default options. Youll want to lower the timeouts or figure out some latency thresholds. Assert something in these response. All of this depending also on your slas and general policy regarding performance at your company. If you're starting and you just want to explore how you could tweak this, I'd really recommend going with the default in the beginning and learn from how your system behaves before you go custom were talking a lot about failures, but who should react first? If there is an availability problem in a specific world area, who should fix it? Well, in an ideal DevOps environment it would be definitely the platform owning that particular part of your system, the team owning that particular part of the system. Realities in different engineering organizations makes this question to depend really on who has the access and knowledge to make those changes. This can be an SRE team ops incident management platform team. Nevertheless, it's critical to have somebody to react if this happens at Typhoon. At first our team was the one to be alerted so we could understand who is better to contact in case this failure is identified. We have tried out to have teams owning that part of the platform to be the ones to react first, but we have adapted to improve the reaction on these alerts. So what if there is a downtime in all of the areas because let's say a defect is introduced in production, isn't this going to trigger a huge quantity of redundant alerts? If you're going to go for a default setup, these monitors are going to fail and are going to trigger an alert that will be yet another alert in the clutter of alerts that your on call engineer needs to go for. What a nightmare. This is not ideal. This is a problem that generally appears on scaling engineering organizations. And luckily there is software out there that can help you with grouping alerts based on the root cause or affected area or even using AI. For example, I was reading about Moocsoft the other day and it looks like they are being able to group all of these alerts over there. Like a side note, nevertheless, starting small is better. Having the alerts for availvit in different world areas set up is a big step and grouping them with other potentially failing alerts is definitely another. Keep those separate. And what exactly should we monitor? Is it an endpoint? A complete user journey? All the endpoints you have exposed. For example, if I get a certain response action after hitting an endpoint or completing a journey, what is it that we should be looking for in these monitors? First of all, think about the purpose of this checking the availability of the system from different parts of the globe. Modern tech organizations already have some testing interaction going on which already tests some complete or partial user journeys in these UI check some response to APIs and et cetera. A good practice here is to keep it simple. The purpose is to check if a particular part of the platform is accessible from different regions. Think about what is the minimum that you can do to achieve that? And from my experience, most of the time a simple API call like a get or just a ping to an URL should just do the trick. Secondly, look at your architecture of your platform so you can choose what are the exact parts of it that should be hit by the monitors. In a monolithic architecture, one monitor doing a get in your main URLs should really do the trick. In a microservice architecture, things get a bit more complicated and it might not be so feasible to create monitors for each of the endpoints that you are exposing just because of the quantity of microservices and their rapid change that exists. So understanding how are those things deployed and where, what's the connection between them would allow you to understand the risk and reduce the quantity of monitors to a very few, but I would say powerful. That will allow you to avoid downtimes in different parts of the globe. To conclude my talk today, I'd like to stress out the following aspects. As a scaling organization, you need to look at things differently than before. So before jumping into setting up everything in any of the tools out there, look at the current needs of your company, checking the availability in different world areas today, but also from the perspective of near future. Although there are some best practices, the most important aspect is looking at the personalized context of your organization while designing the strategy for this thing. Choose a tool that can help you a bit more than just today. Last but not least, in a scaling organization, there are many things that can be done much better. So if you choose to monitor the availability of your platform from different world areas, start small and then figure out some of the details along the way. This will be much more impactful and we will keep you focused. If you want to reach out to me to help you set up this or just drop me a line and chat, feel free to add me on LinkedIn. Also do check out my blog and for my experiences and opinions regarding work in the software development industry. Last but not least, I like to hear what you love and hated about my session today, and I invite you to do it via the feedback form in the QR code that is on the screen powered by our new product at Typeform that is called video ask. I hope you enjoy it. Thanks for joining me today and enjoy the rest of the conference. See you around.

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Go global or go home! Monitoring your platform from multiple locations

Video size:

Abstract

Summary

Transcript

Andrei Danilov

Platform Engineer @ Typeform

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Go global or go home! Monitoring your platform from multiple locations

Video size:

Abstract

Summary

Transcript

Andrei Danilov

Platform Engineer @ Typeform

Join the community!