Conf42 Site Reliability Engineering 2021 - Online

Pragmatic Site Reliability Engineering for Kubernetes in the Cloud

Video size:

Abstract

Managing infrastructure resources in the cloud becomes more challenging once we start to deal with significantly more resources and complex integration requirements.

In this session, we will discuss the different solutions when dealing with production SRE requirements for Kubernetes in the cloud.

Summary

  • Joshua Arvin Lat will talk about pragmatic site reliability engineering for kubernetes in the cloud. He is also the author of the book Machine Learning with Amazon Sagemaker Cookbook. You can enable your DevOps for reliability with chaos native.
  • When dealing with pragmatic site reliability engineering Kubernetes, how fast would we be able to detect and manage that issue? The stability of the entire system depends on both the software engineers and the site reliability engineers. The number one thing that comes into mind would be time.
  • Joshua: Let's play a quick game. Let's let everyone count the number of apples in this slide. For the winners, we'll give a 25% discount and we'll provide a promo code. Later, Joshua will share nine tips that would help us do SRE properly.
  • Not everything needs to be stored inside the Kubernetes cluster. There may be tools, there may be open source tools that can easily be installed. Instead of storing data inside kubernetes, maybe we can use an RDS instance outside of the cluster.
  • The goal of SRE is to make sure that our customers are happy and the system is up almost 100% of the time. It's critical that we know who the target users are and we need to understand their behavior. There are specific days in the year where you really have to plan ahead and prepare the infrastructure.
  • The next one in our list involves managing the overall cost of storing, managing and searching the logs. Having good collaboration between the application development team and the SRE team is crucial to get that to work. Next in the list would be understanding the weakest links in your system.
  • understanding the common failure modes when using Kubernetes is critical. The best assumption would be that things will fail and the goal is to prepare a layered plan. It's about managing that balance between availability, resilience and cost.
  • Kubernetes is like any other tool. If it's not configured properly, then it can be attacked and compromised. Knowing the weakest links of your system when it becomes CTO security is critical. It's important that we plan when there is minimal pressure.
  • Joshua Arvin Lat talks about pragmatic site reliability engineering for kubernetes in the cloud. If you have any questions, feel free to reach out to me. I hope you learned something today.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus cloud hi there. I am Joshua Arvin Lat and I'm going to talk about pragmatic site reliability engineering for kubernetes in the cloud. So before we start, I'm going to introduce myself. I am the chief technology officer of Nuworks Interactive Labs and I'm also an AWS machine learning hero. I'm also one of the subject matter experts who help update the certification exams globally. And finally, I am the author of the book Machine Learning with Amazon Sagemaker Cookbook. So in this book we have 80 proven recipes for data scientists and developers to perform machine learning experiments and deployments. So you'll see a lot of customizations there using docker and using different techniques in order to port your machine learning model and get that to work in the cloud with Sagemaker. So let's start with the first practical tip. When dealing with pragmatic site reliability engineering Kubernetes, the first thing that comes into mind would be if there's going to be an issue, how fast would we be able to detect and manage that issue? So we need to find ways to visualize and know what's happening inside. So if you have a lot of nodes in your Kubernetes cluster, and there are different types of logs, logs from your application, maybe logs from your network, and so on, things get a bit trickier when you have to deal with a lot of logs and finding ways to visualize what's really happening, especially knowing the metrics, knowing what's wrong, knowing what part of the system is not working and what parts SRE working, it's not that easy as we SRE talking about it right now. So what we can do about it is maybe we can start using additional tools on top of the Kubernetes setup and allowing different end users to access those monitoring tools. Of course, we may be tempted to just allow the pragmatic site reliability engineering access those tools. However, the stability of the entire system depends on both the software engineers and the site reliability engineers. Because for one thing, infrastructure is just one part and then the other part would involve the application side, which the software engineers are continuously building on top of. So if there's an issue, let's say that you have an ecommerce website and then there sre three or four different parts and there's something wrong with the checkout flow that users can check out, or for some reason, the third page before the checkout page isn't working. So being able to detect what's wrong is a responsibility shared by both groups, both the software engineers and the site reliability engineers. And finally, it's about just making sure that people have the right access levels and we just give them the appropriate charts and tools to debug the problems. The next thing that we need to think about is what if there are so many servers and nodes and containers and services we need to worry about here? So it's critical that we're able to manage the data, the logs data, and being able to connect it properly with the monitoring tools. Because if we're not able to clean things ahead of time and doing it in an automated fashion, then the software engineers and site reliability engineers would have a hard time solving the problem. So when dealing with production level stuff, the number one thing that comes into mind would be time. We can talk about concepts all day long, but when we're dealing with reality, we have to worry about time. So when you have an issue, if the issue gets resolved right away, let's say automatically, then that saves us time, because an automated process is definitely better than a human who would still have to process and look for the issues. So if, let's say that there's a self healing setup involved, then that's much better, at least as the first layer. Next, if some analysis is needed and the problem is custom, then it would involve both automated and manual solutions for it. So let's say that you deployed something and that caused the system to become unstable. Then you would need a human to solve that problem. Let's say by rolling back or by doing some tweaks in the current setup or configuration. Speaking of time, let's play a quick game. So what's the game all about? So in this game, I'm going to let everyone count the number of apples in this slide and I'm going to share a reward. So for those who's going to win this game, in this game, we're going to give, I think, 25% discount. For those who will try CTO, purchase the book, the book that I've written this past couple of months. So again, if you win the game and you get the number of apples correctly in the next slide, then we'll share that promo code to you so that you can use it to buy the ebook. All right, so are you guys ready? So I'm going to give everyone 20 seconds to count the number of apples here. And as mentioned. For the winners, we'll give a 25% discount and we'll provide a promo code. Promo code separately. So, 20, 19, 18, 17, 16, 15, 14, 13, 12, 1110, 9876-5432, and one. So stop counting. So, if you have guessed, let's say, 20 apples, that's incorrect. All right, so let's try something lower. Let's say 17. If you've guessed 17, unfortunately, that's also incorrect. How about 24? So, for those who have tried guessing that it's 24, you are also incorrect. So what's the correct answer? There's no correct answer for this, because, for one thing, it's impossible to count the number of apples in this slide because there may be apples underneath the initial layer of apples. And the lesson that I want to share with you guys here is that when things are not organized and when it's hard to look for the needle in the haystack, then it would be hard for us to identify what the problem is. So let's say that we were given a different set of problem similar to this one. Let's say that the apples are organized and there's a ten by ten block of apples, and then there are ten layers. So we're sure that ten by ten times ten, that would be 1000 apples much easier to count, because the data is organized. And the same way, it would take us a lot of time to count the number of apples if it's disorganized like this. And then if we were to clean things up and organize things ahead of time, then it would be easy to find and do aggregates and counts, and even look for the specific apple that we're looking for. So, that said, for those who sre a bit disappointed that they were not able CTo win that prize, I'll just say that I'll be nice today and I'll just give out 25% discount by using the code 25. Joshua. And yeah, I think it's valid between October 22 and December 31 of this year. So just take a screenshot and then use that when the book is launched. I think next month. Yeah, a couple of weeks from now. So, that said, why are these things important? These things are important because the faster we're able to solve problems, the better the system is configured, the better is our capability to manage the SLI, SLO and SLA. So we won't talk about these things in detail here. But if you're already aware of the fundamentals of SRE, it's critical that we're able to manage this and have a plan for it, because we cannot just assume that we'll be able to reach these targets. And yeah, we need to have a plan. And later, the other nine tips that I'm going to share would help us do SRE properly. The next tip would be knowing the options and alternatives. So in the screen, we will see some of the possible tools that we can use with Kubernetes. Let's say we have rancher, we have Prometheus, we have Kali, we have istio, we have appmesh, and we have helm. So you may or may not use some of these tools, but what these tools help us with is that instead of us trying to build our own solution, let's just learn what majority of the users of Kubernetes are using, so that if we have a requirement, then we're pretty sure that maybe some other professional out there has encountered the same problem. Meaning these tools may have the features already. All we need to do is learn how to install it and get that to work. However, in the next tip, using tools do not automatically solve the problem. The tools are enablers. Like I said, the tools are enablers. What do I mean by that? At the end of the day, the tools will be used by people. So, combining the concepts we learned from the previous slides, let's say that the software engineering team and then the site reliability engineering team are going to use these tools. What if the tools are properly set up, the tools are in place, but 80% of the team have no idea how to use these tools, have no idea, and have very little idea about the concepts being used when using these tools. And then finally, what if they say that they're familiar with these tools, but in reality they're not able to use these tools properly to troubleshoot the problems. So what will happen here is that at the start, during the planning stage, we ask the people who's familiar with this tool, and then everyone will say, yeah, I have used that before. And then something goes wrong one month from now, and then only one or two people are able to use the tools properly. So there, that's something that you need to solve, because a big part of this would be a human problem. So one of the possible solutions you need to take note of would be, let's say, giving them a training program to help them understand the tools better and how to do troubleshooting better. Another thing that you have to take note of is that sometimes the best words, the most important things we need to hear about are the words never said. So people will not really tell you what's really happening. So it's important for the management team to listen to the words never said by trying to understand what's really happening, by, let's say, auditing and by checking things even without people telling them or giving them feedback. Sometimes management teams are relying too much on feedback, that they assume that this feedback notes sre the sources of truth, when in fact, when you do assimilation, maybe that's a much better way to know what's going to happen or not. The next thing in our list would be this one. So there's a bunch of logos here, a bunch of tools. So first we have here Sagemaker, we have app mesh, we have eks, we have x ray, and we have Cloudwatch. The lesson I want to share here is that not everything needs to be stored inside the Kubernetes cluster. There may be tools, there may be open source tools that can easily be installed inside the Kubernetes cluster. You may be using some machine learning tools and workflow tools to run machine learning workloads inside the Kubernetes cluster. Right. But if there are other options available, let's say Sagemaker, and it has proper integration with Kubernetes. So in this example, you have sagemaker operators for Kubernetes. Then you'll be able to delegate and transfer some of the workload to Sagemaker so that you're pretty sure that it's managed there. And then there's less management burden inside your Kubernetes cluster, especially if you're dealing with a lot of services and workload there. And you also have app mesh, you have x ray, you have Cloudwatch. So if you have logs, yeah, you can use Cloudwatch to store those logs and help us analyze the logs there. The same way, instead of storing the data inside kubernetes, inside databases there, maybe we can use an RDS instance outside of the cluster so that at least you can deal with the manage advantages, the pros of using RDS, which is outside of your Kubernetes cluster. So if you need to vertically scale your database, then instead of you trying to use Kubernetes to manage the database resources there, you can do that using RDS separately. The next one involves the users, the target users. When creating infrastructure, we're not trying to build and manage the infrastructure just for the sake of it. The goal of SRE, the goal of what we're doing right now is to make sure that our customers are happy and the system is up almost 100% of the time. Almost 100%. So it's critical that we know who the target users are and we need to understand their behavior. Because when dealing with systems, it's always going to be a combination of planned and unplanned work, and at the same time it will be a combination of automated task and manual task. There will be times where we will have to take more risk than usual. And if we know the usage rate of the system and we understand the behavior of the users, then maybe we can schedule some of the more critical work, maybe critical migration work, which can be done during off peak hours. So in this example here, maybe your customers for some reason uses the site between twelve midnight and 03:00 a.m. And then maybe they go back to your site again 03:00 p.m. To 07:00 p.m. So it really depends on your application. This really depends on the users, the target market and the behavior changes. Also, day to day, maybe during weekdays they behave like this, during weekends they behave in some fashion. And then there SRE specific days in the year where you really have to plan ahead and prepare the infrastructure and configuration to carry the workload, especially in cases where there's a spike. So let's say there's a holiday or there's an event you need to prepare ahead of time and you cannot just rely on auto scaling to help solve the requirements of the business. So again, this is very critical and we need to know the behavior of the target users. The next one in our list involves managing the overall cost of storing, managing and searching the logs. Whenever we prepare infrastructure, one of the primary requirements we have to deal with would be cost estimation. It's one of the trickier things we need to worry about, because in addition to trying to predict the cost when dealing with auto scaling infrastructure, we also have to worry about the things which may not be noticeable or visible at that point in time. So let's say that we want to prepare some cost estimates. The first thing that comes into mind would be how much would be the database instance cost and then how much would the aks cluster cost? Right. But after a couple of months, when you check the bill, you'll see, hey, where did this additional cost come from? Oh, the logs start to build up because we're trying to collect a lot of logs because, yeah, the more logs we have, the better, because it allows us to debug the site better, right? Yeah. So you just have to worry about the additional cost of storing logs and also making SRE that there's also backup because you cannot just remove the logs and reduce the number of logs being collected by the system just to reduce the cost. So knowing the different options on where to store the logs, that's also the second step. And the third step here would be understanding the total cost of managing the centralized log storage. Because what you don't want to do is that some of your resources will be building a custom solution for your system, and instead of them doing other things, they will be forced to have a sophisticated centralized log storage. And then finally, it's critical that we manage the time spent searching the logs. So managing is one of the things we need to worry about. Searching the logs is critical also. So given that your system will be collecting a lot of logs from different sources, it's critical that searching the logs, especially when done manually by a user, it's going to be super fast and we're able to connect the logs collected by the system of the distributed components and trying to connect the dots, especially if, let's say component a calls component b and then component b calls component c and they are in different pods and different services, right? So being able to identify the problem and connecting the logs from each of the components is something that you need to set up ahead of time. So there are a lot of tools and options out there. But having good collaboration between the application development team and the SRE team is crucial to get that to work. Next in our list would be understanding the weakest links in your system. Sometimes when we do not understand the system enough, there's always that tendency. When there's a slowdown, we always try to say, okay, if the system is slowing down, we just need to scale it, we just need to do auto scaling. However, it doesn't really work like that in real life. Not all issues are due to lack of, let's say, compute power when it comes to processing the logic work. Logic code. Right. In some cases, maybe the bottleneck would be the database. So instead of us trying to always blame the Kubernetes cluster and the resources inside it, the resources being managed by that cluster, the first step is to actually know where the problem is. Maybe the database queries sre not optimized. So maybe one solution would be to use queries. Then the other solution would be maybe we need to have read replicas to solve the problem. And there's really no need to do auto scaling right away. Because even if you auto scale and the database is the bottleneck, auto scaling will not help solve the latency issues. So that's one of the things that you need to be aware of, because there's always the tendency to have the mapped solution, the canned solutions, and we feel that okay, if the site slow auto scaling, no, that's not the case. We need to spend some time, analyze what's wrong, and let's try to remove any sort of assumptions and biases to help us really solve the problem from the very real root cause of it. The next one in our list would involve failure. So the lesson here would be to prepare for failure so nothing fails. So understanding the common failure modes when using Kubernetes is critical. So some examples are shared here that would involve high cpu, that would involve packet loss, maybe sudden increase in latency would happen. And then sometimes there will be node failure in rare cases, maybe unavailable regions, maybe there's some service failure from time to time, and maybe some cpu throttling as well. So what do we do with this knowledge or information? What's important here is that we try to simulate these common failure modes before launching things into production again. The best assumption would be that things will fail and the goal is to prepare a layered plan and to prepare the automation work required so that the system is self healing. So you have to plan things ahead and include this in your timeline. Building the site is the first step, and then the second part there would be to harden the system to make it prepared for anything that may come up. So the previous lesson, which is understanding the users and their behavior is critical because for one thing, if let's say majority of your users are inside the country, are from a country, let's say Philippines, then maybe you don't need to have a system which is available globally, right? So that makes your system much cheaper compared to, let's say, requirements that involve the entire world. So let's say you have a website and then different parts in the world and different users across the globe will use it, then that may definitely need a more expensive setup. Maybe you have the high availability set up across different regions. So being prepared for those things and planning things ahead is critical and also being practical about it. Because what we want to do is to also save on costs. We cannot just build, build, automate this, make this available everywhere. It's about managing that balance between availability, resilience and cost. Next would be managing the security risk. Kubernetes is like any other tool. If it's not configured properly, then it can be attacked and compromised. There are tools out there which helps us run these tools and identify the risk and vulnerabilities, especially when it comes to misconfigured kubernetes clusters. In some cases, some teams would just say, oh, it's secure. Of course, if you misconfigured something and if you use the wrong sequence of things, and also you have an application which is easily compromised, then someone might be able to attack the entire system. So knowing the weakest links of your system when it becomes CTO security is critical. So in this example, let's say that we have a cloud nine control instance that has Kubectl installed there, which helps manage the Kubernetes cluster. What if that's the highest risk entity in our environment? So even if we were able to secure the Kubernetes cluster, the Kubernetes configuration, if that gets attacked and the Kubernetes cluster is deleted by the malicious actor, then boom, your entire system is down. So feel free to check the possible risk and issues when dealing with these types of setup. Let's say IaM pass rule. Think about how an attacker would be able to attack your system using different techniques and exploits. So that's the first step. The second step would be trying to look at all the other resources in that AWS account. So there are different techniques to protect this system. For example. So for example, you use multiple accounts to make sure that even if one account gets compromised, the other systems are not affected. Because for one thing, let's say that one of the IAM users, or maybe one of the easy CTO instances launched by an IAM user, has super admin powers, and then for some reason that easy to instance gets compromised. Then if that system gets compromised and that has the ability to delete EC two resources, delete all the resources in your account, then boom, even if this entire setup is secure, your entire AWS account would be finished. Would be done. So, yeah, so you might lose the data, you might lose the resources and make sure that you are prepared for those edge cases. This 1 may not be so obvious, but it's critical to have a plan. But the timing is critical. It's important that we plan when there is minimal pressure. Some of us will tell everyone and just assume this is the truth, that sometimes we do a better job when there's pressure in reality. Generally when there's an emergency, there's the time constraint. And then when we are planning, the more time that we have, the better. And when there's minimal pressure and when there's no pressure at all, it's better to plan there because we have the opportunity. CTO, prepare an exhaustive plan to prepare an exhaustive list of what we need to do to be able to accomplish everything that we need. But when there's an emergency, a lot of us will panic, a lot of us will be under time constraints, and we will not be able to prepare the best solution. So planning things way ahead of time is critical, and it's better to have a solid plan than rushing it when there's a problem already. So that's pretty much it. We learned a lot of things in this talk, and yeah, if you have any questions, feel free to reach out to me. But, yeah, thank you for listening and hope you learned something today in my talk and pragmatic site reliability engineering for kubernetes in the cloud. So again, I am Joshua Arvin Lat, and I'm the chief technology officer of Nuworks Interactive Labs. Thank you, you, and bye.
...

Joshua Arvin Lat

CTO @ NuWorks Interactive Labs

Joshua Arvin Lat's LinkedIn account Joshua Arvin Lat's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways