Conf42 Site Reliability Engineering (SRE) 2024 - Online

Debugging cluster issues as an on-call SRE

Abstract

I have been on-call numerous times in my current SRE role in the last 5 years, and I understand the true potential of guided automation. With this talk, I’ll discuss a few of the solutions which have helped me in administering and remediating some of the complex issues even with the help of a single command.

Summary

  • Pravar Agrawal is a senior engineer at IBM for the IBM Cloud Kubernetes service. We will try to understand different approaches to debugging such issues also with the help of different types of automation. What should you be really looking for in the role of an on call engineer?
  • Site availability engineering is more of an approach or a methodology to like for operations to use different tools in solving different set of problems. On call means for set of period of time, you are available to respond to production level issues or incidents with urgency. The main aim is to ensure the availability of our production.
  • Moving on to different cluster issues which are very common and we often see those. There could be issues related to multiple pods getting stuck in pending or terminating state. How we can approach towards debugging these sort of issues. Different types of automation which we might want to implement.
  • It's always essential to get the broader understanding of the entire architecture and the infrastructure. Keep your runbooks handy to deal with different issues, errors or warnings. If the alert auto resolves, it doesn't always mean that there is nothing wrong.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I am Pravar Agrawal and I welcome you all to my session on debugging cluster level issues as on call SRE. Let us quickly take a look at today's agenda. So we start off by getting to know about reliability engineering in much briefer way and also about the role of an on call engineer, what it is really about. We will also try to find some commonly occurring cluster level issues inside Kubernetes ecosystem. We will try to understand different approaches to debugging such issues also with the help of different types of automation. And if you are a beginner, then what should you be really looking for in the role of an on call engineer? A little bit about me I work as a senior engineer at IBM for the IBM Cloud Kubernetes service and these are different handles to stay in touch with me on Twitter, LinkedIn and Slack. Okay, so with an introduction to site availability engineering, I assume that most of us so far has been hearing this word a lot of times and have a lot of awareness around it as well. But to still understand it is more of an approach or a methodology to like for the it operations to use different tools in solving different set of problems and even automate different operation level tasks. It's a practice to ensure the our systems are reliable and scalable and the SRE folks also try to maintain a balance between releasing different features or new features and at the meantime also ensuring reliability for their customers as well. Also, there is a word we often hear along with SRE which is the on call and it basically means for set of period of time, you are available to respond to production level issues or incidents with urgency. And different companies, basically they have their own implementation of this on call process. But everyone's main aim is to ensure the availability of our production and the support to our production 24/7 by managing different incidents. And there are few rules around it. So they could be starting off with acknowledging and verifying the alert, trying to analyze the impact caused by that issue, then communicating different teams involved through proper channels, whichever is used in your organization, and also eventually applying a corrective action or solid fix to that issue. There are many tools which have been popular in last couple of years for incident management. There is pagerduty, Jira, Obsgeny and ServiceNow, but there are other tools, these are more popular. So this diagram is basically a representation of SRE and the on call ecosystem. And it involves different phases, starting from the deployment, troubleshooting, receiving the alerts, then the escalation metrics which is involved and tracking everything with the help of proper ticketing system. Moving on to different cluster issues which are very common and we often see those. So we are taking, or we are taking an example of an environment which comprise of a single or multiple kubernetes clusters running in production. So there could be issues related to the services running on to node and it's not always possible to do a manually SSH and try to debug those. We will take a look at how we can basically automate this process and how we can utilize or debug these sort of issues in the next slide. Then there could be issues related to multiple pods getting stuck in pending or terminating state, which is a very common early occurring issue. So this mainly happens when there are the like for the pending state, when the schedule ability is, is very low for uh, the pod going on to different nodes during its life cycle and terminating mainly when the application is lost, uh is not responding. Basically when you issue a delete command of the application resource inside the cluster, then there are also uh errors related to the API endpoints. Going down or responding is slow. So these API endpoints could be related to the application or at the application level or something related to Kubernetes core components as well. Then the unavailability of HCD pods. Also when you want to issue different reloads of various worker nodes manually doing it is not always an option. Then there could also be issues where your discs are getting full or maybe reaching the 90% capacity and you often get alerted for it and also your health checks are failing or maybe readiness or liveness probes are failing. So how we can approach towards debugging these sort of issues. So there is no such golden rule to do it, but there are different right ways to do it. We will start off by analyzing the error message which we have received and try to see if the low, if the blast radius can be lowered or it is like how many services or how many components it is impacting at the backend. So lower the blast radius is better it is for us to debug and issuing the corrective action. Then we also might want to utilize our monitoring tools like Prometheus log dnssdig properly to take a look at the last recorded state of the application with the help of different metrics which were pushed till the time it was running fine and also different logs which were pushed, maybe the application level log as well as the cluster level logs to understand more the current state and the previous state of the and if it is kubernetes related, then we can try to get the access to the cluster as soon as possible and try to find out different master components or other important components are functioning properly or not. If they're deployed as pods, we can try to get the logs of the pods. Otherwise, if those are running as system D services, we should try to check the service level logs and if it's a widely impacting issue, we should try to isolate that service by restricting its usage. Of course we are going to need the automation to our help and why we basically need it so that we can reduce the time which we need to respond and get our infrastructure statistics. At the earliest. We basically automate or we implement the automation to get the cluster level statistics, or maybe run some real time commands with maybe a single line which is not always possible. If you're trying to get the access of the cluster and then trying to run those commands, and also to handle node level reboots or restarts of different core services, and also to query historical data or maybe try to analyze different patterns which you could be seeing related to one issue or different issues right then also preparing the cleanup jobs to clean up the resources so that you always don't face the resource crunch whenever it is needed, and different types of automation which we might want to implement. So we start off by writing meaningful and well documented run books for the on call folks which can direct them to different points available to that particular issue which has been reported, right? So these, these points could be directions related to different automations which are linked to it, which the on call SRE can utilize. It could be a Jenkins job, or maybe some GitHub issue which was also reported in a similar way. So the runbooks are really necessary in this case, and implementation of bots which can also take care of the chat ops basically. So these bots are the main resource for running on our communication platforms like slack teams or Metromost, and these can help in gathering real time cluster statistics, maybe issue some commands or even the Kubectl commands, and also issue reboots and reloads of our worker nodes if authorized to do so. And one of the example of bot is bot cube, which is a very popular tool and has an extensive usage and connectivity to different platforms available, and it's an open source tool, then we can also deploy some custom scripts running as kubernetes resources inside our cluster in the form of daemon set or sidecar containers like running inside sidecar containers or maybe custom resources. And these will basically invoke whenever they face a certain situation so one example could be if my disk is getting full, then I can invoke a script which can clean up my disk and maybe try to shuffle some resources from it. Then of course we cannot escape implementing or taking the help of resources available with the AI based capabilities nowadays. So these could be some analyzers to gather more detailed information for the cluster. We should also be looking at integrating our existing monitoring solutions or the tools to extend their capabilities, like curating some custom dashboards which can give us a holistic as well as a detailed view of what is basically happening inside the cluster irrespective of what tool you are using. But nowadays the tools, they basically have a numerous support of different type of dashboards which we can utilize. So moving on to the last part, for someone who's starting afresh in this field, basically more the exposure you have, more season you will get in handling different situations, and it's always essential to get the broader understanding of the entire architecture and the infrastructure so that you will be able to link to different components whenever there's a situation. Also keep your runbooks handy to deal with different issues, errors or warnings, and also try to analyze the historical or recent occurrence of such similar issues which might have caused the outages in real in recent past. These could prove as a different like as a very good learning point. Lastly, I would like to say if the alert auto resolves, it doesn't always mean that there is nothing wrong, so it's better to look anyways, that was all from my side. Thank you.
...

Pravar Agrawal

Senior DevOps Engineer @ IBM

Pravar Agrawal's LinkedIn account Pravar Agrawal's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways