Deployments, Downtime, and Unexpected Fires: An SRE Survival Story

Video size:

Abstract

Unlock real-world SRE lessons from enterprise banking to startups! Learn to tackle deployment challenges, avoid cloud misconfigurations, and cut costs. This talk blends engaging stories with technical insights, offering practical takeaways for engineers managing infrastructure efficiently.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Deploying infrastructure is always a mix of planning and firefighting. Sometimes everything works perfectly. Other times you are troubleshooting at 2:00 AM wondering what went wrong. Hey guys, my name is Prade Gadi. I'm a senior DevOps engineer. My agenda today covers a little bit about my journey and my experiences with incidents, tools and SRE world. Coming from banking and financial services sector, I started my journey working with legacy systems such as web spear, tomcat, and manual deployments. Custom scripts helped, but they introduced a margin for error that made me appreciate the value of standardized tools like Terraform Helm and CI CD pipelines. This talk is a collection of real world SRD failures. Lessons learned and the unexpected challenges that come with managing infrastructure. So let me go over my experiences in real world. Let me start with a data center incident. Let me tell you this. Make sure your cloud provider service terms aligned with your requirements, especially around RPOs and. When we signed up for when we signed up for GCP, we used Google Cloud storage with dual region buckets. We assumed this meant our data would be replicated instantly between regions, between different buckets. Then a data center fire hit and objects in our GCS bucket where unavailable for up to two hours. Yes. We later discovered that gcps default replication guarantees only 99.9% of newly written objects replicated within one hour, a hundred percent within 12 hours. Meaning if a data center disaster happens, there is a chance that some of your objects might be replicated, can take up to 12 hours to replicate it to region. They later recommended enabling turbo replication, which actually guarantees a hundred percent replication to both regions in under 15 minutes. Regardless of object size, which we would know this before that way we would enable this feature and we didn't have to face this situation. So what's that lesson here? Always research if a product meets your S-L-A-S-L-A needs before you go to production. Read the conditions, apply details seriously. That's gonna save you from incidents of this kind if such features exist, so you wouldn't face these issues. So research the product before you sign up. Let me go over my experience and incidents with respect to certificates. So SSL certificates, certificate issues are more common than you would expect. A few years ago, public certificate validated dropped from three years to one year, which means you have to renew your certificates every one year. Especially the public wild card certificates are often reused across multiple places, which makes tracking them harder. So how do you track them? So let's say if you use commercial products like Datadog look for features like synthetic monitoring. We, which you could use to track the endpoints and get the certificate expiry data and renew the certificates on time. And if you're using open source such as Grafana, look for Infinity Plugin. That way you could query the endpoint and get the certificate information and keep track of the certificates. Also, proper doc certificate documentation and the locations where it installed, they're installed is very important. It sounds basic. But certificate related outages can be seriously embarrassing. Lemme tell you an example. At one of the company, we had an outage because it was a financial client because an intermediate certificate was missing. Let me explain you certificate terminology. In simple terms. The server certificate is signed by an intermediate certificate, and the intermediate is signed by a trusted route. If the chain is, the complete chain isn't complete, visitors may see an incomplete chain error. So how do you overcome this? The moment you renew any certificate, go to open source online tools like SSL Labs to verify the complete certificate change and make sure you have your route intermediate and the server certificate in place. Also if you have open SSL utility on your command line, just hit the end point and use the option called show search. It's gonna show you all the certificate complete chain. Also, you could look for something called verify written code to be zero at the bottom. That tells you like your certificate is good and the chain is complete. Let me talk my experience with logging and monitoring. Logs as an SRE logs are your lifetime lifeline during incident, right? They're your lifeline. So if an incident happens, you always look for logs and metrics. So ensure that all systems have proper logging enabled before you even pro before you even promote them to production and make sure like those logs are actually forwarded to a centralized logging platform. This should be actually part of your pre-deployment checklist. So let me describe a recent incident. So we had this incident where some of the records were missing from a database and that database wasn't configured properly to emit logging. So there many logs in that database. Also, the. Virtual missions, which we are connecting to, that database had logs, but they weren't forwarded to a central logging system. So we had to look for logs at multiple places, which delayed our investigation. So it's always a good idea to, first of all, have proper logging in place and second forward those logs to a centralized system like Datadog or, Elasticsearch. If you, if. You don't wanna use commercial products due to budget consent constraints. You could always set up open source like Elasticsearch. Likewise, monitoring plays an important role. Let me give you an in a simple incident, happened at my previous company. So I did set up Elasticsearch logging system, I didn't, I couldn't set up the monitoring for the disc due to various reason and time constraints. It sounds simple, right? Being an sorry, how could I miss that? But, sometimes it's all about the time and priorities. So I miss set setting up the dis alert. And moreover, elastic search wouldn't ingest logs when it reaches 80% disc due to the watermark setup. So it's always better to set up. These alerts the moment you install these software or the moment you install infrastructure. So the best way to best way is to use some configuration management tools like Ansible or Puppet to ensure, like these monitoring or logging agents are always installed as part of your infrastructure provisioning. You all learn from our mistakes, isn't it? So the takeaway here is, always have a centralized logging and monitoring system, and then that way it's easier to track issues and fix them quickly. Let me brief my experience with Terraform. I didn't have any real incidents with Terraform, but I absorbed things could be improved at a workspace, by following few recommendations. So if possible, use public Terraform modules provided by cloud vendors unless your use case is truly unique. I spent a lot of time writing custom modules that I later replaced with standardized, well-maintained modules from the Terraform, registry. It saved time and reduced maintenance. For example, let's say if you wanna spin up a GCP vm. There are a couple of options. You could write all the modules and then the resources by yourself, or if it's, or if your is a simple use case, just try to use the preexisting public modules from the Terraform Registry. And then my second recommendation would be, if possible, try to integrate your Terraform with CICD. In my previous workplace, we were using, we were all using Terraform, running terraform from our workstation. You know that sometimes there is a chance, like you set up some infra your provision infrastructure, but that doesn't that doesn't have any approvals from any of your teammates. It's always better to have approval from your teammates when you provision an infrastructure or, run any code. So having Terraform plan and apply steps in your CSCD enables your teammates to review them before the infrastructure is provisioned. Also, you could also use some security tools like VI to scan your Terraform code. It catches misconfigurations exposed secrets and comprehensive issues before the hit production. The takeaway here is reuse. What's reliable? Shift Terraform into your CICD workflow. Expect the unexpected. See, SRE is just SRE isn't just about tools. It's about resilience, adaptability, and curiosity. Sometimes you are solving infra shoes. Other times you are wearing four hats at once. Yes, but every incident, every mistake, and every fix is a chance to grow. In my last company, I did set up whole monitoring system and I thought, I'm done. But no, the data which ha, which was ingested into Prometheus was coming in different formats. Some data was coming in proto of format, so I had to convert. I had to write a Python code and deploy it in a cloud function to convert that proto of data into a format which Prometheus understands. So it was an unexpected, crash course across languages, formats, and platforms. That's all SRE about. And one, one other time I ha we have chosen ela elastic for monitoring and logging. Assuming it's a free tire, you know that the free tire would be enough. Turns out Slack alerts were locked behind a paywall. Basically we ha we, we have to upgrade to premium version. If we wanted to send any alerts to the Slack, then we ended up writing our own Pi Python scripts to send those alerts ourselves. So my point here is, always pilot tools in dev environments before moving to productions. Because if you know what your requirements better before you even move to production, then you would avoid a lot of manual work. You even know, if that product is really worked for you. So the key take takeaway here is, always pilot things. And then, try to learn new tech switch hats and fix things under pressure. You had to be ready for that as an SRE. Not everything goes according to plan. Sometimes you are learning, piloting, implementing, and fixing all in one day. It's intense, but that's where the real growth happens. Thank you very much for listening to my talk and thank you very much. This opportunity, please feel.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Deployments, Downtime, and Unexpected Fires: An SRE Survival Story

Video size:

Abstract

Summary

Transcript

Slides

Pradeep Gaddamidi

DevOps Engineer @ RADAR

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Deployments, Downtime, and Unexpected Fires: An SRE Survival Story

Video size:

Abstract

Summary

Transcript

Slides

Pradeep Gaddamidi

DevOps Engineer @ RADAR

Join the community!