Conf42 Incident Management 2023 - Online

There's No Place Like Production

Video size:

Abstract

It was 3 years on the pager before I “caused” my first production incident and it stuck with me. How did this happen on my watch? Over time I realized that no matter how much testing and validation you do in other environments…there truly is no place like production.

Summary

  • Paige Cruz: There's no place like production. She tells the story of how a senior site reliability engineer made a change to a cannabis platform. Cruz wanted to create beautiful, indexed logs that would let her filter by request. She started with extensive local testing before making the change.
  • SRE Regina George says she didn't blame herself for an incident. She says the incident was an opportunity to learn more about how system works. She credits learning from incidents community and seasoned SREs who've seen it all.
  • The incident retro document details each step of translating from helm template to Kubernetes. The most popular part of the presentations, by and far, was the life of a request diagram. A metaphor that answered the immediate question of what happened allowed attendees to focus on the deeper details.
  • Paige Cruz: Catch up with me at Pagerduty on Mastodon LinkedIn. Email me would love to chat about the time you took production down. Thanks.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to there's no place like production. I'm Paige Cruz from Chronosphere. This talk was inspired by all of the discourse that swirls around a little phrase I test in production. It's one of those phrases that people seem to want to take the wrong way and get into Internet fights about. So I want to tell you the story of how I learned that there's no place like production and that a I do test in prod, but there's a few things you should know first. I started my tech career at a monitoring company. And when you work for a monitoring company, no matter where else you go after that, you are considered the monitoring expert, and that is a big mantle to carry. The second thing you should know is that I was hired into this organization as a senior site reliability engineer, and something that had been low key stressing me out since being an SRE and holding the pager for like five or six years, was that I hadn't yet caused an incident. All of my friends who were seasoned had tales of taking down production, accidentally deleting clusters or databases. And I was just sitting there waiting for my moment. And the last thing is that during this time, I had a lot of stressors going on. I was stressed about the pandemic. I was stressed about a Kubernetes migration that I was having to kind of close out, which would be fast, followed by a CI CD migration, which are just two really big projects back to back. I was stressed about losing like 80% of my department to attrition, and finally just normal personal life stressors. So all of this is going on as the backdrop to this story. Where did this all happen? A sprawling cannabis platform that was 13 years old by the time I joined. We connected consumers who were looking to buy cannabis with dispensaries who were looking to sell cannabis. And as a part of this platform, we offered many different features, such as an online marketplace with live inventory, updating a mobile app for delivery and messaging, lots of ad purchasing and promotion, as well as point of sale register systems, really bread and butter technologies that these dispensaries and consumers depended on. Our story starts with a little component called traffic. One day, I was investigating some issue into production, and I noticed something shocking about our traffic logs. All of the logs were unstructured, therefore unindexed, and not queryable or visualizable. Basically, to me, useless. I couldn't do anything with these blobs of text. What I wanted were beautiful, structured, indexed logs that would let me group and filter by request, path, remote ip referer, router, name, you name it. I wanted all of the flexibility that came with structured login. So what all was involved in making this change? To sum up, traffic is essentially an error traffic controller sort of routing requests from the outside to the inside, or from one part of your system to another. And it was one of two proxies doing this type of work. So here is an approximation of my mental model pre incident of how our system worked. A client would make requests like hey, is there any turpey slurpee flower near me? That would hit our CDN, which would bop over to our load balancer, pass that information to traffic, which would then give it to a Kubernetes service and ultimately land in a container in a pod. All along the way, at every hop we were emitting telemetry like metrics, logs and traces that were shipped to one observability vendor and one on prem monitoring stack. So this is the steps that I followed to make my change happen. I needed to update configuration that was stored in a helm chart. We did not use helm, but we did use Helm's templating. We would take those helm charts, render them into raw Kubernetes Yaml manifest, pass that to an argo CD app of apps. So first layer of Argo CD applications, which would then itself point to an individual argo application, which would then get synced by Argo and rolled out to our clusters. But hey, this was a one line configuration change. I'd been making these types of changes for months. What could go wrong? Let's find out and embark on deploying this change, shall we? I started with a foolproof plan. Whenever I'm a little bit nervous about a change, I really like to break down what my plan is for getting from where I am to production. And in this case, I started with extensive local testing. I wanted to make sure that our deployment process was up to date, that these changes would get picked up. I then wanted to announce these changes to the developers and my other teammates when they would hit each acceptance and staging. I planned, of course, to just let it bake, give it time, and finally, I would schedule and find a quiet time to try it out tomorrow. In production, I made sure to request a PR review from the most tenured person on my team with the most exposure to the system and how it had gotten built up over time. This helped me feel a lot more confident that I wasn't going to make some radical change by accident. So after it passes review and all of the PR checks, it was off to our first environment, this change landed first in acceptance, which was sort of a free for all. It was the first place all these changes would land to see if they would play nicely together. Unfortunately, this environmentsthere was a little bit undermonitored and relied on what we call the scream test, where unless somebody complains or actively screams that your change has caused an issue, you consider things good to go and operationally fine. So after letting it bake in acceptance for a little bit, I decided I was brave enough to push this to staging. So I deployed a staging, decided to let it bake for a little bit longer overnight. One day later, and it was time to take this change to production. And at this point, I had a little bit of what we now know is false confidence that it had caused through two environments, it had passed through a human PR review and automated PR checks, and again, was a one line config change. So at this point, I was feeling so confident that as soon as the deploy status turned green and said it was successful, went back to the circus act that was juggling all of those migrations without a manager and just trying to get my day job done. Which brings us to the incident, because several minutes later, I noticed the incident channel in slack, lighting up. And from a quick skim, it didn't seem like this was just a small, isolated issue. In fact, we had teammates from across the organization, different components and layers of the stack reporting in that they'd been alerted, and things were wrong. So impact was broad and swift. I kind of sat in the back, panicked, thinking to myself, was it my change? No. How? There's no way it was my change. We had all those environments, it would have come up before then. We've tested this. It's not my change. And having convinced myself of that, I muted the incident channel because I was not primary on call. It was not my responsibility to go in and investigate this. And I had a bit of a habit of trying to get involved in all the incidents that I could because I just find the way that systems break fascinating. And with my knowledge of monitoring and observability, can sometimes help surface telemetry or queries that speed along the investigation. But I was on my best behavior, and I muted the channel and went back to work until I got a slack dm from my primary on call, who, incidentally, had reviewed the pr that said, hey, I think it was your traffic change, and I'm going to need you to roll that back. And again, my brain just exploded with questions. How? Why? What is going wrong? What is different about production than every other environment where this change went out. Fine, but this was an incident. I didn't need to necessarily know or believe 100% that my change was causing the issues. What needed to happen was very quick mitigation of the impact that was causing our customers and their customers to not be able to use our products. And so even though I was unsure about the fact that it was my change, it immediately went to that revert button and I rolled back my changes after that because now I was a part of the incident response. I hopped on the video call and just said, hey, I think it's possible that it was my traffic change. I have no idea why, but I've gone ahead and rolled back. Let's continue to monitor and see if there's a change. And very quickly, all of the graphs were trending in the right direction. Requests were flowing through our system just like normal, and peace had been restored. Interestingly, during and after this incident, I received multiple dms from engineers commending me on being brave enough to own being a part of the problem and kind of broadcast that I was rolling things back and it was probably my change and really just kind of owning my part of the problem. And that got me thinking that we perhaps had some cultural issues with on call and production operations. So I filed that one away for the future. And even later that day when I was telling a friend about I finally had the day that I took prod down, they replied, hashtag hug ups. Oh my God. That must have been so stressful that you were the reason that things broke. But I didn't really see it the same way. I actually didn't blame myself at all. I think I took all of the precautions that I could have. I was very intentional and did everything in my power to make sure that this change would be safe before rolling it out to production. And I didn't see what would be difference between me making this change or someone else trying to make that change. I think we would have ended up with the same result. So I didn't blame myself at all. And I credit this to the learning from incidents community and just general seasoned SRE fo folks who've seen it all. So, thank you. And I go to bed. I wake up the next day, I could sense that all eyes were on me and this incident. Engineers from all up and down the stack were also asking what happened, how, why? And I realized this was no incident. No. This was actually a gift that I was giving to the organization. This was an opportunity to learn across the more about how our system works and we were going to cherish this gift. I became the Regina George of incidents and said, get in betch, we are learning. And I wanted to bring everyone. I was really determined to capitalize on this organization wide attention onto how our infrastructure works. I had a little bit of work to do because I myself was still mystified as to how and why this could have happened, and it was time to start gathering information for the incident review. It boiled down essentially into a hard mode version of spot the differences between these two pictures after bopping around a few of our observability tools, I realized the quickest way to figure out what went wrong was to render the helm charts into the raw Kubernetes manifest yamls and diff those facepalmingly simple. But in the heat of the moment was not something that immediately sprung to mind, not when everyone's saying that everything is down. And let's remember my mental model of the system. CDN load balancer traffic service pod I really didn't understand why changing from unstructured logs to structured logs would affect the path that a request takes. And it turns out when I played spot the difference, there was a key component missing from only the production environmentsthere. How was it that I missed an entire component getting deleted from my pr change? Well, we were all in on Gitops, using repositories as a source of truth and leveraging Argocd's reconciliation loop to apply changes from the repository into our clusters. We used an Argo CD app of apps to bootstrap clusters. How did Argo get Kubernetes manifest to even deploy helm? In our case, we would take service secret deployment with whatever values YAML was the base, plus whatever values YAML was for that environment. Interpolate and bungee those all together and spit out the raw Kubernetes Yaml, which we passed over to Argo. So what was missing? What I discovered, was that there was a second container in the pod where the traffic container was. It was a security sidecar that acted essentially as a gatekeeper, vetting and letting requests into our systems. So difffing the Kubernetes manifest got me to what happened, but not really to how. And lo and behold, I noticed that where I entered my one line to do JSON formatted logs in a specific values yaml in the layers of hierarchy, I had unknowingly overwritten the block for that second container definition in the pod, aka that security sidecar. So I accidentally deleted it and it caused a lot of havoc, but I had no warnings along the way that this very critical component had disappeared. Let's talk about learnings, because during the course of this investigation, I personally learned a lot about how my mental model of the system didn't reflect reality, what was actually happening when we merged prs, and what to look out for next time I made a config change. But I also had a lot of curious engineers, managers and leaders wondering what happened, because anytime there's a sitewide outage, you get a lot of attention. So here's how I shared my learnings. The first place for sharing learnings, of course, is the incident retro, and I made sure that my document had a clear timeline of events, talked about in detail each step of translating from helm template to Kubernetes, manifest to argo application to app of apps, the whole process from start to end. Because I wanted anybody, not just the people on the SRE or the infrastructure team, I wanted anybody in the organization to be able to understand how this happened. After that, I took this little incident retro document along with a life of a request diagram, first to chapter backend, which was a community group for sharing learnings and announcements across all the backend engineers. And then I also took it to chapter front end to share all of this knowledge with our front end engineers. There was really no shame in my learning game. I was taking this presentation anywhere people would have me. And finally I took it to SRE study time, which was the dedicated space for learning, for my team to really dig into the details. The thing that came in the most handy, though, was a metaphor, because I was talking to leaders. I was talking to maybe engineers who hadn't been exposed to Kubernetes. So I needed a really handy metaphor to explain the impact. I said my pr essentially resulted in me taking out the front door of our house and replacing it with a brick wall. Nobody could go inside, but the people who were already inside could talk to each other. This was a really simple analogy that answered the immediate question of what happened and what was the impact, which allowed the attendees to focus on the deeper details of how and why. Something I kept in mind when explaining this to teams outside of my own was our information and knowledge silos. Because we were a ruby node and elixir shop, we weren't a ghost shop. We didn't have everybody trained up on kubernetes. If you saw from the first few slides, we actually had just migrated to Kubernetes. So a lot of the infrastructure was mysterious to our developers. So I made sure, specifically for chapter front end and back end to call out the idiosyncrasies of go templating and its errors what the order of precedence is for values. Yaml file in helm and reviewed the app of apps pattern with Argo and explained what actually happened when one of them merged a pr this went a really long way for building a shared foundation of understanding. The most popular part of my presentations, by and far, was the life of a request diagram. It broke down the end to end that a request would take from client all the way down to application running in a and this was the first time that some of these engineers had even seen this fuller picture of what was going on in our system. So it felt really good to be able to share this knowledge. Ultimately, I kind of reflected on the central question I had been asking myself. What was different between the production environment, where my change blew everything up, and staging or acceptance or local, where my changes seemed to test just fine? It felt like I was playing spot the difference in extreme mode, which I've recreated for you here. Tell me which one of these is a crow and which one is a raven unless you're a birder, it's pretty hard to say. Sometimes working in these complex systems can feel like playing a very risky game of Django. I kept thinking that on their own, Yaml go templates, kubernetes, application definitions, and even the argo app of apps pattern aren't terribly complex or confusing concepts to understand, but it's the way that they're all puzzle pieced together into a system, or stacked and pointed to each other and layered that made this change in this complex system really difficult to diagnose. To sum it up, I learned the hard way that there literally is no place like production. And it's not that I test in prod, but we all test in prod. Thanks so much for listening. I'm Paige Cruz with chronosphere. Catch up with me at Pagerduty on Mastodon LinkedIn. Email me would love to chat about the time you took production down.
...

Paige Cruz

Senior Developer Advocate @ Chronosphere

Paige Cruz's LinkedIn account Paige Cruz's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways