Conf42 Cloud Native 2022 - Online

Pushing Code: Don't Forget to Flag Your Canaries?

Video size:

Abstract

Reliability is multifaceted. Your approach to releases plays a part in that. It’s smart to think forward. What’s impacted? What will change? Canary releases, or phased rollouts, allows you to better manage the release lifecycle and understand any impact to reliability. Iteration and canarying is more involved than traditional big releases. You need to look at the code being deployed and flag everything that comprises each new feature. You’ll also need to tag groups of users. And instead of one big release, you do several smaller releases where more groups of users receive more features each time.

In this talk you’ll learn:

  • Why you should consider an iterative canarying approach to releases
  • Knowing when it’s safe to expand and iterate
  • Understanding how users rely on your services to find the ideal groups to canary

Your reliability plan should comprise these parts. Learn the agile way.

Summary

  • Today's talk is titled Pushing Code. In this talk we'll cover a variety of topics. We'll go into reliability, how it's most importantly about the user. At the end we'll have a quick question and answer session. Don't forget to flag your canaries.
  • Reliability is feature number one. It doesn't matter what shiny new feature you put into your application. As you grow into a bigger company, that reliability becomes more and more important. The practice of SRE is about bridging the gap between the dev side of things and the operational side.
  • How likely the login is to succeed is extremely important to the success of your application. It annoys the user when something is slow. What really matters to the overall experience of your customer is how slow is it over a period of time.
  • In software development, canary means releasing to a set of users over time until you've rolled out a feature to all customers. Why do we do this? One, there's less impacted liability. Two, continuous feedback from various users that you're rolling out to.
  • The difference in canary releases is that the deploy step is really broken up into multiple steps. So really you take every feature and repeat this process multiple times for every set of customers. And the fact that you're releasing it to small system users, meaning that your monitoring and fixes are less urgent.
  • For feature A, you're going to offset feature B, release it, and these do the same process as feature B independently. How does that work in a timeline or chronological fashion?
  • feature flags allow us to inform timing, stability and the go forward strategy. You can proactively monitor what got pushed because you can see when the feature flag was enabled and then who it got pushed to. Once everything is stable and you fix the any issues you've identified, you roll out to other users.
  • The best group to target is the ones that are already using youll product area or have specifically asked for a feature set. A Canary deployment is the lowest risk prone compared to all these deployment strategies because of the level of control.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. Today's talk is titled Pushing Code. Don't forget to flag your canaries. Thanks for joining in. In this talk we'll cover a variety of topics. We'll go into reliability, how it's most importantly about the user. Why do we do canary releases in the first place? When is it safe to expand those canary releases and also dive into feature flagging various features and then also understand how our users rely on our services and then identify the ideal groups in youll users base to target. And at these end we'll have a quick question and answer session. Let's get started with reliability. Reliability is feature number one. Without reliability, nothing else matters. It doesn't matter what shiny new feature you put into your application. If your application is not reliable, the customer won't care. Now, reliability might mean a few different things to your team. Phased on these, you are in the maturity of your application or service and also scale and growth of users adoption let's say you're a startup you're just worried about let's get the features out. Let's not focus too much on reliability or the other bells and whistles because we want to get the feature out, because if we don't get it out, we'll get beaten by competitors. So you're really focused on delivering features as fast as possible and delivering value to your customers as fast as possible. And then at some point reliability starts. As you get more customers, reliability starts becoming more and more important because now you've got customers paying for your product and maybe they're complaining something's not working and that is these reliability of youll application. Your reliability is suffering because you move fast, which you had to do, but now you're at a point where you need to address those concerns so your customers don't leave. And as you grow into a bigger company, that reliability becomes more and more important. You see a lot of companies become very stagnant sometimes as far as releasing products because they've built such a massive platform. These their customers are relying on that. It's really hard for the company to release new versions or new features without going through a whole suite of testing. That takes a long time. And that's the other extreme of where reliability takes over and innovation drops off. The state of DevOps report sponsored by Google states that reliability is a combination of measures which include availability, which is one of the most important. Availability can be tracked with four golden signals, latency, traffic saturation and errors. If your users can't easily access the service or if it's unbearably slow, or if there's errors popping up everywhere, then it's not reliable. We've all been there. We've all used a SaaS product or any other product where you're sitting there waiting for the loading spinner to end and just keeps going in an infinite loop. Or you click tab and you see a loading skeleton which initial glance is like, oh, it looks a little better. That's why the loading skeletons were invented or implemented, because it gives the user perception of speed. But in fact you're still waiting. So we've all been there, we've all seen it, we know exactly how annoyed we get. And at that point all the bells and the whistles, all the nice shiny new features don't really better because we're so annoyed at the one feature we wanted to use at that time and it's not working. Our internal SRE expert, Kurt Anderson, who did 70 years as part of LinkedIn's SRE team, says reliability is a team sport. It's a constant balancing act between pushing new code to production and monitoring your services to make sure these performance as expected within the threshold that you have set for resources and just overall healthy. The practice of SRE is about bridging the gap between the dev side of things and the operational side of things, so that innovation can keep going at a fast pace while also maintaining the reliability of your application. Oftentimes it feels like innovation goes super fast and then reliability falls off. And these other times reliability takes over and is the number one priority and then innovation drops off. And that's where I talked about. Different states of companies tend to have different extremes sometimes, but really the perfect experience is having a balance between them. Where innovation is fast but also reliability is high. That's the perfect balance. That's the perfect experience for youll customer and for your development teams and your operational teams. Sre lives between the dev and ops and our job is to smooth it out as much as possible. We want to connect dev and ops so that they can keep going at equal pace without causing too much friction. And we have to think about, as an engineering team, we have to think about what is planned work, what's unplanned work. When something goes wrong in production, yes, you release something really fast. Something went wrong in production that causes unplanned work for your teams as well. So it's not just impacting the customers, also impacting internal engineering teams as well. In our topic today, we'll go over how you want to continue release and innovate while meeting your customers expectations of reliability and their needs for new features at the same time. To do that, you have to adopt an iterative release process using canaries and feature plays. Before we do that though, let's take a simple example. Now, I've got an ecommerce application that a customer can log in, search and buy things. And I've got a simple user journey here. The user is logging in. How likely the login is to succeed is extremely important to the success of your application. It does pertain to reliability. If the user can't even get through the front door, what users is your product? What uses are the shiny new features? They're of no use at all. Now, let's say the customer was able to log in and now they're searching for a product. How fast those search results come back is extremely important. Obviously not as important as getting into the front door in the first place, but fairly important because it annoys the user when something is slow. We talked about this just a couple of slides ago. We've been there. You're using a product even if you're not buying something. You're trying to use a product and it's slow. It annoys you, right? It gives you a bad experience with that product and you'll hold that negative experience until it gets resolved somehow. Now, customer found the product they're looking for. They want to add it to these parts. If it gets successfully added to the cart, awesome. If it doesn't or it takes a long time to do so, that's a problem because the customer wasn't able to do what they came to your application to do. And if they're paying you money, maybe a membership fee or monthly fee, whatever they're paying you, they're paying you to do what they came here to do. And if they can't do it, they're not going to be happy. All these three things put together, obviously this is a simplified example, but these three things put together encompass how happy a customer is going to be with your product. Now in this example, we talked about latency as one of the big metrics, and that would be the SLI, the service level indicator that we want to use. If we were measuring this and these service level objective could be, I want this latency to be below a certain threshold over 30 seconds. That would be how you measure, and slo would be what you want to measure and optimize because you're probably always going to get one or two responses and occasionally that are slow. But what really matters to the overall experience of your customer is how slow is it over a period of time? If it's just slow for one call and then it gets super, super fast, customer probably won't be affected too much and they won't mind too much. But if it's consistently slow, then we have a problem. And youll can measure that using slos and slots. Blameless. Blameless plug. Now, let's talk about canarying. What is canarying? Well, the term canary comes from an old mining tactic where miners would release canaries into tunnels to measure toxic gases. If the canary survived, that means the tunnel was safe. If they didn't, well, not so safe. How do we relate that to software development? In software development, canary means releasing to a set of users over time until you've rolled out a feature to all customers. Why do we do this? Well, one, there's less impacted liability. As I said, you're releases to a subset of users iteratively so you could roll it out to one set of customers, monitor, make sure things are looking okay in your metrics, tools, et cetera. And if they don't, you could stop the rollout or even roll back the rollout without impacting all of your customers, right? So therefore, you're more reliable to all the other customers. And two, continuous feedback from various users that you're rolling out to. If you roll out the first set of customers and they tell you, this sucks, these can't even use it. It's not even the workflow that they want. Maybe you pause, go back, revisit the design, and don't roll out the feature to everybody else because maybe you never know. And then three smooth operations, right, you're releasing to less users. Therefore, the surface area of incidents has decreased dramatically. And if you don't encounter incidents with these set of users you release to, then you can roll out to another set of users, keep doing the same thing, monitor, and then roll out to the other set of customers, hopefully reducing the incidents as much as possible and causing less strain on internal engineering teams. Let's talk about software development lifecycle for a bit. You might think, okay, well, you've got these iterative releases, canary releases. How does that affect my lifecycle? Well, the short answer is it doesn't. You're still going to be building your application, testing it, deploying it, monitoring, and then if you find any issues, you'll be fixing them, reviewing the fix, learning from it, and then repeating the process over and over again. The difference here, and I've got sort of two different use cases being demonstrated here the difference in canary releases is that the deploy step is really broken up into multiple steps, right. You're releasing it to first set of customers doing the monitor, fix, review, learn, build, cycle again after you release the first set of customers and then the same feature rolling out to the next set of customers and then repeating the process. So really you take every feature and repeat this process multiple times for every set of customers. So it's the same process, you're just doing it in a more iterative way over and over again. And the fact that you're breaking it out and it's the fact that you're releasing it to small system users, meaning that your monitoring and fixes are less urgent because you, youll turn off the feature too. You've got feature flags, so you could turn off the feature as well if it was a major issue that the feature introduced. And these, the other part I want to showcase these is feature A and feature B. So not only are you releasing two subsets of users over time, for feature A, you're going to offset feature B, release it, and these do the same process as feature B independently. How does that work in a timeline or chronological fashion? Well, here is a good example. This is the same chart, essentially this is the same thing as this, but more chronological and more visually, I guess, easy to understand. So these we've got feature A and feature B just like the last diagram. And then let's just say 123456. Those numbers represent sprints and then the y axis here represents customers and the x axis is chronological and number of features. So let's say feature a is done and we're ready to roll it out to the first set of customers, we roll it out to 25% of customers. In sprint one, we do whats process monitor, fix any issues, get feedback. And once we're ready and we feel comfortable in sprint two, maybe we'll roll it out to 25% more customers. And then in sprint three, we'll roll it out to 25% more customers. And maybe in sprint three, this is a time where we're about to reach a general availability. Maybe it's time to button up our documentation, make sure it's all up to date. And these in sprint four, we'll roll it out to all the customers. Now going back to start a sprint three, you can see feature B was done and feature B was also ready to release. So in feature B we repeated the same process, but now feature B in Sprint three is only released 25% of customers, whereas feature A is released to 75% of customers. And hopefully feature A and feature B are decoupled enough so that if something happened and only went wrong with feature A in sprint one, two or three, you could pause feature A, continue releasing feature B because they're feature toggled right and continue the canary approach with feature B. While feature a is sort of paused and you're working on fixing the issues, this allows you to unblock your pipeline. Don't have to do major git wrangling, which, let's be honest, in some cases is unavoidable. But hopefully you've decoupled the features enough that you don't have to do that git wrangling. And you can make the process easier for your engineers, your developers don't have to rush. Make sure they're reverting, make sure they're not losing code, make sure they're not having merge conflicts. All that good stuff that comes with messing around with git histories and making git merge commits and reverts. Now feature flags we talked about a little bit with feature a, feature B in the last couple of slides, but what do they do for us? They allow us to inform timing, stability and the go forward strategy. So you can proactively monitor what got pushed because you can see when the feature flag was enabled and then who it got pushed to because you can obviously specify which customers you enabled the feature flag for. And that's how we would do canary, you would say feature flag a enabled for customer A and B, and then feature flag a in two weeks gets enabled for customer C, customer D and so on and so forth. You know exactly when you did those changes and then you can watch for incidents before you continue to push. For example, in the diagram I've got here on the left from sumo logic, you can see the icon for the feature flag. You can see it around April 1 the feature flag was turned on and you can see there's a spike in Prometheus memory usage. The pink line, the blue line both spiked and the orange line sort of spiked there. What happened? Well, if we didn't have feature flags, we may not necessarily know when something was rolled out. You could obviously go to your releases and correlate that. But it's so much easier when you have something whats you can explicitly turn on and more importantly explicitly turn off if you saw issues. So if we saw this spike, we could potentially turn it off. So we turned off a feature flag maybe, and these it goes back down. And then once everything is stable and you fix the any issues you've identified, you roll out to other users. And in all this slot the test for a rollback strategy, because in worst case scenario you have to roll back the code to make sure you have a strategy in place. If you do need to do that, let's take a simple case study regarding a project that we internally canary called Comsflow. Comsflow is a communications flow tool whats allows the users to customize message templates and messages and who they want to send those messages to based on certain events. In the blameless platform, for example, you could have incident status changes trigger an event. Incident severity changes or retrospective state changes trigger a blameless event, which would trigger a message that has variables in it to SMS, email, Slack, Microsoft Teams, or status page recipients. And those recipients could be internal or external. You could directly post to a customer status parts you've got. You could directly post to a public page if you wanted to. Now, in this product, you can see there's different areas that could be feature flagged individually. For example, we could take incident status, incident severity, post mortem state. Those could all be separate features. Maybe we only release incident status and that was the only thing that was complete at the time. Maybe we releases that first as a feature flag, and then maybe blameless event and reminders are separated as well. Maybe we only want to do events at the moment. We don't want to enable reminders or they're not done yet, the development isn't complete, we'll just keep that features flag off. And then in the message template we've got these variables that I mentioned. We could turn off variables because like I said, it could not be done, or you want to minimize the release scope. And then even the recipients could be controlled based feature flags. So anything could be feature flagged here pretty much, but it really depends on your organization, your team, and how granular you want to be. As far as feature flagging, here's a rough overview and plan that we had about how we rolled out comms flow. For example, we had the first iteration in early December. We had a certain set of features built for comprise flow, and at the bottom we had some customer accounts that we decided were asking for a feature and wanted to use it, et cetera. We just determined who those were and rolled out iteratively using a canary approach to those customers, and then we let them play with it for all of December. And then in Canary we came out with the next set of features which were individually feature toggled as well. And then we decided, okay, all the customers can have this feature because we felt comfortable and confident based on our previous releases. What do we learn in this process? Well, a couple of things. Production readiness. As we rolled out two customers iteratively in a subset manner using a canary approach, we were able to get feedback from these customers. We were able to monitor our metrics and make sure nothing were breaking. So that gives us more confidence as far as how ready this application is for production. Obviously we do production readiness testing before we release to any customer as well. But just getting that feedback or getting that visibility into customers using it and our logs looking okay and our services are healthy gives us even more confidence that yes, when we roll this out, this will work and it won't break and cause headaches. And these, these other thing is much more agile roadmap. We were able to break up all the features and different features so that people could still develop and keep developing without worrying about overriding somebody's change or merging something. Whats couldn't go out to production because we had feature plays in place and a couple of other things. Ownership and code cleanup. Well, ownership, what does that mean? Well, we had a lot of features, created a bunch of feature flags. One thing, in some cases we dropped the ball as far as who whats the owner, who whats responsible for enabling the feature flag? Is the product manager, is the engineering manager, is the engineer who's responsible for enabling the feature flag. That wasn't super clear for some of the features that we rolled out. And so we had features that we rolled out at one point that weren't enabled for customers and we thought they were released and then we had to go back and enable them. So lesson learned. And we obviously learned from that and working towards making it better and have a process for it now. But keep that in mind as you're adding feature plays throughout your code base. And then since you're adding feature flags throughout your code base, code cleanup becomes an issue as well. Let's say you've gone through the canary process. You've enabled the features for x number of customers over time. Now you've got to go back and clean up that code because otherwise you're going to have really bad code where you have a bunch of if statements checking for certain features, et cetera, and those aren't really useful anymore because you've already enabled it for everybody. So you want to go back and take those out. That's an important aspect from a development experience perspective. Now let's talk about how we went about building the right canary user groups, right the first users group that is the most important, in my opinion, as far as releasing feature to is the ones that are already using a specific product area where you're adding a features, you don't want to impact their workflow. If you're releasing a feature and you're impacting their workflow, you want them to test it out. You want them to like, hey, first of all, let them know, hey, we're releasing this new feature. Kind of want you to take a look at it and let us know what you think. And these, maybe they've asked for these specific feature in the first place. So you obviously want their feedback right off the bat without releases to everybody else. And the great thing is you can tell customers, look, we're releasing this feature and they'll trust you because they'll see that you have a plan. You're not just rolling things out willy nilly and that you're actually innovating. These. There's another set of users that are less active, right? They're not currently using the product area. They didn't really ask for it. Do they even care about the feature? Are these the right group for canary testing? Probably not. And there's a vocal set of users who will give you specific feedback even if you don't ask for it. Sometimes it's valuable to get that opinion, but you also don't want to have a bunch of noise as far as feedback is concerned. You want to make sure you target these users that are the super users of your product or of that features specifically, because you want to know when customers actually using it, how do they use it, how does it impact these workflow? Does it improve something for them or did it make things worse? You want to know from the power users, you don't want to know necessarily from someone who likes giving feedback, wants to give feedback and doesn't really use the feature heavily, though we all, we may even be some of those vocal users ourselves. But as far as canary testing, the best group to target is the ones that are already using youll product area or have specifically asked for a feature set. The harness blog had a quote that says a Canary deployment is the lowest risk prone compared to all these deployment strategies because of the level of control. And I would agree you're releasing to a small set of users, not everybody. And so you control who's getting the experience that you've decided. And you also control when you can roll it back as well, right? If something goes wrong, you can roll it back. If things look good, then you roll out the next set of customers, so you have that control within your hands. And features plays just make that even easier by being able to turn it off really easily. And that's all for today. Please let me know if youll have any questions. Please feel free to send feedback to my email hamad@blameless.com and I'll be available for any questions. Thank you.
...

Hammad Mushtaq

Engineering Manager @ Blameless

Hammad Mushtaq's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways