Conf42 Site Reliability Engineering 2022 - Online

Unleashing Deploy Velocity with Feature Flags

Video size:

Abstract

A lot of development teams have built out fully automated CI/CD pipelines to deliver code to production fast! Then you quickly discover that the new bottleneck in delivering features is their existence in long-lived feature branches and no true CI is actually happening. This problem compounds as you start spinning up microservices and building features across your multi-repo architecture and coordinating some ultra-fancy release schedule so it all deploys together. Feature flags provide you the mechanism to reclaim control of the release of your features and get back to short-lived branches with true CI. However, what your not told about feature flags in those simple “if/else” getting started demos is that there is an upfront cost to your development time, additional complexities and some pitfalls to be careful of as you begin expanding feature flag usage to the organization. If you know how to navigate these complexities you will start to unleash true velocity across your teams.

In this talk, we’ll get started with some of the feature flagging basics before quickly moving into some practical feature flagging examples that demonstrate its usage beyond the basic scenarios as we talk about UI, API, operations, migrations, and experimentation. We will explore some of the hard questions around “architecting feature flags” for your organization.

Summary

  • Today we're talking about feature flags. Feature flags enable faster deployments to production. You might call them toggle switches or flippers. We'll explore them in a couple different ways today.
  • Travis Goslin is a principal software engineer for SPS Commerce. SPS commerce is the world's largest retail network with over 4200 retailers. The company has been on this DevOps and agile paths over the last decade and beyond. Goslin shares some of the takeaways he's had from his journey.
  • SPS commerce follows more pure CI practices around continuous integration. When we're ready to go to production and replace our existing release one 10 there we get blocked in a lot of cases. This is a big problem in many of our teams.
  • A feature flag allows teams to modify system behavior without changing code. Martin Fowler also defines four types of feature flags to be aware of. Features give us a ton of flexibility. At SPS commerce, we had five environments at one point in time. Now we have that down to two.
  • progressive delivery is defined by launchdarkly as a modern software development lifecycle. It builds upon the core tenets of continuous integration and continuous delivery. Feature flags can help us achieve some of these key metrics.
  • A feature flag is just setting up a new user, maybe in an identity system or platform of your choice. Our feature flag then is going to check if Sendgrid email is enabled. With feature flags, it's not just simply a new piece of behavior being added. It's actually an augmentation.
  • A lot of engineering goes into building a fully aware user contextual feature management provider. From a cost perspective, there might be other good options that you want to consider. Should you purchase one, should you build one?
  • You can use feature flags in many different ways inside APIs. On the UI you have to evaluate and change the flags very dynamically. Having a live websocket or long polling connection that can update that in real time could be very essential.
  • How you want to target it is critical. You got to think about from a delegation of that flag will downstream delegation of it want to use first name and last name or email in order to target users. You can also sunset features with it in an interesting way.
  • The value that that provides in terms of velocity, true CI decoupling, testing and production is just very high. It's not free, right? There's a price to pay for some of these things. But I do believe that it is absolutely worth it for the value you're going to get.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Um, hello and welcome to this session. Unleashing deploy velocity feature flags. Feature Flags I'm excited to be able to join you today as part of Comp 42 site reliability engineering series. And today we're talking about feature flags and we're really seeing how we're going to be able to use feature flags in order to not just enable faster deployments to production, but also just enable these deployments in general to have greater value and what we can do with them. You might call them feature flags. You might call these toggle switches or flippers. We'll explore them in a couple different ways today. Just before we get started, a little about myself. My name is Travis Goslin and I work as a principal software engineer for a company called SPS Commerce. And my focus is really on developer experience. And so anything in the software development lifecycle, I'm very much interested in and exactly how we can make micro feedback loops a little bit better for engineers and so you can get faster, more continuous feedback on whatever you're doing. And feature flags definitely bring us down this route. SPS commerce isn't a household name, it's a b, two b organization. You probably haven't heard of it, but we are the world's largest retail network with over 4200 retailers supplying data interchange invoices, purchase orders. Between those suppliers and retailers, Costco, Walmart, Target, academy, sports, you'll find us all with the biggest retailers in the market. And like many, our organization has really been on this DevOps and agile paths over the last decade and beyond. And we think about DevOps a lot as a culture and a state of mind, a shift really, in how we approach and how we focus on engineering. And for us, much like it is for you, probably continuous and automation are really key principles of how we approach it. The idea of continuously getting that feedback a little bit faster through automation, and whether it be just that local development and debugging feedback loop, or whether it be that feedback loop all the way to production on a finished feature, we're really focusing on making those tighter, faster and quicker as we go, and of course sharing that. And that's why I'm excited to be with you today in order to share our progress and our journey and some of the takeaways that we've had in hopes that maybe it has an impact on you and it's information that you can take away. However, like many organizations focusing down this road, we counted a pretty major roadblock and a problem that I wanted to share with you today. I think it's a problem that you'll be able to relate to. And so let's dive in and look at that. First is let's talk a little bit about our structure for development. It's pretty standard. We typically have a source control system specified in GitHub. So a repository that's there, we use main branch. Our default branch, you might call it master, is typically always deployable. We try to follow more pure CI practices around continuous integration. We do carve off feature branches though and use a lot of the capabilities and the functionality that is available inside of GitHub. So pull requests and status checks and all that capability. So that way we can automate pr checks coming back in through automation. So we develop these features in these short lived feature branches and use the validation in the pull request sequence to come back into our main branch when we're ready. Like any good CI system, then we automatically kick off and build that. We use git semantic versioning and releasing to look at these git commit messages and then devise a semantic version number, in this case one, two, x, whatever number you want, and we go ahead and we deploy that through continuous deployment automatically to our dev environment, which is pretty straightforward and obvious these days. And when we're ready to go to production and replace our existing release one 10 there we get blocked in a lot of cases. This is a big problem in many of our teams and we're blocked by this thing that we'll call a gate here. And you might be asking, well, what is this gate? This gate is many different things for different organizations and different teams, even within SPS commerce. It could be this gate is a product owner who doesn't want that particular feature released until next Tuesday. This could also be these fact that you're producing coordinated release and well it's just not finished yet. There are other applications that need to be released first before yours, or perhaps a UI that isn't quite ready yet. This could also be the fact that you've discovered a bug, you've been examining it in the dev environment or your test or your integration environment and it's just not working. It's just not as you expected or the AC isn't quite right. And so you get held up. You need to deploys this, but you can't to finish deploying it to production. At the same time you have additional features that are in the backlog and they're coming in and you're starting to develop them, but you're a little bit nervous to deploy those back into the main branch now because you know, you have an unresolved dependency that hasn't gone all these way to production, so you're kind of stuck there now. Your feature two is a little bit more longer lived than you wanted it to be, and you really just want to get it merged in and deployed, but you're waiting, and at the same time you discover that you need a critical bug fix in production. And at that point you make the bug fix and you push it all the way through and get a version and built. But that's when you run into the problem, because your pipeline is actually blocked. If this was an engineer maybe doing a critical bug fix that didn't realize that there was a hold up in the pipeline, he might have accidentally gone ahead, or she might have accidentally gone ahead and deploys that to production. In this case we can't because we're blocked. Our green feature can't go to production yet until next week maybe. Of course there's things you could have done, right, we could have cherry picked off main branches. We could have just branched off the bug fix branch from release 1.1 tag and then released that directly to production. You don't want to release it to dev necessarily, because while we practice and really believe a lot in backwards compatibility, forwards compatibility, in the reverse of that is a whole nother scenario, especially if you're using other dependencies and database migrations, it might not be feasible to say, oh, I'm going to deploys version eleven in my dev environment, when it could be a week later after that migration has happened. Maybe it is, maybe it's not. And of course releasing that main branch directly to production with your immutable versioned artifacts is just odd and could cause some problems. Not a great idea. At the same time, you have other services that are waiting, right? Your service one and your service two are waiting either for those features, those critical bug fixes, and they're saying I can't continue on my parallel development without some of these contracts fulfilled and some of these updates, I want to use it now for my own development. That's likely an internal scenario. And so we've created then a lot of confusion and complexity here in our deployment pipelines because we've coupled together this idea of deploys and release. And so that's where we look at feature flags as an opportunity to decouple those in a solution. So let's examine a solution then together in this scenario, we'll have our main branch again and our feature branch. But when we're writing that feature code, we'll go ahead and we'll feature flag it. And for the sake of this discussion, the mental model we need here is just an if statement in code that is ensuring our new code execution path doesn't actually execute. And so here we'll automatically disable that feature if there's no app configuration to enable it. And so that already enables some initial value as I merge that feature back to the main or the default branches, even others doing development as they rebase their branch. Now they're not going to accidentally get an incompleted feature. So I can enable myself through a feature flag to even get some of these prs and these merges back to main and do shorter lived branches. Or I could even commit directly to main if I wanted to and follow some of those more pure CI CD practices. Of course we go ahead and we do the normal versioning here, build it and then deploy it to our dev environment. And here where we typically be blocked before we're not now because we're no longer releasing the feature because it is behind a feature flag that is inherently turned off. And so we can go ahead and deploys that directly to production without any blockage or without any dependencies in our pipeline. In reality, what we've done is taken that feature flag, and in a more advanced structure and architecture, we move that to a feature decision provider. Think of a feature decision provider as a microservice or a tiny service that exists abstracted from these environments that you can ask in a simple API request and say is feature one enabled? Yes or no? You might extend that to say is feature one enabled in dev? Is feature one enabled in prod? And then each individual environment can easily determine whether that feature is enabled and turned on or off. And so our gate then no longer exists between environments, but is abstracted to sit on the outside plane where it is blocking whether the feature decision provider should enable it or not. So this is fantastic, because now we can ensure that we haven't deploys, or I should say we haven't released that feature that can't go to next Tuesday until the product owner goes ahead and clicks that button, perhaps, or updates a value in the featured decision provider to enable it. That critical bug fix that we had no problem, right, can release it all the way through to production because we're now keeping our pipeline flowing without this facade of true CI that we're actually stopping our pipeline every now and again. And those service one and service two can also now be used as a part of this to integrate with early features if they need to. Using context, our feature decision provider can offer contextual decisions. So what do I mean by that? Perhaps service one is authorized as an internal service, we can detect that and we can ask feature decision provider, is feature one enabled in production for service one specifically? We can say yes and turn it on just for them. And so we have taken away then this coupling, as we talked about of deploy and release. They're no longer the same and they are separated and they exist in different parts now, not as a part of the infrastructure or the pipelines. This allows us to have some pretty powerful capability that we're going to talk about today. First, what is a feature flag by definition? My favorite definition is from Martin Fowler, which is a powerful technique allowing teams to modify system behavior, so modifying behaviors. And the key part of that is without changing code. So we added an app configuration file before, maybe a microservices. The key is that we don't want to change code in order to modify it. Martin Fowler also defines four types of feature flags to be aware of. The first is the release type of flag, and a release type of flag is really the kind we've been talking about the idea that something's still in development as a feature, or it shouldn't be released yet till Tuesday, or we're just coordinating it across a couple of different projects and deployable units. Whereas an operations type of flag is something that's more technical for us as engineers, something we want to do, we want to modify the system behavior, but it's not an actual feature. Might be performance related, might be for temporary migrations, we'll see an example of that in a bit. Might be for traffic shaping or switches or degradation, those types of things. And of course these third type is experimental, which I'm sure you've heard of. The idea that I want to test variations of a feature on different users and see what works for them, what doesn't. Maybe I just want to test it on a portion of users out there and see how it performs. And of course, the fourth type is the permission type. And the permission type then designates a certain feature for alpha testing or maybe for specific customers that are likely to be okay with the risk of preview features, those types of permissions. And so feature flags give us a ton of flexibility. We talked about the branching strategy. I can now have shortlived lived branches I'm no longer coupled to when I can merge into my main deployable branches at any time. I can do that and turn it off with a feature flag. I can ensure that I don't have multiple active release versions where I'm keeping different branches per version. I no longer need that because I have a deployable central main branch and this is really enabling. True, at least the way that I think it should be done, which is we're not just integrating in isolation in our feature branches and validating the build, we're actually integrating as multiple features are developed in a single branches and validating your code earlier. The fact that I can see that the refactoring that you're doing while I build my changes just enables us to be slightly faster and have faster feedback loops. And of course I can ship faster, right? We're enabled to be a lot more confident in our deploys, and we do that because while we're shipping all the time now, we're deploying as opposed to releasing and our rollback when we do have to roll back a feature is not a change to the immutable artifacts, it's actually just a change to the feature flags provider to say turn this off in a lot of cases. And one of my favorite capabilities that features gives us is once we are in production, we can test there. We don't need other environments. We can easily do a b testing, we can easily use canary releases, and we can release to a set of users that we want to, maybe even just ourselves for testing. This enables less environments. At SPS commerce, we had five environments at one point in time. Now we have that down to two, using feature flags where we can contextually give access to certain features in those environments. There's a huge ridiculous overhead to maintaining five environments, not just infrastructure costs, but promotional costs and overhead that is just not necessary, not needed when we have some of these capabilities shifted into the code base. Of course, when we think about feature flags, we think about culture as well as a large aspect, which is this idea that product owners are now involved more integrated as part of the release process for us. They can make some of those decisions independently of the deploy process. And when we think about culture, we think about this new term, progressive delivery, that you may have heard of. And progressive delivery is defined by launchdarkly as a modern software development lifecycle that builds upon the core tenets of continuous integration and continuous delivery. It was a term coined by the folks over at Redmonk working with Azure DevOps team and exploring a little bit how Microsoft deploys Azure DevOps using what they call progressive experimentation and rolling through rings of release at a time and working with them, then putting together kind of these concepts. Launchdarkly then of course a feature flagging service, the most predominant one, provides us then with a lot of information about how to use progressive delivery and exactly more what it means by definition. And these three of them together then define progressive delivery as not this idea that it's something above continuous delivery. So it's not like I do continuous integration, then continuous delivery and then progressive delivery. It's different than that, right? Progressive delivery is a named pattern that we can use to achieve continuous delivery, which is the breath of fresh air. If you've been working in that space long and you know that there are so many different ways to approach it, and how do you actually achieve continuous delivery and deployment? Well, progressive delivery is one important way to achieve that. It has two main tenets worth mentioning. First is the idea of release progressions. You're familiar with that? The idea I want to progressively roll out to more and more users at a time. The second part to that is progressive delegation. And progressive delegation is this idea going back to the culture we were just talking about where I want to give control to turn on the features flag basically to these person who owns it at that given time. It might be the product owner is a key example. It might be someone else in your pipeline. As you float through that, the engineer no longer has to worry about, okay, I'll turn this on when I'm asked by so and so. Well, why would you turn it on? Just get the person who is owning that particular feature, get them, delegate it to them, whatever group they're in, in order to enable and roll that features out. When we think about software delivery and performance, in my mind, the ultimate goal is really to deliver high quality software. It has to solve a business problem, and we need to do it as quickly as possible. Speed to market is important. I love the state of DevOps report from puppet. The most recent one, 2021, and the others before it define some key metrics in helping us understand what are high and elite performers like in shipping high quality software. And they talk about these four key metrics that are worth noting. And the reason they're important is because feature flags can actually help us achieve, in some cases, a bulk of this or a portion of some of these metrics pretty easily. So if we look at them, deployment frequency, how often do you deploy? Well, is that whenever you want? Well, it could be if we knew our code base was always deployable and we don't really have the facade of it. Well, it's always deployable. I nerd something and now you have to wait. It's like, no, it's always deployable because I wouldn't have something in Maine that wasn't behind a feature flag that was controllable. So that's a pretty big enable lead time for changes. Then can I get something from inception out to production in less than an hour? Well, I could if I had the ability to use some of these advanced release and contextual techniques. Right, same with meantime to restore my service goes down. Can I fix it in less than an hour? Well, typically with feature flags, yeah, especially if I have operational flags that are helping me support that and use some knobs and levers in order to restore it quite a bit faster. And my favorite is change failure rate. And change failure rate is interesting because in my mind, it's not a metric that makes sense to me based on the results of this survey. The idea is, when I do make a change to production, basically I deploy how often do you fail? Is it less than 5% of the time? And we know from these report and from others that the more you deploy, the less you should fail. Which is weird because you think I'm deploying more often. I should just fail either the same amount percentage wise or more, maybe even with the number of complexities that are these. But in reality, the more often that you deploy, and especially using techniques like progressive delivery, you fail less. And so I'm excited that that's also proven true, obviously with feature flags, because if you're using these techniques, you're going to be able to much more confidently do those deploys and degrade and not turn on a particular change on the entire user basic at a given time. At ESPs commerce, we have a fantastic continuous improvement team who's been tracking our change rate and our failure rate at SPS commerce over the last many years. Going back as far as 2014 and as part of our journey, you can see that these is exactly proven true by our stats as well. We do about 1000 changes to production a month now on our platform and you can see that that change failure rate is just at two to 2.1%, which is fantastic. Using DevOps best practices, using patterns like feature flagging. Enough of that. Let's dive in. Let's actually explore what a feature flag is. And so I want to examine a really simple feature flag with you today. And this feature flag is just setting up a new user, maybe in an identity system or platform of your choice, and it comes in with a user context object. And that user context object might have username, might have first name, last name, email on it as an example, and then it goes ahead and creates that. Our feature flag then is going to check if Sendgrid email is enabled. So in this case we're going to send a new welcome user email to our users if we're using Sendgrid. And that's our feature flag right there. Sendgrid is a popular software as a service provider for sending emails by API, and so we want to use that as a simple integration here. And so the new code will simply say hey, use the API to send the email. But what does that feature flag if statement actually look like? In this case, we've hard coded return true, but going back to our definition of a feature flag, we know that we can't hard code this. While it might be centralized using a couple spots in our application code base, we can't hard code it to true because then we can't modify it from outside the code. We have to be able to change this behavior easily. So you might in its simplest form change it to be app configuration usengrid. So just check an app configuration might be a local file, might be a centralized database key value, but in reality using it as a microservice makes a lot of sense, doesn't it? And you could just pass that key along to say, hey, should I be using this feature Sendgrid? And that would work. That would tell us whether it's on or off. That could be on a per environment kind of basis. But we also find that with feature flags, it's not just simply a new piece of behavior being added, but it's actually an augmentation. So you often end up with an if and an ifelse statement like this, where we want to send local SMTP email is actually the old code and we've placed it inside the else statement now to separate it out as part of the feature flag. And while this looks good, we've actually missed out on a ton of value so far, right? We were not able to truly test this in production the way we want. If I deployed this out to production and I had it disabled and I want to enable it to test well, did I configure the API key right for Sendgrid? I would actually have to turn and enable that on for everyone. So turn it on for everyone, test it really quick and prod, and then turn it off if it was failing. That's not the kind of experience we're going for. That's still an advantage to ensure I can keep my pipeline moving, but I'm not getting the value after that. And so the key part here is that we actually need to go back to our if statement and modify it, so that in this case the is Sangrid enabled has a user that we pass into the same user context we had. Then we modify our check here, our method in the bottom to pass that user context along. So now in production, I can say, is Sangrid email enabled for Travis specifically? Right? And I can say, yes it is. And then we can test out Travis with the welcome email message in production without affecting anyone. Very, very cool. However, if you've been following around this conversation so far, you might have had some important realizations of some questions. And that's where it kind of comes to this stop of our future flagging honeymoon maybe is over, because there's some important realizations here. Let's dive into those and explore them just a little bit. First would be the idea that we are shifting left, right? So we're moving the complexity that we had previously in our infrastructure, in our deployment pipelines, and we're now moving that into the code base where we have the code that is no longer separate branches for different code paths, but actually in the same branch is different code paths. But this is good. I'm actually a fan of this because shifting that left means that we can handle the complexities of some of these releases in our code base. We can do interesting things now at the runtime as a part of that. That's where we get our user context. So that's okay. But it does mean that our maintainability is affected and you need to be aware of that. It adds complexity. Right. It's much more difficult to reason about these state of a system at any given time because it's no longer, is this features out, yes or no, a binary decision? It's actually, well, was this features enabled at that time for that user in these environment? Much more complex question. And you need to have the observability to answer that in your log statements. In your other things. When you go to debug a problem, you can't assume the same things about the system. It's actually not a binary question anymore, not at all. Additionally, you have management of the flag. You need to create these flags, you need to manage the lifecycle of them, the removal of them, lots of interesting aspects. And of course, one of the key, I guess, scenarios that I looked at when I was coming into feature flags is I didn't understand how I could have zero risk with a feature flag going to production when I have a code change in my mind. Any code change can result in risk right. Even behind a feature flag there's still the if statement, the binary decision that happens that we have to consider is a potential risk to production. And I have broken production with a feature flag. Absolutely. But in reality, yes, you're making changes, but we can unit test those changes pretty easily. And like I said, they're binary decisions. Typically in a code based perspective, they are things that you will add to a service that you'll have abstractions for, you'll get good at. You will practice it and it will become second nature and it's not going to be as big a problem as you might think. But getting started with it, for sure, there's risk there to understand. And the other suggestion here is there's a lot of engineering goes into building a fully aware user contextual feature management provider. So you want to be aware of that. And so from your perspective, should you purchase one, should you build one? I like to stay away from undifferentiated engineering and this is a great space that you can use one of the other providers in. Let's take a look at some of those providers so you get an idea of what we're talking about. Might be as simple as the simple case, which is a key value pair inside a database or a config file. Might be AWS like parameter store or Microsoft Azure, app configuration or even console Zookeeper eTcD have key value pairs that you can target and use as a simple method of feature management where you're not going to pay extra. But if you want to use an open source provider that does provide contextual management, you might use unleash or flag or flipped. But of course if you do have the option to use a full service, any of these are fantastic. Launch directly, of course is the predominant leader and is what we use at SPS commerce. But when you compare that across features management providers and on this g two grid, focusing on the feature management space, you can see that launchdarkly is your clear leader, kind of in the top right there. From a cost perspective, there might be other good options that you want to consider, including optimizedly, which is near the top right there, which is great. And you can see where some of these other providers that we talked about kind of fall into place. It's worth noting this extract from a recent Thoughtworks tech radar. If you're not familiar with TechRadar, it's I believe, a quarterly release where they talk about different technologies to adopt different technologies that thoughtworks consultants have seen over the past few months and whether it's something that you should consider bringing in or just spiking and taking a look at. In this discussion talks about the usage of simplest possible feature flag toggles. The idea is that you don't necessarily need a full service provider to get started, and it might just be a barrier to entry, especially from a cost perspective. And while I'm a big fan of some of these full service providers, depending on what you need for your application and who your audience is, and if it's an internal only service, it's not always necessary. So you might consider what are the features are, what are the release capability, and what is the longevity of your project before deciding on exactly what level of provider you might need. Okay, let's move on and take a look at a UI routing example. This takes us to a little different position than our previous example that we looked at, our simple example, which is kind of in a backend API. In this example we will basically trim and change out navigational structure in a UI web application. This is using Launchdarkly's feature flag router, which is a react component that we'll use, and it gives you the ability to specify then this fallback as well as this new feature. And of course if you don't have access to the feature, then you get the fallback and the existing feature shows up. And so really we're just security trimming, but instead of based on security, we're just doing it based on your feature flag status and the contextual usage for that particular user. You can see here then how that materializes. The top screen is demonstrating the react application took a long time to style it, so I hope you appreciate that. And you can see where it has the those option and the existing feature. I can click on it and it shows me existing feature in the URL. If I turn on the feature flag and launch sharkly turn it on, you can see it immediately materializes on the top as the new feature without me refreshing the page. And so another consideration that's different is on the UI you have to evaluate and change the flags very dynamically. You might have a web application that is a spa, and that spa could be living on somebody's desktop for many many weeks, even potentially without being refreshed. So having a live websocket or long polling connection that can update that in real time could be very essential for you also using it on the UI. You now have UI connections from browsers potentially across the world, as opposed to a backend API only needs feature flags from internally and is much lower volume potentially, and much lower, I guess, access points from a geography perspective. So you'll need to consider the reality of well, am I building a UI application? Is live changes to the flag important when you're deciding on your feature decision provider, and if you're going to use something simple and build your own, or if you're going to use a full service provider, you can use feature flags in many different ways inside APIs. For example, this users API, you might have an existing V one users endpoint. You might use a feature flag to enable early or preview access to a V two profiles URL or address that isn't normally accessible. You might also use a feature flag to enable or start shifting engineers over to use V two users automatically if they're using V one users. If there's a major shift that's happening, or perhaps you haven't versioned in your URL, you're versioning in a header and you want to transition some of those default users over. You might also want to test interesting use cases, so you might want to validate that. Well, actually in V two users, what happens in this test case scenario? You might bake that test case as a feature flag you can turn on to create some failure in the system, something you don't typically think about doing in production, but you can easily do with a feature flag. And while feature flags don't give us the ability to really not version our APIs, we can get away with some interesting small changes that you might want to experiment with a bit. Perhaps you've accidentally made your V one users endpoint with a particular request. You've accidentally made it a 200 on a operations instead of a 204 where it has no content. You might swap that out and fix that contract without reversioning the whole thing, even though it is contract breaking, and then monitor the failure rates automated with it and understand that, okay, that did not break any of our downstream internal clients. I can go ahead and just make that change without busting out a whole new major version on your API. So you have some of these options available to you in different ways. You can slice the usage of feature flags. One of my favorite ways that I've seen feature flags used is in a monolithic pattern where we want to strangle out some microservices. In this case, this was a monolithic gateway API and a database behind it, and we wanted to pull out a scheduling API that could be used. And so we built the scheduling API and the scheduling database, and it was all new and shiny and had new technologies that we wanted to use in there. And so what we did then is go back to connect these two using a feature flag. And the feature flag gives us the ability to start redirecting read traffic over to the scheduling API and play around with it a bit, even just for internal users only as a point of getting started. Of course, they have different databases, so we did have to run a bi directional real time synchronization between these models and different transformations that were there. You might do it a different way, though. If you don't want to synchronize the data continuously, you might actually change and use two flags here. An interesting way to approach it would be instead of turning on the read flag first and testing the load, you might actually just turn on a write flag that doesn't say write either left or right. It actually says, should I write just to the old one or should I write to both? You start writing to both, then after one time synchronization of the data, and then now you can kind of control that your read flag. Then you would turn on when you're ready to shift the read traffic for certain users to one or the other both, continuing to remain in sync as you write to both destinations. So it's an interesting way that you can start to perform some traffic shaping and some migrations and be very successful. In our examples of doing this at SPS Commerce, we turned that flag on and off several times and discovered a lot of production load issues that you could only really discover in production. Right. And we were able to affect very little users as part of that, if not any, just by doing it internally. You might also use feature flags for coordination. I might use the same feature flag in the UI in a public API, maybe an internal API, and then turn on together a feature in coordination. Now there are implications to that. Obviously, I want to be able to just turn on a feature, maybe in the UI without turning on everywhere for personal testing and production. So maybe I actually want these to be independent feature flags. But depending on how you slice it, you'll have to consider whether you have a single flag or different flags. And this starts getting into architecting how you're going to use your flags across your organization. And unfortunately, there's other considerations here. You might have to consider, well, what is my user context if I want to enable this? For a user in UI and a user in internal one API, those might have two completely different contexts, and that makes it difficult for me to manage a single flag across those. And we haven't talked a lot about user context and so this is a good area to define a little bit about that user context and what it could be. We talked earlier about how the user context could be first name, last name, email, could be some type of user identifier in your system, could be additional user details that you want to include there for easy reference. You got to think about from a delegation of that flag will downstream delegation of it want to use first name and last name or email in order to target users? How you want to target it is critical. And so when you're thinking about targeting, targeting by individual users is nice, especially for testing in production for your own feature that you're building as an engineer, but in reality turning it on as a role or a particular group or an organization, or even internal versus external employee. Sorry, internal employees versus external customers. Or even if you have a set of beta customers, you don't have to define that and do it in a consistent way across your organization. And that might be something for you to think about as you're architecting it. In this example here, you can see I'm using user id 123456. My name is Travis Goslin. I'm initializing in one provider there, but in another application I might initialize it totally separately. I might initialize it. Yeah, the user id is the same. So if I target the user id, I'll get the same consistent feature flag turned on. But if I were to target this flag and use first name, am I going to get a consistent enablement of it across applications in coordinating? No. In fact, these contacts are not equal to each other at all. Those could be even derived as two different users, and there may even be a potential monetary problem associated with that configuration. By that I mean if these are coming out as different users, you might actually be paying them for twice as many users across your system as not. So you want to think ahead, you want to provide these, the capability and an organized strategy in your organization of how you'll use these. As an example, in our organization we only use the level component and so we don't actually allow targeting by individual users unless they're internal. External users are only ever enabled@the.org level, which is basically a company, a connection within our network. And that made a lot of sense for how we strategically position feature flags in our applications. As you may be realizing, there are so many other scenarios that you can use feature flags for the sky is really the limit you think about log level verbosity and the idea that I could have something in production and change it to a lower level more verbose level of logging. You could even do that automatically and change a feature flag based on an incoming error rate. That'd be pretty cool. You might want to use dynamic configuration, so it's not just a Boolean value for a feature flag. It can be JSOn blobs or multivariate configurations. You might want to use kill switches. So the idea that I want to disable a third party that's acting up, perhaps we are having an issue with a particular service that's really degrading some of our performance. And so we can turn off that service, whether it be on the front end or turn off a feature. Having the ability to kill something is important, especially as you're putting out there for the first time. Or you can think about our migration scenario where we killed the new service, the scheduling API, and shifted back and forth as we needed until we got it right. We talk a lot about feature flags for the creation of new features, but you can also sunset features with it in an interesting way. So as an example you might say, well, I'll place a stake in the sand now, so no net new customers are going to get this feature, and then I take it away from them. And then you can pass it off to your marketing team, to your customer success team in order to work with your existing customers to downgrade that particular feature, remove it from them as they're able to do so, goes back to our progressive delegation as a part of progressive delivery capability. And of course, timed features are always cool and interesting. You have the ability to think about a timed feature in the sense of holiday release, something that you want to specify for a certain schedule, for its appearance. And while that's all the time we have for today talking about feature flags, there are obviously many other topics to dive into and explore as you architect feature flags across your organization. But I think you'll find that if you separate out, deploy and release and decouple them from each other, the value that that provides in terms of velocity, true CI decoupling, testing and production is just very high and very valuable in what is provided to SPS commerce and encourage you. If you're not using feature flags, this is a place that you want to explore to be high or a lead performer. In DevOps, however, we did talk about a little bit of it. It's not free, right? There's a price to pay for some of these things. I am changing code. There is still risk. There is additional complexities here to worry about, especially observability. But I do believe that it is absolutely worth it for the value you're going to get. And I do think that a lot of those risks mitigate as you practice it and as it becomes just another one of your patterns. So thank you. Happy to reach out and chat some more about future flags. Take care.
...

Travis Gosselin

Principal Software Engineer @ SPS Commerce

Travis Gosselin's LinkedIn account Travis Gosselin's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways