Surviving the Real-World: Building Resilient Cloud-Native Platforms for Intermittent Networks

Video size:

Abstract

Dive into the architectural strategies that help a distributed job runner thrive on an unreliable network. Whether you’re dealing with distributed systems, multi-cloud environments, or hybrid infra, learn practical insights to design and test in resilience and high availability for CN services.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name's Brian and I'll be speaking with you today about four quality attributes involved in building enterprise level distributed solutions. Those are availability, resilience, reliability to a lesser degree. Scalability, do you have a distributed system? Will you bang if you have multiple applications in your solution, which operate across multiple networked environments at once? For example, if you have a browser on a user's laptop, that opens up a user experience on a remote server and that connects to a separate SaaS, API in the cloud, which may call other systems, which live in other clouds or route, may be back to a specific data center operated by another organization. I'll show you a few examples of how we're going to deal with building resilient cloud native platforms, including our latest DevOps offering from chef. Each of these quality attributes can be designed into your system to meet your specific technical goals, possibly building up to offering a service level agreement to your customers. First, we'll dive into the theory, talk a little bit about tools we have in the open source GoLine community, and show a couple worked examples, which should give you a practical way to plan and test your own distributed applications. This talk does stay at a high level and does not contain any proprietary information from any of the examples. Rest assured, they're real world examples that are systems that are out in production today. I'll also show a few diagrams as we go through this. Some are my own, but where it's a picture that I've borrowed from someone else, I've cited the publicly available source in the slide notes, let's take a look at why this is important. So we start with this. This is a pretty well, versed, Picture set of points, that talk about distributed computing that kind of made the rounds and everything. And this is a set of assumptions we might start a project with. and we generally know from experience that as we get out to a production grade system, a lot of these things tend not to hold true, right? These reflect the lessons of systems engineering, the engineering of multiple systems as opposed to single self-contained systems or maybe even MVPs over the last few decades. In software, we often start with that core system, an MVP, a proof of concept, and then if we're successful, we grow, number of users, increased functionality, possibly broadening out to additional tools, to different integrations, and then we start seeing failure modes as that system grows, which are both analyzable and preventable. The key is to get design in there at the right point in this path. Hopefully the quality attribute aspects are designed are considered upfront, but the techniques we see today will certainly be able to be added later as needed. Just a little bit higher cost to implement and test. The other lesson we've learned is that there are many levels of quality from good enough to supporting higher performance levels. Maybe we phrase this as a service level agreement or SLA, which is effectively a more specific quality contract with the customer, both what the customer may see as value and expect in the solution, and what an engineering team has confidence in read. We've tested it and can provide a guarantee during normal operating conditions, specifically to talk to some of these points. Distributed systems face challenges in the fact that the network connection between different subsystems is fundamentally unreliable. In timing some operations taking longer than others to complete or identify an error condition, other actors may interfere with our systems. Maybe we call this secure operations. There may be changes to the network topology itself, which our applications will need know about and looking at, number eight, the conception that all networks are exactly the same. Each of these aspects focuses on the networking and integration side between subsystems. And this is one of three major classes of failures which may affect our system. The other two, apart from networking being hardware failure, probably CPU or some, memory or something like that. And then the last one being storage and persistence failures. When we talk about writing things to disc in persisting state, we'll see all these as we analyze our system for likely types of failures. In short, there are lots of things that can go wrong as we think about how our subsystems work together, and we'll rely on systems thinking to look at which are most likely and how we mitigate the risks. That typically mitigating risks is reducing the impact of them if they do occur, or reducing the likelihood that they'll actually occur at all. And we're gonna do this through both design and implementation, right? So we're gonna start from the architecture side and we're going to add elements to our design that help make it more reliable, robust over time. Now, at the end, we'll probably make sure that we continually test for changes in operating the system as well, right? So we're gonna make sure that those improvements that we've made show up again in the system and that we cer see certain types of failures that are either masked we can deal with or don't occur at all. If you're thinking this sounds a little bit like failure modes, effects analysis, FMEA, you'd be correct in how we identify possible failures. We'll show some examples of how this process is not so daunting, and you can apply this on your system as well. Let's take a look at what's coming up. We will talk about topics in this order here, serve as a rough agenda. We'll talk about failures and some definitions of our quality attributes in SLAs. We'll dive a little bit into patterns both on the infrastructure side and the application development side. It's important to note this is a shared responsibility and that both pieces of this puzzle of our design, of our architecture make a successful system. We're gonna make changes on the infrastructure side and we're gonna make changes inside the software itself in the code. That's gonna make, give us a well designed platform and software controls that go on top of that. I'll show a simple example and then we'll build up to a couple more complex ones as we go, as we show applying these techniques. Then we're gonna go into Golang specifically and take a look at some of the tools that are available for us to do so we don't have to write our own. We'll follow up with a little bit on testing and then some final summary, summary slide. Each slide has references on the specific topics described, and then there's some. Overview references on the very last slide. So on the slide notes for each slide, you'll see more specifics on what that slide covers. Now I'll talk a little bit about why we're actually here and why I'm talking about this today. And it really starts with our journey here at Chef and we came up with the idea of a brand new DevOps solution, which gives fundamental capabilities of monitoring a customer's infrastructure assets, do some things that our previous systems had done in terms of infrastructure, configuration management compliance, typical DevOps sort of work processes. And then extending this into a general purpose job system, which could work across customer data centers on-prem, multiple clouds, that they may have assets in desktops, and which we could also bring to market as a software as a service offering for a large number of customers. So multi-tenant type offering here. This system started as an MVP over a year ago, and we launched our first version AWS last month of SaaS solution. As we talk to more customers, probably similar to your journey, we had lots of questions about these quality attributes. What could the individual customer expect in terms of system performance? How often could they do jobs? How many assets could they include in these jobs? What if the system went down? What if the network went down between us and them? What would happen to jobs in flight or ones which had not been delivered, but an outage habit? Could chef offer a service level agreement if these are things that your system is trying to answer questions on as well? This is probably the talk for you. We've learned a lot about how our system performs over the last year and we realize this is really a continuous improvement problem. So a lot of the things we're gonna talk about here, though, they are analysis that we've done on our system and we've learned from other systems that other architects have brought us. it shows a lot of the. Way of thinking about these and how to inject design into the application as it grows. It's easier to state goals than it is to prove a system will not fill. A lot of times a more approachable way for an engineer to deal with a system and improve quality attributes is to plan the organic development of features. In other words, making a process your own right. So you have to have kind of a checklist or a set of tasks that you're gonna execute probably over and over again that fit within your development process. These steps are just one way that we've listed out to build availability and reliability into your distributed application. This design process works for both large systems and smaller systems with less stringent requirements. So fundamentally, just go through these, start with your system diagram and what your business process description is. Figure out what your goals are. We'll show how that can be expressed as a service level agreement or maybe quantitative measures that you can take in your system. And then go through a brainstorming exercise that we'll call FMEA to understand what failure modes might be present. Then you're gonna go through looking for design changes. We're gonna back this with particular patterns that we know from previous experiences that have worked, and we'll list a few of those as we go along here. And then finally, after you've implemented them and deployed your solution test and make sure that you're meeting your goals, from an operations perspective, you're gonna continue monitoring your solution. You're going to automate your, playbooks or your standard operating procedures to recover. That's great. But the important thing for the development team, and certainly an infrastructure team supporting a solution, is that you're gonna learn from operations as well, right? So when you have a root cause analysis after an outage. Turn that around and make sure that turns into the design cycle at the very beginning. Again, the image on the right is one of an uptime page for a set of services. We've probably seen a lot of these, and really this is our end goal for our efforts is to let our customers know if the services up or down, hopefully minimizing the down and we have all green here. And if it does go down, what can we communicate to them about when it'll be back? The key with large systems is really that it's an automation game, right? So we can't rely on manual processes as much when we get to larger and larger installs. more customers, more feature sets, more distributed type operations. And so we're gonna have to bake this all into the system design itself. We can't rely on the old days of, oh, we'll just go reboot the server and you'll be back, we'll send an email out to customers or something like that. when you get to large scale solutions, that's not as, either helpful or promising to customers or as, easy to do because you have large data backups that you have to restore. Potentially that might take a long time, right? Might have a sustained outage, just due to the scale of the system. Now, this observability and self-service is really what we're striving for. So you're gonna see a lot of our techniques here rely on how do you bake it into automation? How do you take metrics actively on the system? How do you feed those back into certain types of behaviors that the system can adapt around? Failures? Next we're gonna jump into a little bit on the common definitions of these quality attributes and how we might describe them to others in the business or as customers. We've talked about SLAs. Let's jump into a couple of these sorts of things and talk about availability as a first definition. We're gonna take a look at. One of our quality attributes availability. This is the ability of a service to be able to take requests for a percentage of a defined operating period. Okay, so this is the front door, right? It's open to submitting requests. The system seems up and running to the user for a percentage of time, even though those results may delay by internal errors, failure conditions. We have the table on the left that shows how this percentage is determined, defined in terms of outages as seen by the customer, right? Just different types of systems and usage by different customers may influence what the system availability is built for. Not all systems need five nines, and many systems are not even tested for three nines or four nines. For example, voice software like Zoom or something like that, may want five or six nines, right? It may really want to have very accurate, packet transmission so that you don't have a break in the signal quality, whereas email or a non-mission critical application, maybe your time card tool mailing, you need three nights, right? And maybe some days you wish it were or less. on the right side panel is a description of some hypothetical impacts of this availability. This is from Mark Richard's, blog and an article, which is in the slide notes. I mentioned that before, but gives you an idea of what do, what does, what do different nines actually mean in real life here right now, you could get lucky if your system is simple enough, or you do not actually measure the availability. Your system may have a particular built in high availability already. There's a lot of cases though, this isn't enough to give a customer a formal service level agreement. Or a contract, which says we're going to hit this, or there are some, repercussions afterwards. There are, however, design changes we can make to get to a desired availability, both in terms of making sure that the theoretical redundancy in the system supports such a statement, and by masking failures, maybe caching data to make sure that a client application doesn't notice that a backend service is temporarily down or intermittently down. Now, there is obviously a balancing act that goes on here, right? So software and infrastructure improvements here tend have, they may drive higher availability, but they have an increased cost, and not all customers may want or want to pay for something like that. That would be in an SLA, right? So if an SLA is appropriate to you, if a particular availability number is appropriate to you, question, what's the value statement here? Are we building this just because we think it needs to be this, or are we building this because our customers. Really have a need for it to be this note that not all components have to be running, and data does not have to be consistent up to the minute only the operations can be started. This also talks a little bit about how little time you actually have to return to operations after a failure. Manual processes can propagate you to two nines, but you're gonna need automated recovery and monitoring at multiple points in the system to get to four nines. The calculation shown at the bottom, which can be easily implemented and calculated in a running system through tracking health checks and telemetry metrics. Two additional definitions we see a lot are a service level agreement and a service level objective. So typically an SLA is an overall SaaS service availability target, and probably has a disaster recovery plan based on reliability and recovery. Office 365 is listed in the notes as a good reference for how to build one of these. It's typically a contractual. Thing between the service provider and the customer, right? So it says what happens if a service level is not met? A service level objective, on the other hand, builds up to an SLA often is internal, right? Where we have multiple dependent services with different operators. Each operator may have their own objective, which supports the overall SLA. For example, an IT department may maintain data center hardware that the SaaS product team is deployed upon. So they would have an SLO to that SaaS product, SLA or a security team may operate active directory. The IDENT identity system, which an application may need to have a certain uptime to meet its overall SLA. Let's take a look at another definition here as well. I. Oh, sorry. Another, example, this is one from a sample retail store, so we'll see this in our example here. imagine that you can do e-commerce orders into a store and you can also walk into the store and order things and get delivery of them. And so this would be si similar for a branch office or any other, distributed environment. The first statement in red text is really a highly caveated objective, and it's probably unreachable. It's an availability statement, but states a hundred percent uptime, meaning no outer can be tolerated. Hence the caveats for any bad incidents or whatever. A flood or anything like that would invalidate the statement. The second statement is really a description of five nines for a specific subsystem. This might be more appropriate as an SLO. And this is for the payment system which lives on top of a platform and connects to external banks that we're communicating with, as one end of an integration. We see this a lot in microservices architecture where one API may call another, in this case it's in a different environment altogether. And we have to assume then that both sides of the integration since one is, has a dependency on the other, are at that five nines level or are marked explicitly out of scope for an availability statement. We've seen a lot at how we look at the customer experience through availability here, how the system can take inputs to initiate functional flows. Let's take a look at resilience and reliability next, and that's gonna be the ability of the system to correctly process until either successful completion or a known error condition that can be reflected back to the user, but to go all the way through and mask any failures that might occur. When we talk about resilience, we're really talking about reliability. The ability of the system to respond to error conditions appropriately, reflecting back to the user what happened, and to continue processing subsequent requests without a sustained outage or a failure across the entire business process. To do this, we need to make sure that the system can detect failures, maybe with error handling or infrastructure failover. Optionally report that failure to a person to take corrective action, proceed to determine the type of failure and therefore what recovery plan gonna be put in place, what automated recovery plans gonna be put in place. Since failures can happen, lots of different levels of the system, the responses which ideally are automated, will span both hardware and platform reliability fixes. And then you're also gonna have software components as well. You'll see these in the examples. The last one here we're gonna talk about, after resilience is really reliability, which is much more quantitative. Reliability is the percentage of time which a system or subsystem performs reliably and is normally applied to a complete business process, such as purchasing a process, purchasing a product as an end-to-end workflow, as opposed to availability, which is the amount of uptime for submitting a request. Reliability is the amount of time the system can perform entire operations, the ability to complete the operations. It's usually measured by two metrics, meantime to failure and meantime between failure for software systems, usually we combine this into a single measurement because MTTF really is a fundamentally a hardware type, measurement there. And then you can, and you can see there meantime between failures is something that we can actually measure on the software side where we have outages and things like that. If the system becomes unable to process, that is it suffers an outage. We're going to define two particular terms related to the recovery. We're gonna talk about a recovery point objective and a recovery time objective. The RPO is effectively how much data is lost between the time when the system goes down and when it's recovered or declared, normally functioning. Again, the recovery time objective is how long that recovery takes, right? So not how much data is lost, that's RPO, but how long it's going to take to get the system back up and running again. For example, a system which is backed up once a day and has a failure in the middle of a Wednesday, has an RPO of up to one day the last backup and an RTO, we'll say maybe five minutes that it takes to reboot the system and apply the latest database backup, right? So these two can differ. We haven't talked about scalability yet, which is really a whole nother topic, but this is the third leg of our stool reliability that. Sorry. Reliability, availability, and scalability together form that three-legged stool, which is usually tested under a number of concurrent operations or load that the system can take on without showing failure signs. We define this as the normal operating range, right? So what we want to do is build our SLA around what we've tested to. Now when we have complex environments, we can look at reliability and availability, and figure out what their theoretical number might be, right? And so maybe this is, for either availability or reliability. We have to multiply the subsystem components which have dependencies on each other. in series this gets worse, right? So a five nines in series with a five nines is less than five nines. if it's in parallel, then sometimes that can get better. So if we have redundant components and only one of them goes down, or if we assume that the failure mode is that at most one of them will go down at a time, then we may have better availability or better reliability. 'cause we have two components that could take up the load. Usually we assume that software at the top of the stack is a hundred percent reliable and then the things below it are less reliable. So we're gonna use like a stack diagram like this, show that chain of dependencies from the software layer at the very top on down through, things in the physical world. Okay. In the diagram shown we have two physical nodes, for instance, side by side on the platform. And I wish I could point these out on here. So you see node A and node B that have a big 99.9 on them. That's two elements in parallel, right? So you have some redundancy in there. So that's how we show that on this kind of diagram here. The availability of a component may allow for some degraded mode of operations as well, which may count it as still available, right? For instance, order taking may not be affected, but backend billing is delayed. You still may be a hundred percent available. If there are redundant components with failover required, say an active passive database, then remember that we're gonna have an RPO and an RTO involved, right? Which is gonna impact our uptime, right? So if we can't recover the system within five minutes, we can't get to five nights, we're gonna have an outage longer than that period would allow. If the redundant components are operating active, say in a load balance service, then a partial outage typically does not decrease the availability unless the load is entirely out of our normal operating range. So if you have a side A and a side B for instance, and side A goes down, everything transitions over to side B. If that load is higher than what side B is able to take, then we may have a cascading outage. In this diagram, the color blue represents software. The application green is the infrastructure platform, and yellow is external integrations. And this is for, again, for the retail scenario that we'll see in a bit. Our sample calculation is pretty simple. Reliability of the entire system is gonna be the platform pieces, the software pieces, times, maybe some external integrations, times, maybe some physical pieces. The payment devices, lemme see on the yellow side there. database reliability that's in there times also your container hardware. And so dependencies in series are multiplied together. And so two elements at one, at 80%, one at 85% will give you a 68% reliability, right? So we multiply them together. If they're in series, meaning there's a direct dependency one-to-one, and elements in parallel where you have replication or duplicate, redundant, maybe hardware or containers, you get, a normal parallel equation for that. Sorry. it's in the notes here. but if you have five elements at 85% availability, that might give a total reli, total availability at 99.992%. So you can see it gets much higher if you have redundancy in there. Okay. Let's go on and we'll talk a little bit about how we're gonna brainstorm about failures here. Here we're basically going to use a technique called failure modes effects analysis. We're gonna go through each layer of the stack and identify sources of failure. We can just scattershot them and go, what could possibly happen at this layer. Look at our stack diagram or a system diagram from before, and then we can just list them out. Now, one recommendation I would say here is that before you get to solving this, before you get to understanding how you might reduce the impact or changing design, list them out. But we want to prioritize them by likelihood. Do that second. And then also as you're going through FMEA, you might look at it by the OSI seven layer model, right? What happens at the application layer, what happens at the container layer, virtualization layer, what happens at the network layer, and so on and so forth. So on the next two slides, we're gonna see a complete result of this. Again, for the retail example here, I know this is a lot of stuff. You may wanna come back to these to look at what these are. But basically we're going through a tabletop, failure mode analysis, coming up with a bunch of failures that might happen, and then you can see we added a couple columns to the right where after we did the initial listing on the left side, we're adding things that we think might be able to avoid this type of failure or things that we might be able to add to the design to remediate. Couple obvious ones that we learned from on this one. There's actually a good case study of Square going down in 2017 and Chick-fil-A in 2021 that I also put in the notes here. This is the second slide of our failure analysis. This is just the, just showing you the complete, analysis here of what we thought could go wrong. There are a number of solutions for dealing with risks. For example, when we have a dependency on a separate cloud dependent service, we may have an offload, offloaded login to a third party provider. We may exclude that from our calculation, right? And we may go, Hey, that provider doesn't give us a service level agreement, or does, and then we can include that as a dependency in our own calculation, right? Or we may just go, we can monitor it and add maybe redundant paths to it and, be able to maybe pat around it a little bit. in the case of a database or network link, we might add redundancy again. or we may have a contract for a different service provider to give us insurance or a spare capacity to jump in. In case of failure for long running services, we may implement checkpointing or store and forward to save the request to disc before starting a long running workflow. If you're the primary service provider, the one with the SLA to the customer. Usually we can't do risk transference to a third party. So in our case of a login service that someone else is maintaining, we may have to accept that risk and bear it within ourselves as maybe a service level objective that we'd like to monitor with that provider. we may not be able to pass that along to the customer. Okay. Now we're gonna take a look at how we're going to improve our designs, what patterns we're gonna use when we get to looking at our top, failure modes. What are we gonna apply to change that, right? Sometimes these are prior examples. They might be codified in a common practice or a new module that we can adapt into our own solution. We might use it to improve either availability and reliability. I. Fundamentally, the system needs to auto automatically identify a failure in another component and adapt around it to stay in a healthy state. And that healthy state is either being able to, in the case of availability, accept new requests, or in the case of reliability, continue processing a workflow while it goes along and it encounters something that it wasn't expected. We may use different approaches depending on the layer of the application we need to, improve the performance in. And so we're gonna see on the next couple slides, we'll see a set of infrastructure patterns and we'll see a set of software patterns. So for the infrastructure folks out there, this is really talking about the infrastructure patterns that we rely on. Also, remember, it's easier to use the providers. Redundancy mechanisms because they'll work well within the platform. So for backup and restore, the same solution can be done for redundancy. So if you have an RDS instance, for instance, within AWS, use that rather than rolling in your own infrastructure as a service implementation. On the lower left panel, I've listed many of the options, which an infrastructure architect or SRE could help implement. The only one's probably lesser known of these is the geo pattern, which is described in the speaker's notes as one that Microsoft. It came up with where services are located in multiple regions globally, and any one of which can serve any given customer possibly raving to a region. Specific backend well being a good solution for something as big as Office 365. It like all infrastructure solutions, has a cost associated with it, and some of the infrastructure solutions provide security benefits as well, such as a web application firewall, which can throttle or restrict traffic from bad actors, as well as rates of traffic from actual customers. So some of these can have two different usages in them. Most of these are designed for service interactions, not inside the same cluster environment. Since the likelihood of network failure is low inside of something like Kubernetes, the same cluster, it's more likely that you're gonna see a failure between environments where you're reaching out from one cloud environment to another. Let's take a look at software once step. Reliability, resiliency, availability and scalability is a single pattern for scalability. Consistent across all the services, rarely is a good idea to mix and match methods with some services doing one thing and others on a different pattern. So when we come to software ones, we might have a set of services that do crud operations and then we might imagine one's doing CQRS as well. Typically, mixing and matching these makes it very difficult to put in the type of scalability and availability changes that we want because we have to deal with two different patterns. if you're doing database transactions, also remember that your backend may be constraining your software a little bit as well. So try and put true database transactions in a single centralized service and try and avoid using a transaction broker if you're having to span a transaction across multiple services. Finally, think about what your recovery should be if something fails and microservices. Being restarted, what actually happens? Then we can call this item potency. So this is the ability to of how your service is going to perform an operation twice in a row with the same effect, right? So doing a read operation is inherently item potent, assuming the back data doesn't change, but retrying a payment would not be Ida potent, right? Because if you ran it twice, the person might be charged twice. So you'll have to add some things like maybe a transaction reference IED that says the second time the operation gets requested for the same purchase, the processor knows that it was submitted previously and just says, Hey, I already did this. I'm not going to do this a second time. That's also why there's a spinner on a lot of checkout buttons when you see it in shopping carts and things like that. It graze out the screen and takes away the ability for you to, as a user, click on it a couple different times, right? For our DevOps example, as well, you're gonna see that, we want to fail safely. And since we're a job based system against infrastructure, we wanna make sure that jobs in flight, if they get put in during the system outage, we have deterministic behavior on how that's going to happen and how to restart them. And we probably are not going to reintroduce those jobs on recovery so that we have basically item potent behavior, which is failing safe. the lower left panel indicates a list in order of common consideration from top to bottom of software techniques that we typically add to an implementation to increase its reliability. I'm not gonna go through all of these. obviously the first ones goes without being said, but yes, let the user retry, right? That's always an option unless you have a business requirement that says that's not available, right? or not an okay fallback. Just cancel it out and then start over again. now when you get into Software Ones, we will go, we'll start with error handling first because it's the easiest to standardize. And often one of the reasons that Go Programs fail is really just having an uncut panic exception. So now there are a lot, this list is based on a lot of the ones that are cataloged at Awesome. Go. So if you're familiar with that site, the link is again, in the slide notes. but many of these we then go down the list and we go, okay, if we can't do that, can't do number two, can we do number three? Can we do number four? And so we're gonna get to more and more complicated ones as we go down the list. Some of the ones that, we call out here, we use a term called Store and Forward. we don't really call this out too often as a pattern, but basically it's the idea that an API is gonna write a request of immediately to DISC before performing a long running. Or error prone operation so that it can restart quickly if a failure is detected later on and it revert back to the original request and go, okay, if I can retry it, I at least have all that information. And I've kept my availability for taking that request, even though my reliability means I'm gonna have to take a different path for stateless services. Remember to invalidate the cash if something happens to the back end. a lot of times we think of this when we're building microservices, but we include it here to be complete backoff. And retry is a strategy for intermittent network loading issues, right? Circuit breakers described in a lot of places. Microsoft, Martin Fowler, are common references there. circuit Breaker basically helps manage or throttle down the requests giving. Given to a down service until such time as the issue is detected is cleared. So we don't wanna, if we detect a service is down on request number one, we don't wanna throw a hundred more requests at that same service 'cause we'll probably get the same answer. And a lot of times we'll have an external signal that will tell us when that down service comes back online. Sometimes we talk about doing asynchronous web services or event driven services, eventing. Maybe those tend to be a little bit more complicated to actually tell when a failure's occurred because we have a long pole or some sort of socket that might be open for a long time and you don't know if the other end went down or if it's just taking a long time processing something. And then the last couple on here, bulkhead and fallbacks, which really limit the impact of a failure by maybe isolating one workflow from another, or having specific undo events and being able to. We might call this a compensating transaction or something like that lets us roll back at least partially through a workflow if we can't roll back the entire thing. So let's take a look at some, look at a module that brings a lot of this together. I'm gonna start with, probably the one that most people would recognize is, the Netflix one from his called Histrix. this was a port from Java originally. and it's a good thing that there aren't too many Netflix sized deployments because this is a very complex technical implementation to add into any system. And, so I'd say, you can look at this as a model. You may take parts of it, or you may find some of the parts other people have taken on and made standalone that you could pick up one at a time. This one does provide a lot of availability, reliability, and recovery. Techniques in it and allows you to add it into your code base. And also the added benefit of providing monitoring, so when it detects a failure has occurred in a dependent service, it alerts, right? It can take it to Prometheus or it can take it out to telemetry. Now, the techniques included here show a wide variety of solutions that can be used singly or in coordination with others, right? So you don't have to use all the things on that previous list in the software list to make sure that a singular failure, single failure does not cause other dependent services to stop working. we may be able to get by in certain cases with certain of those solutions. We don't actually have to use all of them, right? So we might use a circuit breaker, but only with retry. And we don't have to use fallback and we don't have to use bulkhead, right? So we don't have to use everything everywhere, right? Just think from the architectural perspective, where do you wanna apply which technique? To wrap up the patterns description, it's important to remember that we have to monitor for failure. So this is something that districts gave, gives us, it happens at multiple levels, right? We're gonna track hardware performance, we're gonna track traffic analysis for latency. we're gonna see how the network is performing, right? For those. Different lengths of service calls. Is that within, is it normal for it to come back after a second or is it, should it be in the a hundred millisecond range? And we also will have business telemetry and metrics, right? So open telemetry or something like that. This goes a lot beyond logging, right? So a lot of times our first pass at this is just to throw out a log of all the different things that are going on and sequencing 'em by time of occurrence. Realize when you get to multiple different customers, multiple different flows through the application, we're gonna need things like correlation, IDs and timestamps to start looking through those logs to actually get the events out of them and see causality and see where failures are actually occurring. It may happen on one particular customer's job because of the way they specified that job, or it may be because the load that customer's putting on it is affecting. Another customer. So we'll have to sift through it. And logging is often not the best way just to, look at how we're going to find failures here, right? It's kinda looking for a needle in a haystack when something goes wrong, sometimes it goes very wrong and we want to have metrics to abstract that a little bit. Now we're gonna take a look at three systems and see if we can put this into practice a little bit. So our first example here is just a simple two-step microservice, one calling the other, right? the two APIs may not be in the single environment, and we have command line activating the whole thing, maybe user on their laptop, which calls the first API. Okay. Now the second API is tied to a database, so we're gonna call it a crud layer over there at number three. Maybe this is a simple to-do application or something like that, that, is fairly self-contained right now. AWS in this case, if that's where we're hosting, may give us part of an SLA as a platform, so that we can build on top of that in each of the two environments where the APIs live. We may also have RDS or something like that for the database that gives us redundancy and maybe recoverability. And so this is pretty helpful. We can compose the simple case out of pretty handy infrastructure parts. Right now, we probably measure availability at the first rest. API, we're gonna put that little yellow star on there. That's the front door. Remember that is our uptime. And then we're gonna measure the overall reliability by the ability to persist all the way through the database. That's the orange star in this case, right? So not a lot of techniques we're probably gonna need on the software side due to the simplicity of the system. We probably consider the command line outta scope. And if the customer environment can't reach AWS for instance, we're not gonna include it in our uptime, right? Uptime is, you have to be able to get it to our front door. Now to increase uptime, we might have load balancing on that first. API we might cache some results. that might come from that second API or something like that over at API number one. and then, our availability would be based on the customer being able to create requests and maybe view status of their requests. So all, everything in that rest API number one, reliability might add a couple more. We might add a redundant database, right? Provided by our hoster or otherwise, we might load balance API number two, we might provide a circuit breaker in API number one to store requests locally and, back off on sending stuff to API two if it detects that it's, if it's going down right. And then we might have some way to tell the API number one. Okay, it's okay to turn traffic back on to API number two. Okay. Now we're certainly gonna probably put in place monitoring as well and maybe continuous testing with something like synthetic transactions. So we send through no operation type transactions. We pretend we're the user and we send something through maybe a read operation that goes all the way through the database. And returns to status doesn't affect anything, but that's what we'll call a synthetic transaction. So we can actually test while the system's running, what's going on. Now we're gonna take a look at a more complex scenario in retail here that deals with e-commerce and maybe an in-store, set of operations. This is effectively multi-channel sales, if you've ever heard of that. the customer can order it either in a brick and mortar shop or online with maybe pickup in store. of course nowadays we expect delivery to home as well. But, we'll leave that aside. The order starts two business processes. so when we do our online order at number one, the starts a financial transaction, a pre-authorization maybe against a credit card that goes to the bank, right? And says, Hey, does the customer have the ability to pay for goods? And then we're gonna send that order down into the store. So that's step number three there. And then in store, the customer's going to do what we call a completing the transaction. They may have to, sign that they picked up the, goods, and then that notice goes back to the bank to say, go ahead and charge them $17 and 5 cents. And then the customer walks out with the goods in step number five there. Now, they may short this, and actually, instead of step number one, they may start at step number five. They may walk into the store. An order right then and there, in which case those two transactions back to the bank probably get collapsed. But the point here we're showing is that bank is in a different environment. The e-commerce is in an environment, and the in-store system is in an environment. So we're probably going to measure availability at both the online ordering site and in store, right? So if the customer can do either one of them, then that's good. For this more complex set of environments, we actually wanna measure reliability end to end. That is the system supports both customer delivery of goods within a certain amount of time after ordering and the financial transaction is correct, a hundred percent the time. So obviously we used a lot of different. Solutions from both the software side and the infras side. in this example here, this particular system, so many of the techniques were used from previous two slides there on the patterns, but we primarily focused on reliability over availability. So failovers of clusters and e-commerce, failover, data redundancy and workflow restarts in store, are they a priority? Next we're gonna take a look at an example from DevOps. So this is a simplified diagram of a DevOps system like Chefs Cloud 360 offering that I mentioned we started about a month ago, where one of the services is to provide a generic job runner capable of running scripts or specific skills for multiple customers from the cloud hus, cloud hoster to the customer's own data center servers and other cloud providers. So the customer will start a. Job at number one, that will go into our service environment, number two, and then come back to their managed assets over in number four and number five. An end user or operator can automate or interact with the initial rest API, number two, to enroll their servers or nodes in the system with a command line that's Mark number one, or a browser based experience as they authenticate the API forwards the request actually to an open ID service that we don't control the customer's chosenness to lock them in. These API services have internal databases, queues, and file storage inside the platform to be able to perform operations. I've shown those as basically a Kubernetes logo and a light blue Azure queue icon there. In terms of ours, once the operator enrolls the node under management, they call it second API in AWS to submit a job. We'll call this number two and also the yellow star. We're measuring availability, right? Because this job submission is one of the critical things that we want to have up and running all the time. These API services are in Kubernetes nodes, different usage than what the customer sees. we're using the word node in a couple different senses here. in our case it is as a service provider, we're using multiple nodes inside of AWS in a couple availability zones. We'll see other infrastructure design on the next slide. So let's continue with our flow here. Our primary API services may call an outpost or local set of similar API services that live across the bridge mark number four, in the customer's own data center. In order to forward a job to maybe a local queue. We actually call this a connector in our architecture, but let's consider it really a bridge right now so that we forward that. From the job submission from number two, over to number four into the customer's data center to execute. Realize the customer may have multiple of data centers on their side. Now the agent and server number five, our real target where we wanna run the job may actually pick it up directly from number two or number four. Doesn't really matter. The chef agent lives on this server typically, or it leaves lives where the job will run. This is mark number five on the diagram, and typically this is one of the customer servers where they want something to happen. maybe a patch to be applied or maybe a restart or something like that. It may also run remotely through to a second server that we call this target mode. So the agent may live on something like number five, but it may reach out to another server. That for some reason we can't put an agent on. Maybe it's a router, maybe it's an appliance. The job goes then from the chef, API service to that outpost, API service in the customer environment to an agent or maybe a proxy for remote targeting to effectively where that destination of the application or service job will run. The final part of this workflow, which we wanna make as reliable as possible is the agent reporting back what happened, the completed state back through the same chain, through the outpost, backup to AWS and the APIs and maybe populating reports in a web application. We show this as number six on here and also where the yellow star, where our reliability metric of successfully completed jobs without failures. Or if there was a failure, we can identify what it was and have the user resubmit. that can be measured there. We've designed our overall reliability. Based on a lot of AWS primitive for infrastructure and also application techniques. And we do exclude from our, service level at this point, customer responsible items such as, how outposts are running, how their SSO is running and their target server reliability, right? The actual server that's running the job may have failure modes and things like that. All we can do is monitor for those. And so when we say a service level, we're really giving the customer a service level objective. Because our system is so tightly dependent on their reliability for most of it, we can certify our reliability in the fact that it'll identify that an error occurred, but that error may have occurred in an asset that the customer controls or. Maybe in the case of single sign-on that they're operating on our behalf. And so we don't have an actual SLA at this point. And so that makes the reliability calculations a little bit more, manageable when we can exclude some of those things and say that will, that isn't included in our reliability calculations. So we call these customer responsible or cr typically. The other deployment is if a customer deployed this system entirely on premises, we're not really gonna cover that here, but many of the infrastructure parallels exist. So if we have A-W-S-R-D-S services there, we can have a multi-node, a redundant database, on-prem as well. And so we typically do infrastructure components in a purely on-prem deployment that doesn't use our SaaS part of the solution. They will have equivalent components on the infrastructure side as what we would have in AWS. So we'll show in the next slide really some of the software techniques that we've prioritized and how we've built up the production system. So from a basic, MVP of sending a job through the system, we go, there are a lot of places that failures could occur. So what have we done to change the system and improve it? go through these, in, in the order that we present 'em here, but they're really from the front door on through to how do we monitor things in production. So we're still adding changes, improving the availability and reliability of the applications that make up this latest DevOps offering. But this is just a summary of what they are. one thing you'll notice here is that even our diagram, if it's, only a few different forwards between different services, but it does span different environments, we're bringing a lot of different, both software and infrastructure. Patterns to bear on this, right? the specifics don't really matter too much unless you're building exactly the same system here. but I'll start talk you through the process of how we brainstormed, identified failure points and put in place these techniques to improve the availability and reliability. Right now, number one is all about availability. It's making sure the customer, DNS and the job submission API is highly available, especially when combined with number five there, right? So we take in the request, we immediately write it to disc, and then we start the process of running the job, right? So we can always fail back to, what was the actual request if something happens down the line that we need to recover from. Number two is about redundancy for the API calls through the gateway scaling horizontally within the cluster, and the ability to spread load across multiple clusters, potentially right to geo. Diversify if you were. Number three is about using redundant services to the cloud provider, making sure we're staying on the pattern that AWS gives us. we do have some interaction with Azure as well, so using those primitives that are there and then monitoring for external service failures. Number four is important deals with the agent that's sitting next to, or actually on a customer server, it has to be zero trust, but it also needs to let the APIs know what its status is, right? We have to be able to monitor it. We have to know that it's still up and responsive. We use multiple techniques here to make sure the agent can reliably execute jobs. One of the interesting ones here is that we have a heartbeat that is on a regular basis that tells our APIs on the server side, Hey, I'm still here. Hey, I'm still here. hey, I just did that particular job you asked me to do. And so we have a heartbeat status that always comes back in addition to maybe jobs specific, artifacts and status. And then number five is really preventing failures by immediately writing request to disc before executing them. Kinda makes sense. Number six and seven really process items here. So again, work with your operations team to make sure you're monitoring the right stuff and that you're able to do root cause analysis of any failures that may happen. Let's talk next about components in Go. And this is the language that we've picked for building all of our services in. So in Go, most of us will probably know we don't have the easy solution of something the language is already picked for us, nor a single simple, well-known battle tested module that, maybe other languages might have. and look at, in t net languages. We have something called Poly that's a library. we don't really have that here except maybe the implementation that Netflix came up with. What we have in Go is really the community, right? You can go out to the Awesome Go website and you can get 20 different recommendations on stuff, right? So you'll see lots of different options out there, some of which I've tried to list in the most functional to the least support and meaning. we know some of these out. There are single developer that may not have the time to maintain them, but did a good code prototype or something like that, that we may want to take on ourselves and maintain, right? Going forward, at the top of the list we have, has lots of functionality, adds observability. And is probably the most complex of any of the solutions out there, right? Maybe too big of a hammer, but is our gold standard. So if we were looking for somewhere to go look for a reference, architecture, that would probably be it. The next few are pretty well supported and broadly functional. multiple patterns described on the software slide are implemented. We've got circuit breaker, we've got retries, we've bulkheads. so you could pick and choose some of these components if you want something lighter weight. I tried to go with ones that had a lot of either community forks had UpToDate maintenance or had obviously, a lot of people were following and using in their own implementations. So going with the idea that popularity is it's usability. Okay. With each of these sort of check and see how well it integrates into your use of go. A lot of times there's a medium or goal or something on how to use this example. some of them work better with the go language, some use goroutines more than others. Some use go, syntax and semantics a little bit more natively than others. pick the one that works best for your team, right? Because we're probably going to be putting these in a few different places in our code, right? also, as you get through looking at these particular techniques, ask if you need all this complexity in your application, go back to that value statement. Why am I building high availability in here? Why am I building higher reliability to this? Not all applications need all the techniques applied, right? All these techniques when applied could probably make a really good system. but it takes time, it takes effort, it takes, your, cost to get this out the door and into your system. If your customer is demanding an SLA, figure out what parts of that SLA are most important and figure out which components support that the best. And then the code sample off to the side is a sample, circuit breaker over there so you can see what that looks like. Last little note here, for distributed systems at scale plan on having a regular test program, right? You can't do this automated. You have to have people that are running tests, new tests, different types of tests, pick some test tools that the test teams are familiar with, and of course, standard guidance of don't have the developer tests their own stuff, right? they're only going to be looking for, things that they conceptualize, right? A tester's gonna find things that a developer probably never thought about, right? which is what we see, when our system faces first contact with the, customer, right? Run tests and pre-production of sample data and configuration. Run with multiple tenants, run with different roles, run with malformed inputs, on each release of the application. That's a good thing for the development team to take on as integration test and then run it in production as well. If you have an operations team, or a dedicated test tube, you can use canary tenants. I like this because with microservices, we could put out there, a hundred, commercial customers in, in there and they may have different load profiles, but I as, as long as I have a tenant that is mine that I can just see all the way through the system, kinda like those dynamic, tests that we talked about earlier, the synthetic ones, you can test a transaction going all the way through to make sure all the components are up and running and then you can report on your availability that way. Okay. we use a tool called K six to do a lot of our, testing, but I've listed a few other ones here as well. We do not use fail points. That looks really promising. I would say look at that one. And then one of the really interesting tools out here is Chaos Monkey, which injects failures into your actually running system. Finally, I'm gonna recommend that you have some mechanism to report on availability and reliability of your services. Many teams don't have a big enough staff to have dedicated SRAs or operations teams, but if you design in your metrics from the very beginning, anyone can dive in and see what's going on and identify issues quickly. Even if you don't have an SLA, tracking these metrics from the very beginning leads to being able to build fixes into your application, which eventually gives you better customer satisfaction, making these measurements visible. we use something like Grafana. We use Prometheus, common tools that'll help show us the business events that are going on, show us the load on the system, show us the performance of the different components in there, show us latencies and things like that, showing that to the operations team. And the development team then helps both in an outage situation and then afterwards to see how the system's performing and you can brainstorm collectively on improvements. There are really four things I'd like you to take away from this talk. First, that availability and reliability. Require both software and infrastructure patterns. You're gonna borrow from both camps, and the combination of those two is what's really gonna get you to the numbers that you want as your goal. Second, look to examples from other systems. There's best practice out there. There are larger systems than the one you're building that just shows that the path has been paved ahead of time. This common practice gets built into application templates over time and eventually becomes the kind of the easy way to do this, right? You don't have to build your own model from scratch every time, and this is certainly something I think AI is going to help with over the years, is it'll be able to look at all these different patterns and then go, this is what you should actually use until that point in time. Use your experience or your knowledge from other systems as a guide. Third, do as much as you need. Don't do anymore. Don't do any less. This is. A balancing act of cost and schedule against technical. I call it chrome plating or over engineering. Use your architect's experience here to know how far you should go with each of these techniques, right? You don't need all of them. You may need only certain ones right now, you may need other ones over time. Finally, number four, don't overlook testing and qu quantitative analysis. It's the only way to be sure before you advertise an SLA to customers. Now we've got a wrap up slide here that shows some of the high level resources for this talk. We mentioned poly for net briefly. His, from Netflix is a good go model to look at cast monkey for testing. we mentioned a lot of other things. Those will be in the slide notes as we go, as you pull those down and look through them on particular topics. FMEA, Particular other testing tools, different pattern references and things like that. Of course, you can look through cloud providers best practices. both AWS and Azure have good well architected programs that sort of give us a list of, these are the prescriptive things that if we were to give you a workshop and they can, this is what you should look at for a large SaaS application that's going global or is in lots of different, networking environments. Uber has a really good compilation here of notes on the SRE role. if you're ever wonder what does DevOps do, where did DevOps come from? it came from SRE Site Reliability Engineering and came a little closer to development. But, they talk a lot about how to operationalize systems and how to put in place good discipline on these practices. And finally, if you're interested in the latest from Chef, I put our product documentation at the bottom here. As a thank you. I'll put a little bit of a pitch here. I'd like to thank the engineers that obviously published before on this topic. the history goes way, way back over multiple decades in terms of building available and reliable systems. obviously the teams at Poly and Netflix who've bottled up a lot of this into tools, modules that we can use. Thank you, of course to the open source engineers who maintain all the tools that we talked about here. A lot of teams at AWS and Microsoft and other cloud providers that brought us lessons learned from that. and especially on the infrastructure set of best practices there. they were really the first ones to see global scale and a lot of these things, as organizations moved out of their own data centers into shared hosting facilities. We obviously learned from our mentors and then enables each of us to take on increasingly scalable solutions. thank you also to you, the audience for listening to this talk. Without you, there wouldn't be any next application to build. I hope this has caused you to look at your applications, inspires you to talk to your teams about quality regularly. Please reach out to me on LinkedIn if you have any questions or feedback. We wish you a very helpful and valuable conference. 42 from Brian at Chef. Goodbye.

Slides

Download slides (PDF)

See all 65 talks at this event!

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Surviving the Real-World: Building Resilient Cloud-Native Platforms for Intermittent Networks

Video size:

Abstract

Summary

Transcript

Slides

Brian Loomis

Director Of Architecture @ Progress

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Surviving the Real-World: Building Resilient Cloud-Native Platforms for Intermittent Networks

Video size:

Abstract

Summary

Transcript

Slides

Brian Loomis

Director Of Architecture @ Progress

Join the community!