Conf42 Site Reliability Engineering 2023 - Online

Policies and Contracts in Distributed Systems

Video size:


As companies adopt micro-services and distributed systems architecture, maintaining service performance up to a manageable level becomes necessary.

But manageable for whom? Consumer or the Service owner or the Business? The answer lies in SLOs as runtime promises!


  • Prathamesh: There are three axes to every software or any other system. On one side you have the boundary of economic failure. At the same time, there is a boundary of unacceptable workload. With all of these and human contracts involved, we always have these knock star objectives. But is it always possible?
  • Boundaries are nothing but something which limits or puts an extent to a criteria. These include perfection, cost, pricing of the software, time to market, and so on. These negotiations effectively lead to contracts, which is a written or spoken enforceable agreement.
  • When we talk about these runtime interfaces in the real world, they are effectively the objectives that we were talking about so far. Using service level objectives, we basically define the criteria for a particular indicator, health indicator of a service or an API or a function over a period of time. These help in setting right expectations on what's possible.


This transcript was autogenerated. To make changes, submit a PR.
I'm excited today to talk about policies and contracts in distributed systems. My name is Prathamesh and I work as developer evangelist at last nine. This is my twitter handle. You can find me posting interesting things about distributed systems, time series, databases and so on. So if you want to follow, go ahead. As a software developer or engineer, we want to write code. We want to fix all the bugs. We want to ideally write code which is bug free right? We don't want to be the author who basically writes bugs every time. We want to use all the latest and greatest tools that is the North Star. If there is anything new that is coming out in the market, I will definitely want to try that, be it copilot or even a chat jeopardy for that matter. I want to integrate with best in class tools in my development workflow so that I get the best that is out there. That is my aim. But is it always possible? As a DevOps engineer, I want to make sure that my infrastructure scales. I want to make sure that the application utilizes resources efficiently. There are no extra resources that are just hanging and costing me money. I want to make sure that my cost is always optimized and under my control. It is not exploding massively unnecessarily. All of these is something that I will always strive for. But as the ad software engineers also or even DevOps engineers, it is possible or not is the question. As a team lead engineering manager, DevOps lead I have similar objectives. I want to make sure that the deadlines are met. The features work as expected within a performance criteria that basically caters for a customer experience by which they are satisfied. I want to make sure that my code and the infrastructure that is running the application is performant enough. The tech debt is not going out of control so that I don't want to cater for that when I really want to ship some features and at the same time I want to make sure that my team is motivated enough because otherwise there is no use of everything else if my team is not willing to work very happily with rest of the team members. All of these are my objectives, but they are also not always possible. As a product leader, I want to make sure that the team velocity is not slowing down. The external customer commitments, service level agreements are getting met. The product is getting adopted is also one of the constraint that I would like to put on my team. At the same time, the feedback from customers and their expectations are considered by my products and engineering teams and are getting incorporated so that my customers are happy. This is all I want, but all of these are constraints and sometimes they always run into each other. For example, if you want to ship a feature. But the engineering team is struggling with techtech from last print, their deliverables in this print will get affected. At the same time, if there is not enough marketing effort from the product marketing side or the sales side, then even if we have the best product that is out there, we won't have any customers who are willing to use it. With all of these and human contracts involved, we always have these knock star objectives, but they're not always possible because they always run into each other as contracts. Rasmussen long time ago developed a model of theory of constraints and that describes very nicely how accidents happen. Basically there are three axis to every software or any other system as well, where on one side you have the boundary of economic failure, beyond which, if you go, there are chances that your company will shut down, right? If we keep increasing the resources in our organization, then our cloud can go out of control and there is a chance that we'll have to shut down the company. A very trivial example, but you get an idea about how the boundary of economic failure works. At the same time, there is a boundary of unacceptable workload. I have few team members in my team, but if I continue to ask them to work every day, 24 x seven, then sometime they will say that, okay, I want to just leave now. I don't want to work here anymore because the workload is completely unacceptable. So there is a threshold up to which the workload can be acceptable and beyond which it is not sustainable. At the same time, there is also experience or performance expectation boundary. My application should load within a second. If it is a payment transaction, then it should always succeed. Or if it does not succeed, then at least my money should not get hanging in between. It should get credited to my account again. So there is a performance or safety regulations boundary as well. And the point of equilibrium should be in the middle in this circle, right? If you try to push it from one side, let's say I keep increasing the workload from the bottom side, then the gradient or my point of equilibrium will keep moving to an edge and there is always a boundary of failure, beyond which if I try to push harder, then accidents can happen. And this is applicable to all three axis. So if I try to push any one axis forward, there is a chance that the accident can happen. If I keep pushing it, we have to. As software leaders, sres, DevOps and people who manage these software systems, we want to make sure that this boundary of acceptable behavior is not broken. We are always within the constraint that the gradient is not completely out of the red boundary of accident, and Rasmussen basically developed this for control theory, but it is equally applicable to today's modern software systems. If we don't follow this boundary and we try keep pushing harder, then boom, we run into accidents. That's where we see misprints. We see efforts that is not aligned with rest of the organization. We see tech debt increasing and effectively causing more failures. In the long term, failures will happen anyways, but they will essentially happen because each and every stakeholder in an organization has a limit. They have a boundary beyond which things cannot be pushed. Either. It can be an economic boundary, or a workload related boundary, or a performance related boundary. As we saw in the Rasmosan's model. If one of the stakeholders pushes the boundaries too much, then accidents can happen. Business pushing for feature rollouts instead of worrying about detect debt is one of the example of this scenario. Whereas engineering can also keep chasing perfection, they can keep looking for the best solution, best product out there instead of shipping what they have, which eventually causes problems with the velocity of the software as well as the delivery consistency. So these failures will happen, if we don't mind, for these constraints and boundaries. Boundaries are nothing but something which limits or puts an extent to a criteria. It fixes a threshold for the objective that we are trying to achieve. There can be team constraints that I only have five people. One of my team member is on leave for a personal reason. That person has all the access to my AWS cloud account, right? And I'm a startup, so I don't have a lot of people to manage all of these things. Quality of work can also be another boundary. We cannot have a very unoptimized code base which keeps failing always time and time to delivery or time to market is also one more contracts that everybody is always concerned about. There are other constraints and boundaries as well, such as perfection, cost, pricing of the software, time to market, and so on. These are all examples of boundaries in modern software systems. Now how we deal with boundaries is via negotiations. People say that we'll be able to release this, but there can be few bugs. Are you okay with this? We will be able to ship 80% of the functionality, but can be certain bugs can be present. Like the way I am doing this talk. Mark had told us that your video should be in by 20 eigth and I'm making sure that I'm submitting the video recording the talk before 20th April because otherwise I'll break my promise to him. We can do certain deployments, but our AWS bill will increase for two three weeks. When we will get time for optimization. We'll be able to do the optimization, but until then we'll have to bear the but of increased bill. Things like we'll be able to roll out to certain customers, but teams will have to work overnight and weekend. They have already worked previous weekends as well. So do you want to really do it? And so on. These kind of negotiations we are used to doing in our day to day job. These negotiations effectively leads to contracts, which is a written or spoken enforceable agreement. When we negotiate something with our own colleagues or within our own organizations, and even with outside customers, we arrive at a contracts that all of us depend on and that all of us follow. The delivery will not happen today, but it will happen on Monday at 11:00 a.m. Once we clarify on this contract, then it becomes an agreement between the two parties and then we follow that agreement as and when we go forward. If we think about how it relates to a programming concept, I would like to correlate it with API interface. When we talk about APIs and their documentation, it is nothing but a return agreement about what the endpoint will return promises. Let's take an example of the user's endpoint. So if the user's endpoint is returning HTTP status 20 one, in case of success scenario, it is returning bad request in case of when the request really does not have the correct input, and then it can have different statuses for an unauthorized request, or even when the resource is not found. This is the contract or this is the documentation of our user's endpoint, which can be publicly documented and given to the rest of the team members to follow so that they can work according to this agreement when they develop rest of the code that is consuming this particular endpoint. Now, when we talk about these programmable interfaces, what is the equivalent of that in our day to day life? Right? That is the runtime interfaces which we deal with every day when we deal with other people. It is a written agreement about how the endpoint will behave at runtime. It can be similar to the uptime of the endpoint, or this particular API is 90% if you expect the consumer of this particular service that the post users endpoint should always succeed. No, that is not possible because the agreement is that it is only available 90% of the time. 10% of requests are allowed to fail every weekend because we don't work on weekend. Right? This can be an enforceable agreement which both parties have to follow during PCR. The latency can vary between certain limits. This is also an example of one of the promise that the API author is making to their consumers. So all of these promises will effectively define how the consumer will expect this particular service or API to behave. And there are no chances of confusion or there are no chances of accidents because both parties are aware of what those constraints are while deciding about the consumption of this particular API endpoint. When we talk about these runtime interfaces in the real world, they are effectively the objectives that we were talking about so far. And there is a beautiful concept in site reliability engineering or observability world for this, which is service level objectives. Using service level objectives, we basically define the criteria for a particular indicator, health indicator of a service or an API or a function over a period of time. An example of this can be that availability of my service will be greater than 99.99 over a period of one day. An example of the service level objective in case of this talk is that I promised mark that I will submit this talk. He selected my talk. I promised him that okay, I will upload this talk today. And then he promised me that once you upload it, I will publish it on 4 May. That is a layman's example of how service level objectives can be defined. Similarly, we can define objectives for key indicators such as latency and uptime over a period of time, and these gives enough visibility to all team members, all functions, all stakeholders about how a particular system, a service, or important infrastructure component is going to behave so that they can build redundancy, they can build parameters to consume this information or this service in a way that is consistent across the organization. There are other examples of service level objectives as well. What is my error rate on this particular payment checkout flow? Can we promise 99.9% availability to this enterprise customer? And if we can't, then what are the areas that we want to improve upon? Is it the tech debt that is stopping us, or is it some hardware that we need to invest to get to this particular availability? Because we have to remember that not every nine is free. As we go beyond certain three lies to four nines, to five lies, we'll have to invest more in terms of time, money, resources, people. Objectifying it and making it a contracts helps us in identifying where we have to spend more or do we even have to spend more to reach that particular level. It also helps answers questions such as should we prioritize tech debt over new features? Because if we know that the availability of a service itself is 80%, then shipping new features may not be even productive for our team members because these new features will also run into the same challenges. Instead, we can first prioritize improving the reliability from 80% to an acceptable level that is acceptable across organization, and then work on new features. Making those decisions then becomes extremely easy because everybody is aligned on the same objective and same goal. These runtime promises are nothing but service level objectives. As we saw, these runtime promises can be codified as documents, can be run as service level objectives if you're using observability tool or they can always be recorded as decisions in your decision lock tree where everybody can see them over time and they are essentially runtime. They are not static because if you start with a particular objective, you can always increase or decrease it. Adapt to the next nine based on the performance that you are seeing right now. So instead of forcing these promises top down, where the engineering leaders can say that okay, we want to start with three nines, four lies. Instead of that, the teams can start with what they have right now and use adaptive service level objectives to improve their reliability goals over time based on their current benchmark or the current baseline of service level objectives. These promises can effectively be codified then into policies where my P zero service or P zero API will only have 99.99% of availability versus my P three will have 90% of availability, and this can be enforced across the organization. These help in setting right expectations on what's possible. It also helps understand these contracts to multiple stakeholders at the same time and effectively. This becomes a framework of communication between customers, between internal stakeholders, between other team members, and so on. It also helps in making decisions such as build versus but. For example, if I don't have enough team members resources to improve reliability of my infrastructure component, it will help me to take a decision that okay, I need this particular level of reliability. I don't have the enough resources. As of now, I'll go for a build decision or a but decision in such cases. It can also help us in tiered services like I discussed earlier, where I can categorize my services into critical normal and can be ignored in certain cases, and so on. Because this can help us in documenting that not everything is a priority. I can decide and take decisions based on whether a customer is a paid customer, or just a pilot customer, or whether a customer is an enterprise customer, or whether a but is happening only in an alpha release versus a release that is generally available, and so on. Because the most important thing to understand here is that in these today's world, time is the biggest constraint that all of us have. If we can focus our energies on specific things based on the objectives that we have decided upon as an organizational policy. It just helps us making those decisions, making these decisions faster, and prioritize right things instead of just going to fix everything. It helps us climb the ladder of reliability. You cannot improve what you can't measure. We already know that. So the way to go about this is always first baseline and then go one ladder at a time in adaptive way. As we discussed earlier, instead of going big Bang from 90% to five nine, that will lead us to a failure. So, recapping the Rasmussen's model of how accidents happen, there are essentially three boundaries. A boundary of economic failure, a boundary of workload, and a boundary of expected performance or safety regulations. The point of equilibrium or the gradient, if it is within this circle, within these three boundaries, then system is performing to its optimal level. But that is not at all the reality. At one point of time you will have one access pushing the other to access, and then there is a chance to the gradient moving beyond the boundary of acceptable failure where accidents will start happening. So we have to again do pushback from other lies to keep the gradient inside and make sure that the accidents don't happen. The boundaries still exist even if you use service level objectives or policies. But there is a tension that keeps them in balance. The tension is via these service level objectives and policies where every organizational function is aware of that and works in tandem with each other according to those objectives, instead of working against each other. The that results into fun and profit. That's all I have. Thank you.

Prathamesh Sonpatki

Developer Evangelist @

Prathamesh Sonpatki's LinkedIn account Prathamesh Sonpatki's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways