Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              I'm excited today to talk about policies and contracts in distributed
            
            
            
              systems. My name is Prathamesh and I work as developer evangelist
            
            
            
              at last nine. This is my twitter handle. You can find me
            
            
            
              posting interesting things about distributed systems, time series,
            
            
            
              databases and so on. So if you want to follow,
            
            
            
              go ahead. As a software developer or engineer,
            
            
            
              we want to write code. We want to fix all the bugs. We want
            
            
            
              to ideally write code which is bug free right? We don't want
            
            
            
              to be the author who basically writes bugs every
            
            
            
              time. We want to use all the latest and greatest tools that is the
            
            
            
              North Star. If there is anything new that is coming out
            
            
            
              in the market, I will definitely want to try that,
            
            
            
              be it copilot or even a chat jeopardy for that matter.
            
            
            
              I want to integrate with best in class tools in my development
            
            
            
              workflow so that I get the best that is out there. That is
            
            
            
              my aim. But is it always possible? As a DevOps
            
            
            
              engineer, I want to make sure that my infrastructure scales.
            
            
            
              I want to make sure that the application utilizes resources
            
            
            
              efficiently. There are no extra resources that are just hanging
            
            
            
              and costing me money. I want to make sure that my cost is
            
            
            
              always optimized and under my control. It is not
            
            
            
              exploding massively unnecessarily. All of these
            
            
            
              is something that I will always strive for. But as
            
            
            
              the ad software engineers also or even DevOps engineers,
            
            
            
              it is possible or not is the question. As a team lead engineering
            
            
            
              manager, DevOps lead I have similar objectives.
            
            
            
              I want to make sure that the deadlines are met. The features work
            
            
            
              as expected within a performance criteria that
            
            
            
              basically caters for a customer experience by
            
            
            
              which they are satisfied. I want to make sure that my code
            
            
            
              and the infrastructure that is running the application is
            
            
            
              performant enough. The tech debt is not going out of control so
            
            
            
              that I don't want to cater for that when
            
            
            
              I really want to ship some features and at the same time I
            
            
            
              want to make sure that my team is motivated enough because otherwise there
            
            
            
              is no use of everything else if my team is not willing to
            
            
            
              work very happily with rest of the team
            
            
            
              members. All of these are my objectives, but they are also
            
            
            
              not always possible. As a product leader, I want
            
            
            
              to make sure that the team velocity is not slowing down.
            
            
            
              The external customer commitments, service level agreements
            
            
            
              are getting met. The product is getting adopted
            
            
            
              is also one of the constraint that I would like to put on my team.
            
            
            
              At the same time, the feedback from customers and their expectations
            
            
            
              are considered by my products and engineering teams and are
            
            
            
              getting incorporated so that my customers are happy. This is
            
            
            
              all I want, but all of these are constraints
            
            
            
              and sometimes they always run into each other. For example,
            
            
            
              if you want to ship a feature. But the engineering team is
            
            
            
              struggling with techtech from last print, their deliverables
            
            
            
              in this print will get affected. At the same time, if there is
            
            
            
              not enough marketing effort from the product marketing
            
            
            
              side or the sales side, then even if we have the best product
            
            
            
              that is out there, we won't have any customers who
            
            
            
              are willing to use it. With all of these and human
            
            
            
              contracts involved, we always have these knock
            
            
            
              star objectives, but they're not always possible because
            
            
            
              they always run into each other as contracts.
            
            
            
              Rasmussen long time ago developed a model
            
            
            
              of theory of constraints and that describes very
            
            
            
              nicely how accidents happen. Basically there are three
            
            
            
              axis to every software or any other system as well,
            
            
            
              where on one side you have the boundary of economic failure,
            
            
            
              beyond which, if you go, there are chances that your company will shut
            
            
            
              down, right? If we keep increasing the resources in our
            
            
            
              organization, then our cloud can go out
            
            
            
              of control and there is a chance that we'll have to shut down the
            
            
            
              company. A very trivial example, but you get an idea about
            
            
            
              how the boundary of economic failure works. At the same
            
            
            
              time, there is a boundary of unacceptable workload.
            
            
            
              I have few team members in my team,
            
            
            
              but if I continue to ask them to work every
            
            
            
              day, 24 x seven, then sometime they will say that,
            
            
            
              okay, I want to just leave now. I don't want to work here anymore because
            
            
            
              the workload is completely unacceptable. So there is a
            
            
            
              threshold up to which the workload can be acceptable and beyond
            
            
            
              which it is not sustainable. At the same time,
            
            
            
              there is also experience or performance
            
            
            
              expectation boundary. My application
            
            
            
              should load within a second. If it is a payment transaction,
            
            
            
              then it should always succeed. Or if it does not succeed,
            
            
            
              then at least my money should not get hanging in between.
            
            
            
              It should get credited to my account again. So there is a performance
            
            
            
              or safety regulations boundary as well. And the point of
            
            
            
              equilibrium should be in the middle in this
            
            
            
              circle, right? If you try to push it from one side, let's say I
            
            
            
              keep increasing the workload from the
            
            
            
              bottom side, then the gradient or my point of equilibrium will
            
            
            
              keep moving to an edge and there is always a boundary of
            
            
            
              failure, beyond which if I try to push harder, then accidents
            
            
            
              can happen. And this is applicable to all three axis. So if I try
            
            
            
              to push any one axis forward,
            
            
            
              there is a chance that the accident can happen. If I keep pushing it,
            
            
            
              we have to. As software leaders, sres,
            
            
            
              DevOps and people who manage these software
            
            
            
              systems, we want to make sure that this boundary of
            
            
            
              acceptable behavior is not broken. We are always
            
            
            
              within the constraint that the gradient is
            
            
            
              not completely out of the red boundary of accident,
            
            
            
              and Rasmussen basically developed this for control theory,
            
            
            
              but it is equally applicable to today's modern
            
            
            
              software systems. If we don't follow this boundary and
            
            
            
              we try keep pushing harder, then boom,
            
            
            
              we run into accidents. That's where we see misprints.
            
            
            
              We see efforts that is not aligned with rest of the organization.
            
            
            
              We see tech debt increasing and effectively causing
            
            
            
              more failures. In the long term, failures will happen
            
            
            
              anyways, but they will essentially happen because each
            
            
            
              and every stakeholder in an organization has a limit.
            
            
            
              They have a boundary beyond which things cannot be pushed.
            
            
            
              Either. It can be an economic boundary, or a workload
            
            
            
              related boundary, or a performance related boundary. As we saw in
            
            
            
              the Rasmosan's model. If one of the stakeholders pushes
            
            
            
              the boundaries too much, then accidents can happen. Business pushing
            
            
            
              for feature rollouts instead of worrying about detect debt
            
            
            
              is one of the example of this scenario.
            
            
            
              Whereas engineering can also keep chasing perfection,
            
            
            
              they can keep looking for the best solution,
            
            
            
              best product out there instead of shipping what they have,
            
            
            
              which eventually causes problems with the velocity of the
            
            
            
              software as well as the delivery consistency.
            
            
            
              So these failures will happen, if we don't mind,
            
            
            
              for these constraints and boundaries. Boundaries are nothing
            
            
            
              but something which limits or puts
            
            
            
              an extent to a criteria. It fixes a threshold
            
            
            
              for the objective that we are trying to achieve. There can
            
            
            
              be team constraints that I only have five people.
            
            
            
              One of my team member is on leave for a personal reason. That person
            
            
            
              has all the access to my AWS cloud account,
            
            
            
              right? And I'm a startup, so I don't have a lot of people
            
            
            
              to manage all of these things. Quality of work can also be another
            
            
            
              boundary. We cannot have a very unoptimized
            
            
            
              code base which keeps failing always time
            
            
            
              and time to delivery or time to market is also one more
            
            
            
              contracts that everybody is always concerned about. There are
            
            
            
              other constraints and boundaries as well, such as perfection,
            
            
            
              cost, pricing of the software, time to market,
            
            
            
              and so on. These are all examples of boundaries in
            
            
            
              modern software systems. Now how we deal with boundaries
            
            
            
              is via negotiations. People say that we'll be able to release
            
            
            
              this, but there can be few bugs. Are you okay with this?
            
            
            
              We will be able to ship 80% of the functionality,
            
            
            
              but can be certain bugs can be present. Like the
            
            
            
              way I am doing this talk. Mark had told
            
            
            
              us that your video should be in by 20 eigth and
            
            
            
              I'm making sure that I'm submitting the video recording the
            
            
            
              talk before 20th April because otherwise I'll break my promise
            
            
            
              to him. We can do certain deployments, but our AWS
            
            
            
              bill will increase for two three weeks. When we will get time
            
            
            
              for optimization. We'll be able to do the optimization, but until then we'll
            
            
            
              have to bear the but of increased bill. Things like we'll be able to
            
            
            
              roll out to certain customers, but teams will have to work overnight and weekend.
            
            
            
              They have already worked previous weekends as well. So do you want
            
            
            
              to really do it? And so on. These kind of negotiations we are used
            
            
            
              to doing in our day to day job. These negotiations
            
            
            
              effectively leads to contracts, which is a written or
            
            
            
              spoken enforceable agreement. When we negotiate
            
            
            
              something with our own colleagues or within our own
            
            
            
              organizations, and even with outside customers,
            
            
            
              we arrive at a contracts that
            
            
            
              all of us depend on and that all of us follow. The delivery will
            
            
            
              not happen today, but it will happen on Monday at 11:00
            
            
            
              a.m. Once we clarify on this contract, then it becomes an
            
            
            
              agreement between the two parties and then we follow that agreement
            
            
            
              as and when we go forward. If we think about how
            
            
            
              it relates to a programming concept,
            
            
            
              I would like to correlate it with API
            
            
            
              interface. When we talk about APIs and their documentation,
            
            
            
              it is nothing but a return agreement about what the
            
            
            
              endpoint will return promises. Let's take an example of
            
            
            
              the user's endpoint. So if the user's endpoint
            
            
            
              is returning HTTP status 20 one, in case of success
            
            
            
              scenario, it is returning bad request in case of when the request
            
            
            
              really does not have the correct input, and then it can have
            
            
            
              different statuses for an unauthorized request, or even when
            
            
            
              the resource is not found. This is the contract or this
            
            
            
              is the documentation of our user's endpoint, which can
            
            
            
              be publicly documented and given to the rest of the team
            
            
            
              members to follow so that they can work according to
            
            
            
              this agreement when they develop rest of the code that is
            
            
            
              consuming this particular endpoint. Now, when we talk about these programmable
            
            
            
              interfaces, what is the equivalent of that
            
            
            
              in our day to day life? Right? That is the runtime interfaces which
            
            
            
              we deal with every day when we deal with other people.
            
            
            
              It is a written agreement about how the endpoint will behave
            
            
            
              at runtime. It can be similar to the uptime
            
            
            
              of the endpoint, or this particular API is 90%
            
            
            
              if you expect the consumer of this particular
            
            
            
              service that the post users endpoint should always succeed.
            
            
            
              No, that is not possible because the agreement is
            
            
            
              that it is only available 90% of the time.
            
            
            
              10% of requests are allowed to fail every weekend because
            
            
            
              we don't work on weekend. Right? This can be an enforceable agreement
            
            
            
              which both parties have to follow during
            
            
            
              PCR. The latency can vary between certain limits. This is also an
            
            
            
              example of one of the promise that the API author
            
            
            
              is making to their consumers. So all of these promises
            
            
            
              will effectively define how the consumer
            
            
            
              will expect this particular service or API to behave.
            
            
            
              And there are no chances of confusion or there
            
            
            
              are no chances of accidents because both parties are
            
            
            
              aware of what those constraints are while
            
            
            
              deciding about the consumption of this particular API
            
            
            
              endpoint. When we talk about these runtime interfaces in
            
            
            
              the real world, they are effectively the objectives
            
            
            
              that we were talking about so far. And there is a beautiful concept in
            
            
            
              site reliability engineering or observability world for this, which is
            
            
            
              service level objectives. Using service level objectives,
            
            
            
              we basically define the criteria for
            
            
            
              a particular indicator, health indicator of a service
            
            
            
              or an API or a function over a period of time.
            
            
            
              An example of this can be that availability of
            
            
            
              my service will be greater than 99.99 over a period
            
            
            
              of one day. An example of the service level objective
            
            
            
              in case of this talk is that I promised mark that I will
            
            
            
              submit this talk. He selected my talk. I promised him that okay, I will upload
            
            
            
              this talk today. And then he promised me that
            
            
            
              once you upload it, I will publish it on 4 May.
            
            
            
              That is a layman's example of how service level objectives
            
            
            
              can be defined. Similarly, we can define objectives for
            
            
            
              key indicators such as latency and uptime over a
            
            
            
              period of time, and these gives enough visibility
            
            
            
              to all team members, all functions, all stakeholders about
            
            
            
              how a particular system, a service, or important
            
            
            
              infrastructure component is going to behave so that they can build
            
            
            
              redundancy, they can build parameters to consume this
            
            
            
              information or this service in a way that is consistent across
            
            
            
              the organization. There are other examples
            
            
            
              of service level objectives as well. What is my error rate on this particular
            
            
            
              payment checkout flow? Can we promise 99.9%
            
            
            
              availability to this enterprise customer? And if we can't,
            
            
            
              then what are the areas that we want to improve upon?
            
            
            
              Is it the tech debt that is stopping us, or is it some
            
            
            
              hardware that we need to invest to get to this particular availability?
            
            
            
              Because we have to remember that not every nine is free.
            
            
            
              As we go beyond certain three lies to four nines,
            
            
            
              to five lies, we'll have to invest more in terms of time, money,
            
            
            
              resources, people. Objectifying it and making
            
            
            
              it a contracts helps us in identifying where we
            
            
            
              have to spend more or do we even have to spend more to
            
            
            
              reach that particular level. It also helps answers questions
            
            
            
              such as should we prioritize tech debt over new features?
            
            
            
              Because if we know that the availability of
            
            
            
              a service itself is 80%, then shipping new
            
            
            
              features may not be even productive for our team
            
            
            
              members because these new features will also run into the same challenges.
            
            
            
              Instead, we can first prioritize improving
            
            
            
              the reliability from 80% to an acceptable level
            
            
            
              that is acceptable across organization, and then work on new
            
            
            
              features. Making those decisions then becomes extremely
            
            
            
              easy because everybody is aligned on the same objective and
            
            
            
              same goal. These runtime promises
            
            
            
              are nothing but service level objectives. As we saw,
            
            
            
              these runtime promises can be codified as documents,
            
            
            
              can be run as service level objectives
            
            
            
              if you're using observability tool or they can always
            
            
            
              be recorded as decisions in your decision
            
            
            
              lock tree where everybody can see them over time and
            
            
            
              they are essentially runtime. They are not static because
            
            
            
              if you start with a particular objective, you can always
            
            
            
              increase or decrease it. Adapt to the next
            
            
            
              nine based on the performance that you are seeing right now.
            
            
            
              So instead of forcing these promises top down,
            
            
            
              where the engineering leaders can say that okay, we want to start
            
            
            
              with three nines, four lies. Instead of that, the teams can start
            
            
            
              with what they have right now and use adaptive service
            
            
            
              level objectives to improve their reliability
            
            
            
              goals over time based on their current benchmark
            
            
            
              or the current baseline of service level objectives.
            
            
            
              These promises can effectively be codified then into
            
            
            
              policies where my P zero service or P
            
            
            
              zero API will only have 99.99%
            
            
            
              of availability versus my P three will have
            
            
            
              90% of availability, and this can be enforced across the organization.
            
            
            
              These help in setting right expectations on what's possible.
            
            
            
              It also helps understand these contracts to
            
            
            
              multiple stakeholders at the same time and effectively.
            
            
            
              This becomes a framework of communication between customers,
            
            
            
              between internal stakeholders, between other team
            
            
            
              members, and so on. It also helps in making decisions such
            
            
            
              as build versus but. For example, if I don't have enough
            
            
            
              team members resources to improve reliability of
            
            
            
              my infrastructure component, it will help me to take a decision that
            
            
            
              okay, I need this particular level of reliability.
            
            
            
              I don't have the enough resources. As of now, I'll go for a build decision
            
            
            
              or a but decision in such cases. It can also
            
            
            
              help us in tiered services like I discussed earlier, where I
            
            
            
              can categorize my services into critical normal and
            
            
            
              can be ignored in certain cases, and so on. Because this
            
            
            
              can help us in documenting that not everything is a priority.
            
            
            
              I can decide and take decisions based on whether a customer
            
            
            
              is a paid customer, or just a pilot customer,
            
            
            
              or whether a customer is an enterprise customer, or whether
            
            
            
              a but is happening only in an alpha release versus
            
            
            
              a release that is generally available, and so on.
            
            
            
              Because the most important thing to understand
            
            
            
              here is that in these today's world, time is the biggest
            
            
            
              constraint that all of us have. If we can focus our energies
            
            
            
              on specific things based on the objectives that we have
            
            
            
              decided upon as an organizational policy. It just
            
            
            
              helps us making those decisions, making these decisions faster,
            
            
            
              and prioritize right things instead of just going
            
            
            
              to fix everything.
            
            
            
              It helps us climb the ladder of reliability.
            
            
            
              You cannot improve what you can't measure. We already know that.
            
            
            
              So the way to go about this is always first baseline and
            
            
            
              then go one ladder at a time in adaptive way. As we discussed
            
            
            
              earlier, instead of going big Bang from 90%
            
            
            
              to five nine, that will lead us to a failure.
            
            
            
              So, recapping the Rasmussen's model of how accidents happen,
            
            
            
              there are essentially three boundaries. A boundary of economic failure,
            
            
            
              a boundary of workload, and a boundary of expected
            
            
            
              performance or safety regulations. The point of equilibrium or
            
            
            
              the gradient, if it is within this circle, within these three boundaries,
            
            
            
              then system is performing to its optimal level. But that
            
            
            
              is not at all the reality. At one point of time
            
            
            
              you will have one access pushing the other to access,
            
            
            
              and then there is a chance to the gradient
            
            
            
              moving beyond the boundary of acceptable failure where accidents
            
            
            
              will start happening. So we have to again do pushback from other
            
            
            
              lies to keep the gradient inside and make sure that the accidents
            
            
            
              don't happen. The boundaries still exist
            
            
            
              even if you use service level objectives or policies.
            
            
            
              But there is a tension that keeps them in balance. The tension is
            
            
            
              via these service level objectives and policies where every
            
            
            
              organizational function is aware of that and works in tandem
            
            
            
              with each other according to those objectives, instead of working against
            
            
            
              each other. The that results into fun and profit.
            
            
            
              That's all I have. Thank you.