Mastering the Maze: Practical Strategies for Navigating Complexity in Distributed Systems

Video size:

Abstract

An overview of strategies for tackling complexity while building and evolving architecture in distributed systems by Aleksei Popov from Stenn International Ltd. We will examine the core principles and practical aspects of establishing reliable and scalable systems, based on practices driven by system engineering, cybernetics, and SRE.

Summary

Today I'm going to talk about practical strategies for navigating complexity in distributed systems. What is a distributed system? What are the main complexities of building such systems? Core principles of system engineering and cybernetics, and the three practices for reducing complexity overhead.
The main disadvantage of monolithic architecture is inability to scale modulus independently. If we'll move further, we can take a look at the opposite one. This is a microservice architecture, where the system is a collection of loosely complete services. What do distributed systems give us?
In order to evolve, a distributed system should be reliable, scalable and maintainable. There are many reasons why networks are not reliable. Good and stable abstractions really help to reduce complexity overhead. This topic is also very broad and we'll take a look on several very useful aspects.
Concurrency is one of the most intricate challenges in distributed systems. If there is no defensive mechanism applied, race conditions very likely will happen. To avoid this problem, there are several techniques can be used.
The dual write problem is a challenge that arises in distributed systems. Since these two operations are not atomic, there is a chance of failure during the publishing of new messages. One possible solution is to implement transactional outbox. Another is to utilize database, transactional lock and custom connectors.
Unreliable clocks is one of the most tricky problems in distributed systems. Time accuracy heavily depends on computer performance. In highly available systems, a new leader has to be assigned. To solve this problem, we need to use some kind of distributed consensus algorithm.
When leader has crashed, there is need to select or elect a new leader. distributed consensus is an algorithm for getting notes agree on something. Without it, it's impossible to solve the problem of highly available software. There are many conditions and trade offs that must be taken into account.
Distributed tracing is an approach for tracking this distributed operation across multiple microservices nodes components. Key idea is to attach a unique destroyed operation id to the initial request, then span it down the road and finally collect and aggregate this unique id to build analytical visualization of the whole.
Another approach that may help to improve the systems observability is orchestration techniques rather than choreography. Cybernetics and software engineering emphasizes creating systems that can adjust to changes, learn from many interactions, and improve over time.
Domain driven design focuses on building a domain model with a deep comprehensive understanding of domain processes and rules. System organized or in hierarchies or subsystem and software system often have a hierarchical structure. This hierarchical decomposition helps really to manage complexity by breaking down the system into more manageable parts.
Talking about Sre SRe is a site reliability engineering. The core principles of SRE is about first of all, embrace risk. Another thing is about automate manual work. And the last thing is to simplify as much as you could do.
Finally, I'd like to talk about simplicity and measuring complexity. A complex system that works is invariably found to have evolved from a simple system that worked. Another thing which is quite important to track is about time to train.
So in conclusion, I just want to say that I covered several techniques related to destroyed systems. We want to achieve highly available and scalable software, but it brings a lot of complexities. We need to understand the drawbacks and how to apply best practices to solve this problem.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

My name is Alex. Today I'm going to talk about practical strategies for navigating complexity in distributed systems. Here's a quick agenda of what I'm planning to talk about today. So first of all, we'll take a look at what this distributed system is. What are the main complexities of building such systems? Core principles of system engineering and cybernetics, and the three practices for reducing complexity overhead. So first of all, let me define what distributed system is. The attributed system is simply a network of computers which communicate with each other to achieve a common goal. Each computer, or you may name it as not, had its own local memory and cpu and run its own processes. All of them communicate with each other via network to coordinate their actions. The main differences between distributed and centralized systems is that in centralized computing systems, all computing is performed by a single computer in one location. Also, nodes of a centralized system all access the central node. I think that's obvious, which can lead to network congestion and slowness. A centralized system has a single point of failure or it's represented as a single point of failure, while destroy the system by design has no single point of failure. But before we start looking at real complexities of building destroy the systems, I want to talk a bit about what complexity is. And complexity can be defined from different point of views and aspects. To be honest, in terms of software engineering, there are two main definitions which are important for us. So in systems theory, complexity describes how various independent parts of the system interact and communicate with each other. Where from software and technology perspective, complexity refers to the details of software architecture, such as the number of components, how they define interactions between each other, how many dependencies they have, and also how they interact within the whole system. I think monolithic architecture is a great example of centralized system. It's represented as a single deployable and single executable component. For instance, such component may contain user interface, different modulus in one place, and sometimes even more and more modular. Although this architecture is a traditional one for building a software, I think over time and even today. But it has several important drawbacks we need to talk about. So first of all, the main complexity or main disadvantage of monolithic architecture is inability to scale modulus independently. Another problem is that over time it's much harder to control growing complexity of this big system. The lack of modulus independent deployment also brings a lot of complexities because you will be needed to redeploy everything because it's a single executable component. And also additional challenges are bringing to developers to maintain this single huge database without well established rules. And basically these rules are not very well established and it's really complicated. And also another problem which is quite obvious, technologies and vendors coupling, which is quite, quite obvious because you cannot just have several languages in one monolithic application because it technically has different runtimes and so on, so forth. If we'll move further, we can take a look on the opposite one. This is a microservice architecture, where microservice architecture is an architecture style and one of the variants of service oriented architecture. We structure the system as a collection of loosely complete services. For instance, in the same example, companies accounts, customers and user interface are represented as separate processes deployed on several nodes. And all of these services has its own database. Time and time shared database is utilized, but that probably is a bad practice or anti pattern for microservices architecture. So in this case I demonstrated how the different and database for services utilized. So, but what do distributed systems gives us, to be honest, the main, I think, obvious fact that technically we can get some horizontal scalability traits of such system. You can scale database horizontally, you can scale your services horizontally, or technically any infrastructure component can be scaled horizontally. Just technically cloning that, putting some kind of load balancer between all of that and leave without a lot of pain. Another thing which is quite important is about high availability and fault tolerance. Because in this way, whenever you have several clones, you may organize some failure techniques will help you to avoid any downtimes in case of crashes or some memory leaks and services outages. Another thing is quite important about your geographic distribution. We all have customers all the world, in USA, in Europe, in Asia, and we also want to bring some best experience to our customers. That's why we'll be needed to anyway to distribute these services across the whole world and organize some more complicated techniques for data replication, how to connect all of that together. Another thing which is also quite obvious is about technology choice freedom. Because you can deliver one service with this tool and other service with another tool, you can try more. You can find the best solution for the current problem. You can scale it horizontally based on some special techniques, special solutions and so on, so forth. Another thing which is quite important to know and to note about easier architecture control. It really helps to control the system, because when there is only single component time and time rules are violated, everything is broken. You need to dig into this component, find some balance, make an order and so on, so forth. When this some kind of little bit isolated from each other, several small services, very likely these services will be not very messed with some changes from a big group of developers working on the same component. Any distributed system comes with its own challenges. We'll take a look right now. So before we take a look into challenges, I just want to talk a little bit about quality attributes we'd like to have in our system. So there are three main quality attributes which any system has at some level or another level. First of all, it's reliability. Reliability is a way to continue work correctly even when things go wrong, meaning to be fault tolerant or resilient. Even when a system is working reliably today, that doesn't mean it will necessarily work reliably in the future. One common reason for degradation usually is increased lots or perhaps the system has grown from 10,000 concurrent requests or users to 100,000 current users, or run from 1 million to 10 million. Scalability is the term we used to describe a system's ability to handle just an increased load. And the final piece of that is about maintainability, is about making our life better. For the engineering and operations teams who need to work with the system, good and stable abstractions really help to reduce complexity overhead and make the system much, much easier to modify and adapt for new features. So this is the system example. I just want to show you of the system which has various components such as API gateways, memory caches, message queues, different databases, different microservices, full text search and so on, so forth. And all of that, in order to evolve, should be reliable, scalable and maintainable. So far it sounds easy, but this is a famous low and very realistic law, to be honest, and the best friend of distributed systems. In some formulations it's extended to anything that can go wrong will go wrong, and the worst possible time is it just because we're not really lucky or infrastructure is not. The reality is different. And now let's take a look at the real complexities with distributed system. What are the main troubles? First of all, networks are unreliable. We'll take a look on it deeper little bit later on. But just from a high level perspective, you cannot completely rely on networks because so many things are, are out of your control. You cannot say for sure that it will be working in 100% cases. Another thing which is quite important is about unreliable clocks and problems with time synchronization. Another thing which is also quite painful for us, it's about process pauses. And let's take a look deeper on this one. And I think many of you who was working with the systems having garbage collection probably was thinking about this problem as well. Another thing which quite obvious is about eventual consistency. Whenever we have some horizontal scalability of our database, we also have some follower, we have a leader, and whenever data is replicated from leader to followers there is some lag which technically may affect some systems, some logic, user experience and so on, so forth. It's not a problem from one perspective because it gives you a lot of availability benefits, but it can be a problem from a different perspective where you need to achieve really strong consistency. Another thing which is quite also important is about observability. Whenever we build a system on this huge scale with a lot of services, databases, queues, we need to understand how all of that is working. We need to have all insights about how it's working and how we can get everything about right now for that and how performance is tracked. Do we have any bottlenecks, errors and so on, so forth? This is a big and broad topic. We will take a look on some of these aspects which probably can be helpful in building such systems. And another one is about evolvability. It's some kind of extension of maintainability, but technically describes how to evolve the system or distribute the systems without a lot of problems and without a lot of pain, and how to make it really scalable for people, for engineers, for the business. And don't waste a lot of resources and be stuck because something it was really designed very well and very bad and it cannot be scaled due to some issue in the system technically. This topic is also very broad and we'll take a look on several very useful aspects you may probably find. And first of all, as I said, let's take a look on unreliable networks. So there are so many reasons why networks are not reliable. So there are a few reasons why so. First of all your request may have been lost and you cannot do nothing about that because this is a problem of network. Another thing, your request may be waiting in the queue and will be just delivered later on, and you also cannot completely rely that it will be processed immediately. The remote node may have failed, probably, perhaps it crashed or powers down, or some kind of memory leak happens and so on, so forth. Remote not have may have temporarily stopped responding. The same problem with the second I mentioned about waiting in the queue, but it does happen on the not being this case also, remote not may have processed your request, but the response has been lost on the network and you technically just don't know was it processed or it wasn't processed. What you need to do right now and so on, so forth. And technically the modnot may have even processed your request and again the response has been delayed. Now you still have to wait for a while to get response and make some logic on your side. So the obvious thing how to probably solve the problem of request loss is to apply tam out logic on the caller side. For example, if the caller doesn't receive response after some timeout, we can just throw an exception that shows an error to the user. In most cases probably would be okay mainly for the back office I guess, but for the customers and have availability systems when we don't want to have some poorer user experience, we want to make something to solve this problem. And obvious solution for this problem is to because the network issue is very very likely temporary. At scale we can just apply, or even scale we can just apply retry pattern. So if the response indicates that something goes wrong, we can just retry it after timeout. So technically if timeouts happened, we like it's a temporary issue problem and we can just apply retry to process it again. But what if the request technically was processed by the server and only the response was lost? So in this scenario, to be honest with tries may lead to severe consequences like several orders, payments, transactions and so on, so forth. And that experience will be not very nice for the users. So the technique to avoid this problem, we can call it, or technically called so edompinac is a concept. We're doing the same thing multiple times has the same effect as doing just once. To achieve exactly one semantics, we can leverage a solution that attaches, especially in the potency key to the request. So after trying the same request with the identical identity key, the the server will verify that the request with such key has already been processed and we will simply return the previous response. Such any number of retries with the same key won't affect system behavior negatively, and that technique helps in most of the cases. And this is one of the main techniques you have to bear in mind while building distributed systems. Another pattern which might be quite useful in preventing overloading and completely crushing the server in case of failover, outage or temporary issues is circuit breaker. Circuit breaker acts as a proxy to prevent the calling system, which is probably under maintenance, likely will fail or heavily failing right now. So there are so many reasons why it can go wrong, memory leak, bug in the code, or external dependency which are faulted. So in such scenarios it's better to fail fast rather than risk of getting cascading fails over the system. So technically, circuit breaker represents some intermediate component which is checking based on some specific conditions and just switching it off and on when it's allowed based on some specific condition. Concurrency is one of the most intricate challenges in distributed systems. Concurrency means that multiple computations are happening at the same time. So what happens if we try to update the account balance at the same time from different operations? If there is no defensive mechanism applied, race conditions very likely will happen, which will lead to lost rights and data inconsistencies. In this example, you can see the top region are trying to update the account balance, and since they are concurrently running these operations, the last one is winning, which led to serious issues. And to avoid this problem, there are several techniques can be used. So before diving into the problem solution, let's take a look at what ASIP means. So the ACES acronym stands for atomicity, consistency, isolation and durability. All of the popular SQL databases implement these properties, but what do they mean? So atomicity specifies that the operation will be either completely executed or entirely failed. No matter at what stage it happens. It allows us to be ensured that another thread cannot see the half finished result of the operation. In very simple terms, consistency means that all invariants are defined and will be satisfied before successfully committing a transaction and changing the state of the database or just specific table or particular part of the data store. Whether isolation in the sense of faceit means that concurrently executing transactions are isolated from each other. This is a serializable isolation level, which is the strictest one to process all transactions sequentially. But another level named snapshot isolation. In popular databases like SQL Server, MySQL and Postgres, SQL and I believe many others is mainly widely used and we will talk about it very soon. So durability promises that once transaction is committed, all the data is stored safely. So that is very important point for us to be sure that we sort everything without any losses. So returning back to the concurrency problem, so the most practical solution is utilized, snapshot isolation level, which most of the popular databases with ACET compliant database provide. So I just thought about that. So the key of this level is the database track records version and fails to commit transactions for once that were already modified outside of the current transaction. And for all read operations in a transaction, see a consistent version of the data. And as it was at the start of this transaction for write operations, changes made by transaction are not visible to other transactions until the initial transactions commit. And that helps a lot to be honest, simplifies so many things during development, and it also guarantees you that if you managed it properly, you won't be losing anything and you will be quite consistent in your database and your daytime. Another thing which is also quite important to say that most of the no SQL databases are not providing acid properties completely right now while choosing in favor of base. In such cases, the compare and say the strategy may be utilized. So for instance, where base means basically available soft state and eventually consistent, the purpose of this operation, I mean compare and set, is to avoid lost updates. Of course, by allowing an update to happen only if the value has not changed since your last read it. If the current value does not match with what you previously read, the update has no effect and the read modify write cycle must be retried. So for instance, Cassandra provides lightweight transactions approach which allows you to utilize various if if not exist if exist statements to prevent concurrency issues. So please bear in mind this strategy will be very helpful if you're utilizing knowledgeable database and you need to be aware about how to track how to avoid this problem with concurrency, which may lead to really problematic situations in your software and the business as well. Also, I think we can take a look on another pattern which may be quite helpful. So another technique that might be useful is called lease. Let's imagine we have some kind of resource we want to update exclusively, and the lease pattern means that we should first obtain the lease, which with an expiration period for this specific resource, then update the resource and finally return back the list. In case of failures or anything else, the list will be automatically expired, allowing another threat to access the resource. Also, this technique is very useful and it will help you. In most of the cases there is a risk of process pauses and clock desynchronization, which may lead to issues with parallel resource access, and we'll take a look at it a little bit later on. So if you'll try to send a message during transaction, we'll find ourselves in the worst situation where for some reason transaction failed to commit. So technically, the dual write problem is a challenge that arises in distributed systems, mainly when dealing with multiple data sources or databases we need to capt and sync or just update. So in other words, let us imagine we need to store the new data in the database and then send some messages to the kafka. So since these two operations are not atomic, there is a chance of failure during the publishing of new messages. So based on this example, if we'll be trying to send some data, we will find really ourselves in the worst situation. For some reason transaction failed to commit, but external system like Kafka didn't get these messages, another system may be really disrupted. In this case, if we'll be trying to send the messages during transaction directly to Kafka and only then trying to commit transaction. We'll find ourselves in the worst situation when external systems are already notified but we didn't commit transaction on our site and depends on the business depends on the business and critical operation criticality of this operation. We may find ourselves in really, really interesting situations, very likely to lead to some very nice discussions with legals and trying to find some solution. But technically this is a really bad practice and you have to avoid this at any time and at any situation to not being involved in so many war rooms and solving problems. So one possible solution to implement and solve this problem is to implement transactional outbox. So this involves storing events in an outbox table within the same transaction at the operation itself. Because of atomicity, nothing will be stored in case of transaction failure. One more component needed here as well is relay which will be polling the outbox event table at some regular intervals and send the messages to destinations. Such approach allows to achieve at least one delivery guarantee. However, it's not a big problem since all the consumers must be independent anyway due to network possible failure and just network and reliability. An alternative solution instead of building custom transactional box is to utilize database, transactional lock and custom connectors to read directly from this lock and send these changes to exactly destinations. This approach has its own advantages and disadvantages. For instance, it requires to be coupled to database solutions, but that allows you to write less code in the application. Another problem we have to take a look. It's about unreliable clocks. The time tracking is one of the most important parts of any software or infrastructure, since you always need to track durations to enforce timeouts and expirations, or even gathering metrics to understand how your system operates. Unreliable clocks is one of the most tricky problems in distributed systems since time accuracy heavily depends on computer performance, since each machine has its own clock, which could be faster or slower than others. So there are two types of clocks used by computers, time of day and monotonic clocks. Timeofday returns date and time according to some calendar, which will lead to clocks desynchronization in case of if NTP server is desynchronized from our machine, monotonic clocks always move forward and that helps in many cases for calculating durations, so they can be really useful in such cases when we need to calculate how much time that required took on our machine and this monotonically increased value. But this value is unique per computer, so technically it can be used for multiple server comparison of date and time. Technically, there are not so many easy solutions to achieve very accurate clock synchronization. In most of the cases you don't need it, to be honest, but in situations where it's required by some specific regulations, where we are building very, very accurate system for probably tracking transactions or showing some charts with real time data, precision time protocol can be leveraged, which to balance requires a lot of huge investment of that. Now take a look on the cap theorem. And the cap theorem states that any distributed data store can satisfy only two of three requirements or guarantees in this case. So however, since network unreliability is not something you or someone else can significantly impact, in the case of network partitions, we have to choose between availability or consistency. So in this simple diagram, two clients read from different nodes, one from the primary node and another one from the follower. And replication is configured to update followers after the leader was being has been changed. But what happens if for some reason the leader stops responding? It could be a crash, network partitioning or any another issue. In highly available systems, a new leader has to be assigned. But how do we choose between existing followers? To solve this problem, we need to use some kind of distributed consensus algorithm. But before we look at it, let's review several major consistency types. There are two main classes of consistency used to describe guarantees, weak consistency or eventually consistency, and another one is strong consistency. Eventually consistency means that the data will be synchronized on all the followers after some times if you will stop making changes on the leader talking about strong consistency, I think the better to explain it in this way where a linearizable system, as soon as one client successfully completes a write, all clients reading from the database might be able to see the value just written. So this is a quote from Marcin Klatman book about designing data intensive applications and it perfectly describes what it is strong consistency. This is a synonym of for linear usability, how we can decide and how we can solve this problem with selecting leader by having existing followers. So returning back to the problem I just mentioned about, when leader has crashed, there is need to select or elect a new leader. So this problem at first glance probably looks easy, but in reality there are so many conditions and trade offs that have to be taken into account when selecting the appropriate approach. Per rough protocol, if followers do not receive data or heartbeat from the leader for a specified period of time, then a new leader election process begins. So distributed consensus is an algorithm for getting notes agree on something. So in our case agree on a new leader. So when these leaders election processes started, raft specifies a lot of instructions and techniques, how to achieve the consensus in case of network partitions as well. So technically each replication unit, it can be monolith, write node or multiple shards. It really depends on the infrastructure is associated with a set of rough logs and OS processes that maintains the logs and replicates changes from leader to followers. So the rough protocol guarantees that receive lock records in the same order they are generated by the leader. A user transaction is committed on the leader as soon as the half of the follower acknowledges the receipt of the commit record and writes it to the raft log. So technically, raft specifies how we can really implement this distributed consensus and how we can solve this problem of selecting a new leader. There are many other algorithms, like Paxos in many others. So in this case I just highlighted that raft is just an example. But destiny with the consensus algorithm is very very important. And without it, it's impossible to solve the problem of highly available software. And it has a lot of much more complexities. Let's imagine we have several data centers and we need to organize some kind of replication. One is located in USA, another located in Asia, and we don't want to ask users to make write requests from Asia to USA. We like to have several data centers which can accept rights. To achieve that we need to have some kind of multi leader application and apply some conflicts resolution strategy which will help us to achieve consistent way of the data. Also ordering is quite problematic thing, so it requires to generating some monotonically increasing numbers which will be assigned to any message in this replication specific mechanism. So let's take a look on eventual consistency. So this is a classic example of eventual consistency, which demonstrates situations when the user sees the absence of new jack created transactions. So in this case, first of all we're trying to insert some data in the leader node. Then of course it's replicated to the follower. And whenever we do a second request to the follower during two eventual consistency aspects, we probably may found that the transaction was not even created and be really really confused. So in this case probably it's not very critical because eventually it will be synchronized. But in more complicated scenarios when we need to make some logic, for instance, we cannot finance the company if it was added in some sanctions list. We cannot rely on eventual consistency because it may lead to some problems in the business. And one of the possible solutions to apply here, and it's probably one, it's quite effective and simple strategies, is to read from the follower by the user who just saved new data to avoiding this replication lag. And it helps in most of the cases. For more complicated scenarios, of course it's better take a look on the cases and trying to define the best strategy how to solve that. Another thing is about process spouses and let's take a look on this. So this is a quite dangerous problem and that may be even garbage collection stop the world situation happened, or virtual machine suspension or context switch took more time than we expected, or some delay in the network that led to unexpected pauses of the system which affected the logic. So if such thing happens under the cured list, for instance, we discussed this couple of slides ago, there's a chance to access the same resource twice, which may lead to severe consequences. To overcome this, there's a safety technique called fencing that could be leveraged. So technically, before updating the resource, we need to secure the lease as we did it previously, but without just occurring. The lease would like to get some kind of token and this token will be used on the data store to prevent updating the data or resource in case of such talking was already processed or it was already expired. But technically it means like we need to generate a monotonic increasing token. Of course we can attach some key here and whenever we update the resource, we check that the current token is the last token from the sequence. If some, if some another token in the past was already, if the time was already processed or this token was expired will be needed to throw an exception and solve this problem on the data storage. But probably it's better to take a look at the problem as well. Like whenever one client secured the lease for some reason garbage collection stop, the world may happen and we will be delaying until the lease expires. Another client will be getting the same lease under the same key and also will be started doing some operation and maybe the client to finish the operation. But when client one slapped out, it will be also trying to update the storage because it was already under release. And in this case we may probably have some issues and failures. So in this case, this fencing technique to having some monotonically increasing token and checking that there is no such token in there processed in the past, but this sequential number should be applied to this problem. It probably may be some theoretical problem, but it may happen. And it's better to bear in mind for really, really critical operations when we want to avoid anything related to transaction processing and so on, so forth. So as I said in the beginning, observability is one of the most important things in distributed systems. So in this example, you can see a distributed operation that spans transactions across multiple components and we need an easy mechanism for getting this information to effectively determine bottlenecks and have full visibility or executed attributed operation. The best thing to do that, or probably one of the best thing, is distributed tracing is an approach for tracking this distributed operation across multiple microservices nodes components. The key idea is to attach a unique destroyed operation id to the initial request, then span it down the road and finally collect and aggregate this unique id to build analytical visualization of the whole. For instance, Jager keeps track of all connections and provides charts and graphs to visualize request path in the application. So technically it has multiple components which is providing collecting of these destabilization ids and building some visualization to effectively take a look into operation and understanding the bottlenecks, the problems and so on, so forth. This is a visualization example which helps to analyze the bottlenecks and probably just to understand where my request is slow, how I can find the point in the system where I need to apply some optimization techniques and so on, so forth. It's very useful and helpful whenever I need to debug complicated scenarios and to be sure that you are not missing something very important. Another approach that may help to improve the systems observability is orchestration techniques rather than choreography. Orchestration in software engineering involves a central controller or orchestration engine that manages the interaction integration between services. Choreography refers to a decentralized approach where services interact based on navans. In this model, each service knows when to execute its operation and with whom to interact without needing to central point of control. Orchestration is usually the profitable approach, to be honest, for controlling growing complexity due to increased visibility characteristics and just having a place for controlling. Of course, it must be highly available because it can become a bottleneck or single point of failure. So high availability techniques must be applied in place to avoid any problems. So let's take a look about evolvability and cybernetics principles. So, in the context of software engineering, cybernetics can be defined as a study or science and application of feedback loops, control systems and communication processes within software development and operational environments. It focuses on how systems, software, hardware processes anything. Human interactions as well, can be designed and managed to achieve desired goals through self regulation. Adapting and learning. Cybernetics and software engineering emphasizes creating systems that can adjust to changes, learn from many interactions, and improve over time, ensuring reliability, efficiency and resilience. To be honest, that thing explains what we are doing as a software engineers or architects to manage the really complicated evolving of the system. So speaking about several principles of that, the first of all is systems thinking. So this concept focuses on the system as a whole rather than individual parts. So in software engineering, it technically means that all the parts of a pair are really important, how they interact with each other. And we need to have a holistic view about how it's working together, how it's evolving, how many dependencies we have, having some kind of dependency review over time and check all that. And design decisions are made with this understanding applied to the system itself. Without just considering one component, we're considering the system as a whole thing. So, example, when service b handles an event published by service a for the outcome does not really affect service a directly. However, the overall result of the operation is a significant problem to the system at the whole piece. So another thing is feedback loops. This is a core concept, awareness of cybernetics, and this involves implementing or just having feedback loops to understand how to control and stabilize the system. So in software architecture and systems, feedback loops may be represented in various forms like having retrospectives, having monitoring system performance, user feedback through different questionnaires and so on, so forth, and also having mechanism like continuous delivery, continuous integration and deployment to get feedback earlier and making decisions about how to make it better. So this principle is one of the important principles in the software engineering from cybernetics perspective. So another thing is about adaptability and learning. So cybernetics Pro promotes the idea that systems should be capable of adopting the changes. It also relates to maintainability and vulvability. And that means that we need to apply different techniques to evolve our system without lots of obstacles, and having specific techniques for incorporating inside the system to get some feedback as well, and learning from our mistakes from our existing metrics. So technically this is evolution of feedback loops. Talking about this stuff related to goal oriented design, this is one more principle in cybernetics. And cybernetic systems are often defined by their goals. So in context of software architecture, this means that the system should be designed with clear objectives in mind. Domain driven design is a software development methodology that focuses on building a domain model with a deep comprehensive understanding of domain processes and rules. So technically it helps to build the software which is aligned with the real business domains and business requirements drives architecture decisions like in this case, we can see a lot of different contexts or domains like one is related to accounting, another one for underwriting fraud profile analysis on the boarding. So we're technically designing this system which is solving these real problems from the domain. I believe everyone, to be honest, is familiar with such pattern or anti pattern and was involved in building such software. This is experience everyone had. And I believe that very helpful to have this experience because it helps you to understand how you shouldn't build the system. So this pattern, or again, anti pattern drives, building system drives without any architecture there is no rules, boundaries, strategy and how to control inevitable growing complexity. After some time it becomes really painful thing and system becomes unmaintainable. And any change leads to so many painful problems and of course reducing the revenue or just improving, increasing a lot of bugs and failures to the system. So we can apply some techniques to building software with different types and can really help. So system organized or in hierarchies or subsystem and software system often have a hierarchical structure with high level models, depending on lower level models for functionality. So this hierarchical decomposition helps really to manage complexity by breaking down the system into more manageable parts. But to be able to understand it better we need to take a look on something which is quite important. So first of all, it's a quite popular opinion that all microservices are the same. So technically from the deployment perspective they are the same. Everything is either virtual machine deployed or just a docker image and deployed as a docker container. But perspective of the system and perspective of the making a decision. We need to have some service types to make the decision much much simpler and simplify the thinking process. We can have multiple types of services which very likely will be having absolutely different rules and constraints. Like for instance, we have several services which are necessary to communicate with the external world. Like I need to communicate with some front end application which is technically run in the browser of the user, and I need to accept some request and move it in the central of the system. And in this case I defined about six types of services which may be very very useful. The first type is web application or mobile application, public API. Technically this is a service which is used by customers. Mainly it's started or run in browser, or it's a desktop application or mobile application. Another thing, another layer is back and front. I think this is a place where all incoming requests are coming. Then we are rotating this request down into the system. Then the third layer is about product workflows. Whenever we build a system which has some products, we need to structure some business logic which is really specific and detailed this product, and we cannot just generalize this stuff because it's a product specific, this is a unique of this product. Then we also have some reusable components we need to help, we need to utilize in our domain, and likely that we'll be having this components because all the time we are using some, we are resolving some domain problems. We need to utilize in several components. For instance, maybe kyc kybap be anti fraud, or maybe some just payment service internally which will be helping us to send money and getting money back. Another thing is about gateways and adapters. All the time we need to get some data from external system, we need to integrate with something from the external world. And in achieving that we need to call this service and the service will be responsible for getting this responsibility to take some data or send some data to outside in the bottom of that always have some kind of platform components. It of course related to identity access management, some core library configuration monitoring, even workflow mechanism for managing really complicated distributed workflows. Another very, very important thing that industrial systems and at scale we're starting to solve more and more problems and we're starting to having some domain areas, some macro domain areas, and in this areas we can collect the services or even concepts or even problems together and hide that behind some domain macro API which will be implemented or represented as a central point to communicate with this subdomain. It technically helps on the scale to control growing of a lot of microservices and be able to manage this really, really effectively. Talking about Sre SRe is a site reliability engineering. This is a foundation framework built by Google really many years ago. And the core principles of SRE is about first of all, embrace risk. We need to really understand the risk where the critical operation, whether it's not critical operation, where we need to understand that this is really mission critical for the business and we need to apply all the reliable techniques. If for some reason this technique is not very necessary or these functionality is not very critical, probably we don't need to have it at all. Probably we don't need to put a lot of resources and making more and more tasks, putting a lot of requirements, metrics, monitoring and so on, so forth. So technically it means like we need to understand the risk and we need to manage the risk. We need to count it from this perspective because of course a deal state where everything is covered by test everything, but it's impossible to achieve. So we need to prioritize where we want to put our resource based on what we have right now. Now another thing is about SLA, Slo and SLi to define system reliability. SLA is a service level agreement, SLO service level objective and SLI service level indicator. In simple terms, SLA is our agreement with a customer about what we promised from our system. SLA is an agreement inside the team. What we are promised to achieve as a team and SLI is about technique or just measure how to calculate our SLO and SLA's. And these techniques or these principles are very helpful to define the reliability of the system in order to communicate with the business about how reliable are we. Another thing is about automate manual work. We need to ideally to avoid any manual work as possible. Sometimes, of course it's not really feasible because automation will cost more. So we need to be wise about when we make such decision. But ideally we need to automate manual work. Repetitive work and trying to avoid any kind of complicated work have to be done by person or by engineer, or so on, so forth. Another thing is about monitor everything. Monitor everything means like we need to have visibility over everything and discuss with like we can implement some kind of distributed tracing. We need to put some metrics in place to calculate and measure performance of our microservices or systems or any infrastructure components. And the last thing is about simplify as much as you could do, because any complex system tends to fail because it just has a lot of components which may really fail and you not always have control over everything. So if you can simplify, choose the more simplified approach, it will help you over time. So take a look on some things which is quite important about infrastructure as code. This is a principle which states that we need to define our infrastructure as a code. Like I'm a developer writing some new microservice, the same thing I will be using for our infrastructure. One of the possible solutions is to having some tools which can allow us to define some templates of resources, some definitions of these resources, some scripts which will be updating infrastructure. And this is the following, the ultimate manual work guideline. So we can use some kind of terraform solution to write by engineers that will be validated and then applied to our infrastructure. And of course it will be versioned in some control like git, mercurial and so on, so forth. The thing about child engineering is also quite important for testing as well. So one technique we can utilize to achieve a safety of our system is Jobson, a special tool and framework developed by Kyle Kingsbury to analyze the safety and consistency of distributed database stores and systems under various conditions, particularly focusing on how system behave under network partitions and other types of failures. And technically, Jemson provides some, some very useful things like fault injection. It introduces specific faults in distributed systems and it looks how the system will behave under these conditions and what will be happening when a system is broken and so on, so forth. It includes network partitions where communication between nodes is really affected and we need to understand how our system will solve this problem. Another thing about operations testing test various default scenarios for read writes from leaders followers and understanding how fast it works and how our consistency is organized. And of course in currency. This is a very very important problem we need to cover is about how our system behave over constrained to simulate real world usage scenarios and just become more real and understand the impact of such thing. Finally, I'd like to talk about simplicity and measuring complexity. So it's really complicated to measure complexity, but complex system that works in a specific way can be measured somehow. And a complex system that works is invariably found to have evolved from a simple system that worked and that happens all the time. And this is a galls law about this stuff. And we need to measure complexity whenever we have possibility for that. So we can apply several techniques to measure this stuff. We can utilize some cyclomatic complexity which helps to track how complicated function you have right now and how you can make it better and much simpler to understand by other developers. Another thing which is quite important to track is about time to train. So whenever a new engineer is onboarding, you need to understand how much time you need to spend to train this guy to work with this specific solution or stuff. Probably that's not very great thing you want to have, but that's a very important thing to track and also explanation time. If you have very complicated domain and it takes really a lot of time to explain some details, probably you need to apply some techniques to simplify this and make it really easy for people to be on board with and starting to understand the details and techniques of this domain. So in conclusion, I just want to say that I covered several techniques related to destroyed systems. Destroyed systems is a really, really nice thing. We want to have to achieve highly available and scalable software, but it brings a lot of complexities we need to bear in mind, and we cannot just avoid that. We need, need to really solving this stuff. We need to understand the drawbacks and how to apply best practices to solve this problem. Thank you for attending this talk. Looking forward to.

Slides

Download slides (PDF)

See all 26 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

Mastering the Maze: Practical Strategies for Navigating Complexity in Distributed Systems

Video size:

Abstract

Summary

Transcript

Slides

Aleksei Popov

Software Engineering Manager @ Stenn Technologies

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

Mastering the Maze: Practical Strategies for Navigating Complexity in Distributed Systems

Video size:

Abstract

Summary

Transcript

Slides

Aleksei Popov

Software Engineering Manager @ Stenn Technologies

Join the community!