Conf42 Golang 2021 - Online

DDD and FSM: tackling complexity with state machines

Video size:

Abstract

The talk will describe how you can simplify the implementation of complex domain models, using FSM as a basis for building logic and interactions between elements. At the same time, the topic contains a minimum of theory and a maximum of practical advice and examples from the author’s experience.

The participant will learn: - how FSM is applied to a domain model; - how to manage the state of complex data models with many independently changing entities; - how to implement interaction of different domains depending on their states; - how to make error handling in complex processes fault-tolerant; - how to use this technique in distributed systems with synchronous, asynchronous, and event-driven communication.

Summary

  • Ilya Kaznacheev is a technical lead leader at container managed Kubernetes service in MTS cloud. He will talk about how we implemented complex and long running processes using the domain model and finite state machines. If you want to move to the cloud or to design a suitable architecture and set up processes, feel free to contact me.
  • A state machine is a behavior model. It consists of finite number of states and transition rules between them. As a cloud provider, we need much more control over the process to ensure speed and resilience. We decided to dive deeper into domain driven design.
  • For each domain, we add its own state machine, which describes the logic for that particular domain. The logic that affects multiple domains is propagated within the domains. As soon as parts of processes become parallel, there is a danger of race condition. Another problem is inconsistency in asynchronous messaging.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Today I'm going to talk about how we implemented complex and long running processes using the domain model and finite state machines to organize the logic within them. You will learn what state machines are and why they are useful for backend development, how to use domains to organize complex business logic, how to implement parallel processes based on state machines, and how to deal with common problems. My name Ilya Kaznacheev and I am a technical lead leader at container managed Kubernetes service in MTS cloud. I'm also a consulting cloud architect. So if you want to move to the cloud or to design a suitable architecture and set up processes, feel free to contact me. I'm a founder of several local communication, host of a podcast and a conference and meetup organizer. So what is a finite state machine? If you haven't worked with it, you probably met it at university or in computer science literature. A state machine is a behavior model. It consists of finite number of states and transition rules between them. I often see mentions of state machine in context of front end or embedded, but rarely see examples of backend application logic implementation. Today I will tell you how we did it. So what is managed Kubernetes service? In simple words, the user wants to get a Kubernetes cluster in the cloud. Then magic happens and then the client gets access to the cluster. In our case, we had a service at the MVP states whose architecture did not meet our production requirements. A thmoose refactoring process begun during which we did what I am going to talk about today. As a cloud provider, we can just create a cluster in via terraform. We need much more control over the process to ensure speed and resilience. We control the process at a lower level, which means a lot of granular operations. So what we had in the beginning, complete cluster creation, was done in just three steps, separated by messages. In Kafka. The code structure didn't allow us to implement good practices and the application was not fault tolerant. A crash or restart interrupted the loan process of cluster creation without any possibility to recover. First, we decided to divide the cluster creation process into a set of steps combined into a pipeline. Each step is performed in a separate session and the sequence of steps is controlled over kafka. If an error occurs at any step, the process can be repeated again, the same application, the same appliance to fault tolerance if the service crashes. Another instance reads the message from Kafka and handles it. It looked very good in theory, but in practice there were many difficulties. Some steps consisted of a set of parallel operations which also had to be managed for many steps. The logic consisted of a complex chain of operations that depended on the various cluster components. The more complex the logic became, the worse it fit into the existing model. So after a while we decided to go further, dive deeper into domain driven design. This is a simplified domain model of the cluster. It is actually more complicated by I removed some of the elements for the presentation. So in the Kubernetes cluster, it has a cluster domain itself, a set of load balancers, a set of models, and a set of node groups. In terms of domain driven design, domains that change together are combined into a domain aggregate. The overall context of changes is described by the aggregate boundary. A cluster is an aggregate route. This means that there will always be one cluster for one aggregate, and it can be referred to by cluster id. Node can be master or worker. The group of workers is combined into a node group, while the master exists independently. This is an example of a domain aggregate instance, a real cluster, a cluster with three masters and five worker nodes, which are combined into two groups with different configurations. This module covered the need for complex logic, as the logic of each entity is encapsulated in a separate domain. This solved the problem of parallel operations, because now each domain was responsible for its own processes and it was easier to implement parallel steps. But some of the steps actually consisted of their own set of systems. Let's take a close look at the worker node creation process. Each node is created in four steps, creating a virtual machine, launching the operation system, configuring the environment and applications, and finally running. Moreover, an error can occur at any step. In this case, different error handling may be required. In the case of an error, while starting the operation system, you must either perform a restart or delete the vm for a full rollback. In case of a configuration error, you may need to repeat the setup. In the case of an error, when starting the application, another processing may be required, and these errors can happen in parallel at different states for different models, which makes it practically impossible to manage this process from one single place, as we assumed in the first implementation with the pipeline. So at this point we realized that it was impossible to cover the logic of the entities. Domain aggregate with a single state machine. Each domain needs its own state machine describing states and transitions between them just for that only domain. Let's check this out with a state machine example for a node. So node has an initial state with which it is created in the database. The state machine accepts a success event which will be handled depending on the current state of the domain. This can actually be different events for different situations, but for simplicity, I have combined all success situations into one event. Transition to the next state will send a request to create a virtual machine. When the virtual machine is created, the transition to the next states happens with the request to start the operation system. When it's okay, the next step states with transition to a relevant state. When the configuration is done, the node goes into the running state with no additional actions performed. Similarly, we can describe the process of the node removal. Right now it looks like just two pipelines joined for creation and deletion. Or, if you like, it looks like a saga. However, unlike pipelines and sagas, the state machine allows you to solve one very important task, error handling. Remember that an error can happen at any step. For simplicity, I also have combined the different types of errors into a single error event. I don't include retrievable errors since they can be handled automatically. Here are the errors that cannot be fixed by repetition. Note that the error handling actions are different depending on the current state of the domain. The case where the application will handle an error that does not match the current state, for example, if it was received late from a message queue, is eliminated. Back to the domain model. For each domain, we add its own state machine, which describes the logic for that particular domain. In this way, we can independently implement the logic of parts of the system as complex as we want. In case of certain cluster, a state field is added to each element, which defines its position on the state diagram of the corresponding domain. Thus, the state of each item of the cluster is stored in the database between event handlings. Any entity in the domain aggregate has a clearly defined state at any point in time. Let's now see how it all works. Suppose we have a cluster, also a simplified cluster in creation process. When one node group is ready and other is still using created worker node, one receives a success message and its state changes to running. The node then escalates an event about its state's change to its parent domain, the node group. The node group then performs state transition validation. The validation condition is that all nodes in the group have states running, but one node is still in the state setup pending, so the transition will not take place. Next worker node two receives a success message and its states changing to running. The node then escalates an event about its state change to the node group. The node group performs the state transition validation. Now all child nodes are in state running, so the node group will change the state to running too, and then it escalates an event about its state change to the cluster domain, its parent. The cluster checks if all node groups are in running state. As soon as the condition is met, it also changes state to running. So the logic of each domain domains within that domain itself. The logic that affects multiple domains is propagated within the domains. Aggregate aggregate through events let's take a look at what happens in case of an error. For example, we have the same cluster in the same state, but then one of the nodes receives an error message which is handled as an event. The node starts the error process matching the error process matching its current state. As we saw in the state diagram before. At the same time the node cannot decide what to do with other nodes. A node domain can only control its own state, so the node escalates the error event to its parent, the node group domains. The node group then has enough information about nodes in group to make an appropriate decision. For example, it might try to create a new node to replace a bad one or delete the other nodes in the group for full rollback. However, this may not be enough, in which case the node group escalates the error even higher to cluster level. In this case, the decision on error handling will be made at the entire cluster level, where it is possible to analyze the situation at all levels. The logic for each domain is still encapsulated within the domain. After the decision is made, the domains will tell the child domains what to do by triggering the relevant event. So during the refactoring process we faced technical and architectural problems. Next, I will mention some interesting problems and their solutions. As soon as parts of processes become parallel, there is a danger of race condition. In concurrent event processing, different entities in the same domain aggregate may conflict for resources or for process control. To avoid this, we use database level root lock. Any change to domain aggregate data happens in a single AC transaction, so any conflicts are eliminated. No matter which domain handles the event, the log is always set to the root element, in our case the cluster domain. However, this leads to lower performance in parallel processes. This is not a problem in our case, but could be a problem in yours. In this case, you can set the lock not to the root domain, but to the processes domain or its parent. Here applies the bottom to up rule. You can put the log on the parent domain but not on the child. Otherwise you might get a deadlock. Another problem is inconsistency in asynchronous messaging. We use secure s and all comments are executed asynchronously. When protesting an event, the state transition logic can send messages to Kafka. Then the domain state changes, which is persisted in postgres, but entire event processing takes place in a single AC transaction. So what happens in case of error? Domain changes will not be saved in the database because of the transaction rollback, but elements would already be sent to Kafka, which would lead to inconsistency in the data model. The service database expects the domain to be in a state before the comment is sent, so the state before transaction begins, but in fact the comment has already been sent, which corresponds to a different domains state. To avoid this, we adopt the outbox pattern. The messages are first stored in the database within the same transaction as the rest of the changes. A separate job then reads the data from the outbox table and send it to Kafka. In case of an error. No messages from this transaction will be saved in the database, so the job will not read them from the outbox table and send them to kafka. Next, some operations, such as virtual machine creation, are not always practical to do one at a time. There may be dozens of virtual machines in the same cluster, and the creating them one at a time reduces the availability of infrastructure for other services and makes the processes longer. This can be fixed by introducing batch operations, the rest of the logic domains the same, but the moment the process reaches a batch operation, instead of an event trigger for each domain in the group, there is a special event trigger for batch processing. The same changes are made, but messages to other systems are not sent directly, but are sent in a different method as a batch. In this implementation, the domains logic encapsulation begins to leak a little bit. The business logic remains within the domain, but the logic for sending messages is moved out of the single domain to the domain list level. But this is a tradeoff for batch operations, and the last problem that the state machines solves especially well is the complex error handling. I call it state matched error processing. Assume an error event comes into a system. This can be some general event or more specific one. In any case, we need to handle the error. Normally, we would have identified the type of error and handled it accordingly. However, it could happen that the error messages get mixed up or the message arrived later because of a slow queue. There may be a bug in the application sending the error that causes the wrong type of message to be sent. This would normally result in error handling that does not correspond to the actual state. But now, in the case of state matched error processing, when we use the state machine, error handling is always matched to the state of the domain that is handling the error at the same time. Different errors can be handled differently within the same state, but the processing will always be relevant to the current state of the domain and not to any other. If the current state does not consider any error handling, then the event will simply be ignored. So the error processing be always matched with current domain state. And if it was some kind of false error or errors, that doesn't mean anything for the current state, it will just be ignored. You also can log it, you also can add it to tracing, et cetera, but it will not disturb the data with the fells error processes. So today, the domain model with state machine based logic works very well within our hexagonal services. The approach performed very well and fit into our CQRS communication model with synchronous queries and asynchronous comments it's quite simple to implement in Golem. We just use long switch cases to describe each state machine. It's easy to read and easy to troubleshoot, but there is still a room for improvement. In the future, we want to further refine our approach to implemented distributed transactions based on domains with the state machines. This already works in some parts of our infrastructure, but the formal approach has not yet been described. We also want to generate state diagrams based on code. Right now we use mermaid js to describe state diagrams in the service documentation, but we want to automate the process. So that's all for you today. I look forward to talking with you. Send me your questions and ideas. Also, don't forget that if you want to run your services in the cloud, I can help with that. Feel free to contact me with any questions related to cloud architecture and process organization related to development in the cloud environment. Thank you for your attention and goodbye.
...

Ilya Kaznacheev

Technical Lead @ MTS Cloud

Ilya Kaznacheev's LinkedIn account Ilya Kaznacheev's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways