Isolation levels and partial failures in distributed systems with LLMs

Video size:

Abstract

Addressing isolation levels and partial failures in distributed systems with LLMs requires careful consideration of system architecture, fault tolerance mechanisms, and concurrency control strategies to ensure consistent and reliable operation despite the complexities of distributed environments.

Summary

Most of my talk in this particular session will be covering the non functional aspects of distributed systems like consistency, isolation, concurrency, performance, availability and reliability. Why they're important and what problems are we trying to solve.
Database isolation defines the degree to which a transaction must be isolated from the data modifications made by any other transaction. How much degree of isolation is needed for either reading or writing is called this isolation level. The goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.
Read committed isolation level makes sure any transaction that reads data from row blocks any other writing transactions from accessing the same row. Repeatable read typically results in lower concurrency compared to read committed because it holds locks for the duration of the transaction. This prevents non repeatable reads, but it may still allow phantom reads.
Snapshot isolation provides the strongest consistency guarantees among all the three isolation levels. It ensures that transactions see a consistent snapshot of the database. Concurrency snapshot isolation typically allows for higher concurrency because it doesn't hold locks on read operations. But it requires additional storage space.
Partial failures are situations where only a subset of components or nodes within the system experience failures. This can result in inconsistencies, degraded performance, or temporary unavailability of services. Various replication strategies are employed in distributed systems to handle these partial failures.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, thanks for joining for my talk on isolation levels and partial failures in distributed systems. So most of my talk in this particular session will be covering the non functional aspects of distributed systems like consistency, isolation, concurrency, performance, availability and reliability. Why they're important and what problems are we trying to solve. So to begin with, I would start off with the concept of isolation and why is it important in transactions? And even before that, what is a transaction? Right. So any event or any operation that we do in our day to day life, say you're purchasing something on an e commerce platform, or you're editing a Google document, or reading from a Google document, or browsing something on the Internet and looking at some data. Anything and everything involves two things. Either it is a read or a write operation in the distributed system. Say when you are purchasing something on the Internet, on the ecommerce platform, you are actually making a payment. And for that you need to update a database record. So that means you're writing something in the DB, the database, in the backend, or when you're viewing something on the ecommerce platform to purchase something, that means you are reading the data. So it's just read or write. What happens when so many of us, it's not just one person, millions of people across the globe trying to read and write the data onto these distributed systems concurrently, at the same time. So lot of things needed to be handled at the back end. What is that? Lot of things. So most importantly, concurrency. So the concept of concurrency in itself explains that it coming into picture as we all try to interact with these distributed systems. So what is important is the acid properties of the database. Like what is acid, the atomicity, consistency, isolation and durability. So among these four, I would be focusing on the isolation aspect of this particular data based systems in the distributed environment. So these properties are fundamental principles of these database management systems that ensure the reliability, integrity and correctness of transactions. Now in this, let's talk about the isolation and what is isolation? And in that, what are these isolation levels and why are they needed in this distributed systems environment? So isolation means that a transaction should take place in a system in such a way that that is the only transaction that is accessing the resources in the distributed system. So like I mentioned earlier, many of us are trying to access these systems. Say, while I took this example here, e commerce platform, some users are trying to, users are trying to say, purchase some items. Buyer one is trying to look at all the products and also buyer two and buyer three, all of them are looking at some products and as you can see, buyer one and buyer two are trying to purchase, read and write, that is, view and purchase the same product, which is the Jim Klaus here. And there is a view buyer three who is trying to also read and purchase, read and write a couple of products like motorcycle helmet and office chair at the same time while the seller is trying to update the price of the product. So that means a lot of concurrency. Lot of concurrent things are happening here. So imagine that you're implementing like a distributed system, an e commerce system like this. All these operations have to take place at the same time, right? Multiple customers simultaneously want to purchase the same product, prices of the product may change, and new products are still being delivered. So on. As you know, a single action done by a user is run as a transaction in a database, which I just explained. So we need some logic to maintain the consistency. And that's the role of isolation, because it controls whether locks are taken, when the data is read, and what type of locks are requested, how long the read locks are held, so that the viewer is able to see a proper consistent data before some new data gets updated. Whether a read operation referencing some rows modified by another transaction, say here, the seller updating the price of a particular product, which the buyer three is reading. So it blocks until the exclusive lock on the row is freed or retrieves the committed version of the row that existed at the time the transaction started. It's depending on the isolation levels. So these are the things which is controlled by the isolation level. And let me again go into what is an isolation level and what are the different types of isolation levels. So simple, right? Putting in simple terms, database isolation defines the degree to which a transaction must be isolated from the data modifications made by any other transaction. Say multiple, multiple transactions, or multiple people are trying to access the same record. How much degree of isolation, how much isolation is needed for either reading or writing is called this isolation level, and there can be a large number of concurrently running transactions. So that's why the goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions, right? Say if someone is writing the data and hasn't committed the data, and someone else is reading the data, which is not committed, not supposed to happen, right? So you shouldn't let read happen, say when, until a transaction is committed, write transaction is committed. An example, by the way, while I'm explaining that, I already gave an example of dirty read. So here you can see there's a problem or a phenomena of concurrent transactions called dirty read, where a transaction reads data written by a concurrent uncommitted transaction. Here, the uncommitted data is called dirty. For example. Let's take the example on the right hand side. On the screen here I presented, let's say a transaction one updates a row in a database and leaves it uncommitted. Meanwhile, transaction two reads the updated data updated row. So if transaction one rolls back that change like it's aborted, or it rolls back, transaction two will have red data that is considered to never have existed, right? So we shouldn't let someone read until the data is actually committed. So that's what this read committed isolation level means. As you can see on the left side of the picture, not letting transaction two read the data. T two to read the data until the t one has finished writing and then updating it. So that's what read committed isolation level means. Now, the isolation level does not allow any other transaction to write or read row to which another transaction has written two. But dotnet committed another transaction. Here is t one. That's what I just explained. Thus it does not allow allow dirty read, right? So you're locking on the read, not letting read to happen. The transaction holds a read or a write lock on the current row, and thus prevents other transactions from reading, updating, or deleting it. Now, what things are guaranteed with this? Right? Firstly, let's talk about three aspects, which I was saying the non functional aspects. First is consistency. So read committed provides good balance between consistency and concurrency. It ensures that transactions only see committed data, right? So preventing dirty reads so consistency is good. You are like across different systems or different nodes, there won't be any wrong data or inconsistent data. Secondly, concurrency read committed allows for higher concurrency compared to other strong isolation levels like repeatable read, which we'll be covering later because it releases locks as soon as the data is read. So it's not like in this particular read committed doesn't hold the lock for a long time, which is good for. Which is good for concurrency. A lot of concurrent operations can happen since you're holding the lock for lesser time. However, it still suffers from non repeatable reads and phantom reads, which we'll be covering in the next sections again. Now let's move into the performance so performance is again, read committed tends to have better performance than other stronger isolation levels due to its lower level of locking and reduced contention. I just mentioned a few seconds ago it allows for more concurrent transactions, but may still incur some overhead due to lack of acquisition and releases. Now in this direction, let's move into the next problem, which is non repeatable read, which I just gave an intro about. So what is a non repeatable read which the read committed isolation level doesn't solve? Say for example, suppose transaction t one reads data, okay? Now due to concurrency, another transaction t two updates the same data and commit. Now, if t one rereads the same data, rereads like the same again repeated reads the data, it will retrieve a different value, right? So you're not rereading the same value, but a different value after the t two has written and committed the data, right? So the read committed doesn't guarantee this particular, you know, don't want to, you want to read the same data within the same transaction flow like read and read. Right? Now how is it solved by now? This can be solved by repeatable read isolation level. This isolation level makes sure any transaction that reads data from row blocks any other writing transactions from accessing the same row. So this is the most restrictive isolation level that holds read locks on all rows it references and write locks on all rows. It inserts, updates and deletes. Since other transactions cannot read, update, or delete these rows. Consequently, it avoids non repeatable read. I just demonstrated all of this using the picture on the left t one is trying to do select and select two select send. No other transaction can do any operation while you are reading. While one transaction is reading here. So two reads are happening and it's locked completely. All the locks, all rows it references are being locked completely. So that way non repeatable road non repeatable read can be avoided. And now let's talk about the consistency, concurrency and performance. So as you can see, firstly, holding lot of locks, right? Holding locks on all the rows. Let's talk about performance. Repeatable read have slightly worse performance as I just described, compared to late committed recommitted isolation level due to like increased locking and reduced concurrency. Its impact on performance depends on the workload too, and the level of contention system. So why reduced concurrency? Let's talk about concurrency. Repeatable read typically results in lower concurrency compared to read committed because it holds locks for the duration of the transaction. Because you're doing two reads, right read read for the entire duration of the transaction. T one is holding the lock entire duration of the transaction to prevent other transactions, say t two in this aspect to do any kind of updates from modifying the data. This can lead to increased contention and potential deadlock situations because as you are holding the lock for a long time, there'll be a lot of contention and waiting and there's bad concurrency, things are not happening in parallel. Well, so holding lock for longer and not letting other writes to happen in parallel is something not really good. But consistency, yeah, repeatable read provides strong consistency than read committed because you're solving that other problem as well by ensuring that once data is read by a transaction, it remains unchanged for the duration of the transaction. So that's good consistency. Even though someone updates the data and commits, you can't do that like in read committed. That is allowed, right? Until, if it is committed, you can read again after it is committed. So two reads will result in two different data. It's not totally consistent if it is read committed isolation level. But with this particular readable repeatable read isolation level, you're actually providing stronger consistency. But the non repeatable reads may not. This prevents non repeatable reads, but it may still allow something called phantom reads. Now, what is a phantom read and who solves it? Let's talk about it. Now, the next thing is snapshot isolation. This is another very very very deep or strong isolation level. And what it solves is something called phantom read, along with the dirty read and the non repeatable read as well. So phantom read a transaction, say, re executes a query returning a set of rows, not just one row, but a range query. A set of rows say greater than, say, for example, let's list all the players who earn more than say blah blah blah, certain dollars. So list of all, like a range query. So set of rows that satisfy search condition and finds that the set of rows satisfying the condition has changed due to another recently committed transaction, say new entry got added, right? So this is similar to a non repeatable read, except it involves changing collection matching predicate rather than a single item. So as I have given the example as well here, say transaction one has read something and now there is something written by transaction two appending to the list. It's just range query appending to the list and it's committed. And when you read the transaction after some time, again double read, it's similar to non repeatable read. Exactly the same thing. If you look back, it's even in the previous one you're reading, and then transaction two is writing committed, and then another read. Similarly reading, and then another transaction is writing, and again a second read. So it's similar to non repeatable read, except it involves changing collection matching predicate rather than single items. It's a range. So how is this solved? By snapshot isolation level. This isolation level can greatly increase concurrency at lower cost than transactional isolation. Now, when the data is modified. The committed versions of affected rows are copied to a temporary data structure or a temporary place and given some version numbers. So this operation is called copy on write and is used for all inserts updates. And I'm referring to the diagram on the left. And when another session reads the same data, the committed version of the data as of the time the reading transaction began, it returned. So when someone initiates read, it will be provided with a snapshot of the current data. And if someone is doing a write at the same time, the write gets its own snapshot. As you can see in t one and t two, two transactions are trying to write say odd to even and then even to odd. So the green, like t one, is changing from odd to even. All the greens, it has its own snapshot and then changed. And t two, it has its own snapshot and it's changed event odd. Ultimately, though, all are combined once the transactions are complete and written to the original database. So same way when there is a read happening, when these two t one, t two, like t three is doing some read, it will be provided its own snapshot, and so there is no one interfering with their own say, transactions, say read or write. So that way, maintaining everyone having their own snapshots has this high level of concurrency and also strong consistency maintained. So let's talk about consistency. Firstly, snapshot isolation, as I said, provides the strongest consistency guarantees among all the three isolation levels I just mentioned. It ensures that transactions see a consistent snapshot of the database. As of the transactions start when the read, say started. By the time the read is finished, like two or three reads the whole transaction, the data is still the same because it's using snapshot, preventing both non repeatable reads and phantom reads. There's no change. If there is a change, that's when the phantom reads and non repeatable reads come into picture their problems. Now, concurrency snapshot isolation typically allows for higher concurrency because it doesn't hold locks on read operations. Right? Just like previously we were talking about holding locks in what is that repeatable read isolation level. Now with this, instead, it maintains multiple versions of data items, allowing concurrent transactions to operate on their own constant snapshots. So concurrency is guaranteed. Wonderful. No performance. Snapshot isolation can have good performance too in read heavy workloads with low update contention because it allows for high concurrency. Like when read is happening, everyone is probably with their own snapshots and write can also happen in parallel and avoids the overhead of locking. So wonderful. But think about it, the performance here, the time complexity is ensured, like it's fast, there's no locking, and concurrently things are happening. It's quick. However, it requires like additional storage space, because where do you put the snapshots, right? So overhead of maintaining the multiple versions of the data is coming into picture. So that needs to be considered. The space complexity is what I'm talking about, especially in writes, everyone maintaining a copy of their own. So you need to account for that extra memory in the database when you're considering the snapshot isolation. So those are the three things about the read, committed, non repeatable read and snapshot, all three levels of isolation which guarantee certain degree of consistency, concurrency and performance aspects. Now, switching gears apart from the consistency, concurrency and performance, the two other non functional aspects of distributed systems, especially in the large language models which are very important to consider, are these partial failures. So now partial failures, like what are partial failures, right. These refer to situations where only a subset of components, like what is a distributed system? Like I said, lot of connected computers working in a network, you have lot of systems connected globally in a network, interacting with each other, trying to locate the data and processing, computing, again, merging the data, all that is happening, data processing in a distributed environment. Now a lot of communication is going on. So what if only a subset of components or nodes within the system experience failures while other parts continue to function normally? That's called partial failure, the name itself suggests. Right, this is what I'm saying. But so the two types of like failures I mentioned here, as I said all the, by the way, this distributed environment, again, you can consider it as an e commerce platform where a lot of people are trying to purchase items on the ecommerce website and all the transaction requests coming onto the back end. And that back end is this node, one node three, node two, node ten, node four connected over a network. And node failures, like I said, the partial failures can be of two types. Node failures, individual computers, individual systems may experience failures due to hardware or software, right? It's very much possible. Or network partitions, where these systems interacting with each other may be disrupted due to some bad network and leading to some network partitions, or isolated segments unable to reach or connectivity issues and things like that. And these partial failures are very unique feature of distributed systems, only because if you think about a single program, single computer, or a single system where you are just running a program and letting it run a standalone application, there's no problem why it either runs 100%, or if it doesn't run, it's either because your Internet is bad, or the hardware failure, or blank screen, things like that. It's zero or 100. There is nothing in between if it is a single computer, or if there is a flaw in the software itself, the program you're running. But in distributed environment, it's not the issue with just the logic of your program or the code, or it's not the problem with the connectivity or just the what is that? Complete failure. But it has to be completely partial as well, like it's somewhere in between. So that is the distributed environment problem. Now, despite the partial failures, some parts of the distributed system may remain operational, while others experience disruptions. So this can result in inconsistencies, degraded performance, or temporary unavailability of services. Now, how can this be handled? Is simple, right? Like you need to have lot of replication. I picked replication. There are other things like building fault tolerance systems and reliable systems, which I covered in some of my previous sessions at con 42, which can be referred to, but here I'll be focusing a bit or giving some intro about the replication aspect to handle these partial failures. Now, various replication strategies are employed in distributed systems to handle these partial failures and ensure data availability and reliability. First is full replication. So in full replication, all data is replicated across multiple nodes or replicas. In the distributed system, each replica contains a complete copy copy of the dataset. So this approach ensures high availability and fault tolerance. And since any node, any single node failure can be mitigated by accessing data from other replicas. However, it can be costly in terms of storage and bandwidth requirements, especially for large data sets. So next is what? Partial replication? Yeah, so partial replication involves replicating only a subset of the data across multiple nodes. Different subsets of data can be replicated based on access patterns, data importance or other criteria. And this strategy can help reduce storage and bandwidth costs compared to the full replication, while still providing fault tolerance for critical data. However, it may require careful data partitioning and management to ensure, like you need to understand what is important, data that is adequately used and needs to be replicated next is sharding alsoness. It's a very common one, it's a very popular one. Horizontal partitioning involves partitioning the dataset into multiple subsets called shards, and distributing these shards across multiple nodes. Each node is responsible for storing and managing a subset of the data, so sharding can improve scalability and performance. Scalability is more important or is a higher, bigger problem, which sharding solves and performance, of course, by distributing the workload across different nodes, parallelly happening concurrent requests in the event of a node failure only the data stored on that node is affected, minimizing the impact on the overall system. Right. So still that this is still manageable. Now, replication chains replication chains involve replicating data from one node to another in a sequential chain like fashion. Each node in the chain replicates data to its successor node. This approach can provide fault tolerance by ensuring that data is replicated across multiple nodes in the chain. However, it may introduce latency and complexity, especially in the large distributed systems. As it's a chain of replication, you have to do, as it sounds obvious. Next is the primary backup replication. One node serves as the primary replica, responsible for processing client requests and updating data changes made to the primary replica, asynchronously replicated to one or more backup replicas. So asynchronously replicating is very important. You don't want to block other transactions while replication is happening. So if the primary replica fails, one of the backup replicas can be promoted to the primary role to continue serving client requests. So this approach, again, it's like a master slave and new leader, new master election, and all those things come into picture. So this approach provides fault tolerance, higher availability, while minimizing overhead compared to full replication. And the last one is quorum based replication. It involves replicating data to a subset of nodes, known as a quorum. Read and write operations require coordination among a quorum of nodes to ensure consistency and fault tolerance. So quorum based replication can provide strong consistency guarantees while tolerating failures of a subset of nodes within the quorum. So yeah, so this is all I wanted to cover today. Obviously there'll be more things to talk about, say the fault tolerance. And like I said, I covered in one of my previous talks. And for even more detailed discussions or furthermore problems and handling the problems in distributed systems, I would like to take some other session. This is it for now. And thank you very much for watching all the way through. Thank you.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Isolation levels and partial failures in distributed systems with LLMs

Video size:

Abstract

Summary

Transcript

Slides

Santosh Nikhil Kumar

Senior Software Engineer @ ByteDance

Join the community!

Featured event

2026

2025

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Isolation levels and partial failures in distributed systems with LLMs

Video size:

Abstract

Summary

Transcript

Slides

Santosh Nikhil Kumar

Senior Software Engineer @ ByteDance

Join the community!