Zero Data Loss at Scale: Building Resilient Asynchronous Messaging Systems for Modern Distributed Architectures

Video size:

Abstract

Discover how to achieve the holy grail of messaging systems: zero data loss with high performance at scale. Learn battle-tested patterns that prevent catastrophic failures, slash recovery times, and maintain data integrity even when everything goes wrong.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Greetings to 100 present here today in today's software landscape. We are in the era of microservice architecture, where distributed systems plays a crucial role in handling both synchronous and a synchronous messages and responses. In synchronous system, we can immediately notify user of success or failure in real time. However, what about the asynchronous system? Let me illustrate with an example. Imagine I have exposed a web service that uses a messaging queue to report events. When a user reports an event using my web service, the system acknowledge the request by placing the message in the queue. As a result, the user sees a success response assuming that event has been successfully reported. The problem arises when the messaging system goes down and fails to process the event even though the user was told that event was successfully received. This situation happens because of the asynchronous design where the response is sent to the user without guaranteeing that the event has been fully processed. So in today's presentation, I'm going to discuss about zero data loss at scale building resilient, asynchronous. For modern distributed architectures, battle tested strategies for SRE and architects to implement resilient messaging that maintains data integrity during failures. By the way, I am Kupa Amarna. I work as a software engineer serving multiple clients across the United States with specialization in various domains such as tax, health insurance, and payroll. Now let me step into my presentation. The current distributed system challenges. Common problem me. Message tracking blind spots refers to where tracking or monitoring the flow of messages become difficult, incomplete, or even impossible. The next one is the inadequate recovery mechanism. It refers to the lack of effective strategies or processes. To store a system to a functional state after a failure occurs inconsistent state across nodes. It refers to a situation in distributed system where different nodes that are supposed to hold the same or synchronized data end up with conflicting or divergent states. These are very common problems and the distributed systems. Now, how this challenge impact sre. Unpredictable recovery time referred to the varying or inconsistent amount of time it take for a system to recover from a failure or a disruption. Data integrity issues refers to a problem where the accuracy, consistency, and reliability of data or compromise. In distributor system, asynchronous architectures and large scale applications, data integrity is crucial to ensure that the data remains accurate, complete and consistent across all system and services. Data integrity issues can lead to inconsistent corruption or loss of data, resulting in operational disruption and trust issues for both users and system administrator. Complex post failure reconciliation refers to the challenges and processes involved in restoring data integrity and consistency in a system after ref failure occurs. Now let's see how to prevent data loss by message. Replication. Architecture first is to distribute the messages, replicating the message across multiple nodes. Then verify them, get confirm, receive acknowledgement from all destinations where it transfer. Next is the synchronize the messages maintain consistency between replicas. And final step is protect, ensure durability during failures. So here are some advanced recovery techniques. Snapshot based recovery, restore system from consistent point in time Backups. Enabling rapid recovery without complex message reconstruction. Replay based recovery. So systematically reconstruct systems state by replaying in transaction logs, ensuring no messages are lost during the recovery phase. And the next one is the peer assisted recovery. River leverage healthy notes in the network to collaboratively rebuild failed peers, distributing recovery, workload, and minimizing downtime. So the, this slide is on distributed acknowledgement protocol design. Here are the steps involved in this design. First, send the messages this past message to all the notes that are involved in the transmission. The next step is to acknowledge them. Each note that involved in the transmission process should need to provide the confirmation receipt. That is, they have to acknowledge it. The next step is to persist the data. We have to store the data for a durable period on multiple nodes to prevent the data loss. And next is the confirmation verify. That is verify the complete distribution is done. Some of the industry implementations examples in the financial services mission critical payment processing system with guaranteed transaction integrity across distributed banking networks in the healthcare real time patients records synchronization, ensuring critical media data consistency across multiple treatment facilities. E-commerce, fall tolerance, order management systems, maintaining seamless customer experience during high traffic events and flash sales. Now let us see how to ensure a smooth, predictable communication and distributed system by standardizing the message format, the standardizing message format referred to the practice of using a predefined and consistent structure for the message exchange between components in a system, especially in distributed or microservice architecture. Some of the commonly used standardized message formats or JS xml, CCSV, via ml, et cetera. Why do we need the standardized messaging format? For these three reasons, interoperability system communicates seamlessly transformation, simplified data mapping, and validation is very easy. Automated schema enforcement. The next one is about the performance optimization. It refer to the process of improving the efficiency, speed, and scalability of a system or application. Various ways to optimize the performance. So the performance optimization through eventual consistency. It is an approach to consistency that allows for the temporary inconsistency of a data across different part of the system. With a guarantee that eventually all nodes or replica will converge to the same consistent state. Some of the benefits are higher throughput, reduced synchronization overhead, stale data risk mitigation. Next one is the strategic caching. It refers to the deliberate and efficient use of caching mechanism to improve the performance and scalability of systems, especially in distributed environments. Caching is the process of storing copies of frequently accessed data in faster, more readable, available storage location to reduce the time it takes to fe the data and dis decrease the load on the backend systems. Such as database or ap. And here are the, some of the benefits of using the strategic caching. It lowers the average response time, reduced backend load, and intelligent invalidation partisan strategies. This strategies are crucial in distributor system and large scale applications where data need to be distributed across multiple servers or database. To ensure scalability, performance, and high availability. Partitioning allows large data sets to be divided into smaller, manageable pieces that can be distributed across multiple nodes, enabling parallel processing and efficient data retrieval. There are some of the benefits of using the partisan strategies. It balance the workload, distribution, isolate, failure domains, and the horizontal scalability. And next, this slide is on the graphical representation of a performance versus consistency trade off here. X axis refers to consistency and way axis refers to performance. The dark yellow line refers to latency. It means that time taken by the message to travel. From sender to the receiver and the light yellow line refers to throughput. Throughput is the successful message transmission between the center and receiver. So the first one is the strong consistency, guarantees that the system always return the most up to date version of data, no matter which replica notice access is a model used in distributed system to. To provide a balance between consistency and availability. And it is about how to manage how many replicas must participate in a read or a write operation to consider it successful. Read your writes. That consistency is also called a strong consistency for the writing users. The users cease their own rights immediately. And the last one is the eventual consistency. Here the, it's a weak consistency. Data may be inconsistent temporarily, but will eventually converse. Here in the graphical notation, you can see that as al latency decreases, the throughput increases. Below are the steps for the case study implementation. The fastest step involved in the case study should be the problem assessment data lost during the regional outage. Next is the architecture redesign, so multiple regions replications with the quo rights. And the third step is monitoring, enhance enhancement, end to end message tracking. And the fourth step is the result where we achieve the data loss. Some are the key takeaways. You to cons to take into consideration. The first one is the safety First, prioritize data integrity over raw performance. The next one is the balance requirement, match consistency level two business needs. The third one is the implement incrementally. Start with critical data path. And the final one is to measure everything. Deploy comprehensive monitoring. Thank you. Hope you all have enjoyed my presentation and definitely would have gained insight into designing a resilient asynchronous messaging.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Zero Data Loss at Scale: Building Resilient Asynchronous Messaging Systems for Modern Distributed Architectures

Video size:

Abstract

Summary

Transcript

Slides

Vignesh Kuppa Amarnath

Software Engineer @ MSRCOSMOS LLC

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Zero Data Loss at Scale: Building Resilient Asynchronous Messaging Systems for Modern Distributed Architectures

Video size:

Abstract

Summary

Transcript

Slides

Vignesh Kuppa Amarnath

Software Engineer @ MSRCOSMOS LLC

Join the community!