Transcript
This transcript was autogenerated. To make changes, submit a PR.
Greetings to 100 present here today in today's software landscape.
We are in the era of microservice architecture, where distributed
systems plays a crucial role in handling both synchronous and a
synchronous messages and responses.
In synchronous system, we can immediately notify user of
success or failure in real time.
However, what about the asynchronous system?
Let me illustrate with an example.
Imagine I have exposed a web service that uses a messaging queue to report events.
When a user reports an event using my web service, the system
acknowledge the request by placing the message in the queue.
As a result, the user sees a success response assuming that event
has been successfully reported.
The problem arises when the messaging system goes down and fails to process
the event even though the user was told that event was successfully received.
This situation happens because of the asynchronous design where the response
is sent to the user without guaranteeing that the event has been fully processed.
So in today's presentation, I'm going to discuss about zero data loss at
scale building resilient, asynchronous.
For modern distributed architectures, battle tested strategies for
SRE and architects to implement resilient messaging that maintains
data integrity during failures.
By the way, I am Kupa Amarna.
I work as a software engineer serving multiple clients across the
United States with specialization in various domains such as tax,
health insurance, and payroll.
Now let me step into my presentation.
The current distributed system challenges.
Common problem me.
Message tracking blind spots refers to where tracking or monitoring the
flow of messages become difficult, incomplete, or even impossible.
The next one is the inadequate recovery mechanism.
It refers to the lack of effective strategies or processes.
To store a system to a functional state after a failure occurs
inconsistent state across nodes.
It refers to a situation in distributed system where different nodes that
are supposed to hold the same or synchronized data end up with
conflicting or divergent states.
These are very common problems and the distributed systems.
Now, how this challenge impact sre.
Unpredictable recovery time referred to the varying or inconsistent amount
of time it take for a system to recover from a failure or a disruption.
Data integrity issues refers to a problem where the accuracy, consistency,
and reliability of data or compromise.
In distributor system, asynchronous architectures and large scale
applications, data integrity is crucial to ensure that the data remains
accurate, complete and consistent across all system and services.
Data integrity issues can lead to inconsistent corruption or loss
of data, resulting in operational disruption and trust issues for
both users and system administrator.
Complex post failure reconciliation refers to the challenges and
processes involved in restoring data integrity and consistency in
a system after ref failure occurs.
Now let's see how to prevent data loss by message.
Replication.
Architecture first is to distribute the messages, replicating the
message across multiple nodes.
Then verify them, get confirm, receive acknowledgement from all
destinations where it transfer.
Next is the synchronize the messages maintain consistency between replicas.
And final step is protect, ensure durability during failures.
So here are some advanced recovery techniques.
Snapshot based recovery, restore system from consistent point in time Backups.
Enabling rapid recovery without complex message reconstruction.
Replay based recovery.
So systematically reconstruct systems state by replaying in transaction
logs, ensuring no messages are lost during the recovery phase.
And the next one is the peer assisted recovery.
River leverage healthy notes in the network to collaboratively rebuild
failed peers, distributing recovery, workload, and minimizing downtime.
So the, this slide is on distributed acknowledgement protocol design.
Here are the steps involved in this design.
First, send the messages this past message to all the notes that
are involved in the transmission.
The next step is to acknowledge them.
Each note that involved in the transmission process should need to
provide the confirmation receipt.
That is, they have to acknowledge it.
The next step is to persist the data.
We have to store the data for a durable period on multiple
nodes to prevent the data loss.
And next is the confirmation verify.
That is verify the complete distribution is done.
Some of the industry implementations examples in the financial services
mission critical payment processing system with guaranteed transaction
integrity across distributed banking networks in the healthcare real time
patients records synchronization, ensuring critical media data consistency
across multiple treatment facilities.
E-commerce, fall tolerance, order management systems, maintaining
seamless customer experience during high traffic events and flash sales.
Now let us see how to ensure a smooth, predictable communication and distributed
system by standardizing the message format, the standardizing message format
referred to the practice of using a predefined and consistent structure
for the message exchange between components in a system, especially in
distributed or microservice architecture.
Some of the commonly used standardized message formats or
JS xml, CCSV, via ml, et cetera.
Why do we need the standardized messaging format?
For these three reasons, interoperability system communicates seamlessly
transformation, simplified data mapping, and validation is very easy.
Automated schema enforcement.
The next one is about the performance optimization.
It refer to the process of improving the efficiency, speed, and scalability
of a system or application.
Various ways to optimize the performance.
So the performance optimization through eventual consistency.
It is an approach to consistency that allows for the temporary
inconsistency of a data across different part of the system.
With a guarantee that eventually all nodes or replica will converge
to the same consistent state.
Some of the benefits are higher throughput, reduced synchronization
overhead, stale data risk mitigation.
Next one is the strategic caching.
It refers to the deliberate and efficient use of caching mechanism to improve the
performance and scalability of systems, especially in distributed environments.
Caching is the process of storing copies of frequently accessed data
in faster, more readable, available storage location to reduce the time it
takes to fe the data and dis decrease the load on the backend systems.
Such as database or ap.
And here are the, some of the benefits of using the strategic caching.
It lowers the average response time, reduced backend load, and intelligent
invalidation partisan strategies.
This strategies are crucial in distributor system and large scale applications
where data need to be distributed across multiple servers or database.
To ensure scalability, performance, and high availability.
Partitioning allows large data sets to be divided into smaller, manageable
pieces that can be distributed across multiple nodes, enabling parallel
processing and efficient data retrieval.
There are some of the benefits of using the partisan strategies.
It balance the workload, distribution, isolate, failure domains, and
the horizontal scalability.
And next, this slide is on the graphical representation of a performance
versus consistency trade off here.
X axis refers to consistency and way axis refers to performance.
The dark yellow line refers to latency.
It means that time taken by the message to travel.
From sender to the receiver and the light yellow line refers to throughput.
Throughput is the successful message transmission between
the center and receiver.
So the first one is the strong consistency, guarantees that the
system always return the most up to date version of data, no matter
which replica notice access is a model used in distributed system to.
To provide a balance between consistency and availability.
And it is about how to manage how many replicas must participate in a read or a
write operation to consider it successful.
Read your writes.
That consistency is also called a strong consistency for the writing users.
The users cease their own rights immediately.
And the last one is the eventual consistency.
Here the, it's a weak consistency.
Data may be inconsistent temporarily, but will eventually converse.
Here in the graphical notation, you can see that as al latency
decreases, the throughput increases.
Below are the steps for the case study implementation.
The fastest step involved in the case study should be the problem assessment
data lost during the regional outage.
Next is the architecture redesign, so multiple regions
replications with the quo rights.
And the third step is monitoring, enhance enhancement, end to end message tracking.
And the fourth step is the result where we achieve the data loss.
Some are the key takeaways.
You to cons to take into consideration.
The first one is the safety First, prioritize data
integrity over raw performance.
The next one is the balance requirement, match consistency
level two business needs.
The third one is the implement incrementally.
Start with critical data path.
And the final one is to measure everything.
Deploy comprehensive monitoring.
Thank you.
Hope you all have enjoyed my presentation and definitely would
have gained insight into designing a resilient asynchronous messaging.