Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

Resilient by Design: Data-Driven Migration from Monoliths to Event-Driven Microservices

Video size:

Abstract

Discover how to transform monolithic systems into resilient event-driven microservices without sacrificing stability. Learn battle-tested patterns from Fortune 500 case studies that reduce complexity, speed deployments, and enhance recovery times.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to our session on Resilient by Design, a data-driven migration from monoliths to event driven microservices. Hello, I'm Amlin Go. I have over two decades of experience in architecting software solutions with a focus on modernization from monolithic applications to microservices based event driven architecture. So today we are tackling one of the most pressing challenges in modern system architecture, the transition from monolith system to event driven microservices. This transformation while promising unparalleled scalability and agility demands meticulous planning and execution to maintain system stability Throughout the journey. In this talk, we will explore a data-driven framework. Designed to address these challenges head on from understanding system domains to designing robust schemas and implementing progressive migration techniques. We will highlight strategies to ensure resilience, fall tolerance, and seamless operations. So let's dive in and discover how resilience is engineered by design. While microservice offer undeniable benefits like modularity, scalability, and fault isolations, the road to adopting this architecture is far straightforward. I. Transitioning from a monolithic system to an even driven microservices architecture comes with its own set of challenges. This include handling data consistency, managing distributed systems, and ensuring system resilience during the migration. This is where SRE principles and data-driven strategies play a crucial role in overcoming these challenges. Okay, migrating. Let's get into some facts. Migrating to a new architecture is no small fit and the numbers speak for themselves about the challenges enterprise encounter. 68% of enterprise identifies stability concerns. As their primary migration challenge, downtime or disruptions during this process can severely impact business operations making reliability a top priority. 43% of migration fail on the first end due to insufficient architecture. Planning a rushed or poorly designed approach often leads to incomplete transition or disruptions. 37% of the project face cost overruns. Traditional migration methods often underestimate the complexity leading to unplanned expenditure with stability concerns, first time migration failures, and cost overruns posing significant challenges. It's clear that traditional migration methods often fail, fall short. So how can we overcome these hurdles and ensure a smooth transition? So the answer lies in adopting a well-defined data-driven approach. So enter our event driven migration framework. This framework is designed to guide enterprise through the complexities of migration from monolithic to resilient, scalable, and event driven microservices. It addresses these challenges head on by combining robust architecture principles while real-time data insights. So let's explore how this framework can pave the way for successful future-proof migrations. So let's start at the base. The domain driven design. So the domain driven design is a strategic methodology for monolithic complex systems by aligning software architecture with business needs, it emphasizes the use of ED context to create clear service boundaries that encapsulate specific business logic, ensuring modularity and scalability essential for microservices migration. Additionally. Ubiquitous language, foster seamless communication between developers and domain experts, reducing misunderstanding and improving collaboration by leveraging tools like context mapping and even storming domain event design simplifies identifying sub-domains and their interactions enabling a structured approach to managing complexity and building resilient systems. This makes domain driven design of vital framework for transitioning to a event driven microservices architecture. Next comes event sourcing. So event sourcing is again, a technique that preserves the entire system history through immutable event logs, allowing every change to be logged. And recorded these logs are invaluable for reliable state reconstruction making system robust and fall tolerant by enabling precise recovery of the system state at any point in time. Key use cases include debugging issues auditing for compliance replaying events to recover from failures or errors. Tools and frameworks like Apache Kafka, even store Exxon framework, provide the necessary infrastructure to implement event sourcing efficiently, ensuring a resilient and auditable architecture. Next is our event driven patterns. So even driven architecture is a design approach that ensures responsive and reliable in dis in reliability in distributed system. By enabling components to communicate through events, it fosters decoupling making system more scalable and adaptive. Architectural safeguards like circuit breaker pattern prevents cascading failures by isolating problematic services while bulkhead com compartmentalization workloads. To protect critical resources. Additionally, fault isolation mechanism ensures failure in one microservice don't get impacts others. Maintaining overall system stability. These patterns not only enhance resilience, but also aligned with SRE principle of reliability, fault tolerance, and proactive problem mitigation. Next is our continuous delivery. Continuous delivery. As it's essential in the resilience, but also it aligns with SRE principles of reliability, fault tolerance, and proactive problem mitigation. Reducing time to market for updates. Automated testing serves as the foundation of safe deployments, ensuring high code quality and catching issues early in the development cycle. Rollback capabilities further ensure quick restoration of system functionality in case of deployment failures, tools like Jenkins, GitLab AWS code pipeline. Are widely used to implement these practices effectively driving stability and innovation in development processes. So distributor systems are the backbone of modern applications, but they come with their own share of complexities. Complexities that often challenge site reliability. Engineers tasked with ensuring stability and reliability. From managing state changes to debugging issues and ensuring fault tolerance, maintaining uptime in such systems requires robust tool and approaches. This is where event sourcing shines as an alternative pattern that not only reduces complexity, but also aligns perfectly with SRA principles. Today we'll explore how event sourcing can transform the way we manage system states. And reliability, focusing on capturing events, building event logs, reconstructing, state, and enabling projections. So even searching align seamlessly the the event sourcing align seamlessly with SRE principles by reducing system complexity and enhancing reliability. It begins with capturing events recording all state changes as immutable timestamped objects to ensure accurate tracking and observability. This is followed by building an append only event log. Which serves as the system's definitive source of truth, providing resiliency and auditability through state recognition. The current application state is derived by sequential processing of the event streams, enabling swift debugging and efficient incident response, critical for mentoring upline uptime. Lastly, projections. Allow tailored data model to be generated from the same event sequence, improving operational insights and empowering SREs with actionable metrics to proactively manage reliability and performance. Together this practices create a robust fall tolerant architecture well suited to SRE goals. So we explored how event sourcing reduces complexity and enhances reliability by aligning with the core principles of site reliability engineering. But how does this play out in practice to truly understand its impact? Let's dive into a real world example, the migration of a Fortune five, 500 retail companies system with a monolithic to event driven microservices architecture. So this case study highlights the practical application of even sourcing in solving challenges related to system downtime, scalability, and incident response. It showcases the SRE practices, combined with event driven patterns enabled the company to achieve an operational exp excellence, reduce complexity, and build up system resilient enough to handle millions of daily transactions. Let's look at the challenge we faced, the solution we implemented and the key outcomes from this migration. So let's delve into the practical application of event sourcing and microservice migration through the journey of of this use case. So this migration illustrates the principle of SRE. So now let's take a closure look at how site reliability engineering principle guided the migration journey from a monolithic to event driven architecture. This migration, this transformation was carried out into four phases. So now let's take a closer look at how this the principles guided us. The SRA principles guided us in the migration journey. So the first phase is during the assessment phase the team model domains, and identify service boundaries. This strategic effort not only broke down the monolithic system into manageable components, but also ensured modularity, a critical factor for minimizing risk and maintaining operational stability. Second, the focus shifted to even schema design, where the team created 47 distinct event types, along with a versioning strategy. By structuring event data with clarity, we laid the foundation for precise tracking and debugging, empowering SREs with enhanced observability for issue re resolution. The third phase that ran parallel. Saw a development of Macroservices alongside the legacy monolithic application, ensuring the existing system remained functional during the transition. This dual approach aligned with the SRA principles of minimizing disruption and maintaining uptime while introducing new components. Finally, the progressive migration phase enabled incremental traffic shifting to the new architecture with zero downtime. This gradual rollout shared stability, allowing SRE to monitor performance and address any issues proactively while keeping reliability intact. These four phases demonstrate the power of combining event driven architecture with SRE practices to create a system that is both scale, scalable and resilient. So now let's explore the measurable outcomes of this migration and the tools that made it possible. Even schema design is crucial for ensuring reliability and scalability in an event driven architecture. Key base practices includes exp explicit versioning, which incorporates schema versions in event metadata to maintain compatibility during updates, domain aligned events, leverage ubiquitous language from the business domain for naming events, ensuring clarity and alignment with domain logic. Temporal con context embeds creation of timestamps and casual metadata within the events, enabling accurate tracking and sequencing. Lastly, even should be self-contained, including all necessary context within the payload to minimize the dependency and streamlining the process processing. This practice collectively enhance observability, fall tolerance, and operational efficiency. So having established the importance of an well-designed event schemas, the next step is to explore how to enhance system reliability. I. With resilience patterns implementation, right? So resilient patterns such as circuit breaker bulkheads and fault isolation mechanisms are integral to building robust, even driven systems for SREs. These patterns not only help in mitigating failures. But also aligned with the core principles of uptime, observability, and proactive incident management. So let's now dive into the resilience pattern, discussing the role and ensuring high availability and how they can be effectively implemented in a distributed architecture. So resilience patterns are essential for maintaining system ability reliability and fall tolerance in distributed architecture. This circuit breaker pattern prevent system overload by f failing fast when dependencies are unhealthy, reducing cascading failure by 85%. And enabling automatic recovery with configurable threshold for each service. The bulkhead pattern that isolates components to contain failures within bounded context, utilizing resource pool isolation, trade pool segregation, and requested limiting to protect critical services. Meanwhile, retry with backup handles. Transient failures through intelligent retry mechanism, leveraging exponential back of algorithms, jitter for load disruptions, and dead letter cues to manage unprocessed events effectively. Together this patterns align with SRE principles. By improving this system, stability, scalability, and operational efficiency, again, with the resilience pattern in place. The next critical step for ensuring system reliability is real time monitoring. In event of architecture, event driven architecture, real-time monitoring provides the visibility required to maintain operational excellence. And proactively address issues before they impact the users from SRE Perspectives. This strategy provides pivotal in tech tracking key metricses detecting anomalies and ensuring adherence to service level objectives. So let's delve into how realtime monitoring implementation. Monitoring compliments resilience patterns and supports a proactive approach to system reliability. So realtime monitoring strategies align with SRA principles by enhancing visibility and enabling. Proactive Reliability management, so aggregation, centralizes matrices while preserving context Through correlation IDs, ensuring accurate analysis across service, alerting tri alerts, the alerting triggers notification in response to service level objectives. Violation enabling rapid intervention before issue gets escalated, then comes your investigation. Investigation fate facilitates your tracing request flows across distributed services, helping pinpoint failures and streamline debugging efforts. Lastly, instrumentation that embeds telemetry in every service and event flow, providing continuous insights into system performance and help together these practices ensure effective monitoring and operational excellence in complex architectures. So now that we have explored the strategies for real-time monitoring, let's, IM it's important to evaluate how different migration methodologies stack up in the context of reliability and scalability. From SRA perspectives, selecting the right migration approach is crucial. To minimizing downtime, maintaining system stability, and meeting service level objectives. In the next section, we will compare popular migration methodologies, analyzing their strengths, limitations, and alignment with SRE principles to help determine the best fit for the resilient architecture. So when evaluating migration methodologies through the lens of reliability and scalability, it's clear that traditional approach and event driven strategies often vastly differ outcomes. The traditional approach often relies on big bang cutover strategy, which involves extended downtime. Windows mi downtime windows migrating monolithic database, and relying on mutual and manual verification process. While this method can work, in some cases, it has limited rollback capabilities posing significant risk to system stability. On the other hand, our event approach prioritizes incremental service migration. Enabling zero downtime deployment and ensuring system availability throughout the transition with data synchronization, powered by events, automated cannery analysis for gradual testing and instant rollback mechanism. So this methodologist aligns with our SRE principles and delivers. The required results by adopting this event driven migration strategy, organized, can minimize risk, maintain, uptime, and achieve a more scalable and reliable architecture. With the migration methodologist compared, the next step is to outline a clear path of execution through an implemented implementation roadmap. A well-defined roadmap ensures a structured transition. So with the migration methodologist compared next step is to outline the clear path, right? A roadmap. So now that we have established the importance of the structured migration process, let's walk through the key stages of our implementation roadmap, and that is designed to ensure reliability and scalability while aing to SR principles. So number one that we should discuss is about the domain analysis, right? We begin by mapping business domain to bounded context and identifying clear service. Boundaries. This ensures modularity and lays a solid foundation for decomposing the monolithic into microservice microservices, then comes event storming. The next step involves close collaboration with domain expert to document core events and commands. This phase ensures that the architecture aligns with real world business processes, enabling payer system behavior modeling. Next is our schema design. Here we define even schemas and implement a compatibility strategy with versioning to handle future changes. This step is crucial for maintaining system integrity and ensuring consistent communication between services infrastructure setup. With Schema ready, we deploy our event broker, set up observability platform and create robust. Continuous integration and continuous deployment pipelines. These elements form the backbone of the reliable and scalable event driven architecture, and then is our incremental migration. Finally, we migrate one bounded context at a time using progressive delivery. This approach allows us to validate each step, ensuring zero downtime while maintaining system stability, and that is what we. Are doing as a systems site reliability engineers. Thank you. Thank you everyone. Thank you for joining today. In conclusion, we have explored the transition to event-driven architecture, emphasizing the importance of SRE principles in ensuring reliability. And scalability and resilience. From implementation to roadmaps, to migration methodologies and resilience patterns, these strategies collectively paved the way for modern high performing systems. Thank you for your time today and your attention. It's been a pleasure discussing this impactful practices with you, wishing you continued success in driving innovation and reliability in your system. Thank you.
...

Amlan Ghosh

@ Biju Patnaik University of Technology



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)