Building Resilient Asynchronous Commerce Systems: Observability Lessons from Event-Driven Architectures

Video size:

Abstract

Discover how observability transforms asynchronous commerce platforms. Uncover the secrets to building resilient systems with event-driven insights. Learn to master asynchronous workflows, distributed tracing, and real-time diagnostics for reliable digital transactions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, good morning everyone. This is Anbu. I'm a Senior Technical Product manager at T-Mobile. I lead initiatives around large scale digital commerce systems. That's millions of customers across multiple platforms like web, mobile, and in store. Today. I will share how observability is at core of making async event-driven commerce systems reliable and scalable. We will look at how we have approached this at T-Mobile from architectural patterns to debugging techniques and the key lessons we have learned along the way. Whether you're building distributed systems, managing reliability, or designing customer journeys, I hope this talk gives you practical takeaways you can apply to your own async environment. Let me start this with, a statement, in a world we are living in a world where milliseconds definitely matter. What do you, what do is, basically let's say a customer is shopping on a website and then while checking out there's a delay. And then. He can decide to abandon the cart and then, we can potentially lose that customer. We live in a world where conversion rates matter and observability is no longer optional. It's foundational. Let me give you an overview of why we have to shift to Async, to really drive the omnichannel demands. Observability, make observability ob to, to shift from seeing, to async. The thing that matters most to me is the observability, because that's the key of making the shift sustainable. Know today, digital shoppers, like I said, expect everything to be fast, personalized, and seamless, whether they're online or in store or mobile. The level of responsiveness pushes us to move away from synchronous architectures that block, or fail under pressure. The challenge is how do we maintain visibility and trust when everything runs asynchronously and independently? That's where observability becomes critical. It's our lens into this complex world. Synchronous systems are like daisy. Change, one broken link, then everything falls or fails. When we switch to async, we get the scale of flexibility but at the same time, we also lose the comfort of linear flows. Failures don't scream in async flows, but they just whisper. And if you're not listening to the right tools, you miss them. That's the trade off. We must solve rise of asynchronous commerce. I will, you know what's why are we shifting from synchronous to asynchronous? The main reason is the demand, demand and the personalization, rise of headless microservices and even brokers. Omnichannel isn't just a buzzword, it's how customers behave. Now, I'll tell you an example. Someone you know, browses on a mobile and then adds it to the cart. Then he walks into the store and then he wanted to check it out with the agent. This means our system must be loosely coupled, but tightly coordinated. Async communication helps achieve this thanks to our microservices and event driven designs. But now we need visibility across all those hops. I'll give you an overview. I'll just explain a use case that we have implemented or that we are going to implement in Tmobile. At T-Mobile the most busiest time in T-Mobile is during N-P-I-N-P-I is basically the new product launches, especially in September when Apple launches a device. We almost get a hundred K orders that day. That's the most busiest day for us, then Thanksgiving or Christmas or New Year, and, in one of the NPIs, I think last year what happened in the first 15 minutes because of the flush of orders, the payment system was done. We were in a state where the orders were not submitted. Because today what happens in T-Mobile is when you build a car, go to check out, and then once you make the payment, then only you can submit an order. But because of it, like we built substitute breakers, which did not break the system message. But what happened was at least the payment was broken. And then the customers were struck there saying, Hey, oops, something went wrong. And they were not able to submit the order. That made us think, do we really need payments to be synchronous? There are. We can definitely go async as well. That's when we, that's where, we thought hey let's start. We all, it's Async is not something new. Today we already have this Async for Bois, Bois is bring your own so not bring it on I'm sorry, is buy online pickup in store. At T-Mobile we implemented. Even driven architectures to support this Bois and Async. We are designing this and probably by this NPI, we wanted to implement this async payment flow. These flows are unpredictable, a payment, like I said, a payment may succeed in three seconds or time out after 30. And then for Bois inventory is critical thing 'cause inventory might change mid transaction. So what we do for Bo PC is basically, again, to reserve. We make a soft reservation call. For inventory. And then when customer walks into the store, we make a hard reservation call. So all of these calls are tightly coupled. That's when we thought, hey, let's do something asing so that customers can be served in purpose. And then we had events, to coordinate all of this. But it only works if we can see them happening in real time. What makes this async hard to. Observe, regular synchronous calls, we can log their request response or we can use lot of tools in Splunk, debug them. But async flows are, invisible flows or, we have tools that, that there, there aren't really many tools that weren't, that were and designed to be, to support Async. Imagine, you imagine trying to follow up a story where each sentence is told by a different person at a different time. That's what it looks like. Debugging, acing flows, you don't know, there a lot of events are coming up and then you listen to the events and then you process the things. So the lack of casualty this caused, that makes traditional monitoring in FHU. So we need tools to capture relationships, timing, and internet across systems. So to me, observability has four pillars. Distributed caching. Distributed tracing. Distributed tracing is something, you have to, you need to trace events across services using span correlations. You need to correlate them. And then event audits and replay reconstruct workflows, for debugging and root corner analysis. When something goes stronger, you should be able to reconstruct the workflows and then debug and find an RCA. And then consistent monitoring, validate that business outcomes are eventually completed. And we have a lot of domain dashboards which provide visibility. Using business centric metrics and language in sing systems, right? Observability isn't about tracking just technical metrics. It must reflect the business journey end to end. Here's how we approach it through our four key, four key pillars, distributed tracing. We use, in T-Mobile we use tools like for distributor tracing, we use tools like open telemetry, to inject trace context into a request and events. This lets us correlate span across microservices. Even when communication is event based. When something breaks, we can follow the trace path and pinpoint the exact hot way it failed or slow down. It's a backbone for debugging. Asy flows, coming to event audits and replay, async, async flows. Don't leave a clean lock tray. So we log constructed events with context. User id order Id. Sometimes timestamps event type, et cetera. This allows us to replay what happened during an issue. Not only does this help with RCA, but we can also retry failed process without impacting other parts of the system. And for consistency monitoring, what we did was about validating the entire business transaction was compliant, but it not just the services are running. For example, an order placed, you know what happens when an order is placed? We reserve an inventory and payment is confirmed, and then whether shipment has been initiated or not. We track the completion of these workflows across time, services and systems, even when events are out of order. Domain dashboards, we build dashboards that use business language orders. Whether the pickup is done or not, whether the payment is in which state, whether it's in retre or not, whether inventory check, has been done or not. These make observability actionable for non-engineers, like our product manager is support teams and operations can understand what's going on without diving into locks and traces. So go domain dashboards are really helpful to, all the team members. Let me walk you over a book flow, this is what typically happens when customer place an order. What we do is we trigger a, we trigger an event to validate the inventory, and then we trigger events to notify store that, hey, an order is coming up, and then we trigger an event for pickup children. These are all independent services, working in parallel. Unlike, if it is, if someone logs into T-Mobile dot com and place an order, it's all one by one. Like basically once order is submitted, then the next step is payment is set, and then after that it goes to the warehouse. And then warehouse takes its own suite time to dispatch the item. But whereas in case of pop is everything is async you like right from a payment. Payment is not acing, sorry. And the event, the store notifications are acing. The validations are acing, and the pickup sheet, everything is acing, and for Async and for async payment, right? Like I mentioned, while customer like what the main problem that we faced during last NPA was orders were struck because the system was done because of an unexpected volume of orders. And then what we are thinking how can we solve it? What we have been thinking is, retrying is one option, but when the system is down, you can't even, that would not help. So what we're thinking is we wanted to recouple this payment processing from the order submission. So our plan was, when the customer is present, when a customer does a payment, since it takes longer than expected. So we wanted to put on a queue. Then process it, asing and then order submission also would be async only after the payment is successful, we will process the order. But let's say if the order is not successful, if the payment is not successful, then we want to trigger a notification to customer, send a text message or an email asking to reenter the payment. And only after the payment is confirmed, the order would be submitted. So this way both the payment processing and the order submission are, as so we generate an order number and give it back to customer, like customer is. Customer doesn't know what's happening in the background. He just knows that okay, an order is submitted, and he wrote an email saying that, Hey, your order is submitted and we are working on it. But in the background, we validate the payment and if it goes through, then only we submit the order. But if the payment doesn't go through, then we again send a notification to customer two. Reenter the cuts. So this way customers are not blocked and we are able to capture the orders and then, process them. With Async Payments, the best thing is customers get an instant feedback, after the order is placed behind the scenes we are, we check the card validity, and then we check for fraud. And we check whether the card has funds or not. And if something fails, we fall back without disrupting the user flow. Here observability means monitoring the state excuses, try tracing failures and ensuring consistency without sacrificing speed. So let me, I think we, we spoke a lot about ing, we really need tools and techniques through which we can trace them. I will go through few tools that we use at T-Mobile that to track these events. Open telemetry, open Telemetry is used to capture and correlate spans across distributed services and events. So this is our go-to standard for distributed tracing. It allows us to inject trace series and span context into each service call and even message. Even when systems communicate asynchronously, whether it's an a p call or an internal microservice call, or a Kafka message, we tie them all back to a single customer transaction. This is critical for our, finding out what went wrong, especially, in complex clothes when something goes wrong. Know, we use a lot of if you want me, if we have a lot of tools into the way we monitor open telemetry. We have SDKs we like in the web, like in ui we use no Js and Java, Python, et cetera. And then at Commerce system we use data Ho we use Jagger. We have Grafana. Grafana is actually. A lot for our backend services that we use. And then sometimes if we, if the service are hosted on AWS, we use AWS X-ray. And then for Kafka, like Kafka is actually central to our async architecture 'cause it access backbone for event transport. Here also an observe. What matters is observability. So how we monitor, what we monitor, how we monitor the Q depth, to detect the backlogs and the consumer lag, to catch up slow or star services. And then the throughput, like to understand how much is a system load. So what what tools we use here is, Kafka manager, it manages the topics, the queues, and also the clusters, it's deprecated. I know, but we used to use it now we also use Grafana, to do a visualization of Kafka throughput and partition lag and message drops. And then we use borough, a consumer lag monitoring tool for Kafka, error rates for retrace to catch failing event handlers. These metrics, help us to detect not just failures, but also performance degradation before it impacts users and. Custom event entry play dashboards. These dashboards are our internal dashboards which help to track and order from end to end, right? From adding the service to the cart until it's checked out again, we use Kibana here, we use Meta Base and then we use Power bi, and then we use Grafana also. What are the sources for this? We use business events that are business event logs from Kafka consumers or even gateways. And then, we use telemetry systems, all of this. And how do I measure success? When I move from sync to async, how do I measure the success? The success is basically today we support the boat. Like on average we have about 7 million customers logging into T-Mobile dot com. Like we registered those many number of tokens. But eventually when you look at the cards so from 7 million, only one 30 K cards are being created. Okay? And all of one 30 k the actual orders come down to 30 5K. But when we used asing things, what we have seen, if you look at the BPI orders we have seen the checkout rate. Let's say out of those one 30 k, one 30 K to 30 k, it's almost one to one inch to four ratio. But if we look out the BPI orders are more faster checkout. Like it's not one inch to four, it's three to four, like out of let's say we had, let's say we had a thousand orders. So the Poppi cards that are built are only 4,000. Now this led us to lower abandonment. The Bois on Bois. If I see, I can trace that the card to order conversion rate is significantly higher. In case of Bois. And then we what happened with Async? We are during the peak time, like I said NPI we are 97% available. So even promotional peaks, sometimes we give promotions saying that, Hey, buy one, get one, or, to attract customers saying that, Hey you are putting in, you get these offer. So even when we announced those promotional things tho at that times also we were almost close to a hundred percent available. You see the results speak themselves faster. Checkouts mean, higher conversion and lower abundant rate shows how we improve trust. At high uptime during traffic spikes proves that resilience and scalability can go ahead, go hand in hand if observability is right. I think to, again, you, you have to when you move from seeing to e right? You have to do some trade offs. I hope you all know what this cap the means. Cap the means, see for consistency, every re returns the most recent, right? Okay. Availability. Every request receives a response, and then partition tolerance system continues to operate despite network failures. Yeah, you'll have to trade off, when you have to be consistent, then your availability might be lost. If you're on a syn, synchronous systems, you'll be consistent. Because you, you wait for it, it takes 10 seconds or 15 seconds, you'll get that response back and the customer is present. So it is very much consistent, but availability, right? Like I said, during peak time, gone, like if you're bombarding it with a lot of requests, then the system is gone. But when you move from sync to async, consistency, I would not say you'll have to a hundred percent compromise, but. You might have to make a little compromise, like I said, right? So since we do everything asynchronously, what happens? Customer I, a device and an accessory, and then he submit an order, but by the time he come to the store. If within the store if there are no accessories, then you know, we have to trade off. Same saying that, hey, you can only buy a device because it's access event of out of stock. It's a little inconsistent, but still, we try to again keep up the inventory up to date and make sure that the customer gets what he pay, what he orders for. But again, you'll have to trade off, between availability versus, consistency and then what we do, right? What we do. And in our solution, when we move from sync to acing, the important thing is, add potency. When we, when we make sure if some failure happens, we have to suppress this duplicate side effects, let's say a payment failed, then you shouldn't continuously charge customer you should not charge three times. Make sure that if customer is charged once, then you don't charge him at all. And then worsening, tracking changes and ensuring correct state the solution, and then compensating transactions. So if something goes wrong, then we undo, or mitigate the fail steps. One of the core, one of the core, thi one of the core tradeoffs in distributed systems, is explained very well by this cap. The, it tells us, we can only choose between two of three guarantees. Like either, cons if you want to out of these three, right? The consistency, availability, partition tolerance, you can only choose two between sync versus ing. If it is sync, then it is consistent and it is partition tolerant. But if it is asing, then it's available and then partition tolerant. And in a high scale event driven commerce partition tolerance is non-negotiable. Failures will happen. So we typically prioritize availability, meaning we must relax. Strict consistency. This leads to eventual consistency. Data will convert, but not in instantly to manage this effectively. We, like I said, we use a lot of techniques. We design the services that are I important so that, repeated events don't cause duplicate effects. For example, if payment confirmation is sent twice we shouldn't charge the customer again. We ensure that the RET trays are safe. And then versioning, we attach version numbers to entities like orders of payments. This helps us detect and resolve these conditions. If two systems try to update the same entities simultaneously, the version ensures only the correct one is applied. And then compensating transactions. We, when things go wrong, we don't roll back. We instead, we emit events to undo the mitigation issues. For example, if an order was confirmed, but payment later failed, we aim at a cancellation event and restore inventory. Like I said, observability plays a critical role here. It alerts us when events fail out of sync. Tree tray are piling up, or state transitions don't complete. Without the availability, we wouldn't know. Consistency is broken until a customer complains with it. We can react in near real time and fix issues proactively. Yeah. How do we really test the acing systems? How do we simulate delays, how do we simulate message losses? How do we take an action to restart them? How do we again validate retry, logic, fallbacks and ency, focus, again, monitoring. How do we monitor the Q depth? How do we monitor, detect the backlogs? How do we latency, how do we track the delays? And then how do we make sure the events sequencer correct and complete the testing? Asynchronous systems is fundamentally different from traditional applications. It's not just checking about individual service or EPAs. It's about testing the system behavior over time, under pressure, and across boundaries. Let's break it down. Simulated failures. We simulate the real world scenario such as delayed customers. What happens if a service is slow? Or if the service is slow to process a message? And then what happens if the message is lost? What if even never reaches a destination due to a transit and outage? And then service, let's say, how does the system behave when one component research, media event, this helps validate our retry, duplicate duplication, and fallback mechanisms according, actually work in the practice. These things work in practice, but not in theory. Monitoring essentials. Asy observability means watching the systems as connected. Draft of event flows, not isolated services. We focus on Q depth, and where, large backlogs may indicate low consumers or broken processs processing latency. Now, are events being handled with acceptable time windows? If not, where's the bottleneck? Casual chains, we can trace a customer's order from placement to fulfillment. Missing links could mean drop events or unhandled failures. A common pitfall is thinking a green dashboard means everything is okay and in asing systems, services might be up, but flow is broken. That's why we don't just, test endpoints, which as the system end to end, you know, what lessons we have learned when we implemented Async, I'll wrap up. The major lessons that we have learned from building scaling Async commerce systems designed for traceability, I. I always emphasize this, you have to, when you're building a system, you build the design on how you want to trace them. Tracing isn't something you can bolt on later. Most of the things we feel that, we let's implement a design, and then tracing can be done later. But for Asing, I would recommend, first you think of tracing. How do you want to trace the events and then it has to be intentional. This means, using consistent trace IDs, structural logging, and participating. And propagating context across services and event handlers from the start. If you skip this early, you will struggle, debug, monitor, and support your systems as it trust me, it's very hard if you wanted to think about tracing it then. So always design, looking at how you want to trace them, and then accept and plan for eventual consistency in distributed systems. Chasing perfect real consistency will hurt your availability and scalability. Instead, we have learned to embrace eventual consistency and engineer around it using strategies like operations, compensating transactions. Observability is what keeps us honest there. It lets us know when consistency breaks and help us to fix it proactively. Again, I would say align metrics to business events. This is key for making observability useful beyond engineering. Developers care about locks, spans and error, but. Product managers and support teams care about orders, not shipping payments, truck in retre or event or syncing. So building dashboards and alerts that helps speak up the language. It shows how turn it shows, how you turn observability from a dev tool into an organizational asset. I would, I would like to conclude saying that, as in commerce, architectures give us what modern demand, scalability, flexibility, and resilience. They allow different parts of the business to move at their own pace. Orders can be taken while payments are still processing. Inventory can update independently, and notifications can be retried without blocking the user. But here's a catch. What that flexibility comes, it comes with the complexity. Things happen in parallel in different systems. At unpredictable times. If we can't see across these flows, we are flying by it. That's why observability isn't. Just the technical layer. It's a strategic enabler. It's what allows us to build fast, recover faster, and continuously improve when observability is built. We don't want just to operate, we understand, when something grow, wide, grow and what to do next. That's how we maintain trust in the customer experience, even when the backend gets messy. So as we build asynchronous. Commerce event driven systems. Let's make sure observability stays a first class citizen. It's the difference between hopping things work and knowing they do. And thank you very much for patiently listening on this, and please let me know, like when you are building, when you're building anything acing, I'm happy to. Help. And I, I hope decision gives you practical insights about building and operating async commerce systems with confidence. And if you're working on similar architectures or facing any challenges, I would love to continue this conversation. Feel free to reach me via email or connect me on LinkedIn. Thank you. Bye-bye. Have a good day.

Slides

Download slides (PDF)

See all 61 talks at this event!

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Building Resilient Asynchronous Commerce Systems: Observability Lessons from Event-Driven Architectures

Video size:

Abstract

Summary

Transcript

Slides

Anup Raja Sarabu

Senior Technical Product Manager @ T-Mobile

Join the community!

Featured event

2026

2025

Info

Conf42 Observability 2025 - Online

June 05 2025 - premiere 5PM GMT

Building Resilient Asynchronous Commerce Systems: Observability Lessons from Event-Driven Architectures

Video size:

Abstract

Summary

Transcript

Slides

Anup Raja Sarabu

Senior Technical Product Manager @ T-Mobile

Join the community!