Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, good morning everyone.
This is Anbu.
I'm a Senior Technical Product manager at T-Mobile.
I lead initiatives around large scale digital commerce systems.
That's millions of customers across multiple platforms like
web, mobile, and in store.
Today.
I will share how observability is at core of making async event-driven
commerce systems reliable and scalable.
We will look at how we have approached this at T-Mobile from architectural
patterns to debugging techniques and the key lessons we have learned along the way.
Whether you're building distributed systems, managing reliability, or
designing customer journeys, I hope this talk gives you practical takeaways you
can apply to your own async environment.
Let me start this with, a statement, in a world we are living in a world
where milliseconds definitely matter.
What do you, what do is, basically let's say a customer is shopping
on a website and then while checking out there's a delay.
And then.
He can decide to abandon the cart and then, we can
potentially lose that customer.
We live in a world where conversion rates matter and
observability is no longer optional.
It's foundational.
Let me give you an overview of why we have to shift to Async, to really
drive the omnichannel demands.
Observability, make observability ob to, to shift from seeing, to async.
The thing that matters most to me is the observability, because that's the
key of making the shift sustainable.
Know today, digital shoppers, like I said, expect everything to be fast,
personalized, and seamless, whether they're online or in store or mobile.
The level of responsiveness pushes us to move away from synchronous architectures
that block, or fail under pressure.
The challenge is how do we maintain visibility and trust when everything
runs asynchronously and independently?
That's where observability becomes critical.
It's our lens into this complex world.
Synchronous systems are like daisy.
Change, one broken link, then everything falls or fails.
When we switch to async, we get the scale of flexibility but at the same time, we
also lose the comfort of linear flows.
Failures don't scream in async flows, but they just whisper.
And if you're not listening to the right tools, you miss them.
That's the trade off.
We must solve
rise of asynchronous commerce.
I will, you know what's why are we shifting from synchronous to asynchronous?
The main reason is the demand, demand and the personalization, rise of
headless microservices and even brokers.
Omnichannel isn't just a buzzword, it's how customers behave.
Now, I'll tell you an example.
Someone you know, browses on a mobile and then adds it to the cart.
Then he walks into the store and then he wanted to check it out with the agent.
This means our system must be loosely coupled, but tightly coordinated.
Async communication helps achieve this thanks to our microservices
and event driven designs.
But now we need visibility across all those hops.
I'll give you an overview.
I'll just explain a use case that we have implemented or that we
are going to implement in Tmobile.
At T-Mobile the most busiest time in T-Mobile is during N-P-I-N-P-I
is basically the new product launches, especially in September
when Apple launches a device.
We almost get a hundred K orders that day.
That's the most busiest day for us, then Thanksgiving or Christmas or New
Year, and, in one of the NPIs, I think last year what happened in the first
15 minutes because of the flush of orders, the payment system was done.
We were in a state where the orders were not submitted.
Because today what happens in T-Mobile is when you build a car, go to check
out, and then once you make the payment, then only you can submit an order.
But because of it, like we built substitute breakers, which did
not break the system message.
But what happened was at least the payment was broken.
And then the customers were struck there saying, Hey, oops, something went wrong.
And they were not able to submit the order.
That made us think, do we really need payments to be synchronous?
There are.
We can definitely go async as well.
That's when we, that's where, we thought hey let's start.
We all, it's Async is not something new.
Today we already have this Async for Bois, Bois is bring your own so not bring it on
I'm sorry, is buy online pickup in store.
At T-Mobile we implemented.
Even driven architectures to support this Bois and Async.
We are designing this and probably by this NPI, we wanted to
implement this async payment flow.
These flows are unpredictable, a payment, like I said, a payment may succeed in
three seconds or time out after 30.
And then for Bois inventory is critical thing 'cause inventory
might change mid transaction.
So what we do for Bo PC is basically, again, to reserve.
We make a soft reservation call.
For inventory.
And then when customer walks into the store, we make a hard reservation call.
So all of these calls are tightly coupled.
That's when we thought, hey, let's do something asing so that
customers can be served in purpose.
And then we had events, to coordinate all of this.
But it only works if we can see them happening in real time.
What makes this async hard to.
Observe, regular synchronous calls, we can log their request response or we can
use lot of tools in Splunk, debug them.
But async flows are, invisible flows or, we have tools that, that
there, there aren't really many tools that weren't, that were and
designed to be, to support Async.
Imagine, you imagine trying to follow up a story where each sentence is told by
a different person at a different time.
That's what it looks like.
Debugging, acing flows, you don't know, there a lot of events are coming
up and then you listen to the events and then you process the things.
So the lack of casualty this caused, that makes traditional monitoring in FHU.
So we need tools to capture relationships, timing, and internet across systems.
So to me, observability has four pillars.
Distributed caching.
Distributed tracing.
Distributed tracing is something, you have to, you need to trace events
across services using span correlations.
You need to correlate them.
And then event audits and replay reconstruct workflows, for
debugging and root corner analysis.
When something goes stronger, you should be able to reconstruct the workflows
and then debug and find an RCA.
And then consistent monitoring, validate that business outcomes
are eventually completed.
And we have a lot of domain dashboards which provide visibility.
Using business centric metrics and language
in sing systems, right?
Observability isn't about tracking just technical metrics.
It must reflect the business journey end to end.
Here's how we approach it through our four key, four key
pillars, distributed tracing.
We use, in T-Mobile we use tools like for distributor tracing, we use tools
like open telemetry, to inject trace context into a request and events.
This lets us correlate span across microservices.
Even when communication is event based.
When something breaks, we can follow the trace path and pinpoint the
exact hot way it failed or slow down.
It's a backbone for debugging.
Asy flows, coming to event audits and replay, async, async flows.
Don't leave a clean lock tray.
So we log constructed events with context.
User id order Id.
Sometimes timestamps event type, et cetera.
This allows us to replay what happened during an issue.
Not only does this help with RCA, but we can also retry failed process without
impacting other parts of the system.
And for consistency monitoring, what we did was about validating the entire
business transaction was compliant, but it not just the services are running.
For example, an order placed, you know what happens when an order is placed?
We reserve an inventory and payment is confirmed, and then whether
shipment has been initiated or not.
We track the completion of these workflows across time, services and systems,
even when events are out of order.
Domain dashboards, we build dashboards that use business language orders.
Whether the pickup is done or not, whether the payment is in which state,
whether it's in retre or not, whether inventory check, has been done or not.
These make observability actionable for non-engineers, like our product
manager is support teams and operations can understand what's going on
without diving into locks and traces.
So go domain dashboards are really helpful to, all the team members.
Let me walk you over a book flow, this is what typically happens
when customer place an order.
What we do is we trigger a, we trigger an event to validate the inventory, and
then we trigger events to notify store that, hey, an order is coming up, and then
we trigger an event for pickup children.
These are all independent services, working in parallel.
Unlike, if it is, if someone logs into T-Mobile dot com and place
an order, it's all one by one.
Like basically once order is submitted, then the next step is payment is set, and
then after that it goes to the warehouse.
And then warehouse takes its own suite time to dispatch the item.
But whereas in case of pop is everything is async you like right from a payment.
Payment is not acing, sorry.
And the event, the store notifications are acing.
The validations are acing, and the pickup sheet, everything is acing, and
for Async and for async payment, right?
Like I mentioned, while customer like what the main problem that we faced
during last NPA was orders were struck because the system was done because
of an unexpected volume of orders.
And then what we are thinking how can we solve it?
What we have been thinking is, retrying is one option, but when the system is
down, you can't even, that would not help.
So what we're thinking is we wanted to recouple this payment
processing from the order submission.
So our plan was, when the customer is present, when a customer does a payment,
since it takes longer than expected.
So we wanted to put on a queue.
Then process it, asing and then order submission also would be
async only after the payment is successful, we will process the order.
But let's say if the order is not successful, if the payment is
not successful, then we want to trigger a notification to customer,
send a text message or an email asking to reenter the payment.
And only after the payment is confirmed, the order would be submitted.
So this way both the payment processing and the order submission are, as so
we generate an order number and give it back to customer, like customer is.
Customer doesn't know what's happening in the background.
He just knows that okay, an order is submitted, and he wrote an email
saying that, Hey, your order is submitted and we are working on it.
But in the background, we validate the payment and if it goes through,
then only we submit the order.
But if the payment doesn't go through, then we again send a
notification to customer two.
Reenter the cuts.
So this way customers are not blocked and we are able to capture
the orders and then, process them.
With Async Payments, the best thing is customers get an instant feedback,
after the order is placed behind the scenes we are, we check the card
validity, and then we check for fraud.
And we check whether the card has funds or not.
And if something fails, we fall back without disrupting the user flow.
Here observability means monitoring the state excuses, try tracing
failures and ensuring consistency without sacrificing speed.
So let me, I think we, we spoke a lot about ing, we really need
tools and techniques through which we can trace them.
I will go through few tools that we use at T-Mobile that to track these events.
Open telemetry, open Telemetry is used to capture and correlate spans
across distributed services and events.
So this is our go-to standard for distributed tracing.
It allows us to inject trace series and span context into each
service call and even message.
Even when systems communicate asynchronously, whether it's an a p
call or an internal microservice call, or a Kafka message, we tie them all
back to a single customer transaction.
This is critical for our, finding out what went wrong, especially, in complex
clothes when something goes wrong.
Know, we use a lot of if you want me, if we have a lot of tools into
the way we monitor open telemetry.
We have SDKs we like in the web, like in ui we use no Js
and Java, Python, et cetera.
And then at Commerce system we use data Ho we use Jagger.
We have Grafana.
Grafana is actually.
A lot for our backend services that we use.
And then sometimes if we, if the service are hosted on AWS, we use AWS X-ray.
And then for Kafka, like Kafka is actually central to our async architecture 'cause
it access backbone for event transport.
Here also an observe.
What matters is observability.
So how we monitor, what we monitor, how we monitor the Q depth, to detect
the backlogs and the consumer lag, to catch up slow or star services.
And then the throughput, like to understand how much is a system load.
So what what tools we use here is, Kafka manager, it manages
the topics, the queues, and also the clusters, it's deprecated.
I know, but we used to use it now we also use Grafana, to do a
visualization of Kafka throughput and partition lag and message drops.
And then we use borough, a consumer lag monitoring tool for Kafka, error rates for
retrace to catch failing event handlers.
These metrics, help us to detect not just failures, but also performance
degradation before it impacts users and.
Custom event entry play dashboards.
These dashboards are our internal dashboards which help to track
and order from end to end, right?
From adding the service to the cart until it's checked out again, we use Kibana
here, we use Meta Base and then we use Power bi, and then we use Grafana also.
What are the sources for this?
We use business events that are business event logs from Kafka
consumers or even gateways.
And then, we use telemetry systems, all of this.
And how do I measure success?
When I move from sync to async, how do I measure the success?
The success is basically today we support the boat.
Like on average we have about 7 million customers logging into T-Mobile dot com.
Like we registered those many number of tokens.
But eventually when you look at the cards so from 7 million, only
one 30 K cards are being created.
Okay?
And all of one 30 k the actual orders come down to 30 5K.
But when we used asing things, what we have seen, if you look at the BPI
orders we have seen the checkout rate.
Let's say out of those one 30 k, one 30 K to 30 k, it's almost
one to one inch to four ratio.
But if we look out the BPI orders are more faster checkout.
Like it's not one inch to four, it's three to four, like out of let's say we
had, let's say we had a thousand orders.
So the Poppi cards that are built are only 4,000.
Now this led us to lower abandonment.
The Bois on Bois.
If I see, I can trace that the card to order conversion
rate is significantly higher.
In case of Bois.
And then we what happened with Async?
We are during the peak time, like I said NPI we are 97% available.
So even promotional peaks, sometimes we give promotions saying that, Hey,
buy one, get one, or, to attract customers saying that, Hey you are
putting in, you get these offer.
So even when we announced those promotional things tho at that
times also we were almost close to a hundred percent available.
You see the results speak themselves faster.
Checkouts mean, higher conversion and lower abundant rate
shows how we improve trust.
At high uptime during traffic spikes proves that resilience and
scalability can go ahead, go hand in hand if observability is right.
I think to, again, you, you have to when you move from seeing to e right?
You have to do some trade offs.
I hope you all know what this cap the means.
Cap the means, see for consistency, every re returns the most recent, right?
Okay.
Availability.
Every request receives a response, and then partition tolerance system continues
to operate despite network failures.
Yeah, you'll have to trade off, when you have to be consistent, then
your availability might be lost.
If you're on a syn, synchronous systems, you'll be consistent.
Because you, you wait for it, it takes 10 seconds or 15 seconds, you'll get that
response back and the customer is present.
So it is very much consistent, but availability, right?
Like I said, during peak time, gone, like if you're bombarding it with a lot
of requests, then the system is gone.
But when you move from sync to async, consistency, I would not say you'll have
to a hundred percent compromise, but.
You might have to make a little compromise, like I said, right?
So since we do everything asynchronously, what happens?
Customer I, a device and an accessory, and then he submit an order, but
by the time he come to the store.
If within the store if there are no accessories, then you
know, we have to trade off.
Same saying that, hey, you can only buy a device because it's
access event of out of stock.
It's a little inconsistent, but still, we try to again keep up the inventory up
to date and make sure that the customer gets what he pay, what he orders for.
But again, you'll have to trade off, between availability versus,
consistency and then what we do, right?
What we do.
And in our solution, when we move from sync to acing, the
important thing is, add potency.
When we, when we make sure if some failure happens, we have to suppress
this duplicate side effects, let's say a payment failed, then you
shouldn't continuously charge customer you should not charge three times.
Make sure that if customer is charged once, then you don't charge him at all.
And then worsening, tracking changes and ensuring correct state the solution,
and then compensating transactions.
So if something goes wrong, then we undo, or mitigate the fail steps.
One of the core, one of the core, thi one of the core tradeoffs
in distributed systems, is explained very well by this cap.
The, it tells us, we can only choose between two of three guarantees.
Like either, cons if you want to out of these three, right?
The consistency, availability, partition tolerance, you can only
choose two between sync versus ing.
If it is sync, then it is consistent and it is partition tolerant.
But if it is asing, then it's available and then partition tolerant.
And in a high scale event driven commerce partition tolerance is non-negotiable.
Failures will happen.
So we typically prioritize availability, meaning we must relax.
Strict consistency.
This leads to eventual consistency.
Data will convert, but not in instantly to manage this effectively.
We, like I said, we use a lot of techniques.
We design the services that are I important so that, repeated events
don't cause duplicate effects.
For example, if payment confirmation is sent twice we shouldn't
charge the customer again.
We ensure that the RET trays are safe.
And then versioning, we attach version numbers to entities
like orders of payments.
This helps us detect and resolve these conditions.
If two systems try to update the same entities simultaneously, the version
ensures only the correct one is applied.
And then compensating transactions.
We, when things go wrong, we don't roll back.
We instead, we emit events to undo the mitigation issues.
For example, if an order was confirmed, but payment later failed, we aim at a
cancellation event and restore inventory.
Like I said, observability plays a critical role here.
It alerts us when events fail out of sync.
Tree tray are piling up, or state transitions don't complete.
Without the availability, we wouldn't know.
Consistency is broken until a customer complains with it.
We can react in near real time and fix issues proactively.
Yeah.
How do we really test the acing systems?
How do we simulate delays, how do we simulate message losses?
How do we take an action to restart them?
How do we again validate retry, logic, fallbacks and ency,
focus, again, monitoring.
How do we monitor the Q depth?
How do we monitor, detect the backlogs?
How do we latency, how do we track the delays?
And then how do we make sure the events sequencer correct
and complete the testing?
Asynchronous systems is fundamentally different from traditional applications.
It's not just checking about individual service or EPAs.
It's about testing the system behavior over time, under
pressure, and across boundaries.
Let's break it down.
Simulated failures.
We simulate the real world scenario such as delayed customers.
What happens if a service is slow?
Or if the service is slow to process a message?
And then what happens if the message is lost?
What if even never reaches a destination due to a transit and outage?
And then service, let's say, how does the system behave when one
component research, media event, this helps validate our retry, duplicate
duplication, and fallback mechanisms according, actually work in the practice.
These things work in practice, but not in theory.
Monitoring essentials.
Asy observability means watching the systems as connected.
Draft of event flows, not isolated services.
We focus on Q depth, and where, large backlogs may indicate low consumers
or broken processs processing latency.
Now, are events being handled with acceptable time windows?
If not, where's the bottleneck?
Casual chains, we can trace a customer's order from placement to fulfillment.
Missing links could mean drop events or unhandled failures.
A common pitfall is thinking a green dashboard means everything is
okay and in asing systems, services might be up, but flow is broken.
That's why we don't just, test endpoints, which as the system end to end,
you know, what lessons we have learned when we implemented Async, I'll wrap up.
The major lessons that we have learned from building scaling Async commerce
systems designed for traceability, I. I always emphasize this, you have to, when
you're building a system, you build the design on how you want to trace them.
Tracing isn't something you can bolt on later.
Most of the things we feel that, we let's implement a design, and
then tracing can be done later.
But for Asing, I would recommend, first you think of tracing.
How do you want to trace the events and then it has to be intentional.
This means, using consistent trace IDs, structural logging, and participating.
And propagating context across services and event handlers from the start.
If you skip this early, you will struggle, debug, monitor, and support your systems
as it trust me, it's very hard if you wanted to think about tracing it then.
So always design, looking at how you want to trace them, and then
accept and plan for eventual consistency in distributed systems.
Chasing perfect real consistency will hurt your availability and scalability.
Instead, we have learned to embrace eventual consistency and engineer
around it using strategies like operations, compensating transactions.
Observability is what keeps us honest there.
It lets us know when consistency breaks and help us to fix it proactively.
Again, I would say align metrics to business events.
This is key for making observability useful beyond engineering.
Developers care about locks, spans and error, but.
Product managers and support teams care about orders, not shipping payments,
truck in retre or event or syncing.
So building dashboards and alerts that helps speak up the language.
It shows how turn it shows, how you turn observability from a dev
tool into an organizational asset.
I would, I would like to conclude saying that, as in commerce, architectures
give us what modern demand, scalability, flexibility, and resilience.
They allow different parts of the business to move at their own pace.
Orders can be taken while payments are still processing.
Inventory can update independently, and notifications can be retried
without blocking the user.
But here's a catch.
What that flexibility comes, it comes with the complexity.
Things happen in parallel in different systems.
At unpredictable times.
If we can't see across these flows, we are flying by it.
That's why observability isn't.
Just the technical layer.
It's a strategic enabler.
It's what allows us to build fast, recover faster, and continuously
improve when observability is built.
We don't want just to operate, we understand, when something grow,
wide, grow and what to do next.
That's how we maintain trust in the customer experience, even
when the backend gets messy.
So as we build asynchronous.
Commerce event driven systems.
Let's make sure observability stays a first class citizen.
It's the difference between hopping things work and knowing they do.
And thank you very much for patiently listening on this, and please let me know,
like when you are building, when you're building anything acing, I'm happy to.
Help.
And I, I hope decision gives you practical insights about building and operating
async commerce systems with confidence.
And if you're working on similar architectures or facing any challenges, I
would love to continue this conversation.
Feel free to reach me via email or connect me on LinkedIn.
Thank you.
Bye-bye.
Have a good day.