Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to Con 42 Observability 2025.
My name is Ashwin and I'm joined by my colleague Devan.
I'm a senior engineering manager at Adobe, and Devan is a product
manager in our payments organization.
Together we have spent years building and scaling, high volume
commerce and payments platform.
Today's stock is titled beyond the Dashboard Building
End-to-End, observability for Commerce and Payment Systems.
I want to start with a question.
How many of you have experienced that sinking feeling when your revenue drops,
but all your dashboards show green?
Your servers are up, your APIs are responding, but somehow
there is a loss in your business.
That's exactly the problem we are here to solve today.
During the course of this talk, we'll talk to you on a journey from reactive
monitoring to proactive business observability, sharing real world
examples, practical frameworks and lessons learned from managing commerce
and payments platforms at Adobe.
Let's dive in.
Let me start by putting this challenge in a perspective with
some numbers that might shock you.
First, let's talk about downtime costs.
Amazon loses $220,000 per minute during downtime.
It's in, its, that's not a typo.
$220,000 Every single minute test systems are down, but it's not just about downtime
during peak traffic periods numbers.
Get even most staggering Shopify processes around 4.6 million in orders per minute
during Black Friday peak traffic.
Imagine the pressure on their engineering teams.
Every second of degraded performance could cost millions.
And here's perhaps the most sobering statistics.
62% of customers abandon their purchase.
After just one failed payment attempt.
They don't retry, they don't call support.
They just leave, and many of them never come back.
This brings us to the key insight in commerce.
Observability isn't just about uptime, it's about survival.
When your payment systems fail, you're not just dealing with an
engineering problem, you're dealing with an existential business threat.
These aren't theoretical numbers.
Every minute your systems are degraded.
Every failure failed payment that goes unnoticed.
Every customer who hits an error, that's direct revenue loss.
And unlike other types of applications in commerce, there's no such
thing as acceptable downtime.
Before we dive deep, let me quickly establish our
credentials at Adobe Commerce.
We are scaling platforms that handle millions of daily transactions.
We have designed end-to-end observability architecture
for complex commerce platform.
Most importantly, we have survived the real world Battle Scars.
Black Friday, cyber Monday, chaos meeting payment mandates like three Ds.
Many geospecific mandates where every second counts and failure isn't an option.
We are here to share what we have learned in the trenches.
Now let's talk about the fundamental shift we need to make in how
we think about observability.
Traditional monitoring focuses on basic service health checks.
It answers simple uptime questions like is my service app.
It looks at the CPU and memory usage response time metrics and HDP error
rates like four XX codes, five x exports.
This is the foundation, but in commerce it's not nearly enough.
Commerce Observability takes a completely different approach.
It tracks the complete customer journey and business outcomes.
It monitors the end-to-end experience from customer's perspective instead
of asking, is my service up?
Commerce Observability asks the question that actually matters to your business.
Can customers complete purchase because your checkout API might be
responding with 200 status codes, but if there's a bug in the payment
flow, customers still can't buy.
Is the fraud detection, blocking good customers?
Your fraud system might be working perfectly from a technical standpoint
but if it's too aggressive, you are losing legitimate revenue.
What's the actual customer experience?
Maybe your API is so fast, but the front end is broken.
Our payment forms aren't loading.
Customers are frustrated.
Are payment failures causing involuntary churn?
This is particularly concerning for a subscription business.
Your billing system might appear healthy, but customers could be churning
silently due to failed payments.
We'll talk about this a little later as well.
And here's one that's often overlooked.
Are customers able to edit their wallets correctly?
Sometimes.
Something as simple as updating payment information can become
a revenue killer if it's broken.
Now, let me show you what full funnel observability actually
looks like in practice.
This is the complete customer transaction journey, and we need
observability at every single step.
Let let's walk through the journey from a customer's perspective.
It starts with product discovery.
When a customer lands on your site and starts browsing.
Here we are monitoring page, load times, and search performance.
If your product pages are slow or search isn't working, we are,
you are losing customers before they even find what they want.
Next is car tradition.
The customer decides they want something and adds it to the cart.
We monitor API response time here because if the add to card button
doesn't respond quickly, customer will assume it's broken and leave.
Then we hit the checkout process.
Here we are monitoring form validation, and payment method selection.
Every friction point here directly impacts conversion.
If customers can't easily select their payment methods, or if form validation
is confusing, they'll abandon.
Authorization and fraud is where the complexity increases.
Here there are many data points that will have to be monitored.
We are monitoring risk engine decisions and payment approval rates.
Your fraud detection needs to be smart enough to block bad actors while keeping
it smooth for the good customers.
And the balance here directly impacts your revenue.
Finally, order fulfillment, confirmation and order processing
even after payment succeeds.
If customers don't get proper confirmation or if there are issues
with auto processing, you'll get chargebacks and customer support tickets.
The key insight is that each touch point represents critical revenue impact.
A failure at any step doesn't just affect that step.
It affects your entire conversion funnel.
This is why we need observability across the complete journey, not
just at individual service endpoint.
Now let's look at this from a subscription business perspective, because the
observability challenges are quite different when you're dealing with
recurring revenue and subscriptions.
There are different data points and metrics that need to be observed.
This is a typical subscription lifecycle journey, and every
step is a potential failure point that impacts ERR and retention.
Unlike one time transactions.
Subscription businesses have to worry about the entire customer
journey, not just a single purchase.
Let's walk around this cycle.
It starts with trial signup.
We are monitoring conversion metrics and onboarding friction points.
If users can't easily sign up for trials or if the onboarding
process is confusing, you are losing potential subscribers before they
even experience your product value.
Moving to free to paid conversion, this is critical.
We monitor payment, set up success rates and method preferences.
Many potential customers drop off here, not because they don't want to
pay, but because the payment method or the payment setup process is
too complicated or doesn't support their preferred payment method.
Here it is critical to measure payment method usage rates.
And also when a new payment methods, for example, cloud Now or LPMs
are launched, how they're adopted.
Then we have the billing cycle.
This is where many subscription businesses get caught off guard.
We monitor recurring billing events and schedule accuracy.
A missed billing event, for example, or an incorrect scheduling.
Can cause revenue delays or customer confusion.
For example, if our billing system is not able to charge the user for a
month, entitled internet system might remove access after a gray period,
even though the user is at no fault.
Payment processing is where the technical complexity really starts showing up.
We are monitoring multiple different gateway performances and authorization
rates in subscription payment failures are particularly damaging because
they can cause involuntary churn.
Customers who want to stay.
I can't because their payment failed renewal.
The next step here is the renewal step, which is at the moment of
truth for retention rates and involuntary return prevention.
This is where silent failures can be devastating, but this
is also an opportunity for us to ride size subscriptions.
Maybe a customer needs to be downgraded so that their their plan is right sized.
Rather than churn out completely.
So without proper observability into their usage patterns and payment behavior, we
will miss these retention opportunities.
Finally, usage and expansion.
We monitor feature adoption and upgrade revenue opportunities here.
Understanding how customers use our product helps us identify expansion
opportunities and even preventing churn.
The key insight here is that subscription observability requires monitoring
through the entire customer lifetime, not just individual transactions.
Each step in the cycle compounds over time, making the impact of
failures much more significant than in traditional e-commerce business.
Now let's talk about how the traditional three pillars of observability.
That's metrics, logs, and traces will look when you're dealing with
commerce and subscription businesses.
First pillar, which is metrics.
So beyond system health instead of just tracking CPU and memory usage, we need to
track revenue flow and business outcomes.
We will have to focus on customer-centric measurements.
This means revenue flow metrics like a RR conversion rates.
These tell you about your business health, not just your system.
Health payment success rate by processors.
Because a 99% success rate with one processor might be 85 with another.
And that difference is millions in revenue.
And then customer lifecycle metrics like churn and LTV, these help you
understand the long term impact of the technical decisions that you make.
And time to value percentage.
How long does it take customers to get value from your
product after they sign up?
Alright, second pillar.
Logs with business context.
Traditional logs tell you what happened.
Technically, I. Commerce logs need to capture decision, reasoning
and customer journey context.
They need to link technical events to business impact.
This includes payment processor responses, not just payment failed, but
why it failed and what that means for specific customer journey breadcrumbs.
That is basically understanding the full path a customer took
before they encounter an issue.
Then business decision reasoning.
Why did the fraud engine block this transaction?
What data points led to the decision and the revenue weighted error classification?
Not all errors are equal.
Some cost you more money than others.
Let's look at the third pillar.
This traces across, revenue systems we need to follow.
Complete transaction flows across all systems that affect revenue.
This isn't just about following a request through your microservices.
It's about mapping dependencies that affect your business.
This means end-to-end transaction flows from customers, click.
To the money in the bank payment gateway dependencies.
Understanding how third party services impact your revenue
is also quite important.
Third party services.
Let's say when add in or PayPal has issues, how does that affect
your business and your customers?
And now talking about subscription lifecycle traffic tracking, we
will have to follow the customer's entire journey, like we looked
looked at it a few slides up before from trial to renewal to expansion.
The key insight is that Commerce Observability requests all three
pillars to work together to give you a complete picture of both your technical
systems and your business outcomes.
All the slides about bring us to this fundamental mind shift.
From reactive monitoring to proactive observability, reactive monitoring wage
for problems to surface you'll alert when services down, customer complaints have
gone up and then revenue is already lost.
But proactive observability detects patterns before they impact customers.
You can detect degradation patterns predict failure cascades implement
automated remediation and operate in revenue protection mode.
Here's the difference.
Instead of waiting for customers to complain about failed payments a
proactive observability can detect a 2% increase in payment failures.
And this can predict that this could become a 20% involuntary term.
This will give us time to fix it before it starts impacting our business In commerce,
reactive means revenues already lost.
Proactive means we are protecting the revenue.
To drive this further home.
Let's look at a real world example of why this proactive mindset is so critical.
This is about what we call a silent killer of subscription
business, involuntary churn.
Let's start with the problem.
20 to 40% of total churn in a subscription business is involuntary.
These are failed payments that silently destroy subscriber basis without warning.
Think about that.
Nearly half of your customer losses might not be because of
customers wanting to leave, but because of payment system failures.
The re renewal failure rate is at a staggering five to 18% of subscriptions
failing at each billing cycle.
That means one in eight of your customers might not renew successfully, not
because they don't want your product, but because their payment didn't go through.
And now let's look at this impact.
On a global scale, so this would be around 118 billion lost globally
in 2020 due to payment failures.
This creates massive revenue hemorrhage across entire industries.
That's not just a technical problem, that's an economic crisis.
Here's the observability challenge.
Traditional monitoring creates a blind spot.
Your systems will show healthy when customers quietly, oil
customers will quietly disappear.
Your billing systems report success.
Your APIs are responding, your databases are performing well,
but customers are churning silent.
The most critical issue is zero time visibility.
Zero real time visibility.
Most businesses discover churn after customers have already gone.
By the time you realize there's a problem, you've lost not just revenue,
but customer relationships that took months or even years to build.
This is a perfect example of why traditional monitoring
will fail in commerce.
Our subscription businesses, all your technical metrics can be green.
While your business is bleeding money.
You need observability that focuses on business outcome, not just system health.
This is why we need a proactive approach like we discussed earlier
that can detect payment issues sooner so that more customers can be saved
and more revenue can be protected.
This brings us to one of the most challenging aspects of
commerce observability, which is dealing with unknown unknowns.
These are problems you don't even know to look for.
Let me contrast that with two, two approaches.
The traditional approach is reactive.
Teams define known failure modes upfront and creates alerts for expected problems.
This relies on reactive monitoring.
While alert triggers after impact, it'll contain blind spots
for unexpected or novel issues.
Investigations are typically manual and slow, and it focuses on
infrastructure or service health.
Traditional monitoring will answer questions like, are
all services returning 200?
Okay?
Is the memory usage within limits?
But here's the problem.
This approach, miss misses the business critical issues that
you never thought to monitor.
Unknown detection takes a completely different approach.
Proactive intelligent systems surface insights from patterns
you didn't know to look for.
This includes anomaly detection on both business and system metrics,
not just CPU spikes, but unexpected drop in conversion rates are unusual
payment patterns, for example.
Previous in France may be failing for certain card types.
Your systems are still healthy, as this might be a very small dip
for a small subset of your users.
But in the overall revenue scheme, this is still a revenue loss.
Which goes unnoticed and could have been easily avoided if you're
looking for these unknown unknowns.
Cross system correlation of logs, events and telemetries will help you understand
how issues in one system might be causing problems in another behavioral
pattern recognition to flag outliers like say, detecting when customer behavior
changes in ways that might indicate that may, there may be some problem.
For which reason the customer are changing their behaviors.
And we should also look at how to automate root cause analysis with
investigating while investigating workflows system that can trace problems
back to their source automatically instead of asking, are services healthy?
Unknown detection asks, why is the conversion dropping?
What things that we didn't expect?
The key insight is captured in this question.
What we thing is happening that we didn't think to monitor.
In commerce, the most expensive problems are often the ones you never saw coming.
A payment processor starts declining certain types of cards.
Fraud detection rule suddenly flags legitimate customers upfront.
End.
Checkout update breaks the checkout workflow in a specific browser
or a specific in-app experience.
These are issues that can cost millions before you even realize they exist.
This is why modern commerce observability needs to be intelligent
and proactive, not just reactive.
Now let's get practical with lessons from the trenches.
What you should do, start with the subscriber's journey, not system
architecture, instrument for ERR outcomes, not just technical health.
Create runbooks with churn impact context so your team knows about the business.
Impact immediately.
And test your observability during billing cycles when you need it the most.
What not to do.
Don't alert on everything.
Focus on revenue and churn impact.
Don't ignore payment processor dependencies or
other external dependencies.
Don't forget about in-app mobile experience metrics and don't neglect
customer success team communication.
They're often the first to hear about issues.
The key insight here is effective commerce.
Observability is as much about process and communication as it is about.
Tech technology and the stack that we use, one of the biggest challenges
in commerce, observability isn't technical, it's also organizational.
Getting everyone on the same page, I.
The challenge is that different teams have different priorities.
Engineering most typically always focuses on system uptime and performance.
Product cares about feature adoption and retention.
BU and finance organizations care about ERR churn.
And churn reduction.
Customer success team wants seamless billing experience
at a very low call volume.
The solution here is an alignment around shared business outcomes.
We have to create shared dashboards with the error context so everyone
sees the same business metrics.
You have to implement cross-functional incident response for payment
issues, not and treat, not just treat them as engineering problems.
We should use revenue impact scoring for all alerts so teams understand business
priority and hold regular business reviews of subscription health metrics
when everyone understands how their work connects to revenue and customer success.
Observability becomes a shared responsibility, not
just an engineering tool.
And speaking of that product perspective and a non-engineering
perspective let me now hand it over to my colleague Dave Wang, who will
share insights on observability from a product manager's perspective.
Take it over everyone.