Transcript
This transcript was autogenerated. To make changes, submit a PR.
And thank you for joining today's session.
My name is Josh Thomas and I'm currently, leading engineering and architecture
for commerce and financial systems.
Over the years, I worked on subscription building, payment optimization, and large
scale enterprise integrations across.
Platforms like zoa, Recurly, Stripe, Oracle, and Salesforce.
Today I'm excited to share how platform engineering principles can transform
payment systems from a call center into aerial competitive advantage.
Going to the next slide start with today's junta.
Here's how I will be spending the time.
On this presentation, we'll start with the evolution of payment infrastructure, how
we have gone from monolithic processors to distributed e driven systems.
Next, I'll walk you through the core principles of platform
engineering applied to payments.
Then we will look at how to design for resiliency.
C because payments will fail and the key.
Factories, like how you can do pay payment failure recording.
After that I will, I highlight the observability and
complaints part of things.
Two areas where I think the teams are usually slowed down.
Whereas if it's not built into the platform, but if it's built into the
platform, it can actually delivery.
And finally we will talk about scaling across different business
models, B2C, B2B, usage based billing.
By the end of this talk you'll see how modern platforms can directly drive
revenue, retention, and customer trust.
Okay, so to start off with we can see like how.
A traditional payment system was a platform driven approach, looks like.
So these are some of the differences they can see.
So traditionally payment systems were seen as an unknowable cost of doing business.
They were brittle, manual, full of tribal knowledge locked into specific teams.
Adding a new feature or a payment method took months and when some
failure up, up happened, there was like no visibility into what went wrong.
Whereas in contrast like if you look at a platform driven approach, you can see
we build like self-service infrastructure that abstracts a complexity.
The automated recovery process that are building doesn't need firefighting, right?
'Cause everybody knows when the systems goes wrong.
And then it, we can easily start looking at what is runbook and then see like
how we can start the recovery process.
So we reduce the friction and product teams can fast launch much faster, right?
So these are something that are there.
Because of this, I think you.
Don't have to worry too much about things going wrong.
You can always focus on what next to add or what features to be added, right?
So payments will stop being, becoming a overhead and they'll
become a level of growth moving on.
So you can look at what happened, right?
Like an evolution of payment infrastructure.
So let's take a step back and see like how payment structure,
payment infrastructure has evolved.
We started with a monolithic processors, one provider, tightly coupled systems,
zero flexibility as companies expanded the integrated multiple providers payment
processors like Braintree, Stripe Arion.
But each recruited custom integration code leading to duplicated effort.
So one payment processor will work one way, the other payment processors
will work another way, right?
So it was a mix of lot of APIs and different architectures.
So then next stage was evolved into like more API first services providers begin.
Offering abstraction layers and teams still had to handle all the
failures, but still it was a good step in the right direction, right?
But today leading companies are more looking at like event
driven payment platforms, right?
Where like it's distributor architectures, recovery of server of ty and self service
are baked in from the start, right?
So the shift is really clear, like how things have evolved.
From a monolithic to more like event driven payment platforms where like you
get notified of what all happens, right?
So that's what brings in the structure in the payment infrastructure.
So come moving on.
You can look at like core platform engineering
principles for payments, right?
So one of the things that we always look at it is.
So can we abstract a complexity so that it's not really part of everything, right?
So product teams really need not know, like everything.
Every minute details of, say we saw MasterCard of a
local payment method, right?
So how can we really abstract all those things in in, in
the engineering layers, right?
So second, like how we can build like a self-service by default, right?
So that like engineering teams are not involved in each product
launches or pricing changes or payment method launches, right?
So it's more like configuration based that any product team can do.
So third is like confluence, right?
Like how do you make sure treating like regulations as a separate
process where we have a lot of that UAT and backend process.
How can we directly embed them into the infrastructure and the APIs have
proper auditing and logging, right?
And then how can we really create a paved rod, which is like optimize default.
Common payment scenarios while allowing flexibility are needed, right?
So basically building not our complex systems, but building a system in
such a way that we can customize as needed, but not to support
like our customizations, right?
So that it's easy to maintain the code as well.
So looking at the architecture, right?
So let's.
I look at like how, what all things are really needed in
a payment platform, right?
So one of the things that we always look at is you need like a gateway, right?
Like you only unified enterprise that can abstract multiple providers.
So say in the case of Stripe, Braintree, each of them has different APAs and
different way of handling things, right?
But if you have a gateway which can talk to that it's more like
providing that flexibility right?
From a. Engineering standpoint, or from a product standpoint, you have a payment
processor or a payment provider, and then you just call the same payment gateway
and payment Gateway has a complexity to make those transformation that is needed
for each of those processing layers.
So that's one, one way of abstracting it and then having a unified interface then.
Subscription engine, right?
So subscription engine is the core of a subscription business.
So where like you start creating the subscriptions and subscriptions has, each
subscription has like a life cycle, right?
So you have, it can be a monthly, it can be an annual, it can be a bundle or like
different pricing for price points, right?
Baked into the subscription.
So subscription engine has abstracts all the logic of so making sure like you build
the right customer at the right billing period and consolidate that have, if there
are discounts taken into concentration, taxes and other stuff, right?
So that's the responsibility of subscription engine.
Then even processing, that's core for any system, right?
So whether it is like a third party systems or like systems within, right?
If then there is an action happening, say like an order is created or like
a subscription is renewed, right?
That sort of becomes like your event, right?
And then that based on that event, you may have five to 10 actions to take, right?
So maybe like sending out a notification, sending it for payment collection.
Making sure GL is posted, phaser, and then the revenues also allocated, right?
So there are things which can do, which need not be like really tightly coupled
synchronous transaction process, right?
So this, all, this can be an even driven process where like all
the systems or microservices get notified and then they can do the
step that is needed of them, right?
And the last part is revenue recovery, right?
So any systems, right there will be failures, right?
And in case of a payment usually once you have a subscription, you have payment
method that would, that card could expire.
There may not be enough amount in the card to make the payment.
So there are multiple use cases that can happen.
So it's always.
Good to have an automatic re try mechanism built into the platform so that
we are able to collect the amount and recover the revenue from the customer.
Moving on.
How do you really build resilient systems, right?
So one of the things that we always see is like payment services
always need to have 99.99, right?
So it can be like fin nines or three nines or like a single nine, right?
So that's the minimum that is needed because anytime there is
a downtime for a payment systems, you can see how it impacts, right?
So it is if you look at this, even the single night, right?
You can have only eight hours of downtime per year.
That is like across 365 days, right?
So the systems has to be like highly resilient.
It has to be reliable, it has to scale on its own.
Based on the customer needs as well as the traffic on the site, right?
And other systems should not be able to bring it down, right?
So that, that's a core and that, and there are multiple services involved in
this payment flow or checkout process.
Then we have to make sure this applies to all the services, right?
So it's not only one service being up and running, like what other services
are there in the order to cash flow.
They have to be all up and running.
And it needs to be resilient all the time.
So how do you really achieve this?
So one of the things that we can always think about is like, how do you really
try to do circuit breakers, right?
So if you think like there are some systems failing, like how
can we really not add too much of load into those systems, right?
And then start throwing a right away so that we don't bring down
those systems as well, right?
So that is one, one way to think about it, right?
And then how can have like a. Fallback process, right?
Because sometimes the customer would be having the money in the card, right?
It may be some other process that failed on our side.
So how do you make sure like you.
Instead of having the customer go through the, in their flow and doing a
checkout, how can we make sure, like we can redo reprocessing on our side, like
a fallback processing, give them access to the product, and then maybe collect
the money at the later point of time.
So how do you build like compensation handling or fallback processing, right?
So that's something that will be really useful when you think about it.
And then.
Other thing is like more of on like on the retry logic, right?
How do you really think about retry?
How you can learn from the all the failures in the system.
Try to look at the data that we have and then see how we
can build like an indulgent.
Processing right.
With AI in mind.
It's always good to have those models learn from the data that
you already have, and then come up with a lot with a model which can
predict like what is the intelligent retry or what time to retry right?
For those payments to be successful.
So there is a lot of use cases that we can bring AI in When you're
thinking about Ian payment systems.
So this is like a case study that we have put together which, which sort
of shows like how it happens, right?
So before the platform engineering 15% of the subscription payments fail on
first attempt manual retry process.
Like all best guesses.
Like you cannot, like infinite, do infinite retries, right?
It always has to be like a way that you retry.
Maybe four or five times, right?
Like in a 20 day period.
Because the more retries you do on failed cards, visa, MasterCard, or any of these
payment networks will try to say, okay, you are trying to charge a customer the
wrong way or like you will be getting fined because you're trying to retry on
cases where we are not supposed to retry.
So there are.
Lack of visibility into what are the different error codes and
then not processing them properly.
And then there is no visibility into how the performance is going, right?
Like recovery performance, are we like able to collect most of the payments or
are we collecting on the third attempt or the fifth attempt or 10th attempt, right?
There is no real lack of there is no real visibility into what's
happening and to say the truth, right?
40% of the recovery rate.
On failed payments.
It's not a big number, right?
So it should be like a higher number because these are customers
who are subscribers and they're supposed to pay the company, right?
So that's where I think this becomes a what do you call, big revenue level
for any companies to focus on and then build engineering around it.
So coming back to like, how do you.
Get it right.
Like with the platform implementation, we start doing like smart algorithms
and we try consider all the data, right?
Because all this failure has lot of data associated with this.
Like whether it's from a MasterCard, visa networks or from the payment
processes, there is a lot of error codes and vertical patterns available,
which we can really look at and say, okay, this, if the payment.
This set code, these are retrial error and this and retread at on say Friday
at this time this could be successful.
So those are the things.
And then there is always this a b testing frameworks for retry
strategy that could be applied.
So you can always bucket some.
Variations, right?
Like for the payment ries, and then try to learn from them, right?
So it's even though the data subjects like as long as unless you try that out, right?
Like you'll never know whether, how it's successful and it's how it's come compare
to like our current strategy, right?
So it's always better to do like an ab testing and then figure
out what is the window right?
And then try to do more payment retries in that window.
Then it's always a smart idea to have a real time monitor because without
monitoring it's really difficult to really bring out what is the.
Level of success rate that coming in, right?
So like you have to always be on top of it trying to monitor
how the retries are working.
And then if there are cases, like if we are reliant on an AM model, there
are always chances that sometimes it may not be like doing it right way.
Right?
Or the, what do you call the ret?
Or the success rates would've started going down.
So it's always better to keep track of whether the model is
trending in the right direction and then make changes based on that.
And if you look at the 78% success rate, like we were bring, bring about
close to 40% increase in the success rate and then that sort of translate
to millions in recover revenue.
Okay, so I mean to complaints, right?
So complaints.
Is something that I think it's always feared the engineering teams because it
adds a lot of overhead to the systems.
You have to make sure you audit the systems.
You have to build like more data validation or data checks in, right?
So it's always a paying for.
To think about like compliance if Yeah.
It's becomes like an afterthought, right?
But I think if you start thinking about compliance and bake it into
the platform, then it becomes such, such an easy thing, right?
One of the things is like a tokenization service.
So secure APIs that handles payment data.
Because one of the things with the tokenization is you don't, you're not
exposing the customer what do you call it?
Payment, credit card information, right?
So it's.
Always best, like as a company, not to have anything.
So that's where I think the hosted fields comes in.
So if you're using Stripe, Zuora, or Curly, like everybody has
hosted payment method field.
So like you just have to have the STK.
Load the JavaScript and then they take care of getting the
payment and credit card numbers, and then they tokenize it for us.
And then we just use the token right as whenever we want to charge the
customer or do any transaction on that particular credit card so that
we are not storing that information.
And then it makes it easier for our systems to be not non PCI stuff, right?
Because we are not handling any of those data.
Then infrastructure as a code.
So it's always good to have this infrastructure of code from the starting
itself because with all the cloud pass platforms, it's easier to build in a
terraform infrastructure as code ability.
So that like you, it's easy to scale, it's easy to spin
up new environments, and then.
Putting more rail guards, right?
So that not not all the engineering team have access to critical data and
everything is controlled by the IAC code and not through manually access
through directly through ui, right?
So it's easy from an auditing standpoint to really make sure, okay, everything is
done through code and then all the code.
Deployments happen through like an approval process,
like a CMC process, right?
So that's what the, it brings in.
And then building in the automated compliance testing, right?
So we can always automate a lot of these things.
We can have frameworks, automated security scanning, done like once every
month or once every quarter, right?
Just to make sure like all the systems are still adhering
to all the complex standards.
Moving on the observability, right?
So this is one of the main interesting things, right?
In the, in current world if you need to build a resilient system,
you need to know what's happening within your system, right?
And if you want to know what's happening with your system,
you need to like really start.
Looking at how we can build in observability from the start, right?
So if you are looking at platforms like Splunk, NewRelic, right?
Like they, they give us a lot of visibility into what's happening within
like the containers, like the CPUs.
What is the CPUs?
What is the memory usage, right?
So what if they're not enough or are they spinning up new instances
because the current instance size is not able to handle, right?
So all these things.
Need to be monitored so that we can be sure we are building
in the right direction.
So that's where I think the it has to be part of the, all the services
that we build in the, in, in the engineering landscape, right?
So that we know what is the health metrics and what is the AP tech score if there
is a, is there a higher error rate?
Are there any anomalies?
So all this can be identified and the good thing with having the proper
observability is like you can notify the right people at the right time.
And then the team to come in and then look at their it's easy for them to
do look at, okay, trace there and then figure out, okay, this is because of
this issue, so let's either roll back the changes or increase the instance
size or like the memory, right?
So that's how we think about it, right?
So that's where.
Observability becomes very important so that you can easily identify issues
and then take resolution steps, right?
As and when an issue happens.
So how do you really scale payment platforms across business models?
So there are, really three core business models, right?
One is like B2C, where I think you.
Directly deal with your customers.
So any customer that lands on your site, if they want to buy a
subscription or something, you are directly dealing with the customers.
So there will be millions of customers, millions of payment methods a lot
of transactions happening, right?
Because it's like a B2B.
In those cases, I think we have to really track like how the user
is progressing through the funnel.
How do you really do AB testing based on different flows?
If some people like different payment methods, right?
Is that is it better to have a different payment method highlighted
here versus another payment method?
So based on platform geo, like we should be able to surface
different payment methods.
Localized payment method is like also one of the important things because a
lot of when in New Look becoming a log global company a lot of the payment
methods that is that people are familiar in US may not be the same ones that is
very famous in other countries, right?
So it's always, needed to look at those platforms to enable
those localized payment methods.
For in UPI for India, it's UPI and for Mexico at MercadoPago.
So there are like different things, right?
And then how do you make it simplified so that there is not much hoops in that?
The second thing is like B2B enterprise platforms.
This is like more enterprise to enterprise, right?
So you, your customers are enterprise where they are trying to like
when the sales team is trying to pitch them, okay, hey, these are
the products that we are selling.
This is the pricing or the product catalog.
And these are the different combination, different products that we sell.
And then this may be like a. Combination of a onetime charge
set up fee subscriptions along with amendments and some usage based, right?
So those, these are some things that they try to sell customers.
So here, I think it's, the complexity is more on the billing hierarchy.
It's, it may not be the number of what you call subscriptions.
This will be more on the complexity of how the products pricings are handled.
And then how do you really integrate it with, if there a procurement system,
supply chain and other things, right?
Like those are, and multi entity invoicing capabilities, right?
So that's the complexity there.
One of the things like other thing is like usage based billing, right?
So this can happen for both B2C and B2B.
So this is like getting like really what do you call used by
all the companies because of the way gen AI has come out, right?
So everybody's talking about tokens and lot of things are all
like now based on usages, right?
So like there are different models in, even in the health industry
where I think we are using like these many number of, sessions, then how
do you build the customer, right?
Or do you build the insurance, right?
So that a usage based building is important where I think on one side
you have to track like what are the usages different usage patterns of
the customers and how do you really map it to like the systems, right?
Like the subscription platforms where I think.
Usage.
This can be mapped to how much they need to be invoiced, right?
So that, that is something, and how do you really use the new
predictive building kind of thing.
And how the pricing rules to map this usages to the building side of things.
Moving on.
So I, we just want to spend a little bit more time on the
usage based billing because.
I think that's where a lot of teams have felt complexity.
There is a high volume even ingestion, right?
Because usages can happen, like in the case of let's say just
the token based things, right?
Like that we have people use APIs, so you have to keep track of it and there
will a lot of events coming in, right?
How do you really aggregate all this data across different systems?
How do you have these complex metering abstractions for areas pattern?
Do you do daily load or a month monthly load.
That's something that is always tricky, right?
And so one of the things that we do, like from a platform side of
thing will be like to have a pre predefined dimensions and metrics.
And then having a self-service way of dimensions that we can register so that
everybody can say, okay, this much.
Ma to this much.
And then also give them like the usage estimation, API, so that they can self
service and then even the customers can see how much they have used and
then how much they have to pay right.
In the next billing cycle.
So that's or if the billing is like upfront, then they have to make sure,
okay, we cut taxes and then they pay more, and then they get more access.
So that's the way that we are thinking about it, and then this
is becoming the norm nowadays.
So that's why I think everybody is interested in knowing how to really
do like the usage based building.
And how do you think about like building from a developer
experience perspective, right?
Like for payment systems.
So it's always like a, API first design approach.
So I think that's like a pattern that has been around for years
and all the developers are really accepted this, right?
And then people are familiar.
Everybody knows how to use APIs, REST APIs or GraphQL, right?
So this this is like pretty common standard nowadays.
And also like developer tooling, right?
So people, like it's, if you want people to start using in their
own languages and it's make it's always good to have builds STKs.
So s STKs or CLA tools and sandboxes are a good way for bringing in new developers
to start interacting with your product or start building against yours, right?
So whenever you start building, like with, again, say like you want to
build an integration with Stripe.
Recurly, like you always try to see like whether there's an STK that we
can just directly integrate, right?
Just import the STK and then start building.
So you don't have to write like that last code, right?
Of like transforming the rest a p or the request and response.
So one of the things important is those STKs has to be maintained so that as,
as soon as like the new version comes in, we have to make sure like the
STKs or like CS are up updated, right?
That's the thing with it.
But I think that's one good way.
The documentation, that's always important.
Because anybody, if you don't have a big support team or anything, right?
It's always better to have a proper documentation.
So that anybody, any developer can just go in and say, okay, this is
the request, this is the response, this is how you call our API.
And then it helps you.
And then having a developer portal with like best practices, payment resources
the who are the touch what you call kiger resources the whom they can reach out to.
And there are templates and tools available, right?
So that's, these are something that is like across, right?
This is not specific to payment system.
This is across, anybody.
Any team that we want to work on developing new solutions,
this is some best practices.
So how do you measure the success, right?
Like beyond the type.
A lot of the things from a platform status, like developer velocity, how
do you roll out new features, right?
That's something that we try to see, right?
Like from an engineering standpoint, how fast can we work on the business
initiatives and how fast we can do how much effort is being put?
So that we are not, always asking like engineering teams to do the work, right?
How much things can be like done in a self-service way so that like
we build UI based solutions or what do you call API based solutions,
which I think where product teams can execute on their own, right?
So that like the teams are not involved.
And then the call team, right?
Like how can we reduce the number of incidents that happens, right?
So basically making sure like.
When we are building something, we have proper call TX in place
regression that can be run, right?
So that they, they are always up to the standards that we are expecting.
That's on the engineering side, but coming to business impact, right?
There is always the revenue recovery, which is tied two
millions of dollars, right?
So it's always we have to make sure like we have, we are tracking
it very closely and trying to see like how much payments we can do.
That other is a conversion, which is more like a checkout success rate, right?
Like how, there will be like a hundred people going through the checkout process.
If only 10 people are placing an order, that means like the success
rate, like 90 per 90 people are like falling out right in the funnel.
How do you make sure how we can improve that, right?
Is it something that is because of the what do you call friction in the funnel?
Is that something that we can improve on, right?
Are there like things which are not clear to the customer how to
add payment method and all that?
How can we reduce that friction?
That's something that we always look at.
Other than there's like the retention.
So retention is really important and a lot depends on the product also.
So like the.
If the student likes the product or the customer likes the product, then
they will keep coming back, right?
And then they will be renewing the subscription.
So it's always better to make sure, like how healthy is that
interaction with the customer?
And how can, if the user is not using it for a long time, then how can
we really bring him back by sending him notification, giving him like,
say, okay, hey, these are the best things like that can, how we can
use, utilize the platform, right?
So sort trying to bring him on the product.
So we are almost reaching then.
So what are the key takeaways, right?
Let's try to wrap up and then I think this is like, how do you have to like really
start looking with developer experience because they are the first ones who will
be trying to interact with all these payment platforms or other things, right?
So if the platform isn't usable, like the adoption really fails, right?
You have to just always make sure, like everything is easy to use.
From a design standpoint, I think you have to always design for a
failure, recovery or resiliency.
So because like systems will fail, that will be like, there are so
many partners or like interactions that happen on a payment flow.
So it's always better to always design for failures.
Know what are the failures like you do like a failure mode analysis, right?
FMA.
And then try to figure out what are the failure points and then
take actions based on that.
Then a bit complaints, right?
Like third one was like, we are like, so that like teams are
not really way down by all this.
And like PCI related overhead, right?
And then always try to.
The technical and the business metrics, so that a team is aware of what are
the business interests that's, that needs to be the core of the system
and then we build towards that, right?
So that's what I think there is always that need to know, like
what is the real goal of business?
And then what are they really focused on?
And then how can we build solutions in such a way that it
aligns with the business goals.
Yeah.
So this is like how you move, like payments from a cost center
to becoming growth engine, right?
Being the real part of the business.
Yeah.
So thank you for spending the time with me.
I think hope you guys learned something new and then happy for the opportunity.
Have a good day.
Bye.