Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Tapan.
I have spent considerable time building payment platforms and working with
payment systems, and today I'm here to share my learnings on what it takes to
build a planet skill payment platform.
One of the prime example of that is global card payment ecosystems.
The global card payment ecosystem needs to process.
Trillions of rolls annually through one of the most complex
distributed system ever built.
The infrastructure need to maintain five nines of reliability while
handling billions of transactions across continents, currencies, while
dealing with a variety of regulatory frameworks that is local to the countries.
For platform engineers who are trying to design these systems, they need to.
Deal with unique convergence of technical challenges.
They need to build a system that is extremely reliable.
It can process transaction in subsecond latency and have
zero tolerance for data loss.
All of these challenges makes designing payment platform that can operate a
platform p planet scale really hard.
Let's dig deeper into what are the scale and complexity challenges
that comes with payment platform.
When we think of payment platform, especially processing car transactions,
and especially even when we are focusing on specific reason, we
are looking at 150 billion car transactions that is happening annually.
And when we look at holiday period, sometimes that transition could run
into 65,000 transactions per second.
And when you consider the downtime speech scenarios, you're looking
at $31,000 worth of, money lost.
If you have a downside downtime of one second.
That becomes really hard.
And you know what?
When you are trying to process a transaction, there are multiple
party parties that are involved, which makes a challenge much more
unique and hard to solve for.
For example, when we look at the card ion, there is card holders
who actually owns the card.
There's merchant who's trying to process the card there requires who are
trying to settle the card transactions.
The networks, there are payment gateways, there are issuers, the bank
who actually issued the card, and they need to approve the transactions.
When you look at all these parties, they create a complex web of dependencies
that needs to be orchestrated real time.
When you look at single card, swipe at a coffee shop.
You're expecting the transition to go through and that timeframe is roughly
a hundred to 200 milliseconds, in which case you want the transition to
go through all the parties that are involved in that real time flow and
come back with an approval for you to be able to buy that cup of coffee.
That it requires serious distributor system knowledge and actually
planning as a platform engineer.
Let's go deeper.
When we look at the cart transactions, somebody is trying to initiate a purchase.
That purchase needs to be authorized.
For that it has to be processed through gateways.
There are multiple gateways that could be involved, and we need to
find the right gateway to work with.
At each party that is involved in a payment transactions, they actually
want to reduce liability on them, so they run their own set of fraud checks.
So you're looking at series of fraud checks that needs to be performed.
For a transition to be successful.
And at the end when a transition reaches an issuer, the issuer has to
approve the transaction or who has the final authority, and then it,
all of that has to flow through you, or flow back to you in real time for
you to be able to see the results.
So that brings its own complexity, like when you look at it like it's a multi-step
projection, and then you throw in the complexity of like multiple parties.
And then each party is dealing with multiple data
centers, multiple geographies.
And then you're trying to do this while being compliant to
dozens of regulatory framework.
And then on top of that, you know the payment is sensitive.
Business, right?
Security is paramount.
So you need a security to be taken care at every hope.
So in payment platform engineers, when designing the system, they
need to design it such that it can handle hardware failures.
It can be resilient to traffic spikes without r dropping the transactions.
It requires very sophisticated tendency, redundant strategies, intelligent
routing algorithms in real time, you can detect the failures and read
out the transactions and are able to maintain a transaction consistency
across globally distributed systems.
When you look at it like you don't, we don't want to run into a scenario
when a transaction is successful.
On the bank side, but at gateway side it says it's failed, right?
We need that to have a same set of street in eventual consistency there.
Let's go further.
When we think about distributed system architecture for P processing,
I think one of the few of the key concepts are very important, right?
One is horizontal scalability.
While, as much as we would like to upgrade individual machines and push for more
vertical stability, what really well works well for payment system is the horizontal
scalability, because that allows you to scale for the number of projections that
is, allows the system to scale for the requirement of subsecond latencies while
maintaining a consistent performance.
You also want the system to be built using service oriented architecture.
You want the critical component to be isolated that has multiple benefit.
Of one that it allows you to upgrade a smaller component versus actually
bringing the entire system down.
Second is actually, it allows critical failures to be limited to
the critical, the specific components that they were designed for.
The geographic distribution is equally important because you are, you want the
system to be having a good out time.
You want to also prevent.
Against catastrophic failures like, earthquake, for example, or flood.
You also essentially want to process this transaction in subsecond latency,
which means like you can't afford to have, a continent intercontinent
transaction travel in many cases because that will add up to the
overall latency of the transactions.
Data consistency.
That's one of the key part I just hinted earlier, right?
Data consistency in distributed environments is very important.
And all of us are familiar with the cap theorum consistency,
availability, and partition.
I think one of the things that stands out for PM platform is while
most of the systems can operate and tolerate with eventual consistency.
Payment platforms actually need to see the transition through or not at all.
There is no room for ambiguity here within a certain time bound.
They need to have a very succinct outcome of a transaction that
makes the problem very interesting.
The I attached upon the latency right card authorization represents the most latency
sensitive component of payment process.
Merchants are expecting a response within a hundred to two and milliseconds.
That requires us to validate the cards, check available funds on the card,
check, assess the fraud risk with it, apply any business logic that is there,
re out route the transaction to the right issuer and the gateways, while
maintaining a sub secondary response time.
That is difficult when you are running through these, so many network sws.
Payment authorization system employs several optimization
strategies like connection, pooling, persistent connection.
They want to eliminate the TCP handshake times.
They want to use a protocol that is less data versus having, using an XML format.
Definitely leveraging machine learning or rule engine logic to
optimize the relation to the places which can offer better performance.
All of that is needed to be able to do this in the real time.
Load balancing and traffic management is another key aspect when you're
designing a a distributed system for payment platform, right?
We need to be able to have adaptive load balancing, like modern platforms
need to load balance considering the server health response times
and Q depths when routing requests.
We don't want.
A backlog of request to be accommodated at once server.
We want that to be done in a manner that can meet the latency requirements
and availability requirements.
We need circuit breaker patterns.
When we see something is unhealthy we should be able to detect that in real
time and route the traffic elsewhere.
We, all of that has to be part of the designing of the distributed
system so that we can minimize the loss and availability issues.
Caching is gonna be very important as well, right?
Like when we are trying to process millions of transactions, there are
a lot of static data that we use to run the checks and the validations.
All of that can actually be cashed to be, to avoid that io
cost so that we can maintain.
SA for the latency by leveraging those cache and that often those caches has
to be distributed because your services are deployed on distributed systems.
All of this requires, for example, when we go deeper into the cache
itself when you are trying to have a distributed cache, the invalidation
requires a very careful orchestrations.
You need to make sure the gates propagate quickly, preventing any routing errors.
But if you actually try to aggressively invalidate, you can
create a hundred herb problems.
All of those requires more sophisticated cache, cash warming strategy.
You need to have a gradual rollout mechanism and multi-level caching
to ensure that you can offer performance with consistency.
Fraud detection.
I touched upon this this is very interesting and one of the things that
is very common across starting from your payment gateways to the networks,
to the issuer, to the merchant.
All of them are trying to actually perform fraud checks and
trying to mitigate risk, right?
There is a card transaction, comes with certain set of rules that are built by
networks and if you do well, then you can shift the liability to the other
parties that is in the transaction.
So that is one of the reason why, you know, the fraud detection and
risk management is very important.
So modern fraud detection systems needs to high analyze hundreds of
parameters per transactions in real time.
They need to leverage the technology that is available around device
fingerprinting for example device identifications and behavioral patterns
to velocity checks in geographic anoma.
All of this has to be completed within a tight latency pattern so that you can
offer the end to end latency of subsecond.
So all of this has to be done in 10 to 20 milliseconds.
Fraud detection models needs to be deployed to multiple reasons.
You can't just limit centralize all the models and then have them pay the
network cost of querying to get an answer.
You need that to be deployed.
It require, it, it creates its own challenges because when you, again
when we touch upon the distributed catching, it's very similar.
Like when you have distributed models that are deployed, you need them to be,
model versioning becomes a challenge.
AB testing becomes a challenge, and monitoring becomes a challenge.
The platforms employ sophisticated model serving infrastructure that can.
Hot swap models without disrupting traffic.
Adaptive threat response.
Again, another very important one, like we need the platforms for payment to be
able to adapt to the emerging threats.
Like a lot of new things are happening, there's when new attack pattern changes,
platform need mechanism to detect.
To update detection logic without code employments we need that to be rule
engine driven in, or ML driven in some cases, to define the new fraud patterns
using domain specific languages.
The another important aspect that that is not worth missing
out is settlement and clearing.
At the end transaction has to settle, then the funds have to be transport
and the transaction is called as complete while authorization.
This is part is slightly different.
That authorization happens in millisecond settlement actually
happens at a different time scale.
Many of the transition card networks are looking, look to do the daily
settlement, but sometimes, with the new emerging realtime payment rates,
they're looking for a more immediate fund transfer and settlement.
So that actually requires a very sophisticated workflow or orchestration.
Yes, transaction is processed through multiple states.
You need to do a sophisticated state management to know at
what state the transaction is.
For example, we talked about it, right?
The consistency that is important.
We need the transition state to be consistent across all
the partners that you work with and within our systems itself.
So you need to maintain a state, is it authorized, is it
captured, is it batched submitted?
Is it settled?
And with different timing requirements and need to understand the failure codes
coming from the partners very well.
The, another important and very important aspect to be honest,
is the cross-border complexity.
When you are looking international interaction then adds layers of
complexity through currency conversion, cross border fees, and it actually
varies the settlement timelines.
Platforms must integrate with multiple currency exchanges, manage
foreign, ex foreign exchange risk.
Of course the foreign exchange market moves really fast.
The uninsured compliance with the international money transfer regulations.
Multicurrency settle requires careful attention to timing.
Exchange rate fluctuate continuously, so transaction must lock in
the rates as early as possible.
That requires a very sophisticated hedging strategies, especially when you're
dealing with the currency movements and maintain a relationship with liquidity
providers, especially if you're trying to process billions of transactions.
You need that relationship with multiple liquidity providers so that
you can minimize the currency risk.
Otherwise you are at the mercy of the partners that you're working with and
they could potentially charge you.
And then you might actually be looking at a very large bill.
You need to be able to do the reconciliation at scale, like millions of
transactions or billions of transactions flowing through multiple systems
daily and rec reconciliation becomes a very critical platform capability.
Every transition must be tracked through authorization to settlement.
And with any discrepancy identified early on and resolved quickly, that, and given
the number of parties that are involved in traction sometimes that requires
quite a bit of work to walk through each partners to ensure the traction
is in good state across all partners.
If the fund has been taken out of a buyer's account.
It needs to show in the same manner.
And if it has not been, then it should appear the same way.
Otherwise, you are looking at a series of work that can, that has
to be done to, to actually try to get the ion into a good state.
So most system at this point aim for when you're trying to deal with the planet.
Scale size, is looking for a continuous reconciliation versus traditional
batch based reconciliation approach that used to work work in past.
Security.
Another important aspect like like I said, payment platform really
pushes boundary in all aspect of the, this distributed system.
Security is one of the most critical aspect of it.
Payment platform represent high value targets for cyber animals.
This is money, right?
We are talking about people are looking for ways to steal money or
trying to purchase things on behalf of you so that they can enjoy the
benefit while you bear the burnt of the cost of the item that they bought.
So it requires a very comprehensive security architecture.
Defense in depth strategies layer multiple security controls that you
need to be established and steering.
The compromise of any single component does not expose sensitive data.
So that requires multiple things to be achieved and done to ensure that
we can offer in two end security.
Here example is you need to be like, we touched upon this earlier, like
with service oriented architecture, one of the key thing we can achieve
is we can isolate critical systems.
That actually ensures that when one system is compromised, rest of the system is not.
That becomes very important.
Like when you're trying to, while we, we had an attack if we are able
to preserve the large part of the components, we can recover the transaction
because we have the data elsewhere.
If everything is exposed then, we are looking at something that
cannot be potentially recovered.
Hardware security modules, like we need the sistant key storage
and cryptographic operations.
That becomes very important versus a racial way of storing keys.
We need every request to be authenticated regardless of where it came from.
Like we need to build components with the principle in mind that it is a zero
trust architecture, so everything has to go through a proper security exchange.
Tokenization.
It's something that has started to come out in the last five years.
The networks are pushing for it, where they want to move away from sensitive
card numbers to a non sensitive tokens, which, which can expire after some time.
Key management like we talked about it, right?
I think because when there are multiple third parties that are involved, you
need to be able to communicate with them by sharing your private keys
and accessing their public keys.
Managing cryptographic keys across global platforms, that
becomes quite a unique challenge.
Keys need to be regulated rotated regularly.
It has to be distributed securely and made available to thousands of servers without
creating a single point of failures.
They need to be able to handle millions of key operations per
second while maintaining audit trails and enforcing security policies.
Now this is very interesting, not just this is for outside,
but also for internally.
Like when you're building a platform you also don't want your internal
users to be able to facilitate a transition that was never authorized.
We want a critical audit, legs audit logs for them.
We also want to ensure that, there is enough checks and balances in place
and controls in place so that we can detect when there is an unauthorized
activity even within the company, right?
Even within this company that is trying to build the payment platform.
So modern sales platforms employ hierarchical.
Key management systems were master key in hardware security module encrypt data
encryption keys stored with the data.
That approach has works very well because it enables efficient key rotation only.
Key encryption keys need updating rather than re-encrypt vast amount of data.
So sophisticated key derivation schemes enable platforms to generate transition
specific keys without storing them.
It reduces the overall key management, all that while maintaining
strong security guarantees.
The most important part again, as part of any distributed system, but much
more important for payment platform is observability and monitoring at scale.
Like observability is, payment platforms extends beyond traditional
metrics like, CPU and memory usage.
You need to look at the business metrics like authorization rates or like
settlement success, rate for detection accuracy, and you need to provide
critical insights into the platform.
Health platform must correlate technical and business metrics to identify
shoes before the impact projections.
You need your system to be very, very much set up in a way that it
can detect issues very early on and can pro proactively warn the users.
Distributed tracing enables engineers to follow individual
projections across dozens of devices.
Sampling becomes very critical at scale.
Tracing every transactions would overwhelm the monitoring system, so
you need to be able to reject the pattern and then highlight them.
Adaptive sampling strategies captured in updated identify
issues while managing overhead
incident response and zero down time.
Operations when issues occur in the payment system.
Response time is very critical.
We don't, because that's when we look at it like, we touched
upon this very early on.
Each second is about $31,000 of money loss, right?
For some of the business, if you are down for five seconds or if
you're down for five minutes.
And if you're a payment gateway, for example you are it could be for one
nurse, for buyer who's trying to.
Make a purchase, which can be critical.
For example, if they're trying to pay a medical bill to approve a medical,
surgery can be very critical for them.
Similarly, for merchants who are trying to make money, it's possible that this was a
holiday period, and then they want as much revenue to be coming out in that period.
And if you're down for five minutes, you're actually looking for a very
significant revenue loss for them.
So if you issues happen, you need to respond in a very,
in, in a very timely manner.
Incident response projects must enable rapid diagnosis and remediation
while ensuring that security and compliance can be maintained.
Unfortunately, chaos Engineering ca is becoming a practice for.
For payment platform because, you need that muscle in engineers to be
able to deal with with such scenarios.
Often the payment platform actually test themselves by intern Internation,
intentionally introducing failures in controlled environments so that
we can identify the weakness ahead of time and before they start
to show up in the production.
When we look at the future, like the car processing has been around it's growing
well, but we are seeing a lot of change, future directions and emerging challenges.
We are seeing an emergent of realtime payment rails.
Shifting from the batch to realtime settlement fundamentally changes the
platform architecture requirements.
System designed for daily, which must need to evolve to handle continuous
settlement while maintaining the same reliability guarantees
that is a hard problem to solve.
Other one is that, with the open banking regulations platforms are forced to expose
their APIs, which were earlier internal.
That creates another storing scary scaling challenges is third party or
developers build application that can generate unpredictable traffic patterns.
That also exposes overall risk.
Cryptocurrency integration.
Of course the cryptocurrency is rising.
Everybody's aware of it.
There's more and more traction that is happening through stable coins now
and then typical cryptocurrencies.
So as they gain mainstream adoption and central banks explore these digital
currencies, payment platform needs to evolve to these new payment methods
to operate on different principal than traditional payment rails.
Overall platform engineers must design system that can bridge between
the traditional and blockchain based systems while maintaining
consistent user experience.
Because at the end, like for user, all of these are currencies.
They don't differentiate they don't expect a different experience when you're
trying to buy, using a cryptocurrency or a stable coin or digital currency or using a
traditional currencies they expect a very similar outcome in a very similar manner.
With crypto becomes harder because you need to handle the
volatility of cryptocurrencies.
You have a very different wallet infrastructure and then you need to deal
with the evolving compliance regulations.
I think those were all the things.
Let me conclude building plan scale payment platforms requires, mastering
numerous technical disciplines while maintaining a focus on
reliability, security and performance.
Reliability is paramount.
Security is paramount.
Performance is very important.
The architectures and patterns discussed represents.
Years of evolution, right?
The card system had been around for some time, even before the traditional
payment system had been around.
They evolved, like security probably came first.
Reliability came second, and then performance keeps getting better where
we are, moving from, typical time of taking a couple of days or three days
or a week for a traction to settle to settle the traction within a second.
The platform engineers who are actually trying to work in this space, I would
say the challenges are immense, but then they come as an opportunities.
As payment methods continue to evolve, transactions volume, grow,
the need of innovative solutions becomes even more critical.
That, that, that is where the engineers shine.
The next generation of payment platform will need even more resilient, scalable,
and adaptable than today systems.
Especially when we look at the emergence of ai, right?
I think AI in the hand of.
The people who are trying to create chaos in the payment system, especially
trying to create vulnera, look at the vulnerabilities that security,
reliability, performance, all the aspect will be challenged in this evolving world.
The payment industry is, stands at an inflection point, like real time payment
systems, open banking, digital currencies.
They promise to transform how moves money moves globally.
So platform engineers who understand both what has worked well so far in
the payment platform or mini movement platform and understand the emerging
trends will be in a best position to build the financial infrastructure
that is needed for tomorrow.
The challenges are significant and, for those who are willing to tackle
them, opportunity to impact global commerce has never been greater.
With that all, I want to thank you all for listening to me.
Appreciate the time.
Thank you.
Bye.