Building Planet-Scale Payment Platforms: Engineering Resilience in Global Card Processing Infrastructure

Video size:

Abstract

Discover the distributed systems, real-time ML, and zero-downtime strategies that power global payment platforms. Learn battle-tested patterns for building planet-scale infrastructure that processes money at light speed.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Tapan. I have spent considerable time building payment platforms and working with payment systems, and today I'm here to share my learnings on what it takes to build a planet skill payment platform. One of the prime example of that is global card payment ecosystems. The global card payment ecosystem needs to process. Trillions of rolls annually through one of the most complex distributed system ever built. The infrastructure need to maintain five nines of reliability while handling billions of transactions across continents, currencies, while dealing with a variety of regulatory frameworks that is local to the countries. For platform engineers who are trying to design these systems, they need to. Deal with unique convergence of technical challenges. They need to build a system that is extremely reliable. It can process transaction in subsecond latency and have zero tolerance for data loss. All of these challenges makes designing payment platform that can operate a platform p planet scale really hard. Let's dig deeper into what are the scale and complexity challenges that comes with payment platform. When we think of payment platform, especially processing car transactions, and especially even when we are focusing on specific reason, we are looking at 150 billion car transactions that is happening annually. And when we look at holiday period, sometimes that transition could run into 65,000 transactions per second. And when you consider the downtime speech scenarios, you're looking at $31,000 worth of, money lost. If you have a downside downtime of one second. That becomes really hard. And you know what? When you are trying to process a transaction, there are multiple party parties that are involved, which makes a challenge much more unique and hard to solve for. For example, when we look at the card ion, there is card holders who actually owns the card. There's merchant who's trying to process the card there requires who are trying to settle the card transactions. The networks, there are payment gateways, there are issuers, the bank who actually issued the card, and they need to approve the transactions. When you look at all these parties, they create a complex web of dependencies that needs to be orchestrated real time. When you look at single card, swipe at a coffee shop. You're expecting the transition to go through and that timeframe is roughly a hundred to 200 milliseconds, in which case you want the transition to go through all the parties that are involved in that real time flow and come back with an approval for you to be able to buy that cup of coffee. That it requires serious distributor system knowledge and actually planning as a platform engineer. Let's go deeper. When we look at the cart transactions, somebody is trying to initiate a purchase. That purchase needs to be authorized. For that it has to be processed through gateways. There are multiple gateways that could be involved, and we need to find the right gateway to work with. At each party that is involved in a payment transactions, they actually want to reduce liability on them, so they run their own set of fraud checks. So you're looking at series of fraud checks that needs to be performed. For a transition to be successful. And at the end when a transition reaches an issuer, the issuer has to approve the transaction or who has the final authority, and then it, all of that has to flow through you, or flow back to you in real time for you to be able to see the results. So that brings its own complexity, like when you look at it like it's a multi-step projection, and then you throw in the complexity of like multiple parties. And then each party is dealing with multiple data centers, multiple geographies. And then you're trying to do this while being compliant to dozens of regulatory framework. And then on top of that, you know the payment is sensitive. Business, right? Security is paramount. So you need a security to be taken care at every hope. So in payment platform engineers, when designing the system, they need to design it such that it can handle hardware failures. It can be resilient to traffic spikes without r dropping the transactions. It requires very sophisticated tendency, redundant strategies, intelligent routing algorithms in real time, you can detect the failures and read out the transactions and are able to maintain a transaction consistency across globally distributed systems. When you look at it like you don't, we don't want to run into a scenario when a transaction is successful. On the bank side, but at gateway side it says it's failed, right? We need that to have a same set of street in eventual consistency there. Let's go further. When we think about distributed system architecture for P processing, I think one of the few of the key concepts are very important, right? One is horizontal scalability. While, as much as we would like to upgrade individual machines and push for more vertical stability, what really well works well for payment system is the horizontal scalability, because that allows you to scale for the number of projections that is, allows the system to scale for the requirement of subsecond latencies while maintaining a consistent performance. You also want the system to be built using service oriented architecture. You want the critical component to be isolated that has multiple benefit. Of one that it allows you to upgrade a smaller component versus actually bringing the entire system down. Second is actually, it allows critical failures to be limited to the critical, the specific components that they were designed for. The geographic distribution is equally important because you are, you want the system to be having a good out time. You want to also prevent. Against catastrophic failures like, earthquake, for example, or flood. You also essentially want to process this transaction in subsecond latency, which means like you can't afford to have, a continent intercontinent transaction travel in many cases because that will add up to the overall latency of the transactions. Data consistency. That's one of the key part I just hinted earlier, right? Data consistency in distributed environments is very important. And all of us are familiar with the cap theorum consistency, availability, and partition. I think one of the things that stands out for PM platform is while most of the systems can operate and tolerate with eventual consistency. Payment platforms actually need to see the transition through or not at all. There is no room for ambiguity here within a certain time bound. They need to have a very succinct outcome of a transaction that makes the problem very interesting. The I attached upon the latency right card authorization represents the most latency sensitive component of payment process. Merchants are expecting a response within a hundred to two and milliseconds. That requires us to validate the cards, check available funds on the card, check, assess the fraud risk with it, apply any business logic that is there, re out route the transaction to the right issuer and the gateways, while maintaining a sub secondary response time. That is difficult when you are running through these, so many network sws. Payment authorization system employs several optimization strategies like connection, pooling, persistent connection. They want to eliminate the TCP handshake times. They want to use a protocol that is less data versus having, using an XML format. Definitely leveraging machine learning or rule engine logic to optimize the relation to the places which can offer better performance. All of that is needed to be able to do this in the real time. Load balancing and traffic management is another key aspect when you're designing a a distributed system for payment platform, right? We need to be able to have adaptive load balancing, like modern platforms need to load balance considering the server health response times and Q depths when routing requests. We don't want. A backlog of request to be accommodated at once server. We want that to be done in a manner that can meet the latency requirements and availability requirements. We need circuit breaker patterns. When we see something is unhealthy we should be able to detect that in real time and route the traffic elsewhere. We, all of that has to be part of the designing of the distributed system so that we can minimize the loss and availability issues. Caching is gonna be very important as well, right? Like when we are trying to process millions of transactions, there are a lot of static data that we use to run the checks and the validations. All of that can actually be cashed to be, to avoid that io cost so that we can maintain. SA for the latency by leveraging those cache and that often those caches has to be distributed because your services are deployed on distributed systems. All of this requires, for example, when we go deeper into the cache itself when you are trying to have a distributed cache, the invalidation requires a very careful orchestrations. You need to make sure the gates propagate quickly, preventing any routing errors. But if you actually try to aggressively invalidate, you can create a hundred herb problems. All of those requires more sophisticated cache, cash warming strategy. You need to have a gradual rollout mechanism and multi-level caching to ensure that you can offer performance with consistency. Fraud detection. I touched upon this this is very interesting and one of the things that is very common across starting from your payment gateways to the networks, to the issuer, to the merchant. All of them are trying to actually perform fraud checks and trying to mitigate risk, right? There is a card transaction, comes with certain set of rules that are built by networks and if you do well, then you can shift the liability to the other parties that is in the transaction. So that is one of the reason why, you know, the fraud detection and risk management is very important. So modern fraud detection systems needs to high analyze hundreds of parameters per transactions in real time. They need to leverage the technology that is available around device fingerprinting for example device identifications and behavioral patterns to velocity checks in geographic anoma. All of this has to be completed within a tight latency pattern so that you can offer the end to end latency of subsecond. So all of this has to be done in 10 to 20 milliseconds. Fraud detection models needs to be deployed to multiple reasons. You can't just limit centralize all the models and then have them pay the network cost of querying to get an answer. You need that to be deployed. It require, it, it creates its own challenges because when you, again when we touch upon the distributed catching, it's very similar. Like when you have distributed models that are deployed, you need them to be, model versioning becomes a challenge. AB testing becomes a challenge, and monitoring becomes a challenge. The platforms employ sophisticated model serving infrastructure that can. Hot swap models without disrupting traffic. Adaptive threat response. Again, another very important one, like we need the platforms for payment to be able to adapt to the emerging threats. Like a lot of new things are happening, there's when new attack pattern changes, platform need mechanism to detect. To update detection logic without code employments we need that to be rule engine driven in, or ML driven in some cases, to define the new fraud patterns using domain specific languages. The another important aspect that that is not worth missing out is settlement and clearing. At the end transaction has to settle, then the funds have to be transport and the transaction is called as complete while authorization. This is part is slightly different. That authorization happens in millisecond settlement actually happens at a different time scale. Many of the transition card networks are looking, look to do the daily settlement, but sometimes, with the new emerging realtime payment rates, they're looking for a more immediate fund transfer and settlement. So that actually requires a very sophisticated workflow or orchestration. Yes, transaction is processed through multiple states. You need to do a sophisticated state management to know at what state the transaction is. For example, we talked about it, right? The consistency that is important. We need the transition state to be consistent across all the partners that you work with and within our systems itself. So you need to maintain a state, is it authorized, is it captured, is it batched submitted? Is it settled? And with different timing requirements and need to understand the failure codes coming from the partners very well. The, another important and very important aspect to be honest, is the cross-border complexity. When you are looking international interaction then adds layers of complexity through currency conversion, cross border fees, and it actually varies the settlement timelines. Platforms must integrate with multiple currency exchanges, manage foreign, ex foreign exchange risk. Of course the foreign exchange market moves really fast. The uninsured compliance with the international money transfer regulations. Multicurrency settle requires careful attention to timing. Exchange rate fluctuate continuously, so transaction must lock in the rates as early as possible. That requires a very sophisticated hedging strategies, especially when you're dealing with the currency movements and maintain a relationship with liquidity providers, especially if you're trying to process billions of transactions. You need that relationship with multiple liquidity providers so that you can minimize the currency risk. Otherwise you are at the mercy of the partners that you're working with and they could potentially charge you. And then you might actually be looking at a very large bill. You need to be able to do the reconciliation at scale, like millions of transactions or billions of transactions flowing through multiple systems daily and rec reconciliation becomes a very critical platform capability. Every transition must be tracked through authorization to settlement. And with any discrepancy identified early on and resolved quickly, that, and given the number of parties that are involved in traction sometimes that requires quite a bit of work to walk through each partners to ensure the traction is in good state across all partners. If the fund has been taken out of a buyer's account. It needs to show in the same manner. And if it has not been, then it should appear the same way. Otherwise, you are looking at a series of work that can, that has to be done to, to actually try to get the ion into a good state. So most system at this point aim for when you're trying to deal with the planet. Scale size, is looking for a continuous reconciliation versus traditional batch based reconciliation approach that used to work work in past. Security. Another important aspect like like I said, payment platform really pushes boundary in all aspect of the, this distributed system. Security is one of the most critical aspect of it. Payment platform represent high value targets for cyber animals. This is money, right? We are talking about people are looking for ways to steal money or trying to purchase things on behalf of you so that they can enjoy the benefit while you bear the burnt of the cost of the item that they bought. So it requires a very comprehensive security architecture. Defense in depth strategies layer multiple security controls that you need to be established and steering. The compromise of any single component does not expose sensitive data. So that requires multiple things to be achieved and done to ensure that we can offer in two end security. Here example is you need to be like, we touched upon this earlier, like with service oriented architecture, one of the key thing we can achieve is we can isolate critical systems. That actually ensures that when one system is compromised, rest of the system is not. That becomes very important. Like when you're trying to, while we, we had an attack if we are able to preserve the large part of the components, we can recover the transaction because we have the data elsewhere. If everything is exposed then, we are looking at something that cannot be potentially recovered. Hardware security modules, like we need the sistant key storage and cryptographic operations. That becomes very important versus a racial way of storing keys. We need every request to be authenticated regardless of where it came from. Like we need to build components with the principle in mind that it is a zero trust architecture, so everything has to go through a proper security exchange. Tokenization. It's something that has started to come out in the last five years. The networks are pushing for it, where they want to move away from sensitive card numbers to a non sensitive tokens, which, which can expire after some time. Key management like we talked about it, right? I think because when there are multiple third parties that are involved, you need to be able to communicate with them by sharing your private keys and accessing their public keys. Managing cryptographic keys across global platforms, that becomes quite a unique challenge. Keys need to be regulated rotated regularly. It has to be distributed securely and made available to thousands of servers without creating a single point of failures. They need to be able to handle millions of key operations per second while maintaining audit trails and enforcing security policies. Now this is very interesting, not just this is for outside, but also for internally. Like when you're building a platform you also don't want your internal users to be able to facilitate a transition that was never authorized. We want a critical audit, legs audit logs for them. We also want to ensure that, there is enough checks and balances in place and controls in place so that we can detect when there is an unauthorized activity even within the company, right? Even within this company that is trying to build the payment platform. So modern sales platforms employ hierarchical. Key management systems were master key in hardware security module encrypt data encryption keys stored with the data. That approach has works very well because it enables efficient key rotation only. Key encryption keys need updating rather than re-encrypt vast amount of data. So sophisticated key derivation schemes enable platforms to generate transition specific keys without storing them. It reduces the overall key management, all that while maintaining strong security guarantees. The most important part again, as part of any distributed system, but much more important for payment platform is observability and monitoring at scale. Like observability is, payment platforms extends beyond traditional metrics like, CPU and memory usage. You need to look at the business metrics like authorization rates or like settlement success, rate for detection accuracy, and you need to provide critical insights into the platform. Health platform must correlate technical and business metrics to identify shoes before the impact projections. You need your system to be very, very much set up in a way that it can detect issues very early on and can pro proactively warn the users. Distributed tracing enables engineers to follow individual projections across dozens of devices. Sampling becomes very critical at scale. Tracing every transactions would overwhelm the monitoring system, so you need to be able to reject the pattern and then highlight them. Adaptive sampling strategies captured in updated identify issues while managing overhead incident response and zero down time. Operations when issues occur in the payment system. Response time is very critical. We don't, because that's when we look at it like, we touched upon this very early on. Each second is about $31,000 of money loss, right? For some of the business, if you are down for five seconds or if you're down for five minutes. And if you're a payment gateway, for example you are it could be for one nurse, for buyer who's trying to. Make a purchase, which can be critical. For example, if they're trying to pay a medical bill to approve a medical, surgery can be very critical for them. Similarly, for merchants who are trying to make money, it's possible that this was a holiday period, and then they want as much revenue to be coming out in that period. And if you're down for five minutes, you're actually looking for a very significant revenue loss for them. So if you issues happen, you need to respond in a very, in, in a very timely manner. Incident response projects must enable rapid diagnosis and remediation while ensuring that security and compliance can be maintained. Unfortunately, chaos Engineering ca is becoming a practice for. For payment platform because, you need that muscle in engineers to be able to deal with with such scenarios. Often the payment platform actually test themselves by intern Internation, intentionally introducing failures in controlled environments so that we can identify the weakness ahead of time and before they start to show up in the production. When we look at the future, like the car processing has been around it's growing well, but we are seeing a lot of change, future directions and emerging challenges. We are seeing an emergent of realtime payment rails. Shifting from the batch to realtime settlement fundamentally changes the platform architecture requirements. System designed for daily, which must need to evolve to handle continuous settlement while maintaining the same reliability guarantees that is a hard problem to solve. Other one is that, with the open banking regulations platforms are forced to expose their APIs, which were earlier internal. That creates another storing scary scaling challenges is third party or developers build application that can generate unpredictable traffic patterns. That also exposes overall risk. Cryptocurrency integration. Of course the cryptocurrency is rising. Everybody's aware of it. There's more and more traction that is happening through stable coins now and then typical cryptocurrencies. So as they gain mainstream adoption and central banks explore these digital currencies, payment platform needs to evolve to these new payment methods to operate on different principal than traditional payment rails. Overall platform engineers must design system that can bridge between the traditional and blockchain based systems while maintaining consistent user experience. Because at the end, like for user, all of these are currencies. They don't differentiate they don't expect a different experience when you're trying to buy, using a cryptocurrency or a stable coin or digital currency or using a traditional currencies they expect a very similar outcome in a very similar manner. With crypto becomes harder because you need to handle the volatility of cryptocurrencies. You have a very different wallet infrastructure and then you need to deal with the evolving compliance regulations. I think those were all the things. Let me conclude building plan scale payment platforms requires, mastering numerous technical disciplines while maintaining a focus on reliability, security and performance. Reliability is paramount. Security is paramount. Performance is very important. The architectures and patterns discussed represents. Years of evolution, right? The card system had been around for some time, even before the traditional payment system had been around. They evolved, like security probably came first. Reliability came second, and then performance keeps getting better where we are, moving from, typical time of taking a couple of days or three days or a week for a traction to settle to settle the traction within a second. The platform engineers who are actually trying to work in this space, I would say the challenges are immense, but then they come as an opportunities. As payment methods continue to evolve, transactions volume, grow, the need of innovative solutions becomes even more critical. That, that, that is where the engineers shine. The next generation of payment platform will need even more resilient, scalable, and adaptable than today systems. Especially when we look at the emergence of ai, right? I think AI in the hand of. The people who are trying to create chaos in the payment system, especially trying to create vulnera, look at the vulnerabilities that security, reliability, performance, all the aspect will be challenged in this evolving world. The payment industry is, stands at an inflection point, like real time payment systems, open banking, digital currencies. They promise to transform how moves money moves globally. So platform engineers who understand both what has worked well so far in the payment platform or mini movement platform and understand the emerging trends will be in a best position to build the financial infrastructure that is needed for tomorrow. The challenges are significant and, for those who are willing to tackle them, opportunity to impact global commerce has never been greater. With that all, I want to thank you all for listening to me. Appreciate the time. Thank you. Bye.

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Planet-Scale Payment Platforms: Engineering Resilience in Global Card Processing Infrastructure

Video size:

Abstract

Summary

Transcript

Tapan Vijay

Head of Payment Engine & Gateway @ Meta

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Planet-Scale Payment Platforms: Engineering Resilience in Global Card Processing Infrastructure

Video size:

Abstract

Summary

Transcript

Tapan Vijay

Head of Payment Engine & Gateway @ Meta

Join the community!