Platform Engineering for Subscription Payment Systems: Building Self-Service Infrastructure at Scale

Video size:

Abstract

Turn payment chaos into platform gold! Build self-service infrastructure that lets teams ship faster while auto-recovering millions in failed transactions. See how smart platform engineering transforms payments from cost center to competitive advantage.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

And thank you for joining today's session. My name is Josh Thomas and I'm currently, leading engineering and architecture for commerce and financial systems. Over the years, I worked on subscription building, payment optimization, and large scale enterprise integrations across. Platforms like zoa, Recurly, Stripe, Oracle, and Salesforce. Today I'm excited to share how platform engineering principles can transform payment systems from a call center into aerial competitive advantage. Going to the next slide start with today's junta. Here's how I will be spending the time. On this presentation, we'll start with the evolution of payment infrastructure, how we have gone from monolithic processors to distributed e driven systems. Next, I'll walk you through the core principles of platform engineering applied to payments. Then we will look at how to design for resiliency. C because payments will fail and the key. Factories, like how you can do pay payment failure recording. After that I will, I highlight the observability and complaints part of things. Two areas where I think the teams are usually slowed down. Whereas if it's not built into the platform, but if it's built into the platform, it can actually delivery. And finally we will talk about scaling across different business models, B2C, B2B, usage based billing. By the end of this talk you'll see how modern platforms can directly drive revenue, retention, and customer trust. Okay, so to start off with we can see like how. A traditional payment system was a platform driven approach, looks like. So these are some of the differences they can see. So traditionally payment systems were seen as an unknowable cost of doing business. They were brittle, manual, full of tribal knowledge locked into specific teams. Adding a new feature or a payment method took months and when some failure up, up happened, there was like no visibility into what went wrong. Whereas in contrast like if you look at a platform driven approach, you can see we build like self-service infrastructure that abstracts a complexity. The automated recovery process that are building doesn't need firefighting, right? 'Cause everybody knows when the systems goes wrong. And then it, we can easily start looking at what is runbook and then see like how we can start the recovery process. So we reduce the friction and product teams can fast launch much faster, right? So these are something that are there. Because of this, I think you. Don't have to worry too much about things going wrong. You can always focus on what next to add or what features to be added, right? So payments will stop being, becoming a overhead and they'll become a level of growth moving on. So you can look at what happened, right? Like an evolution of payment infrastructure. So let's take a step back and see like how payment structure, payment infrastructure has evolved. We started with a monolithic processors, one provider, tightly coupled systems, zero flexibility as companies expanded the integrated multiple providers payment processors like Braintree, Stripe Arion. But each recruited custom integration code leading to duplicated effort. So one payment processor will work one way, the other payment processors will work another way, right? So it was a mix of lot of APIs and different architectures. So then next stage was evolved into like more API first services providers begin. Offering abstraction layers and teams still had to handle all the failures, but still it was a good step in the right direction, right? But today leading companies are more looking at like event driven payment platforms, right? Where like it's distributor architectures, recovery of server of ty and self service are baked in from the start, right? So the shift is really clear, like how things have evolved. From a monolithic to more like event driven payment platforms where like you get notified of what all happens, right? So that's what brings in the structure in the payment infrastructure. So come moving on. You can look at like core platform engineering principles for payments, right? So one of the things that we always look at it is. So can we abstract a complexity so that it's not really part of everything, right? So product teams really need not know, like everything. Every minute details of, say we saw MasterCard of a local payment method, right? So how can we really abstract all those things in in, in the engineering layers, right? So second, like how we can build like a self-service by default, right? So that like engineering teams are not involved in each product launches or pricing changes or payment method launches, right? So it's more like configuration based that any product team can do. So third is like confluence, right? Like how do you make sure treating like regulations as a separate process where we have a lot of that UAT and backend process. How can we directly embed them into the infrastructure and the APIs have proper auditing and logging, right? And then how can we really create a paved rod, which is like optimize default. Common payment scenarios while allowing flexibility are needed, right? So basically building not our complex systems, but building a system in such a way that we can customize as needed, but not to support like our customizations, right? So that it's easy to maintain the code as well. So looking at the architecture, right? So let's. I look at like how, what all things are really needed in a payment platform, right? So one of the things that we always look at is you need like a gateway, right? Like you only unified enterprise that can abstract multiple providers. So say in the case of Stripe, Braintree, each of them has different APAs and different way of handling things, right? But if you have a gateway which can talk to that it's more like providing that flexibility right? From a. Engineering standpoint, or from a product standpoint, you have a payment processor or a payment provider, and then you just call the same payment gateway and payment Gateway has a complexity to make those transformation that is needed for each of those processing layers. So that's one, one way of abstracting it and then having a unified interface then. Subscription engine, right? So subscription engine is the core of a subscription business. So where like you start creating the subscriptions and subscriptions has, each subscription has like a life cycle, right? So you have, it can be a monthly, it can be an annual, it can be a bundle or like different pricing for price points, right? Baked into the subscription. So subscription engine has abstracts all the logic of so making sure like you build the right customer at the right billing period and consolidate that have, if there are discounts taken into concentration, taxes and other stuff, right? So that's the responsibility of subscription engine. Then even processing, that's core for any system, right? So whether it is like a third party systems or like systems within, right? If then there is an action happening, say like an order is created or like a subscription is renewed, right? That sort of becomes like your event, right? And then that based on that event, you may have five to 10 actions to take, right? So maybe like sending out a notification, sending it for payment collection. Making sure GL is posted, phaser, and then the revenues also allocated, right? So there are things which can do, which need not be like really tightly coupled synchronous transaction process, right? So this, all, this can be an even driven process where like all the systems or microservices get notified and then they can do the step that is needed of them, right? And the last part is revenue recovery, right? So any systems, right there will be failures, right? And in case of a payment usually once you have a subscription, you have payment method that would, that card could expire. There may not be enough amount in the card to make the payment. So there are multiple use cases that can happen. So it's always. Good to have an automatic re try mechanism built into the platform so that we are able to collect the amount and recover the revenue from the customer. Moving on. How do you really build resilient systems, right? So one of the things that we always see is like payment services always need to have 99.99, right? So it can be like fin nines or three nines or like a single nine, right? So that's the minimum that is needed because anytime there is a downtime for a payment systems, you can see how it impacts, right? So it is if you look at this, even the single night, right? You can have only eight hours of downtime per year. That is like across 365 days, right? So the systems has to be like highly resilient. It has to be reliable, it has to scale on its own. Based on the customer needs as well as the traffic on the site, right? And other systems should not be able to bring it down, right? So that, that's a core and that, and there are multiple services involved in this payment flow or checkout process. Then we have to make sure this applies to all the services, right? So it's not only one service being up and running, like what other services are there in the order to cash flow. They have to be all up and running. And it needs to be resilient all the time. So how do you really achieve this? So one of the things that we can always think about is like, how do you really try to do circuit breakers, right? So if you think like there are some systems failing, like how can we really not add too much of load into those systems, right? And then start throwing a right away so that we don't bring down those systems as well, right? So that is one, one way to think about it, right? And then how can have like a. Fallback process, right? Because sometimes the customer would be having the money in the card, right? It may be some other process that failed on our side. So how do you make sure like you. Instead of having the customer go through the, in their flow and doing a checkout, how can we make sure, like we can redo reprocessing on our side, like a fallback processing, give them access to the product, and then maybe collect the money at the later point of time. So how do you build like compensation handling or fallback processing, right? So that's something that will be really useful when you think about it. And then. Other thing is like more of on like on the retry logic, right? How do you really think about retry? How you can learn from the all the failures in the system. Try to look at the data that we have and then see how we can build like an indulgent. Processing right. With AI in mind. It's always good to have those models learn from the data that you already have, and then come up with a lot with a model which can predict like what is the intelligent retry or what time to retry right? For those payments to be successful. So there is a lot of use cases that we can bring AI in When you're thinking about Ian payment systems. So this is like a case study that we have put together which, which sort of shows like how it happens, right? So before the platform engineering 15% of the subscription payments fail on first attempt manual retry process. Like all best guesses. Like you cannot, like infinite, do infinite retries, right? It always has to be like a way that you retry. Maybe four or five times, right? Like in a 20 day period. Because the more retries you do on failed cards, visa, MasterCard, or any of these payment networks will try to say, okay, you are trying to charge a customer the wrong way or like you will be getting fined because you're trying to retry on cases where we are not supposed to retry. So there are. Lack of visibility into what are the different error codes and then not processing them properly. And then there is no visibility into how the performance is going, right? Like recovery performance, are we like able to collect most of the payments or are we collecting on the third attempt or the fifth attempt or 10th attempt, right? There is no real lack of there is no real visibility into what's happening and to say the truth, right? 40% of the recovery rate. On failed payments. It's not a big number, right? So it should be like a higher number because these are customers who are subscribers and they're supposed to pay the company, right? So that's where I think this becomes a what do you call, big revenue level for any companies to focus on and then build engineering around it. So coming back to like, how do you. Get it right. Like with the platform implementation, we start doing like smart algorithms and we try consider all the data, right? Because all this failure has lot of data associated with this. Like whether it's from a MasterCard, visa networks or from the payment processes, there is a lot of error codes and vertical patterns available, which we can really look at and say, okay, this, if the payment. This set code, these are retrial error and this and retread at on say Friday at this time this could be successful. So those are the things. And then there is always this a b testing frameworks for retry strategy that could be applied. So you can always bucket some. Variations, right? Like for the payment ries, and then try to learn from them, right? So it's even though the data subjects like as long as unless you try that out, right? Like you'll never know whether, how it's successful and it's how it's come compare to like our current strategy, right? So it's always better to do like an ab testing and then figure out what is the window right? And then try to do more payment retries in that window. Then it's always a smart idea to have a real time monitor because without monitoring it's really difficult to really bring out what is the. Level of success rate that coming in, right? So like you have to always be on top of it trying to monitor how the retries are working. And then if there are cases, like if we are reliant on an AM model, there are always chances that sometimes it may not be like doing it right way. Right? Or the, what do you call the ret? Or the success rates would've started going down. So it's always better to keep track of whether the model is trending in the right direction and then make changes based on that. And if you look at the 78% success rate, like we were bring, bring about close to 40% increase in the success rate and then that sort of translate to millions in recover revenue. Okay, so I mean to complaints, right? So complaints. Is something that I think it's always feared the engineering teams because it adds a lot of overhead to the systems. You have to make sure you audit the systems. You have to build like more data validation or data checks in, right? So it's always a paying for. To think about like compliance if Yeah. It's becomes like an afterthought, right? But I think if you start thinking about compliance and bake it into the platform, then it becomes such, such an easy thing, right? One of the things is like a tokenization service. So secure APIs that handles payment data. Because one of the things with the tokenization is you don't, you're not exposing the customer what do you call it? Payment, credit card information, right? So it's. Always best, like as a company, not to have anything. So that's where I think the hosted fields comes in. So if you're using Stripe, Zuora, or Curly, like everybody has hosted payment method field. So like you just have to have the STK. Load the JavaScript and then they take care of getting the payment and credit card numbers, and then they tokenize it for us. And then we just use the token right as whenever we want to charge the customer or do any transaction on that particular credit card so that we are not storing that information. And then it makes it easier for our systems to be not non PCI stuff, right? Because we are not handling any of those data. Then infrastructure as a code. So it's always good to have this infrastructure of code from the starting itself because with all the cloud pass platforms, it's easier to build in a terraform infrastructure as code ability. So that like you, it's easy to scale, it's easy to spin up new environments, and then. Putting more rail guards, right? So that not not all the engineering team have access to critical data and everything is controlled by the IAC code and not through manually access through directly through ui, right? So it's easy from an auditing standpoint to really make sure, okay, everything is done through code and then all the code. Deployments happen through like an approval process, like a CMC process, right? So that's what the, it brings in. And then building in the automated compliance testing, right? So we can always automate a lot of these things. We can have frameworks, automated security scanning, done like once every month or once every quarter, right? Just to make sure like all the systems are still adhering to all the complex standards. Moving on the observability, right? So this is one of the main interesting things, right? In the, in current world if you need to build a resilient system, you need to know what's happening within your system, right? And if you want to know what's happening with your system, you need to like really start. Looking at how we can build in observability from the start, right? So if you are looking at platforms like Splunk, NewRelic, right? Like they, they give us a lot of visibility into what's happening within like the containers, like the CPUs. What is the CPUs? What is the memory usage, right? So what if they're not enough or are they spinning up new instances because the current instance size is not able to handle, right? So all these things. Need to be monitored so that we can be sure we are building in the right direction. So that's where I think the it has to be part of the, all the services that we build in the, in, in the engineering landscape, right? So that we know what is the health metrics and what is the AP tech score if there is a, is there a higher error rate? Are there any anomalies? So all this can be identified and the good thing with having the proper observability is like you can notify the right people at the right time. And then the team to come in and then look at their it's easy for them to do look at, okay, trace there and then figure out, okay, this is because of this issue, so let's either roll back the changes or increase the instance size or like the memory, right? So that's how we think about it, right? So that's where. Observability becomes very important so that you can easily identify issues and then take resolution steps, right? As and when an issue happens. So how do you really scale payment platforms across business models? So there are, really three core business models, right? One is like B2C, where I think you. Directly deal with your customers. So any customer that lands on your site, if they want to buy a subscription or something, you are directly dealing with the customers. So there will be millions of customers, millions of payment methods a lot of transactions happening, right? Because it's like a B2B. In those cases, I think we have to really track like how the user is progressing through the funnel. How do you really do AB testing based on different flows? If some people like different payment methods, right? Is that is it better to have a different payment method highlighted here versus another payment method? So based on platform geo, like we should be able to surface different payment methods. Localized payment method is like also one of the important things because a lot of when in New Look becoming a log global company a lot of the payment methods that is that people are familiar in US may not be the same ones that is very famous in other countries, right? So it's always, needed to look at those platforms to enable those localized payment methods. For in UPI for India, it's UPI and for Mexico at MercadoPago. So there are like different things, right? And then how do you make it simplified so that there is not much hoops in that? The second thing is like B2B enterprise platforms. This is like more enterprise to enterprise, right? So you, your customers are enterprise where they are trying to like when the sales team is trying to pitch them, okay, hey, these are the products that we are selling. This is the pricing or the product catalog. And these are the different combination, different products that we sell. And then this may be like a. Combination of a onetime charge set up fee subscriptions along with amendments and some usage based, right? So those, these are some things that they try to sell customers. So here, I think it's, the complexity is more on the billing hierarchy. It's, it may not be the number of what you call subscriptions. This will be more on the complexity of how the products pricings are handled. And then how do you really integrate it with, if there a procurement system, supply chain and other things, right? Like those are, and multi entity invoicing capabilities, right? So that's the complexity there. One of the things like other thing is like usage based billing, right? So this can happen for both B2C and B2B. So this is like getting like really what do you call used by all the companies because of the way gen AI has come out, right? So everybody's talking about tokens and lot of things are all like now based on usages, right? So like there are different models in, even in the health industry where I think we are using like these many number of, sessions, then how do you build the customer, right? Or do you build the insurance, right? So that a usage based building is important where I think on one side you have to track like what are the usages different usage patterns of the customers and how do you really map it to like the systems, right? Like the subscription platforms where I think. Usage. This can be mapped to how much they need to be invoiced, right? So that, that is something, and how do you really use the new predictive building kind of thing. And how the pricing rules to map this usages to the building side of things. Moving on. So I, we just want to spend a little bit more time on the usage based billing because. I think that's where a lot of teams have felt complexity. There is a high volume even ingestion, right? Because usages can happen, like in the case of let's say just the token based things, right? Like that we have people use APIs, so you have to keep track of it and there will a lot of events coming in, right? How do you really aggregate all this data across different systems? How do you have these complex metering abstractions for areas pattern? Do you do daily load or a month monthly load. That's something that is always tricky, right? And so one of the things that we do, like from a platform side of thing will be like to have a pre predefined dimensions and metrics. And then having a self-service way of dimensions that we can register so that everybody can say, okay, this much. Ma to this much. And then also give them like the usage estimation, API, so that they can self service and then even the customers can see how much they have used and then how much they have to pay right. In the next billing cycle. So that's or if the billing is like upfront, then they have to make sure, okay, we cut taxes and then they pay more, and then they get more access. So that's the way that we are thinking about it, and then this is becoming the norm nowadays. So that's why I think everybody is interested in knowing how to really do like the usage based building. And how do you think about like building from a developer experience perspective, right? Like for payment systems. So it's always like a, API first design approach. So I think that's like a pattern that has been around for years and all the developers are really accepted this, right? And then people are familiar. Everybody knows how to use APIs, REST APIs or GraphQL, right? So this this is like pretty common standard nowadays. And also like developer tooling, right? So people, like it's, if you want people to start using in their own languages and it's make it's always good to have builds STKs. So s STKs or CLA tools and sandboxes are a good way for bringing in new developers to start interacting with your product or start building against yours, right? So whenever you start building, like with, again, say like you want to build an integration with Stripe. Recurly, like you always try to see like whether there's an STK that we can just directly integrate, right? Just import the STK and then start building. So you don't have to write like that last code, right? Of like transforming the rest a p or the request and response. So one of the things important is those STKs has to be maintained so that as, as soon as like the new version comes in, we have to make sure like the STKs or like CS are up updated, right? That's the thing with it. But I think that's one good way. The documentation, that's always important. Because anybody, if you don't have a big support team or anything, right? It's always better to have a proper documentation. So that anybody, any developer can just go in and say, okay, this is the request, this is the response, this is how you call our API. And then it helps you. And then having a developer portal with like best practices, payment resources the who are the touch what you call kiger resources the whom they can reach out to. And there are templates and tools available, right? So that's, these are something that is like across, right? This is not specific to payment system. This is across, anybody. Any team that we want to work on developing new solutions, this is some best practices. So how do you measure the success, right? Like beyond the type. A lot of the things from a platform status, like developer velocity, how do you roll out new features, right? That's something that we try to see, right? Like from an engineering standpoint, how fast can we work on the business initiatives and how fast we can do how much effort is being put? So that we are not, always asking like engineering teams to do the work, right? How much things can be like done in a self-service way so that like we build UI based solutions or what do you call API based solutions, which I think where product teams can execute on their own, right? So that like the teams are not involved. And then the call team, right? Like how can we reduce the number of incidents that happens, right? So basically making sure like. When we are building something, we have proper call TX in place regression that can be run, right? So that they, they are always up to the standards that we are expecting. That's on the engineering side, but coming to business impact, right? There is always the revenue recovery, which is tied two millions of dollars, right? So it's always we have to make sure like we have, we are tracking it very closely and trying to see like how much payments we can do. That other is a conversion, which is more like a checkout success rate, right? Like how, there will be like a hundred people going through the checkout process. If only 10 people are placing an order, that means like the success rate, like 90 per 90 people are like falling out right in the funnel. How do you make sure how we can improve that, right? Is it something that is because of the what do you call friction in the funnel? Is that something that we can improve on, right? Are there like things which are not clear to the customer how to add payment method and all that? How can we reduce that friction? That's something that we always look at. Other than there's like the retention. So retention is really important and a lot depends on the product also. So like the. If the student likes the product or the customer likes the product, then they will keep coming back, right? And then they will be renewing the subscription. So it's always better to make sure, like how healthy is that interaction with the customer? And how can, if the user is not using it for a long time, then how can we really bring him back by sending him notification, giving him like, say, okay, hey, these are the best things like that can, how we can use, utilize the platform, right? So sort trying to bring him on the product. So we are almost reaching then. So what are the key takeaways, right? Let's try to wrap up and then I think this is like, how do you have to like really start looking with developer experience because they are the first ones who will be trying to interact with all these payment platforms or other things, right? So if the platform isn't usable, like the adoption really fails, right? You have to just always make sure, like everything is easy to use. From a design standpoint, I think you have to always design for a failure, recovery or resiliency. So because like systems will fail, that will be like, there are so many partners or like interactions that happen on a payment flow. So it's always better to always design for failures. Know what are the failures like you do like a failure mode analysis, right? FMA. And then try to figure out what are the failure points and then take actions based on that. Then a bit complaints, right? Like third one was like, we are like, so that like teams are not really way down by all this. And like PCI related overhead, right? And then always try to. The technical and the business metrics, so that a team is aware of what are the business interests that's, that needs to be the core of the system and then we build towards that, right? So that's what I think there is always that need to know, like what is the real goal of business? And then what are they really focused on? And then how can we build solutions in such a way that it aligns with the business goals. Yeah. So this is like how you move, like payments from a cost center to becoming growth engine, right? Being the real part of the business. Yeah. So thank you for spending the time with me. I think hope you guys learned something new and then happy for the opportunity. Have a good day. Bye.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform Engineering for Subscription Payment Systems: Building Self-Service Infrastructure at Scale

Video size:

Abstract

Summary

Transcript

Slides

George Thomas

Engineering Manager / Enterprise Integration Architect @ Chegg

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Platform Engineering for Subscription Payment Systems: Building Self-Service Infrastructure at Scale

Video size:

Abstract

Summary

Transcript

Slides

George Thomas

Engineering Manager / Enterprise Integration Architect @ Chegg

Join the community!