Conf42 MLOps 2025 - Online

- premiere 5PM GMT

Production MLOps in Finance: Scaling AI Systems for High-Volume Transaction Processing

Video size:

Abstract

Financial ML systems crash and burn when they hit production scale. Learn battle-tested MLOps strategies that keep fraud detection models alive under extreme load, handle model drift in compliance-heavy environments, and deploy credit systems that regulators actually approve.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Thank you for being here. My name is Nja. Today I'll be sharing insights on how we can successfully scale AI systems in the financial industry, specifically, how to handle high volume transactions. While processing and while staying compliant with strict regulations, the talk will cover on operational strategies, infrastructure choices, and real world lessons. We have learned from deploying ML pipelines across the financial services transactional ecosystem. What is the challenges that we are looking at when we look at financial institutions? The challenges are unlike in any other industry, right? Fundamentally. We are processing billions and billions of transactions with less than a millisecond of a second response times. On top of that, every decision must satisfy the regulations like G-D-P-R-C-C-P-A of CRA, and then you have to also maintain 99.99% of time for critical systems. While achieving the uptime latency is another aspect of it as well, right? You need to provide responses within a fraction of a second while also making sure that the right decisions are made at the right time, the right level of compliance checks and balances happen. I'll give you an example, right? When you're doing a credit card authorization transaction, the authorization has to happen in less than like a fraction of a second because the customer is still there in the session. While the customer is still there in the session, you are engaging with the customer behind the scenes. You need to make sure that KYC is done, the KYB is done, the sanction screening check is done. So all these things have to happen simultaneously in parallel. And then final decision has to be taken while ensuring the business outcomes are met, right? Because every. Every drop that happens when the checkout conversion doesn't happen, that means it's a lost sale. A lost sale means it's a lost in revenue. So these are the factors that needs to be considered when we are looking at deploying machine learning systems in financial space. What are the three three critical financial applications? There are three core areas. One is, as I alluded in the previous example, it is real time fraud detection. Then the credit assessment, then automated financial management of it, the first, again, I'll quickly touch base on this. The first is real time fraud detection models that can scan millions of daily transactions to spot suspicious activity milliseconds. The credit assessments where platforms needs to value the credit worthiness while remaining compliant and explainable, right? Imagine a scenario where you are applying for a personal loan or you're applying for a credit card. You would have seen this from your real world examples, like what the decision is in life. Few seconds, but behind the scenes, the whole aspect of assessment, prospecting, underwriting, all these things need to happen in a pretty quick time. And lastly, it is the financial management. Everything from portfolio optimization to old algorithmic trading to personalize financial guidance. These are mission critical so that they rely on the strong ML operations foundations and the associated pipelines. Real time fraud detection architecture is super important. As you can see over here on this slide what I'm presenting over here is various channels, right? You have the transaction sources from where. A transaction can originate. You have point of sale channels, you have online channels, you have mobile channels. So all these channels have their own respective streams, and those streams of data have to be processed and published. And that is where the next aspect of pipelines and feature store come into the. Where we outline what are the low latency inference systems, what are the decision engines, what are their rules that they're applying? And finally monitoring and audit systems, right? Ensure telemetry, retaining, and compliance. The key point here though, fraud detection isn't just about a model. It's about a pipeline that can handle massive scale in seconds. What are the challenges in fraud detection in MLO Ops? And this is another key. Vector that needs to be considered. And as you see on this slide goes is the real time performance. Achieving sub 10 millisecond latency while processing over a hundred thousand transactions per second during peak periods is incredibly demanding. Then the feature freshness, right? Fraud stills evolve quickly. The DDoS attacks that we have seen over the last few. Instances will actually peak, right? And that. That plays a critical role in ensuring that our features stay in compliant and with various fraud patterns. The third is the rapid adaptation, which is pro tactics emerge constantly and our systems need to recognize them immediately. False positives, that's the final behavior. If we flag too many legitimate transactions, we are losing the risk of losing the customers. So that's another thing that needs to be kept in mind, and that is where the balancing act has to be super efficient. Otherwise, we run into the risk of losing value of customers, which in turn impact sales and revenue. The next slide as you guys can see over here, it speaks about what are the key vectors in assessing a credit application or assessing credit worthiness of an individual using the ML ops framework. As you can see the frameworks ensure data orchestration that tracks the full lineage of every data source model. Governance provides version history, approval, workflows, so we know which model made which decision and validation systems rigorously test for payments. When it comes to lending, regulators and customers need to know the system is both accurate and fair. That's. On the credit assessment side of the world. And of course the final aspect is the automated financial management systems. This is, again, super important because if you consider in any line of business, whether it is payments or lending or further matter, any financial transaction, at the end of the day, everything has to be inside the. It's a zero sum principle, right? The debits and credits have to match, and you cannot have money being created on the fly, right? It cannot be created, nor it cannot be destroyed. Every record has to match, and that can be achieved through streamlined. Model deployment pipelines, monitoring systems, integration workflows, automated workflows where you can actually achieve this through various streaming technologies as well. As and when a transaction happens, you publish it to a Kafka topic. Consumer subscribed to the topic. Then you record the transactions in your ledger system. You use it for reconciliation, settlement clearing, making sure that everything is reconciled behind the scenes. So it's super important to have automated financial management systems. Otherwise it becomes super tricky to track the transactions. And the next slide. As you can see over here, it covers the high level architecture for deploying MOPS and financial services. As you can see, the foundation is a data infrastructure, right? Governance, processing, storage for structure and unstructured data need to have a set of policies for both structure and unstructured. One set of policies for structure. Another set of policies for unstructured data on top of it sits the training and validation pipeline which automates the compliance checks, which automates the testing procedures of the same all of the various validation. Pipelines, right? And then the layer that is the engineering and storage layer that delivers both batch and real type features, depending on where you need. And then where finally we have the model registry and deployment systems for versioning, AB testing and various rollouts, right? You want to do controlled rollouts control testing, where you actually decide, oh, if I. Pull the lever of a particular parameter in one direction. How is it going to behave? Or how the impact is going to be on my core products and features. Will it improve, take rate? Will it improve conversion? Will it improve? Will it reduce the, fraud rates or will it increase? So there are, these are the various vectors that you can control by virtue of having versioning and AB testing. And finally, at the very top is the model serving, which is the high performance APIs that powers customer facing applications. You need to have, as I alluded into the, at the start in either of the use cases where a customer is waiting on the checkout screen, trying to process a, or trying to purchase a product. Or a customer waiting on the screen where they're trying to apply for a loan. These are very time sensitive customer flows. So the high performance API layer on the top ensures that the right level of information is abstracted. They buy servicing the clients in a timely manner. The next slides talks about how do we do model deployment strategies in a containerized manner? Containerization has been a game changer in this space, right? The Kubernetes size separate clusters are there for training and inference. GPU nodes for deep learning and auto scale based on demand compliance, critical models, you can in fact scale horizontally and vertical by dedicating by having dedicated nodes or isolated nodes. Financial institutions also add domain specific optimization, hot standby replicas for zero time geo distributor deployments. So that's. That's super, super critical in terms of having real, like your well-defined deployment strategies. On the next slide, it's on about real-time monitoring and alerting systems. This is where all of the key technical metrics would be like, what is the latency, what is the triple, what is your resource utilization? Is it at the max resource utilization main? Main resource utilization? So that you can dynamically at runtime either increase the capacity or reduce the capacities, and so you can also actually track the error rates. The key other thing is the model performance, right? Predicting what is the accuracy of it? What are the drift in terms of distribution? How is that being used? And of course. Pertaining to that would be the business KPIs in terms of what are the approval rate, what are false positive, how is it impacting the revenue, customer experience, and all those things. And lastly compliance is a key aspect of it. So ensuring that the end-to-end transaction is thought through in terms of observability and reliability standpoint. And what are the various automated validation frameworks you have? You have data validation, you have model training, performance and fairness, explaining the compliance side of the world. Financial institutions have to ensure that the validation processes needs to be there. Otherwise, how do you know the robustness and the accuracy of a particular model before you go ahead and deploy that in production and thereby. You don't want to have such a scenario where you don't validate and deploy it in production, and then all of a sudden it can have negative impact across the length and breadth of the spectrum. And then the next slide is about, I would want to speak about like the model drift in high stakes environments. Like what exactly is a model done? You just cannot avoid, model drift, right? It's unavoidable in a, especially in a high stakes environment, right? How do you detect dread drift through statistical tests segment based analysis and, whenever a drift occurs, how do you respond? You respond through shadow deployment comparison. You gradually ramp wrap the traffic, ramp down the traffic, and then that is where when in doubt you have a back fall back approach of human in the loop approval and then track the audit trail to make sure that the performances through transitions is seamless. And then I would also want to speak about implementing AB testing for financial models. That's, that is where e testing is another area which is super critical. Start with clear hypothesis type to business metric test designs. Ensure statistical rigor, ensure carefully and control for adverse effects and practice analyzed at the segment level. And rollout is progressive with continuous monitoring? Yes. Innovation is the key, but not at the cost of, not protecting the customers, right? Safe AB testing is super important. So to wrap up, what are my key takeaways, right? Design for scale from day one, integrate compliance throughout the pipeline, automate everything, but ensure the fallback mechanisms are there, where human in the loop intervention is there. And then investing in robust monitoring and response systems. That's pretty much it, and I hope you, you were able to gather some insights on my presentation. Thank you for giving me this opportunity. I.
...

Naresh Karri

Group Product Manager - Money Platform @ Intuit

Naresh Karri's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content