Building Resilient AI Platforms: Engineering Financial Fraud Detection at Scale

Video size:

Abstract

Build AI platforms that catch fraudsters in milliseconds! Learn how we engineered systems processing millions of transactions daily with 85% better fraud detection. From federated learning to multi-cloud orchestration—real patterns for high-stakes AI at scale.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Greetings of today, everyone. My name is Rahul Gti. Thanks for being here. Let's get started. Right away. Every minute of every day, a silent, invisible war is being fought. It's not a physical battlefield, but across the digital networks that power our global economy. The enemy is sophisticated, relentless, and increasingly automated. I am talking of course about financial fraud. The slide here quotes a figure of $32 billion in global losses, but the reality is the problem is accelerating at an alarming rate. Recent data from the US Federal Trade Commission, a stagger. In 2024 alone reported consumer losses to fraud in the us. Jumped over $12.5 billion, a 25% increase in just one year, and that's just what's reported by the consumers. On the organizational side, a 2025 survey from the Association for Financial Professionals found that a staggering 79% of the organizations were victims of payment fraud attempts last year. This isn't just a numbers game, it's technological arms race. More than 70% of the executives expect financial crime to increase in 2025, citing the increased use of AI backed criminals as a top reason the old defense systems are failing. The nature of the threat has fundamentally changed. The data shows while the number of fraud reports has remained stable, the percentage of people who actually lost money in an attack jumped from 27% to 38% in a single year. This means fraudsters are becoming far more effective in each attempt. This isn't a volume problem anymore. It's a sophistication problem. So today I'm here to talk about how we engineer the front lines of this new battlefield. It's no longer about writing better rules. It's about building fundamentally better platforms. We will explore how to architect and build resilient AI powered systems that can not only detect, but anticipate and dismantle sophisticated fraud at massive scale. So let's look at the evolution of fraud detection. For decades, fraud detection was a game of cat and mouse played in slow motion. A new fraud pattern would emerge. Analysts would study it, write a new rule and deploy it. This is the world of traditional rule-based systems, but these systems are fundamentally brittle. Their inflexibility meant we were always one step behind the criminals. Their reliance on grid thresholds led to high false positive rates, which weren't just a technical problem. They were a customer experience nightmare blocking legitimate transactions, and creating unnecessary friction. And these false positives have a real cost, not just in customer frustration, but an operational overhead, a cost we later quantified at nearly $5 million a year in our own operations. Furthermore, these systems suffered from a limited context. They could see a single transaction, but not the broader sinister pattern it was a part of, and as transaction volumes exploded with rise of digital finance, these rigid rule engines simply couldn't scale. This brings us to the platform engineering imperative. To win this fight, we can't just deploy a machine learning model. We must build a comprehensive end-to-end platform that handles the entire life cycle from data ingestion, real-time processing to model deployment, monitoring, and security and compliance. It's a foundational shift from writing rules to engineering intelligence. Drill foundations. So what does this platform actually look like? Let's walk through the critical architectural blueprint. These five core layers work in concert to deliver real time decisions in milliseconds. It all starts with data. The data ingestion layer is our gateway to outside world responsible for collecting massive varieties of data streams, transaction data, customer profiles, device information, and even external threat intelligence feeds. It's the foundation of everything that follows ensuring data quality and consistency from the beginning. Next is the central nervous system stream processing engine for real time detection. We can't wait for the data to land in the database. We need to analyze it in flight where frameworks like Apache Kafka and Apache Flink are critical. For those unfamiliar, think of Kafka as highly durable, scalable commit log for streaming data. It allows us to publish and subscribe to data streams reliably. Acting as a buffer and decoupling our data producers from data consumers. So Flink then acts as the computational engine on top of these streams. The third component is one of the most critical for production ml. The feature store, this is our centralized single source of truth for machine learning features. It is essentially a dual database system with an offline component for training and low latency online component for serving. It ensures the exact same feature logic. Is used for both training our models and serving predictions in real time. This is absolutely essential to prevent a notorious problem in ML lops called Training Serving sku, where inconsistencies between training and production data silently degrade model performance. Once we have the models, we need to manage them. The model Registry is a version control repository for our trained models. It's not just storage, it's the control plane that allows our serving layer to deploy multiple models simultaneously. Run A by B tests to see which performs better and execute gradual rollouts or instant rollbacks if a model isn't performing as expected. Finally, the AI doesn't always have the last word. The decision engine is where mission intelligence meets human expertise. It takes the model's prediction scores and combines them with configurable business rules. This allows our fraud analysts to fine tune the platform's response, like triggering a step up authentication or flagging an account for review without writing a single line of code. This isn't just a logical diagram, it's a blueprint for agility. By using an event bus like Kafka to connect these components, we decouple them. So let's see how this platform performs under fire. The attack didn't start with the bank, it started with the whisper. So a fraudster gains access to consumers credentials via a phishing campaign. But instead of smash and grab, they play a long game. So from days 1, 2, 3, a small, seemingly innocent test transactions, a rule-based system sees nothing wrong, they start adding new pace, but cleverly with their names, similar to existing ones to avoid suspicion. This is done in days four to five. On day six, the attack, this is where our platform's architecture becomes the hero of the story. The data ingestion layer didn't just see the transactions, it saw the context, it saw the new device fingerprint, the unusual login location, and the ri. Mouse moments that were uncharacteristic of the real customers. The stream processing engine analyzed this data in real time as the new pays were added. The feature store provided the crucial history context. This customer had never added multiple pays in such a short period. This was a clear behavioral, an anomaly. Our end symbol of models served. The model registry picked upon this pattern of the behavior. It wasn't one single event that was suspicious, but the sequence of events over the several days. Finally, the decision engine orchestrated the response instead of a simple block, it triggered a step up authentication challenge. The fraudster of course, couldn't pass it. The transfers were blocked and the account was secured. The key takeaway here is that no single rule could have caught this. The fraud wasn't in one transaction. It was in the narrative of the user's behavior. Over six days, our platform was able to read that narrative in real time. Now let's dig a little bit deeper into the multi-cloud architecture. A platform this critical cannot be dependent on a single provider. We operate in multi-cloud environment, and this is deliberate strategic choice for several key reasons. First. Avoiding vendor lock-in by building our core services using portable technologies containers and Kubernetes, we retain the flexibility to deploy components across different cloud providers. This allows us to choose the best service for the job without being tied to a single ecosystem, giving our organization leverage and ensure our critical capabilities are not subject to the fate of a single vendor. Second, resiliency and redundancy. An outage in one cloud region or over an entire provider won't bring our platform down. We can route traffic and fail over services to another cloud, which is essential for meeting our 99.99% availability. Target third, regulatory and compliance needs. Different jurisdictions have different. Data residency requirements. A multi-cloud architecture allows us to process and store the data in specific geographic locations to meet those complex regulatory demands. Of course this isn't easy. It requires sophisticated engineering for data application and secure the. Cross cloud networking and unified monitoring plane to provide a single pane of glass across all the environments. The challenges listed here, which are replication, networking, monitoring, are exactly why a platform engineering mindset is so critical. Now let's take a look at the data engineering part of it. At the heart of any great AI system is a great data engineering For a financial institution, data is often scattered across decades of legacy systems from one core banking platforms to mobile applications. The first challenge is creating a unified data lake. Data lake is a centralized repository. That allows us to ingest and store vast amounts of structured, semi-structured and unstructured data in its native format. It becomes the single source of truth, breaking down data silos, and enabling the holistic analysis required for the advanced fraud detection. This architecture allows us to embrace the. Extract, load and transform paradigm. We ingest the raw data first in its full. This gives us our data scientists extra freedom to explore and engineer new types of features without being constrained by rigid predefined schema, which dramatically accelerates our ability to innovate. With our data unified, the next step is to transform it into intelligent signals or features for our models and to do the same in real time streaming aggregations. This is about creating features that capture behavior over time. We use our stream processing engine to compute metrics on the flight. Like transaction count in the last five minutes, or average transaction amount over the last 24 hours. These time window features are incredibly powerful for detecting changes in the behavior. Then comes the graph based features. This is where things get really interesting. Fraud is rarely a solo activity. It is a network phenomenon. By modeling our data as a graph where customers, accounts and devices are nodes and transactions are edges, we can compute features that describes user's position and relationships within that network. This allows us to uncover hidden connections and coordinated activity that would otherwise be invisible. And then comes the external enrichment. We enrich our internal data in real time with external feeds, pulling in device reputation scores, IP geolocation data, and threat intelligence too. Add even more context to every transaction. With graph features, we fundamentally change the question we are asking. We move from assessing an individual in isolation to assessing an individual in the context of their entire network. And as we are about to see that change in perspective makes all the difference. Now let's get back to our case study. We blocked the initial. Fraudulent transfer, but the investigation was just beginning. Remember those new pays? The fraudster added, our graph algorithms went to work. They revealed that these weren't just random accounts. They were connected to a hidden network of 47 other accounts across 12 different financial institutions all created in same two week window. The graph showed classic signs of frauding circular money flows between accounts, shared device fingerprints, despite different registered addresses and suspicious. Temporal correlations and account creation times. We weren't fighting a lone wolf. We are fighting an organized network simultaneously. Our real time streaming aggregations were lighting up like a Christmas tree. The login velocity from compromised accounts. It was up to 400% compared to customer's six month baseline. The transaction time shifted from the customer's typical business hours to late at night, and our behavioral biometrics models detected 85% deviation in mouse patterns from the customer's established profile. This wasn't just a different location, it was a different person. These features computed. Under 50 milliseconds were the critical signals that allowed us just not to block one transfer, but to proactively identify and dismantle an entire fraud network preventing an estimated $2.3 million in losses across the affected institutions. So this case study demonstrates a powerful virtuous cycle. The real time platform stops bleeding. So building rate models requires just more than algorithms. It requires a robust infrastructure that empowers data scientists to move from idea to production With speed and conference, our ML infrastructure is built on four pillars. Majorly first sandbox experimentation platform. This is where data scientists can safely explore new data sets, test new feature ideas, and iterate on models without impacting production environment. Second distributed training. Modern fraud models are massive and are trained on terabytes of data. Are distributed training infrastructure. Allows us to paralyze this process across clusters of GPUs, reducing training time from weeks to just ours. Third automated ML pipelines model is just not one and done artifact. It's a living system that needs to be constantly refreshed. Our automated ML pipelines handle the entire life cycle of retraining. Validation and deployment, and also ensuring our models never go stale and are continuously learning from the latest fraud patterns. And finally, and most critically low latency serving. A prediction that arrives too late is useless. Our serving infrastructure is architected for extreme performance with model replicas deployed across multiple availability zones and automatic failover, all to ensure we can deliver predictions and well under. A hundred milliseconds, which is RSLA. We retreat our ML lifecycle, not as a research project, but as a software engineering discipline. Our infrastructure is designed to be a model factory. Enabling us to industrialize with process of taking promising idea from a data scientists notebook and turning it into a hardened production grade asset that serves millions of customers. Platform that processes this much of sensitive financial data must be a fortress. A security poster is built on principle of defense. In depth, defense in depth is a strategy that uses multiple overlapping security measures. The idea is that if one layer is breached, another is there to stop the attack. It is about creating redundancy. In our defenses and assuming that no single control is perfect, we don't add security at the end. It's embedded in every architectural decision we make at the network security layer. We use a zero trust. Principles and microsegmentation to ensure that components can talk to each other if explicitly authorized. All communications are encrypted at the data security layer. All data is encrypted, both in transit and addressed with strict key management protocols that meet regulatory requirements, access control. It is being governed by the principle of least privilege using fine-grained, role-based access control to ensure people and services can only access the data they absolutely need. And then we maintain the comprehensive audit logging of every action taken on the platform for compliance and forensic analysis. Let's get into. The federated learning infrastructure. So far we have talked about what we can do with our own data, but the most sophisticated fraud drinks don't target one institution. They attack the entire financial ecosystem. What if we couldn't fight back? This is the promise of federated learning. It's a machine learning technique that allows multiple organizations to collaboratively train a shade model without ever exposing or exchanging their raw, sensitive customer data. The process is iterative. A central server sends a global model to each participating institution. Each institution trains that model locally on its own private data. They then send the model updates, the learned parameters. Not the data back to the server. They then send only the model updates, the learned parameters then the server aggregates these updates to improve the global model, which is then sent back for the next round building. This is a massive technical undertaking, but the vision is powerful. Now let's take a look at the even driven microservices and biometric systems integration. To make all of this work at scale, we need an architecture that is flexible, resilient, and scalable. We have built our platform on an even driven microservices architecture in an event driven architecture. Or EDA services don't make direct calls to each other. Instead, they communicate asynchronously by producing and consuming events. For example, when a transaction occur, the ingestion service simply publishes a transaction, created even downstream services like our feature engineering and, model interference services. Subscribe to that event and react accordingly. This decoupling makes the system incredibly resilient and scalable As services can fail and scale independently. We implement this using advanced patterns. Even sourcing means we store the full history of all state changes as immutable sequence of events in an append only log. This gives us perfect unchangeable audit trail and allows us to reconstruct the state of any entity at any point in time, which is invaluable for four and six and debugging. By adopting this, we get a perfect audit log for free. For regulators. This is the gold standard to manage transactions that span multiple services. We use the Sega pattern. A Sega is a sequence of local transactions. If any step fails, the Sega executes compensating transactions to undo the preceding steps. This ensures data consistency across our distributed system without using slow blocking two face commits, making our system far more resilient to. Partial failures that are inevitable in large scale environment. And we are exploring blockchain integration, not for currency, but it's for original purpose shared immutable ledger. This can provide a cryptographically secure tamper proof audit trail for high value transactions. Creating a single source of truth that is verifiable by multiple parties. Observability and performance optimization. Our strategy is built on three pillars. Metrics give us the high level quantitative view, transaction throughput model accuracy, and CP utilization. Logging provides us the. Ground truth, detailed record of what happened during a specific transaction. And distributed tracing is the magic that ties it all together. It allows us to follow single request as it jumps across. A dozen different microservices letting us pinpoint exactly where bottlenecks or failures are occurring and the results speak for themselves. We achieved median of median or P 50 latency of 47 milliseconds even for our 99th percentile. The slowest 1% of requests. We are at 89 milliseconds. Within our a hundred milliseconds SLA, we focus on both because the P 99 represents the experience of our un luckiest users. And, keeping that tail latency low is critical for consistently good experience for everyone. The platform can peak of 45 transactions per second and maintains 99.99% of availability. Now at the end of the day. All this engineering effort is only valuable if it delivers a real business impact, and it does for a mid-sized bank processing, close to 50 billion. Dollars in transactions. This platform translates to a 27.2 million in prevented annual losses by reducing false positives. We have unblocked 52 plus. 52,000 legitimate transactions daily improving the customer experience and saving 4.8 million dollars in associated service costs. The ROI is clear, three 40% in the first year alone, rising to five 80% in the second year as operational savings. And of course the customer retention. Benefits also kick in. So as you can see, there is a jump in the second year. This shows the compounding value of the platform approach. The initial return comes from stopping fraud, but the long-term sustained value comes from making the entire business run better. Reducing the operational friction and also improving the customer loyalty. But of course, this arms race never stops. We are actively exploring the new next wave of technologies to stay ahead of these fraudsters. By using this advanced AI techniques, generative AI offers incredible new possibilities. We can use it to generate highly realistic synthetic data to train our models on our rare fraud patterns they have never seen before, or, even use large language models to help panelists investigate the case faster, and also quantum computing while it's still on the horizon. Quantum computing holds the potential to solve optimization and pattern recognition problems that are in traceable for classical computers today. And also we are exploring into the edge computing and 5G. So this would allow us to take the decisions faster. Building these platforms is a marathon, not a sprint. It requires a deep commitment to engineering excellence, a platform, first mindset, and a relentless focus on delivering business value. The war against fraud is one we cannot effort to lose. And with resilient, intelligent platforms, it's a war we can win. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient AI Platforms: Engineering Financial Fraud Detection at Scale

Video size:

Abstract

Summary

Transcript

Slides

Rahul Ganti

@ Financial Institution

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Resilient AI Platforms: Engineering Financial Fraud Detection at Scale

Video size:

Abstract

Summary

Transcript

Slides

Rahul Ganti

@ Financial Institution

Join the community!