Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

Building Financial Fortresses: Scaling Cloud-Native AI for Real-Time Fraud Detection

Video size:

Abstract

Discover how SRE principles power AI fraud detection that processes millions of transactions in milliseconds. Learn battle-tested strategies for scaling ML pipelines, preventing failures, and achieving 99.99% availability while stopping fraudsters in real-time. Real metrics, real solutions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm Prakesh Wanger. Overall, 13 years of exchange experience specializing in AWS Cloud engineer and identity access management. Currently serving as an AWS data engineer, designing and implementing scalable and secure data pipelines, leveraging a W Services. I'm really excited to be here to talk about something that at the intersection of ai. Cloud infrastructure and real world business impact using Cloud native A to fight financial fraud in real time and doing it reliable. Today, financial institutions face critical challenging detection, sophisticated fraud pattern across billions of digital transactions while maintaining system reliability and performance. In this session, we. Drive into how site level engineers are SRE Empower financial institutions to build cloud native AI systems that detect fraud at scale. Without compromising speed, availability, or user trust, we will explore how to architecture infrastructure capable of subsecond responses, time handle mass, transactions, volume, and still. The gold standard of 99.9% of time. This is about more than just dropping fraud. It's about building resilient in inte and systems that form the backbone of modern financial security. Going to next slide. The challenge faced in this is ma, mainly speed, scale and security. Let's start with the problem. Create financing systems, processing billions of transactions daily, exposing them to constraint sophisticated fraud attempts. Legacy systems which analyze data in batches, however often transaction happens, simply don't work anymore. The challenge is huge. Detect fraud in real time across. Master volumes while keeping latency under a second and blocking increasingly advanced attacks. For SRE Site Engineer, this means balancing performance, scalability, and security all at once. Going to next slide, our experience from batch to steaming bank, mentioning the architectural evolution. We started with a traditional batch processing system where analyzing data happened hours after transaction happens. That delay led a lot of false prostitutes and frustrated customers. To fix this, we transitioned to a streaming architecture. This was a big shift. We introduced Kafka team to process even in a real time, ensuring data flows continuously instead of. In Changs, we added future store to provide low latency access to historical data, giving our models the context they need instantly link. This gave us real time insights and fraud detection that keeps us the modern threats and we move to microservice based models are enabling real time interference with auto-scaling so we can meet. Demand without sacrificing speed are reliability. Going to next slide. Let us see the resiliency patterns for a Systems. A Systems bring new kinds of failure modes that traditional SR methods, sometimes they return results too slowly to handle. These we adopted. Specialized resilience patterns. We use circuit breakers to detect when models are overwhelmed, falling back to similar rule-based system to keep transactions flowing and avoiding cascading failures. We built graceful degradation paths where critical fraud checks continues, even if deeper analysis needs to be passed during peak. Loads and with automated canary analysis, we deploy new models to just a small portion of traffic. This let us monitored metrics like latency and false positives before a full rollout, reducing risk, while improving quality. Going to next slide. Specialized observability for ML systems. Traditional monitoring tools weren't enough for our a fraud detection systems. They simply couldn't capture the complexity of model behavior or connect technical issues to business impact. We need to see how our A models were performing and how that performance affected the business. So we built a specialized observability stack focused on model performance, data quality, and. Business relevance. We now track interference latency prediction, competence, and compute usages across different model versions. We monitor data drift by detecting changes in statistical properties of input future, so we know when real world transaction start to differ from training data with explainability tools, we visualize why. Transaction was flagged. This is a key for both debugging and gaining stakeholder trust. And finally, we correlated technical metrics with business outcomes like fraud, losses, our customer friction. So we are not just monitoring systems, we are tracking what actually matters. Going to next slide. Then we have service level objectives that balance, security, and experience. Setting service level objectives for fraud. Detection isn't just about uptime. It's about balancing multiple, often completing goals, security accuracy, posture, approvals, and a smooth user experience. Traditionally, service level objectives focused only on availability, but that wasn't enough. So we built a multi-dimensional service level objectives framework that combines both technical metrics and business impact. We track availability at 99.9% while keeping detection latency under 200 milliseconds for a seamless experience. To ensure quality, we aim for a false positive rate below 0.5 percentage, and a fraud detection rate above 98 percentage. And to maintain trust, we required a hundred percent explainability with clear reason codes for every decisions. This framework help us to make smart trade off when we need to balance depth of analysis with real time performance. Going to next slide, s Engineering for fraud systems. Fraud detection is a high stake environment. Failures he can lead to immediate financial loss. So traditionally, K os engineering, which often involves breaking things in production. Instead, we designed a specialized approach that safe and controlled within chaos experiments in replica environments that matter production, using synthetic transactions data to stimulate both normal and fraudulent behavior. Each experiment is hypothesis driven, targeting specific failure modes, and most importantly, we evaluate that degradation path work. Payload triggers correctly and automatic recovery mechanisms do what they are supposed to do. This way we gain the confidence of Kio testing without risking real money or user trust. Going to next slide. By applying SRE principle tailored specifically for aid driven fraud detections, we saw a major gain across both technical and business dimensions. The biggest win was a 78% reduction in detection time, which let us intercept fraud before transactions were completely preventing laws instead of trying to recover it later, we also cut false positives at by 60, 63% days, which directly improved customer satisfaction. Favor legitimate transactions were wrongly flagged. Few users experience delays at the system level. We maintain 99.99 percentage availability, ensuring constant production without downtime. And finally, we you a 42% cost production efficient infrastructure. Going to next slide in financial services, we are bound by strict regulations which can clash with the fast moving. Trade to nature of tech to balance both our SR team built specialized workflow that supports continuous compliance without slowing down innovation. We translate regularity requirements into technical controls using infrastructure as a code to ensure every deployment is repeatedly and immutable. We also established a robust models governance framework. Document training data model logics, and how each ties back to a business need. Lastly, we implemented automated audit trail logging. Every system change in detail. This allows to reconstruct systems state at any time, ensuring traceability and audit readiness at all times. Going to next building a. AI fraud detection systems is in just about adding more power. It's about thoughtful architecture. Our blueprint focused on decoupling components, so if one fails, it doesn't brings the whole system down. We built in redundancy and ensure a full observability, not just for technical metrics, but for business impact too. We break the system into reliable domains, each with its own scaling logics. This given us flexibility, especially when deploying new ML models, which often have unpredictable demands. At the foundation, we use scalable cloud infrastructure with multi-region failover. Our data layers uses resilience event stream from guaranteed delivered. The model layer is with the smart loading balancing. And finally, the observability layer uses end-to-end insight across system health model behavior. Going to next slide. A fraud detection systems can be seen as risky and complex, but with the Rights R practice, they become a community advantage. When reliability is built into a foundation, our nation can deploy more sophisticated models faster with greater confidence and lower operational overhead. And this value doesn't stop at fraud prevention. The same fraud structure can support personalization, risk scoring are market analytics creator a multiplier effect across the business. To get there, we recommend a three steps approach. Assess current capabilities. So identify reliability gaps and performance bottleneck using SRE principle, building foundational patterns before scaling. Implement key patterns like circuit breakers and observability. Measure business impact. Align technical reliability with outcomes like reduce fraud, losses, and better customer experience by doing this. We turn relatability into, not just by technical goal, but a business strategy. Finally, thank you. Thank you all for your time. If you are working on fraud detection AI system just trying to make your systems more relatable, I would love to connect. Let's keep building systems that are not only smart, but also strong, forced and trustworthy. Thank you.
...

Prakash Vanga

@ Sriven technologies LLC



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)