Beyond Uptime: Revolutionizing Fintech Reliability Through SRE Innovation

Video size:

Abstract

Discover how a high-growth fintech startup mastered system reliability during a viral surge, slashed incident recovery times, and maintained perfect uptime through SRE innovation. Learn battle-tested strategies for balancing aggressive growth with regulatory compliance in the financial sector.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Thank you for joining me in this conference. Today we will discuss on how FinTech reliability is achieved through SRE Innovation. So as organizations accelerate their AI initiatives, many are discovering that traditional infrastructure was not built to handle the high demands of machine learning operations. These workloads are highly resource intensive and are unpredictable. While Kubernetes has emerged as a platform of choice for managing these environments, simply running the containers is not enough. The challenge lies in running them optimally. What we are seeing is that many teams are struggling, be it dealing with over-provisioned resources, unpredictable costs and performance bottlenecks. So in this particular session, I would like to walk you through a set of proven SRE strategies that are designed to increase performance, reduce costs, and significantly improve the reliability of the AI ML pipelines running in. Kubernetes these approaches come directly from the high stakes environments these where even minor inefficiencies can lead to a major financial losses. So together we'll go through the techniques that have delivered up to a 60% improvements in resource utilization and substantially reductions in incident rates. All while maintaining the agility and performance. Ultimately, this presentation is about more than just keeping the systems online. It's about engineering confidently at scale in an increased AI driven financial ecosystem. So let's take a closer look on how the transformation is happening. Okay. When we talk about scaling AI infrastructure efficiently, GPU utilization is one of the most overlooked and most expensive and has more bottlenecks. So let's look at how we can dramatically shift that performance to cost ratio using smart GPU level strategies. Let's start with the MIG or the multi-instance GPU Partitioning. So this is a feature available in NVIDIA's 800 and h hundred series. It allows you to carve up a single GPU into a multiple isolated instances, which enables like multiple workloads to run simultaneously without any issues. Our real world tests have shown up to a seven x better utilization especially in environments with varied inference loads. Next in in memory efficiency QDA provides like a low level access to memory management, allowing us to opti optimize. Tensor core use and reduce the fragmentation. So this can lead to 15 to 30% of throughput gains particularly for vision heavy models like ID verification or biometric fraud detection. Oh, coming to precision optimization. In many FinTech companies like ML workloads, especially during inference. Full FFP 32 precision is not always needed. So by switching the, switching it to FP 16 or even INT eight, organizations can drastically shrink the memory footprint. So these techniques are game changers. They maximize the GPU resource usage and provide critical improvements in both performance and cost efficiency. Moving on to the next slide. On auto-scaling strategies, as organizations push to run ML workloads more efficiently traditional auto-scaling strategies are not enough, especially in GPU accelerated environments. Here we'll go through four techniques that enable smarter ML strategies. Which will help with the autoscaling. First we'll go through the workload profiling. So the foundation of autoscaling is understanding when and how much to scale. So workload profiling uses the historical telemetry across both training and inference cycles. To uncover the usage patterns. So this helps define the baseline resource needs, identify the spikes during the market, events or frauds, fraud searches, and ensures that the auto-scaling decisions are data informed and are not reactive. Coming to the custom metrics pipeline. So if you look at the out of box Kubernetes like Kubernetes HPA doesn't know anything about the machine learning. That's where like the custom metrics come in using Prometheus adapters. We expose the ml. Relevant signals like the Q Depth batch processing duration, and GPU memory pressure to the autoscaler. So this enables a precise decision based on the actual workload characteristics, not just the CPU load or the pod count. Coming to predictive scaling for recurring or seasonal workloads. Think EOD model model retaining or the batch scoring jobs. We can apply the time-based or ML based scaling triggers. So predictive models can anticipate the spikes before they happen. Enabling warm pools to be spun up just in time. This reduces the cold start latency by up to 85%, especially valuable for real time FinTech systems. Coming to the buffer management we can't forget the cluster resilience. So the buffer management means deliberately over provisioning a small GPU headroom during critical critical hours like trading sessions or things like holiday season, right? Then you can scale them back down more aggressively after the hours. This balances the control cost control with high availability, which is non-negotiable for financial applications. So these autoscaling strategies go well beyond the HPA setup. When implemented together, they enable the organizations to sustain peak model performance with minimal costs. Okay. Coming to the multi GPU training orchestration the biggest bottlenecks in scaling a machine learning training right is GPU job orchestrations. It's not just having about having more GPUs. It's about using them in intelligently. When we orchestrate jobs sufficiently across the gps, especially in distributed environments, we unlock, huge throughput gains in cost efficiency. The first step is topology aware scheduling. Why does this matter? So when jobs span across multiple GPUs, especially across nodes network latency becomes c enemy like we solve this by using the g Kubernetes node affinity rules to schedule the jobs on nodes that are physically close to the interconnect through high bandwidth. Links like NV link, or PCIE. This minimizes the communication overhead and drastically improves the performance and distributed training. Next we move on to distributed training frameworks. So framework for frameworks like TensorFlows multi worker needs to be tuned. To our infrastructure, often we need poor performance because training scripts are not optimized. So we address this with Custom gate template that enable, like dynamic scaling, reduces the training time variance, and it minimizes the communication bottlenecks between the nodes. Finally, there's a priority based preemptive, mechanism. So in this particular technique if you look at our shared clusters. When multiple teams submit the jobs, it's critical to ensure that high priority jobs don't wait in line behind the rule, low priority ones. So we implement intelligent queuing mechanisms that evaluate the job priority and then. Deadlines and then resource fairness. It's like air traffic control for GPUs, like preventing the collisions, reducing the idle time, and keeping the system fair and efficient. The key takeaway is like when you bring all these three pillars together, the topology awareness, smart frameworks, and priority based scheduling you can cut the training time by 30 to 50%. That's a massive impact when you are running the running hundreds or even thousands of experiments each week. Okay. Moving on to the advanced monitoring and observability. As machine learning workloads grow in complexity and scale, the need for robust observability becomes absolutely critical. Not just for performance, but for cost efficiency and op and operational reliability. So this. Here outlines the layered approach. We take towards building a comprehensive monitoring stack for the machine learning infrastructure. So let's start at the foundation, the historical data. So long-term storage of metrics is essential for capacity planning and trend analysis. Whether you are sizing your GPU clusters for next quarter's workloads, or preparing for upcoming model retraining cycles, you need to. Analyze historical context, right? So it helps avoid over provisioning and gives a factual basis for infrastructure budgeting. So next we move up to resource utilization. So this is where the fine-grained metrics come in, like the CPU usage, memory pressure, GPU utilization at the process level, right? So this layer is all about visibility into how your infrastructure. Is being consumed in real time without this, you are essentially like flying blind about That is we have the performance insights. So generic metrics like CPU usage won't tell you how long a training model takes or when your data pipeline is getting congested. So we create a custom dashboard that reflects that ML specific metrics such as throughput or training loss. Over the period of time. And then the GPU memory fragmentation. So these these give the engineers a much cleaner picture of how their workloads are behaving. At the top of the pyramid, if you look at, is the root cause analysis. So this is the most advanced layer and arguably the most valuable as well. So it. Ties everything together through distributed tracing and correlation correlates across the layers, like linking spikes in the GPU usages to a specific phase of a model training while something breaks or slows down, this is how you rapidly isolate and resolve the issue. The systems we are talking are not just about alerting. They're about creating a narrative that ML engineers and SREs can follow. They support both issues at hand and also work on strategic decision making. The most successful organizations like built all four of these layers into their monitoring approach. Together they create a holistic feedback. That helps optimize models and improve uptime as well as cost controls. Now let's explore how we can dramatically improve, data to a reduced latency in the ML pipelines by optimizing the storage. So as machine learning workloads become increasingly data intensive storage becomes a critical bottleneck. So this slides outlines like three key strategies to overcome that. So if you look at the distributed file systems we have file systems like GPFS, which are very distributed. These are built for high throughput scenarios and are capable of parallel data processing across hundreds of nodes. So they outperform the traditional network storage solutions by up to eight 10 x. Right, especially when working with the small, fragmented file types which are common in the ML data sets. The design allows like efficient aggregation of bandwidth and parallel read and writes, so which accelerates the data ingestion during the training. Next, we deploy the in-memory caching layers. Between the persistent storage and compute resources these caching layers drastically reduce IO wait times, especially for frequently accessed data sets by 65% or more. They also implement automatic eviction policies ensuring that the optimal memory usage without manual intervention. This is particularly beneficial in iterative workflows where the same data sets are are. Read multiple times during the model tuning. Lastly, we have something called the automated storage tiering. So this involves moving data between the hot and cold storage tiers based on the access, frequency and performance needs. So this not only ensures that the performance for the active data sets, but also. Achieves like the 40 to 50 percent of cost reductions by shifting the inactive data to lower cost storage. This is done transparently through a unified namespace. So engineers don't need to worry about where the data physically rec resides. So together with these kind of three layers, distributed file system, in memory caching and intelligent tiering this create a robust, cost efficient foundation for data driven ML pipelines. The result is faster training, lower cost, and smoother operations. Here we are looking at how we optimize cloud spend for machine learning workloads using sport instances. We start with the workload classifications. So we assess the jobs based on the fall tolerance and run length to identify which, which can safely run on the spot Instances. Next we implement like 4,000. By adding automated checkpoints and recovery systems. So if a spot instance gets reclaimed, the job can resume smoothly. Then comes the bidding strategy, optimization izing, the historical and real time pricing data. We just are bidding to maximize the savings while maintaining the job. Re reliability. Lastly, we adopt the hybrid deployment models. So dynamic, clearly, like switching between the spot and on demand instances based on cost and availability. So this is one of the important models which we can deploy to make sure we enable that the cost savings of up to 60 to 80% is achieved without even sacrificing on the performance. So moving on, let's talk about optimizing the no pool configurations. So an area where we can unlock substantial performance and cost benefits, especially in machine learning and compute heavy environments. So first we start with heterogeneous hardware segmentation. So rather than mixing all the GPUs, GU types into a single nor pool. We create a specialized nor pools for different GPUs, like eight hundreds going to one nor pool. And then the v hundreds go to a different nor pool. So each has a different compute and the memory characteristics. So assigning the workloads without creating the nor pools may lead to and inefficiencies. So by using Ts and tolerations, we make sure workloads are scheduled only to. Hardware data optimized for. Next we have resource optimization. This is about fine tuning the CPU and the GPU as well as the memory ratios to match the ml workload profiles. For example inference might require more CPU, whereas training model could need more GPU memory. So if. If these ratios are mislead or miscalculated, we either end up with bottlenecks or underutilized hardware, both of which are very costly. Then there is network topology alignment where you'll you are running, when you're running your distributed training network bandwidth, and latency matters a lot. So we design a no pools to align with. With the physical infrastructure, that means like grouping into a nodes nodes with the high speed interconnects like NV link or InfiniBand into a same pool. This ensures that the data exchanges between the GPUs in a fast and consistent manner, which avoids the slowdowns also during the training process. Now let's quickly look at how we can share resources fairly when multiple teams are working on the same system. First we use namespace to keep things organized. So each team or, type of work like production versus a non-production gets its own namespace. So this way we can set up different permissions use specific settings for each environment and keep the network traffic separate. Second, we can set up the resource controls. That means we limit how GPU memory and storage each team can use. This helps, make, making sure like team doesn't accidentally take more share than which is allocated. Then finally we give higher priority to critical jobs like live production jobs. At the same time, we make sure like every team gets at least a minimum resources which are needed for their operations. So this avoids the problem of one team overutilizing the resources while the other team doesn't get anything. So the overall this setup. Keeps fair, reliable, and running the infrastructure smoothly for everyone. Here's what an impact these optimization strategies can have. So in one of the case studies, it was observed that companies that applied this techniques saw an average of 43% reduction in costs. That, near that is nearly half of their infrastructure spending saved just by. Tuning the things right next. Training jobs became much faster, like 2.8 times faster on average. That means models are trained and ready in a bayless time which speeding up the entire development cycle study also suggests that. There was a big jump in the GPU usage, so which is up around 67%. So instead of having expensive GPU sitting idle, they're being used efficiently across a cluster. And finally, the production models became more reliable with 94% of the inherent requests, meeting the SLA goals. So that's a strong sign of stable production ready systems. So this slides gives us a step by step guide for applying the strategies we've covered so far. Step one is to establish the baseline metrics. Like before making any changes, we need to understand how our resources are being used right now. That includes tracking the GPU and the CP usage. Training times and how much it's costing up per model. Then step two is to go for the quick wins. Like there are changes that are not. Easy to be implemented and give immediate value without disruption disrupting our current setup. So examples include resizing the resource limits correctly, and then enabling simple auto scaling, and then using the right storage classes for each workloads. Then step three is to deploy more advanced optimizations. Once the basics are in place, we can start using smarter tactics like custom autoscaling based on the real time metrics using spot instances to cut the costs or even fine tuning the node placement based on the hardware topology. And then finally step four is continuously continuous refinement. So optimization is not. One, one time thing. So it needs to evolve as workloads and technologies change. That means doing regular reviews, setting up alerts for unusual costs, and making sure our infrastructure and models are improved over over time which is very important, right? So in short, this roadmap helps balance the quick efficiency gains with smart long-term planning. Finally, thank you for joining me in this session and I hope you have a great time at this particular conference. Thank you. Hello everyone. Thank you for joining me in this conference. Today we will discuss on how FinTech reliability is achieved through SRE Innovation. So as organizations accelerate their AI initiatives, many are discovering that traditional infrastructure was not built to handle the high demands of machine learning operations. These workloads are highly resource intensive and are unpredictable. While Kubernetes has emerged as a platform of choice for managing these environments, simply running the containers is not enough. The challenge lies in running them optimally. What we are seeing is that many teams are struggling, be it dealing with over-provisioned resources, unpredictable costs and performance bottlenecks. So in this particular session, I would like to walk you through a set of proven SRE strategies that are designed to increase performance, reduce costs, and significantly improve the reliability of the AI ML pipelines running in. Kubernetes these approaches come directly from the high stakes environments these where even minor inefficiencies can lead to a major financial losses. So together we'll go through the techniques that have delivered up to a 60% improvements in resource utilization and substantially reductions in incident rates. All while maintaining the agility and performance. Ultimately, this presentation is about more than just keeping the systems online. It's about engineering confidently at scale in an increased AI driven financial ecosystem. So let's take a closer look on how the transformation is happening. Okay. When we talk about scaling AI infrastructure efficiently, GPU utilization is one of the most overlooked and most expensive and has more bottlenecks. So let's look at how we can dramatically shift that performance to cost ratio using smart GPU level strategies. Let's start with the MIG or the multi-instance GPU Partitioning. So this is a feature available in NVIDIA's 800 and h hundred series. It allows you to carve up a single GPU into a multiple isolated instances, which enables like multiple workloads to run simultaneously without any issues. Our real world tests have shown up to a seven x better utilization especially in environments with varied inference loads. Next in in memory efficiency QDA provides like a low level access to memory management, allowing us to opti optimize. Tensor core use and reduce the fragmentation. So this can lead to 15 to 30% of throughput gains particularly for vision heavy models like ID verification or biometric fraud detection. Oh, coming to precision optimization. In many FinTech companies like ML workloads, especially during inference. Full FFP 32 precision is not always needed. So by switching the, switching it to FP 16 or even INT eight, organizations can drastically shrink the memory footprint. So these techniques are game changers. They maximize the GPU resource usage and provide critical improvements in both performance and cost efficiency. Moving on to the next slide. On auto-scaling strategies, as organizations push to run ML workloads more efficiently traditional auto-scaling strategies are not enough, especially in GPU accelerated environments. Here we'll go through four techniques that enable smarter ML strategies. Which will help with the autoscaling. First we'll go through the workload profiling. So the foundation of autoscaling is understanding when and how much to scale. So workload profiling uses the historical telemetry across both training and inference cycles. To uncover the usage patterns. So this helps define the baseline resource needs, identify the spikes during the market, events or frauds, fraud searches, and ensures that the auto-scaling decisions are data informed and are not reactive. Coming to the custom metrics pipeline. So if you look at the out of box Kubernetes like Kubernetes HPA doesn't know anything about the machine learning. That's where like the custom metrics come in using Prometheus adapters. We expose the ml. Relevant signals like the Q Depth batch processing duration, and GPU memory pressure to the autoscaler. So this enables a precise decision based on the actual workload characteristics, not just the CPU load or the pod count. Coming to predictive scaling for recurring or seasonal workloads. Think EOD model model retaining or the batch scoring jobs. We can apply the time-based or ML based scaling triggers. So predictive models can anticipate the spikes before they happen. Enabling warm pools to be spun up just in time. This reduces the cold start latency by up to 85%, especially valuable for real time FinTech systems. Coming to the buffer management we can't forget the cluster resilience. So the buffer management means deliberately over provisioning a small GPU headroom during critical critical hours like trading sessions or things like holiday season, right? Then you can scale them back down more aggressively after the hours. This balances the control cost control with high availability, which is non-negotiable for financial applications. So these autoscaling strategies go well beyond the HPA setup. When implemented together, they enable the organizations to sustain peak model performance with minimal costs. Okay. Coming to the multi GPU training orchestration the biggest bottlenecks in scaling a machine learning training right is GPU job orchestrations. It's not just having about having more GPUs. It's about using them in intelligently. When we orchestrate jobs sufficiently across the gps, especially in distributed environments, we unlock, huge throughput gains in cost efficiency. The first step is topology aware scheduling. Why does this matter? So when jobs span across multiple GPUs, especially across nodes network latency becomes c enemy like we solve this by using the g Kubernetes node affinity rules to schedule the jobs on nodes that are physically close to the interconnect through high bandwidth. Links like NV link, or PCIE. This minimizes the communication overhead and drastically improves the performance and distributed training. Next we move on to distributed training frameworks. So framework for frameworks like TensorFlows multi worker needs to be tuned. To our infrastructure, often we need poor performance because training scripts are not optimized. So we address this with Custom gate template that enable, like dynamic scaling, reduces the training time variance, and it minimizes the communication bottlenecks between the nodes. Finally, there's a priority based preemptive, mechanism. So in this particular technique if you look at our shared clusters. When multiple teams submit the jobs, it's critical to ensure that high priority jobs don't wait in line behind the rule, low priority ones. So we implement intelligent queuing mechanisms that evaluate the job priority and then. Deadlines and then resource fairness. It's like air traffic control for GPUs, like preventing the collisions, reducing the idle time, and keeping the system fair and efficient. The key takeaway is like when you bring all these three pillars together, the topology awareness, smart frameworks, and priority based scheduling you can cut the training time by 30 to 50%. That's a massive impact when you are running the running hundreds or even thousands of experiments each week. Okay. Moving on to the advanced monitoring and observability. As machine learning workloads grow in complexity and scale, the need for robust observability becomes absolutely critical. Not just for performance, but for cost efficiency and op and operational reliability. So this. Here outlines the layered approach. We take towards building a comprehensive monitoring stack for the machine learning infrastructure. So let's start at the foundation, the historical data. So long-term storage of metrics is essential for capacity planning and trend analysis. Whether you are sizing your GPU clusters for next quarter's workloads, or preparing for upcoming model retraining cycles, you need to. Analyze historical context, right? So it helps avoid over provisioning and gives a factual basis for infrastructure budgeting. So next we move up to resource utilization. So this is where the fine-grained metrics come in, like the CPU usage, memory pressure, GPU utilization at the process level, right? So this layer is all about visibility into how your infrastructure. Is being consumed in real time without this, you are essentially like flying blind about That is we have the performance insights. So generic metrics like CPU usage won't tell you how long a training model takes or when your data pipeline is getting congested. So we create a custom dashboard that reflects that ML specific metrics such as throughput or training loss. Over the period of time. And then the GPU memory fragmentation. So these these give the engineers a much cleaner picture of how their workloads are behaving. At the top of the pyramid, if you look at, is the root cause analysis. So this is the most advanced layer and arguably the most valuable as well. So it. Ties everything together through distributed tracing and correlation correlates across the layers, like linking spikes in the GPU usages to a specific phase of a model training while something breaks or slows down, this is how you rapidly isolate and resolve the issue. The systems we are talking are not just about alerting. They're about creating a narrative that ML engineers and SREs can follow. They support both issues at hand and also work on strategic decision making. The most successful organizations like built all four of these layers into their monitoring approach. Together they create a holistic feedback. That helps optimize models and improve uptime as well as cost controls. Now let's explore how we can dramatically improve, data to a reduced latency in the ML pipelines by optimizing the storage. So as machine learning workloads become increasingly data intensive storage becomes a critical bottleneck. So this slides outlines like three key strategies to overcome that. So if you look at the distributed file systems we have file systems like GPFS, which are very distributed. These are built for high throughput scenarios and are capable of parallel data processing across hundreds of nodes. So they outperform the traditional network storage solutions by up to eight 10 x. Right, especially when working with the small, fragmented file types which are common in the ML data sets. The design allows like efficient aggregation of bandwidth and parallel read and writes, so which accelerates the data ingestion during the training. Next, we deploy the in-memory caching layers. Between the persistent storage and compute resources these caching layers drastically reduce IO wait times, especially for frequently accessed data sets by 65% or more. They also implement automatic eviction policies ensuring that the optimal memory usage without manual intervention. This is particularly beneficial in iterative workflows where the same data sets are are. Read multiple times during the model tuning. Lastly, we have something called the automated storage tiering. So this involves moving data between the hot and cold storage tiers based on the access, frequency and performance needs. So this not only ensures that the performance for the active data sets, but also. Achieves like the 40 to 50 percent of cost reductions by shifting the inactive data to lower cost storage. This is done transparently through a unified namespace. So engineers don't need to worry about where the data physically rec resides. So together with these kind of three layers, distributed file system, in memory caching and intelligent tiering this create a robust, cost efficient foundation for data driven ML pipelines. The result is faster training, lower cost, and smoother operations. Here we are looking at how we optimize cloud spend for machine learning workloads using sport instances. We start with the workload classifications. So we assess the jobs based on the fall tolerance and run length to identify which, which can safely run on the spot Instances. Next we implement like 4,000. By adding automated checkpoints and recovery systems. So if a spot instance gets reclaimed, the job can resume smoothly. Then comes the bidding strategy, optimization izing, the historical and real time pricing data. We just are bidding to maximize the savings while maintaining the job. Re reliability. Lastly, we adopt the hybrid deployment models. So dynamic, clearly, like switching between the spot and on demand instances based on cost and availability. So this is one of the important models which we can deploy to make sure we enable that the cost savings of up to 60 to 80% is achieved without even sacrificing on the performance. So moving on, let's talk about optimizing the no pool configurations. So an area where we can unlock substantial performance and cost benefits, especially in machine learning and compute heavy environments. So first we start with heterogeneous hardware segmentation. So rather than mixing all the GPUs, GU types into a single nor pool. We create a specialized nor pools for different GPUs, like eight hundreds going to one nor pool. And then the v hundreds go to a different nor pool. So each has a different compute and the memory characteristics. So assigning the workloads without creating the nor pools may lead to and inefficiencies. So by using Ts and tolerations, we make sure workloads are scheduled only to. Hardware data optimized for. Next we have resource optimization. This is about fine tuning the CPU and the GPU as well as the memory ratios to match the ml workload profiles. For example inference might require more CPU, whereas training model could need more GPU memory. So if. If these ratios are mislead or miscalculated, we either end up with bottlenecks or underutilized hardware, both of which are very costly. Then there is network topology alignment where you'll you are running, when you're running your distributed training network bandwidth, and latency matters a lot. So we design a no pools to align with. With the physical infrastructure, that means like grouping into a nodes nodes with the high speed interconnects like NV link or InfiniBand into a same pool. This ensures that the data exchanges between the GPUs in a fast and consistent manner, which avoids the slowdowns also during the training process. Now let's quickly look at how we can share resources fairly when multiple teams are working on the same system. First we use namespace to keep things organized. So each team or, type of work like production versus a non-production gets its own namespace. So this way we can set up different permissions use specific settings for each environment and keep the network traffic separate. Second, we can set up the resource controls. That means we limit how GPU memory and storage each team can use. This helps, make, making sure like team doesn't accidentally take more share than which is allocated. Then finally we give higher priority to critical jobs like live production jobs. At the same time, we make sure like every team gets at least a minimum resources which are needed for their operations. So this avoids the problem of one team overutilizing the resources while the other team doesn't get anything. So the overall this setup. Keeps fair, reliable, and running the infrastructure smoothly for everyone. Here's what an impact these optimization strategies can have. So in one of the case studies, it was observed that companies that applied this techniques saw an average of 43% reduction in costs. That, near that is nearly half of their infrastructure spending saved just by. Tuning the things right next. Training jobs became much faster, like 2.8 times faster on average. That means models are trained and ready in a bayless time which speeding up the entire development cycle study also suggests that. There was a big jump in the GPU usage, so which is up around 67%. So instead of having expensive GPU sitting idle, they're being used efficiently across a cluster. And finally, the production models became more reliable with 94% of the inherent requests, meeting the SLA goals. So that's a strong sign of stable production ready systems. So this slides gives us a step by step guide for applying the strategies we've covered so far. Step one is to establish the baseline metrics. Like before making any changes, we need to understand how our resources are being used right now. That includes tracking the GPU and the CP usage. Training times and how much it's costing up per model. Then step two is to go for the quick wins. Like there are changes that are not. Easy to be implemented and give immediate value without disruption disrupting our current setup. So examples include resizing the resource limits correctly, and then enabling simple auto scaling, and then using the right storage classes for each workloads. Then step three is to deploy more advanced optimizations. Once the basics are in place, we can start using smarter tactics like custom autoscaling based on the real time metrics using spot instances to cut the costs or even fine tuning the node placement based on the hardware topology. And then finally step four is continuously continuous refinement. So optimization is not. One, one time thing. So it needs to evolve as workloads and technologies change. That means doing regular reviews, setting up alerts for unusual costs, and making sure our infrastructure and models are improved over over time which is very important, right? So in short, this roadmap helps balance the quick efficiency gains with smart long-term planning. Finally, thank you for joining me in this session and I hope you have a great time at this particular conference. Thank you.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Beyond Uptime: Revolutionizing Fintech Reliability Through SRE Innovation

Video size:

Abstract

Summary

Transcript

Slides

Srinivas Reddy Mosali

Senior Consultant - Systems Engineer @ Visa

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Beyond Uptime: Revolutionizing Fintech Reliability Through SRE Innovation

Video size:

Abstract

Summary

Transcript

Slides

Srinivas Reddy Mosali

Senior Consultant - Systems Engineer @ Visa

Join the community!