Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
Thank you for joining me in this conference.
Today we will discuss on how FinTech reliability is achieved
through SRE Innovation.
So as organizations accelerate their AI initiatives, many are discovering
that traditional infrastructure was not built to handle the high demands
of machine learning operations.
These workloads are highly resource intensive and are unpredictable.
While Kubernetes has emerged as a platform of choice for managing
these environments, simply running the containers is not enough.
The challenge lies in running them optimally.
What we are seeing is that many teams are struggling, be it dealing with
over-provisioned resources, unpredictable costs and performance bottlenecks.
So in this particular session, I would like to walk you through a set of proven
SRE strategies that are designed to increase performance, reduce costs, and
significantly improve the reliability of the AI ML pipelines running in.
Kubernetes these approaches come directly from the high stakes environments
these where even minor inefficiencies can lead to a major financial losses.
So together we'll go through the techniques that have delivered up
to a 60% improvements in resource utilization and substantially
reductions in incident rates.
All while maintaining the agility and performance.
Ultimately, this presentation is about more than just keeping the systems online.
It's about engineering confidently at scale in an increased AI
driven financial ecosystem.
So let's take a closer look on how the transformation is happening.
Okay.
When we talk about scaling AI infrastructure efficiently, GPU
utilization is one of the most overlooked and most expensive
and has more bottlenecks.
So let's look at how we can dramatically shift that performance to cost ratio
using smart GPU level strategies.
Let's start with the MIG or the multi-instance GPU Partitioning.
So this is a feature available in NVIDIA's 800 and h hundred series.
It allows you to carve up a single GPU into a multiple isolated instances,
which enables like multiple workloads to run simultaneously without any issues.
Our real world tests have shown up to a seven x better utilization especially in
environments with varied inference loads.
Next in in memory efficiency QDA provides like a low level access to memory
management, allowing us to opti optimize.
Tensor core use and reduce the fragmentation.
So this can lead to 15 to 30% of throughput gains particularly for
vision heavy models like ID verification or biometric fraud detection.
Oh, coming to precision optimization.
In many FinTech companies like ML workloads, especially during inference.
Full FFP 32 precision is not always needed.
So by switching the, switching it to FP 16 or even INT eight, organizations can
drastically shrink the memory footprint.
So these techniques are game changers.
They maximize the GPU resource usage and provide critical improvements in
both performance and cost efficiency.
Moving on to the next slide.
On auto-scaling strategies, as organizations push to run ML workloads
more efficiently traditional auto-scaling strategies are not enough, especially
in GPU accelerated environments.
Here we'll go through four techniques that enable smarter ML strategies.
Which will help with the autoscaling.
First we'll go through the workload profiling.
So the foundation of autoscaling is understanding when and how much to scale.
So workload profiling uses the historical telemetry across both
training and inference cycles.
To uncover the usage patterns.
So this helps define the baseline resource needs, identify the spikes
during the market, events or frauds, fraud searches, and ensures that
the auto-scaling decisions are data informed and are not reactive.
Coming to the custom metrics pipeline.
So if you look at the out of box Kubernetes like Kubernetes HPA doesn't
know anything about the machine learning.
That's where like the custom metrics come in using Prometheus adapters.
We expose the ml. Relevant signals like the Q Depth batch processing duration,
and GPU memory pressure to the autoscaler.
So this enables a precise decision based on the actual workload characteristics,
not just the CPU load or the pod count.
Coming to predictive scaling for recurring or seasonal workloads.
Think EOD model model retaining or the batch scoring jobs.
We can apply the time-based or ML based scaling triggers.
So predictive models can anticipate the spikes before they happen.
Enabling warm pools to be spun up just in time.
This reduces the cold start latency by up to 85%, especially valuable
for real time FinTech systems.
Coming to the buffer management we can't forget the cluster resilience.
So the buffer management means deliberately over provisioning a
small GPU headroom during critical critical hours like trading sessions
or things like holiday season, right?
Then you can scale them back down more aggressively after the hours.
This balances the control cost control with high availability, which is
non-negotiable for financial applications.
So these autoscaling strategies go well beyond the HPA setup.
When implemented together, they enable the organizations to sustain peak
model performance with minimal costs.
Okay.
Coming to the multi GPU training orchestration the biggest bottlenecks
in scaling a machine learning training right is GPU job orchestrations.
It's not just having about having more GPUs.
It's about using them in intelligently.
When we orchestrate jobs sufficiently across the gps, especially in
distributed environments, we unlock, huge throughput gains in cost efficiency.
The first step is topology aware scheduling.
Why does this matter?
So when jobs span across multiple GPUs, especially across nodes network
latency becomes c enemy like we solve this by using the g Kubernetes node
affinity rules to schedule the jobs on nodes that are physically close to the
interconnect through high bandwidth.
Links like NV link, or PCIE.
This minimizes the communication overhead and drastically improves the
performance and distributed training.
Next we move on to distributed training frameworks.
So framework for frameworks like TensorFlows multi
worker needs to be tuned.
To our infrastructure, often we need poor performance because
training scripts are not optimized.
So we address this with Custom gate template that enable, like
dynamic scaling, reduces the training time variance, and
it minimizes the communication bottlenecks between the nodes.
Finally, there's a priority based preemptive, mechanism.
So in this particular technique if you look at our shared clusters.
When multiple teams submit the jobs, it's critical to ensure that high
priority jobs don't wait in line behind the rule, low priority ones.
So we implement intelligent queuing mechanisms that evaluate
the job priority and then.
Deadlines and then resource fairness.
It's like air traffic control for GPUs, like preventing the collisions,
reducing the idle time, and keeping the system fair and efficient.
The key takeaway is like when you bring all these three pillars together, the
topology awareness, smart frameworks, and priority based scheduling you can
cut the training time by 30 to 50%.
That's a massive impact when you are running the running hundreds or even
thousands of experiments each week.
Okay.
Moving on to the advanced monitoring and observability.
As machine learning workloads grow in complexity and scale, the need for robust
observability becomes absolutely critical.
Not just for performance, but for cost efficiency and op
and operational reliability.
So this.
Here outlines the layered approach.
We take towards building a comprehensive monitoring stack for
the machine learning infrastructure.
So let's start at the foundation, the historical data.
So long-term storage of metrics is essential for capacity
planning and trend analysis.
Whether you are sizing your GPU clusters for next quarter's workloads,
or preparing for upcoming model retraining cycles, you need to.
Analyze historical context, right?
So it helps avoid over provisioning and gives a factual basis
for infrastructure budgeting.
So next we move up to resource utilization.
So this is where the fine-grained metrics come in, like the CPU usage,
memory pressure, GPU utilization at the process level, right?
So this layer is all about visibility into how your infrastructure.
Is being consumed in real time without this, you are essentially
like flying blind about That is we have the performance insights.
So generic metrics like CPU usage won't tell you how long a training
model takes or when your data pipeline is getting congested.
So we create a custom dashboard that reflects that ML specific metrics
such as throughput or training loss.
Over the period of time.
And then the GPU memory fragmentation.
So these these give the engineers a much cleaner picture of how
their workloads are behaving.
At the top of the pyramid, if you look at, is the root cause analysis.
So this is the most advanced layer and arguably the most valuable as well.
So it.
Ties everything together through distributed tracing and correlation
correlates across the layers, like linking spikes in the GPU usages to a specific
phase of a model training while something breaks or slows down, this is how you
rapidly isolate and resolve the issue.
The systems we are talking are not just about alerting.
They're about creating a narrative that ML engineers and SREs can follow.
They support both issues at hand and also work on strategic decision making.
The most successful organizations like built all four of these layers
into their monitoring approach.
Together they create a holistic feedback.
That helps optimize models and improve uptime as well as cost controls.
Now let's explore how we can dramatically improve, data to a reduced latency in the
ML pipelines by optimizing the storage.
So as machine learning workloads become increasingly data intensive
storage becomes a critical bottleneck.
So this slides outlines like three key strategies to overcome that.
So if you look at the distributed file systems we have file systems
like GPFS, which are very distributed.
These are built for high throughput scenarios and are capable of parallel
data processing across hundreds of nodes.
So they outperform the traditional network storage solutions by up to eight 10 x.
Right, especially when working with the small, fragmented file types
which are common in the ML data sets.
The design allows like efficient aggregation of bandwidth and parallel
read and writes, so which accelerates the data ingestion during the training.
Next, we deploy the in-memory caching layers.
Between the persistent storage and compute resources these caching
layers drastically reduce IO wait times, especially for frequently
accessed data sets by 65% or more.
They also implement automatic eviction policies ensuring that the optimal
memory usage without manual intervention.
This is particularly beneficial in iterative workflows where
the same data sets are are.
Read multiple times during the model tuning.
Lastly, we have something called the automated storage tiering.
So this involves moving data between the hot and cold storage tiers based on the
access, frequency and performance needs.
So this not only ensures that the performance for the
active data sets, but also.
Achieves like the 40 to 50 percent of cost reductions by shifting the
inactive data to lower cost storage.
This is done transparently through a unified namespace.
So engineers don't need to worry about where the data physically rec resides.
So together with these kind of three layers, distributed file system, in
memory caching and intelligent tiering this create a robust, cost efficient
foundation for data driven ML pipelines.
The result is faster training, lower cost, and smoother operations.
Here we are looking at how we optimize cloud spend for machine learning
workloads using sport instances.
We start with the workload classifications.
So we assess the jobs based on the fall tolerance and run length
to identify which, which can safely run on the spot Instances.
Next we implement like 4,000.
By adding automated checkpoints and recovery systems.
So if a spot instance gets reclaimed, the job can resume smoothly.
Then comes the bidding strategy, optimization izing, the historical
and real time pricing data.
We just are bidding to maximize the savings while maintaining the job.
Re reliability.
Lastly, we adopt the hybrid deployment models.
So dynamic, clearly, like switching between the spot and on demand instances
based on cost and availability.
So this is one of the important models which we can deploy to make sure
we enable that the cost savings of up to 60 to 80% is achieved without
even sacrificing on the performance.
So moving on, let's talk about optimizing the no pool configurations.
So an area where we can unlock substantial performance and cost
benefits, especially in machine learning and compute heavy environments.
So first we start with heterogeneous hardware segmentation.
So rather than mixing all the GPUs, GU types into a single nor pool.
We create a specialized nor pools for different GPUs, like eight
hundreds going to one nor pool.
And then the v hundreds go to a different nor pool.
So each has a different compute and the memory characteristics.
So assigning the workloads without creating the nor pools
may lead to and inefficiencies.
So by using Ts and tolerations, we make sure workloads are scheduled only to.
Hardware data optimized for.
Next we have resource optimization.
This is about fine tuning the CPU and the GPU as well as the memory ratios
to match the ml workload profiles.
For example inference might require more CPU, whereas training
model could need more GPU memory.
So if.
If these ratios are mislead or miscalculated, we either end up with
bottlenecks or underutilized hardware, both of which are very costly.
Then there is network topology alignment where you'll you are running, when you're
running your distributed training network bandwidth, and latency matters a lot.
So we design a no pools to align with.
With the physical infrastructure, that means like grouping into a nodes nodes
with the high speed interconnects like NV link or InfiniBand into a same pool.
This ensures that the data exchanges between the GPUs in a fast and consistent
manner, which avoids the slowdowns also during the training process.
Now let's quickly look at how we can share resources fairly when multiple
teams are working on the same system.
First we use namespace to keep things organized.
So each team or, type of work like production versus a non-production
gets its own namespace.
So this way we can set up different permissions use specific settings
for each environment and keep the network traffic separate.
Second, we can set up the resource controls.
That means we limit how GPU memory and storage each team can use.
This helps, make, making sure like team doesn't accidentally take
more share than which is allocated.
Then finally we give higher priority to critical jobs like live production jobs.
At the same time, we make sure like every team gets at least a minimum resources
which are needed for their operations.
So this avoids the problem of one team overutilizing the resources while
the other team doesn't get anything.
So the overall this setup.
Keeps fair, reliable, and running the infrastructure smoothly for everyone.
Here's what an impact these optimization strategies can have.
So in one of the case studies, it was observed that companies that
applied this techniques saw an average of 43% reduction in costs.
That, near that is nearly half of their infrastructure spending saved just by.
Tuning the things right next.
Training jobs became much faster, like 2.8 times faster on average.
That means models are trained and ready in a bayless time which
speeding up the entire development cycle study also suggests that.
There was a big jump in the GPU usage, so which is up around 67%.
So instead of having expensive GPU sitting idle, they're being
used efficiently across a cluster.
And finally, the production models became more reliable with 94% of the
inherent requests, meeting the SLA goals.
So that's a strong sign of stable production ready systems.
So this slides gives us a step by step guide for applying the
strategies we've covered so far.
Step one is to establish the baseline metrics.
Like before making any changes, we need to understand how our
resources are being used right now.
That includes tracking the GPU and the CP usage.
Training times and how much it's costing up per model.
Then step two is to go for the quick wins.
Like there are changes that are not.
Easy to be implemented and give immediate value without disruption
disrupting our current setup.
So examples include resizing the resource limits correctly, and then enabling
simple auto scaling, and then using the right storage classes for each workloads.
Then step three is to deploy more advanced optimizations.
Once the basics are in place, we can start using smarter tactics like custom
autoscaling based on the real time metrics using spot instances to cut
the costs or even fine tuning the node placement based on the hardware topology.
And then finally step four is continuously continuous refinement.
So optimization is not.
One, one time thing.
So it needs to evolve as workloads and technologies change.
That means doing regular reviews, setting up alerts for unusual costs,
and making sure our infrastructure and models are improved over over
time which is very important, right?
So in short, this roadmap helps balance the quick efficiency gains
with smart long-term planning.
Finally, thank you for joining me in this session and I hope you have a great
time at this particular conference.
Thank you.
Hello everyone.
Thank you for joining me in this conference.
Today we will discuss on how FinTech reliability is achieved
through SRE Innovation.
So as organizations accelerate their AI initiatives, many are discovering
that traditional infrastructure was not built to handle the high demands
of machine learning operations.
These workloads are highly resource intensive and are unpredictable.
While Kubernetes has emerged as a platform of choice for managing
these environments, simply running the containers is not enough.
The challenge lies in running them optimally.
What we are seeing is that many teams are struggling, be it dealing with
over-provisioned resources, unpredictable costs and performance bottlenecks.
So in this particular session, I would like to walk you through a set of proven
SRE strategies that are designed to increase performance, reduce costs, and
significantly improve the reliability of the AI ML pipelines running in.
Kubernetes these approaches come directly from the high stakes environments
these where even minor inefficiencies can lead to a major financial losses.
So together we'll go through the techniques that have delivered up
to a 60% improvements in resource utilization and substantially
reductions in incident rates.
All while maintaining the agility and performance.
Ultimately, this presentation is about more than just keeping the systems online.
It's about engineering confidently at scale in an increased AI
driven financial ecosystem.
So let's take a closer look on how the transformation is happening.
Okay.
When we talk about scaling AI infrastructure efficiently, GPU
utilization is one of the most overlooked and most expensive
and has more bottlenecks.
So let's look at how we can dramatically shift that performance to cost ratio
using smart GPU level strategies.
Let's start with the MIG or the multi-instance GPU Partitioning.
So this is a feature available in NVIDIA's 800 and h hundred series.
It allows you to carve up a single GPU into a multiple isolated instances,
which enables like multiple workloads to run simultaneously without any issues.
Our real world tests have shown up to a seven x better utilization especially in
environments with varied inference loads.
Next in in memory efficiency QDA provides like a low level access to memory
management, allowing us to opti optimize.
Tensor core use and reduce the fragmentation.
So this can lead to 15 to 30% of throughput gains particularly for
vision heavy models like ID verification or biometric fraud detection.
Oh, coming to precision optimization.
In many FinTech companies like ML workloads, especially during inference.
Full FFP 32 precision is not always needed.
So by switching the, switching it to FP 16 or even INT eight, organizations can
drastically shrink the memory footprint.
So these techniques are game changers.
They maximize the GPU resource usage and provide critical improvements in
both performance and cost efficiency.
Moving on to the next slide.
On auto-scaling strategies, as organizations push to run ML workloads
more efficiently traditional auto-scaling strategies are not enough, especially
in GPU accelerated environments.
Here we'll go through four techniques that enable smarter ML strategies.
Which will help with the autoscaling.
First we'll go through the workload profiling.
So the foundation of autoscaling is understanding when and how much to scale.
So workload profiling uses the historical telemetry across both
training and inference cycles.
To uncover the usage patterns.
So this helps define the baseline resource needs, identify the spikes
during the market, events or frauds, fraud searches, and ensures that
the auto-scaling decisions are data informed and are not reactive.
Coming to the custom metrics pipeline.
So if you look at the out of box Kubernetes like Kubernetes HPA doesn't
know anything about the machine learning.
That's where like the custom metrics come in using Prometheus adapters.
We expose the ml. Relevant signals like the Q Depth batch processing duration,
and GPU memory pressure to the autoscaler.
So this enables a precise decision based on the actual workload characteristics,
not just the CPU load or the pod count.
Coming to predictive scaling for recurring or seasonal workloads.
Think EOD model model retaining or the batch scoring jobs.
We can apply the time-based or ML based scaling triggers.
So predictive models can anticipate the spikes before they happen.
Enabling warm pools to be spun up just in time.
This reduces the cold start latency by up to 85%, especially valuable
for real time FinTech systems.
Coming to the buffer management we can't forget the cluster resilience.
So the buffer management means deliberately over provisioning a
small GPU headroom during critical critical hours like trading sessions
or things like holiday season, right?
Then you can scale them back down more aggressively after the hours.
This balances the control cost control with high availability, which is
non-negotiable for financial applications.
So these autoscaling strategies go well beyond the HPA setup.
When implemented together, they enable the organizations to sustain peak
model performance with minimal costs.
Okay.
Coming to the multi GPU training orchestration the biggest bottlenecks
in scaling a machine learning training right is GPU job orchestrations.
It's not just having about having more GPUs.
It's about using them in intelligently.
When we orchestrate jobs sufficiently across the gps, especially in
distributed environments, we unlock, huge throughput gains in cost efficiency.
The first step is topology aware scheduling.
Why does this matter?
So when jobs span across multiple GPUs, especially across nodes network
latency becomes c enemy like we solve this by using the g Kubernetes node
affinity rules to schedule the jobs on nodes that are physically close to the
interconnect through high bandwidth.
Links like NV link, or PCIE.
This minimizes the communication overhead and drastically improves the
performance and distributed training.
Next we move on to distributed training frameworks.
So framework for frameworks like TensorFlows multi
worker needs to be tuned.
To our infrastructure, often we need poor performance because
training scripts are not optimized.
So we address this with Custom gate template that enable, like
dynamic scaling, reduces the training time variance, and
it minimizes the communication bottlenecks between the nodes.
Finally, there's a priority based preemptive, mechanism.
So in this particular technique if you look at our shared clusters.
When multiple teams submit the jobs, it's critical to ensure that high
priority jobs don't wait in line behind the rule, low priority ones.
So we implement intelligent queuing mechanisms that evaluate
the job priority and then.
Deadlines and then resource fairness.
It's like air traffic control for GPUs, like preventing the collisions,
reducing the idle time, and keeping the system fair and efficient.
The key takeaway is like when you bring all these three pillars together, the
topology awareness, smart frameworks, and priority based scheduling you can
cut the training time by 30 to 50%.
That's a massive impact when you are running the running hundreds or even
thousands of experiments each week.
Okay.
Moving on to the advanced monitoring and observability.
As machine learning workloads grow in complexity and scale, the need for robust
observability becomes absolutely critical.
Not just for performance, but for cost efficiency and op
and operational reliability.
So this.
Here outlines the layered approach.
We take towards building a comprehensive monitoring stack for
the machine learning infrastructure.
So let's start at the foundation, the historical data.
So long-term storage of metrics is essential for capacity
planning and trend analysis.
Whether you are sizing your GPU clusters for next quarter's workloads,
or preparing for upcoming model retraining cycles, you need to.
Analyze historical context, right?
So it helps avoid over provisioning and gives a factual basis
for infrastructure budgeting.
So next we move up to resource utilization.
So this is where the fine-grained metrics come in, like the CPU usage,
memory pressure, GPU utilization at the process level, right?
So this layer is all about visibility into how your infrastructure.
Is being consumed in real time without this, you are essentially
like flying blind about That is we have the performance insights.
So generic metrics like CPU usage won't tell you how long a training
model takes or when your data pipeline is getting congested.
So we create a custom dashboard that reflects that ML specific metrics
such as throughput or training loss.
Over the period of time.
And then the GPU memory fragmentation.
So these these give the engineers a much cleaner picture of how
their workloads are behaving.
At the top of the pyramid, if you look at, is the root cause analysis.
So this is the most advanced layer and arguably the most valuable as well.
So it.
Ties everything together through distributed tracing and correlation
correlates across the layers, like linking spikes in the GPU usages to a specific
phase of a model training while something breaks or slows down, this is how you
rapidly isolate and resolve the issue.
The systems we are talking are not just about alerting.
They're about creating a narrative that ML engineers and SREs can follow.
They support both issues at hand and also work on strategic decision making.
The most successful organizations like built all four of these layers
into their monitoring approach.
Together they create a holistic feedback.
That helps optimize models and improve uptime as well as cost controls.
Now let's explore how we can dramatically improve, data to a reduced latency in the
ML pipelines by optimizing the storage.
So as machine learning workloads become increasingly data intensive
storage becomes a critical bottleneck.
So this slides outlines like three key strategies to overcome that.
So if you look at the distributed file systems we have file systems
like GPFS, which are very distributed.
These are built for high throughput scenarios and are capable of parallel
data processing across hundreds of nodes.
So they outperform the traditional network storage solutions by up to eight 10 x.
Right, especially when working with the small, fragmented file types
which are common in the ML data sets.
The design allows like efficient aggregation of bandwidth and parallel
read and writes, so which accelerates the data ingestion during the training.
Next, we deploy the in-memory caching layers.
Between the persistent storage and compute resources these caching
layers drastically reduce IO wait times, especially for frequently
accessed data sets by 65% or more.
They also implement automatic eviction policies ensuring that the optimal
memory usage without manual intervention.
This is particularly beneficial in iterative workflows where
the same data sets are are.
Read multiple times during the model tuning.
Lastly, we have something called the automated storage tiering.
So this involves moving data between the hot and cold storage tiers based on the
access, frequency and performance needs.
So this not only ensures that the performance for the
active data sets, but also.
Achieves like the 40 to 50 percent of cost reductions by shifting the
inactive data to lower cost storage.
This is done transparently through a unified namespace.
So engineers don't need to worry about where the data physically rec resides.
So together with these kind of three layers, distributed file system, in
memory caching and intelligent tiering this create a robust, cost efficient
foundation for data driven ML pipelines.
The result is faster training, lower cost, and smoother operations.
Here we are looking at how we optimize cloud spend for machine learning
workloads using sport instances.
We start with the workload classifications.
So we assess the jobs based on the fall tolerance and run length
to identify which, which can safely run on the spot Instances.
Next we implement like 4,000.
By adding automated checkpoints and recovery systems.
So if a spot instance gets reclaimed, the job can resume smoothly.
Then comes the bidding strategy, optimization izing, the historical
and real time pricing data.
We just are bidding to maximize the savings while maintaining the job.
Re reliability.
Lastly, we adopt the hybrid deployment models.
So dynamic, clearly, like switching between the spot and on demand instances
based on cost and availability.
So this is one of the important models which we can deploy to make sure
we enable that the cost savings of up to 60 to 80% is achieved without
even sacrificing on the performance.
So moving on, let's talk about optimizing the no pool configurations.
So an area where we can unlock substantial performance and cost
benefits, especially in machine learning and compute heavy environments.
So first we start with heterogeneous hardware segmentation.
So rather than mixing all the GPUs, GU types into a single nor pool.
We create a specialized nor pools for different GPUs, like eight
hundreds going to one nor pool.
And then the v hundreds go to a different nor pool.
So each has a different compute and the memory characteristics.
So assigning the workloads without creating the nor pools
may lead to and inefficiencies.
So by using Ts and tolerations, we make sure workloads are scheduled only to.
Hardware data optimized for.
Next we have resource optimization.
This is about fine tuning the CPU and the GPU as well as the memory ratios
to match the ml workload profiles.
For example inference might require more CPU, whereas training
model could need more GPU memory.
So if.
If these ratios are mislead or miscalculated, we either end up with
bottlenecks or underutilized hardware, both of which are very costly.
Then there is network topology alignment where you'll you are running, when you're
running your distributed training network bandwidth, and latency matters a lot.
So we design a no pools to align with.
With the physical infrastructure, that means like grouping into a nodes nodes
with the high speed interconnects like NV link or InfiniBand into a same pool.
This ensures that the data exchanges between the GPUs in a fast and consistent
manner, which avoids the slowdowns also during the training process.
Now let's quickly look at how we can share resources fairly when multiple
teams are working on the same system.
First we use namespace to keep things organized.
So each team or, type of work like production versus a non-production
gets its own namespace.
So this way we can set up different permissions use specific settings
for each environment and keep the network traffic separate.
Second, we can set up the resource controls.
That means we limit how GPU memory and storage each team can use.
This helps, make, making sure like team doesn't accidentally take
more share than which is allocated.
Then finally we give higher priority to critical jobs like live production jobs.
At the same time, we make sure like every team gets at least a minimum resources
which are needed for their operations.
So this avoids the problem of one team overutilizing the resources while
the other team doesn't get anything.
So the overall this setup.
Keeps fair, reliable, and running the infrastructure smoothly for everyone.
Here's what an impact these optimization strategies can have.
So in one of the case studies, it was observed that companies that
applied this techniques saw an average of 43% reduction in costs.
That, near that is nearly half of their infrastructure spending saved just by.
Tuning the things right next.
Training jobs became much faster, like 2.8 times faster on average.
That means models are trained and ready in a bayless time which
speeding up the entire development cycle study also suggests that.
There was a big jump in the GPU usage, so which is up around 67%.
So instead of having expensive GPU sitting idle, they're being
used efficiently across a cluster.
And finally, the production models became more reliable with 94% of the
inherent requests, meeting the SLA goals.
So that's a strong sign of stable production ready systems.
So this slides gives us a step by step guide for applying the
strategies we've covered so far.
Step one is to establish the baseline metrics.
Like before making any changes, we need to understand how our
resources are being used right now.
That includes tracking the GPU and the CP usage.
Training times and how much it's costing up per model.
Then step two is to go for the quick wins.
Like there are changes that are not.
Easy to be implemented and give immediate value without disruption
disrupting our current setup.
So examples include resizing the resource limits correctly, and then enabling
simple auto scaling, and then using the right storage classes for each workloads.
Then step three is to deploy more advanced optimizations.
Once the basics are in place, we can start using smarter tactics like custom
autoscaling based on the real time metrics using spot instances to cut
the costs or even fine tuning the node placement based on the hardware topology.
And then finally step four is continuously continuous refinement.
So optimization is not.
One, one time thing.
So it needs to evolve as workloads and technologies change.
That means doing regular reviews, setting up alerts for unusual costs,
and making sure our infrastructure and models are improved over over
time which is very important, right?
So in short, this roadmap helps balance the quick efficiency gains
with smart long-term planning.
Finally, thank you for joining me in this session and I hope you have a great
time at this particular conference.
Thank you.