Rust-Based AI-Driven Carbon-Aware Kubernetes Scheduling for Sustainable Cloud Infrastructure

Video size:

Abstract

Revolutionary Rust-powered AI transforms Kubernetes into an eco-warrior! Our framework slashes data center carbon emissions by 40% while boosting performance. Real-time carbon-aware scheduling meets blazing-fast Rust implementation. Save the planet AND your cloud costs!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. So I'm here to talk Mo mostly about how we can do the carbon aware Kubernetes scheduling instead of the traditional Kubernetes scheduling. You might be either using E-K-S-A-K-S or GKE, like Google Kubernetes engine, whatever the cloud platform Kubernetes you're using, or any bare metal Kubernetes. I think the future is that we have to be, aware of building the sustainable infrastructure and also not harming the planet. I've been like researching about that and I've come up with a framework which will talk about what can be implemented to improve the cost effectiveness, sustainability, and even environmental friendly infrastructure as a whole. So basically the current challenge is that, that there is like lot of cloud computing that is happening, like with AI booming up and machine learning models are being training on that. Lots and lots of cloud EC2 resources and e case resources are being used. So the traditional Kubernetes resource management, usually like how it happens, like there is a scheduler and the scheduler based on the request that it has based on the deployments that we have. It usually schedules the pods and whatever the workloads that we have, they usually run on the pos. But the key issues that here are. Here are that the data center account for approximately 1% of the global electricity consumption. That is like a lot. And the current k, the schedulers, they only prioritize on like performance and availability of the clusters, but they don't like schedule based on the sustainable, your carbon air infrastructure and the cloud infrastructure often runs. Sometimes, like there are like sta city instances that are just like lying around. There is no proper usage that is being done and that is costing in many ways. Eventually the cost wise, we are getting impacted. Environment wise, we are getting impacted and also even availability wise. So here is the solution that I was like thinking of or developing. So basically even all Google also has come up with such APIs, I think before, but not like a fully grown Kubernetes solution. They were just trying to see how they can do the, how they can work, how they can see the carbon air scheduling can be done on the pots. And our solution is that it's a carbon. Scheduling. There are four major things to it. Firstly it integrates with the carbon intensity data APIs. Like there, there might be so many data data sources from where like we can integrate our Kubernetes with such APIs and then try to gather like real time information like what is the what is the, whether it is, whether the. Energy sources like wind energy or like coal energy or hydro energy, we will be able to identify what kind of energy it is so that real time car carbon awareness and predictive energy modeling. So we'll be deploying like a machine learning models, which will forecast the energy consumption. Say supposedly I, if I have like batch workloads, how much energy consumption, if I have a very critical production workloads, how much energy consumption, if I have just like a simple standalone jobs, what is the energy consumption? They predict that particular energy consumption using the ML models and there is rust powered performance. Framework where like the critical components implemented in the rust, they deliver like memory safe, concurrent processing architectures like vary with minimal overhead, ensuring like both sustainability and they don't compromise on any performance. So this rust framework has been like coming out a lot and like many people are trying to play around with it on implementing the best strategy possible. The next one is the. SLA SLAs. Yeah. We all our applications, they have to meet what the SLA service level agreements and like crossing them or not agreeing to that may cost a lot and million millions for the companies. So when you are like building and algorithms you have to be like really aware of like your applications, SLA agreement. And based on that, you have to design your system. The Chrome, the core framework architecture here, mainly we have like carbon aware scheduler, workload scheduler, and metrics collector. So what this carbon aware scheduler mainly does is that firstly it'll try. This is like completely rust based and it will try to replace the traditional K scheduler. The KS will have its own scheduler based on the replica set deployment, like whatever we define, it'll try to schedule the bots. So this is more like car it's like next version of it, say Carbon Air scheduler that optimizes, and then it tries to integrate with the carbon data sources. The EPS that I was talking before. And then it, based on that, it will try to schedule the workloads. The next one is the workload analyzer. Yeah. As I was telling before this specifically, is that you have to understand like when your workload is like running, you have to understand what is its energy consumption, like what kind of energy consumption does it need. So based on that, you can actually categorize. Your workloads, whether it is like batch crossing, so and different kind of workloads. And you have to also estimate the energy that it takes. So this is such kind of like workload analyzer. And the third one is the metrics collector. Metrics collector is mainly you want to understand like how it is performing, right? What is the CPU memory usages, like what is the energy consumption and like which type of the. Energy source that it is that it is using. So there are different kind of metrics that you can categorize it. So you know that this particular workload actually has utilized the energy source from the hydro. This particular workload has energy source that is utilized from the coal. This particular workload has used from solar. This particular has used different kind wind energy. So different kinds of different kinds of sources so that you know that you want to. Schedule those critical workloads on the, or say, suppose if you have like very high availability or high performance workloads, you can schedule them like on a mediocre or like high level cost ones considering like they'll always be available, but some batch workloads or some like minimal, not needed, not so important non-critical workloads. You can use them on the low energy consumption workload, so that way you can save the cost. And the next one is the rust implementation. I don't have to talk much anything about it that, why did we pick rust for the carbon air computing because it has memory safety without any garbage collection. It has predictable performance. It has the concurrency, it does like load source footprint and it has compiled and guarantee. So there are like lots of benefits using the rust. The next one is the machine learning models for energy prediction. So what kind of machine learning models actually predict the energy consumption of the workloads? So first one is the gradient decision. Trees like the gradient boosting algorithm. The next is RNs. You can use them to analyze the temporal patterns inside the workloads, to predict the what is the future energy consumption trends, how much the energy is needed, and all of that. The last thing is the reinforcement learning, where it'll continuously take the feedback loops. And then accordingly, we'll try to improve the, improve towards the carbon reduction or lesser energy consumption techniques. RL is also included here. The next thing is the carbon aware scheduling algorithms. So what kind of algorithms together make up that, I make up that core of the system. The first one is the carbon data integration, where you have to integrate the carbon intensity APIs or data sources so that you understand where that energy is coming from. And the next one is the workload classification. You can try to classify them as like batch workloads or, the regular non-critical workloads or a, the stateless workloads, based on the API calls, some workloads get triggered so you can classify them accordingly. The next is, the next thing is the temporal optimization. One, these different workloads are identified and they're potentially rescheduled to execute during the periods of like low carbon intensity or higher renewable energy, a availability. So there is like spatial, there is like temporal optimization. Spatial optimization, where you have like different workloads and non-deal workloads. These non-deal workloads are assigned to the nodes, like where the lowest carbon emissions take place. Probably different workloads are assigned to the higher renewable energy category, like saving wind energy or solar energy. So you can categorize based on that resource efficiency. Definitely they can help to improve the. They can help to improve the resources used on the node. So that improves the availability. Eventually it'll improve the cost, definitely, and it'll reduce the energy consumption. So these five algorithms are like really core. You have to integrate to the carbon air APIs, you have to do the workload scheduling. You have to understand how much energy is needed for your workloads. That is done with the machine learning algorithms. They'll give you that prediction based on that. Whatever the scheduler that we have defined, it'll automatically route the. Route the workloads onto the specific types. Say, suppose I have three or four types where this node is this node is hosted on, this this node is coming from an energy source, wind. This is coming from solar. This is coming from coal, this is coming from hydra. So like different source of energy. So based on the priority or the scheduler is like intelligent enough to understand, Hey, this has to go here. This is critical. This has to go here. So that carbonate scheduling algorithms are defined and as usual coming to the next part, like integration with the Kubernetes ecosystems. We usually have scheduler, we have a metrics server. We have custom resource definitions, and there is like Kubernetes operator. So all of these are ities, but there will be a. A bit more enhancements that are that are there and they you can like, try to implement and plug in these to the existing Kubernetes. Systems. Coming to the next slide, it's about the deployment and implementation strategy, which is the critic key critical part, like whenever you're trying to implement these or give them or techniques in your own environments. The first thing is that you definitely have to pick the non-critical workloads, like when you are hosting them there. Begin with like batch processing jobs or non-time sensitive workloads that can be easily shifted to the low carbon periods. So you can implement this through your like CSCD pipelines. And then suppose there are any like data processing jobs, you can implement that and collect any baseline metrics. You can do that so that you will understand like how much are you able to actually do that shifting towards the carbon air scheduling or not. The next thing is the production monitoring extent the product, extent to the production workloads as well, but in the only monitoring only mode. So that so that you are not directly enabling the carbon aware scheduling, but you're just like putting there enough like in the monitoring mode and then you're analyzing like what is the energy this particular thing is needed, or, or or if you integrate with the carbon sources, you at least understand the understand the limitations or what is the kind of what is the kind of scale that it needs to go which workloads has to go where you at least analyze that and collect all that data and metrics in the second phase. The third phase is you actually schedule the workloads, like you enable the carbon air scheduling for some stateless services, for some applications where there is still like little bit of leeway with the with the SLAs, which are not like very critical. Then you can configure these stateless applications. With the carbon preferences, you can implement some canary deployments or the rolling deployments, and then you can monitoring like how they're performing or is there any breakage or is there any like interruptions that are occurring with the workloads while they're scheduling. So you can monitor all of that. The next fourth phase is the full implementation. Only after you have collected the metrics from both non-prod and production environments, then you're able to. Gather them together, sit with your teams, discuss, and then you can go ahead for the full phase of production. So say, suppose if this been implemented, then that will be definitely 40% carbon reduction, considering you'll be moving to the renewable energy sources. There will be some energy savings, like you're not scheduling too many nos for a simple task, but you will go in a controlled fashion. And then there will be cost reduction as well, like a 15% cost reduction as. Definitely in all areas and case study. So you, one of the like global financial companies or say Google, they're already like starting up with such carbon air scheduling processes with the rust in the backend and they're trying to like, move the noncritical workloads to the higher renewable energy re resources to save some costs and then. Even they're okay with like little interruptions and stuff. I'm sure like that will not happen. But even if it happens that is the case then that they're okay with it and they're trying and like trying to improve the implementation as much as possible and try to open source to other companies as well. Yeah, so this is like a open source and ecosystem and the to framework, the algorithms and everything has been like developed and then. And trying to encourage like people from all over to even contribute to it as well and the future research directions. Yeah, you can have like hardware level integration. You can have like edge computing, adaptions even. You can implement this in your all edge computing devices like mobile phones or in, if not essentially mobile phones or if there are any devices like onsite inside the factories wherever they're located. Some industry specific models also can be developed for all like big large scale industries and global policy framework. So still eu, US and like different countries around the world have not proper policy has been like formulated that is still in like progress. So yeah, you can get started today if you're interested. Firstly, understand. Like how you can see the carbon footprint data, how to integrate with those APIs, how to understand like how much is the how the consumption patterns occur. And implement. Like the second step is to implement the non-destructive components. Start with like observability and analysis tools so that you can, you'll be able to understand like in your environment what is the impact, like if you integrate them, like how would you actually benefit from it, and how you can actually create a sustainability. The third thing is you can try this carbon air scheduling the pilot program and then try to get onto it. Work it on your test cluster and then try to gather the metrics, see if it is like helping or not. Then you can like scale across your organizations to the different teams. So all in all, I can say that this carbon air scheduling is definitely a benefit now and in future and we are saving our earth for the future generations. Yep. Thank you.

See all 27 talks at this event!

Conf42 Rustlang 2025 - Online

August 21 2025 - premiere 5PM GMT

Rust-Based AI-Driven Carbon-Aware Kubernetes Scheduling for Sustainable Cloud Infrastructure

Video size:

Abstract

Summary

Transcript

Bandhavi Sakhamuri

Senior Site Reliability Engineer @ Solarwinds

Join the community!

Featured event

2026

2025

Info

Conf42 Rustlang 2025 - Online

August 21 2025 - premiere 5PM GMT

Rust-Based AI-Driven Carbon-Aware Kubernetes Scheduling for Sustainable Cloud Infrastructure

Video size:

Abstract

Summary

Transcript

Bandhavi Sakhamuri

Senior Site Reliability Engineer @ Solarwinds

Join the community!