Conf42 MLOps 2025 - Online

- premiere 5PM GMT

Embracing Sky Computing: Charting a New Course in AI Cloud Infrastructure

Video size:

Abstract

Unlock the future of AI product development with Sky Computing. Learn how seamless, intelligent cloud orchestration can revolutionize scalability, cost management, and compliance, paving the way for smarter, more resilient AI in production.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Brian Ish. I'm an engineering lead at Ditto. And I'll be talking today about Sky Computing charting a new course in AI cloud infrastructure. First of all, apologies, this will be my first time using OBS. And so if there are some technical glitches, I apologize for that. But let's get started. So my name is Brian. As I said, I'm an engineering lead at Ditto currently spent the last seven or eight years as an SRE. Doing production ber workloads at super orbital raincoat and bayfair. And I have two total decades of experience going back to 2005. I wanted to talk first about what the cloud has meant to us and what kinda the promises that it's made over the course of its lifetime here. We've been promised on demand, scalability pay as you go, pricing models near instant global reach. They've achieved all of these things. They've made, they've kept those promises. I say they, it's a concept but it's not without it's drawbacks. So I wanted to draw attention to some of them. Vendor lockin. We all know how difficult it can be to. In practice, be cloud agnostic. Data gravity is another concern and I'll talk about that more later. And we all know egress charges can be a huge shock in your AWS bill, for instance. How expensive hosting in the cloud can be can be very eye-opening for a lot of organizations out in the world. I think this what, this is what has led to a lot of. Repatriation. And what do I mean by that? There's a quote from an article I read that was 42% of organizations surveyed in the United States are considering or have already moved at least half of their cloud-based workloads back to on-premise infrastructures, a phenomenon known as Cloud repatriation. I thought that was very interesting and I wanted to make a note of that and say something about that and. While I applaud what these companies are doing from a pragmatic viewpoint on their infrastructure costs, it does beg a question. Did these companies meticulously plan their cloud usage from the start with cop cost optimization in mind, or are they really just reacting to the unexpected high AWS bill they got at the end of the month? I think most evidence points to the latter companies that are now repatriating their infrastructure often failed to implement basic cloud cost controls from day one, like automatically shutting down dev environments during off hours, using spot instances for batch workloads or implementing resource quotas. So rather than fixing those foundational issues, they've chosen to abandon the cloud completely and that's their decision. But I think taking a step back to on-prem might not truly be a step forward. In the long run, if we remain in the cloud though, how do we address the fundamental challenges of vendor lock-in data gravity and rising costs in a transformative way, rather than just applying incremental fixes? And this is where I want to start talking about the concept of sky computing. So Sky Computing has been around since Ian Stoica and Ian, and I apologize if I've. Mispronounced your last name. Since 2021 he published where when he published his very notable paper. And I very much encourage everyone to go read it. It's not a very long read. I believe it's 20 or so pages. And it really gets to the heart of the problem and what he believes the solution is. And I think he's spot on. I'll be talking more about it in, in the future slides here, but, just as a general overview, workloads flow seamlessly between cloud providers. Therefore they are free from vendor lock-in and inefficiencies. And you as the operator have much better control over data governance. And these are very big issues, especially for enterprise. So the three pillars that I've coined, and these are my pillars that, these are not official or written in Ian's paper at all. But these are what I've dwid it down to. And that is pillar one being abstraction. At the heart, oh, sorry. Moving ahead there too fast. At the heart of Sky Computing lies abstraction, which serves as the glue that unifies the disparate cloud platforms through compatibility layer, leveraging tools like Kubernetes Ray and standardized APIs. Like S3 sky Computing hides the complexities of individual clouds. This means you won't need to worry about whether your data sits in AWS S3 or Azure Blob storage, the abstraction layer. This one handles those details. Deciding on the optimal storage base storage base and cost and performance based on cost and performance. Sorry. By providing a right once run anywhere experience abstraction, eliminates a vendor lock-in and simplifies application deployment. As more cloud platforms become supported, the abstraction layer offers greater flexibility, enabling smoother transitions and substantial cost savings without requiring you to change how you develop or how you manage your workloads. Even beyond the hyper cloud providers of today, like A-W-S-G-C-P and Azure, you'll see support for neo cloud providers like Lambda Labs Core Weave or nes, providing even greater operational flexibility. Pillar two here I have is automation and building on the foundations of abstraction automation is the really the next pillar driving sky computing forward. InterCloud brokers embody this pillar by serving as intelligent decision makers who manage workload placement, cost optimization, and compliance across multiple clouds. These brokers continuously analyze factors like pricing, resource availability, and regulatory requirements using AI and realtime data. They automatically route or adjust workloads based on current conditions. Ensuring your applications always run in the most cost effective and efficient environment. This removes the manual overhead of juggling multiple cloud providers, reduces the chance of human error, and lets you focus on higher level tasks while the system optimizes operations behind the scenes. This is perhaps the largest pillar, by the way and the hardest to get right. Third and last pillar here I wanna talk about is agility. Agility is about creating responsive and flexible cloud ecosystem where data and workloads can move freely. Reciprocal reciprocal peering agreements are key to this agility and you'll see me mention this a couple of times in the presentation. These agreements are collaborations between cloud providers that allow for free or low cost data transfers, or discounted data transfers, breaking down the barriers such as egress fees and data gravity. As these agreements take shape, often organically driven by hyperscale providers keen to support popular brokers serving the automation layer workloads can move seamlessly between clouds. This dynamic environment empowers businesses to adapt quickly to shifting costs, regulatory changes or performance performance requirements without being locked into a single provider. Crucially, this level of agility opens up the cloud ecosystem in ways never but seen before by tearing down silos and encouraging collaboration between providers. Sky computing creates an interconnected landscape where innovation thrives smaller and more specialized. Neo cloud providers, like Land Lambda Labs, for instance, or run pod they gain a seat at the table, fostering competition and driving breakthroughs and service offerings. Enterprises can mix and match services from various providers without fear. Leveraging the best features from each platform to suit their needs. The result is an agile infrastructure that can pivot on demand, offering both resilience and the flexibility to innovate and disrupt rather than just iterate. This unprecedented openness not only breaks the barriers of vendor lock-in, but also sparks a whole new era of creativity and efficiency across the entire cloud. Indu, cloud industry, fundamentally changing how businesses harness cloud technology. Sky Pilot here is the inaugural broker framework. And the third bullet point here, this is encompassing the automation pillar that I just spoke about before and the central pillar it was developed by uc, Berkeley Sky Computing Lab. They're the creators of Spark and Ray. It is also, by the way, the residency of Ian Deca, author of that paper. It is an open source InterCloud broker framework implementing Sky Computing principles. They have over a million downloads as of July, 2025. They're on version zero point 10 including enterprise features. They have a unified interface to 17 clouds plus Kubernetes plus on-prem. And there are key capabilities here, three to six times cost savings through spot orchestration alone. Four x faster provisioning. You can get 200 GPUs in less than 90 seconds. 9.6 x faster checkpointing with Mount cached. Zero code changes for existing ML workloads. And you have one yamo file and one command to run Sky Launch Cloud. Any. Now what kind of impact does this have? This is ML lops, after all. What kind of impact does this have on workload impacts or ML workloads rather? And I wanna reference a couple of case studies that were written by these companies. They're, and they're available on the sky Compute, or, sorry, sky Pilot website which I'll have a link to shortly. But there are two case studies that were of particular interest that I wanted to bring up here. A bridge. There, they were a slum shop running their AI infrastructure on s slm. And I found this quote in their case study from one of their research scientists that I thought, I, I have to include this in the presentation. I thought that was pretty great. And covalent as well. They built a whole new AI stack and really combating the scarcity in GPUs that I'm sure everyone is aware of these days. But you can go on Sky sky Pilot's website, sky Pilot, I believe it's Sky pilot.co. And this one I wanna go into more details about the Abridge case study and bring some things further into this presentation that I thought were very interesting. Sky Pilot delivered, and this, these are their words, by the way, not mine. Sky Pilot delivered the familiar experience our researchers wanted with the reliability. Our production workloads required. With interactive development, you can run Sky Launch dash GPUs, H 100, colon four, which provides immediate SSH access to A GPU enabled shell without complex setup. Just S Run Dash dashs equals GPU colon four dash PT y Bash and S slm, but it works seamlessly across all of our infrastructure. Jupyter Notebook hosting. We can spin up Jupyter Notebooks directly on GPU clusters, enabling researchers to prototype with high-end hardware that wasn't available locally. Sky Pilots manage jobs, provide the same convenience as SLMs job scheduler, but works across all of our infrastructure. Automatic restart on job failures, strong isolation and reliable job management for long running training jobs. And model evals. Quick model evaluation becomes simple. We can deploy models as fast API services in minutes for testing. Unlike lm, which lacks native API endpoint support, sky Pilot makes it easy to expose models as services. This is one of their examples from their case study. This is a Ya the entire YAMA file for running distributed training on Sky Pilot. Not complex at all. Researchers can use pure PyTorch and hugging face without wrappers or additional abstraction layers. Sky Pilot seamlessly sets up multiple nodes, populates environment variables with a cluster topology information like number of nodes, GPUs, IP addresses, and then kicks off the jobs. It's a huge win since machine learning engineers don't need to struggle with scaffolding for distributor training, and it works with any package manager. So when it comes to AI and ML workloads, different stages of a pipeline may benefit from different cloud providers Specialties and this is where sky computing really becomes interesting. With an InterCloud broker like Sky Pilot, you can split your pipeline and run model training on Google Cloud. With TPU optimized instances for deep learning then decide to run inference on AWS with their in chips for lower latency and your data pre-processing on Azure. Why would you do this? For one, speed and efficiency of the machines that I say it I mentioned above. Cost savings Azure oftentimes is much cheaper than a AWS. And regional data regulations. Again, going back to your data governance for compliance purposes, this is huge for organizations. So what's preventing sky computing from becoming ubiquitous? These are some of the challenges that I foresee personally. Standard standardization is a huge one. Universal standards across all cloud platforms is unlikely Due to competitive interests and proprietary technologies, progress can still be made by leveraging existing widely adopted tools such as Kubernetes, Ray and S3 APIs. These standards don't cover every scenario, but they provide a practical bridge, allowing sky computing to move forward without waiting for complete industry-wide uniformity. Another challenge is economic resistance. Hyperscalers will resist reciprocal peering agreements as sharing data freely between platforms can conflict with their business models. While this resistance exists, smaller cloud providers and innovative startups have strong incentives to embrace sky computing principles. Their agility and the desire to compete with larger players drive them to support the ecosystem gradually encouraging wider adoption and putting pressure on the bigger providers to reconsider their stance. Infrastructure inertia is a very big topic here. There are significant investments. Preexisting investments organizations have already heavily invested in existing cloud infrastructure. Not just cost, but expertise, tooling and processing or processes leading to hesitation and dramatic changes such as this. There's resistance to new paradigms, right? That's gonna be the case every time there's a new one. Reluctance to adopt new paradigms like sky computing due to lack of widespread adoption. Good enough status quo. Current cloud deployments often function adequately, even if they're not, cost optimized or performance optimized. And there is daunting overhead concerns about retraining staff, updating deployment pipelines and refactoring applications for sky computing's abstraction layer perceived risks, apprehensions regarding reliability and support when moving away from established cloud providers. Negative services. Moving on to something like Run Pod or Lambda Labs, that is not as proven as AWS. Nothing is gonna be as proven as AWS, but and so these factors create substantial inertia, hindering widespread sky sky computing adoption. Finally, I wanna speak about the challenge of legitimacy, and this is maybe an understated one, but deserves a mention here at the very least. The concept of sky computing faces challenges in establishing legitimacy, partly due to a Wikipedia entry with a warning banner, questioning source reliability, and noting a lack of academic citations. This stems from an incident where a commercial entity attempted to shape the narrative around sky computing through Wikipedia editing, leading to their ban from the platform. This incident highlights a broader challenge. Commercial entities sometimes try to claim thought leadership. For emerging technologies through questionable means, inadvertently damaging the credibility of legitimate technological advances. However, the fundamental value proposition of Sky Computing providing a unified interface across cloud providers while optimizing cost, performance, and compliance stands independent of any single company's implementation. So for final thoughts, I wanted to say I genuinely believe that we're at a turning point in cloud infrastructure. I don't think sky computing is just another buzzword here. I think it's a practical fix for real problems. It brings together different cloud services into one smooth system, making life easier for businesses and SREs who need reliable, flexible, and efficient operations. At the end of the day, this makes a sizable dent and balance sheets. That's what matters. As more sky computing solutions emerge, tech leaders worldwide will continue to notice they'll see the benefits and quickly move their workloads to these smarter, more open and more cost-effective cloud setups. The future of the cloud is here knocking at our door. It's an exciting moment to rethink how we build and manage systems that stand up to real world demands, more resilient, more adaptable, and ready for what's next. Thank you very much.
...

Brian Irish

Engineering Lead @ Ditto

Brian Irish's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content