Cloud-Native ML Infrastructure: Building Resilient Apache Spark Clusters on Kubernetes for AI Workloads

Video size:

Abstract

The convergence of AI/ML workloads with cloud-native infrastructure presents unique challenges in scalability, resource utilization, and operational complexity. This talk demonstrates how to architect production-grade Apache Spark clusters on Kubernetes that specifically cater to the demands of modern AI/ML applications while adhering to cloud-native principles.

Key Topics

Implementing cloud-native patterns for Spark on Kubernetes: statelessness, declarative configurations, and immutable infrastructure
Designing resilient architectures for ML workloads using Kubernetes operators and custom controllers
Container optimization strategies for Spark executors running AI/ML workloads
Network optimization and storage patterns for large-scale data processing
Implementing GitOps workflows for Spark-based ML infrastructure
Monitoring and observability solutions for distributed ML training

Attendees will gain practical insights into building cloud-native data infrastructure that scales effectively for AI/ML workloads, with real-world examples of Kubernetes configurations, deployment patterns, and operational best practices.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome everyone to our presentation on cloud native ML infrastructure. Today we will be exploring how to build resilient Apache Spark clusters on Kubernetes specifically designed for AI and ML workloads. I'm Anant Kumar and I'm excited to share insights on addressing the unique challenges of scalability, resource utilization and operational complexity in this domain. Let's begin. Let's begin by looking at the evolution of ML infrastructure. Historically, ML workloads ran on premise, requiring significant upfront investment in hardware and ongoing maintenance costs. This approach limited scalability and flexibility. The shift to cloud native infrastructure has revolutionized how we handle ML workloads, offering greater scalability and flexibility. Flexibility and cost efficiency through pay as you go models and dynamic resource allocation. This evolution has enabled organizations to experiment more freely and deploy ML solutions faster. Despite advances, ML infrastructure still faces significant challenges. First, scalability remains difficult. as ML model and datasets grow increasingly complex and large. Second, resource utilization is often inefficient with compute resources frequently over present, over provisioned or underutilized. And finally, Operational complexity increases as organizations manage distributed system spanning multiple environments and technologies. These challenges are precisely what we will address with our Spark on Kubernetes approach. Apache Spark has become a cornerstone technology for ML workloads for several compelling reasons. Its distributed processing capabilities allow for handling massive data set across clusters of machines. Essential for training modern ML models. It's in memory computing provides speed advantages, critical for iterative ML algorithms. And additionally, Spark offers pre built libraries for common ML tasks through MLlib, making implementation More straightforward and reducing development time. Kubernetes has emerged as the leading container orchestration platform, offering key capabilities that complement Spark for ML workloads. Its container orchestration simplifies deployment and management of Spark applications. Its sophisticated resource management ensure Efficient allocation and utilization of computer resources. Perhaps, most importantly, its auto scaling capabilities allow dynamic adjustment of resources based on workload demands, optimizing both performance and cost. A comprehensive cloud native ML architecture consists of four essential components. First, Data storage solutions, including data lakes, object storage, and databases that provide the foundation for managing large volume of data. Second, compute resources like Kubernetes cluster, virtual machines, and GPUs that power processing. Third, ML frameworks such as Spark, TensorFlow, and Vue. and PyTorch that provides the tool for building and training models. And finally, model serving platforms that deploy and serve trained models to end users or applications. Now, let's examine how Spark operates within a Kubernetes environment. Spark applications are deployed as Kubernetes pods. With each worker running in its own pod, these workers communicate with each other and the master node to coordinate processing tasks. Think about it. Master node is a driver node and these workers are essentially the executors. Data is stored and accessed through persistent volumes, ensuring. Data remains available even if ports are rescued or restarted. This architecture combines Spark's processing power with Kubernetes orchestration capabilities. Effective resource management is very crucial for Spark on Kubernetes. We implement this through resource requests and limits, defining the minimum resources required. Each Spark port requires and the maximum it can consume. Resource quotas set boundaries on resource consumption for namespaces, preventing any single application from monopolizing the cluster resources. Node affinity and taints allow us to assign Spark ports to specific nodes based on their requirements, such as placing GPU intensive tasks on nodes with appropriate hardware. The Spark Operator simplifies Spark application management on Kubernetes. Installation involves deploying the operator into your Kubernetes cluster, which can be done via Helm Charts. Configuration is very simple. It's require, it requires setting up the operator. with appropriate cluster settings and standard Kubernetes resource limits. Once deployed, the operator enables declarative application deployment through custom resource definition, which is often abbreviated as CRDs, streamlining the entire process of running Spark applications. You can get and orchestrate the Spark applications through Spark Operator. All the CRUD operation can be implemented through the Spark Operator. For example, you can create Spark application, read Spark application, update Spark application, and delete it. Dynamic resource allocation is a key advantage of our approach. This involves three critical elements. Resource monitoring to track SPARC resource consumption in real time. Dynamic scaling to adjust resources based on workload demand. And resource optimization to ensure efficient utilization. Together. These elements create a responsive system that can scale up during peak processing needs and scale down when demand decreases. This is also help for handling the bursty traffic and the burst passes when the workload stabilizes. This dynamic resource allocation can help in those scenarios as well. Autoscaling implementation focuses on two dimensions. Horizontal scaling adds or removes spark worker nodes based on processing demands, allowing the cluster to expand or contract. Horizontal scaling is generally a preferred method of scaling because you can use a general purpose instances. that can scale horizontally and you can deploy them in a large number. Cost wise, these instances are not that expensive, so it makes sense to make use of the horizontal scaling wherever it makes sense. Whereas vertical scaling adjusts resources within existing nodes, such as increasing CPU or memory allocation. Both approaches put together, they can be automated based on some metrics like CPU utilization, memory usage, or application specific metrics. You can set up a threshold value, and based on the threshold value, when these metrics exceed these threshold values, the scaling can automatically trigger. Effective data management requires understanding various storage options. This slide visually represents the diverse storage solutions available, from cloud object storage for raw data, to specialized database for structured information. The right storage strategy depends upon your specific workload characteristics, access pattern, and latency requirement. And it can vary from blob storage, to the database, to the data lake. Data persistence is essential for ML workloads. We implement this through persistent volumes that store data beyond pored life cycles, ensuring data survives container restarts. We know that. In Kubernetes, the pod lifecycle is ephemeral. Anytime, for any reason, pod can be evicted. So it's important to have persistent data. Stateful sets manage stateful applications with persistent identifiers and stable storage. Critical for maintaining state. And data replication across multiple nodes provide redundancy, ensuring availability even if individual nodes fail or get evicted. A robust monitoring infrastructure is vital for production system. I haven't seen any production system without a good monitoring and visibility infrastructure. So this is critical component and we must have the high standard for setting up the monitoring infrastructure. We recommend implementing monitoring tools like Prometheus and Grafana for matrix collection and visualization. You can run a sidecar container to your main application container, which is responsible for sending out matrix. Exporting in the Prometheus format, which can be exported to the external exporters. Logging solutions such as Fluentd and Elasticsearch centralize logs for easier troubleshooting. Tracing systems like Jager and Zipkin provide insights into requests flows across distributed systems. Putting this all together, these tools offer comprehensive visibility into your infrastructure. You can set up these metrics into your PagerDuty alerts. And you would come to know when these production systems downgrade or fail. And you can do the manual intervention to bring those systems back. without impacting the customer workflows. Performance optimization require a multifaceted approach. Data partitioning optimizes distribution across workers, reducing data movement and improving processing efficiency. You can also resort to a external service which is responsible for Doing the spark shuffling, which reduces the chances of shuffle, essentially increasing the efficiency across cluster. Data caching keeps frequently accessed data in memory, reducing read latency. And code optimization improves application efficiency through techniques like Predicate push down, which is one of the key feature of Spark, where Spark push down the predicate or you can say the filter down to the source. For example, let's say if you, there is a SQL query, which is manipulating the data from Avro source file. Then Spark will push the filter down to the Avro reader instead of reading all the records from the Avro file. And on top, you can have some efficiency through broadcast join and query optimization as well. Spark provide catalyst, which does convert, the logical, query plan into an optimized logical query plan. Security must be a priority in any ML infrastructure. This chart illustrates the balanced approach required across multiple security domains. Network policies to control poor communications. Data encryption for protecting sensitive information. Authentic authentication mechanism to verify identities and vulnerability scanning to proactively identify weakness. Implementing proper authentication and authorization involves several mechanism. Role based access control, which is very native to Kubernetes. Restricts access based on user roles, applying the principle of least privileges. OAuth integration leverages external identity provider for streamlined authentication. And certificate based authentication uses digital certificates for securing communication between components. You can also induce MTLS and MTLS. to communicate between internal service endpoints for extra protection. Network security and data protection work in tandem. Network segmentation isolates Spark workloads from other applications, minimizing the attack surface. Data encryption both at rest and in transit. Ensure sensitive data remains protected even if other security measures are compromised. Cost management is increasingly important as ML workloads scale. It's paramount important when these workloads are running in cloud. We have heard lot of stories where some of the EC2 instances run for a longer period of time, which warrants unnecessary cost. We can choose spot instances and we can find out what time of the day and in what zone these spot instances are cheaper to improve our cost efficiency. We can choose this for non critical workload. Because spot instance has a tendency to be taken away. The resource optimization, fine tune allocation to reduce waste, ensuring you are not over provisioning. Idle resource management scales down or terminates resources when not in use, avoiding unnecessary expenses. High availability ensures your ML infrastructure Remains operational despite failures. This requires multiple master nodes for redundancy, eliminating single point of failure. Also, multiple master nodes are great if you have a larger cluster and they work in tandem to share the workload, making the API Kubernetes API server highly available. For worker node, the redundant Spark workers provide fault tolerance for processing tasks. A failed or evicted worker pod can be restarted. And in many of the cases, those tasks, when retried, succeed, leading it to the self healing properties of the application. Data replication across multiple nodes ensure data remains accessible even if individual nodes fail. Disaster recovery planning prepares for worst case scenarios. Data backup and recovery procedure preserve critical data and enable restoration when needed. Replication across multiple regions protects against failure. regional outages. Failover mechanism automatically redirect operation to functioning resources, minimizing downtime during failures. Implementing CI CD streamlines, development and deployment. The build stage compiles and packages Spark applications with dependencies. There are a lot of build tools available in the market. Maven is often used predominantly, but for monorepo cases, Bazel is also gaining a lot of traction. Testing runs automated tests to verify functionality and performance. Deployment pushes applications to the Kubernetes cluster using automated processes. Integration test is an essential part of the CICD pipeline before we go and deploy the Docker image monitoring tracks application health and performance post deployment. The value of ML model lies in their application. We need effective ways to deploy trained models from Spark to serving platforms like TensorFlow serving, KF serving. or custom solutions. Out of all the AI framework available in the market, for example, TensorFlow, PyTorch, or Jaxx, TensorFlow comes out to be one of the favorite which is being used for a lot of production deployment scenarios. It provides high level APIs like Keras, which simplifies the model deployment. And there is a light version of this framework available for mobile applications. These platforms then serve model. for real time inference and predictions, bringing, bridging the gap between model development and production use. Dependency management becomes crucial as projects grow. Tools like Maven, SBT, or Gradle help manage project dependencies efficiently. Containerization via Docker packages application with their dependency, ensuring consistency across lower and higher environment and simplifying deployment. Effective troubleshooting requires multiple approaches. Logs analysis examines Spark and Kubernetes logs to identify errors. And they are root case root causes. There are many logs analysis tool available in the market. Which can source and aggregate loads from all the containers running in the pod. Including all the side containers as well as init containers. Debugging tools helps inspect code execution and variable values. Kubernetes monitoring tools provide insight into pod status, resource utilization, and cluster health. There are many Kubernetes monitoring tools available in the market. There are many open source project as well, which you can leverage to get monitoring to set up some monitoring across your Kubernetes cluster. Continuous performance improvement require systematic measurement. This involves measuring performance metrics under various workload to establish baseline. Identify bottlenecks and areas for improvement direct optimizations efforts. Validating performance after implementing changes confirm their effectiveness. Let's examine a real world case study involving a large scale ML pipeline. The challenge was building a pipeline capable of processing. terabyte of data efficiently. The solution leverage Spark on Kubernetes to provide distributed processing and dynamic scalability, resulting in significant performance improvement and cost reduction. Pair it with some catalog service, for example, Hive Metastore. End. Using the open source Iceberg open table format to efficiently store and manipulate the table format and store the data in the S3 format. Spark SQL, Hype Metastore and Iceberg put together all three provide a very comprehensive and compelling data lake solution, which has capability to store terabyte of data efficiently, as well as provide the schema evolution in case the business need change to change the schema of the table. Also, with the help of Iceberg, we can do the time travel. which can be required for compliance reasoning. From our experience, we have distilled key best practices. Start small with a minimal viable cluster and gradually scale as you gain experience. Automate deployment. scaling and monitoring process to reduce manual intervention, continuously and periodically optimize resource utilization and do the performance benchmarking based on metrics and evolving workload patterns. Looking ahead, several trends will shape the future of cloud native AI and ML infrastructure. Serverless computing for ML workload will reduce infrastructure management overhead. Edge computing will bring AI and ML capabilities closer to data sources for real time applications. AI powered infrastructure management will automate optimization and scaling decisions, creating self tuning systems. AI and ML, the space itself is a very fast moving space, and there are a lot of community build tools, data set, and models available on some of the websites. For example, Hugging Face, where people collaborate. And leverage each other work to take this space forward. This concludes our presentation on building resilient Apache Spark cluster on Kubernetes for AI workload. We have covered key concepts, best practices, and practical implementation strategies. I encourage you to explore the additional resources. and continue learning about this rapidly evolving. Thank you for your time. I hope my presentation is helpful to you and you will be able to embark on your AI and ML infrastructure journey. Thank you for attending.

Slides

Download slides (PDF)

See all 81 talks at this event!

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Cloud-Native ML Infrastructure: Building Resilient Apache Spark Clusters on Kubernetes for AI Workloads

Video size:

Abstract

Summary

Transcript

Slides

Anant Kumar

Tech Lead @ Salesforce

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2025 - Online

March 06 2025 - premiere 5PM GMT

Cloud-Native ML Infrastructure: Building Resilient Apache Spark Clusters on Kubernetes for AI Workloads

Video size:

Abstract

Summary

Transcript

Slides

Anant Kumar

Tech Lead @ Salesforce

Join the community!