Conf42 Cloud Native 2024 - Online

Chidori - AI/ML Cluster Management Platform

Abstract

Kubernetes cluster management platform’s role in speeding up development, scaling AI infrastructure, and lowering computing costs will be discussed.

Summary

  • Ahmed Gaber and Nadine Khaled talk about Chidori AI and ML cluster management. They talk about how spark operates in Kubernetes and did five summit challenges involve it. And also we prepare a good demo for you.
  • Nadine Khaled from Incorta Cloud team demonstrates how to install Shizuri on your environment and put it into action. Through Chidori you can submit your spark jobs through any spark master. You can also see the status of these jobs and share them with others.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. One welcome to comfort two cloud native today we are happy to talk about Chidori AI and ML cluster management developed by Incorta. Let me introduce myself first. I am Ahmed Gaber but you can call me Biga. I am a cloud engineering manager at Incorta and also I have today with me Nadine Khaled, cloud engineer at Incorta. In our agenda today we will talk about how spark operates in Kubernetes and did five summit challenges involve it and explain how Chidori can address these challenges. And also we prepare a good demo for you. Let's dive into how spark work on Kubernetes. As you see in this diagram, the client will start to submit the spark job into Kubernetes to run as a spark driver in Kubernetes we have two modes. First one is a client mode which means the driver will still running in the client side. The second mode is cluster mode which means the driver will be running as a BoD inside Kubernetes. Once the driver started, it requests from Kubernetes to start the executor bots. So the Kubernetes scheduler will start to allocate the executor costs inside Kubernetes. After this exactor costs created, the driver will be get notified to start to schedule jobs in these exactors. So as you see here, Spark will get benefit from the Kubernetes scalability features like the cluster horizontal scaling for the node itself. So if you have costs to be allocated inside the nodes and there isn't enough equity, cluster will be started to scale up to add new nodes to be more flexible with your job. And also you have like a resource management to ensure that the driver and executor running within your capacity. We figure out some of good insights on that model. First one, as I said, the cluster auto scaling cluster auto scaling give you the flexibility to get the better performance with low cost. So you don't have to have like a node to be up and running almost the time. Also we find that enable the dynamic allocation give you a flexibility inside the application itself to scale in and out. The executor itself also uses spot nodes with Spark workload will save a cost a lot. This will give you a flexibility to get higher performance with low cost. Also we notice most of spark bottlenecks come from the shuffling issues. So to optimize your spark job you have to attach the spark job into fast local SSD based on your cloud vendor to optimize the spark scratch space. In summary running Spark on Kubernetes not only optimize resource utilization and reduce cost, but also enhance the overall performance for your spark application. While running Spark on Kubernetes bring us a lot of benefits like better resource use and the cost saving. It is not without its challenges. Let's dive into some of these challenges you might face. Firstly the Nikode boot which mean like when spin spark a job as a cluster mode inside Kubernetes. As I said before, the driver will start running as a separate BoD inside Kubernetes. This BoD is not controlled by any of replica controller or stateful set or deployment object in Kubernetes and this creates some kind of availability issue for this BoD and it would be like act as a single point of failure. So which means if this driver down for any something your job will completely fail and kubernetes will not have any controller to spin this driver again. The second challenge is around driver bods distribution across nodes and its implications for cost. Driver bods are allocated across node based on resource request and node offenses. However, as job conclude you may observe scattering of some bods across the nodes due to the constraint imposed by the naked BoD issue. This distribution can prevent the scaling down of a node to optimize running cost. Another challenge is around startup time overhead. This time will come from two factors. First one is if a new node is required from a driver to be allocated or executor, so there is a latency to wait this node to become available. Also another factor is the startup time of the BoD itself for the driver. If you're using heavily bison libraries to start your job. So you will wait to install these libraries and configure some configuration for this job before the bot become available. So this will impact also the time to job to be executing once it's submitted. Another challenge is the Kubernetes scatter itself. To understand this issue we must understand how Kubernetes allocate bot to the node. This scaddler is called Kubescadular is watch that from the abi server in Kubernetes once the master get requested to create a bot so the scheduler will start looking to available node to be hosted this node based on the resources constraint defined by the bot definition itself and also node affinities. Once the scador find feasible nodes to costs this bot, it will have like a scoring to find the best match for this node. Also, if the scheduler didn't found any feasible node to costs this BoD, the BoD will remain unscadbled until the schedule find best match node for free. So what is missing? Kubernetes will build for running microservice with scale out architecture in mind. The default Kubernetes schedule is not ideal for AI and ML workload lacking critical high performance scheduling component like batch scheduling permission and multiple queues for efficiency. In addition, Kubernetes is missing gang scheduling for scaling up burial processing AI workload to multiple distributed nodes. Also most of AI and the ML jobs require array of libraries and the framework including wheels, eggs, jars and framework. This diversity require a robust tracking system to ensure everything within our container image is up to date and function as expected. Moreover, the size of container images become critical consideration as we add more component the images grow larger which again slow down the deployment time and impact the efficiency and also adding to that managing compatibility and upgrade of these versions. So another challenge to run any spark or ML job inside Kubernetes is it's related to concept called twelve G awareness. Mainly scheduler will allocate the pod based on the resource request and pod affinity which is defined by the BoD itself. However, the node state itself is managed by kubernetes. It's another agent running inside each node in your cluster to know the state of the node itself. So for example if this node have a disk pressure or some kind of throttling in some resource. So you need to have like awareness before you allocate this job into node. Also you must utilize the node affinities and boot affinities together to get the best match of allocating bods to nodes with Kubernetes scheduler. Another challenge is related to integration and monitoring in Spark. In Kubernetes to monitor job you have to use Spark Ui master or using Spark history server. This tool is mainly concerned about the job. It's job focused and it's concerned only about the tasks or in the stages and some kind of resource monitoring of infrastructure. But it's missing the correlation between the cluster behavior, the Kubernetes behavior with this job. So you will find some difficulty to troubleshooting some issues you may face. Also the integration with third party tools the current way to submit any job as we see in the first slide to use the Spark submit command which is a CLI command in Spark. So it's not friendly to be integrated with other tools. So after addressing all of these challenges we starting to build our beloved solution is chidori. So we started build chidori with a mindset to solve all issues that I listed in the previous slides by solving the naked boot availability issue, provide more stable framework could run the spark or email jobs inside Kubernetes and also provide well integrated rest API with third parties and provide more clear monitoring to the create different factors to have a good troubleshooting for resource jobs inside kubernetes. So in this diagram I will explain the high level design of chidori. So let us start with chidori server. So as I said in first slide of the issues that we have naked bot availability issues when we run our driver inside Kubernetes. So we build Shuduri with a concept to be like a hosting for spark drivers inside Kubernetes. So Shaduri will costs the spark driver and Shaduri itself is a Kubernetes deployment so it's totally managed by Kubernetes to guarantee the high availability and disability also. So we build API server that provide multiple APIs to be deal with spark in kubernetes like create job, delete job listing jobs and get logs. So this is BI will be integrated with the spark submit client and also integrated with any third parties that can be integrated with Spark on Kubernetes server. So once we receive a job inside Shaduri it will be queue and we build the queuing because we want shudderi to be controlled how much driver can be run at a time. So the admin can configure the maximum number of jobs and maximum allowed memory and CPU to be consumed at a time. So once the job received in the bias server it will be stored in our queuing system. We provide interface for multiple queuing system like rabbit, meq, cloudbubsub and Azure. So once the job queued if there enough capacity to job to be run, the scheduler will fetch this job from the queue and start go routine function to run this job and the core engine will start tracking this job to manage all the lifecycle of job and all of this metadata stored in our backend store for monitoring purpose and auditing. Also we build the story to be like interface with many of Spark vendor provider like Incorta, Kubernetes and databricks. So you can use chidori to submit jobs to incorta or your own cluster in Kubernetes or your cluster in databricks. Also we build like a monitoring to monitor the jobs running jobs and get full monitoring capabilities to create different factors while you troubleshooting your jobs. Also we have a connect layer that provides spark connect interface with other parties in the client side. Also we will provide our Spark summit chidori version that easily integrated with our Chidori server. So you can consider Chidori is a full AI and ML cluster manager in Kubernetes that provide a full integration with Kubernetes and also different tools in ML ecosystem like MLflow and Kieflow. So we can focus on your business logic by developing your model, training, deployment and serving. And Shduri will take care about infrastructure management. So now is the demo part hi everyone, as mentioned by Bega, this is Nadine Khaled from Incorta Cloud team and today I'm going to demonstrate with you how to install Shizuri on your environment and put it into action. So as you can see here, we provide hand charts for easy installation into your namespace. So once chidori is installed you can verify if all the infrastructure components are created. And by infrastructure components here I mean spark server deployment which is in our case it's chidori. And we also have rabbit MQ sit for set which is responsible for queuing the jobs, and Shidoicore deployment which is responsible for monitoring the jobs that you had ran before. And also, as you can see here, we have created all the necessary services that are responsible for making the deployments communicate with each other. Also, Chidori simplifies the management of Python packages, so you can install the python packets that you want to install for your job execution and removing the hassle of installing it manually. As you can see here, I have installed Python package Tensorflow and all these packages are already pre installed in Chidori so you don't need to install them again. Also through Chidori you can submit your spark jobs through any spark master, whether it's kubernetes or data Brext or Databrock or Azure HD insight. Also, Chidori offers the flexibility in specifying the driver memory that you want. You can choose the size for the driver that you want. So let's go back to Chidori setup. Once you make sure that all the pods are up and running, you can start submitting your spark jobs through Spark submit and you can open Chidori monitoring to see the status of these jobs. As you can see here, all the jobs I have created before, I can filter by status, whether it's failed, whether it's succeeded. I can also filter by the date the job was created and I can filter by the schema name and the table name of the job that was created. Also you can preview the history of the job that you have created with the same schema name and table name as you can see here, these are the jobs that was created with the same schema name and table name and these are the status and all the information about them. You can also view this history in a short view. So you are going to see how long these jobs took in order to be loaded or created through shadowy. And also you can do some actions on the jobs. You can download Spark driver locks and also you can open this job in Spark history server. This will redirect you to the Spark history server. Also Chidori provides a shareable link feature. So when you want to share with someone the history of the job or any details or any information about the job, you can just give him this signed URL. So you will remove from them the hassle of logging to shadowy monitoring with credentials. So as you can see here, you just copy the URL, the shareable link and the person that you gave him this URL. He will open this link and he will be able to view the history of these jobs and he can display it as I said before in the chart view and he can display details about each job was created. So that was Chidori and I hope you enjoyed the demo. So thank you to attending our session and have a good day.
...

Ahmed Gaber

Cloud Engineering Manager @ Incorta

Ahmed Gaber's LinkedIn account

Nadine Khaled

Software Engineer @ Incorta

Nadine Khaled's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways