Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everyone to our presentation on cloud native ML infrastructure.
Today we will be exploring how to build resilient Apache Spark
clusters on Kubernetes specifically designed for AI and ML workloads.
I'm Anant Kumar and I'm excited to share insights on addressing the
unique challenges of scalability, resource utilization and operational
complexity in this domain.
Let's begin.
Let's begin by looking at the evolution of ML infrastructure.
Historically, ML workloads ran on premise, requiring significant upfront investment
in hardware and ongoing maintenance costs.
This approach limited scalability and flexibility.
The shift to cloud native infrastructure has revolutionized
how we handle ML workloads, offering greater scalability and flexibility.
Flexibility and cost efficiency through pay as you go models
and dynamic resource allocation.
This evolution has enabled organizations to experiment more
freely and deploy ML solutions faster.
Despite advances, ML infrastructure still faces significant challenges.
First, scalability remains difficult.
as ML model and datasets grow increasingly complex and large.
Second, resource utilization is often inefficient with compute
resources frequently over present, over provisioned or underutilized.
And finally, Operational complexity increases as organizations manage
distributed system spanning multiple environments and technologies.
These challenges are precisely what we will address with our
Spark on Kubernetes approach.
Apache Spark has become a cornerstone technology for ML workloads
for several compelling reasons.
Its distributed processing capabilities allow for handling massive data
set across clusters of machines.
Essential for training modern ML models.
It's in memory computing provides speed advantages, critical
for iterative ML algorithms.
And additionally, Spark offers pre built libraries for common
ML tasks through MLlib, making implementation More straightforward
and reducing development time.
Kubernetes has emerged as the leading container orchestration platform,
offering key capabilities that complement Spark for ML workloads.
Its container orchestration simplifies deployment and
management of Spark applications.
Its sophisticated resource management ensure Efficient allocation and
utilization of computer resources.
Perhaps, most importantly, its auto scaling capabilities allow dynamic
adjustment of resources based on workload demands, optimizing
both performance and cost.
A comprehensive cloud native ML architecture consists of
four essential components.
First, Data storage solutions, including data lakes, object storage, and
databases that provide the foundation for managing large volume of data.
Second, compute resources like Kubernetes cluster, virtual machines,
and GPUs that power processing.
Third, ML frameworks such as Spark, TensorFlow, and Vue.
and PyTorch that provides the tool for building and training models.
And finally, model serving platforms that deploy and serve trained
models to end users or applications.
Now, let's examine how Spark operates within a Kubernetes environment.
Spark applications are deployed as Kubernetes pods.
With each worker running in its own pod, these workers communicate
with each other and the master node to coordinate processing tasks.
Think about it.
Master node is a driver node and these workers are essentially the executors.
Data is stored and accessed through persistent volumes, ensuring.
Data remains available even if ports are rescued or restarted.
This architecture combines Spark's processing power with Kubernetes
orchestration capabilities.
Effective resource management is very crucial for Spark on Kubernetes.
We implement this through resource requests and limits, defining
the minimum resources required.
Each Spark port requires and the maximum it can consume.
Resource quotas set boundaries on resource consumption for namespaces,
preventing any single application from monopolizing the cluster resources.
Node affinity and taints allow us to assign Spark ports to specific
nodes based on their requirements, such as placing GPU intensive tasks
on nodes with appropriate hardware.
The Spark Operator simplifies Spark application management on Kubernetes.
Installation involves deploying the operator into your Kubernetes cluster,
which can be done via Helm Charts.
Configuration is very simple.
It's require, it requires setting up the operator.
with appropriate cluster settings and standard Kubernetes resource limits.
Once deployed, the operator enables declarative application deployment
through custom resource definition, which is often abbreviated as CRDs,
streamlining the entire process of running Spark applications.
You can get and orchestrate the Spark applications through Spark Operator.
All the CRUD operation can be implemented through the Spark Operator.
For example, you can create Spark application, read Spark application,
update Spark application, and delete it.
Dynamic resource allocation is a key advantage of our approach.
This involves three critical elements.
Resource monitoring to track SPARC resource consumption in real time.
Dynamic scaling to adjust resources based on workload demand.
And resource optimization to ensure efficient utilization.
Together.
These elements create a responsive system that can scale up during
peak processing needs and scale down when demand decreases.
This is also help for handling the bursty traffic and the burst
passes when the workload stabilizes.
This dynamic resource allocation can help in those scenarios as well.
Autoscaling implementation focuses on two dimensions.
Horizontal scaling adds or removes spark worker nodes based on
processing demands, allowing the cluster to expand or contract.
Horizontal scaling is generally a preferred method of scaling because you
can use a general purpose instances.
that can scale horizontally and you can deploy them in a large number.
Cost wise, these instances are not that expensive, so it makes
sense to make use of the horizontal scaling wherever it makes sense.
Whereas vertical scaling adjusts resources within existing nodes, such
as increasing CPU or memory allocation.
Both approaches put together, they can be automated based on some metrics
like CPU utilization, memory usage, or application specific metrics.
You can set up a threshold value, and based on the threshold value, when these
metrics exceed these threshold values, the scaling can automatically trigger.
Effective data management requires understanding various storage options.
This slide visually represents the diverse storage solutions available,
from cloud object storage for raw data, to specialized database
for structured information.
The right storage strategy depends upon your specific workload characteristics,
access pattern, and latency requirement.
And it can vary from blob storage, to the database, to the data lake.
Data persistence is essential for ML workloads.
We implement this through persistent volumes that store data beyond
pored life cycles, ensuring data survives container restarts.
We know that.
In Kubernetes, the pod lifecycle is ephemeral.
Anytime, for any reason, pod can be evicted.
So it's important to have persistent data.
Stateful sets manage stateful applications with persistent
identifiers and stable storage.
Critical for maintaining state.
And data replication across multiple nodes provide redundancy, ensuring
availability even if individual nodes fail or get evicted.
A robust monitoring infrastructure is vital for production system.
I haven't seen any production system without a good monitoring
and visibility infrastructure.
So this is critical component and we must have the high standard for setting
up the monitoring infrastructure.
We recommend implementing monitoring tools like Prometheus and Grafana for
matrix collection and visualization.
You can run a sidecar container to your main application container, which is
responsible for sending out matrix.
Exporting in the Prometheus format, which can be exported to the external exporters.
Logging solutions such as Fluentd and Elasticsearch centralize
logs for easier troubleshooting.
Tracing systems like Jager and Zipkin provide insights into requests
flows across distributed systems.
Putting this all together, these tools offer comprehensive
visibility into your infrastructure.
You can set up these metrics into your PagerDuty alerts.
And you would come to know when these production systems downgrade or fail.
And you can do the manual intervention to bring those systems back.
without impacting the customer workflows.
Performance optimization require a multifaceted approach.
Data partitioning optimizes distribution across workers, reducing data movement
and improving processing efficiency.
You can also resort to a external service which is responsible for Doing
the spark shuffling, which reduces the chances of shuffle, essentially
increasing the efficiency across cluster.
Data caching keeps frequently accessed data in memory, reducing read latency.
And code optimization improves application efficiency through techniques like
Predicate push down, which is one of the key feature of Spark, where Spark
push down the predicate or you can say the filter down to the source.
For example, let's say if you, there is a SQL query, which is manipulating
the data from Avro source file.
Then Spark will push the filter down to the Avro reader instead of reading
all the records from the Avro file.
And on top, you can have some efficiency through broadcast join
and query optimization as well.
Spark provide catalyst, which does convert, the logical, query plan
into an optimized logical query plan.
Security must be a priority in any ML infrastructure.
This chart illustrates the balanced approach required
across multiple security domains.
Network policies to control poor communications.
Data encryption for protecting sensitive information.
Authentic authentication mechanism to verify identities and vulnerability
scanning to proactively identify weakness.
Implementing proper authentication and authorization involves several mechanism.
Role based access control, which is very native to Kubernetes.
Restricts access based on user roles, applying the
principle of least privileges.
OAuth integration leverages external identity provider for
streamlined authentication.
And certificate based authentication uses digital certificates for securing
communication between components.
You can also induce MTLS and MTLS.
to communicate between internal service endpoints for extra protection.
Network security and data protection work in tandem.
Network segmentation isolates Spark workloads from other applications,
minimizing the attack surface.
Data encryption both at rest and in transit.
Ensure sensitive data remains protected even if other security
measures are compromised.
Cost management is increasingly important as ML workloads scale.
It's paramount important when these workloads are running in cloud.
We have heard lot of stories where some of the EC2 instances run for a longer period
of time, which warrants unnecessary cost.
We can choose spot instances and we can find out what time of the day and
in what zone these spot instances are cheaper to improve our cost efficiency.
We can choose this for non critical workload.
Because spot instance has a tendency to be taken away.
The resource optimization, fine tune allocation to reduce waste, ensuring
you are not over provisioning.
Idle resource management scales down or terminates resources when not in
use, avoiding unnecessary expenses.
High availability ensures your ML infrastructure Remains
operational despite failures.
This requires multiple master nodes for redundancy, eliminating
single point of failure.
Also, multiple master nodes are great if you have a larger cluster
and they work in tandem to share the workload, making the API Kubernetes
API server highly available.
For worker node, the redundant Spark workers provide fault
tolerance for processing tasks.
A failed or evicted worker pod can be restarted.
And in many of the cases, those tasks, when retried, succeed,
leading it to the self healing properties of the application.
Data replication across multiple nodes ensure data remains accessible
even if individual nodes fail.
Disaster recovery planning prepares for worst case scenarios.
Data backup and recovery procedure preserve critical data and
enable restoration when needed.
Replication across multiple regions protects against failure.
regional outages.
Failover mechanism automatically redirect operation to functioning resources,
minimizing downtime during failures.
Implementing CI CD streamlines, development and deployment.
The build stage compiles and packages Spark applications with dependencies.
There are a lot of build tools available in the market.
Maven is often used predominantly, but for monorepo cases, Bazel is
also gaining a lot of traction.
Testing runs automated tests to verify functionality and performance.
Deployment pushes applications to the Kubernetes cluster
using automated processes.
Integration test is an essential part of the CICD pipeline before
we go and deploy the Docker image monitoring tracks application health
and performance post deployment.
The value of ML model lies in their application.
We need effective ways to deploy trained models from Spark to serving platforms
like TensorFlow serving, KF serving.
or custom solutions.
Out of all the AI framework available in the market, for example, TensorFlow,
PyTorch, or Jaxx, TensorFlow comes out to be one of the favorite
which is being used for a lot of production deployment scenarios.
It provides high level APIs like Keras, which simplifies the model deployment.
And there is a light version of this framework available
for mobile applications.
These platforms then serve model.
for real time inference and predictions, bringing, bridging the gap between
model development and production use.
Dependency management becomes crucial as projects grow.
Tools like Maven, SBT, or Gradle help manage project dependencies efficiently.
Containerization via Docker packages application with their
dependency, ensuring consistency across lower and higher environment
and simplifying deployment.
Effective troubleshooting requires multiple approaches.
Logs analysis examines Spark and Kubernetes logs to identify errors.
And they are root case root causes.
There are many logs analysis tool available in the market.
Which can source and aggregate loads from all the containers running in the pod.
Including all the side containers as well as init containers.
Debugging tools helps inspect code execution and variable values.
Kubernetes monitoring tools provide insight into pod status, resource
utilization, and cluster health.
There are many Kubernetes monitoring tools available in the market.
There are many open source project as well, which you can leverage to get
monitoring to set up some monitoring across your Kubernetes cluster.
Continuous performance improvement require systematic measurement.
This involves measuring performance metrics under various
workload to establish baseline.
Identify bottlenecks and areas for improvement direct optimizations efforts.
Validating performance after implementing changes confirm their effectiveness.
Let's examine a real world case study involving a large scale ML pipeline.
The challenge was building a pipeline capable of processing.
terabyte of data efficiently.
The solution leverage Spark on Kubernetes to provide distributed
processing and dynamic scalability, resulting in significant performance
improvement and cost reduction.
Pair it with some catalog service, for example, Hive Metastore.
End.
Using the open source Iceberg open table format to efficiently store
and manipulate the table format and store the data in the S3 format.
Spark SQL, Hype Metastore and Iceberg put together all three provide a very
comprehensive and compelling data lake solution, which has capability to
store terabyte of data efficiently, as well as provide the schema evolution
in case the business need change to change the schema of the table.
Also, with the help of Iceberg, we can do the time travel.
which can be required for compliance reasoning.
From our experience, we have distilled key best practices.
Start small with a minimal viable cluster and gradually
scale as you gain experience.
Automate deployment.
scaling and monitoring process to reduce manual intervention, continuously and
periodically optimize resource utilization and do the performance benchmarking based
on metrics and evolving workload patterns.
Looking ahead, several trends will shape the future of cloud
native AI and ML infrastructure.
Serverless computing for ML workload will reduce infrastructure management overhead.
Edge computing will bring AI and ML capabilities closer to data
sources for real time applications.
AI powered infrastructure management will automate optimization and scaling
decisions, creating self tuning systems.
AI and ML, the space itself is a very fast moving space, and there are a lot
of community build tools, data set, and models available on some of the websites.
For example, Hugging Face, where people collaborate.
And leverage each other work to take this space forward.
This concludes our presentation on building resilient Apache Spark
cluster on Kubernetes for AI workload.
We have covered key concepts, best practices, and practical
implementation strategies.
I encourage you to explore the additional resources.
and continue learning about this rapidly evolving.
Thank you for your time.
I hope my presentation is helpful to you and you will be able to embark on
your AI and ML infrastructure journey.
Thank you for attending.