Affordable ML Platform

Video size:

Abstract

Building an end-to-end ML platform requires vision, strategy, and execution. From data acquisition to model deployment, it’s about seamless integration, robust infrastructure, and scalable workflows. Let’s empower innovation with a comprehensive ML ecosystem.

Summary

Vivi will present an affordable machine learning platform where data is becoming the new oil. These platforms democratize success to machine learning capabilities. They are designed to lower the barriers, enabling you to turn your data into actionable insights regardless of your budget.
We will start with an overview of what an affordable machine learning platform is. We will identify the key stakeholders and user groups who benefit the most from affordable machinelearning platform. Then we will dive deeper into the technical aspects that make an affordableMachine learning platform efficient and powerful.
An affordable machine learning platform is generally designed to work effectively with a single or few GPU's. The key point of affordability is GPU sharing. Here we discuss why sharing GPU's is crucial in creating an affordable machinelearning platform.
High performance gpu's which are necessary for running complex machine learning tasks come with a press tag. Consider that GPU's are often idle out of regular working hours. Increasing GPU utilization is key to reducing costs and building an thermal machine learning platform.
The key for startups and small businesses is to respond flexible to market demands at a low cost. An affordable machine learning platform can significantly benefit various groups. It can help them maintain flexibility and scalability in their operations.
A typical machine learning platform can be divided into action application layer, infrastructure layer, and hardware layer. To create an affordable version platform, we have decided not to include the data components at this date. The answer is to introduce OpenStack to further illustrate our needs for scalable container environments.
Machine learning tasks use Nvidia GPU's. MIG allows two parallelism for multiply multiple tasks on the same GPU. NP's merges multiple tasks into a single GPU contest. Cost of using MnG can be very high as only Nvidia's high performance professional GPU support this technology.
GPU type sharing immediate time slashing is a feature that allows a single GPU to be shared by multiple processes or users. Many third party vendors have proposed their own GPU sharing solution. I believe that GPU sharing and scalable container environments are among the most critical packs.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome to my presentation about affordable machine learning platform where data is becoming the new oil. The ability to harness this data through machine learning and artificial intelligence is no longer a luxury reserved for large corporations. It has become a necessity for bases of all size, researchers and individual developers. However, the journey to effective machine learning can often seem expensive. High cost, complex infrastructures and the need for specialized hardware can create significant barriers to entry. Many small business startups and individual developers find themselves wondering, how can we leverage the power of machine learning without breaking the bank. This is where furthermore, machine learning platforms come into play. These platforms democratize success to machine learning capabilities, providing cost effective, scalable and user friendly solutions to build, deploy and manage machine learning models. They are designed to lower the barriers, enabling you to turn your data into actionable insights regardless of your budget. Today, Vivi will explore the landscape of a football machine learning platform, understand their key features and learn how they can be utilized to maximize efficiency, minimizing costs. Thank you for joining us. Let's get started. Before we dive into the details, let me walk through today's agenda. We will cover the following topics. We will start with an overview of what an affordable machine learning platform is and variety essential in todays data driven world. We will identify the key stakeholders and user groups who benefit the most from affordable machine learning platform. Then we will break down the essential components of a machine learning platform, explaining which parts are necessary for furthermore machine learning platforms. We will dive deeper into the technical aspects that make an affordable machine learning platform efficient and powerful. Let's begin with the first topic, what's an affordable machine learning platform? To start, let's define what a machine learning platform is. A machine learning platform is a comprehensive environment that provides a necessary tool, frameworks and infrastructure to develop, team, deploy and manage machine learning models. It streamlines the entire machine learning workflow from data preprocessing and model building to deploy. An affordable machine learning platform is generally designed to work effectively with a single or few GPU's, focusing on the importance of resource sharing to ensure the cost efficiency and broad accessibility. So the key point of affordability is GPU sharing. Who needs this machine learning platform? Before answering this question, I am going to discuss why sharing GPU's is crucial in creating an affordable machine learning platform. First, lets talk about the cost. High performance gpu's which are necessary for running complex machine learning tasks come with a press tag. For many startups, small business and individual researchers, this cost can be a significant barrier to entry. Consider that GPU's are often idle out of regular working hours. In many organizations, these expensive resources sit on youth after workday and leading to inefficiency. Most applications have CPU and IO work in between. Launching GPU kernel the GPU utilization of a deep learning model running solely on the GPU is most of the time much less 100%. It means even during working hours there can be periods when CPU's are not fully utilized. GPU's are getting powerful each year. Experimenting with a new model allows and sometimes even requests when to use smaller hyperparameters, making the model use much less GPU memory than normally. Such tasks lead to underutilization and inefficiency. Typically, GPU's are the most expensive part of machinery PI four and also the component that mostly affects platform utilization. Increasing GPU utilization is key to reducing costs and building an thermal machine learning platform. About who needs this platform first lets talk about startups and small basics. The key for startups and small businesses is to respond flexible to market demands at a low cost. When building a machine learning platform, they shouldnt invest heavily from the start. Instead they should look for cost effective solutions that can scale as their needs grow. Nice educational institutions, including universities and research labs, need to provide students with hands on experience in completing end to end machine learning tasks. While modern GPU's might be overkill for education and thoughtful machine learning platform can offer practical waste powder to industry practices without the excessive. This approach enables students to gain valuable skills and experience while helping institutions manage their budgets effectively. Nonprofit organizations also work on tight budgets and need to maximize their impact with limited resources. They can use machine learning to analyze data, optimize operations, and drive their missions more effectively. An affordable machine learning platform provides them with the necessary computational power without diverting too much of their funds from their primary objectives. Freelancers and consultants in the field of data science and machine learning independently or in smart teams. Access to an affordable machine learning platform allows them to offer competitive services and solutions to their clients without the need to invest heavily in expensive hardware. It can help them maintain flexibility and scalability in their operations. In summary, an affordable machine learning platform can significantly benefit various groups. Before introducing the affordable machine learning platform, I would like to first introduce what components a typical machine learning platform should consist of. Here I used a simplified diagram. Through this diagram, we can see that a typical machine learning platform can be simply divided from top to bottom into action application layer, infrastructure layer, and hardware layer. In the application layer, it's divided into machine learning part and data parts. Let's talk about machine learning part first. Typically, a machine learning part is divided into four data engineering experiments, training, and inference. Data engineering is responsible for collecting, cleaning, and preparing data to ensure it's reliable and suitable for machine learning tasks. The experiment phase involves exploring and analyzing data, testing different algorithms, and creating model prototypes to find the best solution. The training phase works on training models using historical data and optimizing their parameters to achieve the best performance. The inference phase involves developing deploying trained models into product environments to make predictions on new data. Dataparts is also an indispensable part of the machine learning platform. Data related subparts usually includes a feature store, model management, and data lake. The feature store is responsible for storing and serving data, serving feature data consistently across streaming, and inference to ensure reproductibility. Model management involves tracking and visioning machine learning models, driving collaboration, and managing model deployment pipelines. The data lake serves as a centralized repository for serving vast amounts of flow and process data, enabling efficient data material and analysis. However, to create an affordable version platform, we have decided not to include the data components at this date. The reason for this is that our target users typically handle smaller data sites, and there is no immediate need to establish in a dedicated data platform at this point. Moreover, data platforms and machine learning platforms can be decoupled that as our basis scales up in the future, we can build a dedicated data platform separately. So for the affordable machine learning platform, the scope is a dark color part in the diagram, the machine learning part, and the infrastructure part. I will deep dive into the most critical technical points in these two scalable container environments and GPU sharing. To better understand the requirements for scalable container environments, let's revisit some typical basis scenarios. Educational institutions often operate with a single machine equipped with a few GPU cards suitable for classroom use and small skill research projects. Startups in small business typically have a setup consisting of a few PCs each with GPU's ideal for initial product development and smart skill deployment. Freelancers and consultants usually work with a single PC equipped with only one gpu, which is perfect for individual projects and consultancy work. Then we could find that there is a conflict. From the basics scenario perspective, the hardware setup may consist of only one or few PCs. However, building a machine learning platform requires multiple container environments, including experimental training and inference environments. To manage these environments effectively, we need to introduce kubernetes. The instruction of kubernetes requires at least three nodes. The challenge now becomes how to deploy kubernetes on a single PC while ensuring compatibility for potential multinode expansion. The answer is to introduce OpenStack to further illustrate our needs consider typical basis scenario where the initial hardware setup consists of only one PC. As the basis grows, the hardware may expand to multiple physical machines or virtual machines and potentially transition to a cloud environment. This can give shift to ensuring compatibility with heterogeneous hardware environments while maintaining scalability. OpenStack is well suited to drive issue as shown in the diagram on the OpenStack website, it excels in heterogeneous hardware compatibility. OpenStack also provides a dedicated single machine deployment tutorial, making it a perfect fit for our requirements. Additionally, since OpenStack provides a virtual machine environment, it allows for stimulus transition to either self host cloud or public cloud environment in the future without disrupting the operation of the Kubernetes cluster and machine learning platform. This diagram illustrates the potential lifecycle of a typical affordable machine learning platform. In the initial phase, OpenStack is used to support single machine setups and ensure compatibility with heterogeneous hardware. As a platform evolves, Overstack continue to provide compatibility with more complex environments, including cloud infrastructure. In Lithostix, OpenStack can be seamlessly removed to other container environments, directly enhancing flexibility and scalability. Next, I will introduce one of the most app critical technologies in building an affordable machine learning platform. Before diving deep into the technical details, lets first understand the mainstream GPU sharing solutions and their applicable scenarios from the big picture perspective. Machine learning tasks use Nvidia GPU's we will start by looking at several official Nvidia solutions, including multi instance GPU, which is MIG GPU time sharing and multiprocessed service, which is NP's. We can see that MIG allows two parallelism for multiply multiple tasks on the same GPU with the highest level of isolation, making it suitable for inference tasks and small scale training tasks, although it's also the most expensive option which we will discuss in detail later. GPU time sharing involves time slicing in a single GPU, which causes content switching between different tasks, leading to increased total task time, making it suitable for, let us say, sensitive inference tasks, but more suitable for relatively synchronous training tasks. The last solution, NP's, is the earliest GPU sharing solution which merges multiple tasks into a single GPU contest that if one task fails, all tasks will fail. Thus, it's only suitable for experimental scenarios. Among third party solutions, they will primarily introduce Tencent KKe Gagia GPU. Let's now dive into the principles of those solutions and their advantages and disadvantages. MIG is a technology that allows a single Nvidia GPU to be partitioned into multiple isolated instances each of this instance has its own dedicated resources such as memory, compute cores, and bandwidth. This separation for such multiple workloads can run simultaneously on the same GPU without affecting each other, thus maximizing resource utilization and performance. The primary advantage of IMIG is the strong isolation it provides between different tasks by dedicating specific resources to each instance, one task from impacting the performance or stability of another. This makes imaging particularly suitable for environments that revere the workloads. Additionally, MNG allows true for precise resources allocation, enabling efficient use of GPU capabilities and improving overall system scalabilities. However, here is an important drawback to consider. The cost of using MnG can be very high as only Nvidia's high performance professional GPU support this technology. As shown in the table on the right, the minimal requirement is the nadia, a 30 toothpod mnG, which doesnt align with the goal of an affordable machine learning platform. Lets talk about the next candidate. GPU type sharing immediate time slashing is a feature that allows a single GPU to be shared by multiple processes or users. By dividing the GPU's computer resources into typesclass, each process or user gets a dedicated time slice during which they have full access to the GPU's resources. This enables multiple tasks to run on the same GPU in a sequential manner, providing the illusion of parallel processing while ensuring that each task gets a fair share of the GPU's capabilities. Advantage of time pricing is increased flexibility and better results utilization. It allows multiple users or processes to share a single GPU with zasket. The need for partitioning is hardware resources. Physically, there are some disadvantages to time slicing. In a time slicing setup, multiple processes share the same proof vram, leading to potential memory contention. If one process consumes a large amount of vram, it can leave insufficient memory for other tasks, causing performance degradation of failures. Additionally, the shared memory spaces increases the risk of where inefficient memory management by one profile can gradually consume more vram, impacting the performance and the stability of other processes. Regarding for the isolation, time slicing doesn't provide a strict isolation between processes. If one process encounters a fault or crashes, it can potentially affect other processes sharing the same GPU resources. Lack of isolation can lead to system insertability and unpredictable performance, making it challenging to ensure reliable operation in production environments. Next solution is Nvidia NP's the core principle of NP's in allowing multiple processes to share single GPU context. Traditionally, each CUDA application would create its own GPU context, leading to resource message and context switching overhead. By sharing a single GPU context NP's reduces its inefficiencies. Additionally, NP's Merc command queue from different processes into a single queue, thereby facilitating more efficient scheduling and execution of command midping turn minimize GPU add time. Whilst the primary advantage of Nvidia NPM is performance improvement, by reducing context switching and managing command more efficiently, NP's can significantly enhance the performance of concurrent processes. Furthermore, compared to MIG and templating NP's both consumer grid gpu notable disadvantages associated with NP's while the pay issue its memory isolation, the shared GPU contest results in less strict memory oscillation compared to a separate contest. This can lead to memory contention and potential data security concerns. Another significant drawback is fault isolation. If one process encounters fault, it can impact other processes sharing the same GPU context. Configuring and debugging NP's can be complex and requires specialized knowledge and experience outside of official solutions. Many third party vendors have proposed their own GPU sharing solution. A typical example is Tencent Gaia GpU. Tencent provides a complete Fido GPU sharing permissions, which is a fully open source GPU services. Inga GPU is a GPU resource limitation component and belongs to Cuda hijacking. We could manage each container's memory usage by intercepting CUDA's memory allocation and release requests, thereby achieving memory isolation. There is the only thing to load is that context application doesn't go through the malloc function, so it's impossible to know how much memory the process uses in the condesce. Therefore, we could accurate the current memory usage from the GPU each time. In terms of computing power isolation, users can specify the GPU utilization rate for container. MacUDa will monitor utilization and take some actions when it exceeds the limit. Both hard isolation and soft isolation are supported. Since a monitoring adjustment scheme is used, computing power cannot be limited in a short period. Only long term efficiency fairness can be guaranteed. Therefore, it's not suitable for scenarios they are tested. Task times are extremely short, such as inference tasks. Machine learning platforms are extremely broad topic and even if we limit the scope to building an affordable machine learning platform, there are still many sets to share. However, I believe that GPU sharing and scalable container environments are among the most critical packs. A few years ago, well, I was developing a machine learning platform. We successfully run our platforms mostly on a small PC cluster with only few machines shown in the picture. This demonstrated that by applying the technological solutions previously mentioned, we can indeed build an affordable machine learning platform aimed at education institutions. Ngo's freelancers and setups. Finally, I hope my insights have provided valuable guidance and inspiration for your own affordable machine learning platform. Endorse and thank you all for joining this online session.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 Machine Learning 2024 - Online

May 30 2024

Affordable ML Platform

Video size:

Abstract

Summary

Transcript

Slides

Zhiya Zang

Senior Software Engineer @ TikTok

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2024 - Online

May 30 2024

Affordable ML Platform

Video size:

Abstract

Summary

Transcript

Slides

Zhiya Zang

Senior Software Engineer @ TikTok

Join the community!