Conf42 Cloud Native 2025 - Online

- premiere 5PM GMT

Tree Ensemble Classifiers on heterogeneous platforms: Performance & Scalability Challenges

Video size:

Abstract

Tree ensemble methods (Random Forest, Gradient Boosting) are widely used in ML but can be inefficient in cloud-based, multi-threaded environments due to uneven workload distribution across heterogeneous CPU cores. This talk analyzes performance trade-offs in existing ONNX-based implementations, introduces a custom C++ wrapper for optimized task scheduling, and demonstrates a 4x speedup in cloud-based inference workloads.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Andrey Stroganov. Thank you for joining me in my talk about performance and scalability challenges to reassemble classifiers on heterogeneous platforms. Specifically, we will be talking about heterogeneous CPUs. Those are the CPUs that we use every day in our mobile devices. We use Big Little Architecture, which combines high performance cores with energy efficient cores within a single processor. Initially, this technology came from a mobile market where power efficiency is crucial, and now heterogeneous CPUs are gradually making their way into the server segment. Heterogeneous CPUs can dynamically allocate workloads to the appropriate cores, their applications can range from lightweight microservices to resource intensive data processing tasks, and they are an attractive option for data centers. Transcripts can be used to access any of the following formats. We will experiment with two popular machine learning models, random forest and gradient boosting. Both of them use decision trees under the hood, so let's do a quick recap. A decision tree is a tree like structure which makes the decisions. Based on, feature values. Here we see a small decision tree with three features, humidity, temperature, and wind speed. And in each node it has a feature and its splitting value. if humidity is about 90 percent we go, right and if let's say temperature is about 15 degrees, we'll go left. And the prediction is that it is raining. in practice, the decision trees are quite larger. They use more features, but a single decision tree is a weak predictor. So we usually, utilize the ensembles. for instance, we can take multiple trees and train each of them on a random subset of training data set. The resulting structure will be a random forest. Another approach is to train a sequence of decision trees so that Each sequential tree would try to compensate prediction inaccuracy of previous trees. this structure would be a gradient boosting machine. Both of them are widely popular. They have Implementations in such great libraries, as ONNX, xg, boost and Live GBM, but due to their branch in nature, they are usually processed on CPUs. We don't use NPUs or GPUs for them. there might be some obstacles when we port these algorithms into heterogeneous setting. And these are our experimental settings. We will use weather forecasting data set with 21 numerical features, an octa core CPU with one very fast core, three slower cores, and four energy efficient cores. Our models will have up to 100 decision trees, and they will be processed using ONNX Runtime. Here we see the stock performance of Random Forest using several threads. On X axis we see the data size, and 1 denotes 1 million samples. On Y axis we see the inference time in seconds. Blue line here is the performance of a single thread algorithm. If we add additional thread, we got much better performance. So let's add another two threads. This one here is the result of predicting using four threads. We again improved performance. Adding two additional threads, we get the best performance. But if we try to add a couple more threads, we see that performance gets worse. And this is a concurrency penalty. So the problem is that we can't be sure how, how many threads we should take for a good performance. And these are results for gradient boosting in stock performance. We see that the problem stays. This one is single thread performance, and if we add additional thread, we get best performance. Adding more threads results in worse performance with huge concurrency penalty for using. And, let's note that we got very small performance boost here, 1. 2 times. In order to improve performance, let's see how usually random forest and tree ensemble classifiers are multithreaded. If we have enough trees, we can process each tree in a separate thread. For instance, if we have 9 trees and 4 available threads, then we can take 4 trees, process them in parallel, take additional 4 trees and process them, and then we will have only 3 trees. And so we will use second approach, when we split dataset into several batches, and each batch is processed in parallel. in different thread using a decision forest consisting of three trees, but both these approaches imply that our CPU cores are about equal, but in heterogeneous setting. It is not entirely true here. What can happen if we run equally demanding tasks on different course? So let's see this setting. We have four course in our CPU core. A is fastest and core B. is two times slower, and we have two slowest cores, C and D. And if we put, equally demanding tasks on A, on these cores, then we see that core A finishes its job very fast. And we'll do nothing and core B will finish a bit later and also will be idling. And we will have to wait for cores C and D for finishing their tasks. And there is more realistic scenario in case 2 when cores help each other, so when core A. finished its task one, it took the unfinished part of task three from core C, and then it took unfinished part from core D. This schedule is more compact, but we still see that some cores idling. So can we do better? Yes we can. We can use. A worker pool pattern, when we split our data set into very small batches, the size of the batch should be, so small so that, the slowest core would process it in reasonable time. And we use, several workers in several threads. Each worker takes the batch, processes it, and requests next batch. Here we see that core A took batch 1, core B took batch 2, core C took batch 3, and core D took batch 4. When core A finished processing batch one, it took batch five, then it took batch six, by that time core B finished batch two and requested next available batch seven. This schedule is far more compact. And these are the results of using worker pool pattern in random forest. We see that adding threads results in better performance. So this is single thread performance. This is two thread performance. Four thread performance. And this is six, seven and eight thread performance. We see that system couldn't give us more processing power, so in, all these three cases, we get, same performance, but there is no concurrency penalty here and we have better performance overall. These are results for gradient boosting. Again, we see that when we add additional threads, the performance increases and the best performance there is when we use eight threads and we have no concurrency penalty. We have four times performance boost. So as we said, we use small batches around four hundred to eight hundred samples we don't want to use very small batches because we will waste some processor time for handling the queue and we don't want to use very large batches. So to sum up We saw that porting of software to heterogeneous CPUs may require some architectural changes to fully utilize computing potential. We demonstrated that stock multithreading architecture may inefficiently utilize heterogeneous CPU and discussed reasons for this. We demonstrated how to solve this problem using worker pool pattern, and we got four times to five times performance boost for random forest and gradient boosting. I thank you for your attention and I'll be glad to answer any questions. Goodbye.
...

Andrei Stroganov

Lead Software Engineer @ Samsung Research

Andrei Stroganov's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)