Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Andrey Stroganov.
Thank you for joining me in my talk about performance and scalability
challenges to reassemble classifiers on heterogeneous platforms.
Specifically, we will be talking about heterogeneous CPUs.
Those are the CPUs that we use every day in our mobile devices.
We use Big Little Architecture, which combines high performance
cores with energy efficient cores within a single processor.
Initially, this technology came from a mobile market where power
efficiency is crucial, and now heterogeneous CPUs are gradually making
their way into the server segment.
Heterogeneous CPUs can dynamically allocate workloads to the appropriate
cores, their applications can range from lightweight microservices to resource
intensive data processing tasks, and they are an attractive option for data centers.
Transcripts can be used to access any of the following formats.
We will experiment with two popular machine learning models, random
forest and gradient boosting.
Both of them use decision trees under the hood, so let's do a quick recap.
A decision tree is a tree like structure which makes the decisions.
Based on, feature values.
Here we see a small decision tree with three features, humidity,
temperature, and wind speed.
And in each node it has a feature and its splitting value.
if humidity is about 90 percent we go, right and if let's say temperature
is about 15 degrees, we'll go left.
And the prediction is that it is raining.
in practice, the decision trees are quite larger.
They use more features, but a single decision tree is a weak predictor.
So we usually, utilize the ensembles.
for instance, we can take multiple trees and train each of them on a
random subset of training data set.
The resulting structure will be a random forest.
Another approach is to train a sequence of decision trees so that Each
sequential tree would try to compensate prediction inaccuracy of previous trees.
this structure would be a gradient boosting machine.
Both of them are widely popular.
They have Implementations in such great libraries, as ONNX, xg, boost and Live
GBM, but due to their branch in nature, they are usually processed on CPUs.
We don't use NPUs or GPUs for them.
there might be some obstacles when we port these algorithms
into heterogeneous setting.
And these are our experimental settings.
We will use weather forecasting data set with 21 numerical features, an octa core
CPU with one very fast core, three slower cores, and four energy efficient cores.
Our models will have up to 100 decision trees, and they will
be processed using ONNX Runtime.
Here we see the stock performance of Random Forest using several threads.
On X axis we see the data size, and 1 denotes 1 million samples.
On Y axis we see the inference time in seconds.
Blue line here is the performance of a single thread algorithm.
If we add additional thread, we got much better performance.
So let's add another two threads.
This one here is the result of predicting using four threads.
We again improved performance.
Adding two additional threads, we get the best performance.
But if we try to add a couple more threads, we see that
performance gets worse.
And this is a concurrency penalty.
So the problem is that we can't be sure how, how many threads we
should take for a good performance.
And these are results for gradient boosting in stock performance.
We see that the problem stays.
This one is single thread performance, and if we add additional
thread, we get best performance.
Adding more threads results in worse performance with huge
concurrency penalty for using.
And, let's note that we got very small performance boost here, 1.
2 times.
In order to improve performance, let's see how usually random forest and tree
ensemble classifiers are multithreaded.
If we have enough trees, we can process each tree in a separate thread.
For instance, if we have 9 trees and 4 available threads, then we can take 4
trees, process them in parallel, take additional 4 trees and process them,
and then we will have only 3 trees.
And so we will use second approach, when we split dataset into several batches,
and each batch is processed in parallel.
in different thread using a decision forest consisting of three trees,
but both these approaches imply that our CPU cores are about equal,
but in heterogeneous setting.
It is not entirely true here.
What can happen if we run equally demanding tasks on different course?
So let's see this setting.
We have four course in our CPU core.
A is fastest and core B. is two times slower, and we have two slowest cores, C
and D. And if we put, equally demanding tasks on A, on these cores, then we see
that core A finishes its job very fast.
And we'll do nothing and core B will finish a bit later
and also will be idling.
And we will have to wait for cores C and D for finishing their tasks.
And there is more realistic scenario in case 2 when cores help each other,
so when core A. finished its task one, it took the unfinished part of
task three from core C, and then it took unfinished part from core D.
This schedule is more compact, but we still see that some cores idling.
So can we do better?
Yes we can.
We can use.
A worker pool pattern, when we split our data set into very small batches,
the size of the batch should be, so small so that, the slowest core
would process it in reasonable time.
And we use, several workers in several threads.
Each worker takes the batch, processes it, and requests next batch.
Here we see that core A took batch 1, core B took batch 2, core C took
batch 3, and core D took batch 4.
When core A finished processing batch one, it took batch five,
then it took batch six, by that time core B finished batch two and
requested next available batch seven.
This schedule is far more compact.
And these are the results of using worker pool pattern in random forest.
We see that adding threads results in better performance.
So this is single thread performance.
This is two thread performance.
Four thread performance.
And this is six, seven and eight thread performance.
We see that system couldn't give us more processing power, so in, all these
three cases, we get, same performance, but there is no concurrency penalty here
and we have better performance overall.
These are results for gradient boosting.
Again, we see that when we add additional threads, the performance
increases and the best performance there is when we use eight threads
and we have no concurrency penalty.
We have four times performance boost.
So as we said, we use small batches around four hundred to eight hundred samples
we don't want to use very small batches because we will waste some processor
time for handling the queue and we don't want to use very large batches.
So to sum up We saw that porting of software to heterogeneous CPUs may
require some architectural changes to fully utilize computing potential.
We demonstrated that stock multithreading architecture may
inefficiently utilize heterogeneous CPU and discussed reasons for this.
We demonstrated how to solve this problem using worker pool pattern, and we got
four times to five times performance boost for random forest and gradient boosting.
I thank you for your attention and I'll be glad to answer any questions.
Goodbye.