Transcript
This transcript was autogenerated. To make changes, submit a PR.
Morning or afternoon everyone.
Today we are diving into a critical and exciting area, HDD generative ai.
Specifically we'll be exploring how to engineer low latency AI solutions
designed for resource constrained environments as businesses increasingly
look for real-time AI capabilities, right where the action happens at the edge.
Understanding how to optimize these powerful generative
models becomes essential.
We look at systematic ways to maintain performance while cutting
down on computational needs, power consumption, and latency.
This opens up fascinating new possibilities for intelligent
applications directly on edge devices.
Just a little bit about myself.
I'm Ian, principal software engineer at SD Engineering.
I direct my work focuses on enhancing global connectivity through advanced
IP satellite network infrastructure.
My experience in network engineering and software development gives me a
practical perspective on challenges and opportunities of deploying
complex systems like AI in diverse environments, including the Edge.
You can find more details or connect with me on LinkedIn.
So why is Edge AI challenging?
Let's look at edge computing Frontier.
Unlike the seemingly limitless resources in the cloud, edge devices
operate under significant constraints.
First, limited computational resources.
Think processing power, memory storage edge devices typically
have much less than cloud servers.
Second, connectivity challenges.
Edge solutions often need to work reliably, even with spotty, slow,
or sometimes no internet connection.
Third, energy constraints.
Many edge devices run on batteries or have strict power budgets limiting
how complex our AI models can be.
And finally, realtime requirements.
Many edge applications like autonomous vehicles or industrial monitoring
demand, immediate low latency responses, putting high performance demands.
On these constant devices.
Despite this the challenges, the demand for Edge AI is booming.
Let's look at market trends.
We are seeing strong expansion.
The Edge AI accelerator market, the specialized hardware is growing incredibly
fast, nearly 36 39% CHER, projected to hit almost 7.7 billion by 2027.
This signals huge demand for on-device ai.
What's driving this key adoption drivers are primarily the need for low
latency, over 78% of organization site.
This is crucial for realtime responses.
Then data privacy is another major factor.
Almost 65% of organizations cite this as another major factor.
Pushing processing locally instead of sending sensitive data to
the cloud is important for them.
And the industry adapting is adapting.
Model optimization techniques, which we'll discuss heavily today, are
already used in 82% of deployments.
This shows a clear focus on making complex AI run effectively on edge hardware,
edge AI market evolution.
Let's look ahead.
We are really at an inflection point.
Currently generative AI at the edge is still limited by those resources,
resource constraints that were mentioned.
But over the next 12 months, we expect increasing adoption of optimized edge
AI solutions within eight to 24 months.
The prediction is that most enterprise need real time edge capabilities
will be act actively deploying them.
Beyond 2025 H, AI is likely to become ubiquitous, supported by
specialized hardware acceleration.
This rabbit evolution is driven by hardware advancements, better optimization
techniques like composition and pruning, and the growing need for privacy.
Preserving local computation.
The shift away from cloud dependency for realtime tasks is well underway.
Market trends.
Just to reiterate, the those key market drives drivers because they underscore
the why behind the edge optimization.
The market growth is significant.
As mentioned, 39% of CAGR for accelerators and 7.6 billion market cap soon.
The primary need for low latency, for immediate responses and data privacy, both
favoring local on device processing and crucially, the industry is already heavily
invested in model optimizations, which is done in like almost 82% of deployments
to overcome hardware limitations.
This sets the stage for the techniques we are going to
explore in the rest of the talk.
Acceleration solutions?
Software optimization is key Hardware plays a vital role.
Specialized hardware acceleration solution dramatically boost performance
and energy efficiency for the edge.
Yeah.
We have neural processing units, N ps, which are dedicated AI accelerators.
Now common in modern systems on chips they offer 10 to 15 times the
energy efficiency for a standard CPU for typical neural network tasks.
Google offers HT ps. Purposely built edge accelerators providing
significant processing power, like up to four terra operations in small
low power around two watts package.
That's impressive.
Then mobile GPUs are also increasingly optimized for AI leveraging their
parallel processing power, and the FPG accelerators offer re comfortable hardware
adapted for specific AI models and often very power efficient in production.
These accelerators are designed specifically for the math
intensive operations common in ai, like matrix, multiplication,
it, GPU's, overview.
This table gives a snapshot of the diverse landscape of it, GPU and accelerators.
Don't worry about memorizing the details, but notice how the key players like
Nvidia Jetson series like Nano Nx, A GX RTX, ADA AMD rise, AI inter core Intel
score Ultra and PS and R gpu, Google score, HT ps, Callcom, snapdragons and
pos, and specialists like H Cortex.
Important takeaway here is the variation in ai performance,
memory, capacity, and bandwidth, power consumption, and form factor.
Choosing the right hardware depends heavily on specific application
needs regarding performance, power, budget, and physical size.
This plays a critical role in the process of developing these models.
Now let's look at the inference performance specifically for
large language models, which are notoriously resource intensive.
Again, this table is illustrative based on available data, which can be fragmented.
Key things to note.
High-end desktop GPOs like RTX 4 0 9 0 or 3 0 9 0 achieve very
high throughput tokens per second.
Using optimized frameworks like Nvidia sensors R-T-L-L-M significantly
outperforming less optimized methods like Lama CCP on the same
hardware for it specific hardware like Nvidia Jetson's Orient Series.
A MD Rise, AI and Intel Core ultra running models like LAMA or SSL
seven B, often using ization like integer four is becoming feasible,
especially with dedicated frameworks like Tensor, R-T-L-L-M-O-N-N-X.
Runtime with V is ai.
Iex, LLM, open V. However, specific standardized benchmarking numbers
for throughput and latency on these edge platforms is still emerging
and depend heavily on software maturity, on automation levels.
That's why we see, needed data in the, in specific rows and columns.
The power consumption is drastically lower for each devices compared to
desktop GPUs highlighting the efficiency gains, but also performance trade-offs.
Beyond physical hardware.
Hardware is only part of equation Software specifically, AI frameworks
and optimization engines is crucial for actually running models
efficiently on the hardware key.
AI frameworks like TensorFlow Light and PyTorch Mobile are
designed for edge deployment.
NVIDIA sensor RT and its LLM variant are highly optimized for their GPOs.
Inference automation engines like ONNX runtime and Intel.
So open wino.
Focus on optimizing models for efficient execution across various hardware
vendors.
Also provide specific software stacks like Nvidia, Jetpack,
A MD Rise, and AI software.
Intel's one API Tools and Qualcomm's AI stack, which bundles drivers, libraries,
and tools for their respective hardware.
The right framework involves trade offs.
As this chart illustrates TensorFlow light generally offers excellent
hardware support through its de delegation system, allowing it to
leverage n ps and GPUs effectively, or in an next runtime, often provide
slightly better raw execution, speed, and great cross-platform portability.
Py Touch Mobile is often placed for its developer experience and
see simpler deployment, making it good for rapid prototyping.
TVM Apache TVM can achieve the best performance through deep
compiler optimizations, but also comes with significantly
higher deployment complexity.
So the choice depends on your priorities.
Broad hardware support.
ITF flight, raw speed and portability.
Linux runtime is the one to explore ease of use by touch
mobile, our maximum performance at the far cost of complexity, TVR
automation techniques.
Now let's get to the core software automation techniques, especially
critical for large models like LLMs.
On the edge.
The most crucial technique and the one we will spend significant time
on is ization, reducing the precision of numbers used in the model.
Next is network pruning, removing redundant pairs of the neural network.
Then.
There is a knowledge distillation training a smaller student model
to mimic a larger teacher model.
Other methods include low rank approximation or factorization
and various memory optimizations.
Today we'll focus primarily on top three ization, pruning,
and knowledge distillation.
Start with ization.
The fundamental idea is to use fewer bits to represent the model's weight
and activations they're using size and often speeding up comp computation.
There are several approaches post-training ization.
PTQ is the simplest.
We train our model normally, usually in 32 bit floating point numbers,
and then convert it into lower precision like eight bit in TJ.
A represented as in eight or even in four.
Afterwards.
It's easier, but might have a moderate accuracy cost.
Next is ization.
Our training incorporates the effects of ization during the training process.
The model learns to be robust to lower precision.
It's more complex, but usually preserves better accuracy.
Then dynamic range ization adjust how contestation is applied
based on the actual values encountered during inference.
Offering a balance between accuracy and performance,
especially good at varying inputs.
Precis, precis, our ization.
This visual helps illustrate the concept.
Standard training uses FP 32.
Simply converts this F 32 words into eight or in four.
After training Q 80 simulates this conver conversion during training, allowing the
model to adapt and advanced technique is mixed to precision deployment.
Here, you don't have to ize the entire model uniformly.
You can analyze which layers are most sensitive to precision loss and keep
them at higher precision like FP 32 or FP 16, while aggressively contacting
less sensitive layers to inte or integer.
Four.
This offers a fine grain trade off.
Ization is powerful.
Moving from floating point 32 to integer eight can make models four times smaller
and inference three to four times faster.
Especially on hardware with inte support.
Inference is a stage where model makes prediction on not seen data.
Let's talk about ization technique, PTQ, diving deeper into it.
What is it?
Converting pre-trained FP 32 model after training is complete pro.
It's similar it's simpler and faster.
To implement because we don't need to modify the training pipeline.
No retraining is needed.
Cons, it generally leads to higher accuracy loss compared to QA,
especially when going to a very low precision like integer four.
It often requests calibration data set.
A small representative data set used to determine the best
way to map the floating point.
32 ranges to integer or integer four ranges.
For many models, PT Q2 in eight results in less than 2% accuracy degradation,
which is often acceptable because h the use cases are like not very generic,
like this very domain specific actually.
Why is ization particularly to integer eight, so beneficial for hardware?
Let's look at the specs for NVIDIA eight 10 GPU often found
in servers, but illustrating a common principle in accelerators.
Notice the performance figure for standard floating point 32 math.
It delivers 31.2 terra flops, but look at the par integer eight
performance using its tensor course.
Two 50 Terra operations per second.
Potentially up to 500 tops with sparsity features.
This massive increase in throughput for integer eight operations compared
to floating point 32 is why position is so effective and so critical.
Hardware accelerators often specifically designed to perform low precision
integer math much faster, and more power efficiently than a floating point math
ization technique.
What is it simulating the effects of conversation during
the model training process?
It generally achieves better accuracy preservation compared to PTQ,
especially at lower bid depths because the model learns to compensate for
the precision laws during training.
It.
What are the cons?
It's more complex to implement as you need to modify your.
The training code and pipeline, it requires access to the original
training data and that MA massive infrastructure training time is
longer two 80, often keep accuracy degradation below 1.5% for integer
eight, potentially even better than PTQ
dynamic rate ation.
Adapting the ization parameters like the scaling factor based on
the range of activation values observed at runtime, rather than
using fixed parameters determined offline, what are the pros offers?
A good balance between the simplicity of PTQ and the accuracy of Q 80.
It adapts to characteristics of the input data being processed.
It can also reduce.
Memory bandwidth needs as activation ranges are determined on the floor.
What are the cons?
There can be a potential runtime overhead associated with calculating
this ranges during inference process.
Because this runs on the edge, although often minimal, unsupported hardware.
Next condensation technique mixed to precision.
Using different numerical precision levels for different
layers within the same model.
So what are the cons with this approach?
Provides a excellent way to balance compression and accuracy.
We can keep critical sensitive layers at higher precision.
Example, floating point 16 or floating point 32, while izing other more robust.
Layers aggressively into integer eight or integer four.
It requires careful analysis to determine which layers are sensitive.
It also needs framework and hardware support to handle computations
involving multiple different data types within the same inference path.
Studies have shown potential for significant compression.
Example, up to 30 70% with minimal accuracy loss, like less than 1% using
mixed precision because when you you convert some of the layers to inte,
it offers more model compression.
Also,
next to major m automation technique is neural network pruning.
The core idea is that many large neural networks are over parameterized.
They have redundant weights or neurons that don't contribute
much to final output.
Crooning aims to identify and remove these non-essential parts.
The process generally involves identifying redundancy by analyzing
weights or activation patterns, then removing these elements.
This can be structured pruning where entire filters or neurons are remote.
Often better for hardware speedups or unstructured pruning where
individual weights are remote leading to sparks networks.
Techniques like the iterative magnitude pruning gradually remove small weights
while re retraining the network.
To maintain accuracy sensitive allow VT analysis is often ever helps
to evaluate how removing certain parts impact overall performance.
Effective pruning can reduce models significantly, sometimes 50 to 90%
with very minimal impact on accuracy, creating leaner, faster networks.
So here's a overview of the pruning process.
Step one, training initial network.
We start by training a potentially over network until it converges
and performs step two, identify and remove unimportant elements.
This is crucial step where you, we apply the criteria like weight
magnitude activation, frequency, impact on loss, to decide which parts
are unimportant and remove them.
This might involve setting weights to zero, unstructured and removing
entire structures like filters, which is a structured way.
Then step three, fine tune, prune network.
After removing elements, the network's performance usually
drops, so we retain the smaller prune network for a few cycles.
This allows the remaining weights to adapt and often recovers most,
if not all of the lowest accuracy.
This cycle of prune and fine tune can be repeated iteratively to reach
a desired level of pars and size.
Let's look at specific pruning types.
Magnitude based pruning is perhaps the simplest and most common.
It operates at the granularity of either individual weights, weight pruning,
which is unstructured or entire units, neurons or filters, which are structured.
Pruning
For weight pruning we simply remove weights with the lower absolute
values closest to zero, assuming they contribute least for structured pruning.
We might remove entire filters based on the sum or average
magnitude of their weights.
And the result, unstructured pruning leads to sparse irregular weight
mattresses, structured pruning results in smaller but still dense
mattresses, which are often easier for standard hardware to accelerate.
What are the pros?
Simple concept, high compression potential, especially unstructured.
Structured pruning often gives better inference setups on standard hardware.
What are the cons?
Unstructured pruning often needs specialized hardware or libraries for
speed up while structured pruning is ser and might impact accuracy, more
pruning technique.
Importance based beyond just magnitude important based pruning
uses more sophisticated metrices to decide what to remove.
What is a mechanism?
Instead of just looking at the weight values, it might consider
the effect on the loss function.
If the weight were remote, using techniques like tailor expansion
or optimal brain damage, or other algorithms are they analyzed
neural neuron activation.
Or analyze gradient information during training.
Goal is to make a more informed, sophisticated selection of which elements
are truly unimportant for the need for function, potentially preserving accuracy
better than simple magnitude training
based pruning technique.
What's the mechanism we use here?
As we see in the process diagram, it involves repeating the cycles of prune.
Fine tune.
We remove a small percentage of weights or neurons, retain briefly, then
remove some more written and so on.
The goal is to reach the targets parity or size gradually, rather
than removing everything at once.
This gradual approach generally leads to better accuracy prevention
for a given level of parity compared to one shot pruning.
What are the trade-offs it's most time consuming due to multiple
pruning and retraining steps?
A fascinating related concept is lottery ticket hypothesis.
It proposes that dense randomly initiated networks contains spar
subnets called winning tickets.
That.
Trained in isolation from the start, using their original initial
waste can achieve performance comparable to full dense network
magnitude Pruning, especially iterative pruning, is effectively
a method to find these winning tickets within the large network.
What are the implication?
It suggests that the inherent sparsity might be fundamental
property of trainable neural networks.
We don't necessarily need the huge dense network.
We just need to find the right sparse structure within it.
This provides theoretical backing for why pruning can be so effective
without catastrophic accuracy loss, which is important for the H
loss technique, knowledge distillation for the h. Imagine we have a large,
complex, highly accurate teacher model.
It performs great, but it's too slow or large for the edge.
We also have a much smaller student model, which is lightweight enough
for the edge deployment, but might not be as accurate on its own knowledge.
Stands for is a distillation process, which works by training the.
Student model, not just on the ground truth labels, but also to
mimic the output of teacher model.
Specifically, the student learns from the teacher soft targets
the full potential distribution across all classes that teacher
predicts even for incorrect classes.
This encodes richer information about how the teacher model thinks and generalizes.
This allows the small student model to benefit from the knowledge learned by
the much larger teacher, often achieving significant better accuracy than if
it were trained only on hard levels.
So for knowledge, distillation based technique based on the type input,
this is the most common type, focusing on matching the output
layer of the student to the teacher.
Using those soft labels that were mentioned, it captures the teacher's
final pre pre prediction reasoning.
Example show a significant parameter reduction while retaining high
quality up to 95% of teacher.
This goes deeper trying to mass the activations in intermediate
layers of the student and teacher.
The goal is to encourage the student to develop similar internal feature
representations as that of the teacher.
We can also categorize distillation by how the training happens.
This is the standard approach.
First, train the teacher model completely, and then we can, we
use the frozen pre-trained teacher model to train the student.
Then there is online, offline, online distillation.
Here the teacher and student models are trained simultaneously.
They learn together, potentially influencing each other, which can
sometimes lead to better results than the sequential offline train approach.
An interesting variant where a model distills knowledge from itself.
Often deeper layers of network act as a teacher for shallower
layers acting as a student.
This can improve the performance of the single model without
needing a separate teacher used.
This is used when there is a very large gap between teacher and
student size or model capabilities.
We're optimizing existing large models.
Another crucial strategy is to use edge optimized model
architectures from the start.
The, these are neural network designs created specifically for efficiency
on resource constraint devices.
Example includes mobile nets.
Famous for using depth device separable convolution, which drastically reduces
computations like eight to nine times compared to standard convolution.
Great for vision tasks, shuffle nets, use channel shuffling and group convolution
further automations for low power devices.
And then squeeze net employs fire modules to reduce parameter counts while
maintaining classification accuracy.
Then the Efficient Net uses a compounding scaling method to
intelligently balance network with depth and input resolution for excellent
efficiency and accuracy trade-offs.
Using this pre-built architectures is often more effective than simply
thinking a standard large architecture as they incorporate efficiency
principles directly into their design.
Let's look at a practical example, an H Voice assistant using a combination
of the techniques, likely neural network pruning, and eight bit ization.
A baseline model was optimized.
The results.
The optimized model achieved 98.2% wake word accuracy with a very low
false activation rate under 0.5%.
It managed 87.3 command recognition accuracy across
different noisy environments.
Crucially, the model size for the entire system Wake.
Workplace command recognition was reduced to just 76 mb, a 73% reduction
from baseline, and the response latency was only 85 milliseconds.
End to end well within the threshold for a real time fee.
Knowledge distillation from a larger teacher model likely help
maintain the higher accuracy despite the significant size reduction.
This demonstrates how these techniques deliver tangible, real world benchmark
performance benefits on the edge.
Takeaway and implementation details go What are the key table covers?
First, establish performance targets.
Be clear about the latency, accuracy, power, size, budgets before the
start start of optimization.
Secondly, apply the integrated optimization.
Don't just rely on just one technique.
Combine pruning, ization, distillation, and potentially architectural choice.
For the best reserves, their effects are often multi multiplicative.
Third test on target hardware ator arent enough.
Benchmark and profile directly on the edge devices we deploy onto
to find the real bottlenecks.
Fourth, iterate with real world data.
Edge environments can be unpredictable, continuously collect data and refine the
models after the deployment successfully.
Deploying the generative AI at the edge needs the systematic, iterative approach.
Considering the whole pipeline,
a quick word on benchmarking, it's absolutely essential.
To know this, we need to define metrics clearly.
Latency, throughput, accuracy, power, create a standardized test environment,
simulating real deployment conditions, measure performance comprehensively
across different hardware and workloads, and critically use benchmark results to
further guide optimization iterations.
Systematic benchmarking validates the automation efforts and ensure
the solutions perform reliably and consistently in the real world.
This is important because if you have seen, there are multiple choices to make.
So what choices need to be made and which techniques, which combination
of techniques needs to be done?
We can be only determined if we have a standard benchmarking, so we can run this
iteratively and find if the combination actually reaches our benchmarks.
Then finally, the implementation roadmap.
What are the practical steps if we have to do this First audit model
requirements, understand the Constance and performance needs, and then prototype
optimization pipelines, test techniques individually first to see their impact
and implement combined approach.
Integrate the chosen techniques, pruning, ization, distillation
carefully, and deploy and monitor.
The observability aspect.
Realize the optimized model, but but include telemetry to gather
real world data performance data for continuous improvement.
Start with audit, build systematically, validate thoroughly,
and monitor continuously.
That concludes our look into engineering, low latency, generative AI for the edge.
By applying techniques like ization pruning, knowledge distillation, and
choosing appropriate architectures and hardware, we can overcome resource
constraints and unlock powerful AI K capabilities directly on the edge.
Thank you for your time and attention, and have a good day.