Edge-Ready GenAI: Engineering Low-Latency Solutions for Resource-Constrained Environments

Video size:

Abstract

Unlock the secrets to deploying powerful generative AI on edge devices with minimal resources. Learn practical compression techniques that dramatically reduce latency while maintaining performance. Transform your edge devices into AI powerhouses—no cloud required.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Morning or afternoon everyone. Today we are diving into a critical and exciting area, HDD generative ai. Specifically we'll be exploring how to engineer low latency AI solutions designed for resource constrained environments as businesses increasingly look for real-time AI capabilities, right where the action happens at the edge. Understanding how to optimize these powerful generative models becomes essential. We look at systematic ways to maintain performance while cutting down on computational needs, power consumption, and latency. This opens up fascinating new possibilities for intelligent applications directly on edge devices. Just a little bit about myself. I'm Ian, principal software engineer at SD Engineering. I direct my work focuses on enhancing global connectivity through advanced IP satellite network infrastructure. My experience in network engineering and software development gives me a practical perspective on challenges and opportunities of deploying complex systems like AI in diverse environments, including the Edge. You can find more details or connect with me on LinkedIn. So why is Edge AI challenging? Let's look at edge computing Frontier. Unlike the seemingly limitless resources in the cloud, edge devices operate under significant constraints. First, limited computational resources. Think processing power, memory storage edge devices typically have much less than cloud servers. Second, connectivity challenges. Edge solutions often need to work reliably, even with spotty, slow, or sometimes no internet connection. Third, energy constraints. Many edge devices run on batteries or have strict power budgets limiting how complex our AI models can be. And finally, realtime requirements. Many edge applications like autonomous vehicles or industrial monitoring demand, immediate low latency responses, putting high performance demands. On these constant devices. Despite this the challenges, the demand for Edge AI is booming. Let's look at market trends. We are seeing strong expansion. The Edge AI accelerator market, the specialized hardware is growing incredibly fast, nearly 36 39% CHER, projected to hit almost 7.7 billion by 2027. This signals huge demand for on-device ai. What's driving this key adoption drivers are primarily the need for low latency, over 78% of organization site. This is crucial for realtime responses. Then data privacy is another major factor. Almost 65% of organizations cite this as another major factor. Pushing processing locally instead of sending sensitive data to the cloud is important for them. And the industry adapting is adapting. Model optimization techniques, which we'll discuss heavily today, are already used in 82% of deployments. This shows a clear focus on making complex AI run effectively on edge hardware, edge AI market evolution. Let's look ahead. We are really at an inflection point. Currently generative AI at the edge is still limited by those resources, resource constraints that were mentioned. But over the next 12 months, we expect increasing adoption of optimized edge AI solutions within eight to 24 months. The prediction is that most enterprise need real time edge capabilities will be act actively deploying them. Beyond 2025 H, AI is likely to become ubiquitous, supported by specialized hardware acceleration. This rabbit evolution is driven by hardware advancements, better optimization techniques like composition and pruning, and the growing need for privacy. Preserving local computation. The shift away from cloud dependency for realtime tasks is well underway. Market trends. Just to reiterate, the those key market drives drivers because they underscore the why behind the edge optimization. The market growth is significant. As mentioned, 39% of CAGR for accelerators and 7.6 billion market cap soon. The primary need for low latency, for immediate responses and data privacy, both favoring local on device processing and crucially, the industry is already heavily invested in model optimizations, which is done in like almost 82% of deployments to overcome hardware limitations. This sets the stage for the techniques we are going to explore in the rest of the talk. Acceleration solutions? Software optimization is key Hardware plays a vital role. Specialized hardware acceleration solution dramatically boost performance and energy efficiency for the edge. Yeah. We have neural processing units, N ps, which are dedicated AI accelerators. Now common in modern systems on chips they offer 10 to 15 times the energy efficiency for a standard CPU for typical neural network tasks. Google offers HT ps. Purposely built edge accelerators providing significant processing power, like up to four terra operations in small low power around two watts package. That's impressive. Then mobile GPUs are also increasingly optimized for AI leveraging their parallel processing power, and the FPG accelerators offer re comfortable hardware adapted for specific AI models and often very power efficient in production. These accelerators are designed specifically for the math intensive operations common in ai, like matrix, multiplication, it, GPU's, overview. This table gives a snapshot of the diverse landscape of it, GPU and accelerators. Don't worry about memorizing the details, but notice how the key players like Nvidia Jetson series like Nano Nx, A GX RTX, ADA AMD rise, AI inter core Intel score Ultra and PS and R gpu, Google score, HT ps, Callcom, snapdragons and pos, and specialists like H Cortex. Important takeaway here is the variation in ai performance, memory, capacity, and bandwidth, power consumption, and form factor. Choosing the right hardware depends heavily on specific application needs regarding performance, power, budget, and physical size. This plays a critical role in the process of developing these models. Now let's look at the inference performance specifically for large language models, which are notoriously resource intensive. Again, this table is illustrative based on available data, which can be fragmented. Key things to note. High-end desktop GPOs like RTX 4 0 9 0 or 3 0 9 0 achieve very high throughput tokens per second. Using optimized frameworks like Nvidia sensors R-T-L-L-M significantly outperforming less optimized methods like Lama CCP on the same hardware for it specific hardware like Nvidia Jetson's Orient Series. A MD Rise, AI and Intel Core ultra running models like LAMA or SSL seven B, often using ization like integer four is becoming feasible, especially with dedicated frameworks like Tensor, R-T-L-L-M-O-N-N-X. Runtime with V is ai. Iex, LLM, open V. However, specific standardized benchmarking numbers for throughput and latency on these edge platforms is still emerging and depend heavily on software maturity, on automation levels. That's why we see, needed data in the, in specific rows and columns. The power consumption is drastically lower for each devices compared to desktop GPUs highlighting the efficiency gains, but also performance trade-offs. Beyond physical hardware. Hardware is only part of equation Software specifically, AI frameworks and optimization engines is crucial for actually running models efficiently on the hardware key. AI frameworks like TensorFlow Light and PyTorch Mobile are designed for edge deployment. NVIDIA sensor RT and its LLM variant are highly optimized for their GPOs. Inference automation engines like ONNX runtime and Intel. So open wino. Focus on optimizing models for efficient execution across various hardware vendors. Also provide specific software stacks like Nvidia, Jetpack, A MD Rise, and AI software. Intel's one API Tools and Qualcomm's AI stack, which bundles drivers, libraries, and tools for their respective hardware. The right framework involves trade offs. As this chart illustrates TensorFlow light generally offers excellent hardware support through its de delegation system, allowing it to leverage n ps and GPUs effectively, or in an next runtime, often provide slightly better raw execution, speed, and great cross-platform portability. Py Touch Mobile is often placed for its developer experience and see simpler deployment, making it good for rapid prototyping. TVM Apache TVM can achieve the best performance through deep compiler optimizations, but also comes with significantly higher deployment complexity. So the choice depends on your priorities. Broad hardware support. ITF flight, raw speed and portability. Linux runtime is the one to explore ease of use by touch mobile, our maximum performance at the far cost of complexity, TVR automation techniques. Now let's get to the core software automation techniques, especially critical for large models like LLMs. On the edge. The most crucial technique and the one we will spend significant time on is ization, reducing the precision of numbers used in the model. Next is network pruning, removing redundant pairs of the neural network. Then. There is a knowledge distillation training a smaller student model to mimic a larger teacher model. Other methods include low rank approximation or factorization and various memory optimizations. Today we'll focus primarily on top three ization, pruning, and knowledge distillation. Start with ization. The fundamental idea is to use fewer bits to represent the model's weight and activations they're using size and often speeding up comp computation. There are several approaches post-training ization. PTQ is the simplest. We train our model normally, usually in 32 bit floating point numbers, and then convert it into lower precision like eight bit in TJ. A represented as in eight or even in four. Afterwards. It's easier, but might have a moderate accuracy cost. Next is ization. Our training incorporates the effects of ization during the training process. The model learns to be robust to lower precision. It's more complex, but usually preserves better accuracy. Then dynamic range ization adjust how contestation is applied based on the actual values encountered during inference. Offering a balance between accuracy and performance, especially good at varying inputs. Precis, precis, our ization. This visual helps illustrate the concept. Standard training uses FP 32. Simply converts this F 32 words into eight or in four. After training Q 80 simulates this conver conversion during training, allowing the model to adapt and advanced technique is mixed to precision deployment. Here, you don't have to ize the entire model uniformly. You can analyze which layers are most sensitive to precision loss and keep them at higher precision like FP 32 or FP 16, while aggressively contacting less sensitive layers to inte or integer. Four. This offers a fine grain trade off. Ization is powerful. Moving from floating point 32 to integer eight can make models four times smaller and inference three to four times faster. Especially on hardware with inte support. Inference is a stage where model makes prediction on not seen data. Let's talk about ization technique, PTQ, diving deeper into it. What is it? Converting pre-trained FP 32 model after training is complete pro. It's similar it's simpler and faster. To implement because we don't need to modify the training pipeline. No retraining is needed. Cons, it generally leads to higher accuracy loss compared to QA, especially when going to a very low precision like integer four. It often requests calibration data set. A small representative data set used to determine the best way to map the floating point. 32 ranges to integer or integer four ranges. For many models, PT Q2 in eight results in less than 2% accuracy degradation, which is often acceptable because h the use cases are like not very generic, like this very domain specific actually. Why is ization particularly to integer eight, so beneficial for hardware? Let's look at the specs for NVIDIA eight 10 GPU often found in servers, but illustrating a common principle in accelerators. Notice the performance figure for standard floating point 32 math. It delivers 31.2 terra flops, but look at the par integer eight performance using its tensor course. Two 50 Terra operations per second. Potentially up to 500 tops with sparsity features. This massive increase in throughput for integer eight operations compared to floating point 32 is why position is so effective and so critical. Hardware accelerators often specifically designed to perform low precision integer math much faster, and more power efficiently than a floating point math ization technique. What is it simulating the effects of conversation during the model training process? It generally achieves better accuracy preservation compared to PTQ, especially at lower bid depths because the model learns to compensate for the precision laws during training. It. What are the cons? It's more complex to implement as you need to modify your. The training code and pipeline, it requires access to the original training data and that MA massive infrastructure training time is longer two 80, often keep accuracy degradation below 1.5% for integer eight, potentially even better than PTQ dynamic rate ation. Adapting the ization parameters like the scaling factor based on the range of activation values observed at runtime, rather than using fixed parameters determined offline, what are the pros offers? A good balance between the simplicity of PTQ and the accuracy of Q 80. It adapts to characteristics of the input data being processed. It can also reduce. Memory bandwidth needs as activation ranges are determined on the floor. What are the cons? There can be a potential runtime overhead associated with calculating this ranges during inference process. Because this runs on the edge, although often minimal, unsupported hardware. Next condensation technique mixed to precision. Using different numerical precision levels for different layers within the same model. So what are the cons with this approach? Provides a excellent way to balance compression and accuracy. We can keep critical sensitive layers at higher precision. Example, floating point 16 or floating point 32, while izing other more robust. Layers aggressively into integer eight or integer four. It requires careful analysis to determine which layers are sensitive. It also needs framework and hardware support to handle computations involving multiple different data types within the same inference path. Studies have shown potential for significant compression. Example, up to 30 70% with minimal accuracy loss, like less than 1% using mixed precision because when you you convert some of the layers to inte, it offers more model compression. Also, next to major m automation technique is neural network pruning. The core idea is that many large neural networks are over parameterized. They have redundant weights or neurons that don't contribute much to final output. Crooning aims to identify and remove these non-essential parts. The process generally involves identifying redundancy by analyzing weights or activation patterns, then removing these elements. This can be structured pruning where entire filters or neurons are remote. Often better for hardware speedups or unstructured pruning where individual weights are remote leading to sparks networks. Techniques like the iterative magnitude pruning gradually remove small weights while re retraining the network. To maintain accuracy sensitive allow VT analysis is often ever helps to evaluate how removing certain parts impact overall performance. Effective pruning can reduce models significantly, sometimes 50 to 90% with very minimal impact on accuracy, creating leaner, faster networks. So here's a overview of the pruning process. Step one, training initial network. We start by training a potentially over network until it converges and performs step two, identify and remove unimportant elements. This is crucial step where you, we apply the criteria like weight magnitude activation, frequency, impact on loss, to decide which parts are unimportant and remove them. This might involve setting weights to zero, unstructured and removing entire structures like filters, which is a structured way. Then step three, fine tune, prune network. After removing elements, the network's performance usually drops, so we retain the smaller prune network for a few cycles. This allows the remaining weights to adapt and often recovers most, if not all of the lowest accuracy. This cycle of prune and fine tune can be repeated iteratively to reach a desired level of pars and size. Let's look at specific pruning types. Magnitude based pruning is perhaps the simplest and most common. It operates at the granularity of either individual weights, weight pruning, which is unstructured or entire units, neurons or filters, which are structured. Pruning For weight pruning we simply remove weights with the lower absolute values closest to zero, assuming they contribute least for structured pruning. We might remove entire filters based on the sum or average magnitude of their weights. And the result, unstructured pruning leads to sparse irregular weight mattresses, structured pruning results in smaller but still dense mattresses, which are often easier for standard hardware to accelerate. What are the pros? Simple concept, high compression potential, especially unstructured. Structured pruning often gives better inference setups on standard hardware. What are the cons? Unstructured pruning often needs specialized hardware or libraries for speed up while structured pruning is ser and might impact accuracy, more pruning technique. Importance based beyond just magnitude important based pruning uses more sophisticated metrices to decide what to remove. What is a mechanism? Instead of just looking at the weight values, it might consider the effect on the loss function. If the weight were remote, using techniques like tailor expansion or optimal brain damage, or other algorithms are they analyzed neural neuron activation. Or analyze gradient information during training. Goal is to make a more informed, sophisticated selection of which elements are truly unimportant for the need for function, potentially preserving accuracy better than simple magnitude training based pruning technique. What's the mechanism we use here? As we see in the process diagram, it involves repeating the cycles of prune. Fine tune. We remove a small percentage of weights or neurons, retain briefly, then remove some more written and so on. The goal is to reach the targets parity or size gradually, rather than removing everything at once. This gradual approach generally leads to better accuracy prevention for a given level of parity compared to one shot pruning. What are the trade-offs it's most time consuming due to multiple pruning and retraining steps? A fascinating related concept is lottery ticket hypothesis. It proposes that dense randomly initiated networks contains spar subnets called winning tickets. That. Trained in isolation from the start, using their original initial waste can achieve performance comparable to full dense network magnitude Pruning, especially iterative pruning, is effectively a method to find these winning tickets within the large network. What are the implication? It suggests that the inherent sparsity might be fundamental property of trainable neural networks. We don't necessarily need the huge dense network. We just need to find the right sparse structure within it. This provides theoretical backing for why pruning can be so effective without catastrophic accuracy loss, which is important for the H loss technique, knowledge distillation for the h. Imagine we have a large, complex, highly accurate teacher model. It performs great, but it's too slow or large for the edge. We also have a much smaller student model, which is lightweight enough for the edge deployment, but might not be as accurate on its own knowledge. Stands for is a distillation process, which works by training the. Student model, not just on the ground truth labels, but also to mimic the output of teacher model. Specifically, the student learns from the teacher soft targets the full potential distribution across all classes that teacher predicts even for incorrect classes. This encodes richer information about how the teacher model thinks and generalizes. This allows the small student model to benefit from the knowledge learned by the much larger teacher, often achieving significant better accuracy than if it were trained only on hard levels. So for knowledge, distillation based technique based on the type input, this is the most common type, focusing on matching the output layer of the student to the teacher. Using those soft labels that were mentioned, it captures the teacher's final pre pre prediction reasoning. Example show a significant parameter reduction while retaining high quality up to 95% of teacher. This goes deeper trying to mass the activations in intermediate layers of the student and teacher. The goal is to encourage the student to develop similar internal feature representations as that of the teacher. We can also categorize distillation by how the training happens. This is the standard approach. First, train the teacher model completely, and then we can, we use the frozen pre-trained teacher model to train the student. Then there is online, offline, online distillation. Here the teacher and student models are trained simultaneously. They learn together, potentially influencing each other, which can sometimes lead to better results than the sequential offline train approach. An interesting variant where a model distills knowledge from itself. Often deeper layers of network act as a teacher for shallower layers acting as a student. This can improve the performance of the single model without needing a separate teacher used. This is used when there is a very large gap between teacher and student size or model capabilities. We're optimizing existing large models. Another crucial strategy is to use edge optimized model architectures from the start. The, these are neural network designs created specifically for efficiency on resource constraint devices. Example includes mobile nets. Famous for using depth device separable convolution, which drastically reduces computations like eight to nine times compared to standard convolution. Great for vision tasks, shuffle nets, use channel shuffling and group convolution further automations for low power devices. And then squeeze net employs fire modules to reduce parameter counts while maintaining classification accuracy. Then the Efficient Net uses a compounding scaling method to intelligently balance network with depth and input resolution for excellent efficiency and accuracy trade-offs. Using this pre-built architectures is often more effective than simply thinking a standard large architecture as they incorporate efficiency principles directly into their design. Let's look at a practical example, an H Voice assistant using a combination of the techniques, likely neural network pruning, and eight bit ization. A baseline model was optimized. The results. The optimized model achieved 98.2% wake word accuracy with a very low false activation rate under 0.5%. It managed 87.3 command recognition accuracy across different noisy environments. Crucially, the model size for the entire system Wake. Workplace command recognition was reduced to just 76 mb, a 73% reduction from baseline, and the response latency was only 85 milliseconds. End to end well within the threshold for a real time fee. Knowledge distillation from a larger teacher model likely help maintain the higher accuracy despite the significant size reduction. This demonstrates how these techniques deliver tangible, real world benchmark performance benefits on the edge. Takeaway and implementation details go What are the key table covers? First, establish performance targets. Be clear about the latency, accuracy, power, size, budgets before the start start of optimization. Secondly, apply the integrated optimization. Don't just rely on just one technique. Combine pruning, ization, distillation, and potentially architectural choice. For the best reserves, their effects are often multi multiplicative. Third test on target hardware ator arent enough. Benchmark and profile directly on the edge devices we deploy onto to find the real bottlenecks. Fourth, iterate with real world data. Edge environments can be unpredictable, continuously collect data and refine the models after the deployment successfully. Deploying the generative AI at the edge needs the systematic, iterative approach. Considering the whole pipeline, a quick word on benchmarking, it's absolutely essential. To know this, we need to define metrics clearly. Latency, throughput, accuracy, power, create a standardized test environment, simulating real deployment conditions, measure performance comprehensively across different hardware and workloads, and critically use benchmark results to further guide optimization iterations. Systematic benchmarking validates the automation efforts and ensure the solutions perform reliably and consistently in the real world. This is important because if you have seen, there are multiple choices to make. So what choices need to be made and which techniques, which combination of techniques needs to be done? We can be only determined if we have a standard benchmarking, so we can run this iteratively and find if the combination actually reaches our benchmarks. Then finally, the implementation roadmap. What are the practical steps if we have to do this First audit model requirements, understand the Constance and performance needs, and then prototype optimization pipelines, test techniques individually first to see their impact and implement combined approach. Integrate the chosen techniques, pruning, ization, distillation carefully, and deploy and monitor. The observability aspect. Realize the optimized model, but but include telemetry to gather real world data performance data for continuous improvement. Start with audit, build systematically, validate thoroughly, and monitor continuously. That concludes our look into engineering, low latency, generative AI for the edge. By applying techniques like ization pruning, knowledge distillation, and choosing appropriate architectures and hardware, we can overcome resource constraints and unlock powerful AI K capabilities directly on the edge. Thank you for your time and attention, and have a good day.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Edge-Ready GenAI: Engineering Low-Latency Solutions for Resource-Constrained Environments

Video size:

Abstract

Summary

Transcript

Slides

Sai KR Pentaparthi

@ ST Engineering iDirect, Inc.

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Edge-Ready GenAI: Engineering Low-Latency Solutions for Resource-Constrained Environments

Video size:

Abstract

Summary

Transcript

Slides

Sai KR Pentaparthi

@ ST Engineering iDirect, Inc.

Join the community!