Optimizing Real-Time AI Inference at the Edge: Accelerating Autonomous Vehicle Safety and Efficiency

Video size:

Abstract

Autonomous vehicles generate vast data, requiring ultra-fast AI. This talk explores edge inference techniques that cut latency, enhance efficiency, and improve safety. Discover how neuromorphic computing, model optimizations, and AI accelerators revolutionize real-time AV decision-making! 🚀

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. My name is Murli Kris Manum. Welcome to the Con 42 Golan Conference. Today we will discuss about real-time, a inference at the edge for self-driving cars. Self-driving cars represent one of the most challenging applications of edge computing and ai, modern autonomous vehicles. Must process over a gigabyte of sensor data per second from high resolution cameras, lidars, and radar units to make a split second decision that ensure passenger safety. In this presentation, we will explore how real-time a inference at the edge enables autonomous vehicles to function safely and efficiently. We'll also examine. Computational challenges, hardware innovations and optimization techniques that makes it this possible. Welcome. Let's get into the details first. Let's understand the challenge autonomous vehicles pose in terms of the processing requirements. First, the data that it generates is massive. Second, the latency requirements for safe operation of the. Vehicle on the road is very strict. And then three, the variability of the compute requirements based on the environment the vehicle operates in varies significantly from the data perspective. Autonomous vehicles generate anywhere from 1.5, 1.4 terabytes to 19 terabytes of raw data per hour from the multiple high resolution cameras. Which operate at 32 60 frames per second at full HD resolution. And then lidars, which generates a hundred thousand to 4.5 million points per frame. And lidar radar systems operating at 24 to 77 gigahertz. With all this amount of data, strict latency requirements, poses another challenge, complete perception. Decision and action pipeline must execute within a hundred to 300 milliseconds for collagen avoidance. at high speeds. Each 10 10 milliseconds of crossing delay translates to approximately 0.3 to 0.5 meters of extra stopping distance, which is very crucial for safe operation of the cars. Next. Environmental variability, computational load. Can vary by up to 480% between minimal complexity scenarios, which are open highways and maximum complexity environments like dense urban intersections where vehicles crisscross humans present, and many more complex scenarios. This requires adaptive computation architectures. Let's understand how edge computing evolved for autonomous, cars. First generation repurposed consumer gpu. Early autonomous prototypes used adapted consumer GPUs delivering up to 12 te terra operations per second with significant limitations, namely. High power consumption, up to 300 watts. This high power post, additional constraints with respect to the thermal management, which required liquid cooling and also data transfer bottlenecks up to 67% of the processing time. In the second generation automotive grad, automotive grade accelerators were designed. As a purpose built process with improved efficiency up to two to four terra operations per second per watt, with reduced memory traffic by up to 60% through pruning and compression techniques, and also supported reduced patient computing for up to four x throughput improvement. Let's get into the next level of details on all these techniques. Edge versus com cloud computing trade off. Let's understand the differences and the advantages between edge and cloud. Computings edge computing advantages, it is near instantaneous processing with five to 15 milliseconds latency versus up to 500 milliseconds for cloud. Maintains functionality during connectivity interruptions, which may happen up to 38% of the route reduced security vulnerabilities up to 50%. Better privacy protection by keeping sensitive data within the vehicle boundaries. And of course, cloud computing has its own advantages. It is two to three others, magnitude greater computational throughput. It enables more sophisticated algorithms with higher accuracy. It externalizes the power consumption up to 1500 watts for edge processing. As the power is delivered from a grid, there is no limit on the power that it can consume. It is ideal for non-safety critical task, like high definition map generation, et cetera. So with that, now let's get into hybrid approach. In the cloud processing, high bandwidth analytics mapping and fleet learning are used. This is up to 70, 72% of the task. The hybrid processing where intelligent load balancing with seamless task migration across the platforms is done. The last one is edge processing. This is used primarily for mission critical perception and control systems, which is up to 82% of the processing needs in a hybrid, in a autonomous car, the industry has conversed on a hybrid architecture that strategically allocates computational workloads between onboard systems and cloud infrastructure. The sophisticated approach. Delivers almost a hundred percent functional liability while reducing the vehicle computational requirements by up to 45% compared to pure edge solutions. By leveraging both paradigms in both paradigms, inherent strengths, which is immediacy of the edge processing with scalability of cloud computing. This hybrid framework creates an optimal balance of safety, efficiency, and capitalize while effectively neutralizing the limitations of either approach in its isolation. Let's understanding some, let's understand some of the issues are some of the requirements related to various aspects of autonomous driving. First, update detection and classification. This is done mostly at the edge performance requirements. With the state of the art models, we can achieve up to 87% of precision while operating within the strict latency constraints of up to 50 milliseconds per frame on auto grade processors. Second multi-stage processing in this, technique. In Shell region. Proposal networks operate at 2230 heads, followed by more intensive classification networks. That process only identified region of interest reducing computational requirements by up to 75%. There are, there is one more optimization technique that is quantization. Quantization. If it is full precision, we use 32 bits. But by quantization reducing the quantization from 32 bit to eight bit in T Precision, which reduces the memory requirements up to 76% and inference time by up to 3.4 times with accuracy. Degradation of only up to 1.8 are maybe 2% compared to the full precision 30 littlebit parameters. Next, let's understand the lane detection and project planning compute requirements. Lane detection con consumes up to 12% of perception budget processing camera feeds either in seven 20 P resolution or 1,080 p resolution to extract the lane markings within five to eight centimeters, accuracy at distances up to 80 meters multimodal fusion. In this RGB camera data is combined with lidar reflectivity and rather returns to maintain accuracy during adverse weather when visual data quality degrades almost by 60%. Third, trajectory planning in this, in, in this task hardware are, the compute requirements evaluates 1500 to 3000 candidate trajectories. Every a hundred milliseconds, which comprises of almost 25% of the computational budget for optimizing the safety, comfort, and efficiency within the strict time constraints or latency constraints we have. Next let's you, let's understand few of the techniques that are used to optimize further on the model itself. Quantization. Converts high precision that bit floating. Point to an efficient eight bit T representation, delivering four times weight compression and two times activation Compression with minimal accuracy loss up to 1.2%. This is critical for its deployment. Eight bit operations consume nine times less energy compared to floating point calculations. Next pruning technique. In this technique, we systematically eliminate T neural network parameters, which are up to 70%, that they contribute negligible to the performance, and then structured pruning techniques yield up to 3.8 times computational efficiency gains while maintaining model integrity. With this, maybe accuracy drops only up to 2% with fine tuning. And then another technique for model optimization is knowledge distillation. In this technique, we use larger teach model, teacher models to guide the training of compact and student networks. This technique enables dramatically smaller models up to 8, 8, 8 times fewer parameters to capture the essential calculates of larger architectures. While sacrificing only up to 3% accuracy, this is ideal for resource constrained edge devices. Let's get into some hard, hardware aware neural architecture, such improvements automated discovery discovers in this mode, in this optimization. It discovers novel network architectures tailored to specific hardware, outperforming handcrafted designs by up to 12% inaccuracy, while reducing inference latency by 2.3 times, which is very good. Then the second technique is hardware modeling. in this technique, we incorporate details, detailed models of memory access patterns. Operator execution times and paralyzation capital capabilities of hardware to maximize the hardware utilization in this models are aware of the actual hardware, underlying hardware and its capabilities to achieve battery utilization. Third sensor fusion optimization, particularly effective for sensor fusion tasks, reducing re inference latency by up to 48%. Compared to the conventional architecture on auto grade accelerators, fourth attention mechanisms discovers efficient attention based architectures that selectively process high information reasons, reducing overall computational requirements by up to 55%. There is also another. Emerging trend, which is neuromorphic computing before understanding neuromorphic computing. Let's understand what is event driven processing? Event driven processing is instead of processing data frame by frame, we process data based on an event. This reduces power consumption up to 95% compared to the frame based approach By allocating resources only when significant changes occur. This also helps temporal advantages, which is microsecond scale. Temporal resolutions up to 10 microseconds improves detection latency by 20 to 45 milliseconds for high speed objects compared to conventional vision systems. This also helps lighting, adaptability, maintains consistent detection. Across elimination ranges from point 0.1, luminance to a hundred thousand luminance, addressing limitations with high dynamic range in neuro environments. With this understanding, let's get, let's understand. What is neuromorphic computing? Neuromorphic computing represent a fundamental departure from conventional architectures, drawing the inspiration from biological neural systems. These systems integrate. Processing and memory in artificial neuron and signup structures that mimic their biological counterparts. Prototype neuromorphic processing units can process five 50 to a hundred million events per second while consuming only a hundred to 300 milliwatts of power representing a two orders of magnitude improvement in efficiency compared to GPU based solutions. Okay, let's summarize the future. Where are we headed with this? Distributed AI and continuous learning are the future distributed AI architectures maintain 85 to 92% of the critical functionality, even with failures in 30% of the processing nodes. Reduce latency by up to 47% for complex perception tasks. Through better paralyzation and reduced data movement, continuous learning systems improve detection accuracy by 15 to 20%. In novel environments not represented in initial training data, employee safety, aware, incremental updates, that limit parameter changes to only up to 2% per cycle. Federated learning. Vehicles continue to col collective intelligence while transmitting only a 0.5% of the data required for central centralized approaches with this synchronization of every a hundred to 500 kilometers provides up to 85% of continuous connectivity benefits, which is a great one. Thank you. With this. I conclude my presentation.

Slides

Download slides (PDF)

See all 65 talks at this event!

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Optimizing Real-Time AI Inference at the Edge: Accelerating Autonomous Vehicle Safety and Efficiency

Video size:

Abstract

Summary

Transcript

Slides

Murali Krishna Reddy Mandalapu

Senior Director, Hardware Engineering @ Renesas Electronics

Join the community!

Featured event

2025

2024

Info

Conf42 Golang 2025 - Online

April 03 2025 - premiere 5PM GMT

Optimizing Real-Time AI Inference at the Edge: Accelerating Autonomous Vehicle Safety and Efficiency

Video size:

Abstract

Summary

Transcript

Slides

Murali Krishna Reddy Mandalapu

Senior Director, Hardware Engineering @ Renesas Electronics

Join the community!