Search | arXiv e-print repository

doi 10.1016/j.sysarc.2023.102872

An Automotive Case Study on the Limits of Approximation for Object Detection

Authors: Martí Caro, Hamid Tabani, Jaume Abella, Francesc Moll, Enric Morancho, Ramon Canal, Josep Altet, Antonio Calomarde, Francisco J. Cazorla, Antonio Rubio, Pau Fontova, Jordi Fornt

Abstract: The accuracy of camera-based object detection (CBOD) built upon deep learning is often evaluated against the real objects in frames only. However, such simplistic evaluation ignores the fact that many unimportant objects are small, distant, or background, and hence, their misdetections have less impact than those for closer, larger, and foreground objects in domains such as autonomous driving. Mor… ▽ More The accuracy of camera-based object detection (CBOD) built upon deep learning is often evaluated against the real objects in frames only. However, such simplistic evaluation ignores the fact that many unimportant objects are small, distant, or background, and hence, their misdetections have less impact than those for closer, larger, and foreground objects in domains such as autonomous driving. Moreover, sporadic misdetections are irrelevant since confidence on detections is typically averaged across consecutive frames, and detection devices (e.g. cameras, LiDARs) are often redundant, thus providing fault tolerance. This paper exploits such intrinsic fault tolerance of the CBOD process, and assesses in an automotive case study to what extent CBOD can tolerate approximation coming from multiple sources such as lower precision arithmetic, approximate arithmetic units, and even random faults due to, for instance, low voltage operation. We show that the accuracy impact of those sources of approximation is within 1% of the baseline even when considering the three approximate domains simultaneously, and hence, multiple sources of approximation can be exploited to build highly efficient accelerators for CBOD in cars. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Journal ref: Journal of Systems Architecture, Volume 138, 2023, 102872, ISSN 1383-7621

arXiv:2302.14426 [pdf, other]

doi 10.1016/j.sysarc.2022.102635

At-Scale Evaluation of Weight Clustering to Enable Energy-Efficient Object Detection

Authors: Martí Caro, Hamid Tabani, Jaume Abella

Abstract: Accelerators implementing Deep Neural Networks for image-based object detection operate on large volumes of data due to fetching images and neural network parameters, especially if they need to process video streams, hence with high power dissipation and bandwidth requirements to fetch all those data. While some solutions exist to mitigate power and bandwidth demands for data fetching, they are of… ▽ More Accelerators implementing Deep Neural Networks for image-based object detection operate on large volumes of data due to fetching images and neural network parameters, especially if they need to process video streams, hence with high power dissipation and bandwidth requirements to fetch all those data. While some solutions exist to mitigate power and bandwidth demands for data fetching, they are often assessed in the context of limited evaluations with a scale much smaller than that of the target application, which challenges finding the best tradeoff in practice. This paper sets up the infrastructure to assess at-scale a key power and bandwidth optimization - weight clustering - for You Only Look Once v3 (YOLOv3), a neural network-based object detection system, using videos of real driving conditions. Our assessment shows that accelerators such as systolic arrays with an Output Stationary architecture turn out to be a highly effective solution combined with weight clustering. In particular, applying weight clustering independently per neural network layer, and using between 32 (5-bit) and 256 (8-bit) weights allows achieving an accuracy close to that of the original YOLOv3 weights (32-bit weights). Such bit-count reduction of the weights allows shaving bandwidth requirements down to 30%-40% of the original requirements, and reduces energy consumption down to 45%. This is based on the fact that (i) energy due to multiply-and-accumulate operations is much smaller than DRAM data fetching, and (ii) designing accelerators appropriately may make that most of the data fetched corresponds to neural network weights, where clustering can be applied. Overall, our at-scale assessment provides key results to architect camera-based object detection accelerators by putting together a real-life application (YOLOv3), and real driving videos, in a unified setup so that trends observed are reliable. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: 25 pages, 13 figures, 5 tables, published in Journal of Systems Architecture

Journal ref: Journal of Systems Architecture, Volume 129, 2022, 102635, ISSN 1383-7621

arXiv:2106.16006 [pdf, other]

Improving the Efficiency of Transformers for Resource-Constrained Devices

Authors: Hamid Tabani, Ajay Balasubramaniam, Shabbir Marzban, Elahe Arani, Bahram Zonooz

Abstract: Transformers provide promising accuracy and have become popular and used in various domains such as natural language processing and computer vision. However, due to their massive number of model parameters, memory and computation requirements, they are not suitable for resource-constrained low-power devices. Even with high-performance and specialized devices, the memory bandwidth can become a perf… ▽ More Transformers provide promising accuracy and have become popular and used in various domains such as natural language processing and computer vision. However, due to their massive number of model parameters, memory and computation requirements, they are not suitable for resource-constrained low-power devices. Even with high-performance and specialized devices, the memory bandwidth can become a performance-limiting bottleneck. In this paper, we present a performance analysis of state-of-the-art vision transformers on several devices. We propose to reduce the overall memory footprint and memory transfers by clustering the model parameters. We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x, achieve up to 22% speedup and 39% energy savings on mobile devices with less than 0.1% accuracy loss. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: This paper is accepted as a full paper at 24th Euromicro Conference on Digital System Design (DSD)

arXiv:2105.02613 [pdf, other]

Challenges and Obstacles Towards Deploying Deep Learning Models on Mobile Devices

Authors: Hamid Tabani, Ajay Balasubramaniam, Elahe Arani, Bahram Zonooz

Abstract: From computer vision and speech recognition to forecasting trajectories in autonomous vehicles, deep learning approaches are at the forefront of so many domains. Deep learning models are developed using plethora of high-level, generic frameworks and libraries. Running those models on the mobile devices require hardware-aware optimizations and in most cases converting the models to other formats or… ▽ More From computer vision and speech recognition to forecasting trajectories in autonomous vehicles, deep learning approaches are at the forefront of so many domains. Deep learning models are developed using plethora of high-level, generic frameworks and libraries. Running those models on the mobile devices require hardware-aware optimizations and in most cases converting the models to other formats or using a third-party framework. In reality, most of the developed models need to undergo a process of conversion, adaptation, and, in some cases, full retraining to match the requirements and features of the framework that is deploying the model on the target platform. Variety of hardware platforms with heterogeneous computing elements, from wearable devices to high-performance GPU clusters are used to run deep learning models. In this paper, we present the existing challenges, obstacles, and practical solutions towards deploying deep learning models on mobile devices. △ Less

Submitted 6 May, 2021; originally announced May 2021.

arXiv:2104.07735 [pdf, other]

doi 10.1016/j.jpdc.2021.02.008

Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs

Authors: Hamid Tabani, Fabio Mazzocchetti, Pedro Benedicte, Jaume Abella, Francisco J. Cazorla

Abstract: Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) bring unprecedented performance requirements for automotive systems. Graphic Processing Unit (GPU) based platforms have been deployed with the aim of meeting these requirements, being NVIDIA Jetson TX2 and its high-performance successor, NVIDIA AGX Xavier, relevant representatives. However, to what extent high-performance GPU co… ▽ More Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) bring unprecedented performance requirements for automotive systems. Graphic Processing Unit (GPU) based platforms have been deployed with the aim of meeting these requirements, being NVIDIA Jetson TX2 and its high-performance successor, NVIDIA AGX Xavier, relevant representatives. However, to what extent high-performance GPU configurations are appropriate for ADAS and AD workloads remains as an open question. This paper analyzes this concern and provides valuable insights on this question by modeling two recent automotive NVIDIA GPU-based platforms, namely TX2 and AGX Xavier. In particular, our work assesses their microarchitectural parameters against relevant benchmarks, identifying GPU setups delivering increased performance within a similar cost envelope, or decreasing hardware costs while preserving original performance levels. Overall, our analysis identifies opportunities for the optimization of automotive GPUs to further increase system efficiency. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Showing 1–5 of 5 results for author: Tabani, H