Search | arXiv e-print repository

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Authors: Lei Xun, Jonathon Hare, Geoff V. Merrett

Abstract: Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous work… ▽ More Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at Design, Automation & Test in Europe Conference (DATE) 2024, PhD Forum

arXiv:2401.08943 [pdf, other]

Fluid Dynamic DNNs for Reliable and Adaptive Distributed Inference on Edge Devices

Authors: Lei Xun, Mingyu Hu, Hengrui Zhao, Amit Kumar Singh, Jonathon Hare, Geoff V. Merrett

Abstract: Distributed inference is a popular approach for efficient DNN inference at the edge. However, traditional Static and Dynamic DNNs are not distribution-friendly, causing system reliability and adaptability issues. In this paper, we introduce Fluid Dynamic DNNs (Fluid DyDNNs), tailored for distributed inference. Distinct from Static and Dynamic DNNs, Fluid DyDNNs utilize a novel nested incremental t… ▽ More Distributed inference is a popular approach for efficient DNN inference at the edge. However, traditional Static and Dynamic DNNs are not distribution-friendly, causing system reliability and adaptability issues. In this paper, we introduce Fluid Dynamic DNNs (Fluid DyDNNs), tailored for distributed inference. Distinct from Static and Dynamic DNNs, Fluid DyDNNs utilize a novel nested incremental training algorithm to enable independent and combined operation of its sub-networks, enhancing system reliability and adaptability. Evaluation on embedded Arm CPUs with a DNN model and the MNIST dataset, shows that in scenarios of single device failure, Fluid DyDNNs ensure continued inference, whereas Static and Dynamic DNNs fail. When devices are fully operational, Fluid DyDNNs can operate in either a High-Accuracy mode and achieve comparable accuracy with Static DNNs, or in a High-Throughput mode and achieve 2.5x and 2x throughput compared with Static and Dynamic DNNs, respectively. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at Design, Automation & Test in Europe Conference (DATE) 2024

arXiv:2206.02525 [pdf, other]

Dynamic DNNs Meet Runtime Resource Management on Mobile and Embedded Platforms

Authors: Lei Xun, Bashir M. Al-Hashimi, Jonathon Hare, Geoff V. Merrett

Abstract: Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to low latency and better privacy. However, efficient deployment on these platforms is challenging due to the intensive computation and memory access. We propose a holistic system design for DNN performance and energy optimisation, combining the trade-off opportunities in both algorithms and har… ▽ More Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to low latency and better privacy. However, efficient deployment on these platforms is challenging due to the intensive computation and memory access. We propose a holistic system design for DNN performance and energy optimisation, combining the trade-off opportunities in both algorithms and hardware. The system can be viewed as three abstract layers: the device layer contains heterogeneous computing resources; the application layer has multiple concurrent workloads; and the runtime resource management layer monitors the dynamically changing algorithms' performance targets as well as hardware resources and constraints, and tries to meet them by tuning the algorithm and hardware at the same time. Moreover, We illustrate the runtime approach through a dynamic version of 'once-for-all network' (namely Dynamic-OFA), which can scale the ConvNet architecture to fit heterogeneous computing resources efficiently and has good generalisation for different model architectures such as Transformer. Compared to the state-of-the-art Dynamic DNNs, our experimental results using ImageNet on a Jetson Xavier NX show that the Dynamic-OFA is up to 3.5x (CPU), 2.4x (GPU) faster for similar ImageNet Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency. Furthermore, compared with Linux governor (e.g. performance, schedutil), our runtime approach reduces the energy consumption by 16.5% at similar latency. △ Less

Submitted 6 June, 2022; v1 submitted 17 May, 2022; originally announced June 2022.

Comments: Accepted as a presentation at Fourth UK Mobile, Wearable and Ubiquitous Systems Research Symposium (MobiUK 2022)

arXiv:2109.12047 [pdf, other]

Intermittent Opportunistic Routing Components for the INET Framework

Authors: Edward Longman, Mohammed El-Hajjar, Geoff V. Merrett

Abstract: Intermittently-powered wireless sensor networks (WSNs) use energy harvesting and small energy storage to remove the need for battery replacement and to extend the operational lifetime. However, an intermittently-powered forwarder regularly turns on or off, which requires alternative networking solutions. Opportunistic routing (OR) is a potential cross-layer solution for this novel application, but… ▽ More Intermittently-powered wireless sensor networks (WSNs) use energy harvesting and small energy storage to remove the need for battery replacement and to extend the operational lifetime. However, an intermittently-powered forwarder regularly turns on or off, which requires alternative networking solutions. Opportunistic routing (OR) is a potential cross-layer solution for this novel application, but due to the interaction with the energy storage, the operation of these protocols is highly dynamic. To compare protocols and components in like-for-like scenarios we propose module interfaces for MAC, routing and discovery protocols, that enable clear separation of concerns and good interchangeability. We also suggest some candidates for each of the protocols based on our own implementation and research. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: Published in: M. Marek, G. Nardini, V. Vesely (Eds.), Proceedings of the 8th OMNeT++ Community Summit, Virtual Summit, September 8-10, 2021

Report number: OMNET/2021/05

arXiv:2109.09495 [pdf, other]

GhostShiftAddNet: More Features from Energy-Efficient Operations

Authors: Jia Bi, Jonathon Hare, Geoff V. Merrett

Abstract: Deep convolutional neural networks (CNNs) are computationally and memory intensive. In CNNs, intensive multiplication can have resource implications that may challenge the ability for effective deployment of inference on resource-constrained edge devices. This paper proposes GhostShiftAddNet, where the motivation is to implement a hardware-efficient deep network: a multiplication-free CNN with few… ▽ More Deep convolutional neural networks (CNNs) are computationally and memory intensive. In CNNs, intensive multiplication can have resource implications that may challenge the ability for effective deployment of inference on resource-constrained edge devices. This paper proposes GhostShiftAddNet, where the motivation is to implement a hardware-efficient deep network: a multiplication-free CNN with fewer redundant features. We introduce a new bottleneck block, GhostSA, that converts all multiplications in the block to cheap operations. The bottleneck uses an appropriate number of bit-shift filters to process intrinsic feature maps, then applies a series of transformations that consist of bit-wise shifts with addition operations to generate more feature maps that fully learn to capture information underlying intrinsic features. We schedule the number of bit-shift and addition operations for different hardware platforms. We conduct extensive experiments and ablation studies with desktop and embedded (Jetson Nano) devices for implementation and measurements. We demonstrate the proposed GhostSA block can replace bottleneck blocks in the backbone of state-of-the-art networks architectures and gives improved performance on image classification benchmarks. Further, our GhostShiftAddNet can achieve higher classification accuracy with fewer FLOPs and parameters (reduced by up to 3x) than GhostNet. When compared to GhostNet, inference latency on the Jetson Nano is improved by 1.3x and 2x on the GPU and CPU respectively. Code is available open-source on \url{https://github.com/JIABI/GhostShiftAddNet}. △ Less

Submitted 3 February, 2022; v1 submitted 20 September, 2021; originally announced September 2021.

Journal ref: The 32nd British Machine Vision Conference BMVC 2021

arXiv:2107.08199 [pdf, other]

Dynamic Transformer for Efficient Machine Translation on Embedded Devices

Authors: Hishan Parry, Lei Xun, Amin Sabet, Jia Bi, Jonathon Hare, Geoff V. Merrett

Abstract: The Transformer architecture is widely used for machine translation tasks. However, its resource-intensive nature makes it challenging to implement on constrained embedded devices, particularly where available hardware resources can vary at run-time. We propose a dynamic machine translation model that scales the Transformer architecture based on the available resources at any particular time. The… ▽ More The Transformer architecture is widely used for machine translation tasks. However, its resource-intensive nature makes it challenging to implement on constrained embedded devices, particularly where available hardware resources can vary at run-time. We propose a dynamic machine translation model that scales the Transformer architecture based on the available resources at any particular time. The proposed approach, 'Dynamic-HAT', uses a HAT SuperTransformer as the backbone to search for SubTransformers with different accuracy-latency trade-offs at design time. The optimal SubTransformers are sampled from the SuperTransformer at run-time, depending on latency constraints. The Dynamic-HAT is tested on the Jetson Nano and the approach uses inherited SubTransformers sampled directly from the SuperTransformer with a switching time of <1s. Using inherited SubTransformers results in a BLEU score loss of <1.5% because the SubTransformer configuration is not retrained from scratch after sampling. However, to recover this loss in performance, the dimensions of the design space can be reduced to tailor it to a family of target hardware. The new reduced design space results in a BLEU score increase of approximately 1% for sub-optimal models from the original design space, with a wide range for performance scaling between 0.356s - 1.526s for the GPU and 2.9s - 7.31s for the CPU. △ Less

Submitted 30 July, 2021; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: Accepted at MLCAD 2021

arXiv:2106.11208 [pdf, other]

Temporal Early Exits for Efficient Video Object Detection

Authors: Amin Sabet, Jonathon Hare, Bashir Al-Hashimi, Geoff V. Merrett

Abstract: Transferring image-based object detectors to the domain of video remains challenging under resource constraints. Previous efforts utilised optical flow to allow unchanged features to be propagated, however, the overhead is considerable when working with very slowly changing scenes from applications such as surveillance. In this paper, we propose temporal early exits to reduce the computational com… ▽ More Transferring image-based object detectors to the domain of video remains challenging under resource constraints. Previous efforts utilised optical flow to allow unchanged features to be propagated, however, the overhead is considerable when working with very slowly changing scenes from applications such as surveillance. In this paper, we propose temporal early exits to reduce the computational complexity of per-frame video object detection. Multiple temporal early exit modules with low computational overhead are inserted at early layers of the backbone network to identify the semantic differences between consecutive frames. Full computation is only required if the frame is identified as having a semantic change to previous frames; otherwise, detection results from previous frames are reused. Experiments on CDnet show that our method significantly reduces the computational complexity and execution of per-frame video object detection up to $34 \times$ compared to existing methods with an acceptable reduction of 2.2\% in mAP. △ Less

Submitted 21 June, 2021; originally announced June 2021.

arXiv:2105.03608 [pdf, other]

Optimising Resource Management for Embedded Machine Learning

Authors: Lei Xun, Long Tran-Thanh, Bashir M Al-Hashimi, Geoff V. Merrett

Abstract: Machine learning inference is increasingly being executed locally on mobile and embedded platforms, due to the clear advantages in latency, privacy and connectivity. In this paper, we present approaches for online resource management in heterogeneous multi-core systems and show how they can be applied to optimise the performance of machine learning workloads. Performance can be defined using platf… ▽ More Machine learning inference is increasingly being executed locally on mobile and embedded platforms, due to the clear advantages in latency, privacy and connectivity. In this paper, we present approaches for online resource management in heterogeneous multi-core systems and show how they can be applied to optimise the performance of machine learning workloads. Performance can be defined using platform-dependent (e.g. speed, energy) and platform-independent (accuracy, confidence) metrics. In particular, we show how a Deep Neural Network (DNN) can be dynamically scalable to trade-off these various performance metrics. Achieving consistent performance when executing on different platforms is necessary yet challenging, due to the different resources provided and their capability, and their time-varying availability when executing alongside other workloads. Managing the interface between available hardware resources (often numerous and heterogeneous in nature), software requirements, and user experience is increasingly complex. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Comments: Accepted at DATE 2020

arXiv:2105.03600 [pdf, other]

Incremental Training and Group Convolution Pruning for Runtime DNN Performance Scaling on Heterogeneous Embedded Platforms

Authors: Lei Xun, Long Tran-Thanh, Bashir M Al-Hashimi, Geoff V. Merrett

Abstract: Inference for Deep Neural Networks is increasingly being executed locally on mobile and embedded platforms due to its advantages in latency, privacy and connectivity. Since modern System on Chips typically execute a combination of different and dynamic workloads concurrently, it is challenging to consistently meet inference time/energy budget at runtime because of the local computing resources ava… ▽ More Inference for Deep Neural Networks is increasingly being executed locally on mobile and embedded platforms due to its advantages in latency, privacy and connectivity. Since modern System on Chips typically execute a combination of different and dynamic workloads concurrently, it is challenging to consistently meet inference time/energy budget at runtime because of the local computing resources available to the DNNs vary considerably. To address this challenge, a variety of dynamic DNNs were proposed. However, these works have significant memory overhead, limited runtime recoverable compression rate and narrow dynamic ranges of performance scaling. In this paper, we present a dynamic DNN using incremental training and group convolution pruning. The channels of the DNN convolution layer are divided into groups, which are then trained incrementally. At runtime, following groups can be pruned for inference time/energy reduction or added back for accuracy recovery without model retraining. In addition, we combine task map** and Dynamic Voltage Frequency Scaling (DVFS) with our dynamic DNN to deliver finer trade-off between accuracy and time/power/energy over a wider dynamic range. We illustrate the approach by modifying AlexNet for the CIFAR10 image dataset and evaluate our work on two heterogeneous hardware platforms: Odroid XU3 (ARM big.LITTLE CPUs) and Nvidia Jetson Nano (CPU and GPU). Compared to the existing works, our approach can provide up to 2.36x (energy) and 2.73x (time) wider dynamic range with a 2.4x smaller memory footprint at the same compression rate. It achieved 10.6x (energy) and 41.6x (time) wider dynamic range by combining with task map** and DVFS. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Comments: Accepted at ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) 2019

arXiv:2105.03596 [pdf, other]

Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling on Heterogeneous Embedded Platforms

Authors: Wei Lou, Lei Xun, Amin Sabet, Jia Bi, Jonathon Hare, Geoff V. Merrett

Abstract: Mobile and embedded platforms are increasingly required to efficiently execute computationally demanding DNNs across heterogeneous processing elements. At runtime, the available hardware resources to DNNs can vary considerably due to other concurrently running applications. The performance requirements of the applications could also change under different scenarios. To achieve the desired performa… ▽ More Mobile and embedded platforms are increasingly required to efficiently execute computationally demanding DNNs across heterogeneous processing elements. At runtime, the available hardware resources to DNNs can vary considerably due to other concurrently running applications. The performance requirements of the applications could also change under different scenarios. To achieve the desired performance, dynamic DNNs have been proposed in which the number of channels/layers can be scaled in real time to meet different requirements under varying resource constraints. However, the training process of such dynamic DNNs can be costly, since platform-aware models of different deployment scenarios must be retrained to become dynamic. This paper proposes Dynamic-OFA, a novel dynamic DNN approach for state-of-the-art platform-aware NAS models (i.e. Once-for-all network (OFA)). Dynamic-OFA pre-samples a family of sub-networks from a static OFA backbone model, and contains a runtime manager to choose different sub-networks under different runtime environments. As such, Dynamic-OFA does not need the traditional dynamic DNN training pipeline. Compared to the state-of-the-art, our experimental results using ImageNet on a Jetson Xavier NX show that the approach is up to 3.5x (CPU), 2.4x (GPU) faster for similar ImageNet Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency. △ Less

Submitted 11 May, 2021; v1 submitted 8 May, 2021; originally announced May 2021.

Comments: Accepted at CVPR ECV Workshop 2021

arXiv:1810.03333 [pdf]

Practical Implementation of Memristor-Based Threshold Logic Gates

Authors: Georgios Papandroulidakis, Alexantrou Serb, Ali Khiat, Geoff V. Merrett, Themistoklis Prodromakis

Abstract: Current advances in emerging memory technologies enable novel and unconventional computing architectures for high-performance and low-power electronic systems, capable of carrying out massively parallel operations at the edge. One emerging technology, ReRAM, also known to belong in the family of memristors (memory resistors), is gathering attention due to its attractive features for logic and in-m… ▽ More Current advances in emerging memory technologies enable novel and unconventional computing architectures for high-performance and low-power electronic systems, capable of carrying out massively parallel operations at the edge. One emerging technology, ReRAM, also known to belong in the family of memristors (memory resistors), is gathering attention due to its attractive features for logic and in-memory computing; benefits which follow from its technological attributes, such as nanoscale dimensions, low power operation and multi-state programming. At the same time, design with CMOS is quickly reaching its physical and functional limitations, and further research towards novel logic families, such as Threshold Logic Gates (TLGs) is scoped. TLGs constitute a logic family known for its high-speed and low power consumption, yet rely on conventional transistor technology. Introducing memristors enables a more affordable reconfiguration capability of TLGs. Through this work, we are introducing a physical implementation of a memristor-based current-mode TLG (MCMTLG) circuit and validate its design and operation through multiple experimental setups. We demonstrate 2-input and 3-input MCMTLG configurations and showcase their reconfiguration capability. This is achieved by varying memristive weights arbitrarily for sha** the classification decision boundary, thus showing promise as an alternative hardware-friendly implementation of Artificial Neural Networks (ANNs). Through the employment of real memristor devices as the equivalent of synaptic weights in TLGs, we are realizing components that can be used towards an in-silico classifier. △ Less

Submitted 8 October, 2018; originally announced October 2018.

Showing 1–11 of 11 results for author: Merrett, G V