Search | arXiv e-print repository

Vitamin-V: Expanding Open-Source RISC-V Cloud Environments

Authors: Ramon Canal, Stefano Di Carlo, Dimitris Gizopoulos, Alberto Scionti, Francesco Lubrano, Josep-Lluís Berral, Aaron Call, Diego Marron, Konstantinos Nikas, Dionisios Pnevmatikatos, Daniel Raho, Alvise Rigo, Yannis Papaefstathiou, José María Arnau, Angelos Arelakis

Abstract: Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation. Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation. △ Less

Submitted 12 June, 2024; originally announced July 2024.

Comments: RISC-V Summit Europe 2024, 24-28 June 2024

arXiv:2310.17501 [pdf, other]

A Lightweight, Compiler-Assisted Register File Cache for GPGPU

Authors: Mojtaba Abaie Shoushtary, Jose Maria Arnau, Jordi Tubella Murgadas, Antonio Gonzalez

Abstract: Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conf… ▽ More Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conflicts. This paper introduces an energy-efficient RF caching mechanism called Malekeh that repurposes an existing component in GPUs' RF to operate as a cache in addition to its original functionality. In this way, Malekeh minimizes the overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue scheduling policy that utilizes the reuse distance of the values in the RF cache and is controlled by a dynamic algorithm. The goal is to adapt the issue policy to the runtime program characteristics to maximize the GPU's performance and the hit ratio of the RF cache. The reuse distance is approximated by the compiler using profiling and is used at run time by the proposed caching scheme. We show that Malekeh reduces the number of reads to the RF banks by 46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves performance by 6.1% while adding only 2KB of extra storage per core to the baseline RF of 256KB, which represents a negligible overhead of 0.78%. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2305.10982 [pdf, other]

Vitamin-V: Virtual Environment and Tool-boxing for Trustworthy Development of RISC-V based Cloud Services

Authors: A. Arelakis, J. M. Arnau, J. L. Berral, A. Call, R. Canal, S. Di Carlo, J. Costa, D. Gizopoulos, V. Karakostas, F. Lubrano, K. Nikas, Y. Nikolakopoulos, B. Otero, G. Papadimitriou, I. Papaefstathiou, D. Pnevmatikatos, D. Raho, A. Rigo, E. Rodríguez, A. Savino, A. Scionti, N. Tampouratzis, A. Torregrosa

Abstract: Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment. Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment. △ Less

Submitted 27 June, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Paper accepted and presented at the RISC-V Summit Europe, Barcelona, 5-9th June 2023. arXiv admin note: substantial text overlap with arXiv:2305.01983

arXiv:2302.00361 [pdf, ps, other]

doi 10.1145/3579371.3589055

K-D Bonsai: ISA-Extensions to Compress K-D Trees for Autonomous Driving Tasks

Authors: Pedro H. E. Becker, José María Arnau, Antonio González

Abstract: Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structu… ▽ More Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structure that holds the point cloud (a k-d tree) to compress the data in memory. K-D Bonsai further compresses the data using a reduced floating-point representation, exploiting the physically limited range of point cloud values. For easy integration into nowadays systems, we implement K-D Bonsai through Bonsai-extensions, a small set of new CPU instructions to compress, decompress, and operate on points. To maintain baseline safety levels, we carefully craft the Bonsai-extensions to detect precision loss due to compression, allowing re-computation in full precision to take place if necessary. Therefore, K-D Bonsai reduces data movement, improving performance and energy efficiency, while guaranteeing baseline accuracy and programmability. We evaluate K-D Bonsai over the euclidean cluster task of Autoware.ai, a state-of-the-art software stack for AD. We achieve an average of 9.26% improvement in end-to-end latency, 12.19% in tail latency, and a reduction of 10.84% in energy consumption. Differently from expensive accelerators proposed in related work, K-D Bonsai improves radius search with minimal area increase (0.36%). △ Less

Submitted 30 August, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

MSC Class: Article No. 18; 2018 Related DOI: https://doi.org/10.1145/3243176.3243184 Focus to learn more

Journal ref: ISCA'23 Proceedings of the 50th Annual International Symposium on Computer Architecture, Article No. 20, 2023

arXiv:2212.00608 [pdf, other]

Exploiting Kernel Compression on BNNs

Authors: Franyell Silfa, Jose Maria Arnau, Antonio González

Abstract: Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Also, BNNs c… ▽ More Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Also, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights is typically low. Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. Hence, we decrease the storage requirements and memory accesses since common sequences are encoded with fewer bits. We extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2202.06563 [pdf, other]

Saving RNN Computations with a Neuron-Level Fuzzy Memoization Scheme

Authors: Franyell Silfa, Jose-Maria Arnau, Antonio González

Abstract: Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, recurrent layers are executed many times for process… ▽ More Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, recurrent layers are executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we observe that the output of a neuron exhibits small changes in consecutive invocations.~We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches each neuron's output and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations. The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy. We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 26.7\% of computations, resulting in 21\% energy savings and 1.4x speedup on average. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2202.04990 [pdf, other]

Mixture-of-Rookies: Saving DNN Computations by Predicting ReLU Outputs

Authors: Dennis Pinto, Jose-María Arnau, Antonio González

Abstract: Deep Neural Networks (DNNs) are widely used in many applications domains. However, they require a vast amount of computations and memory accesses to deliver outstanding accuracy. In this paper, we propose a scheme to predict whether the output of each ReLu activated neuron will be a zero or a positive number in order to skip the computation of those neurons that will likely output a zero. Our pred… ▽ More Deep Neural Networks (DNNs) are widely used in many applications domains. However, they require a vast amount of computations and memory accesses to deliver outstanding accuracy. In this paper, we propose a scheme to predict whether the output of each ReLu activated neuron will be a zero or a positive number in order to skip the computation of those neurons that will likely output a zero. Our predictor, named Mixture-of-Rookies, combines two inexpensive components. The first one exploits the high linear correlation between binarized (1-bit) and full-precision (8-bit) dot products, whereas the second component clusters together neurons that tend to output zero at the same time. We propose a novel clustering scheme based on the analysis of angles, as the sign of the dot product of two vectors depends on the cosine of the angle between them. We implement our hybrid zero output predictor on top of a state-of-the-art DNN accelerator. Experimental results show that our scheme introduces a small area overhead of 5.3% while achieving a speedup of 1.2x and reducing energy consumption by 16.5% on average for a set of diverse DNNs. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: 13 pages, 14 figures

arXiv:2202.04971 [pdf, other]

ASRPU: A Programmable Accelerator for Low-Power Automatic Speech Recognition

Authors: Dennis Pinto, Jose-María Arnau, Antonio González

Abstract: The outstanding accuracy achieved by modern Automatic Speech Recognition (ASR) systems is enabling them to quickly become a mainstream technology. ASR is essential for many applications, such as speech-based assistants, dictation systems and real-time language translation. However, highly accurate ASR systems are computationally expensive, requiring on the order of billions of arithmetic operation… ▽ More The outstanding accuracy achieved by modern Automatic Speech Recognition (ASR) systems is enabling them to quickly become a mainstream technology. ASR is essential for many applications, such as speech-based assistants, dictation systems and real-time language translation. However, highly accurate ASR systems are computationally expensive, requiring on the order of billions of arithmetic operations to decode each second of audio, which conflicts with a growing interest in deploying ASR on edge devices. On these devices, hardware acceleration is key for achieving acceptable performance. However, ASR is a rich and fast-changing field, and thus, any overly specialized hardware accelerator may quickly become obsolete. In this paper, we tackle those challenges by proposing ASRPU, a programmable accelerator for on-edge ASR. ASRPU contains a pool of general-purpose cores that execute small pieces of parallel code. Each of these programs computes one part of the overall decoder (e.g. a layer in a neural network). The accelerator automates some carefully chosen parts of the decoder to simplify the programming without sacrificing generality. We provide an analysis of a modern ASR system implemented on ASRPU and show that this architecture can achieve real-time decoding with a very low power budget. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: 11 pages, 11 figures

arXiv:2107.09408 [pdf, other]

CREW: Computation Reuse and Efficient Weight Storage for Hardware-accelerated MLPs and RNNs

Authors: Marc Riera, Jose-Maria Arnau, Antonio Gonzalez

Abstract: Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications. The core operation in a DNN is the dot product between quantized inputs and weights. Prior works exploit the weight/input repetition that arises due to quantization to avoid redundant computations in Convolutional Neural Networks (CNNs). However, in this paper we show that their effectiveness is severely limit… ▽ More Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications. The core operation in a DNN is the dot product between quantized inputs and weights. Prior works exploit the weight/input repetition that arises due to quantization to avoid redundant computations in Convolutional Neural Networks (CNNs). However, in this paper we show that their effectiveness is severely limited when applied to Fully-Connected (FC) layers, which are commonly used in state-of-the-art DNNs, as it is the case of modern Recurrent Neural Networks (RNNs) and Transformer models. To improve energy-efficiency of FC computation we present CREW, a hardware accelerator that implements Computation Reuse and an Efficient Weight Storage mechanism to exploit the large number of repeated weights in FC layers. CREW first performs the multiplications of the unique weights by their respective inputs and stores the results in an on-chip buffer. The storage requirements are modest due to the small number of unique weights and the relatively small size of the input compared to convolutional layers. Next, CREW computes each output by fetching and adding its required products. To this end, each weight is replaced offline by an index in the buffer of unique products. Indices are typically smaller than the quantized weights, since the number of unique weights for each input tends to be much lower than the range of quantized weights, which reduces storage and memory bandwidth requirements. Overall, CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage. We evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator. Compared to UCNN, a state-of-art computation reuse technique, CREW achieves 2.10x speedup and 2.08x energy savings on average. △ Less

Submitted 11 March, 2022; v1 submitted 20 July, 2021; originally announced July 2021.

arXiv:2101.09083 [pdf, other]

Exploiting Beam Search Confidence for Energy-Efficient Speech Recognition

Authors: Dennis Pinto, Jose-María Arnau, Antonio González

Abstract: With computers getting more and more powerful and integrated in our daily lives, the focus is increasingly shifting towards more human-friendly interfaces, making Automatic Speech Recognition (ASR) a central player as the ideal means of interaction with machines. Consequently, interest in speech technology has grown in the last few years, with more systems being proposed and higher accuracy levels… ▽ More With computers getting more and more powerful and integrated in our daily lives, the focus is increasingly shifting towards more human-friendly interfaces, making Automatic Speech Recognition (ASR) a central player as the ideal means of interaction with machines. Consequently, interest in speech technology has grown in the last few years, with more systems being proposed and higher accuracy levels being achieved, even surpassing \textit{Human Accuracy}. While ASR systems become increasingly powerful, the computational complexity also increases, and the hardware support have to keep pace. In this paper, we propose a technique to improve the energy-efficiency and performance of ASR systems, focusing on low-power hardware for edge devices. We focus on optimizing the DNN-based Acoustic Model evaluation, as we have observed it to be the main bottleneck in state-of-the-art ASR systems, by leveraging run-time information from the Beam Search. By doing so, we reduce energy and execution time of the acoustic model evaluation by 25.6% and 25.9%, respectively, with negligible accuracy loss. △ Less

Submitted 22 January, 2021; originally announced January 2021.

arXiv:2009.10656 [pdf, other]

E-BATCH: Energy-Efficient and High-Throughput RNN Batching

Authors: Franyell Silfa, Jose Maria Arnau, Antonio Gonzalez

Abstract: Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they requi… ▽ More Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short timespan, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of a sequence is done, so that a new sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. In E-PUR, E-BATCH improves throughput by 1.8x and energy-efficiency by 3.6x, whereas in TPU, it improves throughput by 2.1x and energy-efficiency by 1.6x, over the state-of-the-art. △ Less

Submitted 22 September, 2020; originally announced September 2020.

arXiv:2007.07131 [pdf, other]

Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads

Authors: Albert Segura, Jose-Maria Arnau, Antonio Gonzalez

Abstract: GPGPU architectures have become established as the dominant parallelization and performance platform achieving exceptional popularization and empowering domains such as regular algebra, machine learning, image detection and self-driving cars. However, irregular applications struggle to fully realize GPGPU performance as a result of control flow divergence and memory divergence due to irregular mem… ▽ More GPGPU architectures have become established as the dominant parallelization and performance platform achieving exceptional popularization and empowering domains such as regular algebra, machine learning, image detection and self-driving cars. However, irregular applications struggle to fully realize GPGPU performance as a result of control flow divergence and memory divergence due to irregular memory access patterns. To ameliorate these issues, programmers are obligated to carefully consider architecture features and devote significant efforts to modify the algorithms with complex optimization techniques, which shift programmers priorities yet struggle to quell the shortcomings. We show that in graph-based GPGPU irregular applications these inefficiencies prevail, yet we find that it is possible to relax the strict relationship between thread and data processed to empower new optimizations. Based on this key idea, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses which significantly improves memory coalescing, and allows increased performance and energy efficiency. Additionally, the IRU is capable of filtering and merging duplicated irregular access which further improves graph-based irregular applications. Programmers can easily utilize the IRU with a simple API, or compiler optimized generated code with the extended ISA instructions provided. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x and 13% improvement in performance and energy savings respectively, while incurring in a small 5.6% area overhead. △ Less

Submitted 15 March, 2022; v1 submitted 14 July, 2020; originally announced July 2020.

Report number: UPC-DAC-RR-ARCO-2020-1

arXiv:1911.04244 [pdf, other]

Boosting LSTM Performance Through Dynamic Precision Selection

Authors: Franyell Silfa, Jose-Maria Arnau, Antonio Gonzàlez

Abstract: The use of low numerical precision is a fundamental optimization included in modern accelerators for Deep Neural Networks (DNNs). The number of bits of the numerical representation is set to the minimum precision that is able to retain accuracy based on an offline profiling, and it is kept constant for DNN inference. In this work, we explore the use of dynamic precision selection during DNN infe… ▽ More The use of low numerical precision is a fundamental optimization included in modern accelerators for Deep Neural Networks (DNNs). The number of bits of the numerical representation is set to the minimum precision that is able to retain accuracy based on an offline profiling, and it is kept constant for DNN inference. In this work, we explore the use of dynamic precision selection during DNN inference. We focus on Long Short Term Memory (LSTM) networks, which represent the state-of-the-art networks for applications such as machine translation and speech recognition. Unlike conventional DNNs, LSTM networks remember information from previous evaluations by storing data in the LSTM cell state. Our key observation is that the cell state determines the amount of precision required: time steps where the cell state changes significantly require higher precision, whereas time steps where the cell state is stable can be computed with lower precision without any loss in accuracy. Based on this observation, we implement a novel hardware scheme that tracks the evolution of the elements in the LSTM cell state and dynamically selects the appropriate precision in each time step. For a set of popular LSTM networks, our scheme selects the lowest precision for more than 66% of the time, outperforming systems that fix the precision statically. We evaluate our proposal on top of a modern accelerator highly optimized for LSTM computation, and show that it provides 1.56x speedup and 23% energy savings on average without any loss in accuracy. The extra hardware to determine the appropriate precision represents a small area overhead of 8.8%. △ Less

Submitted 7 November, 2019; originally announced November 2019.

arXiv:1911.01258 [pdf, other]

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Network

Authors: Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, Antonio Gonzalez

Abstract: The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. H… ▽ More The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. However, given that the dimensionality of RNNs varies a lot for different tasks, it is crucial to generalize this efficiency to diverse configurations. In this work, we identify adaptiveness as a key feature that is missing from today's RNN accelerators. In particular, we first show the problem of low resource-utilization and low adaptiveness for the state-of-the-art RNN implementations on GPU, FPGA and ASIC architectures. To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies. To do so, we propose Sharp as a hardware accelerator, which pipelines RNN computation using an effective scheduling scheme to hide most of the dependent serialization. Furthermore, Sharp employs dynamic reconfigurable architecture to adapt to the model's characteristics. Sharp achieves 2x, 2.8x, and 82x speedups on average, considering different RNN models and resource budgets, compared to the state-of-the-art ASIC, FPGA, and GPU implementations, respectively. Furthermore, we provide significant energy-reduction with respect to the previous solutions, due to the low power dissipation of Sharp (321 GFLOPS/Watt). △ Less

Submitted 21 May, 2023; v1 submitted 4 November, 2019; originally announced November 2019.

arXiv:1906.02535 [pdf, other]

(Pen-) Ultimate DNN Pruning

Authors: Marc Riera, Jose-Maria Arnau, Antonio Gonzalez

Abstract: DNN pruning reduces memory footprint and computational work of DNN-based solutions to improve performance and energy-efficiency. An effective pruning scheme should be able to systematically remove connections and/or neurons that are unnecessary or redundant, reducing the DNN size without any loss in accuracy. In this paper we show that prior pruning schemes require an extremely time-consuming iter… ▽ More DNN pruning reduces memory footprint and computational work of DNN-based solutions to improve performance and energy-efficiency. An effective pruning scheme should be able to systematically remove connections and/or neurons that are unnecessary or redundant, reducing the DNN size without any loss in accuracy. In this paper we show that prior pruning schemes require an extremely time-consuming iterative process that requires retraining the DNN many times to tune the pruning hyperparameters. We propose a DNN pruning scheme based on Principal Component Analysis and relative importance of each neuron's connection that automatically finds the optimized DNN in one shot without requiring hand-tuning of multiple parameters. △ Less

Submitted 6 June, 2019; originally announced June 2019.

arXiv:1711.07480 [pdf, other]

doi 10.1145/3243176.3243184

E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks

Authors: Franyell Silfa, Gem Dot, Jose-Maria Arnau, Antonio Gonzalez

Abstract: Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount o… ▽ More Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 92x. △ Less

Submitted 20 November, 2017; originally announced November 2017.

Report number: UPC-DAC-RR-2017-8

Journal ref: PACT '18 Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, Article No. 18, 2018

arXiv:1707.08089 [pdf, other]

Delay Performance of MISO Wireless Communications

Authors: Jesus Arnau, Marios Kountouris

Abstract: Ultra-reliable, low latency communications (URLLC) are currently attracting significant attention due to the emergence of mission-critical applications and device-centric communication. URLLC will entail a fundamental paradigm shift from throughput-oriented system design towards holistic designs for guaranteed and reliable end-to-end latency. A deep understanding of the delay performance of wirele… ▽ More Ultra-reliable, low latency communications (URLLC) are currently attracting significant attention due to the emergence of mission-critical applications and device-centric communication. URLLC will entail a fundamental paradigm shift from throughput-oriented system design towards holistic designs for guaranteed and reliable end-to-end latency. A deep understanding of the delay performance of wireless networks is essential for efficient URLLC systems. In this paper, we investigate the network layer performance of multiple-input, single-output (MISO) systems under statistical delay constraints. We provide closed-form expressions for MISO diversity-oriented service process and derive probabilistic delay bounds using tools from stochastic network calculus. In particular, we analyze transmit beamforming with perfect and imperfect channel knowledge and compare it with orthogonal space-time codes and antenna selection. The effect of transmit power, number of antennas, and finite blocklength channel coding on the delay distribution is also investigated. Our higher layer performance results reveal key insights of MISO channels and provide useful guidelines for the design of ultra-reliable communication systems that can guarantee the stringent URLLC latency requirements. △ Less

Submitted 25 July, 2017; originally announced July 2017.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:1703.06069 [pdf, other]

Performance Analysis of Ultra-Dense Networks with Elevated Base Stations

Authors: Italo Atzeni, Jesús Arnau, Marios Kountouris

Abstract: This paper analyzes the downlink performance of ultra-dense networks with elevated base stations (BSs). We consider a general dual-slope pathloss model with distance-dependent probability of line-of-sight (LOS) transmission between BSs and receivers. Specifically, we consider the scenario where each link may be obstructed by randomly placed buildings. Using tools from stochastic geometry, we show… ▽ More This paper analyzes the downlink performance of ultra-dense networks with elevated base stations (BSs). We consider a general dual-slope pathloss model with distance-dependent probability of line-of-sight (LOS) transmission between BSs and receivers. Specifically, we consider the scenario where each link may be obstructed by randomly placed buildings. Using tools from stochastic geometry, we show that both coverage probability and area spectral efficiency decay to zero as the BS density grows large. Interestingly, we show that the BS height alone has a detrimental effect on the system performance even when the standard single-slope pathloss model is adopted. △ Less

Submitted 17 March, 2017; originally announced March 2017.

Comments: 6 pages, 4 figures. To be presented at SpaSWiN'17 (WiOpt workshops), May 2017

arXiv:1703.01279 [pdf, other]

Downlink Cellular Network Analysis with LOS/NLOS Propagation and Elevated Base Stations

Authors: Italo Atzeni, Jesús Arnau, Marios Kountouris

Abstract: In this paper, we investigate the downlink performance of dense cellular networks with elevated base stations (BSs) using a channel model that incorporates line-of-sight (LOS)/non-line-of-sight (NLOS) propagation in both small-scale and large-scale fading. Modeling LOS fading with Nakagami-$m$ fading, we provide a unified framework based on stochastic geometry that encompasses both closest and str… ▽ More In this paper, we investigate the downlink performance of dense cellular networks with elevated base stations (BSs) using a channel model that incorporates line-of-sight (LOS)/non-line-of-sight (NLOS) propagation in both small-scale and large-scale fading. Modeling LOS fading with Nakagami-$m$ fading, we provide a unified framework based on stochastic geometry that encompasses both closest and strongest BS association. Our study is particularized to two distance-dependent LOS/NLOS models of practical interest. Considering the effect of LOS propagation alone, we derive closed-form expressions for the coverage probability with Nakagami-$m$ fading, showing that the performance for strongest BS association is the same as in the case of Rayleigh fading, whereas for closest BS association it monotonically increases with the shape parameter $m$. Then, focusing on the effect of elevated BSs, we show that network densification eventually leads to near-universal outage even for moderately low BS densities: in particular, the maximum area spectral efficiency is proportional to the inverse of the squared BS height. △ Less

Submitted 3 March, 2017; originally announced March 2017.

Comments: Submitted to the IEEE for possible publication

arXiv:1702.06493 [pdf, ps, other]

Timely CSI Acquisition Exploiting Full Duplex

Authors: Jesus Arnau, Marios Kountouris

Abstract: In this paper, we propose a method for acquiring accurate and timely channel state information (CSI) by leveraging full-duplex transmission. Specifically, we propose a mobile communication system in which base stations continuously transmit a pilot sequence in the uplink frequency band, while terminals use self-interference cancellation capabilities to obtain CSI at any time. Our proposal outperfo… ▽ More In this paper, we propose a method for acquiring accurate and timely channel state information (CSI) by leveraging full-duplex transmission. Specifically, we propose a mobile communication system in which base stations continuously transmit a pilot sequence in the uplink frequency band, while terminals use self-interference cancellation capabilities to obtain CSI at any time. Our proposal outperforms its half-duplex counterpart by at least 50% in terms of throughput while ensuring the same (or even lower) outage probability. Remarkably, it also outperforms using full duplex for downlink data transmission for low values of downlink bandwidth and received power. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 6 pages, 4 figures, accepted at IEEE WCNC 2017

arXiv:1602.03644 [pdf, other]

Impact of LOS/NLOS Propagation and Path Loss in Ultra-Dense Cellular Networks

Authors: Jesús Arnau, Italo Atzeni, Marios Kountouris

Abstract: Most prior work on performance analysis of ultra-dense cellular networks (UDNs) has considered standard power-law path loss models and non-line-of-sight (NLOS) propagation modeled by Rayleigh fading. The effect of line-of-sight (LOS) on coverage and throughput and its implication on network densification are still not fully understood. In this paper, we investigate the performance of UDNs when the… ▽ More Most prior work on performance analysis of ultra-dense cellular networks (UDNs) has considered standard power-law path loss models and non-line-of-sight (NLOS) propagation modeled by Rayleigh fading. The effect of line-of-sight (LOS) on coverage and throughput and its implication on network densification are still not fully understood. In this paper, we investigate the performance of UDNs when the signal propagation includes both LOS and NLOS components. Using a stochastic geometry based cellular network model, we derive expressions for the coverage probability, as well as tight approximations and upper bounds for both closest and strongest base station (BS) association. Our results show that under standard singular path loss model, LOS propagation increases the coverage, especially with nearest BS association. On the contrary, using dual slope path loss, LOS propagation is beneficial with closest BS association and detrimental for strongest BS association. △ Less

Submitted 28 September, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

Comments: Paper presented at IEEE ICC 2016 - Wireless Communications Symposium

arXiv:1512.05526 [pdf, other]

Single-Pole IIR Channel Power Prediction with Variable Delays

Authors: Jesús Arnau

Abstract: Exploiting outdated channel quality indicators is crucial in most adaptive wireless communication systems. This is often done through channel prediction based on previous received indicators. In this paper, we analyze the case where the feedback delay experienced by the quality indicators is not constant, but random. Focusing on a single-pole IIR predictor, we obtain analytical expressions for the… ▽ More Exploiting outdated channel quality indicators is crucial in most adaptive wireless communication systems. This is often done through channel prediction based on previous received indicators. In this paper, we analyze the case where the feedback delay experienced by the quality indicators is not constant, but random. Focusing on a single-pole IIR predictor, we obtain analytical expressions for the MSE and the filter parameters, and study the throughput behavior through Monte Carlo simulations. Results show that prediction provides a performance advantage for average delays smaller than 30 ms for low terminal speeds. △ Less

Submitted 17 December, 2015; originally announced December 2015.

Comments: Paper presented at IEEE GLOBECOM 2015, San Diego, California

arXiv:1503.07321 [pdf, other]

doi 10.1109/ICCW.2015.7247312

Fractional Pilot Reuse in Massive MIMO Systems

Authors: Italo Atzeni, Jesús Arnau, Mérouane Debbah

Abstract: Pilot contamination is known to be one of the main impairments for massive MIMO multi-cell communications. Inspired by the concept of fractional frequency reuse and by recent contributions on pilot reutilization among non-adjacent cells, we propose a new pilot allocation scheme to mitigate this effect. The key idea is to allow users in neighboring cells that are closest to their base stations to r… ▽ More Pilot contamination is known to be one of the main impairments for massive MIMO multi-cell communications. Inspired by the concept of fractional frequency reuse and by recent contributions on pilot reutilization among non-adjacent cells, we propose a new pilot allocation scheme to mitigate this effect. The key idea is to allow users in neighboring cells that are closest to their base stations to reuse the same pilot sequences. Focusing on the uplink, we obtain expressions for the overall spectral efficiency per cell for different linear combining techniques at the base station and use them to obtain both the optimal pilot reuse parameters and the optimal number of scheduled users. Numerical results show a remarkable improvement in terms of spectral efficiency with respect to the existing techniques. △ Less

Submitted 16 June, 2015; v1 submitted 25 March, 2015; originally announced March 2015.

Comments: Paper presented at the IEEE ICC 2015 Workshop on 5G & Beyond - Enabling Technologies and Applications

arXiv:1303.3110 [pdf, ps, other]

Adaptive Transmission Techniques for Mobile Satellite Links

Authors: Jesus Arnau, Alberto Rico-Alvariño, Carlos Mosquera

Abstract: Adapting the transmission rate in an LMS channel is a challenging task because of the relatively fast time variations, of the long delays involved, and of the difficulty in map** the parameters of a time-varying channel into communication performance. In this paper, we propose two strategies for dealing with these impairments, namely, multi-layer coding (MLC) in the forward link, and open-loop a… ▽ More Adapting the transmission rate in an LMS channel is a challenging task because of the relatively fast time variations, of the long delays involved, and of the difficulty in map** the parameters of a time-varying channel into communication performance. In this paper, we propose two strategies for dealing with these impairments, namely, multi-layer coding (MLC) in the forward link, and open-loop adaptation in the return link. Both strategies rely on physical-layer abstraction tools for predicting the link performance. We will show that, in both cases, it is possible to increase the average spectral efficiency while at the same time kee** the outage probability under a given threshold. To do so, the forward link strategy will rely on introducing some latency in the data stream by using retransmissions. The return link, on the other hand, will rely on a statistical characterization of a physical-layer abstraction measure. △ Less

Submitted 13 March, 2013; originally announced March 2013.

Comments: Presented at the 30th AIAA International Communications Satellite Systems Conference (ICSSC), Ottawa, Canada, 2012. Best Professional Paper Award

arXiv:1211.5903 [pdf, ps, other]

MMSE Performance Analysis of Generalized Multibeam Satellite Channels

Authors: Dimitrios Christopoulos, Jesus Arnau, Symeon Chatzinotas, Carlos Mosquera, Bjorn Ottersten

Abstract: Aggressive frequency reuse in the return link (RL) of multibeam satellite communications (SatComs) is crucial towards the implementation of next generation, interactive satellite services. In this direction, multiuser detection has shown great potential in mitigating the increased intrasystem interferences, induced by a tight spectrum reuse. Herein we present an analytic framework to describe the… ▽ More Aggressive frequency reuse in the return link (RL) of multibeam satellite communications (SatComs) is crucial towards the implementation of next generation, interactive satellite services. In this direction, multiuser detection has shown great potential in mitigating the increased intrasystem interferences, induced by a tight spectrum reuse. Herein we present an analytic framework to describe the linear Minimum Mean Square Error (MMSE) performance of multiuser channels that exhibit full receive correlation: an inherent attribute of the RL of multibeam SatComs. Analytic, tight approximations on the MMSE performance are proposed for cases where closed form solutions are not available in the existing literature. The proposed framework is generic, thus providing a generalized solution straightforwardly extendable to various fading models over channels that exhibit full receive correlation. Simulation results are provided to show the tightness of the proposed approximation with respect to the available transmit power. △ Less

Submitted 26 November, 2012; originally announced November 2012.

Comments: 4 pages, 2 figures, submitted to the IEEE

Showing 1–25 of 25 results for author: Arnau, J