-
Vitamin-V: Expanding Open-Source RISC-V Cloud Environments
Authors:
Ramon Canal,
Stefano Di Carlo,
Dimitris Gizopoulos,
Alberto Scionti,
Francesco Lubrano,
Josep-Lluís Berral,
Aaron Call,
Diego Marron,
Konstantinos Nikas,
Dionisios Pnevmatikatos,
Daniel Raho,
Alvise Rigo,
Yannis Papaefstathiou,
José María Arnau,
Angelos Arelakis
Abstract:
Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation.
Among the key contributions of Vitamin-V (2023-2025 Horizon Europe project), we develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart. In this paper, we detail the software suites and applications ported plus the three cloud setups under evaluation.
△ Less
Submitted 12 June, 2024;
originally announced July 2024.
-
Design, Implementation and Evaluation of the SVNAPOT Extension on a RISC-V Processor
Authors:
Nikolaos-Charalampos Papadopoulos,
Stratos Psomadakis,
Vasileios Karakostas,
Nectarios Koziris,
Dionisios N. Pnevmatikatos
Abstract:
The RISC-V SVNAPOT Extension aims to remedy the performance overhead of the Memory Management Unit (MMU), under heavy memory loads. The Privileged Specification defines additional Natural-Power-of-Two (NAPOT) multiples of the 4KB base page size, with 64KB as the default candidate. In this paper we extend the MMU of the Rocket Chip Generator, in order to manage the collocation of 64KB pages along w…
▽ More
The RISC-V SVNAPOT Extension aims to remedy the performance overhead of the Memory Management Unit (MMU), under heavy memory loads. The Privileged Specification defines additional Natural-Power-of-Two (NAPOT) multiples of the 4KB base page size, with 64KB as the default candidate. In this paper we extend the MMU of the Rocket Chip Generator, in order to manage the collocation of 64KB pages along with 4KB pages in the L2 TLB. We present the design challenges we had to overcome and the trade-offs of our design choices. We conduct a preliminary sensitivity analysis of the L2 TLB with different configurations/page sizes. Finally, we summarize on techniques which could further improve memory management performance on RISC-V systems.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Vitamin-V: Virtual Environment and Tool-boxing for Trustworthy Development of RISC-V based Cloud Services
Authors:
A. Arelakis,
J. M. Arnau,
J. L. Berral,
A. Call,
R. Canal,
S. Di Carlo,
J. Costa,
D. Gizopoulos,
V. Karakostas,
F. Lubrano,
K. Nikas,
Y. Nikolakopoulos,
B. Otero,
G. Papadimitriou,
I. Papaefstathiou,
D. Pnevmatikatos,
D. Raho,
A. Rigo,
E. Rodríguez,
A. Savino,
A. Scionti,
N. Tampouratzis,
A. Torregrosa
Abstract:
Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment.
Vitamin-V is a 2023-2025 Horizon Europe project that aims to develop a complete RISC-V open-source software stack for cloud services with comparable performance to the cloud-dominant x86 counterpart and a powerful virtual execution environment for software development, validation, verification, and test that considers the relevant RISC-V ISA extensions for cloud deployment.
△ Less
Submitted 27 June, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining
Authors:
C. Peltekis,
D. Filippas,
G. Dimitrakopoulos,
C. Nicopoulos,
D. Pnevmatikatos
Abstract:
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this w…
▽ More
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.
△ Less
Submitted 6 June, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Performance landscape of resource-constrained platforms targeting DNNs
Authors:
Panagiotis Miliadis,
Christos-Savvas Bouganis,
Dionisios Pnevmatikatos
Abstract:
Over the recent years, a significant number of complex, deep neural networks have been developed for a variety of applications including speech and face recognition, computer vision in the areas of health-care, automatic translation, image classification, etc. Moreover, there is an increasing demand in deploying these networks in resource-constrained edge devices. As the computational demands of t…
▽ More
Over the recent years, a significant number of complex, deep neural networks have been developed for a variety of applications including speech and face recognition, computer vision in the areas of health-care, automatic translation, image classification, etc. Moreover, there is an increasing demand in deploying these networks in resource-constrained edge devices. As the computational demands of these models keep increasing, pushing to their limits the targeted devices, the constant development of new hardware systems tailored to those workloads has been observed. Since programmability of these diverse and complex platforms -- compounded by the rapid development of new DNN models -- is a major challenge, platform vendors have developed Machine Learning tailored SDKs to maximize the platform's performance.
This work investigates the performance achieved on a number of modern commodity embedded platforms coupled with the vendors' provided software support when state-of-the-art DNN models from image classification, object detection and image segmentation are targeted. The work quantifies the relative latency gains of the particular embedded platforms and provides insights on the relationship between the required minimum batch size for achieving maximum throughput, concluding that modern embedded systems reach their maximum performance even for modest batch sizes when a modern state of the art DNN model is targeted. Overall, the presented results provide a guide for the expected performance for a number of state-of-the-art DNNs on popular embedded platforms across the image classification, detection and segmentation domains.
△ Less
Submitted 3 November, 2021; v1 submitted 21 July, 2021;
originally announced July 2021.
-
Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator
Authors:
Nikolaos Charalampos Papadopoulos,
Vasileios Karakostas,
Konstantinos Nikas,
Nectarios Koziris,
Dionisios N. Pnevmatikatos
Abstract:
The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are ess…
▽ More
The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are essential in terms of performance because they mitigate the overhead of frequent Page Table Walks, but may harm the critical path of the processor due to their size and/or associativity. In the original Rocket Chip implementation the L1 Instruction/Data TLB is fully-associative and the shared L2 TLB is direct-mapped. We lift these restrictions and design and implement configurable, set-associative L1 and L2 TLB templates that can create any organization from direct-mapped to fully-associative to achieve the desired ratio of performance and resource utilization, especially for larger TLBs. We evaluate different TLB configurations and present performance, area, and frequency results of our design using benchmarks from the SPEC2006 suite on the Xilinx ZCU102 FPGA.
△ Less
Submitted 16 September, 2020;
originally announced September 2020.
-
Design Guidelines for High-Performance SCM Hierarchies
Authors:
Dmitrii Ustiugov,
Alexandros Daglis,
Javier Picorel,
Mark Sutherland,
Edouard Bugnion,
Babak Falsafi,
Dionisios Pnevmatikatos
Abstract:
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such…
▽ More
With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity.
We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while kee** performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.
△ Less
Submitted 7 March, 2019; v1 submitted 20 January, 2018;
originally announced January 2018.