Search | arXiv e-print repository

Design and Experimental Investigation of Trikarenos: A Fault-Tolerant 28nm RISC-V-based SoC

Authors: Michael Rogenmoser, Philip Wiese, Bruno Endres Forlin, Frank K. Gürkaynak, Paolo Rech, Alessandra Menicucci, Marco Ottavi, Luca Benini

Abstract: We present a fault-tolerant by-design RISC-V SoC and experimentally assess it under atmospheric neutrons and 200 MeV protons. The dedicated ECC and Triple-Core Lockstep countermeasures correct most errors, guaranteeing a device cross-section lower than $5.36 \times 10^{-12}$ cm$^2$. We present a fault-tolerant by-design RISC-V SoC and experimentally assess it under atmospheric neutrons and 200 MeV protons. The dedicated ECC and Triple-Core Lockstep countermeasures correct most errors, guaranteeing a device cross-section lower than $5.36 \times 10^{-12}$ cm$^2$. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 4 pages (excluding title page), accepted at RADECS 2024

arXiv:2406.06546 [pdf, other]

SentryCore: A RISC-V Co-Processor System for Safe, Real-Time Control Applications

Authors: Michael Rogenmoser, Alessandro Ottaviano, Thomas Benz, Robert Balas, Matteo Perotti, Angelo Garofalo, Luca Benini

Abstract: In the last decade, we have witnessed exponential growth in the complexity of control systems for safety-critical applications (automotive, robots, industrial automation) and their transition to heterogeneous mixed-criticality systems (MCSs). The growth of the RISC-V ecosystem is creating a major opportunity to develop open-source, vendor-neutral reference platforms for safety-critical computing.… ▽ More In the last decade, we have witnessed exponential growth in the complexity of control systems for safety-critical applications (automotive, robots, industrial automation) and their transition to heterogeneous mixed-criticality systems (MCSs). The growth of the RISC-V ecosystem is creating a major opportunity to develop open-source, vendor-neutral reference platforms for safety-critical computing. We present SentryCore, a reliable, real-time, self-contained, open-source mega-IP for advanced control functions that can be seamlessly integrated into Systems-on-Chip, e.g., for automotive applications, through industry-standard Advanced eXtensible Interface 4 (AXI4). SentryCore features three embedded RISC-V processor cores in lockstep with error-correcting code (ECC) protected data memory for reliable execution of any safety-critical application. Context switching is accelerated to under 110 clock cycles via a RISC-V core-local interrupt controller (CLIC) and dedicated hardware extensions, while a timer-based direct memory access (DMA) engine streamlines sensor data readout during periodic control loops. SentryCore was implemented in Intel's 16nm process node and tested with FreeRTOS, ThreadX, and RTIC software support. △ Less

Submitted 16 May, 2024; originally announced June 2024.

Comments: 2 pages, accepted at the RISC-V Summit Europe 2024

arXiv:2310.02045 [pdf, other]

doi 10.1109/ICECS58634.2023.10382727

Trikarenos: A Fault-Tolerant RISC-V-based Microcontroller for CubeSats in 28nm

Authors: Michael Rogenmoser, Luca Benini

Abstract: One of the key challenges when operating microcontrollers in harsh environments such as space is radiation-induced Single Event Upsets (SEUs), which can lead to errors in computation. Common countermeasures rely on proprietary radiation-hardened technologies, low density technologies, or extensive replication, leading to high costs and low performance and efficiency. To combat this, we present Tri… ▽ More One of the key challenges when operating microcontrollers in harsh environments such as space is radiation-induced Single Event Upsets (SEUs), which can lead to errors in computation. Common countermeasures rely on proprietary radiation-hardened technologies, low density technologies, or extensive replication, leading to high costs and low performance and efficiency. To combat this, we present Trikarenos, a fault-tolerant 32-bit RISC-V microcontroller SoC in an advanced TSMC 28nm technology. Trikarenos alleviates the replication cost by employing a configurable triple-core lockstep configuration, allowing three Ibex cores to execute applications reliably, operating on ECC-protected memory. If reliability is not needed for a given application, the cores can operate independently in parallel for higher performance and efficiency. Trikarenos consumes 15.7mW at 250MHz executing a fault-tolerant matrix-matrix multiplication, a 21.5x efficiency gain over state-of-the-art, and performance is increased by 2.96x when reliability is not needed for processing, with a 2.36x increase in energy efficiency. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: 4 pages, 4 figures, accepted by IEEE International Conference on Electronics Circuits and Systems (ICECS) 2023

arXiv:2308.00154 [pdf, other]

PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge

Authors: Vikram Jain, Matheus Cavalcante, Nazareno Bruschi, Michael Rogenmoser, Thomas Benz, Andreas Kurth, Davide Rossi, Luca Benini, Marian Verhelst

Abstract: Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi… ▽ More Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi-core DNN computing platforms. Evaluation of PATRONoC in a 2D-mesh topology shows 34% higher area efficiency compared to a state-of-the-art classical NoC at 1 GHz. PATRONoC's throughput outperforms a baseline NoC by 2-8X on uniform random traffic and provides a high aggregated throughput of up to 350 GiB/s on synthetic and DNN workload traffic. △ Less

Submitted 31 July, 2023; originally announced August 2023.

Comments: Accepted and presented at 60th DAC

arXiv:2305.08562 [pdf, other]

doi 10.1109/MDAT.2023.3306720

FlooNoC: A Multi-Tbps Wide NoC for Heterogeneous AXI4 Traffic

Authors: Tim Fischer, Michael Rogenmoser, Matheus Cavalcante, Frank K. Gürkaynak, Luca Benini

Abstract: Meeting the staggering bandwidth requirements of today's applications challenges the traditional narrow and serialized NoCs, which hit hard bounds on the maximum operating frequency. This paper proposes FlooNoC, an open-source, low-latency, fully AXI4-compatible NoC with wide physical channels for latency-tolerant high-bandwidth non-blocking transactions and decoupled latency-critical short messag… ▽ More Meeting the staggering bandwidth requirements of today's applications challenges the traditional narrow and serialized NoCs, which hit hard bounds on the maximum operating frequency. This paper proposes FlooNoC, an open-source, low-latency, fully AXI4-compatible NoC with wide physical channels for latency-tolerant high-bandwidth non-blocking transactions and decoupled latency-critical short messages. We demonstrate the feasibility of wide channels by integrating a 5x5 router and links within a 9-core compute cluster in 12 nm FinFet technology. Our NoC achieves a bandwidth of 629Gbps per link while running at only 1.23 GHz (at 0.19 pJ/B/hop), with just 10% area overhead post layout. △ Less

Submitted 6 August, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

arXiv:2305.05240 [pdf, other]

A High-performance, Energy-efficient Modular DMA Engine Architecture

Authors: Thomas Benz, Michael Rogenmoser, Paul Scheffler, Samuel Riedel, Alessandro Ottaviano, Andreas Kurth, Torsten Hoefler, Luca Benini

Abstract: Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous… ▽ More Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8x with only 1 % additional area compared to a base system without a DMAE. We achieve an area reduction of 10 % while improving ML inference performance by 23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems. △ Less

Submitted 14 November, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 14 pages, 14 figures, accepted by an IEEE journal for publication

arXiv:2303.08706 [pdf, other]

doi 10.1145/3635161

Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space

Authors: Michael Rogenmoser, Yvan Tortorella, Davide Rossi, Francesco Conti, Luca Benini

Abstract: Space Cyber-Physical Systems (S-CPS) such as spacecraft and satellites strongly rely on the reliability of onboard computers to guarantee the success of their missions. Relying solely on radiation-hardened technologies is extremely expensive, and develo** inflexible architectural and microarchitectural modifications to introduce modular redundancy within a system leads to significant area increa… ▽ More Space Cyber-Physical Systems (S-CPS) such as spacecraft and satellites strongly rely on the reliability of onboard computers to guarantee the success of their missions. Relying solely on radiation-hardened technologies is extremely expensive, and develo** inflexible architectural and microarchitectural modifications to introduce modular redundancy within a system leads to significant area increase and performance degradation. To mitigate the overheads of traditional radiation hardening and modular redundancy approaches, we present a novel Hybrid Modular Redundancy (HMR) approach, a redundancy scheme that features a cluster of RISC-V processors with a flexible on-demand dual-core and triple-core lockstep grou** of computing cores with runtime split-lock capabilities. Further, we propose two recovery approaches, software-based and hardware-based, trading off performance and area overhead. Running at 430 MHz, our fault-tolerant cluster achieves up to 1160 MOPS on a matrix multiplication benchmark when configured in non-redundant mode and 617 and 414 MOPS in dual and triple mode, respectively. A software-based recovery in triple mode requires 363 clock cycles and occupies 0.612 mm2, representing a 1.3% area overhead over a non-redundant 12-core RISC-V cluster. As a high-performance alternative, a new hardware-based method provides rapid fault recovery in just 24 clock cycles and occupies 0.660 mm2, namely ~9.4% area overhead over the baseline non-redundant RISC-V cluster. The cluster is also enhanced with split-lock capabilities to enter one of the redundant modes with minimum performance loss, allowing execution of a mission-critical or a performance section, with <400 clock cycles overhead for entry and exit. The proposed system is the first to integrate these functionalities on an open-source RISC-V-based compute device, enabling finely tunable reliability vs. performance trade-offs. △ Less

Submitted 14 November, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

arXiv:2205.12580 [pdf, other]

doi 10.1109/ISVLSI54635.2022.00089

On-Demand Redundancy Grou**: Selectable Soft-Error Tolerance for a Multicore Cluster

Authors: Michael Rogenmoser, Nils Wistoff, Pirmin Vogel, Frank Gürkaynak, Luca Benini

Abstract: With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V c… ▽ More With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V cluster with a novel On-Demand Redundancy Grou** (ODRG) scheme. ODRG allows the cluster to operate either as two fault-tolerant cores, or six individual cores for high-performance, with limited overhead to switch between these modes during run-time. The ODRG unit adds less than 11% of a core's area for a three-core group, or a total of 1% of the cluster area, and shows negligible timing increase, which compares favorably to a commercial state-of-the-art implementation, and is 2.5$\times$ faster in fault recovery re-synchronization. Furthermore, when redundancy is not necessary, the ODRG approach allows the redundant cores to be used for independent computation, allowing up to 2.96$\times$ increase in performance for selected applications. △ Less

Submitted 3 October, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Journal ref: ISVLSI (2022) 398-401

Showing 1–8 of 8 results for author: Rogenmoser, M