Skip to main content

Showing 1–7 of 7 results for author: Bertaccini, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15068  [pdf, other

    cs.AR

    Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

    Authors: Gianna Paulin, Paul Scheffler, Thomas Benz, Matheus Cavalcante, Tim Fischer, Manuel Eggimann, Yichao Zhang, Nils Wistoff, Luca Bertaccini, Luca Colagrande, Gianmarco Ottavi, Frank K. Gürkaynak, Davide Rossi, Luca Benini

    Abstract: We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stenc… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: 2 pages, 7 figures. Accepted at the 2024 IEEE Symposium on VLSI Technology & Circuits

  2. arXiv:2405.19284  [pdf, other

    cs.DC cs.AI cs.AR

    Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

    Authors: Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini

    Abstract: Transformer-based foundation models have become crucial for various domains, most notably natural language processing (NLP) or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we pre… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 14 pages, 10 figures, 4 tables, IEEE Transactions on Circuits and Systems for Artificial Intelligence

    ACM Class: C.4; C.3; I.2

  3. arXiv:2305.07325  [pdf, other

    cs.AR

    Echoes: a 200 GOPS/W Frequency Domain SoC with FFT Processor and I2S DSP for Flexible Data Acquisition from Microphone Arrays

    Authors: Mattia Sinigaglia, Luca Bertaccini, Luca Valente, Angelo Garofalo, Simone Benatti, Luca Benini, Francesco Conti, Davide Rossi

    Abstract: Emerging applications in the IoT domain require ultra-low-power and high-performance end-nodes to deal with complex near-sensor-data analytics. Domains such as audio, radar, and Structural Health Monitoring require many computations to be performed in the frequency domain rather than in the time domain. We present ECHOES, a System-On-a-Chip (SoC) composed of a RISC-V core enhanced with fixed and f… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  4. arXiv:2301.03904  [pdf, other

    cs.AR cs.AI cs.LG

    RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration

    Authors: Yvan Tortorella, Luca Bertaccini, Luca Benini, Davide Rossi, Francesco Conti

    Abstract: The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energ… ▽ More

    Submitted 6 May, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

  5. arXiv:2209.00889  [pdf, other

    cs.AR

    Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

    Authors: Gianna Paulin, Matheus Cavalcante, Paul Scheffler, Luca Bertaccini, Yichao Zhang, Frank Gürkaynak, Luca Benini

    Abstract: Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency and energy efficiency, but they should be as flexible as possible to achieve a high utilization for the top-level die floorplan. In this pa… ▽ More

    Submitted 2 September, 2022; originally announced September 2022.

    Comments: 6 pages. Accepted for publication in the IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2022

  6. arXiv:2207.03192  [pdf, other

    cs.AR

    MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

    Authors: Luca Bertaccini, Gianna Paulin, Tim Fischer, Stefan Mach, Luca Benini

    Abstract: Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of the NN models and improving the energy efficiency of the underlying hardware architectures. Narrow integer data types have been vastly investigated for NN inference and have successfully been pushed to the extreme of ternary and binary representations. In… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: This work has been submitted to the ARITH22 - IEEE Symposium on Computer Arithmetic for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 8 pages

  7. arXiv:2204.11192  [pdf, other

    cs.AR

    RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

    Authors: Yvan Tortorella, Luca Bertaccini, Davide Rossi, Luca Benini, Francesco Conti

    Abstract: The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications' latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging. One of the key stumbling stones is the need for parallel… ▽ More

    Submitted 24 April, 2022; originally announced April 2022.