Search | arXiv e-print repository

Inexactness and Correction of Floating-Point Reciprocal, Division and Square Root

Authors: Lucas M. Dutton, Christopher Kumar Anand, Robert Enenkel, Silvia Melitta Müller

Abstract: Floating-point arithmetic performance determines the overall performance of important applications, from graphics to AI. Meeting the IEEE-754 specification for floating-point requires that final results of addition, subtraction, multiplication, division, and square root are correctly rounded based on the user-selected rounding mode. A frustrating fact for implementers is that naive rounding method… ▽ More Floating-point arithmetic performance determines the overall performance of important applications, from graphics to AI. Meeting the IEEE-754 specification for floating-point requires that final results of addition, subtraction, multiplication, division, and square root are correctly rounded based on the user-selected rounding mode. A frustrating fact for implementers is that naive rounding methods will not produce correctly rounded results even when intermediate results with greater accuracy and precision are available. In contrast, our novel algorithm can correct approximations of reciprocal, division and square root, even ones with slightly lower than target precision. In this paper, we present a family of algorithms that can both increase the accuracy (and potentially the precision) of an estimate and correctly round it according to all binary IEEE-754 rounding modes. We explain how it may be efficiently implemented in hardware, and for completeness, we present proofs that it is not necessary to include equality tests associated with round-to-nearest-even mode for reciprocal, division and square root functions, because it is impossible for input(s) in a given precision to have exact answers exactly midway between representable floating-point numbers in that precision. In fact, our simpler proofs are sometimes stronger. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2212.02872 [pdf, other]

doi 10.1038/s41928-023-01010-1

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

Authors: Manuel Le Gallo, Riduan Khaddam-Aljameh, Milos Stanisavljevic, Athanasios Vasilopoulos, Benedikt Kersting, Martino Dazzi, Geethan Karunaratne, Matthias Braendli, Abhairaj Singh, Silvia M. Mueller, Julian Buechel, Xavier Timoneda, Vinay Joshi, Urs Egger, Angelo Garofalo, Anastasios Petropoulos, Theodore Antonakopoulos, Kevin Brew, Samuel Choi, Injo Ok, Timothy Philip, Victor Chan, Claire Silvestre, Ishtiaq Ahsan, Nicole Saulnier , et al. (4 additional authors not shown)

Abstract: The need to repeatedly shuttle around synaptic weight values from memory to processing units has been a key source of energy inefficiency associated with hardware implementation of artificial neural networks. Analog in-memory computing (AIMC) with spatially instantiated synaptic weights holds high promise to overcome this challenge, by performing matrix-vector multiplications (MVMs) directly withi… ▽ More The need to repeatedly shuttle around synaptic weight values from memory to processing units has been a key source of energy inefficiency associated with hardware implementation of artificial neural networks. Analog in-memory computing (AIMC) with spatially instantiated synaptic weights holds high promise to overcome this challenge, by performing matrix-vector multiplications (MVMs) directly within the network weights stored on a chip to execute an inference workload. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and communication to move towards configurations in which a full inference workload is realized entirely on-chip. Moreover, it is highly desirable to achieve high MVM and inference accuracy without application-wise re-tuning of the chip. Here, we present a multi-core AIMC chip designed and fabricated in 14-nm complementary metal-oxide-semiconductor (CMOS) technology with backend-integrated phase-change memory (PCM). The fully-integrated chip features 64 256x256 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and processing involved in ResNet convolutional neural networks and long short-term memory (LSTM) networks. We demonstrate near software-equivalent inference accuracy with ResNet and LSTM networks while implementing all the computations associated with the weight layers and the activation functions on-chip. The chip can achieve a maximal throughput of 63.1 TOPS at an energy efficiency of 9.76 TOPS/W for 8-bit input/output matrix-vector multiplications. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Journal ref: Nature Electronics 6, 680-693 (2023)

arXiv:2104.03142 [pdf, other]

A matrix math facility for Power ISA(TM) processors

Authors: José E. Moreira, Kit Barton, Steven Battle, Peter Bergner, Ramon Bertran, Puneeth Bhat, Pedro Caldeira, David Edelsohn, Gordon Fossum, Brad Frey, Nemanja Ivanovic, Chip Kerchner, Vincent Lim, Shakti Kapoor, Tulio Machado Filho, Silvia Melitta Mueller, Brett Olsson, Satish Sadasivam, Baptiste Saleil, Bill Schmidt, Rajalakshmi Srinivasaraghavan, Shricharan Srivatsan, Brian Thompto, Andreas Wagner, Nelson Wu

Abstract: Power ISA(TM) Version 3.1 has introduced a new family of matrix math instructions, collectively known as the Matrix-Multiply Assist (MMA) facility. The instructions in this facility implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution and discrete Fourier transform. These instructions h… ▽ More Power ISA(TM) Version 3.1 has introduced a new family of matrix math instructions, collectively known as the Matrix-Multiply Assist (MMA) facility. The instructions in this facility implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution and discrete Fourier transform. These instructions have led to a power- and area-efficient implementation of a high throughput math engine in the future POWER10 processor. Performance per core is 4 times better, at constant frequency, than the previous generation POWER9 processor. We also advocate the use of compiler built-ins as the preferred way of leveraging these instructions, which we illustrate through case studies covering matrix multiplication and convolution. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Showing 1–3 of 3 results for author: Müller, S M